Again, without getting into the theory as to why, there are 223 = 28 = 256 possible Boolean functions of 3 variables. In other words: 256 triadic Boolean functions.

We could implement a universal triadic operator in hardware the same way as we did a dyadic operator, but with an 8:1 multiplexer instead of a 4:1 multiplexer. However, unlike a 4:1 mux, an 8:1 mux doesn't fit into a 6-LUT on a modern FPGA, which distances our design from the underlying FPGA hardware: we might not be able to pipeline it as required for best operating speed, we are more at the mercy of the register retiming and logic packing of the CAD tool, and as we'll see, we lock ourselves out of a useful break in abstraction which gives a triadic operator special uses.

Instead, we can decompose a triadic Boolean function f(x,y,z) into two dyadic sub-functions based on the two possible values of z: g(x,y) = f(x,y,0) and h(x,y) = f(x,y,1). We can then reconstruct the original function like so: f(x,y,z) = ¬zg(x,y) ∨ zh(x,y). So we now have a Universal Triadic Operator (UTO) build out of two Universal Dyadic Operators (UDO) implementing functions of x and y and multiplexed by z. Note that this selection is done bit-wise, not word-wise: each bit of z selects the corresponding bit of g(x,y) or h(x,y). If necessary, we can register the outputs of the UDOs and the multiplexer as necessary for high speed.

We can use a triadic operator to implement very diverse functions:

• Majority: xy ∨ xz ∨ yz, which is used in Triple Modular Redundancy (TMR).
• Minority, which tells you if a TMR unit had a corrected error.
• Given x, y, and their sum x+y, you can recover the carries into each bit position like so: x ⊕ y ⊕ (x+y). This can allow you to find overflows in packed sub-word parallel arithmetic (e.g.: adding two 32-bit words as two vectors of 4 bytes)
• Bitfield masking and swapping between x and y, as selected by the bitmask z.

When used to implement Boolean functions with 4 or more variables ("n-adic"), triadic functions approximately halve the number of steps required compared to using ordinary dyadic functions. This is likely one reason the NVIDIA Maxwell GPUs implement them via the LOP3 instruction, and Intel CPUs with AVX-512 support via VPTERNLOG.

However, if you look at the UTO implementation of f(x,y,z) above, we always throw away half of the total work done by both dyadic halves (UDO1 and UDO2). Discarded work is a clue that we are missing out on some computational capacity or efficiency. So let's provide a way to output that discarded work. We can add a second output multiplexer, driven by the same functions g(x,y) and h(x,y), but controlled by the complement of z, which outputs the bits not selected by z. And, instead of hardcoding that complementation, we can control it with a 1-bit "dual" signal XOR'ed with all the bits of this second z input.

This Dual Triadic Operator (DTO) can now do more, when dual is set:

• We can compute 2 independent dyadic functions.
• We can set g() and h() to always output x or y, and depending on z, send those outputs either straight through or crossed-over. This acts as a Banyan Switch, which is the building block of a lot of switching networks.

We can build this Dual Triadic Operator with the following Verilog-2001 code. First, we need that inverter controlled by the "dual" signal:

```// Inverts a value (bitwise NOT if enabled)

// We put something this simple into a module since it conveys intent,
// and avoids an RTL schematic cluttered with a bunch of XOR gates.

`default_nettype none

module Inverter
#(
parameter       WORD_WIDTH         = 0
)
(
input   wire                       invert,
input   wire    [WORD_WIDTH-1:0]   in,
output  reg     [WORD_WIDTH-1:0]   out
);
always @(*) begin
out <= (invert == 1'b1) ? ~in : in;
end
endmodule
```

We also need that final output multiplexer, but in a bit-wise form rather than word-wise. We must remember that each bit of z selects the corresponding bits from x and y.

```// Bitwise 2:1 multiplexer
// Selects each bit from two sources, based on a mask of same size.

`default_nettype none

module Bitwise_2to1_Mux
#(
parameter WORD_WIDTH = 0
)
(
input   wire    [WORD_WIDTH-1:0]    in1,
input   wire    [WORD_WIDTH-1:0]    in2,
output  wire    [WORD_WIDTH-1:0]    out
);

// One mux per bit pair

generate
genvar j;
for(j = 0; j < WORD_WIDTH; j = j+1) begin: per_bit
#(
.WORD_WIDTH     (1),
.INPUT_COUNT    (2)
)
Mux
(
.in             ({in2[j],in1[j]}),
.out            (out[j])
);
end
endgenerate

endmodule
```

And finally, we put it all together, along with the previously defined Universal Dyadic Operators. This is a fully-pipelined design, with 2 stages total. Note: I pulled this code out of a known-good ALU design and made it into a module. I may have introduced a typo or three in the process.

```// Universal Dual Triadic Operator
// See include files for Dyadic_Boolean_Operator

`default_nettype none

#(
parameter WORD_WIDTH = 0
)
(
input   wire                                clock,
input   wire    [WORD_WIDTH-1:0]            x,
input   wire    [WORD_WIDTH-1:0]            y,
input   wire    [WORD_WIDTH-1:0]            z,
input   wire                                dual,
output  reg     [WORD_WIDTH-1:0]            a,
output  reg     [WORD_WIDTH-1:0]            b

);

// So simulation matches synthesis for FPGAs.

localparam ZERO = {WORD_WIDTH{1'b0}};

initial begin
a = ZERO;
b = ZERO;
end

// --------------------------------------------------------------------

wire [WORD_WIDTH-1:0] UDO1_raw;

#(
.WORD_WIDTH     (WORD_WIDTH)
)
UDO1
(
.a              (x),
.b              (y),
.o              (UDO1_raw)
);

reg [WORD_WIDTH-1:0] UDO1_out = 0;

always @(posedge clock) begin
UDO1_out <= UDO1_raw;
end

// --------------------------------------------------------------------

wire [WORD_WIDTH-1:0] UDO2_raw;

#(
.WORD_WIDTH     (WORD_WIDTH)
)
UDO2
(
.a              (x),
.b              (y),
.o              (UDO2_raw)
);

reg [WORD_WIDTH-1:0] UDO2_out = 0;

always @(posedge clock) begin
UDO2_out <= UDO2_raw;
end

// --------------------------------------------------------------------

// Pipeline z and dual to match

reg [WORD_WIDTH-1:0] z_2    = 0;
reg [WORD_WIDTH-1:0] dual_2 = 0;

always @(posedge clock) begin
z_2     <= z;
dual_2  <= dual;
end

// --------------------------------------------------------------------

// Yes, these two mux groups repeat eachother, but they can optionally
// select in opposition, which gives us both possible triadic results.

wire [WORD_WIDTH-1:0] a_raw;

Bitwise_2to1_Mux
#(
.WORD_WIDTH     (WORD_WIDTH)
)
SD1
(
.in1             (UDO1_out),
.in2             (UDO2_out),
.out             (a_raw)
);

always @(posedge clock) begin
a <= a_raw;
end

// --------------------------------------------------------------------

wire [WORD_WIDTH-1:0] z_2_inverted;

Inverter
#(
.WORD_WIDTH     (WORD_WIDTH)
)
(
.invert (dual_2),
.in     (z_2),
.out    (z_2_inverted)
);

// --------------------------------------------------------------------

wire [WORD_WIDTH-1:0] b_raw;

Bitwise_2to1_Mux
#(
.WORD_WIDTH     (WORD_WIDTH)
)
SD2
(