from FPGA Resources by GateForge Consulting Ltd.

Again, without getting into the theory as to why, there are
2^{23} = 2^{8} = 256 possible Boolean functions of 3
variables. In other words: 256 *triadic* Boolean functions.

We could implement a universal triadic operator in hardware the same way as we did a dyadic operator, but with an 8:1 multiplexer instead of a 4:1 multiplexer. However, unlike a 4:1 mux, an 8:1 mux doesn't fit into a 6-LUT on a modern FPGA, which distances our design from the underlying FPGA hardware: we might not be able to pipeline it as required for best operating speed, we are more at the mercy of the register retiming and logic packing of the CAD tool, and as we'll see, we lock ourselves out of a useful break in abstraction which gives a triadic operator special uses.

Instead, we can decompose a triadic Boolean function f(x,y,z) into two dyadic sub-functions based on the two possible values of z: g(x,y) = f(x,y,0) and h(x,y) = f(x,y,1). We can then reconstruct the original function like so: f(x,y,z) = ¬zg(x,y) ∨ zh(x,y). So we now have a Universal Triadic Operator (UTO) build out of two Universal Dyadic Operators (UDO) implementing functions of x and y and multiplexed by z. Note that this selection is done bit-wise, not word-wise: each bit of z selects the corresponding bit of g(x,y) or h(x,y). If necessary, we can register the outputs of the UDOs and the multiplexer as necessary for high speed.

We can use a triadic operator to implement very diverse functions:

- Majority: xy ∨ xz ∨ yz, which is used in Triple Modular Redundancy (TMR).
- Minority, which tells you if a TMR unit had a corrected error.
- Given x, y, and their sum x+y, you can recover the carries into each bit position like so: x ⊕ y ⊕ (x+y). This can allow you to find overflows in packed sub-word parallel arithmetic (e.g.: adding two 32-bit words as two vectors of 4 bytes)
- Bitfield masking and swapping between x and y, as selected by the bitmask z.

When used to implement Boolean functions with 4 or more variables ("n-adic"), triadic functions approximately halve the number of steps required compared to using ordinary dyadic functions. This is likely one reason the NVIDIA Maxwell GPUs implement them via the LOP3 instruction, and Intel CPUs with AVX-512 support via VPTERNLOG.

However, if you look at the UTO implementation of f(x,y,z) above, we always
throw away half of the total work done by both dyadic halves (UDO1 and UDO2).
Discarded work is a clue that we are missing out on some computational capacity
or efficiency. So let's provide a way to output that discarded work. We can add
a second output multiplexer, driven by the same functions g(x,y) and h(x,y),
but controlled by the *complement* of z, which outputs the bits not
selected by z. And, instead of hardcoding that complementation, we can control
it with a 1-bit "dual" signal XOR'ed with all the bits of this second z input.

This Dual Triadic Operator (DTO) can now do more, when dual is set:

- We can compute 2 independent dyadic functions.
- We can set g() and h() to always output x or y, and depending on z, send those outputs either straight through or crossed-over. This acts as a Banyan Switch, which is the building block of a lot of switching networks.

We can build this Dual Triadic Operator with the following Verilog-2001 code. First, we need that inverter controlled by the "dual" signal:

// Inverts a value (bitwise NOT if enabled) // We put something this simple into a module since it conveys intent, // and avoids an RTL schematic cluttered with a bunch of XOR gates. `default_nettype none module Inverter #( parameter WORD_WIDTH = 0 ) ( input wire invert, input wire [WORD_WIDTH-1:0] in, output reg [WORD_WIDTH-1:0] out ); always @(*) begin out <= (invert == 1'b1) ? ~in : in; end endmodule

We also need that final output multiplexer, but in a bit-wise form rather than word-wise. We must remember that each bit of z selects the corresponding bits from x and y.

// Bitwise 2:1 multiplexer // Selects each bit from two sources, based on a mask of same size. `default_nettype none module Bitwise_2to1_Mux #( parameter WORD_WIDTH = 0 ) ( input wire [WORD_WIDTH-1:0] select_mask, input wire [WORD_WIDTH-1:0] in1, input wire [WORD_WIDTH-1:0] in2, output wire [WORD_WIDTH-1:0] out ); // One mux per bit pair generate genvar j; for(j = 0; j < WORD_WIDTH; j = j+1) begin: per_bit Addressed_Mux #( .WORD_WIDTH (1), .ADDR_WIDTH (1), .INPUT_COUNT (2) ) Mux ( .addr (select_mask[j]), .in ({in2[j],in1[j]}), .out (out[j]) ); end endgenerate endmodule

And finally, we put it all together, along with the previously defined Universal Dyadic Operators. This is a fully-pipelined
design, with 2 stages total. *Note: I pulled this code out of a known-good
ALU design and made it into a module. I may have introduced a typo or
three in the process.*

// Universal Dual Triadic Operator // See include files for Dyadic_Boolean_Operator `default_nettype none module Dual_Triadic_Operator #( parameter WORD_WIDTH = 0 ) ( input wire clock, input wire [`DYADIC_CTRL_WIDTH-1:0] dyadic_op_1, input wire [`DYADIC_CTRL_WIDTH-1:0] dyadic_op_2, input wire [WORD_WIDTH-1:0] x, input wire [WORD_WIDTH-1:0] y, input wire [WORD_WIDTH-1:0] z, input wire dual, output reg [WORD_WIDTH-1:0] a, output reg [WORD_WIDTH-1:0] b ); // So simulation matches synthesis for FPGAs. localparam ZERO = {WORD_WIDTH{1'b0}}; initial begin a = ZERO; b = ZERO; end // -------------------------------------------------------------------- wire [WORD_WIDTH-1:0] UDO1_raw; Dyadic_Boolean_Operator #( .WORD_WIDTH (WORD_WIDTH) ) UDO1 ( .op (dyadic_op_1), .a (x), .b (y), .o (UDO1_raw) ); reg [WORD_WIDTH-1:0] UDO1_out = 0; always @(posedge clock) begin UDO1_out <= UDO1_raw; end // -------------------------------------------------------------------- wire [WORD_WIDTH-1:0] UDO2_raw; Dyadic_Boolean_Operator #( .WORD_WIDTH (WORD_WIDTH) ) UDO2 ( .op (dyadic_op_2), .a (x), .b (y), .o (UDO2_raw) ); reg [WORD_WIDTH-1:0] UDO2_out = 0; always @(posedge clock) begin UDO2_out <= UDO2_raw; end // -------------------------------------------------------------------- // Pipeline z and dual to match reg [WORD_WIDTH-1:0] z_2 = 0; reg [WORD_WIDTH-1:0] dual_2 = 0; always @(posedge clock) begin z_2 <= z; dual_2 <= dual; end // -------------------------------------------------------------------- // Yes, these two mux groups repeat eachother, but they can optionally // select in opposition, which gives us both possible triadic results. wire [WORD_WIDTH-1:0] a_raw; Bitwise_2to1_Mux #( .WORD_WIDTH (WORD_WIDTH) ) SD1 ( .select_mask (z_2), .in1 (UDO1_out), .in2 (UDO2_out), .out (a_raw) ); always @(posedge clock) begin a <= a_raw; end // -------------------------------------------------------------------- wire [WORD_WIDTH-1:0] z_2_inverted; Inverter #( .WORD_WIDTH (WORD_WIDTH) ) Triadic_Mode ( .invert (dual_2), .in (z_2), .out (z_2_inverted) ); // -------------------------------------------------------------------- wire [WORD_WIDTH-1:0] b_raw; Bitwise_2to1_Mux #( .WORD_WIDTH (WORD_WIDTH) ) SD2 ( .select_mask (z_2_inverted), .in1 (UDO1_out), .in2 (UDO2_out), .out (b_raw) ); always @(posedge clock) begin b <= b_raw; end endmodule

fpgacpu.ca