Source

# Triadic Boolean Operator (Dual Output)

Computes one of the 256 possible three-variable (triadic) Boolean operations, selectable at run-time, on the `A`, `B`, and `C` input words, with optional dual output.

We could implement a triadic Boolean operator in hardware the same way as we did the dyadic Boolean operator, but with an 8:1 multiplexer instead of a 4:1 multiplexer. However, unlike a 4:1 mux, an 8:1 mux doesn't fit into a 6-LUT on a modern FPGA, which distances our design from the underlying FPGA hardware: we might not be able to pipeline it as required for best operating speed, we are more at the mercy of the register retiming and logic packing of the CAD tool, and as we'll see, we lock ourselves out of a useful break in abstraction which gives a triadic operator special uses.

Instead, we can decompose a triadic Boolean function `f(A,B,C)` into two dyadic sub-functions based on the two possible values of `C`: ```g(A,B) = f(A,B,0)``` and `h(A,B) = f(A,B,1)`. We can then reconstruct the original triadic function like so: `f(A,B,C) = (g(A,B) & ~C) | (h(A,B) & C)`: the value of `C` selects one of the two dyadic Boolean functions of `A` and `B`. Note that `C` selects bit-wise, not word-wise: each bit of `C` selects the corresponding bit of `g(A,B)` or `h(A,B)`. This useful mathematical identity is known as "Shannon Decomposition" or "Boole's Expansion Theorem".

We can use a triadic operator to implement very diverse functions:

• Majority: `AB | AC | BC`, which is used in Triple Modular Redundancy (TMR).
• Minority, which tells you if a TMR unit had a corrected error.
• Given `A`, `B`, and their sum `A+B`, you can recover the carries into each bit position like so: `A ^ B ^ (A+B)`. This can allow you to find overflows in packed sub-word parallel arithmetic (e.g.: adding two 32-bit words as two vectors of 4 bytes)
• Bitfield masking and swapping between `A` and `B`, as selected by the bitmask `C`.

When used to implement Boolean functions with 4 or more variables (tetradic or n-adic), triadic functions approximately halve the number of steps required compared to using ordinary dyadic functions. This is likely one reason the NVIDIA Maxwell GPUs implemented triadic Boolean functions with the `LOP3` instruction, and Intel CPUs with AVX-512 support with the `VPTERNLOG` instruction.

However, if you look at the implementation of `f(A,B,C)` above, we always throw away half of the total work done by both dyadic halves `g(A,B)` and `h(A,B)`. Discarded work is a clue that we are missing out on some computational capacity or efficiency. So let's provide a way to output that discarded work. We can add a second output multiplexer, driven by the same functions `g(A,B)` and `h(A,B)`, but controlled by an inverted version of `C`, called `D`, which outputs the bits not selected by `C`. And, instead of hardcoding `D` as the inverse of `C`, we can control it with a 1-bit `dual` signal. If `dual` is not set, then `C` and `D` are the same, and both triadic outputs are identical.

With two outputs, our triadic Boolean operator can now do more. When `dual` is set:

• We can compute 2 independent dyadic Boolean functions.
• We can set `g(A,B)` and `h(A,B)` to always output `A` or `B`, and depending on `C`, send those outputs either straight through or crossed-over. This acts as a Banyan Switch, which is the building block of a lot of switching networks.

I have not gone through the trouble of defining all 256 triadic Boolean operations, but given the definitions of the 16 possible dyadic Boolean operations, you can easily create triadic definitions as needed: write out the 8-entry truth table for your desired triadic function, split the table into two 4-entry truth tables with the most-significant input bit (`C`, as above) always 1 in one table, and always 0 in the other, then use those two dyadic definitions as functions `g(A,B)` and `h(A,B)` as explained above.

You can, of course, repeat this process to implement tetradic or n-adic functions, but the size of the truth tables grows exponentially (2n), and the number of truth tables grows super-exponentially (65,536 possible tetradic functions, or 22n for n-adic functions), so there are increasingly fewer functions of general interest in that space, with some exceptions such as bit reductions and word reductions.

````include "Dyadic_Boolean_Operations.vh"

`default_nettype none

#(
parameter WORD_WIDTH = 0
)
(
input   wire    [WORD_WIDTH-1:0]                    word_A,
input   wire    [WORD_WIDTH-1:0]                    word_B,
input   wire    [WORD_WIDTH-1:0]                    word_C,
input   wire                                        dual,
output  wire    [WORD_WIDTH-1:0]                    result_1,
output  wire    [WORD_WIDTH-1:0]                    result_2

);

localparam WORD_ZERO = {WORD_WIDTH{1'b0}};
```

First dyadic Boolean operation: `g(A,B)`

```    wire [WORD_WIDTH-1:0] g;

#(
.WORD_WIDTH     (WORD_WIDTH)
)
first
(
.word_A         (word_A),
.word_B         (word_B),
.result         (g)
);
```

Second dyadic Boolean operation: `h(A,B)`

```    wire [WORD_WIDTH-1:0] h;

#(
.WORD_WIDTH     (WORD_WIDTH)
)
second
(
.word_A         (word_A),
.word_B         (word_B),
.result         (h)
);
```

Now we select each bit from either `g(A,B)` or `h(A,B)`, giving us the first result.

```    Multiplexer_Bitwise_2to1
#(
.WORD_WIDTH (WORD_WIDTH)
)
select_1
(
.word_in_0  (g),
.word_in_1  (h),
.word_out   (result_1)
);
```

Then, conditionally invert `word_C` if `dual` is set.

```    reg [WORD_WIDTH-1:0] word_D = WORD_ZERO;

always @(*) begin
word_D = (dual == 1'b1) ? ~word_C : word_C;
end
```

And select again for the second result, but selected by `word_D`, giving us all the bits not selected by the previous multiplexer if `dual` is set, else the same bits.

```    Multiplexer_Bitwise_2to1
#(
.WORD_WIDTH (WORD_WIDTH)
)
select_2
(