Dual Triadic Boolean Operator

by GateForge Consulting Ltd.

Again, without getting into the theory as to why, there are 223 = 28 = 256 possible Boolean functions of 3 variables. In other words: 256 triadic Boolean functions.

We could implement a universal triadic operator in hardware the same way as we did a dyadic operator, but with an 8:1 multiplexer instead of a 4:1 multiplexer. However, unlike a 4:1 mux, an 8:1 mux doesn't fit into a 6-LUT on a modern FPGA, which distances our design from the underlying FPGA hardware: we might not be able to pipeline it as required for best operating speed, we are more at the mercy of the register retiming and logic packing of the CAD tool, and as we'll see, we lock ourselves out of a useful break in abstraction which gives a triadic operator special uses.


Instead, we can decompose a triadic Boolean function f(x,y,z) into two dyadic sub-functions based on the two possible values of z: g(x,y) = f(x,y,0) and h(x,y) = f(x,y,1). We can then reconstruct the original function like so: f(x,y,z) = ¬zg(x,y) ∨ zh(x,y). So we now have a Universal Triadic Operator (UTO) build out of two Universal Dyadic Operators (UDO) implementing functions of x and y and multiplexed by z. Note that this selection is done bit-wise, not word-wise: each bit of z selects the corresponding bit of g(x,y) or h(x,y). If necessary, we can register the outputs of the UDOs and the multiplexer as necessary for high speed.

We can use a triadic operator to implement very diverse functions:

When used to implement Boolean functions with 4 or more variables ("n-adic"), triadic functions approximately halve the number of steps required compared to using ordinary dyadic functions. This is likely one reason the NVIDIA Maxwell GPUs implement them via the LOP3 instruction, and Intel CPUs with AVX-512 support via VPTERNLOG.


However, if you look at the UTO implementation of f(x,y,z) above, we always throw away half of the total work done by both dyadic halves (UDO1 and UDO2). Discarded work is a clue that we are missing out on some computational capacity or efficiency. So let's provide a way to output that discarded work. We can add a second output multiplexer, driven by the same functions g(x,y) and h(x,y), but controlled by the complement of z, which outputs the bits not selected by z. And, instead of hardcoding that complementation, we can control it with a 1-bit "dual" signal XOR'ed with all the bits of this second z input.

This Dual Triadic Operator (DTO) can now do more, when dual is set:


We can build this Dual Triadic Operator with the following Verilog-2001 code. First, we need that inverter controlled by the "dual" signal:

// Inverts a value (bitwise NOT if enabled)

// We put something this simple into a module since it conveys intent,
// and avoids an RTL schematic cluttered with a bunch of XOR gates.

module Inverter
#(
    parameter       WORD_WIDTH         = 0
)
(
    input   wire                       invert,
    input   wire    [WORD_WIDTH-1:0]   in,
    output  reg     [WORD_WIDTH-1:0]   out
);
    always @(*) begin
        out <= (invert == 1'b1) ? ~in : in;
    end
endmodule

We also need that final output multiplexer, but in a bit-wise form rather than word-wise. We must remember that each bit of z selects the corresponding bits from x and y.

// Bitwise 2:1 multiplexer
// Selects each bit from two sources, based on a mask of same size.

module Bitwise_2to1_Mux
#(
    parameter WORD_WIDTH = 0
)
(
    input   wire    [WORD_WIDTH-1:0]    select_mask,
    input   wire    [WORD_WIDTH-1:0]    in1,
    input   wire    [WORD_WIDTH-1:0]    in2,
    output  wire    [WORD_WIDTH-1:0]    out
);

    // One mux per bit pair

    generate
        genvar j;
        for(j = 0; j < WORD_WIDTH; j = j+1) begin: per_bit
            Addressed_Mux
            #(
                .WORD_WIDTH     (1),
                .ADDR_WIDTH     (1),
                .INPUT_COUNT    (2)
            )
            Mux
            (
                .addr           (select_mask[j]),    
                .in             ({in2[j],in1[j]}),
                .out            (out[j])
            );
        end
    endgenerate

endmodule

And finally, we put it all together, along with the previously defined Universal Dyadic Operators. This is a fully-pipelined design, with 2 stages total. Note: I pulled this code out of a known-good ALU design and made it into a module. I may have introduced a typo or three in the process.

// Universal Dual Triadic Operator
// See include files for Dyadic_Boolean_Operator

module Dual_Triadic_Operator
#(
    parameter WORD_WIDTH = 0
)
(
    input   wire                                clock,
    input   wire    [`DYADIC_CTRL_WIDTH-1:0]    dyadic_op_1,
    input   wire    [`DYADIC_CTRL_WIDTH-1:0]    dyadic_op_2,
    input   wire    [WORD_WIDTH-1:0]            x,
    input   wire    [WORD_WIDTH-1:0]            y,
    input   wire    [WORD_WIDTH-1:0]            z,
    input   wire                                dual,
    output  reg     [WORD_WIDTH-1:0]            a,
    output  reg     [WORD_WIDTH-1:0]            b
    
);

    // So simulation matches synthesis for FPGAs.

    localparam ZERO = {WORD_WIDTH{1'b0}};

    initial begin
        a = ZERO;
        b = ZERO;
    end

// --------------------------------------------------------------------

    wire [WORD_WIDTH-1:0] UDO1_raw;

    Dyadic_Boolean_Operator
    #(
        .WORD_WIDTH     (WORD_WIDTH)
    )
    UDO1
    (
        .op             (dyadic_op_1),
        .a              (x),
        .b              (y),
        .o              (UDO1_raw)
    );

    reg [WORD_WIDTH-1:0] UDO1_out = 0;

    always @(posedge clock) begin
        UDO1_out <= UDO1_raw;
    end
    
// --------------------------------------------------------------------

    wire [WORD_WIDTH-1:0] UDO2_raw;

    Dyadic_Boolean_Operator
    #(
        .WORD_WIDTH     (WORD_WIDTH)
    )
    UDO2
    (
        .op             (dyadic_op_2),
        .a              (x),
        .b              (y),
        .o              (UDO2_raw)
    );

    reg [WORD_WIDTH-1:0] UDO2_out = 0;

    always @(posedge clock) begin
        UDO2_out <= UDO2_raw;
    end
    
// --------------------------------------------------------------------

    // Pipeline z and dual to match

    reg [WORD_WIDTH-1:0] z_2    = 0;
    reg [WORD_WIDTH-1:0] dual_2 = 0;

    always @(posedge clock) begin
        z_2     <= z;
        dual_2  <= dual;
    end

// --------------------------------------------------------------------

    // Yes, these two mux groups repeat eachother, but they can optionally
    // select in opposition, which gives us both possible triadic results.

    wire [WORD_WIDTH-1:0] a_raw;

    Bitwise_2to1_Mux
    #(
        .WORD_WIDTH     (WORD_WIDTH)
    )
    SD1
    (
        .select_mask     (z_2),
        .in1             (UDO1_out),
        .in2             (UDO2_out), 
        .out             (a_raw)
    );

    always @(posedge clock) begin
        a <= a_raw;
    end
    
// --------------------------------------------------------------------

    wire [WORD_WIDTH-1:0] z_2_inverted;

    Inverter
    #(
        .WORD_WIDTH     (WORD_WIDTH)
    )
    Triadic_Mode
    (
        .invert (dual_2),
        .in     (z_2),
        .out    (z_2_inverted)
    );

// --------------------------------------------------------------------

    wire [WORD_WIDTH-1:0] b_raw;

    Bitwise_2to1_Mux
    #(
        .WORD_WIDTH     (WORD_WIDTH)
    )
    SD2
    (
        .select_mask     (z_2_inverted),
        .in1             (UDO1_out),
        .in2             (UDO2_out), 
        .out             (b_raw)
    );

    always @(posedge clock) begin
        b <= b_raw;
    end

endmodule

fpgacpu.ca