A signed binary integer adder/subtractor, which does arithmetic over
WORD_WIDTH
words as a sequence of smaller STEP_WORD_WIDTH
operations to
save area and increase operating speed at the price of extra latency.
The main use of this circuit is to avoid CAD problems which can emerge at large integer bit widths (e.g.: 128 bits): the area of added pipeline registers would become quite large, and the CAD tools can fail to retime them completely into the extra-wide adder logic, thus failing to reach a high operating frequency and slowing down the rest of the system. There are notes in the code which point out the critical paths you can expect to see (or not).
Also, if we don't need a result every cycle, regular pipelining is wasteful: it unnecessarily duplicates the input/output data storage, and makes poor use of the adder/subtractor logic (e.g.: the least-significant bits are used once, then sit idle for multiple cycles while the higher bits are computed).
Since the result latency depends on the ratio of WORD_WIDTH
to
STEP_WORD_WIDTH
and whether that ratio is a whole integer, the inputs are
set with a ready/valid handshake, and can be updated after the output
handshake completes.
Addition/subtraction is selected with add_sub
: 0 for an add (A+B
), and
1 for a subtract (A-B
). This assignment conveniently matches the
convention of sign bits. Note that the overflow
bit is only meaningful
for signed numbers. For unsigned numbers, use carry_out
instead.
`default_nettype none module Adder_Subtractor_Binary_Multiprecision #( parameter WORD_WIDTH = 0, parameter STEP_WORD_WIDTH = 0 ) ( input wire clock, input wire clock_enable, input wire clear, input wire input_valid, output reg input_ready, input wire add_sub, // 0/1 -> A+B/A-B input wire [WORD_WIDTH-1:0] A, input wire [WORD_WIDTH-1:0] B, output reg output_valid, input wire output_ready, output wire [WORD_WIDTH-1:0] sum, output reg carry_out, output wire [WORD_WIDTH-1:0] carries, output reg overflow ); initial begin input_ready = 1'b1; // Ready after reset. output_valid = 1'b0; carry_out = 1'b0; overflow = 1'b0; end `include "word_count_function.vh" `include "word_pad_function.vh" `include "clog2_function.vh" localparam WORD_ZERO = {WORD_WIDTH{1'b0}};
Compute how many STEP_WORD_WIDTH
words we will need to hold
a WORD_WIDTH
input. The total width may end up larger, but we will
discard the extra bits at the end.
localparam STEP_WORD_COUNT = word_count(WORD_WIDTH, STEP_WORD_WIDTH); localparam STEP_WORD_WIDTH_TOTAL = STEP_WORD_WIDTH * STEP_WORD_COUNT; localparam STEP_WORD_ZERO = {STEP_WORD_WIDTH{1'b0}}; localparam STEP_TOTAL_ZERO = {STEP_WORD_WIDTH_TOTAL{1'b0}};
How many pad bits at the end of the last step word? Re-adjust to zero (see Word Pad for why) since we don't construct a pad here, only index to its position in the last step word.
localparam PAD_WIDTH_RAW = word_pad(WORD_WIDTH, STEP_WORD_WIDTH); localparam PAD_WIDTH = (PAD_WIDTH_RAW == STEP_WORD_WIDTH) ? 0 : PAD_WIDTH_RAW;
We must add a bit of width to the step counter to deal with the special
case where STEP_WORD_WIDTH
equals WORD_WIDTH
, so the STEP_WORD_COUNT
is 1, and thus the counter would be of width zero, which is impossible.
The overhead is insignificant and grows logarithmically at worst.
localparam STEP_COUNT_WIDTH = clog2(STEP_WORD_COUNT) + 1; localparam STEP_COUNT_INITIAL = STEP_WORD_COUNT - 1; localparam STEP_ONE = {{STEP_COUNT_WIDTH-1{1'b0}},1'b1}; localparam STEP_ZERO = {STEP_COUNT_WIDTH{1'b0}};
Set the initial step_carry_in
into the step adder/subtractor to 1 (which
matches the add_sub
convention) if subtracting to complete the negation
of the inverted B
operand, and update it at each calculation step with the
step_carry_out
. The final carry_out
is calculated later, and depends on
the PAD_WIDTH
.
reg load_carry_initial = 1'b0; reg load_carry_step = 1'b0; reg step_carry_selected = 1'b0; wire step_carry_out; wire step_carry_in; always @(*) begin step_carry_selected = (load_carry_initial == 1'b1) ? add_sub : step_carry_out; end Register #( .WORD_WIDTH (1), .RESET_VALUE (1'b0) ) carry_storage ( .clock (clock), .clock_enable (load_carry_step), .clear (1'b0), .data_in (step_carry_selected), .data_out (step_carry_in) );
Extend A to the total width of the pipeline as a signed integer.
wire [STEP_WORD_WIDTH_TOTAL-1:0] A_extended; Width_Adjuster #( .WORD_WIDTH_IN (WORD_WIDTH), .SIGNED (1), .WORD_WIDTH_OUT (STEP_WORD_WIDTH_TOTAL) ) A_extender ( .original_input (A), .adjusted_output (A_extended) );
Word-reverse A_extended
so the pipeline outputs the least significant
step word first.
wire [STEP_WORD_WIDTH_TOTAL-1:0] A_reversed; Word_Reverser #( .WORD_WIDTH (STEP_WORD_WIDTH), .WORD_COUNT (STEP_WORD_COUNT) ) A_reverser ( .words_in (A_extended), .words_out (A_reversed) );
Read A_extended
into the pipeline, and feed it out one step word at
a time, from least to most-significant.
reg load_A = 1'b0; wire [STEP_WORD_WIDTH-1:0] step_A; Register_Pipeline #( .WORD_WIDTH (STEP_WORD_WIDTH), .PIPE_DEPTH (STEP_WORD_COUNT), .RESET_VALUES (STEP_TOTAL_ZERO) ) A_storage ( .clock (clock), .clock_enable (clock_enable), .clear (1'b0), .parallel_load (load_A), .parallel_in (A_reversed), // verilator lint_off PINCONNECTEMPTY .parallel_out (), // verilator lint_on PINCONNECTEMPTY .pipe_in (STEP_WORD_ZERO), .pipe_out (step_A) );
Invert B
if subtracting. The step_carry_in
is already correctly
initialized to 1 to make it into a negation.
We do the negation of B
in this module instead of using the built-in
negation logic in the Adder_Subtractor_Binary
sub-module because I could
not predict what logic would synthesize when subtracting, which is
internally be implemented as A+((~B)+1)-carry_in
. So to be sure the logic
synthesizes predictably, I decided to remove carry_in
as an input to this
module and use the internal carry_storage
as part of the negation of B
when subtracting, which then implements as A+((~B)+step_carry_in)
.
reg [WORD_WIDTH-1:0] B_selected = WORD_ZERO; always @(*) begin B_selected = (add_sub == 1'b1) ? ~B : B; end
Extend B to the total width of the pipeline as a signed integer.
wire [STEP_WORD_WIDTH_TOTAL-1:0] B_extended; Width_Adjuster #( .WORD_WIDTH_IN (WORD_WIDTH), .SIGNED (1), .WORD_WIDTH_OUT (STEP_WORD_WIDTH_TOTAL) ) B_extender ( .original_input (B_selected), .adjusted_output (B_extended) );
Word-reverse B_extended
so the pipeline outputs the least significant step
word first.
wire [STEP_WORD_WIDTH_TOTAL-1:0] B_reversed; Word_Reverser #( .WORD_WIDTH (STEP_WORD_WIDTH), .WORD_COUNT (STEP_WORD_COUNT) ) B_reverser ( .words_in (B_extended), .words_out (B_reversed) );
Read B_extended
into the pipeline , and feed it out one step word at
a time, from least to most-significant.
reg load_B = 1'b0; wire [STEP_WORD_WIDTH-1:0] step_B; Register_Pipeline #( .WORD_WIDTH (STEP_WORD_WIDTH), .PIPE_DEPTH (STEP_WORD_COUNT), .RESET_VALUES (STEP_TOTAL_ZERO) ) B_storage ( .clock (clock), .clock_enable (clock_enable), .clear (1'b0), .parallel_load (load_B), .parallel_in (B_reversed), // verilator lint_off PINCONNECTEMPTY .parallel_out (), // verilator lint_on PINCONNECTEMPTY .pipe_in (STEP_WORD_ZERO), .pipe_out (step_B) );
NOTE: the step_carry_in
and step_carry_out
storage and wiring was
defined earlier.
We cannot use the built-in overflow
as if there are pad bits at the MSB
positions in the last step word, they will falsely disable the overflow and
carry-out bits. So we calculate the overflow
and carry_out
later in
this module, and the unused logic in Adder_Subtractor_Binary
will
optimize away.
The adder carry-chain should never be the critical path. If so, reduce
STEP_WORD_WIDTH
as needed.
wire [STEP_WORD_WIDTH-1:0] step_sum; wire [STEP_WORD_WIDTH-1:0] step_carries; Adder_Subtractor_Binary #( .WORD_WIDTH (STEP_WORD_WIDTH) ) step_adder ( .add_sub (1'b0), // 0/1 -> A+B/A-B .carry_in (step_carry_in), .A (step_A), .B (step_B), .sum (step_sum), .carry_out (step_carry_out), .carries (step_carries), // verilator lint_off PINCONNECTEMPTY .overflow () // verilator lint_on PINCONNECTEMPTY );
Store the step_sum
word-by-word, then word-reverse it back to the
expected order so we can extract the least-significant WORD_WIDTH
subset
which contains the result we want.
reg load_step_sum = 1'b0; wire [STEP_WORD_WIDTH_TOTAL-1:0] sum_reversed; wire [STEP_WORD_WIDTH_TOTAL-1:0] sum_restored; Register_Pipeline #( .WORD_WIDTH (STEP_WORD_WIDTH), .PIPE_DEPTH (STEP_WORD_COUNT), .RESET_VALUES (STEP_TOTAL_ZERO) ) output_sum ( .clock (clock), .clock_enable (load_step_sum), .clear (1'b0), .parallel_load (1'b0), .parallel_in (STEP_TOTAL_ZERO), .parallel_out (sum_reversed), .pipe_in (step_sum), // verilator lint_off PINCONNECTEMPTY .pipe_out () // verilator lint_on PINCONNECTEMPTY ); Word_Reverser #( .WORD_WIDTH (STEP_WORD_WIDTH), .WORD_COUNT (STEP_WORD_COUNT) ) sum_reverser ( .words_in (sum_reversed), .words_out (sum_restored) ); Width_Adjuster #( .WORD_WIDTH_IN (STEP_WORD_WIDTH_TOTAL), .SIGNED (1), .WORD_WIDTH_OUT (WORD_WIDTH) ) sum_truncator ( .original_input (sum_restored), .adjusted_output (sum) );
Store the carries word-by-word, then word-reverse them back to the expected
order so we can extract the least-significant WORD_WIDTH
subset which
contains the result we want.
reg load_step_carries = 1'b0; wire [STEP_WORD_WIDTH_TOTAL-1:0] carries_reversed; wire [STEP_WORD_WIDTH_TOTAL-1:0] carries_restored; Register_Pipeline #( .WORD_WIDTH (STEP_WORD_WIDTH), .PIPE_DEPTH (STEP_WORD_COUNT), .RESET_VALUES (STEP_TOTAL_ZERO) ) output_carries ( .clock (clock), .clock_enable (load_step_carries), .clear (1'b0), .parallel_load (1'b0), .parallel_in (STEP_TOTAL_ZERO), .parallel_out (carries_reversed), .pipe_in (step_carries), // verilator lint_off PINCONNECTEMPTY .pipe_out () // verilator lint_on PINCONNECTEMPTY ); Word_Reverser #( .WORD_WIDTH (STEP_WORD_WIDTH), .WORD_COUNT (STEP_WORD_COUNT) ) carries_reverser ( .words_in (carries_reversed), .words_out (carries_restored) ); Width_Adjuster #( .WORD_WIDTH_IN (STEP_WORD_WIDTH_TOTAL), .SIGNED (0), .WORD_WIDTH_OUT (WORD_WIDTH) ) carries_truncator ( .original_input (carries_restored), .adjusted_output (carries) );
We gather together the carries from the final step_sum
(the most
significant word of the result) and the final step_carry_in
(which now
holds the last step_carry_out
), and select the correct carries, based on
the amount of padding in the last step word, to compute the carry_out
and
the overflow
. We have to handle all cases, including when there is no
padding when STEP_WORD_WIDTH
exactly divides WORD_WIDTH
.
The wiring here is constant and only uses 2 bits, so it will never be
a critical path, regardless of the width of all_carries
.
localparam ALL_CARRIES_WIDTH = 1 + STEP_WORD_WIDTH; localparam ALL_CARRIES_ZERO = {ALL_CARRIES_WIDTH{1'b0}}; reg [ALL_CARRIES_WIDTH-1:0] all_carries = ALL_CARRIES_ZERO; reg last_carry_in = 1'b0; always @(*) begin all_carries = {step_carry_in, carries_restored [STEP_WORD_WIDTH_TOTAL-1 -: STEP_WORD_WIDTH]}; last_carry_in = all_carries [STEP_WORD_WIDTH - PAD_WIDTH - 1]; carry_out = all_carries [STEP_WORD_WIDTH - PAD_WIDTH]; overflow = (carry_out != last_carry_in); end
We accept inputs in STATE_LOAD
, compute in
STATE_CALC
, and present the output in STATE_DONE
. Once the results are
read out, we return to STATE_LOAD
.
NOTE: The state encoding is arbitrary. Also, control from state_storage
to the input/output pipelines will tend to be the critical path as
WORD_WIDTH
gets larger due to physical distance and routing congestion,
depending on the target device and CAD tool.
localparam STATE_WIDTH = 2; localparam [STATE_WIDTH-1:0] STATE_LOAD = 2'b00; localparam [STATE_WIDTH-1:0] STATE_CALC = 2'b01; localparam [STATE_WIDTH-1:0] STATE_DONE = 2'b11; //verilator lint_off UNUSED localparam [STATE_WIDTH-1:0] STATE_ERROR = 2'b10; // Never reached. //verilator lint_on UNUSED wire [STATE_WIDTH-1:0] state; reg [STATE_WIDTH-1:0] state_next = STATE_LOAD; Register #( .WORD_WIDTH (STATE_WIDTH), .RESET_VALUE (STATE_LOAD) ) state_storage ( .clock (clock), .clock_enable (clock_enable), .clear (clear), .data_in (state_next), .data_out (state) );
Count down the calculation steps until zero, which is how long we stay in
STATE_CALC
.
reg step_do = 1'b0; reg step_load = 1'b0; wire [STEP_COUNT_WIDTH-1:0] step; Counter_Binary #( .WORD_WIDTH (STEP_COUNT_WIDTH), .INCREMENT (STEP_ONE), .INITIAL_COUNT (STEP_COUNT_INITIAL [STEP_COUNT_WIDTH-1:0]) ) calc_steps ( .clock (clock), .clear (clear), .up_down (1'b1), // 0/1 --> up/down .run (step_do), .load (step_load), .load_count (STEP_COUNT_INITIAL [STEP_COUNT_WIDTH-1:0]), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), .carries (), .overflow (), // verilator lint_on PINCONNECTEMPTY .count (step) ); reg step_done = 1'b0; always @(*) begin step_done = (step == STEP_ZERO); end
reg input_handshake_done = 1'b0; reg output_handshake_done = 1'b0; always @(*) begin input_ready = (state == STATE_LOAD); output_valid = (state == STATE_DONE); input_handshake_done = (input_ready == 1'b1) && (input_valid == 1'b1); output_handshake_done = (output_ready == 1'b1) && (output_valid == 1'b1); end
reg input_load = 1'b0; // Load of input operation and operands. reg calculating = 1'b0; // High while doing the calculation steps. reg last_calc = 1'b0; // High during the last calculation step. reg output_read = 1'b0; // When the result is read out. always @(*) begin input_load = (state == STATE_LOAD) && (input_handshake_done == 1'b1); calculating = (state == STATE_CALC); last_calc = (state == STATE_CALC) && (step_done == 1'b1); output_read = (state == STATE_DONE) && (output_handshake_done == 1'b1); end
After this point, there should be no reference to inputs, outputs, or states. All control logic must be expressed in terms of control events.
always @(*) begin state_next = (input_load == 1'b1) ? STATE_CALC : state; state_next = (last_calc == 1'b1) ? STATE_DONE : state_next; state_next = (output_read == 1'b1) ? STATE_LOAD : state_next; end
always @(*) begin load_carry_initial = (input_load == 1'b1); load_carry_step = (input_load == 1'b1) || (calculating == 1'b1); load_A = (input_load == 1'b1); load_B = (input_load == 1'b1); load_step_sum = (calculating == 1'b1); load_step_carries = (calculating == 1'b1); step_do = (calculating == 1'b1); step_load = (input_load == 1'b1); end endmodule