Computes the signed Quotient and Remainder of the signed Numerator and Denominator, using truncating long division, which matches the behaviour of most programming languages and manual long division. For example, here are the expected values:

- 22 / 7 = 3 rem 1
- -22 / 7 = -3 rem -1
- 22 / -7 = -3 rem 1
- -22 / -7 = 3 rem -1
- 7 / 22 = 0 rem 7
- -7 / -22 = 0 rem -7
- -7 / 22 = 0 rem -7
- 7 / -22 = 0 rem 7
- 0 / 0 = -1 rem 0 (raises div_by_zero)
- 22 / 0 = -1 rem 22 (raises div_by_zero)
- -22 / 0 = 1 rem 22 (raises div_by_zero)

Supports c-slowing to meet timing. This means the inputs and outputs contain multiple words: C_SLOW_RATIO words of WORD_WIDTH width. All inputs and outputs are concatenated vectors, in the same order. Note that outputs will change wildly during calculation and cannot be used until the outputs are signalled as valid.

Takes one cycle to load the inputs, plus C_SLOW_RATIO * WORD_WIDTH cycles to compute the results, but requires no multiplication or variable bit shifts or double-length registers, so scales better to large WORD_WIDTH. The inputs and outputs each use a conventional ready/valid handshake, and accept/give out one set of results per division.

Think of the numerator and denominator along a number line. Depending on
their initial signs, the divider will add or subtract the denominator
to/from the numerator as necessary *to bring the numerator towards zero*,
and then increment or decrement the quotient (initially at zero) to reflect
that change in the numerator and produce a quotient with the correct sign.
No calculations of absolute values or sign corrections after the fact are
necessary, which saves a lot of hardware.

The implementation refines this approach by subtracting multiples of the denominator, as in manual long division. Imagine aligning the LSB of the denominator to the MSB of the numerator and performing the addition or subtraction. If the operation result has the opposite sign, then the denominator is too big, so we shift the numerator right by 1 bit and try again. Thus, it takes WORD_WIDTH steps to try all possible denominator multiples.

In hardware, we do this by shifting the numerator bit-by-bit, MSB-first, into the remainder then adding/subtracting the denominator from the remainder, and storing back the new remainder result if the sign did not flip (meaning the shifted bits of the numerator were large enough). Whatever is left at the end is the remainder, with the correct sign.

Similarly, if the denominator could be added/subtracted from the remainder without a sign flip, we add or subtract the equivalent multiple of 1 to the quotient, else we add zero. So the quotient bits are computed MSB-first.

The remainder storage cannot be retimed into the calc_new_remainder logic, and so that adder carry-chain becomes a major part of the worst-case delay. The output of that adder and the new/old remainder sign comparison logic has a high fanout and forms the next major part of the critical path. There's nothing to optimize here, it's just placment/routing.

There is no problem meeting a 150 MHz timing constraint up to about 96 bits WORD_WIDTH. Past that, it's hard. Use the following settings:

- Synthesis: default, with retiming enabled, no_lc, no_srlextract, and resource_sharing off.
- Implementation: Use Performance_RefinePlacement strategy.

This has just barely met timing at 104 bits of WORD_WIDTH at 150 MHz, with a C_SLOW_RATIO of 3. Since the remainder logic can't retime, the C_SLOW_RATIO likely has little impact on cycle time, but it does allow simple multiplexing of the divider.

Last ditch option: pipeline the connections from the remainder logic to the quotient logic, including the "calculating" signal from the FSM. Then delay the output valid signal the same. This would break the high fanout path after the remainder adder, at the cost of some complication and a few extra cycles of computation while the quotient calculations complete.

`default_nettype none module Divider_Integer_Signed #( parameter WORD_WIDTH = 0, parameter C_SLOW_RATIO = 1, // Must be 1 or greater // Don't set at instantiation, except in Vivado IPI parameter TOTAL_WIDTH = WORD_WIDTH * C_SLOW_RATIO ) ( input wire clock, input wire in_valid, output reg in_ready, input wire signed [TOTAL_WIDTH-1:0] numerator, input wire signed [TOTAL_WIDTH-1:0] denominator, output reg out_valid, input wire out_ready, output wire signed [TOTAL_WIDTH-1:0] quotient, output reg signed [TOTAL_WIDTH-1:0] remainder, output reg [C_SLOW_RATIO-1:0] div_by_zero ); `include "clog2_function.vh" localparam WORD_ZERO = {WORD_WIDTH{1'b0}}; localparam TOTAL_ZERO = {TOTAL_WIDTH{1'b0}}; localparam C_SLOW_ZERO = {C_SLOW_RATIO{1'b0}};

Begin with handshake ports idle, and locally computed outputs invalid.

initial begin in_ready = 1'b0; out_valid = 1'b0; remainder = TOTAL_ZERO; div_by_zero = C_SLOW_ZERO; end

Common iteration counter used in multiple places

integer i;

Each division takes WORD_WIDTH steps (for each of the C_SLOW_RATIO inputs), from WORD_WIDTH-1 to 0, plus one step to initially load the numerator and denominator. Thus, we need some state bits later on to signal we can start stepping.

localparam STEPS_WIDTH = clog2(WORD_WIDTH); localparam STEPS_INITIAL = WORD_WIDTH - 1; localparam STEPS_ZERO = {STEPS_WIDTH{1'b0}}; localparam STEPS_ONE = {{STEPS_WIDTH-1{1'b0}},1'b1};

We also need to control the calculation step counter with a secondary c-slow counter so we step the calculation step counter only once all the c-slowed steps have happenned for each calculation step.

This approach consumes less hardware than c-slowing the step counter directly, and avoid the need to OR-reduce all the c-slowed step counter values to detect that all calculations have completed. We can also do it this way since all the c-slowed calculations have the exact same number of calculation steps.

This counter divides the total number of calculation steps by C_SLOW_RATIO, so we have to init with a value one greater, then reload the expected value so the count is done when equal to zero.

localparam C_SLOW_STEPS_WIDTH = clog2(C_SLOW_RATIO) + 1; localparam C_SLOW_STEPS_LOAD = C_SLOW_RATIO - 1; localparam C_SLOW_STEPS_INITIAL = C_SLOW_RATIO; localparam C_SLOW_STEPS_ZERO = {C_SLOW_STEPS_WIDTH{1'b0}}; localparam C_SLOW_STEPS_ONE = {{C_SLOW_STEPS_WIDTH-1{1'b0}},1'b1};

We need one extra bit in the remainder since the shift left always happens, so the final result is shifted by one and we don't want to lose the MSB. The right subsets of this wider register are extracted further down.

localparam REMAINDER_WIDTH = WORD_WIDTH + 1; localparam REMAINDER_ZERO = {REMAINDER_WIDTH{1'b0}}; localparam REMAINDER_ONES = {REMAINDER_WIDTH{1'b1}}; localparam TOTAL_REMAINDER_WIDTH = REMAINDER_WIDTH * C_SLOW_RATIO; localparam TOTAL_REMAINDER_ZERO = {TOTAL_REMAINDER_WIDTH{1'b0}};

The quotient is incremented/decremented by a power of 2, starting with the most significant bit, because we process the numerator MSB first, which makes the division proceed in steps of the largest possible multiple of the divisor at that point.

localparam QUOTIENT_INCREMENT_INITIAL = {1'b1,{WORD_WIDTH-1{1'b0}}}; localparam TOTAL_QUOTIENT_INCREMENT_INITIAL = {C_SLOW_RATIO{QUOTIENT_INCREMENT_INITIAL}}; localparam ADD = 1'b0; localparam SUB = 1'b1; localparam POSITIVE = 1'b0; localparam NEGATIVE = 1'b1;

We denote state as two bits, with the following transitions:

INIT -> CALC -> DONE -> INIT

We don't handle the fourth, impossible case.

localparam STATE_WIDTH = 2; localparam [STATE_WIDTH-1:0] STATE_INIT = 'b00; localparam [STATE_WIDTH-1:0] STATE_CALC = 'b10; localparam [STATE_WIDTH-1:0] STATE_DONE = 'b11; localparam [STATE_WIDTH-1:0] STATE_ERROR = 'b01; // Never reached

The running state bits, from which we derive the control outputs and the internal control signals.

reg [STATE_WIDTH-1:0] next_running_state = STATE_INIT; wire [STATE_WIDTH-1:0] running_state; Register #( .WORD_WIDTH (STATE_WIDTH), .RESET_VALUE (STATE_INIT) ) running_state_storage ( .clock (clock), .clock_enable (1'b1), .clear (1'b0), .data_in (next_running_state), .data_out (running_state) );

Count down WORD_WIDTH calculation steps for the entire division. Ends up at (2**WORD_WIDTH)-1 (all ones) when done, but we don't test for that value, we just clear it back to the start value at the next calculation.

reg clear_calculation_step = 1'b0; reg do_calculation_step = 1'b0; wire [STEPS_WIDTH-1:0] current_calculation_step; Counter_Binary #( .WORD_WIDTH (STEPS_WIDTH), .INCREMENT (STEPS_ONE), .INITIAL_COUNT (STEPS_INITIAL[STEPS_WIDTH-1:0]) ) calculation_steps ( .clock (clock), .clear (clear_calculation_step), .up_down (1'b1), // 0/1 -> up/down .run (do_calculation_step), .load (1'b0), .load_count (STEPS_ZERO), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), // verilator lint_on PINCONNECTEMPTY .count (current_calculation_step) );

Count down C_SLOW_RATIO steps for each calculation step for the entire division.

reg clear_c_slow_step = 1'b0; reg load_c_slow_step = 1'b0; reg do_c_slow_step = 1'b0; wire [C_SLOW_STEPS_WIDTH-1:0] current_c_slow_step; Counter_Binary #( .WORD_WIDTH (C_SLOW_STEPS_WIDTH), .INCREMENT (C_SLOW_STEPS_ONE), .INITIAL_COUNT (C_SLOW_STEPS_INITIAL[C_SLOW_STEPS_WIDTH-1:0]) ) c_slow_steps ( .clock (clock), .clear (clear_c_slow_step), .up_down (1'b1), // 0/1 -> up/down .run (do_c_slow_step), .load (load_c_slow_step), .load_count (C_SLOW_STEPS_LOAD [C_SLOW_STEPS_WIDTH-1:0]), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), // verilator lint_on PINCONNECTEMPTY .count (current_c_slow_step) );

Store the denominator. Never changes during division, so it simply loops around itself to output each c-slowed value in turn.

reg enable_denominator = 1'b0; reg load_denominator = 1'b0; wire [WORD_WIDTH-1:0] denominator_loaded; wire [TOTAL_WIDTH-1:0] all_denominators; Register_Pipeline #( .WORD_WIDTH (WORD_WIDTH), .PIPE_DEPTH (C_SLOW_RATIO), // concatenation of each stage initial/reset value .RESET_VALUES (TOTAL_ZERO) ) denominator_storage ( .clock (clock), .clock_enable (enable_denominator), .clear (1'b0), .parallel_load (load_denominator), .parallel_in (denominator), .parallel_out (all_denominators), .pipe_in (denominator_loaded), .pipe_out (denominator_loaded) );

Report if we tried to divide by zero for each denominator.

always @(*) begin for (i=0; i < C_SLOW_RATIO; i=i+1) begin: per_denominator_div_by_zero div_by_zero[i] = (all_denominators[WORD_WIDTH*i +: WORD_WIDTH] == WORD_ZERO); end end

Extract the initial sign of all the denominators. Many run-time decisions are based on it. No need for separate storage since the denominator never changes.

reg denominator_sign; always @(*) begin denominator_sign = denominator_loaded[WORD_WIDTH-1]; end

Store and left-shift-out the numerator (into the remainder). We load the numerator value shifted left by 1 during the load cycle as the next cycle is the first cycle of calculation, and we need all the data ready.

reg enable_numerator = 1'b0; reg load_numerator = 1'b0; wire [WORD_WIDTH-1:0] numerator_reg; reg numerator_msb = 1'b0; Register_Pipeline #( .WORD_WIDTH (WORD_WIDTH), .PIPE_DEPTH (C_SLOW_RATIO), .RESET_VALUES (TOTAL_ZERO) ) numerator_storage ( .clock (clock), .clock_enable (enable_numerator), .clear (1'b0), .parallel_load (load_numerator), .parallel_in (numerator << 1), // verilator lint_off PINCONNECTEMPTY .parallel_out (), // verilator lint_on PINCONNECTEMPTY .pipe_in (numerator_reg << 1), .pipe_out (numerator_reg) ); always @(*) begin numerator_msb = numerator_reg[WORD_WIDTH-1]; end

Store the initial sign of the numerators. Many run-time decisions are based on it. We also need the numerator signs at load time to properly initialize the remainders.

reg enable_numerator_sign = 1'b0; reg load_numerator_sign = 1'b0; reg [C_SLOW_RATIO-1:0] numerator_signs_at_load = C_SLOW_ZERO; wire numerator_sign;

Extract the sign bit (last bit) from each numerator

always @(*) begin for (i=0; i < C_SLOW_RATIO; i=i+1) begin: per_numerator_sign_bits numerator_signs_at_load[i] = numerator[(WORD_WIDTH*i)+(WORD_WIDTH-1)]; end end Register_Pipeline #( .WORD_WIDTH (1), .PIPE_DEPTH (C_SLOW_RATIO), // concatenation of each stage initial/reset value .RESET_VALUES (C_SLOW_ZERO) ) numerator_sign_storage ( .clock (clock), .clock_enable (enable_numerator_sign), .clear (1'b0), .parallel_load (load_numerator_sign), .parallel_in (numerator_signs_at_load), // verilator lint_off PINCONNECTEMPTY .parallel_out (), // verilator lint_on PINCONNECTEMPTY .pipe_in (numerator_sign), .pipe_out (numerator_sign) );

Store the remainders. Initialized with a zero or -1 (to sign-extend the numerator) and the first MSB of the numerator (pre-shifted out at numerator load above). At each calculation step, gets loaded with either the new remainder or the existing remainder, both left-shifted by 1 with the LSB filled-in with the MSB of the numerator.

Must be 1 bit wider than the numerator, since the shift always happens and we don't want to lose the MSB on the last step. This means we use the last WORD_WIDTH bits as the remainder output, and the first WORD_WIDTH bits for the new remainder calculations.

reg enable_remainder = 1'b0; reg load_remainder = 1'b0; reg [REMAINDER_WIDTH-1:0] remainder_updated = REMAINDER_ZERO; // Since we get the output remainders from another port, the // most-significant bit of this port is never used. // verilator lint_off UNUSED wire [REMAINDER_WIDTH-1:0] remainder_now; // verilator lint_on UNUSED reg [TOTAL_REMAINDER_WIDTH-1:0] all_initial_remainders = TOTAL_REMAINDER_ZERO; wire [TOTAL_REMAINDER_WIDTH-1:0] all_remainders; Register_Pipeline #( .WORD_WIDTH (REMAINDER_WIDTH), .PIPE_DEPTH (C_SLOW_RATIO), .RESET_VALUES (TOTAL_REMAINDER_ZERO) ) remainder_storage ( .clock (clock), .clock_enable (enable_remainder), .clear (1'b0), .parallel_load (load_remainder), .parallel_in (all_initial_remainders), .parallel_out (all_remainders), .pipe_in (remainder_updated), .pipe_out (remainder_now) );

Split out the remainders for output (last WORD_WIDTH bits) and for calculation (first WORD_WIDTH bits). The output remainder is picked from each c-slowed copy.

reg [WORD_WIDTH-1:0] current_remainder = WORD_ZERO; always @(*) begin current_remainder = remainder_now[REMAINDER_WIDTH-2:0]; for (i=0; i < C_SLOW_RATIO; i=i+1) begin: per_remainder remainder[WORD_WIDTH*i +: WORD_WIDTH] = all_remainders[(REMAINDER_WIDTH*i)+1 +: WORD_WIDTH]; end end

Compute the new remainder: new_remainder = current_remainder +/- denominator_loaded, depending on the initial signs of the numerator and denominator.

reg remainder_add_sub = ADD; wire [WORD_WIDTH-1:0] new_remainder; Adder_Subtractor_Binary #( .WORD_WIDTH (WORD_WIDTH) ) calc_new_remainder ( .add_sub (remainder_add_sub), // 0/1 -> A+B/A-B .carry_in (1'b0), .A_in (current_remainder), .B_in (denominator_loaded), .sum_out (new_remainder), // verilator lint_off PINCONNECTEMPTY .carry_out () // verilator lint_on PINCONNECTEMPTY );

We will need the signs of the current and new remainders. They affect what we add/sub into the quotient at each calculation step.

reg current_remainder_sign = 1'b0; reg new_remainder_sign = 1'b0; always @(*) begin current_remainder_sign = current_remainder[WORD_WIDTH-1]; new_remainder_sign = new_remainder[WORD_WIDTH-1]; end

Store the quotient. Gets a multiple of 1 or zero added or subtracted to/from it at each step, depending on the sign of the result of the addition/subtraction of the denominator to/from the remainder, the current sign of the remainder, and the initial signs of the numerator and denominator. That logic is worked out further down.

reg enable_quotient = 1'b0; reg clear_quotient = 1'b0; wire [WORD_WIDTH-1:0] new_quotient; wire [WORD_WIDTH-1:0] current_quotient; Register_Pipeline #( .WORD_WIDTH (WORD_WIDTH), .PIPE_DEPTH (C_SLOW_RATIO), .RESET_VALUES (TOTAL_ZERO) ) quotient_storage ( .clock (clock), .clock_enable (enable_quotient), .clear (clear_quotient), .parallel_load (1'b0), .parallel_in (TOTAL_ZERO), .parallel_out (quotient), .pipe_in (new_quotient), .pipe_out (current_quotient) );

Store the quotient increment which is added/subtracted to/from the quotient each time we could remove the denominator from the remainder. Shift it right by 1 each calculation step.

reg enable_quotient_increment = 1'b0; reg clear_quotient_increment = 1'b0; wire [WORD_WIDTH-1:0] quotient_increment; Register_Pipeline #( .WORD_WIDTH (WORD_WIDTH), .PIPE_DEPTH (C_SLOW_RATIO), .RESET_VALUES (TOTAL_QUOTIENT_INCREMENT_INITIAL) ) quotient_increment_storage ( .clock (clock), .clock_enable (enable_quotient_increment), .clear (clear_quotient_increment), .parallel_load (1'b0), .parallel_in (TOTAL_ZERO), // verilator lint_off PINCONNECTEMPTY .parallel_out (), // verilator lint_on PINCONNECTEMPTY .pipe_in (quotient_increment >> 1), .pipe_out (quotient_increment) );

Add/subtract to/from the quotient each time the denominator is successfully removed or added to from the current remainder without changing its sign.

reg quotient_add_sub = ADD; reg [WORD_WIDTH-1:0] quotient_step = WORD_ZERO; Adder_Subtractor_Binary #( .WORD_WIDTH (WORD_WIDTH) ) calc_new_quotient ( .add_sub (quotient_add_sub), // 0/1 -> A+B/A-B .carry_in (1'b0), .A_in (current_quotient), .B_in (quotient_step), .sum_out (new_quotient), // verilator lint_off PINCONNECTEMPTY .carry_out () // verilator lint_on PINCONNECTEMPTY );

Now let's do the control and steering logic

First, the input and output handshakes. To avoid long combination paths, ready and valid should not depend directly on eachother.

Accept inputs when empty (after results are read out) or frehsly reset/cleared). Declare outputs valid when calculation is done.

always @(*) begin out_valid = (running_state == STATE_DONE); in_ready = (running_state == STATE_INIT); // Use this instead to allow starting a new calculation without first // reading out the results. You can then use a Pipeline Half-Buffer to // implement backpressure, based on it's input ready signal. (if not // ready, then it's full, and if the divider also has valid results, then // accept nothing into the divider) // in_ready = (running_state == STATE_INIT) || (running_state == STATE_DONE); end

Then, define the basic interactions with and transformations within this module. Past this point, we should not refer directly to the FSM states, but to these events which are combinations of states and signals.

reg load_inputs = 1'b0; // All of the above. reg read_outputs = 1'b0; // When we read out the results. reg calculating = 1'b0; // High while performing the division steps. reg last_calculation = 1'b0; // High during the last calculation step. reg calculation_done = 1'b0; // High after all the calculations always @(*) begin load_inputs = (in_ready == 1'b1) && (in_valid == 1'b1); read_outputs = (out_valid == 1'b1) && (out_ready == 1'b1); calculating = (running_state == STATE_CALC); last_calculation = (running_state == STATE_CALC) && (current_calculation_step == STEPS_ZERO) && (current_c_slow_step == C_SLOW_STEPS_ZERO); calculation_done = (running_state == STATE_DONE); end

Define the running state machine transitions. There is no handling of erroneous states.

always @(*) begin next_running_state = (load_inputs == 1'b1) ? STATE_CALC : running_state; next_running_state = (last_calculation == 1'b1) ? STATE_DONE : next_running_state; next_running_state = (read_outputs == 1'b1) ? STATE_INIT : next_running_state; end

Control the C_SLOW_RATIO step counter (one step per calculating cycle). We step the C_SLOW_RATIO step counter at load since loading the inputs also performs the first shifts of the numerator and remainder and computes the first new remainder and quotient stored at the next clock edge.

reg c_slow_step_last = 1'b0; always @(*) begin c_slow_step_last = (current_c_slow_step == C_SLOW_STEPS_ZERO); load_c_slow_step = (c_slow_step_last == 1'b1); clear_c_slow_step = (calculation_done == 1'b1); do_c_slow_step = (load_inputs == 1'b1) || (calculating == 1'b1); end

Control the calculation step counter (one step for one cycle of C_SLOW_RATIO steps)

always @(*) begin clear_calculation_step = (calculation_done == 1'b1); do_calculation_step = (c_slow_step_last == 1'b1); end

Control the denominator. They are both loaded just once at the start of the calculation steps, then cycled to output a new c-slowed value at each calculation step.

always @(*) begin load_denominator = (load_inputs == 1'b1); enable_denominator = (calculating == 1'b1) || (load_denominator == 1'b1); end

Control the numerator and its sign. The sign is loaded once at calculation start. The numerator gets loaded once at the start, then shifted left by 1 at each calculation step. Enable without load is a shift left. See Register_Pipeline.

always @(*) begin load_numerator = (load_inputs == 1'b1); enable_numerator = (calculating == 1'b1) || (load_numerator == 1'b1); load_numerator_sign = (load_numerator == 1'b1); enable_numerator_sign = (enable_numerator == 1'b1); end

Control the remainder and select what to load into it. Load input once at start, then select new remainder at each calculations step.

always @(*) begin load_remainder = (load_inputs == 1'b1); enable_remainder = (load_remainder == 1'b1) || (calculating == 1'b1); end reg [REMAINDER_WIDTH-1:0] remainder_with_removal = REMAINDER_ZERO; reg [REMAINDER_WIDTH-1:0] remainder_unchanged = REMAINDER_ZERO; always @(*) begin remainder_with_removal = {new_remainder, numerator_msb}; remainder_unchanged = {current_remainder, numerator_msb}; remainder_updated = (new_remainder_sign == current_remainder_sign) ? remainder_with_removal : remainder_unchanged; for (i=0; i < C_SLOW_RATIO; i=i+1) begin: per_initial_remainder all_initial_remainders[REMAINDER_WIDTH*i +: REMAINDER_WIDTH] = (numerator_signs_at_load[i] == POSITIVE) ? REMAINDER_ZERO : REMAINDER_ONES; end end always @(*) begin remainder_add_sub = (numerator_sign == denominator_sign) ? SUB : ADD; end

Control the quotient and what we add or subtract to/from it. Always zero at start of calculations.

always @(*) begin clear_quotient = (load_inputs == 1'b1); enable_quotient = (calculating == 1'b1); clear_quotient_increment = (clear_quotient == 1'b1); enable_quotient_increment = (enable_quotient == 1'b1); end always @(*) begin quotient_add_sub = (numerator_sign == denominator_sign) ? ADD : SUB; quotient_step = (new_remainder_sign == current_remainder_sign) ? quotient_increment : WORD_ZERO; end endmodule