A generic signed multiplier module, with the inference left to the CAD tool. Any attributes to control the synthesis of the multiplier should be applied, in the text of the enclosing module, to this whole module. The attributes vary too much and none provide the default automatic inference choices made by the CAD tool (e.g.: logic for narrow widths, DSP blocks for larger widths), which are almost always the right choices.
Both input word widths are parameterized separately so as to infer the smallest multiplier necessary to generate a full product, whose width is the sum of the widths of the inputs. The user must manually supply that total width to the connecting wires in the enclosing module, as there is no way to introspect parameter values inside a module.
The multiplier inference is limited to signed integers as the common case and to keep the code simple. If you must treat the inputs as unsigned integers, first extend them with a constant zero most-significant bit to force them positive. Yes, this may cost some area and may slow down the logic slightly, but if a single extra bit of width breaks your timing closure, then your design's timing closure was already on its last legs, and perhaps you need to allow more pipelining and/or use a narrower width.
At the time of writing (March 2020), retiming of external registers into an inferred multiplier does not work in Vivado, so we cannot simply place a Register Pipeline before and after the multiplier module to help it meet timing if necessary.
Instead, we must connect the input and output pipelines and the multiplier all together in a single clocked always block to match the recommended HDL style for multiplier inference (UG901, Vivado Design Suite User Guide: Synthesis). This code also works under Intel Quartus Prime as its HDL coding guidelines for inferring multipliers are the same (UG-20131, Intel Quartus Prime Pro Edition User Guide: Design Recommendations).
Note: you must disable shift register extraction during synthesis to force implementation of the pipelines as registers which can be retimed and can be placed-and-routed independently. Else your pipeline may be implemented as shift registers which, while compact, won't provide any pipelining benefits. Shift register extraction is enabled by default in Vivado and Quartus synthesis.
Pipelining multipliers, although optional, benefits both synthesis and place-and-route:
`default_nettype none module Multiplier_Binary_Parallel #( parameter WORD_WIDTH_A = 0, parameter WORD_WIDTH_B = 0, parameter INPUT_PIPE_DEPTH = 0, parameter OUTPUT_PIPE_DEPTH = 0, // Don't set at instantiation, except in IPI parameter PRODUCT_WIDTH = WORD_WIDTH_A + WORD_WIDTH_B ) ( // Unused if no input/output pipelines (combinational multiplier) // verilator lint_off UNUSED input wire clock, // verilator lint_on UNUSED input wire signed [WORD_WIDTH_A-1:0] A_in, input wire signed [WORD_WIDTH_B-1:0] B_in, output reg signed [PRODUCT_WIDTH-1:0] product_out ); localparam WORD_ZERO_A = {WORD_WIDTH_A{1'b0}}; localparam WORD_ZERO_B = {WORD_WIDTH_B{1'b0}}; localparam PRODUCT_ZERO = {PRODUCT_WIDTH{1'b0}}; initial begin product_out = PRODUCT_ZERO; end
If their depth is greater than zero, create the input and/or output
pipelines. These MUST be declared as signed
, else the multiplication
will be inferred as unsigned and calculate the wrong results when given
negative integers.
Then, if necessary, we connect the inputs, the pipelines, and the multiplier all together in a single clocked always block. The CAD tool with infer DSP blocks and adders, and retime the pipeline registers as necessary if retiming is enabled in your CAD tool. Retiming is off by default in Vivado and Quartus synthesis.
We write the pipelines using the idiom of peeling out the first loop
iteration so we never generate a negative index with i-1
. Note the
initialization value of i
in the for-loops.
There are four possible cases of zero and non-zero (positive) pipeline depths, so we generate the correct code, based on the HDL guidelines, for each case. A negative pipe depth would result in negative array ranges, which is legal in Verilog-2001, though I have no idea what it's for. To avoid strange corner cases, no code is generated for negative values, which will cause the design to fail to elaborate.
generate // No pipelines (combinational multiplier) if ((INPUT_PIPE_DEPTH == 0) && (OUTPUT_PIPE_DEPTH == 0)) begin: no_pipe always @(*) begin product_out = A_in * B_in; end end // Input pipeline only else if ((INPUT_PIPE_DEPTH > 0) && (OUTPUT_PIPE_DEPTH == 0)) begin: in_pipe reg signed [WORD_WIDTH_A-1:0] input_pipe_A [INPUT_PIPE_DEPTH-1:0]; reg signed [WORD_WIDTH_B-1:0] input_pipe_B [INPUT_PIPE_DEPTH-1:0]; integer i; initial begin for (i=0; i < INPUT_PIPE_DEPTH; i=i+1) begin input_pipe_A [i] = WORD_ZERO_A; input_pipe_B [i] = WORD_ZERO_B; end end always @(posedge clock) begin input_pipe_A[0] <= A_in; input_pipe_B[0] <= B_in; for (i=1; i < INPUT_PIPE_DEPTH; i=i+1) begin: per_input input_pipe_A [i] <= input_pipe_A [i-1]; input_pipe_B [i] <= input_pipe_B [i-1]; end end // Corner case: it isn't possible to put this in the clocked // always block without registering the output as a consequence. always @(*) begin product_out = input_pipe_A [INPUT_PIPE_DEPTH-1] * input_pipe_B [INPUT_PIPE_DEPTH-1]; end end // Output pipeline only else if ((INPUT_PIPE_DEPTH == 0) && (OUTPUT_PIPE_DEPTH > 0)) begin: out_pipe reg signed [PRODUCT_WIDTH-1:0] output_pipe [OUTPUT_PIPE_DEPTH-1:0]; integer i; initial begin for (i=0; i < OUTPUT_PIPE_DEPTH; i=i+1) begin output_pipe [i] = PRODUCT_ZERO; end end always @(posedge clock) begin output_pipe [0] <= A_in * B_in; for (i=1; i < OUTPUT_PIPE_DEPTH; i=i+1) begin: per_output output_pipe [i] <= output_pipe [i-1]; end end always @(*) begin product_out = output_pipe [OUTPUT_PIPE_DEPTH-1]; end end // Both input and output pipelines else if ((INPUT_PIPE_DEPTH > 0) && (OUTPUT_PIPE_DEPTH > 0)) begin: in_out_pipe reg signed [WORD_WIDTH_A-1:0] input_pipe_A [INPUT_PIPE_DEPTH-1:0]; reg signed [WORD_WIDTH_B-1:0] input_pipe_B [INPUT_PIPE_DEPTH-1:0]; reg signed [PRODUCT_WIDTH-1:0] output_pipe [OUTPUT_PIPE_DEPTH-1:0]; integer i; initial begin for (i=0; i < INPUT_PIPE_DEPTH; i=i+1) begin input_pipe_A [i] = WORD_ZERO_A; input_pipe_B [i] = WORD_ZERO_B; end for (i=0; i < OUTPUT_PIPE_DEPTH; i=i+1) begin output_pipe [i] = PRODUCT_ZERO; end end always @(posedge clock) begin input_pipe_A[0] <= A_in; input_pipe_B[0] <= B_in; for (i=1; i < INPUT_PIPE_DEPTH; i=i+1) begin: per_input input_pipe_A [i] <= input_pipe_A [i-1]; input_pipe_B [i] <= input_pipe_B [i-1]; end output_pipe [0] <= input_pipe_A [INPUT_PIPE_DEPTH-1] * input_pipe_B [INPUT_PIPE_DEPTH-1]; for (i=1; i < OUTPUT_PIPE_DEPTH; i=i+1) begin: per_output output_pipe [i] <= output_pipe [i-1]; end end always @(*) begin product_out = output_pipe [OUTPUT_PIPE_DEPTH-1]; end end endgenerate endmodule