Takes in a ready/valid handshake of a given word width, buffers it into a FIFO and passes it into another clock domain with full throughput, and outputs a ready/valid handshake with a different word width. The data is repacked without gaps into the new word width. The words widths are arbitrary and need not be multiples of eachother.
The minimum input-to-output latency is 7 cycles when both clocks are plesiochronous.
This FIFO Repacker supports having the input and output interfaces each on their own mutually asynchronous clock (without knowledge or restriction on their relative clock frequencies or phase). Using a FIFO for data tranfer between asynchronous clock domains allows multiple transfers to overlap with the CDC synchronization overhead. Once the CDC synchronization is done, all the previously written data in one clock domain can be freely read out in the other clock domain without further overhead.
Usually, a repacker is implemented with a minimally sized ring buffer with a pre-calculated read/write schedule which covers the whole sequence of possible amounts of data in the buffer at a given time. This is a complex schedule which resembles pseudo-random number generation and is likely unique for every pair of input/output widths. This also means the read side must be able to read into every possible offset into the ring buffer as given by the schedule, which can create wide multiplexers. The minimum size of the buffer is also a function of the computed schedule.
Instead, let's trade-off using more storage to obtain the implicit simplest schedule: we use the Least Common Multiple of the input and output widths, which is the smallest buffer which holds an integer number of input and output entries at the same time. Then the behaviour of the buffer reduces to that of a FIFO buffer, albeit with different input and output word widths. This approach tends to minimize the amount of multiplexing and does not require computing any schedules (only plain counting). We only need to specify the input and output widths and everything is computed for us.
NOTE: Using the LCM depth does not guarantee full throughput. A multiple of the LCM may be needed. See below.
Since the input and output interfaces are asynchronous to eachother, they
each have their own locally-synchronous clear
reset signal. However, it
makes no sense to reset one interface only, as that would corrupt the
read/write addresses and duplicate or lose data. Both interfaces must be reset
together. Pick one interface reset as the primary reset, then synchronize
it into the clock domain of the other interface using a Reset
Synchronizer.
NOTE: Both interfaces must be out of reset before beginning operation, otherwise a CDC synchronization from one domain into another domain which is still under reset will be lost, and the system state becomes inconsistent.
`default_nettype none module CDC_FIFO_Repacker #( parameter WORD_WIDTH_INPUT = 0, parameter WORD_WIDTH_OUTPUT = 0, parameter CDC_EXTRA_STAGES = 0 ) ( input wire input_clock, input wire input_clear, input wire input_valid, output reg input_ready, input wire [WORD_WIDTH_INPUT-1:0] input_data, input wire output_clock, input wire output_clear, output wire output_valid, input wire output_ready, output reg [WORD_WIDTH_OUTPUT-1:0] output_data );
First, we have to calculate the Least Common Multiple (LCM) of both
WORD_WIDTH_INPUT
and WORD_WIDTH_OUTPUT
, which gives us the minimum
buffer depth which will hold a whole number of input and output items at
the same time.
`include "lcm_function.vh" localparam BUFFER_DEPTH_MIN = lcm(WORD_WIDTH_OUTPUT, WORD_WIDTH_INPUT); // Least Common Multiple
However, the LCM depth is not necessarily sufficient to ensure full
throughput, as it may contain too few of WORD_WIDTH_INPUT
or
WORD_WIDTH_OUTPUT
entries to allow reads and writes to continue without
stalls while the read and write addresses are synchronized across the input
and output clock domains.
Based on the design of the [CDC Word Synchronizer] (./CDC_Word_Synchronizer.html), the absolute worst-case latency for completing a transfer from one clock domain to the other is 8 clock cycles. Thus, we have to contain more than twice that number of entries (17) in the FIFO buffer to guarantee that there is always a free entry to read and write at any time, thus avoiding a stall.
We calculate the final BUFFER_DEPTH
by dividing it by the larger of
WORD_WIDTH_OUTPUT
and WORD_WIDTH_INPUT
, taking the ratio of that
fraction with 17 (fudged up by +1 to compensate for integer division), and
then multiplying the BUFFER_DEPTH_MIN
by that amount, thus guaranteeing
that BUFFER_DEPTH
>= 17 * max(WORD_WIDTH_INPUT, WORD_WIDTH_OUTPUT)
.
I do integer division with a +1 fudge factor to avoid potential problems as
it's unclear if casting a real to an integer truncates or rounds. So lets
pay the price of possible BUFFER_DEPTH_MIN
more entries than we really
need to guarantee we meet our constraint. This can double the buffer depth
when the LCM is a large number close to 17 * max(WORD_WIDTH_OUTPUT,
WORD_WIDTH_INPUT)
.
`include "max_function.vh" localparam WORD_WIDTH_MAX = max(WORD_WIDTH_OUTPUT, WORD_WIDTH_INPUT); localparam ITEM_COUNT_MIN = BUFFER_DEPTH_MIN / WORD_WIDTH_MAX; localparam BUFFER_DEPTH_MULTIPLIER = (17 / ITEM_COUNT_MIN) + 1; localparam BUFFER_DEPTH = BUFFER_DEPTH_MIN * BUFFER_DEPTH_MULTIPLIER; `include "clog2_function.vh" localparam ADDR_WIDTH = clog2(BUFFER_DEPTH); localparam ADDR_ZERO = {ADDR_WIDTH{1'b0}}; localparam ADDR_LAST = BUFFER_DEPTH-1; localparam BUFFER_ZERO = {BUFFER_DEPTH{1'b0}}; localparam OUTPUT_ZERO = {WORD_WIDTH_OUTPUT{1'b0}}; // A little contortion to please the linter. // (doing (foo-1)[width-1:0], or similar, isn't legal in Verilog-2001) localparam [ADDR_WIDTH-1:0] ADDR_INITIAL_INPUT = WORD_WIDTH_INPUT - 1; localparam [ADDR_WIDTH-1:0] ADDR_INITIAL_OUTPUT = WORD_WIDTH_OUTPUT - 1; initial begin input_ready = 1'b1; // Empty at start, so accept data output_data = OUTPUT_ZERO; end
We have to access arbitrary word subsets of the entire FIFO storage so we can read and write word of different width without introducing gaps in the data! Block RAMs only support limited and fixed word sizes, so are not usable here. Thus, we must implement using registers and read/write the necessary exact word subsets as needed.
We also need to do CDC, so we must be careful to not have simultaneous reads and write to the same register.
reg [BUFFER_DEPTH-1:0] buffer = BUFFER_ZERO;
To define a word subset inside the buffer
, the input and output each use
two counters: head and tail. The tail counter starts at ADDR_ZERO
, and
the head counter starts at WORD_WIDTH_OUTPUT-1
or WORD_WIDTH_INPUT-1
,
and both can only increment by WORD_WIDTH_OUTPUT
or WORD_WIDTH_INPUT
,
wrapping around to their start value if incremented past ADDR_LAST
.
Since BUFFER_DEPTH
is the Least Common Multiple of the input/output word
widths, the counters always index a whole word without residue or having to
wrap around the end of the buffer.
We use two counters rather than a counter and an adder to calculate the next head and tail concurrently.
From these counter values we can perform range checks (e.g.: by seeing if a write head counter is past a read tail counter, and thus there is word overlap and no room to write) to test if there is enough input data to compose an output data word, if we have reached the end of the buffer, and calculate vector part selects to read/write the buffer.
reg increment_buffer_write_addr = 1'b0; reg load_buffer_write_addr = 1'b0; wire [ADDR_WIDTH-1:0] buffer_write_addr_tail; Counter_Binary #( .WORD_WIDTH (ADDR_WIDTH), .INCREMENT (WORD_WIDTH_INPUT), .INITIAL_COUNT (ADDR_ZERO) ) write_address_tail ( .clock (input_clock), .clear (input_clear), .up_down (1'b0), // 0/1 --> up/down .run (increment_buffer_write_addr), .load (load_buffer_write_addr), .load_count (ADDR_ZERO), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), .carries (), .overflow (), // verilator lint_on PINCONNECTEMPTY .count (buffer_write_addr_tail) ); wire [ADDR_WIDTH-1:0] buffer_write_addr_head; Counter_Binary #( .WORD_WIDTH (ADDR_WIDTH), .INCREMENT (WORD_WIDTH_INPUT), .INITIAL_COUNT (ADDR_INITIAL_INPUT) ) write_address_head ( .clock (input_clock), .clear (input_clear), .up_down (1'b0), // 0/1 --> up/down .run (increment_buffer_write_addr), .load (load_buffer_write_addr), .load_count (ADDR_INITIAL_INPUT), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), .carries (), .overflow (), // verilator lint_on PINCONNECTEMPTY .count (buffer_write_addr_head) );
reg increment_buffer_read_addr = 1'b0; reg load_buffer_read_addr = 1'b0; wire [ADDR_WIDTH-1:0] buffer_read_addr_tail; Counter_Binary #( .WORD_WIDTH (ADDR_WIDTH), .INCREMENT (WORD_WIDTH_OUTPUT), .INITIAL_COUNT (ADDR_ZERO) ) read_address_tail ( .clock (output_clock), .clear (output_clear), .up_down (1'b0), // 0/1 --> up/down .run (increment_buffer_read_addr), .load (load_buffer_read_addr), .load_count (ADDR_ZERO), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), .carries (), .overflow (), // verilator lint_on PINCONNECTEMPTY .count (buffer_read_addr_tail) ); wire [ADDR_WIDTH-1:0] buffer_read_addr_head; Counter_Binary #( .WORD_WIDTH (ADDR_WIDTH), .INCREMENT (WORD_WIDTH_OUTPUT), .INITIAL_COUNT (ADDR_INITIAL_OUTPUT) ) read_address_head ( .clock (output_clock), .clear (output_clear), .up_down (1'b0), // 0/1 --> up/down .run (increment_buffer_read_addr), .load (load_buffer_read_addr), .load_count (ADDR_INITIAL_OUTPUT), .carry_in (1'b0), // verilator lint_off PINCONNECTEMPTY .carry_out (), .carries (), .overflow (), // verilator lint_on PINCONNECTEMPTY .count (buffer_read_addr_head) );
To distinguish between the empty and full cases, which both identically show as equal read and write buffer addresses, we keep track of each time an address wraps around to zero by toggling a bit. The addresses never pass eachother.
If the write address runs ahead of the read address enough to wrap-around and reach the read address from behind, the buffer is full (or has less free space than one write word) and all writes to the buffer halt until after we've read out more than the width of a write word. We detect this because the write address will have wrapped-around one more time than the read address, so their wrap-around bits will be different.
Conversely, if the read address catches up to the write address from behind, the buffer is empty (or containing less than one read word of data) and all reads halt until after we've written enough data to have more than the width of a read word in the buffer. In this case, the wrap-around bits are identical.
reg toggle_buffer_write_addr_wrap_around = 1'b0; wire buffer_write_addr_wrap_around; Register_Toggle #( .WORD_WIDTH (1), .RESET_VALUE (1'b0) ) write_wrap_around_bit ( .clock (input_clock), .clock_enable (1'b1), .clear (input_clear), .toggle (toggle_buffer_write_addr_wrap_around), .data_in (buffer_write_addr_wrap_around), .data_out (buffer_write_addr_wrap_around) ); reg toggle_buffer_read_addr_wrap_around = 1'b0; wire buffer_read_addr_wrap_around; Register_Toggle #( .WORD_WIDTH (1), .RESET_VALUE (1'b0) ) read_wrap_around_bit ( .clock (output_clock), .clock_enable (1'b1), .clear (output_clear), .toggle (toggle_buffer_read_addr_wrap_around), .data_in (buffer_read_addr_wrap_around), .data_out (buffer_read_addr_wrap_around) );
We need to compare the read and write addresses, along with their
associated wrap-around bits, to detect if the buffer is holding any items.
Therefore, we need to transfer the read address into the output_clock
domain, and the write address into the input_clock
domain. We do this
with two CDC Word Synchronizers.
A read/write address is always valid, so we tie sending_valid
high and
ignore sending_ready
. Then, we loop receiving_valid
into
receiving_ready
so we start a new CDC word transfer as soon as the
current CDC word transfer completes. This configuration samples the address
continuously, as fast as the CDC word transfer happens. The synchronized
address at receiving_data
remains steady between CDC word transfers.
It takes a few cycles to do the CDC word transfer, so when comparing the local read or write address with the synchronized counterpart from the other clock domain, we are comparing to a slightly stale version, lagging behind the actual value. However, since the addresses never pass eachother, this does not cause any corruption. The synchronized value eventually catches up, and the actual buffer condition is updated. At worst, this lag means having to specify a somewhat deeper FIFO to achieve the expected peak capacity, depending on input/output rates. However, not being restricted to powers-of-two FIFO depths minimizes this overhead.
wire [ADDR_WIDTH-1:0] buffer_write_addr_tail_synced; wire buffer_write_addr_wrap_around_synced; wire buffer_write_addr_synced_valid; CDC_Word_Synchronizer #( .WORD_WIDTH (1 + ADDR_WIDTH), .EXTRA_CDC_DEPTH (CDC_EXTRA_STAGES), .OUTPUT_BUFFER_TYPE ("HALF"), // "HALF", "SKID", "FIFO" .OUTPUT_BUFFER_CIRCULAR (0), .FIFO_BUFFER_DEPTH (), // Only for "FIFO" .FIFO_BUFFER_RAMSTYLE () // Only for "FIFO" ) write_to_read ( .sending_clock (input_clock), .sending_clear (input_clear), .sending_data ({buffer_write_addr_wrap_around, buffer_write_addr_tail}), .sending_valid (1'b1), // verilator lint_off PINCONNECTEMPTY .sending_ready (), // verilator lint_on PINCONNECTEMPTY .receiving_clock (output_clock), .receiving_clear (output_clear), .receiving_data ({buffer_write_addr_wrap_around_synced, buffer_write_addr_tail_synced}), .receiving_valid (buffer_write_addr_synced_valid), .receiving_ready (buffer_write_addr_synced_valid) ); wire [ADDR_WIDTH-1:0] buffer_read_addr_tail_synced; wire buffer_read_addr_wrap_around_synced; wire buffer_read_addr_synced_valid; CDC_Word_Synchronizer #( .WORD_WIDTH (1+ ADDR_WIDTH), .EXTRA_CDC_DEPTH (CDC_EXTRA_STAGES), .OUTPUT_BUFFER_TYPE ("HALF"), // "HALF", "SKID", "FIFO" .OUTPUT_BUFFER_CIRCULAR (0), .FIFO_BUFFER_DEPTH (), // Only for "FIFO" .FIFO_BUFFER_RAMSTYLE () // Only for "FIFO" ) read_to_write ( .sending_clock (output_clock), .sending_clear (output_clear), .sending_data ({buffer_read_addr_wrap_around, buffer_read_addr_tail}), .sending_valid (1'b1), // verilator lint_off PINCONNECTEMPTY .sending_ready (), // verilator lint_on PINCONNECTEMPTY .receiving_clock (input_clock), .receiving_clear (input_clear), .receiving_data ({buffer_read_addr_wrap_around_synced, buffer_read_addr_tail_synced}), .receiving_valid (buffer_read_addr_synced_valid), .receiving_ready (buffer_read_addr_synced_valid) );
We describe the state of the buffer itself as the number of items currently stored in the buffer, as indicated by the read and write addresses and their wrap-around bits. We only care about the extremes:
reg cannot_read = 1'b0; reg cannot_write = 1'b0; always @(*) begin cannot_read = (buffer_read_addr_head > buffer_write_addr_tail_synced) && (buffer_read_addr_wrap_around == buffer_write_addr_wrap_around_synced); cannot_write = (buffer_read_addr_tail_synced < buffer_write_addr_head) && (buffer_read_addr_wrap_around_synced != buffer_write_addr_wrap_around); end
The input interface is simple: if the buffer isn't at its maximum capacity, signal the input is ready, and when an input handshake completes, write the data directly into the buffer and increment the write address, wrapping around as necessary.
reg insert = 1'b0; always @(*) begin input_ready = (cannot_write == 1'b0); insert = (input_valid == 1'b1) && (input_ready == 1'b1); increment_buffer_write_addr = (insert == 1'b1); load_buffer_write_addr = (increment_buffer_write_addr == 1'b1) && (buffer_write_addr_head == ADDR_LAST [ADDR_WIDTH-1:0]); toggle_buffer_write_addr_wrap_around = (load_buffer_write_addr == 1'b1); end
Normally this would be a Register module, but we need to describe a clock enable to part of the registers here, not a data mux. So it's a rare use of an if-statement outside of a generate block.
always @(posedge input_clock) begin if (insert == 1'b1) begin buffer [buffer_write_addr_tail +: WORD_WIDTH_INPUT] <= input_data; end end
The output interface is not so simple because the output is registered, and so holds data independently of the buffer. We signal the output holds valid data whenever we can remove an item from the buffer and load it into the output register. We meet this condition if an output handshake completes, or if the buffer holds an item but the output register is not holding any valid data. Also, we do not increment/wrap the read address if the previous item removed from the buffer and loaded into the output register was the last one.
reg remove = 1'b0; reg output_leaving_idle = 1'b0; reg load_output_register = 1'b0; always @(*) begin remove = (output_valid == 1'b1) && (output_ready == 1'b1); output_leaving_idle = (output_valid == 1'b0) && (cannot_read == 1'b0); load_output_register = (remove == 1'b1) || (output_leaving_idle == 1'b1); increment_buffer_read_addr = (load_output_register == 1'b1) && (cannot_read == 1'b0); load_buffer_read_addr = (increment_buffer_read_addr == 1'b1) && (buffer_read_addr_head == ADDR_LAST [ADDR_WIDTH-1:0]); toggle_buffer_read_addr_wrap_around = (load_buffer_read_addr == 1'b1); end
Normally this would be a Register module, but we need to describe a clock
enable to the output_data
register, and a mux from the buffer. So it's
a rare use of an if-statement outside of a generate block.
always @(posedge output_clock) begin if (load_output_register == 1'b1) begin output_data <= buffer [buffer_read_addr_tail +: WORD_WIDTH_OUTPUT]; end end
output_valid
must be registered to match the latency of the buffer output
register.
Register #( .WORD_WIDTH (1), .RESET_VALUE (1'b0) ) output_data_valid ( .clock (output_clock), .clock_enable (load_output_register == 1'b1), .clear (output_clear), .data_in (cannot_read == 1'b0), .data_out (output_valid) ); endmodule