Source

Pipeline FIFO Buffer

Decouples two sides of a ready/valid handshake to allow back-to-back transfers without a combinational path between input and output, thus pipelining the path to improve concurrency and/or timing. The latency from input to output is 2 clock cycles. Any FIFO depth is allowed, not only powers-of-2. Raising clear will return the FIFO to the empty state.

Since a FIFO buffer stores variable amounts of data, it will smooth out irregularities in the transfer rates of the input and output interfaces, and when used in pipeline loops, can store enough data to prevent artificial deadlocks (re: Kahn Process Networks with bounded channels).

This module is a generalization of the Skid Buffer, which acts as a 2-deep FIFO buffer. Go read it to get a deeper treatment of pipeline buffering requirements, and the particular idiom used to implement the control logic.

`default_nettype none

module Pipeline_FIFO_Buffer
#(
    parameter WORD_WIDTH = 0,
    parameter DEPTH      = 0,
    parameter RAMSTYLE   = ""
)
(
    input   wire                        clock,
    input   wire                        clear,

    input   wire                        input_valid,
    output  reg                         input_ready,
    input   wire    [WORD_WIDTH-1:0]    input_data,

    output  wire                        output_valid,
    input   wire                        output_ready,
    output  wire    [WORD_WIDTH-1:0]    output_data
);

    initial begin
        input_ready = 1'b1; // Empty at start, so accept data
    end

Constants

From the FIFO DEPTH, we derive the bit width of the buffer addresses and item count, and construct the constants we work with.

    `include "clog2_function.vh"

    localparam WORD_ZERO    = {WORD_WIDTH{1'b0}};

    localparam ADDR_WIDTH   = clog2(DEPTH);
    localparam ADDR_ONE     = {{ADDR_WIDTH-1{1'b0}},1'b1};
    localparam ADDR_ZERO    = {ADDR_WIDTH{1'b0}};
    localparam ADDR_LAST    = DEPTH-1;

Since the stored data count has to be able to represent DEPTH itself, and not a zero to DEPTH-1 count of that quantity, we need an extra bit, to guarantee sufficient range.

    localparam COUNT_WIDTH  = ADDR_WIDTH + 1;
    localparam COUNT_ONE    = {{COUNT_WIDTH-1{1'b0}},1'b1};
    localparam COUNT_ZERO   = {COUNT_WIDTH{1'b0}};
    localparam COUNT_LAST   = DEPTH;
    localparam COUNT_UP     = 1'b0;
    localparam COUNT_DOWN   = 1'b1;

Data Path

The buffer itself is a synchronous dual-port memory: one write port to insert data, and one read port to concurrently remove data. Typically this memory will be a dedicated Block RAM, but can also be built from LUT RAM if the width and depth are small, or even plain registers for very small cases. If your CAD tool does not support inference of memories as registers, you can instead chain together a few Skid Buffers, though the latency will increase by 1 clock cycle per Skid Buffer.

NOTE: There will NEVER be a concurrent read and write to the same address, so write-forwarding logic is not necessary, as specififed by the values of the READ_NEW_DATA and RW_ADDR_COLLISION parameters. Guide your CAD tool as necessary to tell it there will never be read/write address collisions, so you can obtain the highest possible operating frequency.

We initialize the read/write enables to zero, signifying an idle system.

    reg                     buffer_wren = 1'b0;
    wire [ADDR_WIDTH-1:0]   buffer_write_addr;

    reg                     buffer_rden = 1'b0;
    wire [ADDR_WIDTH-1:0]   buffer_read_addr;

    RAM_Simple_Dual_Port
    #(
        .WORD_WIDTH         (WORD_WIDTH),
        .ADDR_WIDTH         (ADDR_WIDTH),
        .DEPTH              (DEPTH),
        .RAMSTYLE           (RAMSTYLE),
        .READ_NEW_DATA      (0),
        .RW_ADDR_COLLISION  ("no"),
        .USE_INIT_FILE      (0),
        .INIT_FILE          (),
        .INIT_VALUE         (WORD_ZERO)
    )
    buffer
    (
        .clock              (clock),
        .wren               (buffer_wren),
        .write_addr         (buffer_write_addr),
        .write_data         (input_data),
        .rden               (buffer_rden),
        .read_addr          (buffer_read_addr),
        .read_data          (output_data)
    );

Since the buffer is a synchronous RAM, it takes a clock cycle before the read data updates, so we pass along the read enable line (buffer_rden) through a register to synchronize it and signal valid read data (output_valid). However, the register needs separate control: for example, when the output interface reads out the last item out of the buffer, the output becomes invalid even though the buffer output is unchanged.

    reg update_output_valid = 1'b0;

    Register
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b0)
    )
    output_data_valid
    (
        .clock          (clock),
        .clock_enable   (update_output_valid),
        .clear          (clear),
        .data_in        (buffer_rden),
        .data_out       (output_valid)
    );

Read/Write Address Counters

The buffer read and write addresses are stored in counters, which both start at (and clear to) ADDR_ZERO. Each counter can only increment by ADDR_ONE at each read or write, and will wrap around to ADDR_ZERO if incremented past a value of DEPTH-1, labelled as ADDR_LAST. The depth can be any positive number, not only a power-of-2.

NOTE: The counters never pass eachother. If the write counter runs ahead of the read counter enough to wrap-around and reach the read counter from behind, the buffer is full and all writes to the buffer halt until after a read happens. Conversely, if the read counter catches up to the write counter from behind, the buffer is empty and all reads halt until after a write happens. In both the full and empty cases, both counters are equal but they get to that state by different paths, so we will need some extra state further down to make these two cases look different.

    reg increment_buffer_write_addr = 1'b0;
    reg load_buffer_write_addr      = 1'b0;

    Counter_Binary
    #(
        .WORD_WIDTH     (ADDR_WIDTH),
        .INCREMENT      (ADDR_ONE),
        .INITIAL_COUNT  (ADDR_ZERO)
    )
    write_address
    (
        .clock          (clock),
        .clear          (clear),
        .up_down        (1'b0), // 0/1 --> up/down
        .run            (increment_buffer_write_addr),
        .load           (load_buffer_write_addr),
        .load_count     (ADDR_ZERO),
        .carry_in       (1'b0),
        // verilator lint_off PINCONNECTEMPTY
        .carry_out      (),
        // verilator lint_on  PINCONNECTEMPTY
        .count          (buffer_write_addr)
    );

    reg increment_buffer_read_addr = 1'b0;
    reg load_buffer_read_addr      = 1'b0;

    Counter_Binary
    #(
        .WORD_WIDTH     (ADDR_WIDTH),
        .INCREMENT      (ADDR_ONE),
        .INITIAL_COUNT  (ADDR_ZERO)
    )
    read_address
    (
        .clock          (clock),
        .clear          (clear),
        .up_down        (1'b0), // 0/1 --> up/down
        .run            (increment_buffer_read_addr),
        .load           (load_buffer_read_addr),
        .load_count     (ADDR_ZERO),
        .carry_in       (1'b0),
        // verilator lint_off PINCONNECTEMPTY
        .carry_out      (),
        // verilator lint_on  PINCONNECTEMPTY
        .count          (buffer_read_addr)
    );

Stored Item Counter

To distinguish between the empty and full cases, which both identically show as equal read and write counters, we also count the number of words currently stored in the FIFO. This is not the most compact way of expressing the difference between the empty and full cases, but it is the simplest and fastest, as we don't need to track the read and write addresses to see if they have wrapped around before they came to be equal to determine if the FIFO is full or empty. All we need to do is count up or down by COUNT_ONE.

Put another way, instead of having to compare the whole value of both address counters plus an extra empty/full bit, we only need to compare one counter to two constant values (for the empty and full cases) which is simpler and more local logic.

Put yet another way, we found an equivalence between all possible pairs of read and write address values which represent the same amount of stored data, so we went from having to deal with (for a FIFO of depth N) ceil(22log2(N)+1) possible states (two counters and an empty/full bit), to only ceil(2log2(N)) actual states (the number of stored words).

    reg                     update_buffer_data_count    = 1'b0;
    reg                     incr_decr_buffer_data_count = 1'b0;
    wire [COUNT_WIDTH-1:0]  buffer_data_count;

    Counter_Binary
    #(
        .WORD_WIDTH     (COUNT_WIDTH),
        .INCREMENT      (COUNT_ONE),
        .INITIAL_COUNT  (COUNT_ZERO)
    )
    data_count
    (
        .clock          (clock),
        .clear          (clear),
        .up_down        (incr_decr_buffer_data_count), // 0/1 --> up/down
        .run            (update_buffer_data_count),
        .load           (1'b0),
        .load_count     (COUNT_ZERO),
        .carry_in       (1'b0),
        // verilator lint_off PINCONNECTEMPTY
        .carry_out      (),
        // verilator lint_on  PINCONNECTEMPTY
        .count          (buffer_data_count)
    );

Control Path

For a depth of N, the FIFO buffer has N states:

  1. Empty, representing 0 items in storage.
  2. Busy, N-2 similar states, representing 1 to N-1 items in storage.
  3. Full, representing N items in storage.

The operations which transition between these states are:

We also descriptively name each transition between states. These names will show up later in the code as datapath transformations.

                 /--\ +- flow              /--\ +- flow 
                 |  |                      |  |
          load   |  v    load       load   |  v    load
 -------   +    -------   +          +    -------   +     -------
| Empty | ---> | Busy  | --->       ---> | Busy  | ----> | Full  |
|   0   |      |   1   |      . . .      |  N-1  |       |   N   |
|       | <--- |       | <---       <--- |       | <---  |       |
 -------    -   -------    -          -   -------    -    -------
         unload         unload     unload         unload

We can see from the resulting state diagram that when the datapath is empty, it can only support an insertion, and when it is full, it can only support a removal. If the interfaces try to remove while Empty, or insert while Full, data will be duplicated or lost, respectively.

We can also see that the state of the FIFO is exactly represented by the number of items currently stored in it, and that number can only stay put, increment by 1, or decrement by 1. Thus, the data_count counter will be our state variable, and we do not need to uniquely distinguish the state transitions.

    reg empty   = 1'b0;
    reg busy    = 1'b0;
    reg full    = 1'b0;

    always @(*) begin
        empty = (buffer_data_count == COUNT_ZERO);
        full  = (buffer_data_count == COUNT_LAST);
        busy  = (empty == 1'b0) && (full == 1'b0);
    end

Input Interface

The input interface inserts directly into the buffer, and only signals the other end to stop if the buffer is full.

    reg insert = 1'b0;

    always @(*) begin
        input_ready = (full == 1'b0);
        insert      = (input_valid  == 1'b1) && (input_ready  == 1'b1);
    end

Output Interface

The buffer output is registered, so the output interface is necessarily pipelined and works in parallel with the input interface. Keeping that pipeline full and signalling to the read_address and data_count counters when an item is removed from the buffer requires a separate little engine to continually read from the buffer when possible.

Recall from above that output_valid is a registered version of buffer_rden, matching the latency of a buffer read and signalling when a buffer read has completed and new output_data is available.

We update output_valid and try to read from the buffer if the other end of the output interface is ready (output_ready is high), or if output_data is currently invalid, so even if output_ready is low, we try and pre-load the output interface. A buffer read happens if the buffer is not empty. Otherwise, output_valid updates from a buffer_rden which is necessarily low, and thus output_valid goes (or remains) low in the next cycle. If we can read from the buffer, and output_ready is high, then we have removed an item from the buffer.

    reg remove = 1'b0;

    always @(*) begin
        update_output_valid = (output_valid == 1'b0) || (output_ready        == 1'b1);
        buffer_rden         = (empty        == 1'b0) && (update_output_valid == 1'b1);
        remove              = (buffer_rden  == 1'b1) && (output_ready        == 1'b1);
    end

Datapath Transformations

Now that we have defined the possible states and operations of the datapath, we can define the datapath transformations, which are the edges between states. Or put otherwise: in which state(s) can the insert and/or remove operations happen, and what do they mean.

After this point, we can describe the rest of the control logic using only transformations, without reference to states, I/O signals, or operations.

    reg load    = 1'b0; // Inserts data into buffer.
    reg unload  = 1'b0; // Remove data from buffer.
    reg flow    = 1'b0; // Inserts new data into buffer as stored data is removed.

    always @(*) begin
        load    = (full  == 1'b0) && (insert == 1'b1) && (remove == 1'b0);
        unload  = (empty == 1'b0) && (insert == 1'b0) && (remove == 1'b1);
        flow    = (busy  == 1'b1) && (insert == 1'b1) && (remove == 1'b1);
    end

Next-State Calculations

Calculating the next state after each datapath transformation becomes the control lines of the data_count counter, where we increment, decrement, or leave constant the count of items in the buffer, which defines the state of the FIFO.

    always @(*) begin
        update_buffer_data_count    = (load == 1'b1) || (unload == 1'b1);
        incr_decr_buffer_data_count = (load == 1'b1) ? COUNT_UP : COUNT_DOWN;
    end

Control Signals

Here, we increment the read/write addresses as necessary (they only ever increment), enable a write to the buffer as necessary, and wrap-around the read/write addresses when they reach the end of the buffer.

    always @(*) begin
        increment_buffer_write_addr = (load   == 1'b1) || (flow == 1'b1);
        increment_buffer_read_addr  = (unload == 1'b1) || (flow == 1'b1);
        buffer_wren                 = increment_buffer_write_addr;
        load_buffer_write_addr      = (increment_buffer_write_addr == 1'b1) && (buffer_write_addr == ADDR_LAST [ADDR_WIDTH-1:0]);
        load_buffer_read_addr       = (increment_buffer_read_addr  == 1'b1) && (buffer_read_addr  == ADDR_LAST [ADDR_WIDTH-1:0]);
    end

endmodule

back to FPGA Design Elements

fpgacpu.ca