Source

Pipeline Skid Buffer

Decouples two sides of a ready/valid handshake to allow back-to-back transfers without a combinational path between input and output, thus pipelining the path.

A skid buffer is the smallest FIFO, with only two entries. It is useful when you need to pipeline the path between a sender and a receiver for concurrency and/or timing, but not to smooth-out data rate mismatches. It also only requires two data registers, which at this scale is smaller than LUTRAMs or Block RAMs (depending on implementation), and has more freedom of placement and routing.

Background

Networks-on-Chip (NoC) and elastic pipelines have as a fundamental building block a handshaking mechanism where each end of a link can signal if they have data to send ("valid"), or if they are able to receive data ("ready"). When both ends agree (valid and ready both high), a data transfer occurs on that clock cycle.

However, pipelining handshaking is more complicated: simply adding a pipeline register to the valid, ready, and data lines will work, but now each transfer take two cycles to start, and two cycles to stop. This isn't bad in terms of bandwidth if you know you can transfer a block of data per handshake, but now the receiver has to be aware of how many pipeline stages exist between it and the sender, and thus must have sufficient internal buffering to absorb the data that keeps arriving after it signals it is no longer ready to receive more data.

This is the basis of credit-based connections (which I'm not getting into here), which maximize bandwidth over long pipelines, but are overkill if you simply need to add a single pipeline stage between two ends, without having to modify them, so as to meet timing or allow each end to send off one item of data without having to wait for a response (thus overlapping communication and computation, which is desirable).

This fundamental pipeline block is the skid buffer.

Figuring Out The Requirements

To begin designing a skid buffer, let's imagine a single unit which can perform a valid/ready handshake and receive an input item of data, then performs the same handshake with the other end to output the data.

Input                       Output
-----                       ------
          -------------
ready <--|             |<-- ready
valid -->| Skid Buffer |--> valid
data  -->|             |--> data
          -------------

Ideally, the input and output interfaces operate concurrently for maximum bandwidth: in the same clock cycle, a new data item is received on the input interface and put into a register, and that same register is simultaneously read out by the output interface. However, if the output interface is not transfering data on a given cycle, the input interface must not transfer data during that cycle also, else we will overwrite the data register before it was read out. To avoid this problem, the input interface should declare itself not ready in the same cycle as the output interface declaring itself not ready. But this forms a direct combinational connection between them, not a pipelined one. If we could connect both interfaces directly, and not affect timing or concurrency, we wouldn't need pipelining in the first place!

To resolve this contradiction, we need an extra buffer register to capture the incoming data during a clock cycle where the input interface is transferring data, but the output interface isn't, and there is already data in the main register. Then, in the next cycle, the input interface can signal it is no longer ready, and no data gets lost. We can imagine this extra buffer register as allowing the input interface to "skid" to a stop, rather than stopping immediately, which we'd previously found contradicts our pipelining requirements.

`default_nettype none

module Pipeline_Skid_Buffer
#(
    parameter WORD_WIDTH = 0
)
(
    input   wire                        clock,
    input   wire                        areset,
    input   wire                        clear,

    input   wire                        input_valid,
    output  wire                        input_ready,
    input   wire    [WORD_WIDTH-1:0]    input_data,

    output  wire                        output_valid,
    input   wire                        output_ready,
    output  wire    [WORD_WIDTH-1:0]    output_data
);

    localparam WORD_ZERO = {WORD_WIDTH{1'b0}};

Data Path

Feed the data_out register either from the input_data, or a buffered copy of input_data. Write to registers only if enabled by control.

Funneling into a single data_out register rather than selecting between two equal output registers avoids a mux after registers, fed by two data streams (thus more routing and delay). A single output register also retimes more easily into downstream logic.

Set up the default control values to match the "empty" state of the skid buffer, so the first input_data to arrive ends up in the data_out by default. We don't have to worry about state here, just pass the data through unless told otherwise.

    reg                     data_buffer_wren = 1'b0; // EMPTY at start, so don't load.
    wire [WORD_WIDTH-1:0]   data_buffer_out;

    Register
    #(
        .WORD_WIDTH     (WORD_WIDTH),
        .RESET_VALUE    (WORD_ZERO)
    )
    data_buffer_reg
    (
        .clock          (clock),
        .clock_enable   (data_buffer_wren),
        .areset         (areset),
        .clear          (clear),
        .data_in        (input_data),
        .data_out       (data_buffer_out)
    );

    reg                     data_out_wren       = 1'b1; // EMPTY at start, so accept data.
    reg                     use_buffered_data   = 1'b0;
    reg [WORD_WIDTH-1:0]    selected_data       = WORD_ZERO;

    always @(*) begin
        selected_data = (use_buffered_data == 1'b1) ? data_buffer_out : input_data;
    end

    Register
    #(
        .WORD_WIDTH     (WORD_WIDTH),
        .RESET_VALUE    (WORD_ZERO)
    )
    data_out_reg
    (
        .clock          (clock),
        .clock_enable   (data_out_wren),
        .areset         (areset),
        .clear          (clear),
        .data_in        (selected_data),
        .data_out       (output_data)
    );

Control Path

We separate the control path so the associated data path does not have to know anything about the current state or its encoding.

This FSM assumes the usual meaning and behaviour of valid/ready handshake signals: when both are high, data transfers at the end of the clock cycle. It is an error to raise ready when not able to accept data (thus losing the incoming data), or to raise valid when not able to send data (thus duplicating previously sent data). These error situations are not handled.

To operate our datapath as a skid buffer, we need to understand which states we want to allow it to be in, and which state transitions we also allow. This skid buffer has three states:

  1. It is Empty.
  2. It is Busy, holding one item of data in the main register, either waiting or actively transferring data through that register.
  3. It is Full, holding data in both registers, and stopped until the main register is emptied and simultaneously refilled from the buffer register, so no data is lost or reordered. (Without an available empty register, the input interface cannot skid to a stop, so it must signal it is not ready.)

The operations which transition between these states are:

We also descriptively name each transition between states. These names will show up later in the code.

                 /--\ +- flow
                 |  |
          load   |  v   fill
 -------   +    ------   +    ------
|       | ---> |      | ---> |      |
| Empty |      | Busy |      | Full |
|       | <--- |      | <--- |      |
 -------    -   ------    -   ------
         unload         flush

We can see from the resulting state diagram that when the datapath is empty, it can only support an insertion, and when it is full, it can only support a removal. These constraints will become very important later on. If the interfaces try to remove while Empty, or insert while Full, data will be duplicated or lost, respectively.

This simple FSM description helped us clarify the problem, but it also glossed over the potential complexity of the implementation: 3 states, each connected to 2 signals (valid/ready) per interface, for a total of 16 possible transitions out of each state, or 48 possible state transitions total.

We don't want to have to manually enumerate all the transitions to then coalesce the equivalent ones and rule out all the impossible or illegal ones. Instead, if we express in logic the constraints on removals and insertions we determined from the state diagram, and the possible transformations on the datapath, we then get the state transition logic and datapath control signal logic almost for free.

Lets describe the possible states of the datapath, and initialize it. This code describes a binary state encoding, but the CAD tool can re-encode and re-number the state encoding. Usually this is beneficial, but if the states+inputs fit in a single LUT, forcing binary encoding reduces area. See what works best (i.e.: reaches the highest speed) for your given FPGA.

    localparam STATE_BITS = 2;

    localparam [STATE_BITS-1:0] EMPTY = 'd0; // Output and buffer registers empty
    localparam [STATE_BITS-1:0] BUSY  = 'd1; // Output register holds data
    localparam [STATE_BITS-1:0] FULL  = 'd2; // Both output and buffer registers hold data
    // There is no case where only the buffer register would hold data.

    // No handling of erroneous and unreachable state 3.
    // We could check and raise an error flag.

    wire [STATE_BITS-1:0] state;
    reg  [STATE_BITS-1:0] state_next = EMPTY;

Now, let's express the constraints we figured out from the state diagram:

We do this by computing the allowable output read/valid handshake signals based on the datapath state. We use state_next so we can have nice registered outputs. This little bit of code prunes away a large number of invalid state transitions. If some other logic seems to be missing, first see if this code has made it unnecessary.

This tiny bit of code is critical since it also implies the fundamental operating assumptions of a skid buffer: that one interface cannot have its current state depend on the current state of the other interface, as that would be a combinational path between both interfaces.

Compute ready for the input interface

    Register
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b1) // EMPTY at start, so accept data
    )
    input_ready_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .areset         (areset),
        .clear          (clear),
        .data_in        (state_next != FULL),
        .data_out       (input_ready)
    );

Compute valid for the output interface

    Register
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b0)
    )
    output_valid_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .areset         (areset),
        .clear          (clear),
        .data_in        (state_next != EMPTY),
        .data_out       (output_valid)
    );

After, let's describe the interface signal conditions which implement our two basic operations on the datapath: insert and remove. This also weeds out a number of possible state transitions.

    reg insert = 1'b0;
    reg remove = 1'b0;

    always @(*) begin
        insert = (input_valid  == 1'b1) && (input_ready  == 1'b1);
        remove = (output_valid == 1'b1) && (output_ready == 1'b1);
    end

Now that we have our datapath states and operations, let's use them to describe the possible transformations to the datapath, and in which state they can happen. You'll see that these exactly describe each of the 5 edges in the state diagram, and since we've pruned the space of possible interface conditions, we only need the minimum logic to describe them, and this logic gets re-used a lot later on, simplifying the code.

    reg load    = 1'b0; // Empty datapath inserts data into output register.
    reg flow    = 1'b0; // New inserted data into output register as the old data is removed.
    reg fill    = 1'b0; // New inserted data into buffer register. Data not removed from output register.
    reg flush   = 1'b0; // Move data from buffer register into output register. Remove old data. No new data inserted.
    reg unload  = 1'b0; // Remove data from output register, leaving the datapath empty.

    always @(*) begin
        load    = (state == EMPTY) && (insert == 1'b1) && (remove == 1'b0);
        flow    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b1);
        fill    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b0);
        flush   = (state == FULL)  && (insert == 1'b0) && (remove == 1'b1);
        unload  = (state == BUSY)  && (insert == 1'b0) && (remove == 1'b1);
    end

And now we simply need to calculate the next state after each datapath transformations:

    always @(*) begin
        state_next = (load   == 1'b1) ? BUSY  : state;
        state_next = (flow   == 1'b1) ? BUSY  : state_next;
        state_next = (fill   == 1'b1) ? FULL  : state_next;
        state_next = (flush  == 1'b1) ? BUSY  : state_next;
        state_next = (unload == 1'b1) ? EMPTY : state_next;
    end

    Register
    #(
        .WORD_WIDTH     (STATE_BITS),
        .RESET_VALUE    (EMPTY)         // Initial state
    )
    state_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .areset         (areset),
        .clear          (clear),
        .data_in        (state_next),
        .data_out       (state)
    );

Similarly, from the datapath transformations, we can compute the necessary control signals to the datapath. These are not registered here, as they end at registers in the datapath.

    always @(*) begin
        data_out_wren     = (load  == 1'b1) || (flow == 1'b1) || (flush == 1'b1);
        data_buffer_wren  = (fill  == 1'b1);
        use_buffered_data = (flush == 1'b1);
    end

endmodule

For a 64-bit connection, the resulting skid buffer uses 128 registers for the buffers, 4 to 9 registers (and associated LUTs) for the FSM and interface outputs, depending on the particular state encoding chosen by the CAD tool, and easily reaches a high operating speed.


back to FPGA Design Elements

fpgacpu.ca