Source

License

Index

Differential Deserializer Bit Aligment

This bit alignment algorithm initially trains the Deserializer by gradually altering the input delay on the positive and negative deserialized data to measure the region of stable data ("the eye"), where both positive and negative data agree, and then sets the delay so we sample the data close to the middle of the stable region. The set delay is constant. This algorithm is based on AMD/Xilinx WP249 "SPI-4.2 Dynamic Phase Alignment" by Robert Le and Kyle Locke

This bit alignment algorithm has simpler logic than the one described in XAPP855 "16-Channel, Double-Date-Rate LVDS Interface with Per-Channel Alignment of Clock and Data" by Greg Burton, and XAPP860 "16-Channel, DDR LVDS Interface with Real-Time Window Monitoring Application Note" by Brandon Day. However, it depends on not using SERDES width expansion: both positive and negative data SERDES blocks must be independent and in MASTER mode. Instead, we use a natively supported SERDES width and do the width expansion in our own logic: Differential Deserializer, with N to 2N Ratio.

Finally, this bit alignment algorithm also enables the possibility of dynamic adjustment of the input delay while the SERDES is operating (see XAP860), which is not implemented here.

Usage

Provide the Deserializer with a suitable and constant training word as serial data input. Most any training word pattern will do as long as it's not clock-like. I like to use a pattern than can be distinguished when permuted or reversed, with a definite 0/1 transition at the start, e.g.: 9'b000100101.

Then, in the clk_main domain, pulse start_alignment high for one cycle, then wait for done_alignment to pulse high for one cycle, signalling bit alignment is complete. All other signals are in the clk_parallel domain, which also runs the Deserializer.

Operation

Assume a constant stream of a single constant training word of width equal to the Deserializer output width. Each deserialized word is signalled by a ready/valid handshake, which is also the step forward signal for the alignment process.

We have two input delay chains, one for the positive (P) Deserializer and one for the negative (N) Deserializer. Init the P delay at tap 0 and the N delay at tap TAP_OFFSET (2 is a good value).

This means the N Deserializer sees a bit from TAP_OFFSET cycles ago relative to the P Deserializer seeing the curent bit. We can imagine this as the N Deserializer sampling a bit TAP_OFFSET cycles earlier in the data word. Thus, incrementing the P and N delays together moves the sampling points backwards in a data word we imagine to be fixed.

At each step, we compare the P and N deserialized words. Assuming we begin in a stable area where the P and N words match, as we move backwards the N Deserializer will hit the transition area between stable data first, causing the N and P Deserializer output to mismatch, and the P Deserializer will be the last one to exit the transition area, when the P and N Deserializer output match again.

If we happen to begin in an unstable area, where the P and N outputs mismatch, the logic remains the same: increment P and N taps until the P and N Deserializer outputs match again.

After we find the first stable location after a transition area, we initialize a counter of valid taps (V) to zero. We then increment the P and N delay, and the V counter, until the Deserializer output do not match, signalling the end of the stable area.

At this point, we now have V+1 valid taps (equal to P plus the one tap before the N tap). We can then re-set the P and N delays to the nearest middle of the stable area with Pmid = Nmid = P + (V >> 1) - (TAP_OFFSET - 2).

Corner Cases

This algorithm does assume the total jitter is not so bimodally distributed that a stable area of width greater than TAP_OFFSET delay taps exists inside the bit transition areas.

All tap counts use their full bit widths and will correctly wrap around. This means the algorithm will keep searching until it finds the eye, even through a wraparound.

There is no guarantee of finding the minimum working tap value, only a working one. This means you must assume the worst case jitter possibly introduced by the IDELAY2 tap chain.

Parameters, Ports, and Constants

`default_nettype none

module Deserializer_Differential_Bit_Alignment
#(
    parameter TAP_OFFSET        = 2,    // How much the initial N IDELAY lags the P IDELAY for bit alignment.
    parameter WORD_WIDTH        = 12,   // Parallel data width

    // Do not set at instantiation, except in Vivado IPI
    parameter TAP_COUNTER_WIDTH = 5     // Hardcoded to match IDELAY2 hardware. See UG471.
)
(
    // clk_parallel domain, for Deserializer data and control

    input   wire                            clk_parallel,
    input   wire                            reset_parallel,

    input   wire                            datain_parallel_valid,  // There is a handshake, but no slack for stalling!
    output  reg                             datain_parallel_ready,  // Must always be ready before valid!
    input   wire    [WORD_WIDTH-1:0]        datain_p_parallel,      // Deserialized positive data, framed by datain_parallel_valid
    input   wire    [WORD_WIDTH-1:0]        datain_n_parallel,      // Deserialized negative data, framed by datain_parallel_valid

    input   wire    [TAP_COUNTER_WIDTH-1:0] tap_p_current,          // Current value of delay tap
    output  reg     [TAP_COUNTER_WIDTH-1:0] tap_p_load_value,       // New value of delay tap
    output  reg                             tap_p_load,             // Load new delay tap value

    input   wire    [TAP_COUNTER_WIDTH-1:0] tap_n_current,          // Current value of delay tap
    output  reg     [TAP_COUNTER_WIDTH-1:0] tap_n_load_value,       // New value of delay tap
    output  reg                             tap_n_load,             // Load new delay tap value

    // System control signals in clk_main domain

    input   wire                            clk_main,               // General logic clock

    input   wire                            start_alignment,        // Preferably a one-cycle pulse
    output  wire                            done_alignment          // Pulsed high means deserializer data is bit aligned
);

    localparam TAP_ZERO  = {TAP_COUNTER_WIDTH{1'b0}};
    localparam TAP_ONE   = {{TAP_COUNTER_WIDTH-1{1'b0}},1'b1};
    localparam TAP_TWO   = {{TAP_COUNTER_WIDTH-2{1'b0}},2'b10};

    initial begin
        datain_parallel_ready   = 1'b1; // Always ready (no backpressure possible)
        tap_p_load_value        = TAP_ZERO;
        tap_p_load              = 1'b0;
        tap_n_load_value        = TAP_ZERO;
        tap_n_load              = 1'b0;
    end

Datapath Operations

We must have the bit alignment logic work in the Deserializer clock domain since we cannot have the extra latency of passing the Deserializer data into the main clock domain without greatly complicating the state machine. (We could not tell if a new data value was one affected by the latest change in tap delay.)

Transfer a pulse signalling the start of training into the Deserializer clock domain. A pulse sent during a bit alignment in progress is lost and has no effect.

    wire start_alignment_deserializer;

    CDC_Pulse_Synchronizer_2phase
    #(
        .CDC_EXTRA_DEPTH        (0)
    )
    start_alignment_transfer
    (
        .sending_clock          (clk_main),
        .sending_pulse_in       (start_alignment),
        // verilator lint_off PINCONNECTEMPTY
        .sending_ready          (),
        // verilator lint_on  PINCONNECTEMPTY

        .receiving_clock        (clk_parallel),
        .receiving_pulse_out    (start_alignment_deserializer)
    );

Transfer the pulse signalling the end of bit alignment training from the Deserializer clock domain into the main system clock domain.

    reg done_alignment_deserializer = 1'b0;

    CDC_Pulse_Synchronizer_2phase
    #(
        .CDC_EXTRA_DEPTH        (0)
    )
    done_alignment_transfer
    (
        .sending_clock          (clk_parallel),
        .sending_pulse_in       (done_alignment_deserializer),
        // verilator lint_off PINCONNECTEMPTY
        .sending_ready          (),
        // verilator lint_on  PINCONNECTEMPTY

        .receiving_clock        (clk_main),
        .receiving_pulse_out    (done_alignment)
    );

Check if the Deserializer positive and negative polarity output data differ. Then latch it while the data is valid so we always know what the last match state was. This way we don't have to synchronize others events to datain_parallel_valid, which simplifies the later control logic.

    reg deserializer_outputs_match = 1'b0;

    always @(*) begin
        deserializer_outputs_match = (datain_p_parallel == ~datain_n_parallel);
    end

    wire deserializer_outputs_match_latched;

    Register
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b0)
    )
    deserializer_outputs_latest_state
    (
        .clock          (clk_parallel),
        .clock_enable   (datain_parallel_valid == 1'b1),
        .clear          (reset_parallel),
        .data_in        (deserializer_outputs_match),
        .data_out       (deserializer_outputs_match_latched)
    );

Count the number of taps in the stable data area.

    reg                             stable_tap_count_increment  = 1'b0;
    reg                             stable_tap_count_load       = 1'b0;
    wire [TAP_COUNTER_WIDTH-1:0]    stable_tap_count;

    Counter_Binary
    #(
        .WORD_WIDTH     (TAP_COUNTER_WIDTH),
        .INCREMENT      (TAP_ONE),
        .INITIAL_COUNT  (TAP_ZERO)
    )
    stable_tap_counter
    (
        .clock          (clk_parallel),
        .clear          (reset_parallel),

        .up_down        (1'b0), // 0/1 --> up/down
        .run            (stable_tap_count_increment),

        .load           (stable_tap_count_load),
        .load_count     (TAP_ZERO),

        .carry_in       (1'b0),
        // verilator lint_off PINCONNECTEMPTY
        .carry_out      (),
        .carries        (),
        .overflow       (),
        // verilator lint_on  PINCONNECTEMPTY

        .count          (stable_tap_count)
    );

Calculate the next tap values and the final aligned tap value.

Normally I'd use Adder_Subtractor_Binary modules, but here we know all numbers are unsigned and of the same width, no carry in/out is needed, and wrap-around is expected and desired. There are no corner-cases. So let the CAD tool synthesize the math here.

    reg [TAP_COUNTER_WIDTH-1:0] tap_p_next  = TAP_ZERO;
    reg [TAP_COUNTER_WIDTH-1:0] tap_n_next  = TAP_ZERO;
    reg [TAP_COUNTER_WIDTH-1:0] tap_aligned = TAP_ZERO;

    always @(*) begin
        tap_p_next  = tap_p_current + TAP_ONE;
        tap_n_next  = tap_n_current + TAP_ONE;
        tap_aligned = tap_p_current - (stable_tap_count >> 1) - (TAP_OFFSET [TAP_COUNTER_WIDTH-1:0] - TAP_TWO);
    end

State Logic

    localparam  STATE_WIDTH                         = 2;
    localparam [STATE_WIDTH-1:0] STATE_IDLE         = 'd0;
    localparam [STATE_WIDTH-1:0] STATE_FIND_FIRST   = 'd1;
    localparam [STATE_WIDTH-1:0] STATE_THRU_FIRST   = 'd2;
    localparam [STATE_WIDTH-1:0] STATE_FIND_SECOND  = 'd3;

    wire [STATE_WIDTH-1:0] state;
    reg  [STATE_WIDTH-1:0] state_next = STATE_IDLE;
    
    Register
    #(
        .WORD_WIDTH     (STATE_WIDTH),
        .RESET_VALUE    (STATE_IDLE)
    )
    state_reg
    (
        .clock          (clk_parallel),
        .clock_enable   (1'b1),
        .clear          (reset_parallel),
        .data_in        (state_next),
        .data_out       (state)
    );

Datapath Transformations

    wire all_taps_stable;           // See Pulse_Divider below. Pulses high when all 2**TAP_COUNTER_WIDTH taps have been tried without a transition found.

    reg init_taps           = 1'b0; // Load both P and N IDELAY with the start delay tap values: P is 0 and N is TAP_OFFSET.
    reg init_find_first     = 1'b0; // Starting from a stable area, start looking for the start of the first data transition.
    reg init_thru_first     = 1'b0; // Starting from a transition area, start looking for the end of that first data transition area.

    reg finding_first       = 1'b0; // Currently in stable area, keep incrementing taps.
    reg none_first          = 1'b0; // From inside a stable area, reach the end of possible taps because there is NO transition area (perfect P/N alignment)
    reg found_first         = 1'b0; // From inside a stable area, found the start of the first data transition.

    reg exiting_first       = 1'b0; // Currently in a transition area, keep incrementing taps.
    reg exited_first        = 1'b0; // From inside the first transition area, found the start of the stable area. This is one end of the eye.

    reg finding_second      = 1'b0; // Currently in stable area, keep incrementing taps.
    reg none_second         = 1'b0; // From inside a stable area, reach the end of possible taps because there is NO transition area (perfect P/N alignment)
    reg found_second        = 1'b0; // From the stable area, found the start of the second transition. This is the other end of the eye.

    always @(*) begin
        init_taps           = (state == STATE_IDLE) && (start_alignment_deserializer == 1'b1);
        init_find_first     = (state == STATE_IDLE) && (deserializer_outputs_match_latched == 1'b1); // latch, so no sync to valid needed
        init_thru_first     = (state == STATE_IDLE) && (deserializer_outputs_match_latched == 1'b0);

        finding_first       = (state == STATE_FIND_FIRST)  && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);
        none_first          = (finding_first == 1'b1) && (all_taps_stable == 1'b1);
        found_first         = (state == STATE_FIND_FIRST)  && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);

        exiting_first       = (state == STATE_THRU_FIRST)  && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);
        exited_first        = (state == STATE_THRU_FIRST)  && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);

        finding_second      = (state == STATE_FIND_SECOND) && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);
        none_second         = (finding_second == 1'b1) && (all_taps_stable == 1'b1);
        found_second        = (state == STATE_FIND_SECOND) && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);
    end

Signal when all possible taps have been tried. If we reach that point without a transition, then the alignement is already perfect (or close enough), so we use the signal to skip searching for transition regions. The system will then naturally find the middle tap as the best one.

    localparam [TAP_COUNTER_WIDTH-1+1:0] TAP_COUNT = 2**TAP_COUNTER_WIDTH;

    Pulse_Divider
    #(
        .WORD_WIDTH         (TAP_COUNTER_WIDTH+1),
        .INITIAL_DIVISOR    (TAP_COUNT)
    )
    all_taps_stable_detector
    (
        .clock          (clk_parallel),
        .restart        (1'b0),
        .divisor        (TAP_COUNT),
        .pulses_in      (finding_first | finding_second),
        .pulse_out      (all_taps_stable),
        // verilator lint_off PINCONNECTEMPTY
        .div_by_zero    ()
        // verilator lint_on  PINCONNECTEMPTY
    );

State Transitions

    always @(*) begin
        state_next = init_taps && init_find_first ? STATE_FIND_FIRST  : state;
        state_next = init_taps && init_thru_first ? STATE_THRU_FIRST  : state_next;
        state_next = found_first                  ? STATE_THRU_FIRST  : state_next;
        state_next = none_first                   ? STATE_FIND_SECOND : state_next;
        state_next = exited_first                 ? STATE_FIND_SECOND : state_next;
        state_next = found_second                 ? STATE_IDLE        : state_next;
        state_next = none_second                  ? STATE_IDLE        : state_next;
    end

Control Signals

    always @(*) begin
        tap_p_load_value            = init_taps                  ? TAP_ZERO    : tap_p_next;
        tap_p_load_value            = found_second | none_second ? tap_aligned : tap_p_load_value;
        tap_p_load                  = init_taps | finding_first | exiting_first | finding_second | found_second | none_second;

        tap_n_load_value            = init_taps                  ? TAP_OFFSET  : tap_n_next;
        tap_n_load_value            = found_second | none_second ? tap_aligned : tap_n_load_value;
        tap_n_load                  = init_taps | finding_first | exiting_first | finding_second | found_second | none_second;

        stable_tap_count_load       = exited_first | none_first;
        stable_tap_count_increment  = finding_second;

        done_alignment_deserializer = found_second | none_second;
    end

endmodule


Back to FPGA Design Elements

fpgacpu.ca