This bit alignment algorithm initially trains the Deserializer by gradually altering the input delay on the positive and negative deserialized data to measure the region of stable data ("the eye"), where both positive and negative data agree, and then sets the delay so we sample the data close to the middle of the stable region. The set delay is constant. This algorithm is based on AMD/Xilinx WP249 "SPI-4.2 Dynamic Phase Alignment" by Robert Le and Kyle Locke
This bit alignment algorithm has simpler logic than the one described in XAPP855 "16-Channel, Double-Date-Rate LVDS Interface with Per-Channel Alignment of Clock and Data" by Greg Burton, and XAPP860 "16-Channel, DDR LVDS Interface with Real-Time Window Monitoring Application Note" by Brandon Day. However, it depends on not using SERDES width expansion: both positive and negative data SERDES blocks must be independent and in MASTER mode. Instead, we use a natively supported SERDES width and do the width expansion in our own logic: Differential Deserializer, with N to 2N Ratio.
Finally, this bit alignment algorithm also enables the possibility of dynamic adjustment of the input delay while the SERDES is operating (see XAP860), which is not implemented here.
Provide the Deserializer with a suitable and constant training word as serial data input. Most any training word pattern will do as long as it's not clock-like. I like to use a pattern than can be distinguished when permuted or reversed, with a definite 0/1 transition at the start, e.g.: 9'b000100101.
Then, in the clk_main domain, pulse start_alignment high for one cycle,
then wait for done_alignment to pulse high for one cycle, signalling bit
alignment is complete. All other signals are in the clk_parallel domain,
which also runs the Deserializer.
Assume a constant stream of a single constant training word of width equal to the Deserializer output width. Each deserialized word is signalled by a ready/valid handshake, which is also the step forward signal for the alignment process.
We have two input delay chains, one for the positive (P) Deserializer and
one for the negative (N) Deserializer. Init the P delay at tap 0 and the
N delay at tap TAP_OFFSET (2 is a good value).
This means the N Deserializer sees a bit from TAP_OFFSET cycles ago
relative to the P Deserializer seeing the curent bit. We can imagine this
as the N Deserializer sampling a bit TAP_OFFSET cycles earlier in the
data word. Thus, incrementing the P and N delays together moves the
sampling points backwards in a data word we imagine to be fixed.
At each step, we compare the P and N deserialized words. Assuming we begin in a stable area where the P and N words match, as we move backwards the N Deserializer will hit the transition area between stable data first, causing the N and P Deserializer output to mismatch, and the P Deserializer will be the last one to exit the transition area, when the P and N Deserializer output match again.
If we happen to begin in an unstable area, where the P and N outputs mismatch, the logic remains the same: increment P and N taps until the P and N Deserializer outputs match again.
After we find the first stable location after a transition area, we initialize a counter of valid taps (V) to zero. We then increment the P and N delay, and the V counter, until the Deserializer output do not match, signalling the end of the stable area.
At this point, we now have V+1 valid taps (equal to P plus the one tap
before the N tap). We can then re-set the P and N delays to the nearest
middle of the stable area with Pmid = Nmid = P + (V >> 1) - (TAP_OFFSET - 2).
This algorithm does assume the total jitter is not so bimodally distributed
that a stable area of width greater than TAP_OFFSET delay taps exists
inside the bit transition areas.
All tap counts use their full bit widths and will correctly wrap around. This means the algorithm will keep searching until it finds the eye, even through a wraparound.
There is no guarantee of finding the minimum working tap value, only a working one. This means you must assume the worst case jitter possibly introduced by the IDELAY2 tap chain.
`default_nettype none
module Deserializer_Differential_Bit_Alignment
#(
parameter TAP_OFFSET = 2, // How much the initial N IDELAY lags the P IDELAY for bit alignment.
parameter WORD_WIDTH = 12, // Parallel data width
// Do not set at instantiation, except in Vivado IPI
parameter TAP_COUNTER_WIDTH = 5 // Hardcoded to match IDELAY2 hardware. See UG471.
)
(
// clk_parallel domain, for Deserializer data and control
input wire clk_parallel,
input wire reset_parallel,
input wire datain_parallel_valid, // There is a handshake, but no slack for stalling!
output reg datain_parallel_ready, // Must always be ready before valid!
input wire [WORD_WIDTH-1:0] datain_p_parallel, // Deserialized positive data, framed by datain_parallel_valid
input wire [WORD_WIDTH-1:0] datain_n_parallel, // Deserialized negative data, framed by datain_parallel_valid
input wire [TAP_COUNTER_WIDTH-1:0] tap_p_current, // Current value of delay tap
output reg [TAP_COUNTER_WIDTH-1:0] tap_p_load_value, // New value of delay tap
output reg tap_p_load, // Load new delay tap value
input wire [TAP_COUNTER_WIDTH-1:0] tap_n_current, // Current value of delay tap
output reg [TAP_COUNTER_WIDTH-1:0] tap_n_load_value, // New value of delay tap
output reg tap_n_load, // Load new delay tap value
// System control signals in clk_main domain
input wire clk_main, // General logic clock
input wire start_alignment, // Preferably a one-cycle pulse
output wire done_alignment // Pulsed high means deserializer data is bit aligned
);
localparam TAP_ZERO = {TAP_COUNTER_WIDTH{1'b0}};
localparam TAP_ONE = {{TAP_COUNTER_WIDTH-1{1'b0}},1'b1};
localparam TAP_TWO = {{TAP_COUNTER_WIDTH-2{1'b0}},2'b10};
initial begin
datain_parallel_ready = 1'b1; // Always ready (no backpressure possible)
tap_p_load_value = TAP_ZERO;
tap_p_load = 1'b0;
tap_n_load_value = TAP_ZERO;
tap_n_load = 1'b0;
end
We must have the bit alignment logic work in the Deserializer clock domain since we cannot have the extra latency of passing the Deserializer data into the main clock domain without greatly complicating the state machine. (We could not tell if a new data value was one affected by the latest change in tap delay.)
Transfer a pulse signalling the start of training into the Deserializer clock domain. A pulse sent during a bit alignment in progress is lost and has no effect.
wire start_alignment_deserializer;
CDC_Pulse_Synchronizer_2phase
#(
.CDC_EXTRA_DEPTH (0)
)
start_alignment_transfer
(
.sending_clock (clk_main),
.sending_pulse_in (start_alignment),
// verilator lint_off PINCONNECTEMPTY
.sending_ready (),
// verilator lint_on PINCONNECTEMPTY
.receiving_clock (clk_parallel),
.receiving_pulse_out (start_alignment_deserializer)
);
Transfer the pulse signalling the end of bit alignment training from the Deserializer clock domain into the main system clock domain.
reg done_alignment_deserializer = 1'b0;
CDC_Pulse_Synchronizer_2phase
#(
.CDC_EXTRA_DEPTH (0)
)
done_alignment_transfer
(
.sending_clock (clk_parallel),
.sending_pulse_in (done_alignment_deserializer),
// verilator lint_off PINCONNECTEMPTY
.sending_ready (),
// verilator lint_on PINCONNECTEMPTY
.receiving_clock (clk_main),
.receiving_pulse_out (done_alignment)
);
Check if the Deserializer positive and negative polarity output data
differ. Then latch it while the data is valid so we always know what the
last match state was. This way we don't have to synchronize others events
to datain_parallel_valid, which simplifies the later control logic.
reg deserializer_outputs_match = 1'b0;
always @(*) begin
deserializer_outputs_match = (datain_p_parallel == ~datain_n_parallel);
end
wire deserializer_outputs_match_latched;
Register
#(
.WORD_WIDTH (1),
.RESET_VALUE (1'b0)
)
deserializer_outputs_latest_state
(
.clock (clk_parallel),
.clock_enable (datain_parallel_valid == 1'b1),
.clear (reset_parallel),
.data_in (deserializer_outputs_match),
.data_out (deserializer_outputs_match_latched)
);
Count the number of taps in the stable data area.
reg stable_tap_count_increment = 1'b0;
reg stable_tap_count_load = 1'b0;
wire [TAP_COUNTER_WIDTH-1:0] stable_tap_count;
Counter_Binary
#(
.WORD_WIDTH (TAP_COUNTER_WIDTH),
.INCREMENT (TAP_ONE),
.INITIAL_COUNT (TAP_ZERO)
)
stable_tap_counter
(
.clock (clk_parallel),
.clear (reset_parallel),
.up_down (1'b0), // 0/1 --> up/down
.run (stable_tap_count_increment),
.load (stable_tap_count_load),
.load_count (TAP_ZERO),
.carry_in (1'b0),
// verilator lint_off PINCONNECTEMPTY
.carry_out (),
.carries (),
.overflow (),
// verilator lint_on PINCONNECTEMPTY
.count (stable_tap_count)
);
Calculate the next tap values and the final aligned tap value.
Normally I'd use Adder_Subtractor_Binary modules, but here we know all numbers are unsigned and of the same width, no carry in/out is needed, and wrap-around is expected and desired. There are no corner-cases. So let the CAD tool synthesize the math here.
reg [TAP_COUNTER_WIDTH-1:0] tap_p_next = TAP_ZERO;
reg [TAP_COUNTER_WIDTH-1:0] tap_n_next = TAP_ZERO;
reg [TAP_COUNTER_WIDTH-1:0] tap_aligned = TAP_ZERO;
always @(*) begin
tap_p_next = tap_p_current + TAP_ONE;
tap_n_next = tap_n_current + TAP_ONE;
tap_aligned = tap_p_current - (stable_tap_count >> 1) - (TAP_OFFSET [TAP_COUNTER_WIDTH-1:0] - TAP_TWO);
end
State Logic
localparam STATE_WIDTH = 2;
localparam [STATE_WIDTH-1:0] STATE_IDLE = 'd0;
localparam [STATE_WIDTH-1:0] STATE_FIND_FIRST = 'd1;
localparam [STATE_WIDTH-1:0] STATE_THRU_FIRST = 'd2;
localparam [STATE_WIDTH-1:0] STATE_FIND_SECOND = 'd3;
wire [STATE_WIDTH-1:0] state;
reg [STATE_WIDTH-1:0] state_next = STATE_IDLE;
Register
#(
.WORD_WIDTH (STATE_WIDTH),
.RESET_VALUE (STATE_IDLE)
)
state_reg
(
.clock (clk_parallel),
.clock_enable (1'b1),
.clear (reset_parallel),
.data_in (state_next),
.data_out (state)
);
Datapath Transformations
wire all_taps_stable; // See Pulse_Divider below. Pulses high when all 2**TAP_COUNTER_WIDTH taps have been tried without a transition found.
reg init_taps = 1'b0; // Load both P and N IDELAY with the start delay tap values: P is 0 and N is TAP_OFFSET.
reg init_find_first = 1'b0; // Starting from a stable area, start looking for the start of the first data transition.
reg init_thru_first = 1'b0; // Starting from a transition area, start looking for the end of that first data transition area.
reg finding_first = 1'b0; // Currently in stable area, keep incrementing taps.
reg none_first = 1'b0; // From inside a stable area, reach the end of possible taps because there is NO transition area (perfect P/N alignment)
reg found_first = 1'b0; // From inside a stable area, found the start of the first data transition.
reg exiting_first = 1'b0; // Currently in a transition area, keep incrementing taps.
reg exited_first = 1'b0; // From inside the first transition area, found the start of the stable area. This is one end of the eye.
reg finding_second = 1'b0; // Currently in stable area, keep incrementing taps.
reg none_second = 1'b0; // From inside a stable area, reach the end of possible taps because there is NO transition area (perfect P/N alignment)
reg found_second = 1'b0; // From the stable area, found the start of the second transition. This is the other end of the eye.
always @(*) begin
init_taps = (state == STATE_IDLE) && (start_alignment_deserializer == 1'b1);
init_find_first = (state == STATE_IDLE) && (deserializer_outputs_match_latched == 1'b1); // latch, so no sync to valid needed
init_thru_first = (state == STATE_IDLE) && (deserializer_outputs_match_latched == 1'b0);
finding_first = (state == STATE_FIND_FIRST) && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);
none_first = (finding_first == 1'b1) && (all_taps_stable == 1'b1);
found_first = (state == STATE_FIND_FIRST) && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);
exiting_first = (state == STATE_THRU_FIRST) && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);
exited_first = (state == STATE_THRU_FIRST) && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);
finding_second = (state == STATE_FIND_SECOND) && (deserializer_outputs_match == 1'b1) && (datain_parallel_valid == 1'b1);
none_second = (finding_second == 1'b1) && (all_taps_stable == 1'b1);
found_second = (state == STATE_FIND_SECOND) && (deserializer_outputs_match == 1'b0) && (datain_parallel_valid == 1'b1);
end
Signal when all possible taps have been tried. If we reach that point without a transition, then the alignement is already perfect (or close enough), so we use the signal to skip searching for transition regions. The system will then naturally find the middle tap as the best one.
localparam [TAP_COUNTER_WIDTH-1+1:0] TAP_COUNT = 2**TAP_COUNTER_WIDTH;
Pulse_Divider
#(
.WORD_WIDTH (TAP_COUNTER_WIDTH+1),
.INITIAL_DIVISOR (TAP_COUNT)
)
all_taps_stable_detector
(
.clock (clk_parallel),
.restart (1'b0),
.divisor (TAP_COUNT),
.pulses_in (finding_first | finding_second),
.pulse_out (all_taps_stable),
// verilator lint_off PINCONNECTEMPTY
.div_by_zero ()
// verilator lint_on PINCONNECTEMPTY
);
State Transitions
always @(*) begin
state_next = init_taps && init_find_first ? STATE_FIND_FIRST : state;
state_next = init_taps && init_thru_first ? STATE_THRU_FIRST : state_next;
state_next = found_first ? STATE_THRU_FIRST : state_next;
state_next = none_first ? STATE_FIND_SECOND : state_next;
state_next = exited_first ? STATE_FIND_SECOND : state_next;
state_next = found_second ? STATE_IDLE : state_next;
state_next = none_second ? STATE_IDLE : state_next;
end
Control Signals
always @(*) begin
tap_p_load_value = init_taps ? TAP_ZERO : tap_p_next;
tap_p_load_value = found_second | none_second ? tap_aligned : tap_p_load_value;
tap_p_load = init_taps | finding_first | exiting_first | finding_second | found_second | none_second;
tap_n_load_value = init_taps ? TAP_OFFSET : tap_n_next;
tap_n_load_value = found_second | none_second ? tap_aligned : tap_n_load_value;
tap_n_load = init_taps | finding_first | exiting_first | finding_second | found_second | none_second;
stable_tap_count_load = exited_first | none_first;
stable_tap_count_increment = finding_second;
done_alignment_deserializer = found_second | none_second;
end
endmodule