Basic Clock Domain Crossing (CDC)

from FPGA Resources by GateForge Consulting Ltd.

You can design plain synchronous logic and be cocksure it'll work (and you'd be right)...or you can design Clock Domain Crossing (CDC) logic, and always be full of doubt. CDC logic is a well-understood, but subtle design problem. There are many ways to get it slightly wrong, causing intermittent failures, or ending up with a design that complicates things and reduces performance (as I discover here).

I'll explain a basic CDC synchronizer, how it works and its limits, the design of a custom bit of CDC logic connecting a slow external microcontroller unit (MCU) bus to a fast FPGA, and the reverse case of passing a fast pulse into a slow clock domain.

The MCU bus has a synchronous protocol without wait states, which means that reads and writes take a fixed number of MCU clock cycles and you have to send/receive data at specific cycles in the transfer. The MCU and FPGA clocks are completely asynchronous, though the FPGA clock is always the faster one. Passing data across clock domains is a given, but the extra challenge here is passing the MCU clock also, so the FPGA can keep count of the MCU clock cycles despite operating on a completely unrelated, faster clock.

The Basics

A synchronizer forms the core of any CDC solution: a chain of two closely placed registers, with no logic between them, and clocked by the receiving clock. The input register receives a bit synchronized to a sending clock, and the output register sends a version of that bit synchronized to the receiving clock. The duration between bit transitions at the output may be of a different length than the original bit, with a new duration of some multiple of the receiving clock period.

// A basic Clock Domain Crossing synchronizer

// This can only ever be correct for 1 bit.
// DO NOT MAKE IT WORD-WIDE.

module cdc_synchronizer
(
    input   wire    data_from,
    input   wire    clock_to,
    output  reg     data_to
);

    // There should never be a need to increase this,
    // unless you are pushing the limits of your hardware.
    localparam DEPTH = 2;

    // Tell Vivado that these reg should be placed together (UG912),
    // and to show up as part of MTBF reports.
    (* ASYNC_REG = "TRUE" *)
    reg [DEPTH-1:0] sync_reg = 0;

    always @(posedge clock_to) begin
        sync_reg[0] <= data_from;
        sync_reg[1] <= sync_reg[0]; 
    end

    always @(*) begin
        data_to <= sync_reg[1];
    end

endmodule

Defining Metastability

Normally, digital logic works in a bistable mode: it's either in a high state or low state, glossing over the fact that it takes time to move from one state to the other. But so long as we sample states (e.g.: latch a signal) far from those transitions, the assumption holds and everything remains bistable.

A metastable state is simply any state that is neither high nor low, but in between, like during a transition. And the closer to the mid-point between states a metastable state is, the longer it takes to resolve to a valid state and the less certainty there is which valid state that will be. So the next logic output may remain metastable or resolve too late, a flip-flop may sample the wrong state, and so on... and your system falls apart.

(And it's impossible to design metastability-free circuits. That would imply instantaneous state transitions, which isn't physically possible.)

Latency

Despite having two registers, a synchronizer has an unpredictable latency from the point of view of the sending clock. A few different cases may happen, which will also explain why a synchronizer is necessary:

  1. The receiving clock captures the incoming bit properly, which then exits the synchronizer on the next cycle. Latency: 2 cycles.
  2. The receiving clock captures the incoming bit just before a transition, so that transition will pass through after two more cycles. Latency: 3 cycles.
  3. The receiving clock captures the incoming bit as it transitions, causing the first register output to go metastable, which then settles into the correct post-transition state and exits in the following cycle. Latency: 2 cycles.
  4. The receiving clock captures the incoming bit as it transitions, causing the first register output to go metastable, which then settles into the incorrect pre-transition state (no change). The post-transition bit is properly captured in the next cycle, and exits in the cycle after that. Latency: 3 cycles.
  5. The receiving clock captures the incoming bit as it transitions, causing the first register output to go metastable, which then settles into either the pre-transition or post-transition state, but takes too long. The input of the second register thus samples a metastable bit, causing one of the previous two cases (3,4) to happen, or this case (5) again, at its own output. However, this second metastability is exponentially less likely than the first, and should in practice "never" happen (meaning, once every many, many years on average, swamped by other failure modes).

Thus, depending on the alignment of the sending clock relative to the receiving clock at that particular point in time, the output high and low phases of a toggling bit may have different lengths, though the total period remains the same. Also, the synchronizer prevents metastable state from propagating into the rest of the logic, possibly causing an incorrect bit value to be created.

The variable latency of a synchronizer implies an important design rule of CDC: only one bit at a time may ever change when crossing clock domains. If you tried to synchronize two bits in parallel, there would be no guarantee that both bits would always see the same latency through the synchronizer. With only a single bit, the worst case is that its transition is missed, and will get captured in the next receiving clock cycle. This delay causes no errors in itself, and only alters the received bit duration by one receiving clock cycle.

Performance

How fast can we pass a bit through a synchronizer? The input bit needs to remain stable long enough for the receiving clock to be always able to properly capture the bit between two of its transitions (a cycle), else we could miss a pulse (which matters greatly in this design, but may not in others). We can guarantee a proper capture if the minimum time between bit transitions is equal or greater to 3 edges of the receiving clock, regardless of clock alignment:

  1. The 3 receiving clock edges are positive-negative-positive. Thus, if the first posedge misses the first bit transition or goes metastable (see above), the second one will capture the bit properly before the second transition.
  2. The 3 receiving clock edges are negative-positive-negative. Since the posedge is squarely in the middle between the bit transitions, the bit is properly sampled.

So, in the worst case of a bit which toggles each sending clock cycle, the receiving clock must run at least 1.5x faster to provide 3 edges between bit transitions. In the particular CDC design I am describing, the worst case is even worse: the fastest toggling bit is the sending clock itself, so each high and low phase must last 1.5x periods of the receiving clock, meaning the sending clock must be 3x slower than the receiving clock at a minimum.

This 1/3 sending/receiving ratio is another way to express the worst case synchronizer latency of 3 cycles explained earlier: 3 receiving clock edges per incoming bit transition means 6 edges total over a cycle (two transitions), which necessarily contain 3 positive edges.

This 1/3 ratio is a stiff penalty, but peripherals are usually slower than core logic, and this approach keeps the peripheral and the core logic in lock-step without any other information needed: the sending clock signal, synchronized to the receiving clock, allows us to count how much time has passed in the peripheral so we can respond to it at just the right moment, as if the core logic was running on the same clock as the peripheral.

But there are shortcomings to this approach: see In Hindsight below.

Slow to Fast Transfer

Given a CDC synchronizer, we can start building a simple circuit to pass a slow clock and its associated data to a faster clock. We first capture all the data into registers driven by the slow clock. These registers would usually be the dedicated I/O registers at each pin of an FPGA, and serve multiple purposes:

Since these data were registered in their original clock domain, they will not be metastable, so we only have to hold them steady long enough to be properly captured by the fast clock.

The slow clock passes through a CDC synchronizer to filter out metastability, and then through a posedge pulse generator to convert the variable duration of the synchronized clock high phase into a single pulse lasting the duration of one fast clock cycle. This pulse then enables the fast clock to register the data after it has been stable for long enough and before it changes again. If we didn't use the pulse generator, the data might change after its first capture by the fast clock and get captured again while the synchronized slow clock bit was still high. We want to capture the data only once, as close as possible to the rising edge of the synchronized slow clock.

Finally, we delay the synchronized slow clock by one cycle to re-align it with the data, so from the fast clock's perspective the data changes at each rising edge of the synchronized slow clock. If we didn't do this, we could not pass the synchronized slow clock through any logic driven by the synchronized data: any gating of that clock while high would create a false rising edge and corrupt our cycle counting.

Fast to Slow Transfer

Here, I'm only going to deal with one case of the reverse situation where we want to pass a brief fast clock event, such as a pulse, to a slow clock domain. We can't simply use a CDC synchronizer, as the fast pulse may come and go before the slow clock captures it.

So we first convert the fast pulse to a level using a pulse latch, synchronize it to the slow clock, then convert it back to a pulse with a pulse generator, giving us a single-cycle pulse in the slow clock domain. At the same time, the synchronized slow level is re-synchronized back into the fast clock domain and clears the pulse latch. The time it takes to reset the pulse latch puts a limit on the rate of the fast pulses, but since we are talking to a slow device, it's not usually a problem. (I used this for an infrequent trigger signal.)

Reducing Latency

Eric Smith (@brouhaha) showed me a nice special case to save a slow clock cycle of latency when synchronizing from a fast to a slow clock domain: drive the second register of the CDC synchronizer on the negative edge of the slow clock. This change gives the synchronizer output half a cycle to reach the next register, before the second rising edge of the slow clock, reducing latency by one cycle.

The reduced time between the first and second synchronizer register acts as if we ran the slow clock twice as fast, and we can verify if the CDC synchronizer will work properly under those conditions. If the slow clock is less than half the speed of the fast clock, it already works by design. Nonetheless, make sure your CAD tool can handle and analyze this special synchronizer, and meet the setup and hold timings on the paths from positive to negative to positive clock edges.

(You could play the same trick on the slow to fast domain crossings, but that might not be reliable enough, and it's not a lot of absolute time saved either.)

In Hindsight...

In hindsight, passing a Grey counter value driven by the sending clock, instead of the sending clock itself, might double the performance of the slow-to-fast circuit, while still keeping the core logic synchronized to the peripheral. It would be a much more complex circuit though.

The slow-to-fast design does the CDC at the front-end of the interface, before any transaction logic. In further hindsight, it would have been much better to do the CDC at the back-end: run the MCU bus transaction logic off the MCU clock, including the cycle counter, and do the CDC once the initial read/write transaction was started. This change would allow the MCU bus to run at full speed (e.g. 90 MHz instead of 33 or 66 MHz), and moves the CDC 1/3 ratio from the clock cycle to the transaction cycle. Each transaction has an extra 3 receiving clock cycles of latency (6 if you are returning data), but runs at a much higher clock rate overall, which is quite a win if other devices are sharing the MCU bus and thus now aren't forced to run slower.

Finally, having an asynchronous read/write transaction take a fixed number of clock cycles complicates the hardware, which must carry the time elapsed across the interface, and complicates the software, which must be adjusted to keep track of that time elapsed. This dependency means that if you change the sender/receiver clock frequency ratio, you have to change the software also. Rather, an asynchronous interface requires a request/acknowledge handshake, which for bus interfaces often takes the form of a WAIT signal asserted by the receiver.


This article only scratches the surface of CDC, and glosses over accounting for propagation delay, setup, and hold times when determining the maximum data rate, or CAD tool issues regarding placement, timing analysis, and MTBF (Mean Time Between Failure) calculations. Designing even simple CDC circuits requires sweating a lot of details!

See Clifford E. Cummings' excellent paper Clock Domain Crossing (CDC) Design & Verification Techniques Using SystemVerilog for a larger and deeper overview of CDC.


fpgacpu.ca