from FPGA Design Elements by Charles Eric LaForest, PhD., GateForge Consulting Ltd.
You can design plain synchronous logic driven by a single clock and be certain it'll work (and you'd be right)...or you can design Clock Domain Crossing (CDC) logic, and always be full of doubt. CDC logic is a well-understood, but subtle design problem. There are many ways to get it slightly wrong, causing intermittent failures, or ending up with a design that complicates things and reduces performance.
Normally, digital logic works in a bistable mode: it's either in a high state or low state, glossing over the fact that it takes time to move from one state to the other. But so long as we sample states (e.g.: latch a signal) far from those transitions, the assumption holds and everything remains bistable.
A metastable state is simply any state that is neither high nor low, but in between, like during a transition. And the closer to the mid-point between states a metastable state is, the longer it takes to resolve to a valid state and the less certainty there is which valid state that will be. So the next logic output may remain metastable or resolve too late, a flip-flop may sample the wrong state, and so on... and your system falls apart.
Metastability can happen at clock domain crossings since the sending clock and receiving clock are (usually) asynchronous to eachother, or at least is a lot simpler and robust to assume they are independent that way. Thus the receiving clock might sample a signal from the sending clock domain too close to its transitions, causing metastability.
(And it's impossible to design metastability-free circuits. That would imply instantaneous state transitions, which isn't physically possible.)
A CDC Bit Synchronizer forms the core of any CDC solution: a chain of two (or more) closely placed registers, with no logic between them, and clocked by the receiving clock. The input register receives a bit synchronized to a sending clock, and the output register sends a version of that bit synchronized to the receiving clock. The duration between bit transitions at the output may be of a different length than the original bit, with a new duration of some multiple of the receiving clock period.
(You can find several more ready-to-use CDC modules here.)
Despite having two registers, a synchronizer has an unpredictable latency from the point of view of the sending clock. A few different cases may happen, which will also explain why a synchronizer is necessary:
Thus, depending on the alignment of the sending clock relative to the receiving clock at that particular point in time, the output high and low phases of a toggling bit may have different lengths, though the total period remains the same. Also, the synchronizer prevents metastable state from propagating into the rest of the logic, possibly causing an incorrect bit value to be created.
The variable latency of a synchronizer implies an important design rule of CDC: only one bit at a time may ever be synchronized across a given clock domain crossing. If you tried to synchronize two bits in parallel, or the same bit twice in parallel, there would be no guarantee that both bits would always see the same latency through the synchronizer. With only a single bit, the worst case is that its transition is missed, and will get captured in the next receiving clock cycle. This delay causes no errors in itself, and only alters the received bit duration by one receiving clock cycle.
How fast can we pass a bit through a synchronizer? The input bit needs to remain stable long enough for the receiving clock to be always able to properly capture the bit between two of its transitions (a cycle), else we could miss a pulse. We can guarantee a proper capture if the minimum time between bit transitions is equal or greater to 3 edges of the receiving clock, regardless of clock alignment:
So, in the worst case of a bit which toggles each sending clock cycle, the receiving clock must run at least 1.5x faster to provide 3 edges between bit transitions.
This 1/3 sending/receiving ratio is another way to express the worst case synchronizer latency of 3 cycles explained earlier: 3 receiving clock edges per incoming bit transition means 6 edges total over a cycle (two transitions), which necessarily contain 3 positive edges.
This 1/3 ratio is a stiff penalty, but peripherals clocks are usually slower than core logic clocks, and the penalty can be spread out over multiple bits by receiving multiple bits in one clock domain and then passing the resulting whole word across clock domains (directly, without synchronization!), with a single "valid" bit being synchronized across clock domains and acting as a latch trigger in the receiving clock domain.
There is an implicit assumption in the above discussion about passing a bit through a synchronizer: the receiving clock edges must coincide with the incoming transition only once at most. Otherwise, in cases where the incoming transition is very slow relative to the receiving clock, the receiving clock will sample the incoming transition multiple times, which maximizes the chance of a metastable event. This will manifest as multiple bit transitions in the receiving clock domain. In these cases, the synchronizer must be followed by a Debouncer.
For similar reasons, the input of a synchronizer must immediately come from a register in the sending clock domain, without any logic before the synchronizer. Otherwise, that final logic in the sending clock domain might converge multiple paths of logic, and thus one or more glitches may appear at its output due to different path or logic cell delays. Those glitches may get sampled by the synchronizer and also result in multiple bit transitions in the receiving clock domain. Again, if this case isn't avoidable, you must use a Debouncer.
This article only scratches the surface of CDC, and glosses over accounting for propagation delay, setup, and hold times when determining the maximum data rate, or CAD tool issues regarding placement, timing analysis, and MTBF (Mean Time Between Failure) calculations. Designing even simple CDC circuits requires sweating a lot of details!
See Clifford E. Cummings' excellent paper Clock Domain Crossing (CDC) Design & Verification Techniques Using SystemVerilog for a larger and deeper overview of CDC.