Logic Design Principles

Make the code reflect the design directly: don't make the reader reverse-engineer the design from the implementation.

Small, Regular Modules

Create a set of conceptually small, regular modules which reflect the design, but whose interfaces say as little as possible about their implementation. A modular design means more debugging of the design itself than of its implementation, so bugs become more meaningful and less random. Smaller modules have a narrower context, and so more meaningful code and comments, and fewer places for bugs to happen. These benefits are preserved as you compose modules together into other conceptually small modules, even if they are internally complex (e.g.: Saturating Adder/Subtractor).

Small, reusable modules allow you to construct a design using regular non-arbitrary design idioms (e.g.: ready/valid handshakes, pulse interfaces, separated state and computation), which constrain the design space away from unexpected corner cases, and further suggests any other small, reusable modules which may not already exist. Arbitrary modules tend to require other arbitrary modules to function together. Regular modules tend to require other regular modules to function together. How do we figure out what's a small, regular module? How do we seed the start of the library of modules from which designs are constructed?

Constants and Optimization

Constants are the first tool to optimize logic and clarify code.

Expect the existence of a logic optimizer in your CAD tool, and make use of it by expressing your logic without case-specific optimizations, then let the constant parameters which define a specific case optimize the logic down to a reduced form. A constant input into a 2-input (dyadic) Boolean gate input reduces the gate to a constant, a wire, or an inverter, which can then further simplify downstream logic.

Constants allow you to express the meaning of numbers in your code, especially if you can construct the constant from other constants at elaboration time (e.g.: calculating localparam values from module parameters). And using constants allows expressing pattern matching, masking, and bit reductions using a single Verilog idiom, which makes the code easier to design, read, and debug. Define constants near the start of your module, or just before their use if the meaning is quite local.

parameter WORD_WIDTH = 23
...
localparam ALL_ONES = {WORD_WIDTH{1'b1}};
localparam ALL_ZERO = {WORD_WIDTH{1'b0}};
localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};
...
always @(*) begin
    and_reduction = (foo == ALL_ONES);
    or_reduction  = (foo != ALL_ZERO);
    negative      = ((foo & MSB_ONLY) != ALL_ZERO);
end

Retiming and Pipelining

Where to pipeline a circuit isn't always obvious, and may change depending on which logic optimizations happen and how the logic is laid out on the FPGA. However, we can rely on the CAD tool's register retiming passes to do the work for us. In general, forward retiming, from input towards output, works better than backwards retiming. There are fewer logical restrictions on forward retiming, and it's the only retiming supported by Vivado's physical optimization passes.

First, we design the circuit as-is, without extra pipeline registers, keeping it readable and general and in its own module. Then, we add one or more registers before the inputs of the module, which is easily done with a simple register pipeline module, and reduces changing the amount of pipelining to a single module parameter, regardless of the implementation of the circuit. Your CAD tool's register retiming optimizations will then spread the registers along the pipeline as needed.

Note that there are limits to retiming. For circuits which would need a very deep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do as much forward retiming as could be done, limiting the performance. You will then hae to add pipeline registers manually, or use alternate implementation tradeoffs, such as multi-precision arithmetic.

If there is a loop in your circuit, then that would block any registers from being retimed from outside to inside the loop. Instead place the pipeline registers to be retimed just inside the start of the loop, where they can be retimed forward both into the feedback part of the loop and past the output of the loop. For an example, see the saturating accumulator.

Design Abstractions

Separate Operation from Implementation, and then separate Implementation into Controlpath and Datapath.

Operation

The Operation describes what does the module do, how does it work, how do you use it, without knowledge of internals. You write this up as a story of comments at the start of your module, starting with the purpose and adding more detail up to the expected usage and behaviour of each input and output, how to set the module parameters, as well as corner-cases and limitations. This story should be enough for another person to understand and use your module without having to look inside it. Writing the Operation first often improves the module design from the start, as you think of new cases to deal with.

Implementation

The Implementation describes how the Operations are done, using functional modules which connect the inputs and outputs described in the Operation. It's the Block Diagram level of description. Given a good set of building block modules, you only need to know the Operation of each module, not its Implementation, which minimizes unintended leakage across abstractions, and so you avoid getting increasingly lost in details as the design gets larger and more complex.

The Implementation also acts as the Operation of the Datapath and Controlpath, which form the two halves of the Implementation. You must keep the Datapath and the Controlpath separate to avoid having to explicitly describe every possible combination of states, inputs, and output , which makes for tedious and long code which doesn't describe the design itself, but enumerates its possible behaviours (e.g.: large case statements containing nested conditionals). The reader must then reverse-engineer the design through this extra layer of abstraction, usually while debugging!

Datapath

The Datapath describes which calculations must happen, and where the data flows from module to module. It describes the set of possible computations, but not their entire sequence or conditions. Designing the Datapath first and separately allows you to optimize it to your particular problem and to the underlying FPGA hardware. This is where you solve problems of clock speed, pipelining, arithmetic precision, latency, storage, area, etc... At the end, you have inputs and outputs for computation, and inputs and outputs for control.

It's possible for a Datapath to be too general, with logic functions going unused and a larger and slower implementation than necessary. Having data inputs which are set once at run-time and remain constant is a good indicator of an over-general Datapath. Instead, if the unchanging data inputs only ever take a single value, replace them with a module parameter or localparam to enable logic optimization. Otherwise, if a few possible values exist, hardwire them to a multiplexer connected to the constant data input. This change converts the data input to a control input (the mux selector input) with fewer possible values, which results in simpler, optimized logic.

Controlpath

The Controlpath describes when and in which order the Datapath calculations must happen, possibly in a data-dependent way depending on the control outputs from the Datapath. This is where you have handshaking, sequencing, internal state, conditions, any data-dependent calculation shortcuts, etc... Note that some of the Controlpath may end up embedded inside the Datapath (e.g.: Skid Buffers), and for simple cases may be almost non-existent (e.g.: data-flow pipelines).

Signalling

Signalling deals with coordinating multiple Datapaths and Controlpaths, or individual Datapah stages. Good signalling design can eliminate small FSMs in the Controlpath, or at least reduce the number of states and transitions. We refer to these signalling systems as "handshakes" since they coordinate agreement between a sender and a receiver.

Pulse Handshake

Pulse handshake are a natural way to control computations which cannot accept a new input every clock cycle due to either backwards dependencies in the pipeline (loops), or due to an iterative algorithm (e.g.: long division). A pulse handshake takes in a single cycle pulse to signal new input data, or the start of a computation, or a change of state, and returns that same pulse later to signal the data has been accepted, the computation completed, or the state has changed. The input pulse is carried alongside the computation, and eventually reaches the output at the same time as the result. It is very easy to do this with a 1-bit pipeline matching the main pipeline's path, or with a counter. Thus, the internal pipeline can be adjusted without changing the interface.

However, a pulse handshake limits the throughput to a maximum of one-half the clock frequency. At the limit, the output pulse is emitted on the next cycle after the input pulse, which means we can only send an input pulse every other clock cycle. Simultaneous input and output pulses during the same clock cycle imply combinational logic, and so are redundant, since the clock itself becomes the signalling to the sequential logic connected to this combinational logic. In other words, you have a conventional pipeline which never stops.

A multi-cycle input pulse implies multiple consecutive calculations, up to a certain limit, and thus one or more single or multi-cycle output pulses, depending on the latency of each computation, which can be data-dependent and possible complete out of order. Keeping track of how many input pulses have been sent relative to the number of output pulses received creates a credit-based signalling system.

Credit-based Interface

Credit-based signalling extends pulse handshakes to maintain full throughput over a long latency link, where many cycles can pass between sending an input pulse to a remote module and it arriving, and many cycles again before the output pulse returns after being sent by the remote module. This latency can either be pure propagation delay, or from multiple pipeline stages inserted to maintain a high clock frequency. As a tradeoff, the sender and receiver must know the latency between them, or more specifically, the maximum number of handshakes which can be in transit between the sender and receiver.

Abstractly, credit-based signalling involves a sender and a receiver with a pulse handshake between them. However, instead of the sender and receiver being directly connected, there is a number (N) of plain pipeline registers along the input pulse path from the sender to the receiver, and another set of N pipeline registers along the output pulse path from the receiver back to the sender. Thus the total latency added to the pulse handshake, on top of the receiver's latency, is 2N cycles.

The sender maintains a counter initialized with a value of 2N, to account for the additional round-trip latency of the pipelined pulses. Every time an input pulse is sent, the sender decrements the counter by one. Every time an output pulse is received, the sender increments the counter by one. A simultaneous pair of input and output pulses leaves the counter unchanged. This counter keeps track of the credit of the sender. In other words, of how many input pulses it can send before receiving any output pulses.

The receiver maintains an identical counter as the sender, except it is initialized to zero. The receiver increments the counter every time an input pulse arrives Use case: long latency due to distance and pipelining, and a need to maintain full throughput. Upside: simplest pipelining (no skid buffers). Downside: 2N local buffers for N-long pipeline, and creates a dependency between the configuration of the end modules and of the pipeline.

Ready/Valid Handshake

Options: Carloni buffering (pipelines, but does not absorb stalls), and FIFO (does not pipeline (spatially local) but absorbs stalls).

4-phase and 2-phase Asynchronous Handshsakes

Pulse, Credit, and Ready/Valid Handshakes 7. Hierarchy: pulses, credit, ready/valid, 4-phase, 2-phase (TODO: chapter to introduce/explain them, why/where use them, etc...) 2. Ready/valid handshakes allow for action every clock cycle if possible and compose well. Pulses necessarily limit your max throughput by half. They are a layer of abstraction. 3. Pulse generators and pulse latches remove many FSMs and can fix race conditions (ABA problems). 14. Since the `clear` input, a synchronous reset, is a signal that changes the output, and derived from some control path logic, it must trigger an "output updated" signal in pulse interfaces as any other `load` or `valid` input signal, though only as a single pulse since the output only changes once even with a steady `clear`. This also implies that `clear` must be pipelined, separately and in parallel if necessary, and not broadcast to all units, which means similar latency, and better routing and distribution. 15. Amend that last one: don't signal an "output updated" value, as consecutive calculations can have identical results (so a Word Change Detector is no good here), and multiple commands can update a given output. Instead, signal when a command is done. That "done" signal also denotes when the given output is valid to sample. Internally latch the command so we know which one to report as "done" if multiple commands update one internal object (itself with a "done" signal). This does not allow concurrency or queueing by itself. 16. Following up on last one: using a "done" signal on commands also separates control and data paths. 17. Possible general design principle: connect datapaths with ready/valid handshakes at boundaries, enumerate ready/valid actions as 2-input truth table, result is computed by one of the 16 dyadic Boolean functions.

Implementation Issues

21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design. 9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock. 11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races) 11a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens. 18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset). 8. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.

Scaling

22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel. 23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations. 24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now? 25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)
Back to FPGA Design Elements