Logic Design Principles

All these principles of logic design are based on a root principle: don't make the reader have to reverse-engineer the design from the implementation. So make the code reflect the design directly. The rest of these design principles are all about how to do that.

(See the related System Design Standard for similar discussions about CAD project structure and major architectural design layers.)

Small and Regular Modules

Create a set of conceptually small and regular modules which reflect the design, but whose interfaces say as little as possible about their implementation. A conceptually small module has only one function (even if complex), with possibly multiple operations on that function. A conceptually regular module uses a limited number of interface idioms. The Binary Accumulator and its sub-modules exemplify these principles.

Small and regular modules have a narrower context, and so more meaningful code and comments, and fewer places for bugs to happen. These benefits are preserved as you compose modules together into other conceptually small modules, even if they are internally complex (e.g.: Saturating Adder/Subtractor).

Small and regular modules constrain the design space away from unexpected corner cases, and further suggests any other small and regular modules which may not already exist. Arbitrary modules tend to require other arbitrary modules to function together. Regular modules tend to require other regular modules to function together.

A larger design built-up from small and regular modules means more debugging of the design itself than debugging of its implementation, so bugs become more meaningful and less random.

How do we figure out what is a small and regular module? How do we seed the start of a library of modules from which we build-up larger designs? We separate functions, based both on how we think about functions, and on what the design shows us.

Design Abstractions

In the design as a whole, and within each module, separate Operation from Implementation, and then separate Implementation into Controlpath and Datapath.

Operation

The Operation describes what does the module do, how does it work, how do you use it, without knowledge of internals. You write this up as a story of comments at the start of your module, starting with the purpose and adding more detail up to the expected usage and behaviour of each input and output, how to set the module parameters, as well as corner-cases and limitations. This story should be enough for another person to understand and use your module without having to look inside it. Writing the Operation first often improves the module design from the start, as you think of new cases to deal with.

Implementation

The Implementation describes how the Operations are done, using functional modules which connect the inputs and outputs described in the Operation. It's the Block Diagram level of description. Given a good set of building block modules, you only need to know the Operation of each module, not its Implementation, which minimizes unintended leakage across abstractions, and so you avoid getting increasingly lost in details as the design gets larger and more complex.

The Implementation also acts as the Operation description of the Datapath and Controlpath, which form the two halves of the Implementation. You must keep the Datapath and the Controlpath separate to avoid having to explicitly describe every possible combination of states, inputs, and output, which makes for tedious and long code which doesn't implement the design itself, but instead enumerates its possible behaviours (e.g.: large case statements containing nested conditionals). The reader must then reverse-engineer the design through this extra layer of abstraction, usually while debugging!

Datapath

The Datapath describes which calculations must happen, and where the data flows from module to module. It describes the set of possible computations, but not their entire sequence or conditions. Designing the Datapath first and separately allows you to optimize it to your particular problem and to the underlying FPGA hardware. This is where you solve problems of clock speed, pipelining, arithmetic precision, latency, storage, area, etc... At the end, you have inputs and outputs for computation, and inputs and outputs for control.

It's possible for a Datapath to be too general, with logic functions going unused and a larger and slower implementation than necessary. Having data inputs which are set once at run-time and remain constant is a good indicator of an over-general Datapath. Instead, if the unchanging data inputs only ever take a single value, replace them with a module parameter or localparam to enable logic optimization. Otherwise, if a few possible values exist, hardwire them to a multiplexer connected to the constant data input. This change converts the data input to a control input (the mux selector input) with fewer possible values, which results in simpler, optimized logic.

Controlpath

The Controlpath describes when and in which order the Datapath calculations must happen, possibly in a data-dependent way depending on the control outputs from the Datapath. This is where you have handshaking, sequencing, internal state, conditions, any data-dependent calculation shortcuts, etc... Note that some of the Controlpath may end up embedded inside the Datapath (e.g.: Skid Buffers), and for simple cases may be almost invisible from the outside (e.g.: data-flow pipelines).

Handshaking

Handshaking deals with coordinating multiple Datapaths and Controlpaths, or individual Datapath stages. Good handshaking design can eliminate small FSMs in the Controlpath, or at least reduce the number of states and transitions. We refer to these systems as "handshakes" since they coordinate agreement between a sender and a receiver.

Ready/Valid Handshake

The ready/valid handshake is a simple system where a sender signals when it has valid data, and the receiver signals when it is ready. When both ready and valid are asserted, a data transfer happens. By following some design rules, you can support maximum throughput and operating frequency. All other handshakes can be converted to/from ready/valid handshakes, giving a universal, common interface for data and control. Furthermore, it is easy to have the sender and receiver work concurrently by buffering the handshake, which is done differently depending on design needs:

Pulse Handshake

Pulse handshakes are a natural way to implement computations which cannot accept a new input every clock cycle due to either backwards dependencies in the pipeline (loops), or due to an iterative algorithm (e.g.: long division). A pulse handshake takes in a single cycle pulse to signal new input data, or the start of a computation, or a change of state, and returns that same pulse later to signal the data has been accepted, the computation completed, or the state has changed. The input pulse is carried alongside the computation, and eventually reaches the output at the same time as the result. It is very easy to do this with a 1-bit-wide pipeline matching the main pipeline's path, or with a counter. Thus, the internal pipeline can be adjusted without changing the interface, often automatically via parameters.

However, a pulse handshake limits the throughput to a maximum of one-half the clock frequency. At the limit, the output pulse is emitted on the next cycle after the input pulse, which means we can only send an input pulse every other clock cycle. Simultaneous input and output pulses during the same clock cycle imply combinational logic, and so are redundant, since the clock itself becomes the handshake to the sequential logic connected to this combinational logic. In other words, you have a conventional pipeline which never stops.

See the Pulse to Pipeline and the Pipeline to Pulse modules for detailed examples of Pulse Hanshakes.

Asynchronous Handshsake

When a sender and a receiver operate in different clock domains, we must take special precautions because of the Clock-Domain Crossing. 3. Pulse generators and pulse latches remove many FSMs and can fix race conditions (ABA problems). 14. Since the `clear` input, a synchronous reset, is a signal that changes the output, and derived from some control path logic, it must trigger an "output updated" signal in pulse interfaces as any other `load` or `valid` input signal, though only as a single pulse since the output only changes once even with a steady `clear`. This also implies that `clear` must be pipelined, separately and in parallel if necessary, and not broadcast to all units, which means similar latency, and better routing and distribution. 15. Amend that last one: don't signal an "output updated" value, as consecutive calculations can have identical results (so a Word Change Detector is no good here), and multiple commands can update a given output. Instead, signal when a command is done. That "done" signal also denotes when the given output is valid to sample. Internally latch the command so we know which one to report as "done" if multiple commands update one internal object (itself with a "done" signal). This does not allow concurrency or queueing by itself. 16. Following up on last one: using a "done" signal on commands also separates control and data paths. 17. Possible general design principle: connect datapaths with ready/valid handshakes at boundaries, enumerate ready/valid actions as 2-input truth table, result is computed by one of the 16 dyadic Boolean functions.

Constants and Optimization

Constants are the first tool to optimize logic and clarify code. They move computations from run-time to synthesis-time (or even design-time).

Expect the existence of a logic optimizer in your CAD tool, and make use of it by expressing your logic without case-specific optimizations, then let the constant parameter values which define a specific case optimize the logic down to a reduced form. For example, a constant input into a 2-input (dyadic) Boolean gate input reduces the gate to a constant, a wire, or an inverter, which can then further simplify downstream logic.

Constants allow you to express the meaning of numbers in your code, especially if you can construct the constant from other constants at elaboration time (e.g.: calculating localparam values from module parameters). And using constants allows expressing pattern matching, masking, and bit reductions using a single Verilog idiom, which makes the code easier to design, read, and debug. Define constants near the start of your module, or just before their use if the meaning is quite local.

parameter WORD_WIDTH = 23
...
localparam ALL_ONES = {WORD_WIDTH{1'b1}};
localparam ALL_ZERO = {WORD_WIDTH{1'b0}};
localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};
...
always @(*) begin
    and_reduction = (foo == ALL_ONES);
    or_reduction  = (foo != ALL_ZERO);
    negative      = ((foo & MSB_ONLY) != ALL_ZERO);
end

Retiming and Pipelining

Where to pipeline a circuit isn't always obvious, and may change depending on which logic optimizations happen and how the logic is laid out on the FPGA. However, we can rely on the CAD tool's register retiming passes to do the work for us. In general, forward retiming, from input towards output, works better than backwards retiming. There are fewer logical restrictions on forward retiming, and it's the only retiming supported by Vivado's physical optimization passes.

First, we design the circuit as-is, without extra pipeline registers, keeping it readable and general and in its own module. Then, we add one or more registers before the inputs of the module, which is easily done with a simple register pipeline module, and reduces changing the amount of pipelining to a single module parameter, regardless of the implementation of the circuit. Your CAD tool's register retiming optimizations will then spread the registers along the pipeline as needed.

Note that there are limits to retiming. For circuits which would need a very deep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do as much forward retiming as could be done, limiting the performance. You will then hae to add pipeline registers manually, or use alternate implementation tradeoffs, such as multi-precision arithmetic.

If there is a loop in your circuit, then that would block any registers from being retimed from outside to inside the loop. Instead place the pipeline registers to be retimed just inside the start of the loop, where they can be retimed forward both into the feedback part of the loop and past the output of the loop. For an example, see the saturating accumulator.

Implementation Issues

21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design. 9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock. 11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races) 11a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens. 18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset). 8. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.

Scaling

22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel. 23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations. 24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now? 25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)
Back to FPGA Design Elements