Logic Design Principles

Constants and Optimization

Constants are the first tool to optimize logic and clarify code.

Expect the existence of a logic optimizer in your CAD tool, and make use of it by expressing your logic without case-specific optimizations, then let the constant parameters which define a specific case optimize the logic down to a reduced form. A constant input into a 2-input (dyadic) Boolean gate input reduces the gate to a constant, a wire, or an inverter, which can then further simplify downstream logic.

Constants allow you to express the meaning of numbers in your code, especially if you can construct the constant from other constants at elaboration time (e.g.: calculating localparam values from module parameters). And using constants allows expressing pattern matching, masking, and bit reductions using a single Verilog idiom, which makes the code easier to design, read, and debug. Define constants near the start of your module, or just before their use if the meaning is quite local.

parameter WORD_WIDTH = 23
...
localparam ALL_ONES = {WORD_WIDTH{1'b1}};
localparam ALL_ZERO = {WORD_WIDTH{1'b0}};
localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};
...
always @(*) begin
    and_reduction = (foo == ALL_ONES);
    or_reduction  = (foo != ALL_ZERO);
    negative      = ((foo & MSB_ONLY) != ALL_ZERO);
end

Retiming and Pipelining

Where to pipeline a circuit isn't always obvious, and may change depending on which logic optimizations happen and how the logic is laid out on the FPGA. However, we can rely on the CAD tool's register retiming passes to do the work for us. In general, forward retiming, from input towards output, works better than backwards retiming. There are fewer logical restrictions on forward retiming, and it's the only retiming supported by Vivado's physical optimization passes.

First, we design the circuit as-is, without extra pipeline registers, keeping it readable and general and in its own module. Then, we add one or more registers before the inputs of the module, which is easily done with a simple register pipeline module, and reduces changing the amount of pipelining to a single module parameter, regardless of the implementation of the circuit. Your CAD tool's register retiming optimizations will then spread the registers along the pipeline as needed.

Note that there are limits to retiming. For circuits which would need a very deep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do as much forward retiming as could be done, limiting the performance. You will then hae to add pipeline registers manually, or use alternate implementation tradeoffs, such as multi-precision arithmetic.

If there is a loop in your circuit, then that would block any registers from being retimed from outside to inside the loop. Instead place the pipeline registers to be retimed just inside the start of the loop, where they can be retimed forward both into the feedback part of the loop and past the output of the loop.

Design

Make the code reflect the design: don't make the reader reverse-engineer the design from the implementation. The first step is to create a set of modules which reflect the design, but whose interfaces say as little as possible about their implementation. A modular design means more debugging of the design itself than of its implementation, so bugs become more meaningful and less random. Smaller modules have a narrower context, and so more meaningful code and comments, and fewer places for bugs to happen. These benefits are preserved as you compose modules together into other conceptually small modules, even if they are internally complex (e.g.: Saturating Adder/Subtractor). Small, reusable modules allow you to construct a design using regular non-arbitrary idioms, which constrain the design space away from unexpected corner cases, and further suggests any other small modules which may not already exist. Arbitrary modules tend to require other arbitrary modules to function together. Regular modules tend to require other regular modules to function together. How do we figure out what's a small, regular module? How do we seed the start of the library of modules from which designs are constructed?

Separate Operation from Implementation, and then separate Implementation into Controlpath and Datapath.

19. It's possible for a logic path to be too general, with constant inputs that don't change. This makes a slower path. Instead, have parallel, simpler paths with each hardcode one possible value of the unchanging inputs, then select at the end using the actual unchanging input. Pipelining may render this optimization moot. (e.g.: adders which depend on >3 inputs (data and control) are better split as separate simpler adders and a mux) 20. Following on above: a over-general logic path may indicate you are not implementing the complete algorithm, and should instead use all possible logic functions of the path.

Signalling

7. Hierarchy: pulses, ready/valid, 4-phase, 2-phase (TODO: chapter to introduce/explain them, why/where use them, etc...) 2. Ready/valid handshakes allow for action every clock cycle if possible and compose well. Pulses necessarily limit your max throughput by half. They are a layer of abstraction. 3. Pulse generators and pulse latches remove many FSMs and can fix race conditions (ABA problems). 14. Since the `clear` input, a synchronous reset, is a signal that changes the output, and derived from some control path logic, it must trigger an "output updated" signal in pulse interfaces as any other `load` or `valid` input signal, though only as a single pulse since the output only changes once even with a steady `clear`. This also implies that `clear` must be pipelined, separately and in parallel if necessary, and not broadcast to all units, which means similar latency, and better routing and distribution. 15. Amend that last one: don't signal an "output updated" value, as consecutive calculations can have identical results (so a Word Change Detector is no good here), and multiple commands can update a given output. Instead, signal when a command is done. That "done" signal also denotes when the given output is valid to sample. Internally latch the command so we know which one to report as "done" if multiple commands update one internal object (itself with a "done" signal). This does not allow concurrency or queueing by itself. 16. Following up on last one: using a "done" signal on commands also separates control and data paths. 17. Possible general design principle: connect datapaths with ready/valid handshakes at boundaries, enumerate ready/valid actions as 2-input truth table, result is computed by one of the 16 dyadic Boolean functions.

Implementation Issues

21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design. 9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock. 11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races) 11a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens. 18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset). 8. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.

Scaling

22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel. 23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations. 24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now? 25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)
Back to FPGA Design Elements