their use if the meaning is quite local.a single idiom. Define constants near the start of your module, or just beforeconstants allows expressing pattern matching, masking, and bit reductions usingespecially if you can construct the constant from other constants. And usingConstants allow you to express the meaning of numbers in your code,logic down the line. gate to a constant, a wire, or an inverter, which can then further simplifya reduced form. A constant to a 2-input (dyadic) Boolean gate input reduces thethe constant parameters which define a specific case optimize the logic down toit by expressing your logic without specific implementation details, then let

Expect the existence of a logic optreverse-engineer the design from the implementation.reverse-engineer the design from the implementation.

Make the code reflect the design directly:

don't make the reader

Logic Design PrinciplesLogic Design Principles ad#k5O {sk

Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without bufferin

Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latenc

Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now?23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations.22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel.## Scaling

adbBRm l T S p gf p o : CV [($cbv(P than its imp*design* than its implementation. Writing smaller modules means a smallertheir*design* than its impleme**design* than its implementation. Writing smaller modules means a smallerthe*design* than its implementation. Writing smaller modules means a smallertheir implementation. Thus, a modular design means debugging more ooptimization passes.retiming, and it's the only retiming supporoptimization passes.retiming, and it's the only retiming supported by Vivado's optimization paoptimization passes.retiming, and it's the only retiming suoptimization passes.retoptimization passes.retiming, and it's the only retiming supported by Vivado's physicalthan backwards retiming. There are fewer logical restrictions on forwardfor us. In general, forward retiming, from input towards output, works betterHowever, we can rely on the CAD tool's register retiming passes to do the workon which logic optimizations happen and how the logic is laid out on the FPGA.Where to pipeline a circuit isn't always obvious, and may change depending

## Retiming and Pipelining

just before their use if the meaning is quite local.design, read, and debug. Define constants near the start of your module, orand bit reductions using a single Verilog idiom, which makes the code easier toparameters). And using constants allows expressing pattern matching, masking,elaboration time (e.g.: calculating`localparam`

values from moduleespecially if you can construct the constant from other constants atConstants allow you to express the meaning of numbers in your code,simplify downstream logic.reduces the gate to a constant, a wire, or an inverter, which can then furtherreduced form. A constant input into a 2-input (dyadic) Boolean gate inputconstant parameters which define a specific case optimize the logic down to ait by expressing your logic without case-specific optimizations, then let the

Expect the existence of a logic optimizer in your CAD tool, and make use of

Constants are the first tool to optimize logic and clarify code.

## Constants and Optimization

How do we figure out what's a small, regular module? How do we seed the start of the library of modules from which designs are constructed?modules to function together.modules to function together. Regular modules tend to require other regularwhich may not already exist. Arbitrary modules tend to require other arbitraryunexpected corner cases, and further suggests any other small, reusable modulesseparated state and computation), which constrain the design space away fromnon-arbitrary design idioms (e.g.: ready/valid handshakes, pulse interfaces,Small, reusable modules allow you to construct a design using regularAdder/Subtractor). href="./Adder_Subtractor_Binary_Saturating.html">Saturatingthey are internally complex (e.g.: designdesign, but whose interfaces say as little as possible about their

Create a set of conceptually small, regular modules which reflect the

## Small, Regular Modules

ad>:8Qwv, D Zshg 2 J I bI[o+*;Q:s98locks: a blocking assignment the(does not pipeline (spatially local) but absorbs stalls).Options: Carloni buffering (pipelines, but does not absorb stalls), and FIFO## Ready/Valid Handshake

configuration of the end modules and of the pipeline.local buffers for N-loninput pulse path from the sender latency, is 2N cycles.latency, is 2N cycles.Thus the total latency added to the pulse handshake, on latency, is 2N cycles.Thus the total latency added to the pulse handshake, onlatency, is 2N cycles.latency, is 2N cycleslatency, islatency, is 2N cycles.Thus the total latency added to the pulse handshake, on top of the receiver'sregisters along the output pulse path from the receiver back to the sender.input pulse path from the sender to the receiver, and another set of N pipelinedirectly connected, there is a number (N) of plain pipeline registers along thepulse handshake between them. However, instead of the sender and receiver beingAbstractly, credit-based signalling involves a sender and a receiver with ahandshakes which can be in transit between the sender and receiver.must know the latency between them, or more specifically, the maximum number ofto maintain a high clock frequency. As a tradeoff, the sender and receivercan either be pure propagation delay, or from multiple pipeline stages insertedthe output pulse returns after being sent by the remote module. This latencyan input pulse to a remote module and it arriving, and many cycles again beforethroughput over a long latency link, where many cycles can pass between sending

Credit-based signalling extends pulse handshakes to maintain full

## Credit-based Interface

credit-based signalling system.been sent relative to the number of output pulses received creates apossible complete out of order. Keeping track of how many input pulses havedepending on the latency of each computation, which can be data-dependent andcertain limit, and thus one or more single or multi-cycle output pulses,A multi-cycle input pulse implies multiple consecutive calculations, up to alogic. In other words, you have a conventional pipeline which never stops.becomes the signalling to the sequential logic connected to this combinationalcycle imply combinational logic, and so are redundant, since the clock itselfother clock cycle. Simultaneous input and output pulses during the same clockcycle after the input pulse, which means we can only send an input pulse everythe clock frequency. At the limit, the output pulse is emitted on the next

However, a pulse handshake limits the throughput to a maximum of one-halfinterface.counter. Thus, the internal pipeline can be adjusted without changing theto do this with a 1-bit pipeline matching the main pipeline's path, or with aeventually reaches the output at the same time as the result. It is very easystate has changed. The input pulse is carried alongside the computation, andlater to signal the data has been accepted, the computation completed, or thethe start of a computation, or a change of state, and returns that same pulseA pulse handshake takes in a single cycle pulse to signal new input data, orthe pipeline (loops), or due to an iterative algorithm (e.g.: long division).accept a new input every clock cycle due to either backwards dependencies in

Pulse handshake are a natural way to control computations which cannot

## Pulse Handshake

agreement between a sender and a receiver.refer to these signalling systems as "handshakes" since they coordinatethe Controlpath, or at least reduce the number of states and transitions. Weindividual Datapah stages. Good signalling design can eliminate small FSMs inSignalling deals with coordinating multiple Datapaths and Controlpaths, or## Signalling

ad w f{0 = J:9R x w > J S gj2>=p$4X^Yolpath, whiCControlpath, which form the two halves of the ImplementaControlpath, which form the two halves of the Implementation:more complex. ThControlpath, which form the two halves of the Implementation:more complex. The Implementation also acts as the Operation of the Datapath andso you avoid getting increasingly lost in details as the design gets larger andits Implementation, which minimizes unintended leakage across abstractions, andbuilding block modules, you only need to know the Operation of each module, notOperation. It's the Block Diagram level of description. Given a good set offunctional modules which connect the inputs and outputs described in the

First, we design the circuit as-is, without extra pipeline registers,ad^!Xv, y . Ga%$ w 0 M xwY n!r$<Q5How do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a sHow do we figure out what's a small, regular module? How do we seed the stHow do we figure out what's a small, regular module? How do weHow do we figure outHow do we figure out what's a small, regular module? HowHow do we figure out what's a smaHowHow do we figure out whatHoHow do we figure out what's a small, regular module? How do we seed the startHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular moduleHow do we How do we figure out what's a How do we figure out what's a small, regular module? How do we seed the start of the liHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the starHow do we figure out what's a small, regular module? How do we seed the start How do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the startHow do we figure out what's a small, regular module? How do we seed the stHow do we figure out what's a small, regular module? How do we seed the starHow do we figure out wHow do we figure out what's a small, regular module? How do we seed theHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the start How do we figure out what's a small, regular module? How do we seed the startHow do we fiand for simple cases may be almost non-existent (e.g.: data-and for simple cases may be almost non-existent (e.g.: data-flow pipelinand for simple cases may be almost non-existent (e.g.: data-flow pipelines).aand forand for simple cases may be almost non-existent (e.g.: data-flow pipelines).the Datapath (e.g.: Skid Buffers),shortcuts, etc... Note that some of the Controlpath may end up embedded insidesequencing, internal state, conditions, any data-dependent calculationcontrol outputs from the Datapath. This is where you have handshaking,calculations must happen, possibly in a data-dependent way depending on the

The Controlpath describes *when and in which order* the Datapath

It's possible for a Datapath to be too general, with logic functions goingoutputs for control.etc... At the end, you have inputs and outputs for computation, and inputs andof clock speed, pipelining, arithmetic precision, latency, storage, area,problem and to the underlying FPGA hardware. This is where you solve problemsDatapath first and separately allows you to optimize it to your particularcomputations, but not their entire sequence or conditions. Designing thethe data flows from module to module. It describes the set of possible

The Datapath describes *which* calculations must happen, and where

The Implementation also acts as the Operation of the Datapath andmore complex. so you avoid getting increasingly lost in details as the design gets larger andits Implementation, which minimizes unintended leakage across abstractions, andbuilding block modules, you only need to know the Operation of each module, notOperation. It's the Block Diagram level of description. Given a good set offunctional modules which connect the inputs and outputs described in the

The Implementation describes *how* the Operations are done, using

The Operation describes *what* does the module do, how does it work, how

Separate Operation from Implementation, and then separate Implementationad;ed | { Cs;}Gdc8. CDC implies asynchrony, which means no notion of time, only of sequence, h8. CDC implies asynchrony, which means no notion of time, only of sequen8. CDC implies asynchrony, which means no notion o88. CDC implies asynchrony, which means no notion of time, only of sequence, he8. CDC implies asynchrony, which means no notion of time, only of sequence, h8. CDC implies asynchrony, which means no notion of time, o88. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset).11a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens.11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races)9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock.21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design.

Pulse, Credit, and Ready/Valid Handshakes

The receiver maintains an identical counter as the sender, except it ishow many input pulses it can send before receiving any output pulses.counter keeps track of the *credit* of the sender. In other words, ofsimultaneous pair of input and output pulses leaves the counter unchanged. Thisoutput pulse is received, the sender increments the counter by one. Ainput pulse is sent, the sender decrements the counter by one. Every time anfor the additional round-trip latency of the pipelined pulses. Every time an

The sender maintains a counter initialized with a value of 2N, to account