b0VIM 8.1a:laforestscriptor~laforest/public_html/fpga/principles.htmlutf-8 3210#"! Utp  B3 T!q8R   padi B  iqmZY  l  q $ [ Z T : 6 r K   N g# p$qp#=21MKh*design* than its implementation. Writing smaller modules means a smallertheir implementation. Thus, a modular design means debugging more of thewhich reflect the design, but whose interfaces say as little as possible aboutdesign from the implementation. The first step is to create a set of modulesMake the code reflect the design: don't make the reader reverse-engineer the(Operation: What does it do? Implementation: how does it do it? Datapath: what calculations must happen? Controlpath: when and in what order must the calculations happen?)Split Operation from Implementation, then Implementation into Control and Data.

Design

the loop.retimed forward both into the feedback part of the loop and past the output ofregisters to be retimed just inside the start of the loop, where they can bebeing retimed from outside to inside the loop. Instead place the pipelineIf there is a loop in your circuit, then that would block any registers fromtradeoffs.hae to add pipeline registers manually, or use alternate implementationmuch forward retiming as could be done, limiting the performance. You will thendeep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do asNote that there are limits to retiming. For circuits which would need a veryimplementation of the circuit. amount of pipelining to a single module parameter, regardless of thesimple pipeline module in front of the other module, and reduces changing theregisters to theinputs of the circuit, which is easily done with akeeping it readable and general and in its own module. Then, we add one or moreFirst, we design the circuit to be pipelined as-is, without extra registers,optimization passes.retiming, and it's the only retiming supported by Vivado's physicalthan backwards retiming. There are fewer logical restrictions on forwardfor us. In general, forward retiming, from input towards output, works betterHowever, we can rely on the CAD tool's register retiming passes to do the workwhich logic optimizations happen, and how the logic is laid out on the FPGA.Where to pipeline a circuit isn't always obvious, and may change depending on

Retiming and Pipelining

end negative = ((foo & MSB_ONLY) != ALL_ZERO); or_reduction = (foo != ALL_ZERO); and_reduction = (foo == ALL_ONES);always @(*) begin...localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};localparam ALL_ZERO = {WORD_WIDTH{1'b0}};localparam ALL_ONES = {WORD_WIDTH{1'b1}};...parameter WORD_WIDTH = 23
their use if the meaning is quite local.a single idiom.  Define constants near the start of your module, or just beforeconstants allows expressing pattern matching, masking, and bit reductions usingespecially if you can construct the constant from other constants. And using

Constants allow you to express the meaning of numbers in your code,logic down the line. gate to a constant, a wire, or an inverter, which can then further simplifya reduced form. A constant to a 2-input (dyadic) Boolean gate input reduces thethe constant parameters which define a specific case optimize the logic down toit by expressing your logic without specific implementation details, then let

Expect the existence of a logic optreverse-engineer the design from the implementation.reverse-engineer the design from the implementation.

Make the code reflect the design directly: don't make the reader

Logic Design Principles

Logic Design Principlesad# k  5O { s k
Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without bufferin
Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latenc
Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now?23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations.22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel.

Scaling

adbBRm l T S p g f p o : CV [($cbv(P than its imp*design* than its implementation. Writing smaller modules means a smallertheir*design* than its impleme**design* than its implementation. Writing smaller modules means a smallerthe*design* than its implementation. Writing smaller modules means a smallertheir implementation. Thus, a modular design means debugging more ooptimization passes.retiming, and it's the only retiming supporoptimization passes.retiming, and it's the only retiming supported by Vivado's optimization paoptimization passes.retiming, and it's the only retiming suoptimization passes.retoptimization passes.retiming, and it's the only retiming supported by Vivado's physicalthan backwards retiming. There are fewer logical restrictions on forwardfor us. In general, forward retiming, from input towards output, works betterHowever, we can rely on the CAD tool's register retiming passes to do the workon which logic optimizations happen and how the logic is laid out on the FPGA.

Where to pipeline a circuit isn't always obvious, and may change depending

Retiming and Pipelining

end negative = ((foo & MSB_ONLY) != ALL_ZERO); or_reduction = (foo != ALL_ZERO); and_reduction = (foo == ALL_ONES);always @(*) begin...localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};localparam ALL_ZERO = {WORD_WIDTH{1'b0}};localparam ALL_ONES = {WORD_WIDTH{1'b1}};...parameter WORD_WIDTH = 23
just before their use if the meaning is quite local.design, read, and debug.  Define constants near the start of your module, orand bit reductions using a single Verilog idiom, which makes the code easier toparameters). And using constants allows expressing pattern matching, masking,elaboration time (e.g.: calculating localparam values from moduleespecially if you can construct the constant from other constants at

Constants allow you to express the meaning of numbers in your code,simplify downstream logic.reduces the gate to a constant, a wire, or an inverter, which can then furtherreduced form. A constant input into a 2-input (dyadic) Boolean gate inputconstant parameters which define a specific case optimize the logic down to ait by expressing your logic without case-specific optimizations, then let the

Expect the existence of a logic optimizer in your CAD tool, and make use of

Constants are the first tool to optimize logic and clarify code.

Constants and Optimization

How do we figure out what's a small, regular module? How do we seed the start of the library of modules from which designs are constructed?modules to function together.modules to function together. Regular modules tend to require other regularwhich may not already exist. Arbitrary modules tend to require other arbitraryunexpected corner cases, and further suggests any other small, reusable modulesseparated state and computation), which constrain the design space away fromnon-arbitrary design idioms (e.g.: ready/valid handshakes, pulse interfaces,

Small, reusable modules allow you to construct a design using regularAdder/Subtractor). href="./Adder_Subtractor_Binary_Saturating.html">Saturatingthey are internally complex (e.g.: designdesign, but whose interfaces say as little as possible about their

Create a set of conceptually small, regular modules which reflect the

Small, Regular Modules

ad>:8Qwv, D Z s h g  2 J I bI[ o+*;Q:s98locks: a blocking assignment the(does not pipeline (spatially local) but absorbs stalls).Options: Carloni buffering (pipelines, but does not absorb stalls), and FIFO

Ready/Valid Handshake

configuration of the end modules and of the pipeline.local buffers for N-loninput pulse path from the sender latency, is 2N cycles.latency, is 2N cycles.Thus the total latency added to the pulse handshake, on latency, is 2N cycles.Thus the total latency added to the pulse handshake, onlatency, is 2N cycles.latency, is 2N cycleslatency, islatency, is 2N cycles.Thus the total latency added to the pulse handshake, on top of the receiver'sregisters along the output pulse path from the receiver back to the sender.input pulse path from the sender to the receiver, and another set of N pipelinedirectly connected, there is a number (N) of plain pipeline registers along thepulse handshake between them. However, instead of the sender and receiver being

Abstractly, credit-based signalling involves a sender and a receiver with ahandshakes which can be in transit between the sender and receiver.must know the latency between them, or more specifically, the maximum number ofto maintain a high clock frequency. As a tradeoff, the sender and receivercan either be pure propagation delay, or from multiple pipeline stages insertedthe output pulse returns after being sent by the remote module. This latencyan input pulse to a remote module and it arriving, and many cycles again beforethroughput over a long latency link, where many cycles can pass between sending

Credit-based signalling extends pulse handshakes to maintain full

Credit-based Interface

credit-based signalling system.been sent relative to the number of output pulses received creates apossible complete out of order. Keeping track of how many input pulses havedepending on the latency of each computation, which can be data-dependent andcertain limit, and thus one or more single or multi-cycle output pulses,

A multi-cycle input pulse implies multiple consecutive calculations, up to alogic. In other words, you have a conventional pipeline which never stops.becomes the signalling to the sequential logic connected to this combinationalcycle imply combinational logic, and so are redundant, since the clock itselfother clock cycle. Simultaneous input and output pulses during the same clockcycle after the input pulse, which means we can only send an input pulse everythe clock frequency. At the limit, the output pulse is emitted on the next

However, a pulse handshake limits the throughput to a maximum of one-halfinterface.counter. Thus, the internal pipeline can be adjusted without changing theto do this with a 1-bit pipeline matching the main pipeline's path, or with aeventually reaches the output at the same time as the result. It is very easystate has changed. The input pulse is carried alongside the computation, andlater to signal the data has been accepted, the computation completed, or thethe start of a computation, or a change of state, and returns that same pulseA pulse handshake takes in a single cycle pulse to signal new input data, orthe pipeline (loops), or due to an iterative algorithm (e.g.: long division).accept a new input every clock cycle due to either backwards dependencies in

Pulse handshake are a natural way to control computations which cannot

Pulse Handshake

agreement between a sender and a receiver.refer to these signalling systems as "handshakes" since they coordinatethe Controlpath, or at least reduce the number of states and transitions. Weindividual Datapah stages. Good signalling design can eliminate small FSMs inSignalling deals with coordinating multiple Datapaths and Controlpaths, or

Signalling

ad w f{0 = J : 9 R  x w > J S  gj2>=p$4X^Yolpath, whiCControlpath, which form the two halves of the ImplementaControlpath, which form the two halves of the Implementation:more complex. ThControlpath, which form the two halves of the Implementation:more complex. The Implementation also acts as the Operation of the Datapath andso you avoid getting increasingly lost in details as the design gets larger andits Implementation, which minimizes unintended leakage across abstractions, andbuilding block modules, you only need to know the Operation of each module, notOperation. It's the Block Diagram level of description. Given a good set offunctional modules which connect the inputs and outputs described in the
  • The Implementation describes how the Operations are done, usingthe module design from the start, as you think of new cases to deal wHow do we fiHow do we figure out what's a small, regular moduHow do we figure out what's a small, regular module? How do we seed the start oHow do we figure out what's a small, regular module? How do we seed t

    Separation of AbstractionsSeparation of Abstractions

    How do we figure out what's a small, regular module?

    Separation of Abstractions

    How do we figure out what's a small, reg

    Separation of Abstractions

    How do we figure out what's a small, regu

    Separation of Abstractions

    How do we figure out what's a small, regula

    Separation of Abstractions

    How do we figure out what's a small, reg

    Separation of Abstractions

    How do we figure out what's a small, regul

    Separation of Abstractions

    How do we figure out what's a small, re

    Separation of Abstractions

    How do we figure out what's a small, regu

    Separation of Abstra<

    Separation of Abstractions

    How do we figure out what's a small,

    Separation of Abstractions

    How do we figure out what's a small, reg

    Separation of Abstractions

    How do we figure out what's a small, reg

    Separation of Abstractions

    How do we figure out what's a small, regula

    Separation of Abstractions

    How do we figure out what's a small, regul

    Separation of Abstractions

    How do we figure out what'

    Design Abstractions

    the loop. For an example, see the Design Abstractionshref="./Accumulator_Binary_Saturating.html">saturating accumulator.the loop. For an example, see the If there is a loop in your circuit, then that would block any registers fromarithmetic.href="./Adder_Subtractor_Binary_Multiprecision.html">multi-precisiontradeoffs, such as Note that there are limits to retiming. For circuits which would need a veryneeded.retiming optimizations will then spread the registers along the pipeline asregardless of the implementation of the circuit. Your CAD tool's registerand reduces changing the amount of pipelining to a single module parameter,simple register pipeline module,registers before the inputs of the module, which is easily done with akeeping it readable and general and in its own module. Then, we add one or more

    First, we design the circuit as-is, without extra pipeline registers,ad^!Xv, y . G a % $   w 0 M xwY n!r$<Q5How do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a sHow do we figure out what's a small, regular module? How do we seed the stHow do we figure out what's a small, regular module? How do weHow do we figure outHow do we figure out what's a small, regular module? HowHow do we figure out what's a smaHowHow do we figure out whatHoHow do we figure out what's a small, regular module? How do we seed the startHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular moduleHow do we How do we figure out what's a How do we figure out what's a small, regular module? How do we seed the start of the liHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the starHow do we figure out what's a small, regular module? How do we seed the start How do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the startHow do we figure out what's a small, regular module? How do we seed the stHow do we figure out what's a small, regular module? How do we seed the starHow do we figure out wHow do we figure out what's a small, regular module? How do we seed theHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the staHow do we figure out what's a small, regular module? How do we seed the start How do we figure out what's a small, regular module? How do we seed the startHow do we fiand for simple cases may be almost non-existent (e.g.: data-and for simple cases may be almost non-existent (e.g.: data-flow pipelinand for simple cases may be almost non-existent (e.g.: data-flow pipelines).aand forand for simple cases may be almost non-existent (e.g.: data-flow pipelines).the Datapath (e.g.: Skid Buffers),shortcuts, etc... Note that some of the Controlpath may end up embedded insidesequencing, internal state, conditions, any data-dependent calculationcontrol outputs from the Datapath. This is where you have handshaking,calculations must happen, possibly in a data-dependent way depending on the

    The Controlpath describes when and in which order the Datapath

    Controlpath
    possible values, which results in simpler, optimized logic.converts the data input to a control input (the mux selector input) with fewerthem to a multiplexer connected to the constant data input. This changeenable logic optimization. Otherwise, if a few possible values exist, hardwiretake a single value, replace them with a module parameter or localparam toof an over-general Datapath. Instead, if the unchanging data inputs only everinputs which are set once at run-time and remain constant is a good indicatorunused and a larger and slower implementation than necessary. Having data

    It's possible for a Datapath to be too general, with logic functions goingoutputs for control.etc... At the end, you have inputs and outputs for computation, and inputs andof clock speed, pipelining, arithmetic precision, latency, storage, area,problem and to the underlying FPGA hardware. This is where you solve problemsDatapath first and separately allows you to optimize it to your particularcomputations, but not their entire sequence or conditions. Designing thethe data flows from module to module. It describes the set of possible

    The Datapath describes which calculations must happen, and where

    Datapath
    ad/ -F ]   d  x (   > T  Controlfrom this enumerthis extra layer of abstraction, usually while debugging!for tedious and this extra layer of abstraction, usually while debugging!enumerates its possibthis extra layer of athis extra layer of tthis extra layer of abstraction, usually while debugging!nested conditionals). The reader must then reverse-engineer the design throughenumerates its possible behaviours
    (e.g.: large case statements containingfor tedious and long code which doesn't describe the design itself, butdescribe every possible combination of states, inputs, and output , which makesthe Datapath and the Controlpath separate to avoid having to explicitlyControlpath, which form the two halves of the Implementation. You must keep

    The Implementation also acts as the Operation of the Datapath andmore complex. so you avoid getting increasingly lost in details as the design gets larger andits Implementation, which minimizes unintended leakage across abstractions, andbuilding block modules, you only need to know the Operation of each module, notOperation. It's the Block Diagram level of description. Given a good set offunctional modules which connect the inputs and outputs described in the

    The Implementation describes how the Operations are done, using

    Implementation

    the module design from the start, as you think of new cases to deal with.without having to look inside it. Writing the Operation first often improvesstory should be enough for another person to understand and use your modulehow to set the module parameters, as well as corner-cases and limitations. Thismore detail up to the expected usage and behaviour of each input and output,comments at the start of your module, starting with the purpose and addingdo you use it, without knowledge of internals. You write this up as a story of

    The Operation describes what does the module do, how does it work, how

    Operation

    into Controlpath and Datapath.

    Separate Operation from Implementation, and then separate Implementationad;ed | { Cs;}Gdc8. CDC implies asynchrony, which means no notion of time, only of sequence, h8. CDC implies asynchrony, which means no notion of time, only of sequen8. CDC implies asynchrony, which means no notion o88. CDC implies asynchrony, which means no notion of time, only of sequence, he8. CDC implies asynchrony, which means no notion of time, only of sequence, h8. CDC implies asynchrony, which means no notion of time, o88. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset).11a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens.11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races)9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock.21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design.

    Implementation Issues

    17. Possible general design principle: connect datapaths with ready/valid handshakes at boundaries, enumerate ready/valid actions as 2-input truth table, result is computed by one of the 16 dyadic Boolean functions.16. Following up on last one: using a "done" signal on commands also separates control and data paths.15. Amend that last one: don't signal an "output updated" value, as consecutive calculations can have identical results (so a Word Change Detector is no good here), and multiple commands can update a given output. Instead, signal when a command is done. That "done" signal also denotes when the given output is valid to sample. Internally latch the command so we know which one to report as "done" if multiple commands update one internal object (itself with a "done" signal). This does not allow concurrency or queueing by itself.14. Since the `clear` input, a synchronous reset, is a signal that changes the output, and derived from some control path logic, it must trigger an "output updated" signal in pulse interfaces as any other `load` or `valid` input signal, though only as a single pulse since the output only changes once even with a steady `clear`. This also implies that `clear` must be pipelined, separately and in parallel if necessary, and not broadcast to all units, which means similar latency, and better routing and distribution.3. Pulse generators and pulse latches remove many FSMs and can fix race conditions (ABA problems).2. Ready/valid handshakes allow for action every clock cycle if possible and compose well. Pulses necessarily limit your max throughput by half. They are a layer of abstraction.7. Hierarchy: pulses, credit, ready/valid, 4-phase, 2-phase (TODO: chapter to introduce/explain them, why/where use them, etc...)

    Pulse, Credit, and Ready/Valid Handshakes

    4-phase and 2-phase Asynchronous Handshsakes

    ad e5 V H G d . -  (does not pipeline (spatially local) but absorbs stalls).Options: Carloni b(does not pip(does not pipeline (spatially local) but absorbs stalls).Options: Carloni buffering (pipelines, but does not absorb stalls), and FIFO

    Ready/Valid Handshake

    configuration of the end modules and of the pipeline.local buffers for N-long pipeline, and creates a dependency between thefull throughput. Upside: simplest pipelining (no skid buffers). Downside: 2NUse case: long latency due to distance and pipelining, and a need to maintainpulse arrivesinitialized to zero. The receiver increments the counter every time an input

    The receiver maintains an identical counter as the sender, except it ishow many input pulses it can send before receiving any output pulses.counter keeps track of the credit of the sender. In other words, ofsimultaneous pair of input and output pulses leaves the counter unchanged. Thisoutput pulse is received, the sender increments the counter by one. Ainput pulse is sent, the sender decrements the counter by one. Every time anfor the additional round-trip latency of the pipelined pulses. Every time an

    The sender maintains a counter initialized with a value of 2N, to account