b0VIM 8.1 ,a :laforestscriptor~laforest/public_html/fpga/principles.htmlutf-8 3210#"! Utp +.3TqR pad B  ZY  l  q $ [ Z T : 6 r K   N g# p$qp#=21MKh*design* than its implementation. Writing smaller modules means a smallertheir implementation. Thus, a modular design means debugging more of thewhich reflect the design, but whose interfaces say as little as possible aboutdesign from the implementation. The first step is to create a set of modulesMake the code reflect the design: don't make the reader reverse-engineer the(Operation: What does it do? Implementation: how does it do it? Datapath: what calculations must happen? Controlpath: when and in what order must the calculations happen?)Split Operation from Implementation, then Implementation into Control and Data.

Design

the loop.retimed forward both into the feedback part of the loop and past the output ofregisters to be retimed just inside the start of the loop, where they can bebeing retimed from outside to inside the loop. Instead place the pipelineIf there is a loop in your circuit, then that would block any registers fromtradeoffs.hae to add pipeline registers manually, or use alternate implementationmuch forward retiming as could be done, limiting the performance. You will thendeep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do asNote that there are limits to retiming. For circuits which would need a veryimplementation of the circuit. amount of pipelining to a single module parameter, regardless of thesimple pipeline module in front of the other module, and reduces changing theregisters to theinputs of the circuit, which is easily done with akeeping it readable and general and in its own module. Then, we add one or moreFirst, we design the circuit to be pipelined as-is, without extra registers,optimization passes.retiming, and it's the only retiming supported by Vivado's physicalthan backwards retiming. There are fewer logical restrictions on forwardfor us. In general, forward retiming, from input towards output, works betterHowever, we can rely on the CAD tool's register retiming passes to do the workwhich logic optimizations happen, and how the logic is laid out on the FPGA.Where to pipeline a circuit isn't always obvious, and may change depending on

Retiming and Pipelining

end negative = ((foo & MSB_ONLY) != ALL_ZERO); or_reduction = (foo != ALL_ZERO); and_reduction = (foo == ALL_ONES);always @(*) begin...localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};localparam ALL_ZERO = {WORD_WIDTH{1'b0}};localparam ALL_ONES = {WORD_WIDTH{1'b1}};...parameter WORD_WIDTH = 23
their use if the meaning is quite local.a single idiom.  Define constants near the start of your module, or just beforeconstants allows expressing pattern matching, masking, and bit reductions usingespecially if you can construct the constant from other constants. And using

Constants allow you to express the meaning of numbers in your code,logic down the line. gate to a constant, a wire, or an inverter, which can then further simplifya reduced form. A constant to a 2-input (dyadic) Boolean gate input reduces thethe constant parameters which define a specific case optimize the logic down toit by expressing your logic without specific implementation details, then let

Expect the existence of a logic optimizer in your CAD tool, and make use of

Constants are the first tool to optimize logic and cl<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Logic Design Principles

Logic Design Principlesad  ts > = 
Back to FPGA Design Elements25. Follow up on above: can we make it so the handshake logic adds no latency or optimizes away when possible? (combinational paths without buffering)24. Design paradigm: all modules with ready/valid handshakes at input/output/control to allow half/skid-buffers to be added as necessary to avoid above timing problems, without altering functionality. Are we at Kahn Networks now?23. As a design gets large, routing may not get as good a result (congestion, too much effort) and so critical paths become bad routes, and bit-reductions that cannot be pipelined, despite using carry-chains for faster calculations.22. As physical area increases, if there are any cycles in dependency (not a straight pipeline), then area becomes a limiting factor for speed. Also, distributing control signals can have too far to travel.

Scaling

8. CDC implies asynchrony, which means no notion of time, only of sequence, hence the 2ph/4ph handshakes must be present inside non-trivial CDC designs.18. If a module contains latches which will hold state, then it must have a reset/clear. Else a reset of the surrounding logic would result in an inconsistent system (e.g.: a valid line from a latch stayed high after reset).ad+F` i  | / N J 8  : N  s. r"7/5P than its imp*design* than its implementation. Writing smaller modules means a smallertheir*design* than its impleme**design* than its implementation. Writing smaller modules means a smallerthe*design* than its implementation. Writing smaller modules means a smallertheir implementation. Thus, a modular design means debugging more of thewhich reflect the design, but whose interfaces say as little as possible aboutdesign from the implementation. The first step is to create a set of modulesMake the code reflect the design: don't make the reader reverse-engineer the(Operation: What does it do? Implementation: how does it do it? Datapath: what calculations must happen? Controlpath: when and in what order must the calculations happen?)Split Operation from Implementation, then Implementation into Control and Data.

Design

the loop.retimed forward both into the feedback part of the loop and past the output ofregisters to be retimed just inside the start of the loop, where they can bebeing retimed from outside to inside the loop. Instead place the pipelineIf there is a loop in your circuit, then that would block any registers fromtradeoffs.hae to add pipeline registers manually, or use alternate implementationmuch forward retiming as could be done, limiting the performance. You will thendeep pipelines (e.g. a 128-bit adder), the CAD tool may give up and not do asNote that there are limits to retiming. For circuits which would need a veryimplementation of the circuit. amount of pipelining to a single module parameter, regardless of thesimple pipeline module in front of the other module, and reduces changing theregisters to theinputs of the circuit, which is easily done withthan backwards retiming. There are fewer logical restrictions on forward retiming, and it's the only retiming supported by Vivado's physical optimization passes.optimization passes.retiming, and it's the only retiming supported by Vivado's physicalthan backwards retiming. There are fewer logical restrictions on forwardfor us. In general, forward retiming, from input towards output, works betterHowever, we can rely on the CAD tool's register retiming passes to do the workon which logic optimizations happen and how the logic is laid out on the FPGA.

Where to pipeline a circuit isn't always obvious, and may change depending

Retiming and Pipelining

end negative = ((foo & MSB_ONLY) != ALL_ZERO); or_reduction = (foo != ALL_ZERO); and_reduction = (foo == ALL_ONES);always @(*) begin...localparam MSB_ONLY = {1'b1,{WORD_WIDTH-1{1'b0}}};localparam ALL_ZERO = {WORD_WIDTH{1'b0}};localparam ALL_ONES = {WORD_WIDTH{1'b1}};...parameter WORD_WIDTH = 23
just before their use if the meaning is quite local.design, read, and debug.  Define constants near the start of your module, orand bit reductions using a single Verilog idiom, which makes the code easier toparameters). And using constants allows expressing pattern matching, masking,elaboration time (e.g.: calculating localparam values from moduleespecially if you can construct the constant from other constants at

Constants allow you to express the meaning of numbers in your code,simplify downstream logic.reduces the gate to a constant, a wire, or an inverter, which can then furtherreduced form. A constant input into a 2-input (dyadic) Boolean gate inputconstant parameters which define a specific case optimize the logic down to ait by expressing your logic without case-specific optimizations, then let the

Expect the existence of a logic optimizer in your CAD tool, and make use of

Constants are the first tool to optimize logic and clarify code.

Constants and Optimization

adEH  U pxw;k3K\[11a. Watch out for implicitly clocked always blocks: a blocking assignment t11a. Watch out for implicitly clocked always blocks: a blocking assignment 11a. Watch111a. Watch out for implicitly clocked always blocks: a blocking assignment th11a. Watch out for implicitly clocked always blocks: a blocking assignment th11a. Watch out for implicitly clocked always blocks: a blocking assignment 11a. Watch out for implicitly clocked always blocks: a blocking assignment there11a. Watch out for implicitly clocked always blocks: a blocking assignment there11a. Watch out for implicitly clocked alw111a. Watch out for implicitly clocked always blocks: a blocking assignment there will work *nearly every time* until a race condition happens in the simulation and impossible logic happens.11. Non-blocking assignments for testbench logic, blocking assignment for testbench clock. (avoid races)9. Don't generate/divide clocks: generate enables for the desired rate, synchronous to the main clock.21. Make signal names evolve with the computation, in lexicograpic order, by adding/changing suffixes. Thus search in text is easier, and waveform displays will be more organized. This naming scheme also make obvious when your code has a harmony, which is usually a sign that the implementation reflects the design.

Implementation Issues

17. Possible general design principle: connect datapaths with ready/valid handshakes at boundaries, enumerate ready/valid actions as 2-input truth table, result is computed by one of the 16 dyadic Boolean functions.16. Following up on last one: using a "done" signal on commands also separates control and data paths.15. Amend that last one: don't signal an "output updated" value, as consecutive calculations can have identical results (so a Word Change Detector is no good here), and multiple commands can update a given output. Instead, signal when a command is done. That "done" signal also denotes when the given output is valid to sample. Internally latch the command so we know which one to report as "done" if multiple commands update one internal object (itself with a "done" signal). This does not allow concurrency or queueing by itself.14. Since the `clear` input, a synchronous reset, is a signal that changes the output, and derived from some control path logic, it must trigger an "output updated" signal in pulse interfaces as any other `load` or `valid` input signal, though only as a single pulse since the output only changes once even with a steady `clear`. This also implies that `clear` must be pipelined, separately and in parallel if necessary, and not broadcast to all units, which means similar latency, and better routing and distribution.3. Pulse generators and pulse latches remove many FSMs and can fix race conditions (ABA problems).2. Ready/valid handshakes allow for action every clock cycle if possible and compose well. Pulses necessarily limit your max throughput by half. They are a layer of abstraction.7. Hierarchy: pulses, ready/valid, 4-phase, 2-phase (TODO: chapter to introduce/explain them, why/where use them, etc...)

Signalling

20. Following on above: a over-general logic path may indicate you are not implementing the complete algorithm, and should instead use all possible logic functions of the path.19. It's possible for a logic path to be too general, with constant inputs that don't change. This makes a slower path. Instead, have parallel, simpler paths with each hardcode one possible value of the unchanging inputs, then select at the end using the actual unchanging input. Pipelining may render this optimization moot. (e.g.: adders which depend on >3 inputs (data and control) are better split as separate simpler adders and a mux)ad.f{0 = J : 9 R  K  j  D]|ih:Qp$4X^Yolpath, whiCControlpath, which form the two halves of the ImplementaControlpath, which form the two halves of the Implementation:more complex. ThControlpath, which form the two halves of the Implementation:more complex. The Implementation also acts as the Operation of the Datapath andso you avoid getting increasingly lost in details as the design gets larger andits Implementation, which minimizes unintended leakage across abstractions, andbuilding block modules, you only need to know the Operation of each module, notOperation. It's the Block Diagram level of description. Given a good set offunctional modules which connect the inputs and outputs described in the
  • The Implementation describes how the Operations are done, usingthe module design from the start, as you think of new cases to deal with.without having to look inside it. Writing the Operation first often improvesstory should be enough for another person to understand and use your modulehow to set the module parameters, as well as corner-cases and lnon-arbitrary idioms, which constrain the design space away from unexpected corner cases, and further suggests any other small modules which may not already exist. Arbitrary modulHow do we figure out what's a small, regular module? How do we seed the start of the library of modules from which designs are constructed?function together.function together. Regular modules tend to require other regular modules toalready exist. Arbitrary modules tend to require other arbitrary modules tocorner cases, and further suggests any other small modules which may notnon-arbitrary idioms, which constrain the design space away from unexpectedSmall, reusable modules allow you to construct a design using regularAdder/Subtractor). href="./Adder_Subtractor_Binary_Saturating.html">Saturatingsmall modules, even if they are internally complex (e.g.: design itself than of its implementation, so bugs become morepossible about their implementation. A modular design means more debugging ofmodules which reflect the design, but whose interfaces say as little asthe design from the implementation. The first step is to create a set of

    Make the code reflect the design: don't make the reader reverse-engineer

    Design

    the loop.retimed forward both into the feedback part of the loop and past the output ofregisters to be retimed just inside the start of the loop, where they can bebeing retimed from outside to inside the loop. Instead place the pipeline

    If there is a loop in your circuit, then that would block any registers fromarithmetic.href="./Adder_Subtractor_Binary_Multiprecision.html">multi-precisiontradeoffs, such as Note that there are limits to retiming. For circuits which would need a veryneeded.retiming optimizations will then spread the registers along the pipeline asregardless of the implementation of the circuit. Your CAD tool's registerand reduces changing the amount of pipelining to a single module parameter,simple register pipeline module,registers before the inputs of the module, which is easily done with akeeping it readable and general and in its own module. Then, we add one or more

    First, we design the circuit as-is, without extra pipeline registers,ad \}0 u ) L " 1 E Q calculatio be almost non-existent (e.g.: simple data-flow pipelines).href="./ be almost non-existent (e.g.: simple data-flow pipelines). be almost non-existent (e.g.: simple data-flow pipeli be almost non-existent (e.g.: simple data-flow pipelines).href=" be almost non-existent (e.g.: simple data-flow pipelines).href="./Pipeline_Skid_Buffer.html">Skid Buffers), and for simple cases mayup embedded inside the Datapath (e.g.: The Controlpath describes when and in which order the Datapathinputs and outputs for control.area, etc... At the end, you have inputs and outputs for computation, andproblems of clock speed, pipelining, arithmetic precision, latency, storage,domain, and for the underlying FPGA architecture. This is where you solvefirst and separately allows you to optimize it for the particular problemcomputations, but not their sequence or conditions. Designing the Datapathwhere the data flows from module to module. It describes the set of possible

  • The Datapath describes which calculations must happen, and