This project has run its course.
The code is still available.
If you have any questions, please contact me: firstname.lastname@example.org or @elaforest
The Octavo soft-processor was my doctoral research project aimed at building FPGA overlay architectures by increasing the performance of soft-processors though adapting their architecture to FPGAs instead of ASICs. I continued it after graduation to clean-up the code and implement the "Further Works" section of the thesis. The work here spans from late 2010 to early 2019.
You can get the complete source from the Octavo GitHub Repository.
Increasing the performance of a soft-processor would allow moving more of an FPGA design into software, and reduce the number of time-consuming hardware re-spins. It should also simplify the design of both the hardware and software:
Unfortunately, it's possible to overshoot these goals and end up with a system where the software is harder to write than the hardware. Octavo can do up to 4x as much work than any conventional scalar soft-processor (as little as half as many cycles on twice as fast a clock), but this comes at the cost of code which is either too hard to reliably write manually, or requires complex software to assist the programmer, and it must be split across 8 threads.
For example, the assembler I wrote to support enough of Octavo to write micro-benchmarks is 2160 lines of Python implementing 35 classes, and has multiple phases (parsing, data and code allocation, data dependency resolution and code generation, and binary images generation). It resembles more a crude compiler. In hindsight, this complexity is inevitable since managing the configuration and allocation of the opcodes, branching logic, pointer logic, and ALU state all boils down to a problem of calculating register allocations, initialization data, variable lifetimes, and code hoisting (even if the programmer can easily do that last one).
The software complexity problem propagates upwards too. Since Octavo's hardware is so different, there is no existing C compiler that can support it, and although possible, it would be a whole separate project to add support to GCC or LLVM. Without C support, connecting Octavo to existing software systems becomes a manual, tedious process, cancelling out the simplified hardware design process.
Briefly put, too much of the design difficulty had been transferred to the software. It would now be less design effort to gain performance by adding accelerator pipelines tightly-coupled (i.e.: not hanging off a bus or interconnect) to a (mostly) conventional soft-processor with compiler support.
Can we transfer what was learned from Octavo to conventional soft-processors, and is it worth doing so? Let's consider a few possible cases and estimate the performance gains that are theoretically possible.
Our baseline for comparison is a conventional scalar RISC soft-processor which runs 1 thread, at 1x clock frequency, and has a CPI of 1.3. (This is a slightly optimistic ballpark value for a good scalar RISC design executing general-purpose code.) At the other end, we have Octavo v3 which runs 8 threads, at 2x the clock frequency, and has a CPI of 1.0.
From Octavo's reduced CPI and increased clock frequency we can calculate that, all other things being equal, Octavo has 2.6x the performance of the scalar soft-processor. In reality, the difference can be as much as 4x, so there's up to a 1.54x (i.e.: about 2/3rd as many instructions) reduction in Octavo's executed instruction count also, which sounds about right (e.g.: roughly -25% for branches now being zero-cycle, and another -10% for zero-cycle pointers and other little ALU tricks).
The first obvious step is to convert the scalar soft-processor to a round-robin multithreaded design, which runs 8 threads, at 2x the clock frequency, a CPI of 1.0, and no change in instruction count. Having eight threads is merely a convenient power of two and matches a likely number of stages for the pipeline. Plus, any number of threads greater than one already brings in the same programming complications. The longer pipeline and round-robin multithreading means the usual pipeline stalls from branches won't happen, those from loads are minimized, and the ALU result feedback paths are no longer necessary, simplifying the pipeline and granting the 2x clock frequency increase (about as much as physically possible) and dropping CPI to 1.0.
From the clock frequency and CPI changes, we can estimate a 2.6x increase in performance, if you can use all threads. The design of accelerators is also improved, since we now have 8 cycles to perform any computation (at 2x clock frequency) before returning a result to a thread after one thread instruction of latency. The same applies to memory accesses. If your application is already parallel, multithreading a scalar soft-processor is a definite improvement: the code is unchanged, your software runs up to 2.6x faster, on less area than 2 or 3 scalar CPUs, and any accelerator you add has its performance potentially doubled (assuming everything runs on the same clock).
However, since even in simple cases an application accelerator attached to a RISC scalar soft-processor grants speedups of many orders of magnitude, all this effort for a 2.6x speedup in software doesn't seem worthwhile. It would be better to keep the software very simple, compatible, in charge of the control and other irregular computations, and move the heavy, repetitive, regular computations to an accelerator. Both hardware and software design effort are reduced this way.
Other than round-robin multithreading a sufficiently long pipeline to maximize clock frequency, there are still some design ideas in Octavo which would benefit conventional scalar CPUs, without complicating the software in incompatible ways:
Transferring data to/from an accelerator through loads/stores is simple, but since many functions take at least two arguments and return one, loads/stores are a bottleneck. If there was an way to pass the contents of the source and destination registers of an instruction directly to an accelerator, even if the destination register gets its result several cycles later, we could keep accelerators busier and interface with them with less overhead. It would make using smaller accelerators worthwhile, giving a finer breakdown of the work that is less dependent on the exact application being accelerated (i.e: a smaller semantic gap).
Similarly, at the limit where data is not reused internally either due to simple code or an already hardware-accelerated computation, the memory access stage poses a bottleneck causing all computation tends towards the form "load, load, compute, store". To reduce this overhead, we would need dual and parallel memory pipeline stages to queue up (and overlap) two ordinary load and/or store instructions. A dual write port register file would allow dual loads to complete in one cycle rather than consecutively, but unless the memory latency is very small, simply overlapping otherwise consecutive memory latencies will have a greater benefit. Dual write ports on FPGAs are also costly in area and in speed.
Although Octavo (v1 and v2) was originally aimed at Altera's Stratix IV FPGA, it performs pretty well on other Altera devices. It generally runs twice as fast as a NiosII/f, and gets fairly close to the absolute upper clock frequency limit allowed by the FPGA hardware.
The Fmax of Octavo v3 is 5% lower than Octavo v2, but operates with less overhead and a more powerful ISA. On an Intel Cyclone V (5CGXFC5C6F27C7), an Octavo v3 instance with Multiplier and Accumulator accelerators uses 2245 ALMs (1808 LUTs and 4854 flip-flops), 12 M10K Block RAMs, 4 DSP blocks, and runs at up to 222 MHz (out of a max possible of 275 MHz). You could fit about 10 such Octavo instances on this FPGA.
The critical path is the calculation of flags between the ALU and the Branching Modules, which architecturally could not be pipelined, else they would fall out of sync with the current thread or introduce a second clock cycle of latency on branches, which is unacceptable. However, since this critical path is fixed, and all the other configurable parts of Octavo are heavily pipelined, we can increase the number of Branching and Address Modules, and the number of I/O ports, with little to no reduction in clock frequency.
Since Octavo v3, the design should be quite portable to other FPGAs (Xilinx, Lattice, etc...) since it uses no vendor-specific modules and is heavily pipelined. Any vendor-specific hardware can be attached as an accelerator, without altering Octavo's ISA or architecture.
The first version of Octavo, published in 2012, was a proof-of-concept: can we maximize operating frequency and issue one instruction per cycle? Octavo v1 had Instruction (I) and Data (A and B) memories, a simple controller (CTL) which could execute a branch based on the contents of a memory location in A, and an ALU which could do addition, subtraction, multiplications, and basic Boolean operations. It reached the 550 MHz limit of the Stratix IV FPGA, but had limitations: you had to write self-modifying code to implement indirect memory accesses, and the ALU was idle during a branch.
The second version of Octavo, published in 2014, addressed the inefficiencies of the first version. Octavo v2 keeps the same ALU, and the same Instruction (I) and Data (A and B) memories, but adds a Branch Trigger Module (BTM) which calculates one or more branches in parallel with the current instruction based on the result of the previous instruction. Branches take zero cycles in the common case. The Address Offset Module (AOM) can alter the instruction operands before execution to implement indirect memory access with post-incrementing. Finally, the I/O Predication Module (PRD) manages the I/O ports: if an instruction operand refers to a port which is not ready, the instruction is forced to a no-op and the Controller (CTL) re-fetches the same instruction to retry again. Octavo v2 no longer reached the maximum possible operating frequency, but its improved efficiency more than made up for the loss. Octavo v2 could also be operated in a SIMD configuration, with up to 32 datapaths.
The third version of Octavo, completed in 2019, fixes some limitations of Octavo v2 which was written in a hurry. The codebase was cleaned up, and computational overhead further reduced: multi-way branching with priority arbitration over 200+ branch conditions (FC), more flexible indirect addressing (AD), a Literal Pool to reduce duplication in Data memories (DM), a programmable Opcode Decoder (OD), a new three-operand ALU which supports bitwise parallelism and instruction chaining, and a new addressing mode to move twice as much data per instruction when data movement dominates computation (AS).
The following block diagrams detail the implementation of Octavo v3. They are not exact schematics, as some address translation and decoding logic is omitted for clarity, but primarily serve as an architectural reference, and make clear the number of pipeline stages a path traverses, which is crucial to get exactly right else threads will corrupt each other.
Each diagram is divided by numbered vertical dashed lines which denote pipeline registers. Each module spanning multiple pipeline stages has its blocks numbered, which match the pipeline register numbers in the detailed block diagram for that module, and so on... Each line represents a single bit or a bus and is named so it can be referenced in sub-diagrams. Small, unnamed boxes with a chevron at the bottom denote registers without logic. Larger such boxes denote synchronous memories. There are no asynchronous memories in Octavo.
You can click on a diagram to get a high-resolution version.
Octavo's has only two 36-bit instruction formats: common and split. A common instruction has, from the most-significant bits down (left to right), the following fields:
All instructions are conventional 3-operand instructions, where
D = A
OP B for common format, and
DA = A OPa B and
DB = A OPb B for the split format.
Octavo instructions operate only on its internal memories. External loads and stores are done through I/O ports mapped into these memory spaces. There are four 1024-word memory spaces:
A number of memory locations in the A and B Data Memories have special functions memory-mapped onto them. Most have a configurable number of locations and start addresses, so the following values are examples:
The bottom row contains the Addressing module (AS), the Data Memory module (DM), and the Arithmetic and Logic Unit (ALU):
The Flow Control module (FC) computes the next Program Counter (PC) for each thread in turn, as well as if the current instruction must be cancelled (C) because a parallel branch did not go as predicted. The FC contains:
The 5-stage Branch Module (BM) takes in the result of the previous instruction (Rp) as well as some Flags computed by the ALU from that previous instruction. The result and destination address of the previous instruction are also denoted as RB/DB' (identical to Rp) are also used to configure the BM. The I/O Ready and Cancel signals from the previous instruction (IORp and C (should have a sub-p there)) will prevent any external updates or internal state changes from taking effect if the previous instruction was no-op'ed. The BM also takes in the current I/O Ready (IOR) to stop a branch from executing until the parallel instruction can proceed, and the Program Counter (PC) value of the current instruction to determine if a branch happens in parallel with it. The BM produces the following signals:
The Condition Predicate module (CP) takes two 2-bit condition selectors and one 4-bit Boolean operation definition fed to a Universal Dyadic Boolean Operator and determines if the conditions are met. The CP design is discussed in detail here. The condition bits are mainly derived from the result of the previous instruction:
The 3-stage Branch Sentinel (BS) has two programmable memories, each holding one value per thread: a Mask (MSK) and Sentinel (STN). The BS applies the Mask to the Sentinel and to the result (R) of an instruction and then checks for a match (M). Any bit position set to 1 in the Mask is omitted from comparison.
The 3-stage Branch Counter (BC) receives the previous instruction results and destination address (RB/DB'), the Branch Reached (BR) signal, and the I/O Ready for the current and previous instruction (IOR/IORp), all delayed by some pipeline stages to synchronized them to the current thread. The Branch Counter contains a Counter Memory (CM) which holds one 36-bit counter value for each thread. The output of CM is both OR-reduced to a Counter Active flag (CTR) and decremented by 1 (over 2 pipeline stages, to meet timing). If a branch is reached, and the counter is active, and the current instruction is not Annulled (IOR is high), then the Counter Memory stores the decremented counter value, until it reaches zero. The result of a previous instruction (RB/DB'/IORp), if not Annulled, can reload the counter.
The 1-stage Branch Arbiter (BA) takes in the output of all the Branch Modules (BM) and filters them with a Priority Arbiter (PA) to pass along only the highest priority taken branch. There are two BMs in this diagram. The output Jump (J) is set if any Jump is set, and the PA is also fed an extra lowest-priority signal if no Jump is set. The PA outputs a one-hot bitmask of Jumps used to select the associated Cancel (C) and Branch Destination (BD) signals. If no Jumps are taken, any Cancel signal is passed through (since an untaken branch may cancel a parallel instruction), and the BD is set to zero (no one-hot mux input at that index).
The 2-stage Controller (CTL) generates the next Program Counter (PC) value for each thread based on any taken branch and if the instruction was Annulled. There are two Program Counter Memories which store the next PC value (PCM) and the current PC value (PCM -1) for each thread. If an instruction is Annulled (IOR is high), the current thread PC value gets sent out again, re-trying the same instruction. Cancel (C) overrides IOR, letting the next instruction be fetched. If a branch is taken (Jump (J) is high), then the current PC takes the value of the Branch Destination (BD).
If the Control Bits (CTL, see Control Path) fetched by the opcode of the current instruction have the Split bit set, then the instruction splits the 12-bit destination operand (D) into two 6-bit destinations (DA and DB), with the necessary six most-significant bits to offset the destination into the address space of the A (0-1023) and B (1024-2047) Data Memories (see Data Path). Otherwise, D is simply replicated into DA and DB.
The 2-stage I/O Predication (PR) module receives the source and destination operands from the current instruction (A, B, DA, DB) and the empty/full bits from each read and write I/O port (E/Fr, E/Fw) and signals if each operand is referring to an I/O port (I/Or, I/Ow) rather than Data Memory (DM). For the I/O read ports, the read enable bit for the active port, if any, is also raised (rden_I/O). The equivalent write enable signals for the I/O write ports (I/Ow_wren) are created later in the Memory Write (MW) module. The all-important I/O Ready (IOR) signal is generated here, based on the masked empty/full bits (E/Frm and E/Fwm). The read enable signals (rden_I/O) are gated by IOR or by the Cancel (C) signal from the Flow Control module (FC), so if the current instruction is Annulled or Cancelled, no I/O port operation takes place. The I/O write enable signals (I/Ow_wren) are gated in the same manner later, in the Memory Write module (MW).
The 2-stage I/O Read Predication (RP) module receives a read address (Addr) for a Data Memory (DM) and the empty/full bits (E/Fr) for any I/O read ports mapped in that Data Memory space. The translated least-significant bits of the address select an empty/full bit, while the whole address is decoded to determine if it refers to an I/O port. If the address refers to an I/O read port, we pass on the corresponding empty/full bit, else we mask it with a "Full" signal since memory is always ready (E/Frm). Finally, we signal if this read accesses I/O (is_I/Or) and raise the read enable (rden) line for the accessed I/O port, if any.
The 2-stage I/O Write Predication (WP) module receives a write address (Addr) for a Data Memory (DM) and the empty/full bits (E/Fw) for any I/O write ports mapped in that Data Memory space. The translated least-significant bits of the address select an empty/full bit, while the whole address is decoded to determine if it refers to an I/O port. If the address refers to an I/O write port, we pass on the corresponding empty/full bit, else we mask it with an "Empty" signal since memory is always ready (E/Fwm). Finally, we signal if this write accesses I/O (is_I/Ow). The I/O port write enable signals (I/Ow_wren) are generated later in the Data Memory (DM).
The I/O Ready module (IOR) takes in the masked read/write empty/full bits (E/Frm and E/Fwm) for both the A and B Data Memories (DM). If all the empty/full read bits are "Full" (logic 1), and all the empty/full write bits are "Empty" (logic 0), then we raise I/O Ready (IOR) to signal the rest of the system that all I/O ports referenced by an instruction are ready and thus the instruction (and any parallel address offsets and branch operations) may proceed. If IOR is low, the instruction is "Annulled": it is converted to a no-op and re-fetched the next time the thread comes around. Any side-effects (e.g.: decrementing loop counters or post-incrementing address offsets) are also prevented.
The 2-stage Address Module (AM) receives an address (Addr) and then adds a per-thread offset to it (Addr'). The (translated) least-significant bits of the address index into Programmed Offset (PO) and Programmed Increment (PI) tables. The whole address is decoded to see if it falls in the range of Shared Memory (SM?) and Indirect Memory (IM?), as well as carried along to the next pipeline stage.
The 2-stage Data Memory Read (RD0/RD1) receives a read address (Addr) from an instruction operands (A' or B') and the input from the I/O read ports (I/Or). The least-significant bits of the address (after translation) select one of the I/O port, while the full address reads from the data memory (RAM) if the address was not previously determined in the I/O Predication module (PR) to be an I/O port address (is_I/Or). Otherwise, the RAM read access is disabled and the input from the selected I/O port is passed along instead. Finally, if a previously generated read enable (rden) is not active, or if the instruction was Annulled or Cancelled, the output (Data) is masked to zero.
The 2-stage Data Memory Read (WR0/WR1) receives a write address (Addr, from DA' or DB') and data (Data, from RA or RB) from the current instruction as well as a previously generated write enable (wren) and the signal from the I/O Predication module (PR) which indicates if the address is accessing an I/O port (is_I/Ow). If the address is not referring to an I/O port, nor Cancelled or Annulled, then the data is written to the data memory (RAM). Otherwise, a signal is raised for the referenced I/O port and the write data is written to that referenced I/O port (I/Ow), and its associated write enalbe (I/Ow_wren) is raised.
The 4-stage Arithmetic and Logic Unit (ALU) receives the 36-bit A and B data words from the Data Memory (DM), the 20-bit control bits from the Opcode Decoder Memory (OD), and the 12-bit destination address DB'. The control bits are divided into multiple fields:
Depending on the control bits and the values of R and S, this ALU can compute all the usual dyadic Boolean operations and addition/subtractions of A and B, and more complex combinations thereof: