The Octavo soft-processor is a research CPU aimed at building FPGA overlay
architectures. Instead of implementing your whole design in hardware, and
waiting hours for it to place-and-route after each change, you implement just
the compute-heavy parts in hardware alongside an Octavo instance, and leave the
rest to software. Most design cycles now reduce to quick compiles, and both the
hardware and software design jobs are simplified. This isn't a new idea, but
Octavo has higher software performance and couples more directly to external
hardware than previous soft-processors.
Octavo's high performance comes from an architecture adapted the to the
underlying FPGA, so the trade-offs are different than for an ASIC. The
architecture works best for parallel code but the increased branching and
addressing efficiency, and the more powerful ALU operations, also help
sequential code. Some architectural features include:
- You have to divide your work across 8 round-robin shared-memory threads.
There are no hazards between instructions within a thread. From the
perspective of a thread, an instruction completes in one cycle, actually 8
clock cycles. Each thread is best considered as a separate "CPU", with fixed
latencies when sharing data across threads.
- You have to operate out of internal memory: 1024 instruction words, 2048
data words (256 per thread), all 36-bits wide. But these memories can do useful
work every cycle.
- There are no load/store instructions, but any instruction can perform up to
2 reads and 1 write (sometimes 2) from multiple I/O ports as operands. These
I/O ports allows tight coupling to external memories, hardware accelerators,
other Octavo instances, etc... Instructions automatically annul and re-try if
an I/O port isn't ready, so busy-wait loops are not required.
- There are no branch instructions. You implement flow-control by programming
(ahead of time) a special functional unit. However, multiple branches can
evaluate in parallel with an ALU operation,
over any of over 200 complex arithmetic and
Boolean conditions, and can always place a useful instruction in their
delay slot. Loops cost almost nothing, so you don't need to unroll or vectorize
data-heavy code to gain efficiency. Multi-way branching accelerates
- Instructions have only register-direct addressing. Octavo implements
pointers using a functional unit, programmed ahead of time, which modifies
read/write addresses in parallel, with programmable stride. You can load small
arrays of data on-chip, and process them without spending cycles on pointer
- The Octavo ISA defines over a million three-operand instructions. Each
thread has access to an independent and programmable window of any 16 of these
instructions. Each instruction can combine Boolean and arithmetic operations
together to support branch-free code and sub-word parallelism. The result of a
previous instruction can also be used simultaneously, allowing fast chains of calculations on three
values instead of two.
- You can easily add multiple datapaths, controlled by a single controlpath,
to implement SIMD processing.
The first version of Octavo, published in 2012, was a proof-of-concept: can
we maximize operating frequency and issue one instruction per cycle? Octavo v1
had Instruction (I) and Data (A and B) memories, a simple controller (CTL)
which could execute a branch based on the contents of a memory location in A,
and an ALU which could do addition, subtraction, multiplications, and basic
Boolean operations. It reached the 550 MHz limit of the Stratix IV FPGA, but
had limitations: you had to write self-modifying code to implement
indirect memory accesses, and the ALU was idle during a branch.
The second version of Octavo, published in 2014, addressed the
inefficiencies of the first version. Octavo v2 keeps the same ALU, and the
same Instruction (I) and Data (A and B) memories, but adds a Branch Trigger
Module (BTM) which calculates one or more branches in parallel with the current
instruction based on the result of the previous instruction. Branches take zero
cycles in the common case. The Address Offset Module (AOM) can alter the
instruction operands before execution to implement indirect memory access with
post-incrementing. Finally, the I/O Predication Module (PRD) manages the I/O
ports: if an instruction operand refers to a port which is not ready, the
instruction is forced to a no-op and the Controller (CTL) re-fetches the same
instruction to retry again. Octavo v2 no longer reached the maximum possible
operating frequency, but its improved efficiency more than made up for the
loss. Octavo v2 could also be operated in a SIMD configuration, with up to 32
The third version of Octavo, currently under development, fixes some
limitations of Octavo v2 which was written in a hurry. The codebase was
cleaned up, and computational overhead further reduced: multi-way branching with priority arbitration over
200+ branch conditions (FC),
more flexible indirect addressing (AD), a Literal Pool to
reduce duplication in Data memories (DM), a programmable Opcode Decoder (OD), a
new three-operand ALU which supports bitwise parallelism and instruction
chaining, and a new addressing mode to move twice as much data per instruction
when data movement dominates computation (AS).
Although Octavo (v1 and v2) was originally aimed at Altera's Stratix IV
FPGA, it performs pretty well on other Altera devices. It generally runs twice
as fast as a NiosII/f, and gets fairly close to the absolute upper clock
frequency limit allowed by the FPGA hardware. The Fmax of Octavo v3
seems to be 5% lower than Octavo v2, but operates with less overhead and a more
We could port Octavo to Xilinx devices, but the ALU would have to be
implemented differently, though architecturally the same. See the works of
Cheah, Fahmy, and Kapre on the iDEA soft-processor, and its pipelining and forwarding,
for their high-performance solutions on Xilinx FPGAs.
Octavo v2 Fmax on Various Altera Devices (tuned to Stratix IV)
| || ||(MHz)||(MHz)||(Ratio)||(MHz)||(Ratio)
|Stratix V ||5SGXEA7N2F45C1||508||588||0.864||675||0.871
|Stratix IV ||EP4S100G5H40I1||470||493||0.953||550||0.896
|Arria V ||5AGXFB5K4F40I3||272||300||0.907||400||0.750
|Cyclone V ||5CGXFC7D6F31C6||239||267||0.895||315||0.848
|Cyclone IV ||EP4CGX30CF19C6||187||197||0.949||315||0.625
- Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA Overlays
Charles Eric LaForest, Jason H. Anderson
ACM Transactions on Reconfigurable Technology and Systems (TRETS), May 2017, Volume 10, Issue 3, Article No. 19
Compares the micro-architecture, performance, and area of two
soft-processor FPGA overlays: the Octavo multi-threaded soft-processor and the
MXP soft vector processor, both compared against hardware implementations of
micro- benchmarks written in C synthesized with the LegUp HLS tool and also
written in the Verilog HDL. Overall, Octavo's higher operating frequency and
MXP's more efficient code execution results in similar performance from both,
within an order of magnitude of hardware implementations, but with a penalty
of an order of magnitude greater area.
- Approaching Overhead-Free Execution on FPGA Soft-Processors
Charles Eric LaForest, Jason Anderson, J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2014, Shanghai, China
Describes the Branch Trigger and Address Offset Modules which can eliminate branching and addressing overheads, giving better performance than loop unrolling even against an "ideal" impossible reference processor.
Slides: PDF (also available from http://wiki.tcfpga.org/FPT2014 along with many others)
- Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning
Charles Eric LaForest and J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2013, Kyoto, Japan
Demonstrates simple design partitioning techniques to preserve performance when replicating datapaths ("tiling") via SIMD or multi-core scaling.
Slides: PPTX PDF
- Octavo: An FPGA-Centric Processor Family
Charles Eric LaForest and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2012, Monterey, CA.
Describes the basic Octavo architecture, and how to maximize pipelined logic speed via "self-loop characterization".
Slides: PPTX PDF
- High-Speed Soft-Processor Architecture for FPGA Overlays
Thesis, Doctor of Philosophy (ECE), University of Toronto, 2014
All of the above, plus some background, work on expanded address spaces, instruction I/O predication, and benchmarking.
You can get the complete source from the Octavo GitHub Repository, updated as work progresses.