PUBLICATIONS
Theses
Journal Articles
- Microarchitectural Comparison of the MXP and Octavo
Soft-Processor FPGA Overlays
Charles Eric LaForest, Jason H. Anderson
ACM Transactions on Reconfigurable Technology and Systems (TRETS), May 2017, Volume 10, Issue 3, Article No. 19
Field-Programmable Gate Arrays (FPGAs) can yield higher performance and lower power than software
solutions on CPUs or GPUs. However, designing with FPGAs requires specialized hardware design skills
and hours-long CAD processing times. To reduce and accelerate the design effort, we can implement an
overlay architecture on the FPGA, on which we then more easily construct the desired system but at a
large cost in performance and area relative to a direct FPGA implementation. In this work, we compare
the micro-architecture, performance, and area of two soft-processor overlays: the Octavo multi-threaded
soft-processor and the MXP soft vector processor. To measure the area and performance penalties of these
overlays relative to the underlying FPGA hardware, we compare direct FPGA implementations of the micro-
benchmarks written in C synthesized with the LegUp HLS tool and also written in the Verilog HDL. Overall,
Octavo's higher operating frequency and MXP's more efficient code execution results in similar performance
from both, within an order of magnitude of direct FPGA implementations, but with a penalty of an order of
magnitude greater area.
- Composing Multi-Ported Memories on FPGAs
Charles Eric LaForest, Zimo Li, Tristan O'Rourke, Ming G. Liu, J. Gregory Steffan
ACM Transactions on Reconfigurable Technology and Systems (TRETS), August 2014, Volume 7, Issue 3, Article No. 16
Multi-ported memories are challenging to implement on FPGAs since the block
RAMs included in the fabric typically have only two ports. Hence we must
construct memories requiring more than two ports either out of logic elements
or by combining multiple block RAMs. We present a thorough exploration and
evaluation of the design space of FPGA-based soft multi-ported memories for
conventional solutions, and also for the recently-proposed Live Value Table
(LVT) and XOR approaches to
unidirectional-port memories, reporting results for both Altera and Xilinx
FPGAs. Additionally, we thoroughly evaluate and compare with a recent
LVT-based approach to bidirectional-port memories by Choi et al.
Conference Papers
- Approaching Overhead-Free Execution on FPGA Soft-Processors
Charles Eric LaForest, Jason Anderson, J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2014, Shanghai, China
Orthogonal to limitations in parallelism or clock frequency, the low
performance of soft-processors primarily originates in the intrinsic addressing
and flow-control overheads of scalar microprocessors, which expend a
considerable number of cycles interleaving address calculations and branch
decisions within the actual useful work. We present an improved version of the
Octavo soft-processor which statically overlaps "overhead" computations and
executes them in parallel with the "useful" computations, while still reaching
500 MHz on the Altera Stratix IV FPGA -- 0.909x of the absolute maximum rating.
We evaluate our cycle count improvements with multiple benchmarks, achieving
speedups ranging from 1.07x for control-heavy code, to 1.92x for loop-heavy code,
never performing worse than the original sequential code, and always performing
better than a totally unrolled loop.
Slides: PDF (also available from https://wiki.tcfpga.org/FPT2014 along with many others)
- Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning
Charles Eric LaForest and J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2013, Kyoto, Japan
Common practice for large FPGA design projects is to divide sub-projects
into separate synthesis partitions to allow incremental recompilation as each
sub-project evolves. In contrast, smaller design projects avoid partitioning
to give the CAD tool the freedom to perform as many global optimizations as
possible, knowing that the optimizations normally improve performance and
possibly area. In this paper, we show that for high-speed tiled designs
composed of duplicated components and hence having "multi-localities"
(multiple instances of equivalent logic), a designer can use partitioning to
preserve multi-locality and improve performance. In particular, we focus on
the lanes of SIMD soft processors and multicore meshes composed of them, as
compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We
demonstrate that, with negligible impact on compile time (less than +/-10%):
(i) we can use partitioning to provide high-level information to the CAD tool
about preserving multi-localities in a design, without low-level micro-managing
of the design description or CAD tool settings; (ii) by preserving
multi-localities within SIMD soft processors, we can increase both frequency
(by up to 31%) and compute density (by up to 15%); (iii) partitioning
improves the density and speed (by up to 51 and 54%) of a mesh of soft
processors, across many building block configurations and mesh geometries; (iv)
the improvements from partitioning increase as the number of tiled computing
elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of
partitioning, a mesh of 102 scalar soft processors improves its operating
frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574
MIPS, while increasing its logic area by only 0.85%.
Slides: PPTX PDF
- Octavo: An FPGA-Centric Processor Family
Charles Eric LaForest and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2012, Monterey, CA.
Overlay processor architectures allow FPGAs to be programmed by non-experts
using software, but prior designs have mainly been based on the architecture of
their ASIC predecessors. In this paper we develop a new processor architecture
that from the beginning accounts for and exploits the predefined widths,
depths, maximum operating frequencies, and other discretizations and limits of
the underlying FPGA components. The result is Octavo, a ten-pipeline-stage
eight-threaded processor that operates at the block RAM maximum of 550MHz on a
Stratix IV FPGA. Octavo is highly parameterized, allowing us to explore
trade-offs in datapath and memory width, memory depth, and number of supported
thread contexts.
Slides: PPTX PDF
See the GitHub Octavo repository for the latest version and plans for future work.
- Multi-Ported Memories for FPGAs via XOR
Charles Eric LaForest, Ming G. Liu, Emma Rae Rapati, and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2012, Monterey, CA.
Multi-ported memories are challenging to implement with FPGAs since the
block RAMs included in the fabric typically have only two ports. Any design
that requires a memory with more than two ports must therefore be built out of
logic elements or by combining multiple block RAMs. The recently-proposed Live
Value Table (LVT) design provides a significant operating
frequency improvement over conventional approaches. In this paper we present
an alternative approach based on the XOR operation that provides
multi-ported memories that use far less logic but more block RAMs than LVT
designs, and are often smaller and faster for memories that are more than 512
entries deep. We show that (i) both designs can exploit multipumping to trade
speed for area savings, (ii) that multipumped XOR designs are significantly
smaller but moderately slower than their LVT counterparts, and (iii) that both
the LVT and XOR approaches are valuable and useful in different situations,
depending on the constraints and resource utilization of the enclosing design.
Slides: PPTX PDF (rough conversion, sorry)
- Efficient Multi-Ported Memories for FPGAs
Charles Eric LaForest and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2010, Monterey, CA.
Multi-ported memories are challenging to implement with FPGAs since the
provided block RAMs typically have only two ports. We present a thorough
exploration of the design space of FPGA-based soft multi-ported memories by
evaluating conventional solutions to this problem, and introduce a new design
that efficiently combines block RAMs into multi-ported memories with arbitrary
numbers of read and write ports and true random access to any memory location,
while achieving significantly higher operating frequencies than conventional
approaches. For example we build a 256-location, 32-bit, 12-ported (4-write,
8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while
consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84%
area reduction over a pure ALM implementation, and a 61% speed improvement
over a pure "multipumped" implementation, although the pure multipumped
implementation is 7.2x smaller.
Selected as one of the 25 most significant papers in the first 20 years of the conference: FPGA20 (endorsement)
Slides: PPTX PDF
fpgacpu.ca