PUBLICATIONS

Theses

High-Speed Soft-Processor Architecture for FPGA Overlays (Slides)
Doctor of Philosophy (ECE), University of Toronto, 2014
Efficient Multi-Ported Memories for FPGAs (T-Space copy)
Master of Applied Science (ECE), University of Toronto, 2009
Second-Generation Stack Computer Architecture (Waterloo copy)
Bachelor of Independent Studies, University of Waterloo, 2007

Journal Articles

Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA Overlays
Charles Eric LaForest, Jason H. Anderson
ACM Transactions on Reconfigurable Technology and Systems (TRETS), May 2017, Volume 10, Issue 3, Article No. 19
Field-Programmable Gate Arrays (FPGAs) can yield higher performance and lower power than software solutions on CPUs or GPUs. However, designing with FPGAs requires specialized hardware design skills and hours-long CAD processing times. To reduce and accelerate the design effort, we can implement an overlay architecture on the FPGA, on which we then more easily construct the desired system but at a large cost in performance and area relative to a direct FPGA implementation. In this work, we compare the micro-architecture, performance, and area of two soft-processor overlays: the Octavo multi-threaded soft-processor and the MXP soft vector processor. To measure the area and performance penalties of these overlays relative to the underlying FPGA hardware, we compare direct FPGA implementations of the micro- benchmarks written in C synthesized with the LegUp HLS tool and also written in the Verilog HDL. Overall, Octavo's higher operating frequency and MXP's more efficient code execution results in similar performance from both, within an order of magnitude of direct FPGA implementations, but with a penalty of an order of magnitude greater area.
Composing Multi-Ported Memories on FPGAs
Charles Eric LaForest, Zimo Li, Tristan O'Rourke, Ming G. Liu, J. Gregory Steffan
ACM Transactions on Reconfigurable Technology and Systems (TRETS), August 2014, Volume 7, Issue 3, Article No. 16
Multi-ported memories are challenging to implement on FPGAs since the block RAMs included in the fabric typically have only two ports. Hence we must construct memories requiring more than two ports either out of logic elements or by combining multiple block RAMs. We present a thorough exploration and evaluation of the design space of FPGA-based soft multi-ported memories for conventional solutions, and also for the recently-proposed Live Value Table (LVT) and XOR approaches to unidirectional-port memories, reporting results for both Altera and Xilinx FPGAs. Additionally, we thoroughly evaluate and compare with a recent LVT-based approach to bidirectional-port memories by Choi et al.

Conference Papers

Approaching Overhead-Free Execution on FPGA Soft-Processors
Charles Eric LaForest, Jason Anderson, J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2014, Shanghai, China
Orthogonal to limitations in parallelism or clock frequency, the low performance of soft-processors primarily originates in the intrinsic addressing and flow-control overheads of scalar microprocessors, which expend a considerable number of cycles interleaving address calculations and branch decisions within the actual useful work. We present an improved version of the Octavo soft-processor which statically overlaps "overhead" computations and executes them in parallel with the "useful" computations, while still reaching 500 MHz on the Altera Stratix IV FPGA -- 0.909x of the absolute maximum rating. We evaluate our cycle count improvements with multiple benchmarks, achieving speedups ranging from 1.07x for control-heavy code, to 1.92x for loop-heavy code, never performing worse than the original sequential code, and always performing better than a totally unrolled loop.
Slides: PDF (also available from https://wiki.tcfpga.org/FPT2014 along with many others)
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning
Charles Eric LaForest and J. Gregory Steffan
IEEE International Conference on Field-Programmable Technology (FPT), December 2013, Kyoto, Japan
Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having "multi-localities" (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than +/-10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.
Slides: PPTX PDF
Octavo: An FPGA-Centric Processor Family
Charles Eric LaForest and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2012, Monterey, CA.
Overlay processor architectures allow FPGAs to be programmed by non-experts using software, but prior designs have mainly been based on the architecture of their ASIC predecessors. In this paper we develop a new processor architecture that from the beginning accounts for and exploits the predefined widths, depths, maximum operating frequencies, and other discretizations and limits of the underlying FPGA components. The result is Octavo, a ten-pipeline-stage eight-threaded processor that operates at the block RAM maximum of 550MHz on a Stratix IV FPGA. Octavo is highly parameterized, allowing us to explore trade-offs in datapath and memory width, memory depth, and number of supported thread contexts.
Slides: PPTX PDF
See the GitHub Octavo repository for the latest version and plans for future work.
Multi-Ported Memories for FPGAs via XOR
Charles Eric LaForest, Ming G. Liu, Emma Rae Rapati, and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2012, Monterey, CA.
Multi-ported memories are challenging to implement with FPGAs since the block RAMs included in the fabric typically have only two ports. Any design that requires a memory with more than two ports must therefore be built out of logic elements or by combining multiple block RAMs. The recently-proposed Live Value Table (LVT) design provides a significant operating frequency improvement over conventional approaches. In this paper we present an alternative approach based on the XOR operation that provides multi-ported memories that use far less logic but more block RAMs than LVT designs, and are often smaller and faster for memories that are more than 512 entries deep. We show that (i) both designs can exploit multipumping to trade speed for area savings, (ii) that multipumped XOR designs are significantly smaller but moderately slower than their LVT counterparts, and (iii) that both the LVT and XOR approaches are valuable and useful in different situations, depending on the constraints and resource utilization of the enclosing design.
Slides: PPTX PDF (rough conversion, sorry)
Efficient Multi-Ported Memories for FPGAs
Charles Eric LaForest and J. Gregory Steffan
ACM International Symposium on Field-Programmable Gate Arrays (FPGA), February 2010, Monterey, CA.
Multi-ported memories are challenging to implement with FPGAs since the provided block RAMs typically have only two ports. We present a thorough exploration of the design space of FPGA-based soft multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implementation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2x smaller.
Selected as one of the 25 most significant papers in the first 20 years of the conference: FPGA20 (endorsement)
Slides: PPTX PDF

fpgacpu.ca