Zero-Overhead Memory-Mapping Address Translation

by GateForge Consulting Ltd.

When memory-mapping a small memory or a number of control registers to a base address that isn't a power-of-2, the least significant bits (LSBs) will not address the mapped entries in order. This addressing offset scrambles the order of the control registers and of the memory locations so that the mapped order no longer matches the internal order of the actual hardware, which makes debugging harder.

If the address range does not start at a power-of-2 boundary, and might not span a power-of-2 sized block, the LSBs might not necessarily be consecutive, exhaustive, and starting at zero: their order can be rotated by the offset to the nearest power-of-2 boundary.

However, we can construct a translation table that can optimize down to mere rewiring of inputs or internal LUT logic, introducing no timing or area overhead. We implement the table as a small read-only memory which translates the raw LSBs into consecutive LSBs to directly address the memory or control registers. A separate address range decoder signals that the translation is valid.

For example, take 4 locations, addressed 0 to 3, but mapped at addresses 7 to 10. We want address 7 to access the zeroth location, and so on. The address bits must be translated as follows:

0111 --> 00
1000 --> 01
1001 --> 02
1010 --> 03

You can see the raw two LSBs are in the sequence 3,0,1,2, which we must map to 0,1,2,3. We can pre-fill a table with the right values to do that, where address 3 will contain value 0, address 0 will contain value 1, etc...

// Translates a non-consecutive sequence of address bits into a consecutive
// one, so they can be used "as expected" with other addressed components
// (muxes, RAMs, etc...). ***Consumes no hardware.***

module Address_Range_Translator
#(
    parameter       ADDR_COUNT          = 0,
    parameter       ADDR_BASE           = 0,
    parameter       ADDR_WIDTH          = 0,
    parameter       REGISTERED          = 0
)
(
    input   wire                        clock,
    input   wire    [ADDR_WIDTH-1:0]    raw_address,
    output  reg     [ADDR_WIDTH-1:0]    translated_address
);

// -----------------------------------------------------------

    //  *********  DO NOT MOVE! ************

    // Doing the obvious thing of placing a register at the output prevents
    // Quartus from reducing the translation table to simple LUT configuration
    // change or input rewiring, and creates a small RAM, which works too
    // slowly. It appears the translation table trick only works when
    // outputting straight into a RAM.

    reg [ADDR_WIDTH-1:0] cooked_address = 0;

    generate
        if (REGISTERED == 1) begin
            always @(posedge clock) begin
                cooked_address <= raw_address;
            end
        end
        else begin
            always @(*) begin
                cooked_address <= raw_address;
            end
        end
    endgenerate

// -----------------------------------------------------------

    localparam ADDR_DEPTH = 2**ADDR_WIDTH;

    integer                     i, j;
    reg     [ADDR_WIDTH-1:0]    translation_table [ADDR_DEPTH-1:0];

    initial begin

        // In the case where ADDR_COUNT < ADDR_DEPTH, make sure all entries are
        // defined. This happens for a single entry: ADDR_WIDTH is artificially 
        // kept at 1 instead of 0

        for(i = 0; i < ADDR_DEPTH; i = i + 1) begin
            translation_table[i] = 0;
        end

        // In the case of a single entry, the LSB (j) will be either 1 or zero,
        // but always translates to 0, thus this should optimize away.

        j = ADDR_BASE[ADDR_WIDTH-1:0];
        for(i = 0; i < ADDR_COUNT; i = i + 1) begin
            translation_table[j] = i[ADDR_WIDTH-1:0];
            j = (j + 1) % ADDR_DEPTH; // Force wrap-around
        end
    end

// -----------------------------------------------------------

    always @(*) begin
        translated_address <= translation_table[cooked_address];
    end

endmodule

fpgacpu.ca