Processor and its software in VHDL part 3 : Configurable Processor Pipelining

In part 1 we designed a processor and an assembler to write software as initial values of ram in VHDL. In part 2 we made the processor word length configurable and modified it to process floating point operations. Following the idea of configurable word length, this time we will create the possibility for adding pipeline registers in the processor to shorten logic paths created by the increased word length. Using similar idea as in part 2, we will chain the configuration of the processing pipelines such that both the software and the actual processor adapts to pipeline depths defined by the word length and pipeline configuration of the floating point alu.

We will first create some functions that fill in the number of nop instructions based on the floating point configuration. This allows our designed software to work with different pipeline depths. We will also use the same configuration to move the stage where we catch the result with the configuration. This way we can write a filter software in VHDL and then tune the resource use and accuracy by adding pipelines and word lengths to meet our desired performance.

We will test the configurable pipelines with Efinix Titanium and Trion, Lattice ECP5 and NX, Intel Cyclone 10lp and AMD Artix 7 all running at the same 120MHz core clock speed. We will create a configuration file for each processor to allow each run with the minimal amount of pipelining needed to meet timing with 8 exponent and 24 mantissa bits in our floating point alu. The processor source code can be found in its own stand-alone repository and the example project in which it is used can be found here.

The processor repository is included in the example project as a submodule and the float alu and memory modules are also in their own repositories for easy adoption outside of the presented application. All of the referenced repositories are part of open source hVHDL project and licensed under permissive MIT license. The project also has its own discord from where you can get support and give feedback or ask questions if you find issues with testing or using the modules.

Adding an extra layer of hierarchy to our assembler

The example program that we have used to test our processor in our previous blogs is a simple first order low pass filter with the conceptual program shown below

temp = u - y
mpy_result = temp * g 
y = y * mpy_result

In our low pass filter we need the result from the first subtraction for the multiplication and we need the result of the multiplication for the final addition. Hence we need the result of previous instruction for the one that follows it.

In our pipelined processor design, we have delayed the next operation by the amount of the number of pipeline stages with a string of no-operation, or “nop” instructions in the software. If we assume 2 pipeline cycles for each of the operations, the the nops the program software is written with our VHDL functions as follows

 constant low_pass_filter : program_array := (
        write_instruction(sub  , temp , u , y) ,
        write_instruction(nop) ,
        write_instruction(nop) ,
        
        write_instruction(mpy  , mpy_result , temp , g) ,
        write_instruction(nop) ,
        write_instruction(nop) ,
        
        write_instruction(add  , y, mpy_result , y),
        write_instruction(nop) ,
        write_instruction(nop) );

We have already used the “&” operator when combining multiple program_arrays into single program. Using this feature we can easily simplify our code by creating functions which return the corresponding arrays of instructions with the required number of nops instead of using the write_instruction functions directly. For example to add two values together, we call “add” this will then return a vector of instructions that perform the add function.

    function add
    (
        result_reg, left, right : natural
    )
    return program_array is
        constant fill : program_array(0 to number_of_float_add_fills) := (others => write_instruction(nop));
    begin
        return write_instruction(add, result_reg, left, right) & number_of_float_add_fills;
    end add;
    ------------------------------

Using similar functions for “sub” and “multiply” we can write our low pass filter with three function calls and concatenating the results together

        ------------------------------
        constant low_pass_filter : program_array :=(
            sub(temp, u, y)                 &
            multiply(mpy_result , temp , g) &
            add(y, y, mpy_result);
        ------------------------------

Since our small programs are constants, we can freely combine together the program_arrays without needing to define the length of the array. The synthesis tools will automatically figure out the correct length of the constants and hence the lengths of program_arrays scale automatically with the number of nops that our add function returns. The full example program can be found in the float_example_programs_pkg.vhd

Using the idea with floating point processor

The pipeline stages of our floating point alu are already configurable via constants defined in packages that are found in the floating point repository. The design of the float alu was the topic of a previous blog post.

The floating point add has denormalizer stage, the actual add stage and finally a normalizer stage and the multiplier has multiplier and normalizer stages. Thus the number of pipeline stages in the float add is a combination of add denormalizer and normalizer pipelines and similarly the multiplier nop fill is combination of mpy pipeline stages and normalizer pipeline stages

        use work.normalizer_pkg.number_of_normalizer_pipeline_stages;
        use work.denormalizer_pkg.number_of_denormalizer_pipeline_stages;

        constant number_of_float_add_fills : natural :=
            number_of_normalizer_pipeline_stages    +
            add_pipeline_stages                     +
            number_of_denormalizer_pipeline_stages;
        
        constant number_of_float_mpy_fills : natural :=
            mpy_pipeline_stages                     +
            number_of_normalizer_pipeline_stages;

The pipeline stage constants are also used internally in the floating point alu to determine the number of actual pipeline stages, hence if we add more pipeline stages and resynthesize the code, the software will now automatically fill in the proper number of nops to accommodate the hardware.

Adding configurable number stages to processing pipeline

Our processing pipeline as seen in the snippet below. The instruction pipeline in line 5 is a shift register that shifts in instruction_from_ram to one end which then advances one stage each clock cycle. If “add” instruction is decoded in stage zero then corresponding add procedure from the float alu is called with the registers based on the decoded instruction.

The add instruction then propagates through the instruction pipeline in lock-step with the actual processing happening in the float alu. When the instruction has reached stage 3 corresponding with the result of float alu being ready, the result is read from the float alu and placed into a register the number of which is decoded from the instruction with the get_dest function.

     process(clock) is
        variable used_instruction : t_instruction;
     if rising_edge(clock) then
        ----------------------
            self.instruction_pipeline <= instruction_from_ram & self.instruction_pipeline(0 to self.instruction_pipeline'high-1)   
        ----------------------
            used_instruction := self.instruction_pipeline(0);
            CASE decode(used_instruction) is
                WHEN add => 
                    add(float_alu, 
                        to_float(self.registers(get_arg1(used_instruction))), 
                        to_float(self.registers(get_arg2(used_instruction))));
                WHEN others => -- do nothing
            end CASE;
        ----------------------
            used_instruction := self.instruction_pipeline(1);
        ----------------------
            used_instruction := self.instruction_pipeline(2);
        ----------------------
            used_instruction := self.instruction_pipeline(3);
            CASE decode(used_instruction) is
                WHEN add =>
                    self.registers(get_dest(used_instruction)) <= get_add_result(float_alu);
                WHEN others => -- do nothing
            end CASE;
        ----------------------

The number of delays corresponds with the index of the instruction pipeline and the number of pipeline delays corresponds with the number of nop commands. Because of this we can use the same constants that we fill the nops in the software to automatically calculate the stage at which we catch the result of float operations.

            used_instruction := self.instruction_pipeline(0);
         -- decode add instruction
        ----------------------
            used_instruction := self.instruction_pipeline(number_of_float_add_fills);
        -- read result from add
        ----------------------

In case we need to adjust the stage at which the float add is triggered, we can also use another constant for placing the add in the instruction pipeline. With this we can now use a pair of constants to place the floating point add at any given point in the pipeline

            used_instruction := self.instruction_pipeline(add_stage);
            CASE decode(used_instruction) is
                WHEN add => 
                    add(float_alu, 
                        to_float(self.registers(get_arg1(used_instruction))), 
                        to_float(self.registers(get_arg2(used_instruction))));
                WHEN others => -- do nothing
            end CASE;
        ----------------------
            used_instruction := self.instruction_pipeline(add_stage + number_of_float_add_fills);
            CASE decode(used_instruction) is
                WHEN add =>
                    self.registers(get_dest(used_instruction)) <= get_add_result(float_alu);
                WHEN others => -- do nothing
            end CASE;
        ----------------------

Simulating the design

The microprogramming repository has a testbench for the floating point processor. The input value to the first order filter is intialized to 0.5 and hence the testbench runs a step response calculation. The processor calculation is requested every 60 clock cycles of the simulator_clock and the result is directly fetched from ram and when the ram_read_is_ready, the result is converted to real and set to result3. The use of real numbers here hides the floating point number format and makes the result look the same despite the configuration.

    stimulus : process(simulator_clock)
        variable used_instruction : t_instruction;
    begin
        if rising_edge(simulator_clock) then
            simulation_counter <= simulation_counter + 1;
            --------------------
            create_simple_processor (
                processor                ,
                ram_read_instruction_in  ,
                ram_read_instruction_out ,
                ram_read_data_in         ,
                ram_read_data_out        ,
                ram_write_port           ,
                used_instruction);

            create_float_alu(float_alu);

            create_float_command_pipeline(processor,float_alu, 
                ram_read_instruction_in  ,
                ram_read_instruction_out ,
                ram_read_data_in         ,
                ram_read_data_out        ,
                ram_write_port           ,
                used_instruction);

            ------------------------------------------------------------------------
            -- test signals
            ------------------------------------------------------------------------
            if simulation_counter mod 60 = 0 then
                request_processor(processor);
            end if;
            processor_is_ready <= program_is_ready(processor);
            if program_is_ready(processor) then
                counter <= 0;
                counter2 <= 0;
            end if;
            if counter < 7 then
                counter <= counter +1;
            end if;

            CASE counter is
                WHEN 0 => request_data_from_ram(ram_read_data_in, y_address);
                WHEN others => --do nothing
            end CASE;
            if not processor_is_enabled(processor) then
                if ram_read_is_ready(ram_read_data_out) then
                    counter2 <= counter2 + 1;
                    CASE counter2 is
                        WHEN 0 => result3 <= to_real(to_float(get_ram_data(ram_read_data_out)));
                        WHEN others => -- do nothing
                    end CASE; --counter2
                end if;
            end if;

        end if; -- rising_edge
    end process stimulus;	

------------------------------------------------------------------------
    u_mpram : entity work.ram_read_x2_write_x1
    generic map(ram_contents)
    port map(
    simulator_clock          ,
    ram_read_instruction_in  ,
    ram_read_instruction_out ,
    ram_read_data_in         ,
    ram_read_data_out        ,
    ram_write_port);
------------------------------------------------------------------------
end vunit_simulation;

The simulations are launched using VUnit and GHDL and the result is viewed with GTKWave. The command line for running the VUnit script is

python vunit_run_sw_processor.py -p 32 --gtkwave-fmt ghw

Note that VUnit does have gui option, but it is preferable to open the gtkwave separately from VUnit since this allows rerunning the simulation in the background and reloading the waveform without it changing the view.

The example project also has a testbench for the floating point processor . In addition to the floating point processor, it has the interconnect and transformations from fixed to float and float to fixed that is used with the fpga project hence it corresponds with the waveform that is obtained from uart.

python vunit_run.py -p 32 --gtkwave-fmt ghw *mcu*

Hardware test

We test the code with the five of the most common vendor tools Lattice Diamond, Intel Quartus, Efinix Efinity, Lattice Radiant and AMD Vivado as seen in the picture above. The tested boards are my own ECP5 custom design, Cyclone 10lp evaluation kit, Efinix Titanium evaluation kit, CRUVI Certus-NX Base Board and Alchitry AU Artix 7 board. I used the most up to date tools for all of them. You can find build scripts for all of them in the hVHDL example projects repository.

The following snippets shows the configuration source file for the Efinix Titanium. All FPGAs have similar configuration file which defines the packages which define floating point alu configurations. Although this does lead to some duplication, all rest of the source files are shared between the different builds. So just by adjusting the numbers here we can individually tune the mantissa and exponent lengths as well as the pipeline depths.

-- Efinix Titanium
package denormalizer_pipeline_pkg is

    constant pipeline_configuration : natural := 1;

end package denormalizer_pipeline_pkg;
------------------------------------------------------------------------
------------------------------------------------------------------------
package normalizer_pipeline_pkg is

    constant normalizer_pipeline_configuration : natural := 1;

end package normalizer_pipeline_pkg;
------------------------------------------------------------------------
------------------------------------------------------------------------
package float_word_length_pkg is

    constant mantissa_bits : integer := 24;
    constant exponent_bits : integer := 8;

end package float_word_length_pkg;
------------------------------------------------------------------------
------------------------------------------------------------------------

Building and running the test script with a oneliner

The build scripts in the example project allow for running the build, loading the binary to the fpga and launching the python script which plots the results in a single instruction. The command line arguments for the python script gives the com port and the name of the plot.

d:\lscc\diamond\3.13\bin\nt64\pnmainc.exe C:\dev\hVHDL_example_project\ecp5_build\ecp5_compile.tcl ; D:\lscc\diamond\3.13\bin\nt64\pgrcmd.exe -infile C:\dev\diamond_jtag.xcf ; python C:\dev\hVHDL_example_project\test_app.py com7 "Lattice ECP5"

quartus_sh -t C:\dev\hVHDL_example_project\quartus_build\compile_with_quartus.tcl;quartus_pgm -c "Cyclone 10 LP Evaluation Kit [USB-1]" -m jtag -o "p;.\output\top.sof"; python C:\dev\hVHDL_example_project\test_app.py com4 "Cyclone 10lp evm"

start /wait test_with_hw.bat & python test_app.py com14 "titanium evm"

d:\lscc\radiant\2023.2\bin\nt64\pnmainc.exe C:\dev\hVHDL_example_project\radiant_build\radiant_compile.tcl ; D:\lscc\radiant\2023.2\programmer\bin\nt64\pgrcmd.exe -infile C:\dev\build_radiant_jtag.xcf ; python C:\dev\hVHDL_example_project\test_app.py com5 "Lattice NX"

d:\xilinx\Vivado\2023.2\bin\vivado.bat -mode batch -nolog -nojournal -source C:\dev\hVHDL_example_project\alchitry_au_plus\vivado_compile.tcl -tclargs ram ; python C:\dev\hVHDL_example_project\test_app.py com7 "Artix 7"

In order for the oneliner to work with the lattice tools, the .xcf file for the jtag needs to be created using the programmer tool. Note that all tools aside from the Cyclone 10lp EVM use ftdi chips for programming and uart.

Running the one liner with each of the evaluation kits produces the plots as shown below.

Hardware configuration

The different FPGAs are running slightly different configurations. All of them are using 24 bit mantissa and 8 bit exponent and we increase the floating point word pipeline depths until they meet the required timing of 120MHz. The different FPGA have separate package files for their own floating point configuration. Note that all other sources are shared so they all are running with the same software and the configuration is propagated through the design.

Resource use for floating point processor with different fpgas

FPGA(vendor)	DSP	LUT(reg)	Memory Block	Pipelines (norm/denorm)
Cyclone 10 lp(Intel)	7(9x9bit)	3514(1331)	4(M9k)	3/3
Artix 7(AMD)	2	1926(1365)	1 bram, 59 dmem	2/2
ECP5(Lattice)	4	3132(1273)	ebr 2, dmem 36	4/3
Certus NX(Lattice)	1	2826(1016)	ebr 2, dmem 11	4/4
Titanium(Efinix)	4	1972(910)	3(10k)	1/1
Trion(Efinix)	4	2570(1432)	6(5k)	3/4

The memory use differs a bit in the fpgas due to the 2 read port and 1 write port configuration. Lattice and AMD FPGAs have distributed memory which is also used for the processor ram. The logic use is also not directly comparable as the pipeline depths are different. All of them meet timing and hence produce exactly the same result though the latency of the calculation is different.

Much of the code is still in initial testing phase and the processor repository will require some refactoring to make it easier to navigate. There are some basic control functions still missing from our processor design, like branching and jump commands a function call as well as stalling the pipeline operation. The structure of the floating point alu is also far from optimal which means that the performance and logic area can be made substantially better. These functionalities might be topics in future blog posts.