Processor and its software in VHDL part 2 : Configurable Floating point processing

In part 1 of processor and its software design in VHDL we designed an assembler with which to write software directly in VHDL. We used this assembler to write a test software for set of 3 low pass filters that were then tested with a noisy sine and synthesized to Efinix Titanium FPGA. The use of such a processor in our FPGA designs allow us to store the logic of how data is moved around in ram and additionally constrain the data movement between logic and ram ports. By constraining the data movement to and from ram, a processor allows us to easily reuse expensive parts of fpga designs. The obvious target for reuse is the floating point hardware and running floating point calculations from configurable processor is the topic of this blog post. The processor and assembler can be found in hVHDL microprogram processor repository.

The processor is designed by reusing the processor code from previous blog post with a floating point command pipeline to process the instructions in floats. We we will refactor the ram control part of the processor into its own module and then connect it to a new pipeline which uses the floating point alu from previous blog post. The design of the floating point modules were covered in a post about floating point arithmetic.

Mirroring the design process, this blog post is structured to how we created the needed abstractions to allow reusing the modules in a new processor design. In addition to the processor design we will add the necessary configurations to allow us to use the word length of the floating point module to configure the rest of the processor. This way we can use any desirable floating point word length and have the processor and the software which is written in VHDL adapt to it automatically.

The designed floating point processor is tested with Efinix Titanium evaluation kit. The impact of the word length on the logic use is tested by synthesizing the code with multiple floating point mantissa lengths. The example project, has build scripts for Vivado, Quartus, Diamond and Radiant. Currently the floating point processor design meets the designed 120MHz timing only with Titanium, hence in this blog we will only develop the code for the titanium. We do not break the other builds however and the code is refactored in a way that the builds other than Efinix Titanium still run with the original logic only floating point design. The timing of the float processor for other FPGAs will be fixed later and might be a topic of another blog post.

Processor code structure

The processor that we designed in previous post has roughly 3 different parts. The ram, the ram control unit and the processing pipeline. The ram port is connected  to the flow control, which either passes through the ram contents directly, or in case of the processor being stopped, pushes only no-operations to the processing pipeline.

 The ram design was already using the multi port ram module from hVHDL memory library but a snipped from the original code in below shows that the processor has the flow control part and the math unit in the same module. The math unit that the original comment tells is just for testing has all of the register widths hard coded to be 20 bit wide.

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    use work.microinstruction_pkg.all;
    use work.multi_port_ram_pkg.all;
    use work.multiplier_pkg.radix_multiply;

package simple_processor_pkg is

    type simple_processor_record is record
        processor_enabled    : boolean                 ;
        is_ready             : boolean                 ;
        program_counter      : natural range 0 to 1023 ;
        registers            : reg_array               ;
        instruction_pipeline : instruction_array       ;
        -- math unit for testing, will be removed later
        add_a          : std_logic_vector(19 downto 0) ;
        add_b          : std_logic_vector(19 downto 0) ;
        add_result     : std_logic_vector(19 downto 0) ;
        mpy_a          : std_logic_vector(19 downto 0) ;
        mpy_b          : std_logic_vector(19 downto 0) ;
        mpy_a1         : std_logic_vector(19 downto 0) ;
        mpy_b1         : std_logic_vector(19 downto 0) ;
        mpy_raw_result : signed(39 downto 0)           ;
        mpy_result     : std_logic_vector(19 downto 0) ;
    end record;
Separating the flow control from instruction path

Since we want to use the same control module for our floating point, but still have the old implementation in working condition, we need to separate the two features from each other. Doing this allows us to replace the instruction pipeline and data control path from the fixed point processor with the floating point processing pipeline.

The original processor is separated into two parts. The processor, that has only the program flow control part that is the driver for the ram and the separate command pipeline that then actually executes the original commands.

Separating processor module

The processor has the register array as well as the program counter and ready and enabled flags and it controls the flow of the instructions from ram to the command pipeline. The procedure which creates this procedure also has control of the ram ports. The processor module record is shown in the snippet below

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    use work.microinstruction_pkg.all;
    use work.multi_port_ram_pkg.all;
    use work.processor_configuration_pkg.all;

package simple_processor_pkg is

    alias register_bit_width is work.processor_configuration_pkg.register_bit_width;

    type simple_processor_record is record
        processor_enabled    : boolean                 ;
        is_ready             : boolean                 ;
        program_counter      : natural range 0 to 1023 ;
        registers            : reg_array               ;
        instruction_pipeline : instruction_array       ;
    end record;
------------------------------------------------------------------------     

The create_simple_processor procedure initializes the used_instruction to “nop” command as default. If the processor is enabled and the ram read port is ready, the program counter is incremented and new word is requested from ram. The if statements here guarantee a clean start for the processor such that only “nop” instructions are pushed into the instruction_pipeline in the cycles when processor is enabled but ram is still not ready to  be read. The create_simple_processor is defined as

    procedure create_simple_processor
    (
        signal self                    : inout simple_processor_record ;
        signal ram_read_instruction_in : out ram_read_in_record        ;
        ram_read_instruction_out       : in ram_read_out_record        ;
        signal ram_read_data_in        : out ram_read_in_record        ;
        ram_read_data_out              : in ram_read_out_record        ;
        signal ram_write_port          : out ram_write_in_record       ;
        used_instruction               : inout t_instruction
    ) is
    begin
        init_ram(ram_read_instruction_in, ram_read_data_in, ram_write_port);
    ------------------------------------------------------------------------
        self.is_ready <= false;
        used_instruction := write_instruction(nop);
        if self.processor_enabled then
            request_data_from_ram(ram_read_instruction_in, self.program_counter);

            if ram_read_is_ready(ram_read_instruction_out) then
                used_instruction := get_ram_data(ram_read_instruction_out);
            end if;

            if decode(used_instruction) = program_end then
                self.processor_enabled <= false;
                self.is_ready <= true;
            else
                self.program_counter <= self.program_counter + 1;
            end if;
        end if;
        self.instruction_pipeline <= used_instruction & self.instruction_pipeline(0 to self.instruction_pipeline'high-1);
    ------------------------------------------------------------------------

The second module called command_pipeline has all the registers that the actual control path requires and the command pipeline logic is put into its own create_command_pipeline -procedure. The command pipeline has a matching length instruction shift register. The instructions are thus shifted in unison with the actual commands. For an add command when decoded in 1st stage loads the add registers. In the second stage the actual add is performed and in the third stage the result is read. This is the code from previous blog post which just moved into own module. The command pipeline is declaration is shown in the snippet below, note that we moved the hard coded bit widths into processor configuration package

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    use work.microinstruction_pkg.all;
    use work.multi_port_ram_pkg.all;
    use work.multiplier_pkg.radix_multiply;
    use work.processor_configuration_pkg.all;

package command_pipeline_pkg is

    type command_pipeline_record is record
        add_a          : std_logic_vector(register_bit_width-1 downto 0) ;
        add_b          : std_logic_vector(register_bit_width-1 downto 0) ;
        add_result     : std_logic_vector(register_bit_width-1 downto 0) ;
        mpy_a          : std_logic_vector(register_bit_width-1 downto 0) ;
        mpy_b          : std_logic_vector(register_bit_width-1 downto 0) ;
        mpy_a1         : std_logic_vector(register_bit_width-1 downto 0) ;
        mpy_b1         : std_logic_vector(register_bit_width-1 downto 0) ;
        mpy_raw_result : signed(register_bit_width*2-1 downto 0)           ;
        mpy_result     : std_logic_vector(register_bit_width-1 downto 0) ;
    end record;
    ------------
    procedure create_command_pipeline (
        signal self                    : inout command_pipeline_record ;
        signal ram_read_instruction_in : out ram_read_in_record                    ;
        ram_read_instruction_out       : in ram_read_out_record                    ;
        signal ram_read_data_in        : out ram_read_in_record                    ;
        ram_read_data_out              : in ram_read_out_record                    ;
        signal ram_write_port          : out ram_write_in_record                   ;
        signal registers               : inout reg_array                           ;
        signal instruction_pipeline    : inout instruction_array                   ;
        instruction                    : in t_instruction);

The restructured processor instantiation with separate processor and command pipeline modules is shown next. Both the command pipeline and the processor get the used_instruction as an inout variable to their create procedure. The use of variable here allows us to have a fully combinatorial path for the used instruction between the processor module and the command_pipeline module.

    stimulus : process(simulator_clock)
        variable used_instruction : t_instruction;
    begin
        if rising_edge(simulator_clock) then
            simulation_counter <= simulation_counter + 1;
            --------------------
            create_simple_processor (
                self                    ,
                ram_read_instruction_in ,
                ram_read_instruction_out,
                ram_write_port,
                used_instruction); -- variable used here

            create_command_pipeline(
                command_pipeline          ,
                ram_read_data_in          ,
                ram_read_data_out         ,
                ram_write_port            ,
                self.registers            ,
                self.instruction_pipeline ,
                used_instruction); --... and here
                --------------------
    end process;
                
    u_mpram : entity work.ram_read_x2_write_x1
    generic map(ram_contents)
    port map(
    simulator_clock          ,
    ram_read_instruction_in  ,
    ram_read_instruction_out ,
    ram_read_data_in         ,
    ram_read_data_out        ,
    ram_write_port);

This is still the previous design with just code moved to separate modules that can be now reused for building the floating point processor. Since the fixed point filter test passes and we can run the previous hardware tests with the newly refactored processor, we can now move on to design the floating point processor.

Figure 1. Passed VUnit test report
Figure 2 hardware test with the original design

Repurposing processor for floating point processing

In order to reuse the previously written processor, we will next create a new command pipeline that uses the floating point module. We will connect the configuration of the floating point mantissa and exponent length to the configurations of processor word length and ram modules. This allows us to change the floating point word length and rest of the processor design including the actual commands, ram port widths and processor piplines will be scaled to match it automatically.

In order for this to work, we need to have the instruction width, ram port width and float alu width all configured with the floating point data width. The propagation of the word length is shown in Figure 3.

Figure 3. Diagram for how type definitions are propagated in the processor design

The hVHDL float library is already designed for arbitrary floating point length therefore it has a float_word_length_pkg.vhd configuration package for setting the word lengths. The float word length package is defined as follows and we will use these definitions to configure the other modules.

The floating point types get the mantissa and exponent lengths from this package definition hence we will copy the float word length package to the processor library.

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;


-- float processor word length definitions
package float_word_length_pkg is

    constant mantissa_bits : integer := 24;
    constant exponent_bits : integer := 8;

end package float_word_length_pkg;

The way we use these configuration packages is by adding them into the same library in our fpga tool compile script. All of the references to packages are done through “work” library which actually means “the same library where this file is”. Hence we can have multiple configurations in same project as long as the configurations as well as the modules they configure are compiled into separate libraries.

Scaling ram width with the used word length

Similarly to the floating point module, the multi-port ram module gets its word length and ram depth from a package which is copied to the processor repository. In this processor specific ram configuration we use the floating point definitions package to expose the bit widths of mantissa and exponent and use these for the ram bit width definition as seen in the snippet below

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;
    
    -- used in simple processor
    use work.float_word_length_pkg.mantissa_bits;
    use work.float_word_length_pkg.exponent_bits;

package ram_configuration_pkg is

    -- use float word length for ram width
    constant ram_bit_width : natural := mantissa_bits+exponent_bits+1;
    constant ram_depth     : natural := 2**9;

    subtype address_integer is natural range 0 to ram_depth-1;
    subtype t_ram_data      is std_logic_vector(ram_bit_width-1 downto 0);

    type ram_array is array (integer range 0 to ram_depth-1) of t_ram_data;

end package ram_configuration_pkg;
------------------------------------------------------------------------

 The ram package also defines 3 additional data types that the ram module uses internally. The part that we actually need is the ram_array, which is the array of std_logic_vectors that define the ram contents. This was already used in previous blog post in the initialization of the ram module.

Scaling processor to match ram port

The previously hard coded processor instruction and register word lengths are moved into a processor configuration package. This package additionally defines the actual commands as well as the subtypes which are used to encode the instructions into std_logic_vectors.

Instead of connecting the processor to the floating point directly, we will define the instruction and register_bit_widths with the ram port width. This allows us to have the processor word length defined by the ram.  The example project also instantiates the original fixed point version of the processor so propagating the definition from ram allows the fixed point processor not to be tied to floating point library. The processor configuration also defines the number of pipeline stages as well as the number of registers in the processor and the used set of instructions as shown below.

library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    -- ram package defines instruction bit widths
    use work.ram_configuration_pkg.ram_bit_width;

package processor_configuration_pkg is

    constant instruction_bit_width     : natural := ram_bit_width;
    constant register_bit_width        : natural := ram_bit_width;

    constant number_of_registers       : natural := 5;
    constant number_of_pipeline_stages : natural := 9;

    type t_command is (
        program_end,
        nop        ,
        add        ,
        sub        ,
        mpy        ,
        save       ,
        load
    );

    subtype comm is std_logic_vector(19 downto 16);
    subtype dest is std_logic_vector(15 downto 12);
    subtype arg1 is std_logic_vector(11 downto 8);
    subtype arg2 is std_logic_vector(7 downto 4);
    subtype arg3 is std_logic_vector(3 downto 0);

end package processor_configuration_pkg;

Now with the processor configuration tied to the ram which configuration is tied to the floating point word length, we now only need to build the command pipeline for the floating point processor.

Float alu

The float alu has a subroutine interface that gives access to the add, subtract and multiply operations. These subroutines are what our processing pipeline will be calling  based on the decoded instructions. The float alu is pipelined hence we can push a new command through the pipeline in consecutive clock cycles. The interface is defined as follows

package float_alu_pkg is    

    type float_alu_record is record
        float_adder        : float_adder_record  ;
        adder_normalizer   : normalizer_record   ;

        float_multiplier : float_multiplier_record ;
        multiplier_normalizer : normalizer_record  ;

    end record;

     procedure multiply (
        signal self : inout float_alu_record;
        left, right : float_record);

    function get_multiplier_result ( self : float_alu_record)
        return float_record;
    ------------------------------------------------------------------------
    ------------------------------------------------------------------------
    procedure add (
        signal self : inout float_alu_record;
        left, right : float_record);

    procedure subtract (
        signal self : inout float_alu_record;
        left, right : float_record);
        
    function get_add_result ( self : float_alu_record)
        return float_record;
    ------------------------------------------------------------------------
    ------------------------------------------------------------------------

The command pipeline for the float alu is shown in the snippet below. The first stage triggers loads from ram as well as the three floating point instructions: add, subtract and multiply. When the corresponding instructions are decoded from the instruction that is read from ram, the arguments are pushed to the float alu.

The float alu is configured with 4 stage add hence the command pipeline stage 4 decodes the multiply and add instructions with matching delay. Pipeline stages 0 and 1 do not have any instructions associated with them thus they are written out just for clarity.

            -- floating point processor pipeline
            create_float_alu(float_alu);

            --stage -1
            CASE decode(used_instruction) is
                WHEN load =>
                    request_data_from_ram(ram_read_data_in, get_sigle_argument(used_instruction));
                WHEN add => 
                    add(float_alu, 
                        to_float(self.registers(get_arg1(used_instruction))), 
                        to_float(self.registers(get_arg2(used_instruction))));

                WHEN sub =>
                    subtract(float_alu, 
                        to_float(self.registers(get_arg1(used_instruction))), 
                        to_float(self.registers(get_arg2(used_instruction))));
                WHEN mpy =>
                    multiply(float_alu, 
                        to_float(self.registers(get_arg1(used_instruction))), 
                        to_float(self.registers(get_arg2(used_instruction))));
                WHEN others => -- do nothing
            end CASE;
        ------------------------------------------------------------------------
            --stage 0
            used_instruction := self.instruction_pipeline(0);
            CASE decode(used_instruction) is
                WHEN others => -- do nothing
            end CASE;
            
        ------------------------------------------------------------------------
            --stage 1
            used_instruction := self.instruction_pipeline(1);
            CASE decode(used_instruction) is
                WHEN others => -- do nothing
            end CASE;
            
        ------------------------------------------------------------------------
            --stage 2
            used_instruction := self.instruction_pipeline(2);

            CASE decode(used_instruction) is
                WHEN load =>
                    self.registers(get_dest(used_instruction)) <= get_ram_data(ram_read_data_out);
                WHEN others => -- do nothing
            end CASE;
        ------------------------------------------------------------------------
            --stage 3
            used_instruction := self.instruction_pipeline(3);

            CASE decode(used_instruction) is

                WHEN others => -- do nothing
            end CASE;
        ------------------------------------------------------------------------
        --stage 4
            used_instruction := self.instruction_pipeline(4);
            CASE decode(used_instruction) is
                WHEN mpy =>
                    self.registers(get_dest(used_instruction)) <= to_std_logic_vector(get_multiplier_result(float_alu));
                WHEN add | sub => 
                    self.registers(get_dest(used_instruction)) <= to_std_logic_vector(get_add_result(float_alu));
                WHEN save =>
                    write_data_to_ram(ram_write_port, get_sigle_argument(used_instruction), self.registers(get_dest(used_instruction)));
                WHEN others => -- do nothing
            end CASE;
        ------------------------------------------------------------------------

The floating point alu normalizer and denormalizer pipeline depths are also configured in the floating point library. The number of pipeline cycles affect the instruction pipeline depth and with 4 stage add, we need to have 4 pipeline stages for the instruction to match the latency. We can have a new instruction requested on every clock cycle, but we need to be able to catch the result after it has passed the float pipeline. Also if we need to increase the pipeline depth to meet timing, we would also need to add matching number of pipeline stages to the command pipeline.

Test software

The test software for the floating point processor is  the same low pass filter that was already developed in the part 1 of this blog. Note that since the alu takes 4 cycles to run the operations and the result is needed for the next instruction, we need to add nop operations in the code to match the latency.

The build_sw function takes in the filter gain and the filter input address, filter output address and gain address. The addresses in the function arguments are used to initialize the corresponding ram addresses and they are used to load the ram contents into the registers to be used for the actual commands. The write_instruction function encodes the arguments into the bit vectors that correspond to the ram width. The output of the build_sw is the initial contents of the ram which is passed to the ram entity as generic. The design of the vhdl assembler is covered also in part 1 of which the write_instruction -function is part of and the package is found in the micoprogram processor repository.

Now with our design ready we will next synthesize the code in Titanium FPGA.

    function build_sw (filter_gain : real range 0.0 to 1.0; u_address, y_address, g_address : natural) 
    return ram_array
    is
        variable retval : ram_array := (others => (others => '0'));

------------------------------------------------------------------------
        constant u    : natural := 4;
        constant y    : natural := 2;
        constant g    : natural := 3;
        constant temp : natural := 1;

        constant program : program_array :=(
            write_instruction(load , u    , u_address) ,
            write_instruction(load , y    , y_address) ,
            write_instruction(load , g    , g_address) ,
            write_instruction(nop) ,
            write_instruction(sub  , temp , u    , y)    ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(mpy  , temp , temp , g)    ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(add  , y    , y    , temp),
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(nop) ,
            write_instruction(save , y    , y_address),
            write_instruction(program_end)
        );

    begin

        for i in program'range loop
            retval(i) := program(i);
        end loop;

        retval(y_address) := to_std_logic_vector(to_float(0.0));
        retval(u_address) := to_std_logic_vector(to_float(0.5));
        retval(g_address) := to_std_logic_vector(to_float(filter_gain));
            
        return retval;
        
    end build_sw;

Test with Efinix Titanium EVM

In the Efinix Titanium build in the example project, the floating point processor replaces the floating point filter architecture that was already present in the design. This allows us to have the same code for all of the builds with just varying implementations. The floating point processor also runs the same low pass filter as the directly in hardware implemented filter. Since both utilize the floating point alu, the performance is the same although the float processor has a few extra cycles of latency due to the memory being in series with the data path.

The test implementation for the projects filter architecture can be found here. The floating point filter instantiates a float to integer converter. When the float filter is requested, the fixed point sine wave is first conveted to float by call to the convert_integer_to_float procedure. The completion of this conversion then triggers the processor and a call to write the converted float number to memory from where the floating point filter will then read it. When the processor runs to completion, the program_is_ready triggers conversion from float back to integer and the result is then set to converted_integer signal. This signal is connected to the internal bus and by extension is made readable with UART with the connect_data_to_address procedure.

The process that implements the entire test is shown below

    floating_point_filter : process(clock)
        variable used_instruction : t_instruction;
    begin
        if rising_edge(clock) then
            init_bus(bus_out);
            connect_read_only_data_to_address(bus_in, bus_out, floating_point_filter_integer_output_address , converted_integer);

            create_float_to_integer_converter(float_to_integer_converter);
            create_simple_processor (
                processor                ,
                ram_read_instruction_in  ,
                ram_read_instruction_out ,
                ram_read_data_in         ,
                ram_read_data_out        ,
                ram_write_port           ,
                used_instruction);

            create_float_alu(float_alu);

            --stage -1
            CASE decode(used_instruction) is
                WHEN load =>
            --- SNIP ----------
            -------------------
            if example_filter_input.filter_is_requested then
                convert_integer_to_float(float_to_integer_converter, example_filter_input.filter_input, 15);
            end if;

            if int_to_float_conversion_is_ready(float_to_integer_converter) then
                request_processor(processor);
                write_data_to_ram(ram_write_port, u_address, to_std_logic_vector(get_converted_float(float_to_integer_converter)));
            end if;

            if program_is_ready(processor) then
                convert_float_to_integer(float_to_integer_converter, to_float(processor.registers(2)), 14);
            end if;
            converted_integer <= std_logic_vector(to_signed(get_converted_integer(float_to_integer_converter) +  32768, 16));

        end if; --rising_edge
    end process floating_point_filter;	
------------------------------------------------------------------------
    u_mpram : entity work.ram_read_x2_write_x1
    generic map(ram_contents)
    port map(
    clock                    ,
    ram_read_instruction_in  ,
    ram_read_instruction_out ,
    ram_read_data_in         ,
    ram_read_data_out        ,
    ram_write_port);
------------------------------------------------------------------------

The project has an example .bat script that allows us to launch the build, upload the binary to the efinix evaluation kit and run the python script that tests the hardware. Using cmd and running the following oneliner in the project root eventually plots the figure 4.

start /wait test_with_hw.bat & python test_app.py com14 "titanium evm"
Figure 4. Result of running the build and test script which plots the filtered waveforms from FPGA
Hardware utilization

As we designed our ram width to be defined by the lengths of the mantissa and the exponent, we can test the resource use between different bit widths. I compiled the project with 18 – 32 bit word lengths and the resource use is shown in the table below. Note that the 200+ means that the maximum speed is not actually limited by the floating point processor as the sine generation in the design becomes the limit. Since we have both the pure logic implementation and the floating point processor available, it is fairly trivial to compare the resource use of both.

Since the RAM width is calculated during synthesis from the floating point exponent and mantissa lengths, we only need to change the data type and all floating point processing modules and the rest of the processor scales with it accordingly. The build time is also less than 90 seconds with the Efinity tool, thus changing the processor configuration here can be accomplished with little effort by just changing the mantissa bit width and recompiling.

The resource use is between logic and processor floating point is presented in the table below. The word length shows the mantissa + exponent. The processor is configured to use 5 registers and the float unit has 2 registers for normalization and denormalization steps. The resource use also takes into account the floating point converter as well as connection to the internal bus. We could use the float alu normalizer/denormalizer stages for the conversions also which would bring down the resource use in both the processor and non processor versions of the filter by a considerable amount.

Efinix Titanium Resources for floating point processor

Mantissa(word) DSP LUT(no processor) Memory Block MHz
18(27) 1 1325(734) 3(0) 200+(200+)
21(30) 4 1427(896) 3(0) 200+(200+)
24(33) 4 1651(954) 3(0) 200+(200+)
26(35) 4 1721(1016) 3(0) 200+(190+)
28(37) 4 1842(1119) 3(0) 140+(150+)
30(39) 4 2034(1193) 3(0) 130+(130+)
32(41) 4 2154(1259) 4(0) 130+(130+)

We can see from the example design that the first order filter done in logic fabric directly with 24 bit mantissa takes about 1000 luts, and around 1600 with the processor included. The processor in takes up around 60% more logic, but with this extra space allocation we get essentially free use of the floating point unit for any subsequent operations. As the pure logic design implements more registers and muxes with each additional calculation even if the logic is reused, the logic use will eventually take up more space than the corresponding processor.

Due to the way DSP slices work, it does not actually matter if we use 19 bits or 38 bits for our multiplier. Similarly RAM comes in specific block sizes and we use a full block at a time. Due to this the number of used dsp slices and number of memory blocks jumps up only at specific break points.

Configuration of VHDL modules

The configuration of the modules was done using configuration packages. The use of packages for configuration leads to some code duplication, but it gives us the flexibility to build almost any kind of commands and decoding for the processor. The configuration here is chosen with the compile script in which we choose into which libraries we place what sources. We can have multiple configurations of the same modules in a single project as long as we just compile the sources to different libraries. Libraries are a VHDL language feature that allow us to do this.

VHDL2008 configuration

The code written here is in 1993 version of the VHDL language standard. The language had a major update in 2008 which allows the use of package generics and generic packages. These allow us to give packages and constants as generics to other packages which would completely eliminate the need for having any duplicates of these sources. Sadly much of the code that we write in VHDL for synthesis still needs to be 1993 as some of the older tools that are still in use, mainly Intel Quartus and Vivado older than 2019.2 do not support the 2008 standard.

New versions of Vivado, Quartus Pro, Efinix Efinity, Lattice Radiant and Diamond all have support for VHDL2008 features as do the open source VHDL simulators GHDL and NVC hence barring the use of older Vivado or Quartus, we could actually make the code much simpler still by using the 2008 standard of the language.

Additional improvements

If our design does not meet the required timing, we can add more pipeline stages by increasing the instruction pipeline depth and adding more pipeline stages for the normalizer and denormalizer modules. In the software this is mirrored with the number of nop commands between non pipelineable parts of the actual code.

An obvious improvement for this is to add another layer of abstraction between the VHDL software and the instructions. We can make a set of functions that takes in the depths of the nops then we could have the instruction package configured with the pipeline depths. Hence a function multiply(c,a,b)  would output a corresponding program_array that has the correct number of nops to match the pipeline delays. Also we should add a stall, which is simply a counter during which the pipeline is not advanced. This mechanism would allow us to only use 1 or 2 memory locations per floating point instruction instead of 1 + pipeline depth worth of no-operations.

We could also add a banching of function calling operations which would substantially increase the amount of functionality we can pack into the memory of the software that we write in VHDL.

The floating point library is as of writing the first revision of it still, hence there very likely much to be improved. I would not be surprised if we could squeeze the floating point into substantially tighter space. The multiplier normalizer for example is full width, even though we are not really hitting the saturation points of the exponent thus not running into denormalized numbers. The denormalizer and normalizer in adder is also full width even though we could possibly save a lot of space if only half width normalizer was used and the first stage just shifted by half word length.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top