High level Floating Point ALU in synthesizable VHDL

In the previous blog post we discussed how floating point arithmetic functions at the logic level. We designed the floating point module and functions for multiplication and a sum. This time we combine the sum and multiply functions under a single interface and use it to implement a first order filter. Additionally, we are going to use the floating point normalizer and denormalizer modules to transform between float and integer.

The high level VHDL repositories that have been created while writing this blog have been now moved under an open source project called “High Level Synthesizable VHDL” or hVHDL which is found on github. The open source project aims to help developers to use the modules and to make that even easier, there is an example project that has build scripts for most common FPGAs. At the time of writing this page, there are build scripts for Lattice ECP5, Efinix Trion, Intel Cyclone 10lp and Xilinx Spartan 7.

Floating point ALU

Floating point is relatively complex way to represent numbers and operate on them. To unload the designer from this inherent complexity of the floating point number representation, we have a module for high level floating point operations written in VHDL. This module allows the floating point numbers to be used through function and procedure interfaces and real numbers like 3.142 instead of raw register bits or vectors. Thus instead of mapping signals to ports of entities to calculate a multiplication, we just call a single procedure named multiply. Since the floating point module also tells when it is ready the application is automatically timed correctly. This way we can change the delays without touching the application code.

The alu is designed to be pipelined which means that we can request a new sum or multiplication at every clock cycle. Since both operations have separate implementations they can be requested and run simultaneously. The alu interface is used with create-request-ready pattern, which allows the calls to be either sequential, that is wait for action to be ready, or pipelined where a new action is requested every clock cycle.

Floating point ALU structure

As discussed in previous blog post, the floating point multiplication and sum have two parts. First we perform the add or multiply functions and once that is done we normalize the result. Normalization is done by shifting all leading zeroes from the mantissa and increasing the exponent by equal amount to the shift.

A fully functioning floating point module thus consists of an adder, a normalizer for the adder, a floating point multiplier and the normalizer for the floating point multiplier. All of the different parts are designed as separately instantiable objects, therefore the float alu record is also a list of elements that it is built from as seen in the snippet below.

package float_alu_pkg is
------------------------------------------------------------------------
    type float_alu_record is record
        float_adder        : float_adder_record  ;
        adder_normalizer   : normalizer_record   ;

        float_multiplier : float_multiplier_record ;
        multiplier_normalizer : normalizer_record  ;
    end record;

Since the alu is implemented using a record and a procedure, using it just requires instantiating a signal of float_alu_record type, creating the logic with create_float_alu -procedure call and then using the subroutines for requesting actions from the alu and checking for the result to be ready before fetching results.

architecture rtl of example is
    
    signal float_alu : float_alu_record := init_float_alu;
    
begin
    
    stimulus : process(simulator_clock)

    begin
        if rising_edge(simulator_clock) then

            create_float_alu(float_alu);

            multiply(float_alu, to_float(9.0), to_float(-9.0));
            add(float_alu, to_float(5.0), to_float(5.0));

            if multiplier_is_ready(float_alu) then
                test_multiplier <= to_real(get_multiplier_result(float_alu));
            end if;

            if add_is_ready(float_alu) then
                add_result_real <= to_real(get_add_result(float_alu));
            end if;

        end if; -- rising_edge
    end process stimulus;

As discussed in previous post the add function also requires the exponents to be equal before addition. Therefore to add, we first scale the numbers to be added for same exponent. The adder module embeds the denormalizer module that is used for the scaling.

Floating point ALU implementation

The alu is implemented in a procedure from a signal of the alu record type. The create_alu procedure creates the adder, multiplier and denormalizer and normalizers. The complete functinality of a floating point is obtained by connecting the different modules together. The connection is done by making the normalizer requests using the adder/multiplier is ready functions and then returning the results from the normalizers as seen below.

 procedure create_float_alu 
    (
        signal float_alu_object : inout float_alu_record
    ) 
    is
    begin

        create_adder(float_alu_object.float_adder);
        create_normalizer(float_alu_object.adder_normalizer);

        if adder_is_ready(float_alu_object.float_adder) then
            request_normalizer(float_alu_object.adder_normalizer, get_result(float_alu_object.float_adder));
        end if;

        create_float_multiplier(float_alu_object.float_multiplier);
        create_normalizer(float_alu_object.multiplier_normalizer);

        if float_multiplier_is_ready(float_alu_object.float_multiplier) then
            request_normalizer(float_alu_object.multiplier_normalizer, get_multiplier_result(float_alu_object.float_multiplier));
        end if;

    end procedure;

The adder is connected to the alu interface with the request procedures. Call to add procedure calls the pipelined_add procedure of the adder module and add_is_ready and get_add_result return the corresponding values from the adder normalizer.

The same holds for the floating point multiplier.

procedure add
(
    signal alu_object : inout float_alu_record;
    left, right : float_record
) is
begin
    pipelined_add(alu_object.float_adder, left, right);
end add;

------------------------------------------------------------------------
function add_is_ready
(
    alu_object : float_alu_record
)
return boolean
is
begin
    return normalizer_is_ready(alu_object.adder_normalizer);
end add_is_ready;
------------------------------------------------------------------------
function get_add_result
(
    alu_object : float_alu_record
)
return float_record
is
begin
    return get_normalizer_result(alu_object.adder_normalizer);
end get_add_result;
------------------------------------------------------------------------

With this interface, we can evaluate floating point multiplications and sums with simple procedure calls. Since the module implementation is pipelined, we can request a new operation every clock cycle and then get the result in successive clock cycles after the pipeline delay of the module. The test code below requests multiplications and additions with 5 successive clock cycles

    stimulus : process(simulator_clock)

    begin
        if rising_edge(simulator_clock) then
            simulation_counter <= simulation_counter + 1;

            create_float_alu(float_alu);
            CASE simulation_counter is
                WHEN 3 => multiply(float_alu, to_float(5.0), to_float(5.0));
                WHEN 4 => multiply(float_alu, to_float(6.0), to_float(5.0));
                WHEN 5 => multiply(float_alu, to_float(7.0), to_float(5.0));
                WHEN 6 => multiply(float_alu, to_float(8.0), to_float(-8.0));
                WHEN 7 => multiply(float_alu, to_float(9.0), to_float(-9.0));
                WHEN others => -- do nothing
            end CASE;

            CASE simulation_counter is
                WHEN 3 => add(float_alu, to_float(5.0), to_float(5.0));
                WHEN 4 => add(float_alu, to_float(6.0), to_float(5.0));
                WHEN 5 => add(float_alu, to_float(7.0), to_float(5.0));
                WHEN 6 => add(float_alu, to_float(8.1), to_float(-8.0));
                WHEN 7 => add(float_alu, to_float(9.0), to_float(-9.1));
                WHEN others => -- do nothing
            end CASE;

            if multiplier_is_ready(float_alu) then
                test_multiplier <= to_real(get_multiplier_result(float_alu));
            end if;

            if add_is_ready(float_alu) then
                add_result_real <= to_real(get_add_result(float_alu));
            end if;

        end if; -- rising_edge
    end process stimulus;

Configurability of the Floating Point module

Since we are building custom hardware for the floating point operations, we most likely want to tailor the implementation to fit the solution. As all of the modules that go into the floating point module are created as separately instantiated units, we have multiple interfaces in the code which allow us to vary the implementations. At the time of writing this we have separate packages with which we can change the floating point mantissa and exponent lengths and separate values for the number of pipeline stages for the bit shifting. These variations are invoked in compile time and without touching the actual application code that uses the floating point module. The simulation is configured with 4 stage pipelines for the normalizer and denormalizer as shown below.

Using the ideas presented in the dependency management, all of the modules that go in the floating point are referencing packages using the “use work.package.all” notation. This leaves us with the possibility to compile the sources into any desired library. With this we can choose what packages are in the same library as with the floating point alu sources. Since we do not specify the library, we we have the possibility for adding the same sources to multiple different libraries thus the same project can have multiple different floating point units if needed. This lets us use the compile script as an interface for building a specific floating point module.

Backpressure from pipelined units

The reason why we can make this work without needing to touch any code, is that the code uses a mechanism called backpressure to automatically time the modules. This works because the calling module waits for the calculation to be ready. This is done using request-ready pattern, where calls to are requested and the the module which processes the call answers with a is_ready function call when the data is ready.

Both the normalizer and denormalizer modules consists of a record that holds the pipeline stages for scaled floating point numbers and a shift register with stages equal to the pipeline length. The record is specified with as vectors with configurable length as shown below.

    use work.normalizer_pipeline_pkg.normalizer_pipeline_configuration;

package normalizer_pkg is
------------------------------------------------------------------------
    alias number_of_normalizer_pipeline_stages is normalizer_pipeline_configuration;

    type normalizer_record is record
        normalizer_is_requested : std_logic_vector(number_of_normalizer_pipeline_stages downto 0);
        normalized_data         : float_array(0 to number_of_normalizer_pipeline_stages);
    end record;

The create procedure sets up the pipeline stages in a for loop that uses the number_of_normalizer_pipeline_stages constant as the loop value. The normalize function has an argument for maximum shift length. If we add more pipeline stages, the total shift operation is split into smaller partial shifts.

------------------------------------------------------------------------
    procedure create_normalizer 
    (
        signal normalizer_object : inout normalizer_record
    ) 
    is
    begin

        m.normalizer_is_requested(0) <= '0';
        for i in 1 to number_of_normalizer_pipeline_stages loop
            m.normalizer_is_requested(i) <= m.normalizer_is_requested(i-1);
            m.normalized_data(i)      <= normalize(m.normalized_data(i-1), mantissa_high/number_of_normalizer_pipeline_stages);
        end loop;
    end procedure;

The way we request an operation is shown below. The pipeline stages are loaded with the data that is processed as well as the request bit.

------------------------------------------------------------------------
    procedure request_normalizer
    (
        signal normalizer_object : out normalizer_record;
        float_input              : in float_record
    ) is
    begin
        normalizer_object.normalizer_is_requested(normalizer_object.normalizer_is_requested'low) <= '1';
        normalizer_object.normalized_data(normalizer_object.normalized_data'low) <= float_input;
        
    end request_normalizer;

The normalizer_is_ready function simply returns a boolean that is the result for checking if the top most bit of the normalizer_is_requested. Using the shift register, the application code can be made to adapt to the changes in the number of pipeline stages without needing to specifically know how long the operation takes.

------------------------------------------------------------------------
    function normalizer_is_ready
    (
        normalizer_object : normalizer_record
    )
    return boolean
    is
    begin
        return normalizer_object.normalizer_is_requested(normalizer_object.normalizer_is_requested'high) = '1';
    end normalizer_is_ready;

Filtering with floating point arithmetic using FPGA

In order to do something meaningful with the flaoting point unit, it is tested using a simple filtering application.

The open source hVHDL project has a synthesizable example which also uses a floating point unit for a simple first order filter. The filter is calculated using a following version of a first order filtering function

y = y + (u-y) * filter_gain

The VHDL floating point implementation is done one operation at a time, thus the algorithm is broken down to

temp1 = (u-y)
temp2 = temp1*filter_gain
y = y + temp2

The VHDL filter with floating point implementation is shown below. Since we are running all operations with the same alu, we need to add the filter counter that provides a possibility for a sequence of operations. This actually saves up a huge amount of logic resources, since we are only creating one float adder and one float multiplier and running all operations through the same hardware. In essence doing this allows us to synthesize muxes in place of subsequent floating point operations.

    floating_point_filter : process(clock)
    begin
        if rising_edge(clock) then

            create_float_alu(float_alu);
        ------------------------------------------------------------------------
            filter_is_ready <= false;
            CASE filter_counter is
                WHEN 0 => 
                    subtract(float_alu, u, y);
                    filter_counter <= filter_counter + 1;
                WHEN 1 =>
                    if add_is_ready(float_alu) then
                        multiply(float_alu  , get_add_result(float_alu) , filter_gain);
                        filter_counter <= filter_counter + 1;
                    end if;

                WHEN 2 =>
                    if multiplier_is_ready(float_alu) then
                        add(float_alu, get_multiplier_result(float_alu), y);
                        filter_counter <= filter_counter + 1;
                    end if;
                WHEN 3 => 
                    if add_is_ready(float_alu) then
                        y <= get_add_result(float_alu);
                        filter_counter <= filter_counter + 1;
                        filter_is_ready <= true;
                    end if;
                WHEN others =>  -- wait for start
            end CASE;
        ------------------------------------------------------------------------

Although the example is a bit verbose, we can also put that into a filter record and corresponding procedure call to create a filter object at which point the filter is simplified to just 1 line of synthesizable VHDL code, the create_first_order_filter procedure call.

    floating_point_filter : process(clock)
    begin
        if rising_edge(clock) then

            create_float_alu(float_alu);
        ------------------------------------------------------------------------
            create_first_order_filter(filter, float_alu, filter_gain);
        ------------------------------------------------------------------------
            
            if we_want_to_trigger_the_filter then
                request_first_order_filter(filter, signal_to_be_filtered);
            end if;
            
        end if; --rising_edge
    end process floating_point_filter;	
end float;

Fpga implementation

The floating point module is tested with a noisy sine that is then filtered using the floating point filter. The project has build scripts for Intel Quartus, Lattice Diamond, Xilinx Vivado and Efinix Efinity. Both the Efinix and Lattice tools need the newest versions of the tools for the code to work. The VHDL sources for the example project can be found here and the floating point filter specifically can be found here. The sine wave is calculated using the sincos module that was been designed previously. The resource use of the floating point with different FPGAs can be found here.

Notes

The most commonly used format for floating point is defined in the IEEE754 standard, which this floating point unit does not adhere to. The reason for not using the standard representation is that it specifies different kinds of error flags and unless we specifically check out for these errors, there is no need to have them. The register format of normalized floating point implementation is very close to the format of the standard, only difference is that we need to reduce the mantissa by the bias and disregard the leading ‘1’ bit as most implementations of floating point uses the always ‘1’ bit as implied and not stored in the registers. The rounding mode is also omitted and just a simple truncation is used. We could add the rounding if desired, though it usually requires less resources to just increase the word length by a bit.

Since the alu is build from separate modules, we could also make a pure sequential version of the floating point unit using the normalizer, denormalizer and multiplier and adder operations for substantially lower resource usage. In this implementation we would have only one normalizer and denormalizer and use those for both the multiplication, addition and float conversions. This would imply that instead of pipelined, we would be calling the same normalization and denormalization functions several times and no new actions can be requested before the previous is ready.

Assuming that the calls to add, multiply and associated is_ready functions were named similarly, we could even run the presented implementation of the floating point filter with either pipelined or sequential version of the floating point alu!

Finally there is the conversion between integers and floating point. In a beautiful symmetry, the integer_to_float is done with normalizing the integer and float_to_integer is done by denormalizing the float to target radix. When we convert integer to float we load mantissa with the integer number and the exponent with the target radix and then push the number through the normalizer. Conversely we get integer from float by denormalizing the floating point value with target exponent being the desired radix. The converter module is essentially just renaming the subroutines of the normalizer and denomalizer stages to make it clearer what is being done.