Floating Point arithmetic in High Level VHDL

When we use digital arithmetic we have two ways to represent numbers, with either fixed or with floating point.  With fixed point arithmetic we choose the appropriate scale for the numbers in the source code whereas  the floating point number representation carries the scale. This carried scale allows a high dynamic range for the used numbers and the automatic range adjustment of floating point arithmetic might make scaling somewhat less of an issue. This  simplification associated with floating point arithmetic is very desirable in most designs as it streamlines the design of calculations.

In order to allow a us access to floating point numbers, in this post we are going develop the basic floating point arithmetic operations in VHDL. These basic operations are addition, subtraction and multiplication. The code is designed using high level VHDL design which was the topic of previous blog post. We design a floating point abstract data type and an interface for it using records, procedures and functions.

In the follow up post, “High level Floating Point ALU in synthesizable VHDL“, the floating point arithmetic modules are used to design a floating point alu. This floating point alu is then used in an example project which has compile scripts for most common FPGA vendors tools.

Floating point arithmetic

Floating point number f has three parts that form the number representation. These are the sign, an exponent N and mantissa b_m which stores the fractional part of the number. The floating point format is as follows

\begin{equation} f = (-)\, 2^N\cdot b_m \end{equation}

We use number range [0.5, 1) for the mantissa b_m and and a signed integer N for the exponent. As an example the number 0.25 in floating point format would be represented as

\begin{equation} 0.25_{f} = 2^{-1}\cdot 0.5 \end{equation}

To multiply two floating point numbers f_1 and f_2 , we simply add together the exponents and multiply the mantissas. The sign of the result is negative if the multiplier and multiplicand have different signs and positive if they match.

\begin{equation} f_1\cdot f_2 = (2^{N_1} \cdot b_{m_1}) \cdot (2^{N_2}\cdot b_{m_2}) = 2^{N_1+N_2}\cdot b_{m_1}b_{m_2} \end{equation}

To add two floating point numbers together we use numbers that have equal exponents and simply add together the mantissas.

\begin{equation} f_1 + f_2 = 2^{N_1=N_2}\cdot (b_{m_1}+b_{m_2}) \end{equation}

The addition and subtraction of two floating point numbers with differing exponents is where the use of floating point numbers gets more complicated. The reason is that since the number is in exponential form, the exponents of the added floating point numbers need to be scaled to have same values for addition to yield a correct result.

Digital implementation of floating point

Even though we talk about floats and real numbers usually interchangeably, they are definitely not the same. Whether we use fixed or floating point numbers we actually do all calculations with integers. Digital circuits work with binary numbers and with floats the mantissa is saved in an unsigned integer and the exponent is stored in two’s complement signed integer.

 For example lets assume a 24 bit mantissa and a 8 bit exponent. With these the number 0.25 in floating point would be represented as

\begin{equation} 0.25 = 2^{-1}\cdot \underbrace{8388608 }_{2^{24}\cdot 0.5} \end{equation}

The floating point would thus be stored in a register in three parts as (0, -1, 8388608) which in hex is(0, \textrm{0xff} , \textrm{0x800000} ).

VHDL implementation of floating point

Using the ideas from the previous blog post about dependency management, the mantissa and exponent word lengths are defined in constants in a datatype package. This package is then used with all other modules.

package float_type_definitions_pkg is

    constant mantissa_length : integer := 16;
    constant exponent_length : integer := 8;

    constant mantissa_high : integer := mantissa_length - 1;
    constant exponent_high : integer := exponent_length - 1;

    subtype t_mantissa is unsigned(mantissa_high downto 0);
    subtype t_exponent is signed(exponent_high downto 0);

With this we have a single point of in the design where the bit widths are defined and this allows us to change the word lengths as needed without need to change the source code. This allows us to for example fix timing issues by lowering the number of mantissa bits without changing the code or increasing them if more accuracy is needed.

The floating point is implemented in its own record type that uses the type definitions. This record has three parts with sign, exponent and mantissa stored in their own fields.

    type float_record is record
        sign     : std_logic;
        exponent : t_exponent;
        mantissa : t_mantissa;
    end record;

With fixed point we can mostly get away with just using the integers directly, but it is quite obvious that the floating point numbers are way too complicated to be readable from the values in the registers. Thus the very first abstraction that we must have is the ability to type in real numbers to our source code.

Real to float conversion

In VHDL numbers like 3.14 are of type real. Real valued numbers are not directly synthesizable, we can still use these in rtl code to define constants of synthesizable types. Basically as long as we do not try to have real values in registers or create logic using real values, the code is synthesizable.

The conversions between float and real values are delivered in a real to float package. This package defines the to_float and to_real functions.


package float_to_real_conversions_pkg is
------------------------------------------------------------------------
    function to_float ( real_number : real)
        return float_record;
------------------------------------------------------------------------
    function to_real ( float_number : float_record)
        return real;
------------------------------------------------------------------------
end package float_to_real_conversions_pkg;

The to_float function is needed to allow us to write real valued numbers in our source code. The inverse transform from float to real is needed in simulations as it allows us to have human readable representation of the floating point data.

 

With these functions we can write signals and constants in our synthesizable source code as

constant float_pi : float_record := to_float(3.14);
signal float_number : float_record := to_float(0.3154386);
constant real_data : real := to_real(float_number);

When real numbers are converted to float we first transform the exponent to sufficient base 10 number and then use the exponent to calculate the mantissa. The implementations of the functions can be found in the package.

Now that we can read and write floats in our source code next I will discuss the implementation of sum, subtract and multiplication using the floating point data type.

Float multiplier

The multiplier is implemented using a multiplier_record data type to hold the registers and a procedure that creates the multiplier logic. Along with the multiplier record, the package has procedures and functions for requesting the multiplication, checking for multiplication to be ready and getting the multiplier result. The float multiplier package is defined as follows

package float_multiplier_pkg is
------------------------------------------------------------------------
    type float_multiplier_record is record

        left   : float_record;
        right  : float_record;
        result : float_record;

        sign                           : std_logic;
        exponent                       : t_exponent;
        mantissa_multiplication_result : unsigned(mantissa_high*2+1 downto 0);
        shift_register                 : std_logic_vector(2 downto 0);
    end record;

    constant init_float_multiplier : float_multiplier_record := (zero, zero, zero, '0', (others => '0'),(others => '0'), (others => '0'));
------------------------------------------------------------------------
    procedure create_float_multiplier (
        signal float_multiplier_object : inout float_multiplier_record);
------------------------------------------------------------------------
    procedure request_float_multiplier (
        signal float_multiplier_object : out float_multiplier_record;
        left, right : float_record);
------------------------------------------------------------------------
    function float_multiplier_is_ready (float_multiplier_object : float_multiplier_record)
        return boolean;
------------------------------------------------------------------------
    function get_multiplier_result ( float_multiplier_object : float_multiplier_record)
        return float_record;
------------------------------------------------------------------------

The create_multiplier procedure holds the implementation of the floating point multiplier. If we compare the fixed point implementation of the multiplier to the floating point, there is not a substantial difference. Since the scale is carried separately the multiplication is done in radix of the mantissa length.

The sign bit is treated separately as a xor of the signs. The sign and shift registers are needed since the multiplier takes 3 clock cycles to propagate through, thus pipelines are provided for the ready bit as well as the sign.

------------------------------------------------------------------------
    procedure create_float_multiplier 
    (
        signal float_multiplier_object : inout float_multiplier_record
    ) 
    is
    begin

        shift_register                 <= shift_register(shift_register'left-1 downto 0) & '0';
        sign                           <= left.sign xor right.sign;
        exponent                       <= left.exponent + right.exponent;
        mantissa_multiplication_result <= left.mantissa * right.mantissa;

        result <= normalize((sign     => sign,
                             exponent => exponent,
                             mantissa => (mantissa_multiplication_result(mantissa_high*2+1 downto mantissa_high+1))
                            ));
    end procedure;

The request_float_multiplier procedure is used to start the multiplier. Request float  takes in the multiplier and multiplicand as arguments as well as the multiplier object. The procedure loads the left and right registers of the multiplier object and loads the zeroth bit in the shift register with ‘1’. This bit is pushed through a shift register that has equal length to the multipliers pipeline stages.

The multiplier_is_ready function can be used to check when the floating point multiplier is ready and the get_multiplier_result just returns the result of the multiplication.

------------------------------------------------------------------------
    procedure request_float_multiplier
    (
        signal float_multiplier_object : out float_multiplier_record;
        left, right : float_record
    ) is
    begin
        float_multiplier_object.shift_register(0) <= '1';
        float_multiplier_object.left <= left;
        float_multiplier_object.right <= right;
        
    end request_float_multiplier;
    
    ------------------------------------------------------------------------
    function float_multiplier_is_ready
    (
        float_multiplier_object : float_multiplier_record
    )
    return boolean
    is
    begin
        return float_multiplier_object.shift_register(float_multiplier_object.shift_register'left) = '1';
    end float_multiplier_is_ready;
    ------------------------------------------------------------------------
    function get_multiplier_result
    (
        float_multiplier_object : float_multiplier_record
    )
    return float_record
    is
    begin
        return float_multiplier_object.result;
    end get_multiplier_result;

Floating point sum/subtraction

The adder is developed into its own module. The adder object has fields for buffering inputs and the output as well as counter for state machine and a boolean to check when addition is done.

Because the addition requires several steps, it is also implemented using a record-procedure object structure. The float_adder_record has floats for the result as well as larger and smaller values and a counter for the state machine.

package float_adder_pkg is
------------------------------------------------------------------------
    type float_adder_record is record
        larger  : float_record;
        smaller : float_record;
        result  : float_record;
        adder_counter : integer range 0 to 7;
        adder_is_done : boolean;
    end record;

------------------------------------------------------------------------
    procedure create_adder (
        signal adder_object : inout float_adder_record);
------------------------------------------------------------------------
    procedure request_add (
        signal adder_object : out float_adder_record;
        left, right : float_record);
------------------------------------------------------------------------
    procedure request_subtraction (
        signal adder_object : out float_adder_record;
        left, right : float_record);
------------------------------------------------------------------------
    function adder_is_ready (float_adder_object : float_adder_record)
        return boolean;
------------------------------------------------------------------------
    function get_result ( adder_object : float_adder_record)
        return float_record;
------------------------------------------------------------------------

The steps to do addition is to first rearrange the added numbers such that the “larger” float is the larger of these two numbers. In the second step the smaller of the exponents is scaled to match the exponent of the larger of the numbers and in the last step the mantissas are summed together. We rearrange the numbers according to exponent sizes in order to be able to use only one de-normalize function call regardless of the relative magnitudes of the summed floating point numbers.

The normalization step is left outside the adder in order to allow for sharing the normalization logic if less logic resources are desired to be used.

-----------------------------------------------------------------------
    procedure create_adder
    (
        signal adder_object : inout float_adder_record
    ) is
        alias larger        is adder_object.larger        ;
        alias smaller       is adder_object.smaller       ;
        alias result        is adder_object.result        ;
        alias adder_counter is adder_object.adder_counter ;
        alias adder_is_done is adder_object.adder_is_done;
    begin
        adder_is_done <= false;
        CASE adder_counter is
            WHEN 0 => 
                if larger.exponent < smaller.exponent then
                    larger  <= smaller;
                    smaller <= larger;
                end if;
                adder_counter <= adder_counter + 1;
            WHEN 1 => 
                smaller <= denormalize_float(smaller, to_integer(larger.exponent));
                adder_counter <= adder_counter + 1;
            WHEN 2 =>
                result <= larger + smaller;
                adder_is_done <= true;
                adder_counter <= adder_counter + 1;
            WHEN others => -- do nothing
        end CASE;

    end create_adder;

Normalization and denormalization

The big part of all complications of floating point numbers is the normalization. This refers to shifting out all of the leading zeros from the floating point word. This allows us to have maximum number of significant bits in the mantissa.

The normalization has two functions, one for calculating the number of leading zeros and another that left shifts out the leading zeros from the mantissa.

------------------------------------------------------------------------
    function number_of_leading_zeroes
    (
        data : std_logic_vector;
        max_shift : integer
    )
    return integer 
    is
        variable number_of_zeroes : integer := 0;
    begin
        for i in data'high - max_shift to data'high loop
            if data(i) = '0' then
                number_of_zeroes := number_of_zeroes + 1;
            else
                number_of_zeroes := 0;
            end if;
        end loop;

        return number_of_zeroes;
        
    end number_of_leading_zeroes;

This normalization left shifts all leading zeroes from the mantissa and adds them to the exponent. Thus a bit vector “00011101” would be shifted three times to “1110100” and the exponent increased by 3.

------------------------------------------------------------------------
    function normalize
    (
        float_number : float_record;
        max_shift : integer
    )
    return float_record
    is
        variable number_of_zeroes : natural := 0;
    begin
        number_of_zeroes := number_of_leading_zeroes(float_number.mantissa, max_shift);

        return (sign     => float_number.sign,
                exponent => float_number.exponent - number_of_zeroes,
                mantissa => shift_left(float_number.mantissa, number_of_zeroes));
    end normalize;

This normalization is extremely expensive operation as it calls for us to implement a any to any shift operation. Thus for 24 bit mantissa, we need to be able to shift the mantissa for any bit width between 0 and 24. The max shift argument in the normalization functions allows us to limit the number of shifts we do in a single clock cycle to limit the maximum logic chain. With this we can either call the normalization several times or have a pipeline if we want to pipeline the float addition.

The de-normalization works in reverse. It takes a number and right shifts it to required mantissa exponent. The de-normalization step is needed to scale dissimilar exponents before summation.

------------------------------------------------------------------------
    function denormalize_float
    (
        right           : float_record;
        set_exponent_to : integer;
        max_shift       : integer
    )
    return float_record
    is
        variable float : float_record := zero;
    begin
        float := (right.sign,
                  exponent => to_signed(set_exponent_to, exponent_length),
                  mantissa => shift_right(right.mantissa,to_integer(set_exponent_to - right.exponent)));

        return float;
        
    end denormalize_float;

Floating point low pass filter with Efinix Trion FPGA

The floating point module is tested with a low pass filter. I use the filter as a baseline operation for the arithmetic module test calculations as it is very simple to use and easy to see that it works correctly. Proper operation also must have correct implementations for both the addition as well as the multiplication and since it covers numbers from our chosen number range, we can easily see the proper operation of the calculations.

A simple low pass filter can be calculated using the following algorithm

\begin{equation} y(k+1) =\bigg[(u(k) - y(k)\bigg]\cdot a_0 \end{equation}

where a_0 is the filter gain. To translate this into VHDL, we use a record-procedure object for the filter. This allows us to reuse the filter as needed thus preventing us from typing the filter every time one is needed.  The filter record holds the filter gain and the input and output registers along with a boolean for checking when the filter is ready and a counter for implementing the state machine.

------------------------------------------------------------------------
    type first_order_filter_record is record
        filter_counter   : integer range 0 to 7 ;
        filter_gain      : float_record;
        u                : float_record;
        y                : float_record;
        filter_is_ready : boolean;
    end record;
    
    ------------------------------------------------------------------------
    procedure create_first_order_filter (
        signal first_order_filter_object : inout first_order_filter_record;
        signal float_multiplier          : inout float_multiplier_record;
        signal float_adder               : inout float_adder_record);

The filter in equation (5) is implemented using a state machine. The state machine runs to 4 and hangs there and it is restarted by setting the counter to zero and all state transitions are made with the module_is_ready functions. This allows us to change the number of clock cycles the addition and multiplication take without changing the filter implementation.

The create_filter procedure uses a multiplier and an adder that it gets as argument as this allows us to reuse the multiplier and adder outside the filter hardware if desired. Since the filter runs on a state machine, we can reuse the same adder object for both the subtraction and addition required by the filter. This greatly reduces the logic use as only multiplexers are needed for additional multiplier and adder operations.

------------------------------------------------------------------------
    procedure create_first_order_filter
    (
        signal first_order_filter_object : inout first_order_filter_record;
        signal float_multiplier : inout float_multiplier_record;
        signal float_adder : inout float_adder_record
        
    ) is

    begin

        filter_is_ready <= false;
        CASE filter_counter is
            WHEN 0 => 
                request_subtraction(float_adder, u, y);
                filter_counter <= filter_counter + 1;
            WHEN 1 =>
                if adder_is_ready(float_adder) then
                    request_float_multiplier(float_multiplier  , get_result(float_adder) , filter_gain);
                    filter_counter <= filter_counter + 1;
                end if;

            WHEN 2 =>
                if float_multiplier_is_ready(float_multiplier) then
                    request_add(float_adder, get_multiplier_result(float_multiplier), y);
                    filter_counter <= filter_counter + 1;
                end if;
            WHEN 3 => 
                if adder_is_ready(float_adder) then
                    filter_is_ready <= true;
                    y <= normalize(get_result(float_adder));
                    filter_counter <= filter_counter + 1;
                end if;
            WHEN others =>  -- filter is ready
        end CASE;
    end create_first_order_filter;
------------------------------------------------------------------------

The implementation is tested in a project. The code snippet below shows the application code of the floating point filter. The if counter = 0 is run every 1200 clock cycles at which point the floating point filter is requested. When the filter is ready, a floating point signal “test_float” is loaded with the filtered value. The mantissa and the exponent are then connected to the internal bus to addresses 5592 and 5593. This internal bus allows the registers to be transmitted out from the FPGA using UART. The internal bus was designed in a live coding session from which a recording can be found here.

--------------------------------
    signal float_multiplier   : float_multiplier_record := init_float_multiplier;
    signal adder              : float_adder_record := init_adder;
    signal first_order_filter : first_order_filter_record := init_first_order_filter;

    signal test_float : float_record := to_float(1.23525);
begin
    
    process(clock_120Mhz)
    begin
        if rising_edge(clock_120Mhz) then

            init_bus(bus_out);

            create_adder(adder);
            create_float_multiplier(float_multiplier);
            create_first_order_filter(first_order_filter, float_multiplier, adder);
          
            connect_read_only_data_to_address(bus_in , bus_out , 5592 , get_mantissa(test_float));
            connect_read_only_data_to_address(bus_in , bus_out , 5593 , get_exponent(test_float) + 1000);

            count_down_from(counter, 1199);
            if counter = 0 then
                testi <= testi + 1;
                filter_input <= (testi mod 16384)*8;

                if filter_input < 16384*4 then
                    request_float_filter(first_order_filter, to_float(0.0));
                else
                    request_float_filter(first_order_filter, to_float(22.1346836));
                end if;
            end if;

            if float_filter_is_ready(first_order_filter) then
                test_float <= get_filter_output(first_order_filter);
            end if;
    end process;

Results

The floating point numbers are captured from UART and converted to real numbers. Figure 1 and Figure 2 shows the exponent and mantissa of the calculation. In Figure 3, the exponent and mantissa are converted to a real number using the matlab script below. In matlab the dot operators .* and .^ refer to element wise operations.

stairs(2.^exponent .* mantissa/2^16)
grid on

title('float in real numbers')

Figure 4 shows the waveform of a single filtered pulse. The line 30 in the project code shows that the number to be filtered is ~22.134. The filter gets up to value 22.0127 which is due to numerical accuracy of both the used floating point word length, the implementation and the use of truncation in the calculations.

Figure 5 shows the logic use. The floating point first order filter calculated using the floating point adder and multiplier consume around 600 logic cells and a single 18×18 bit multiplier. The filter can meet timing at 120MHz. The mantissa bits are limited to 16 to meet timing at this clock frequency.

Figure 1. Exponent read from uart
Figure 2. Mantissa from low pass filter calculation
Figure 3. Real number converted from exponent and mantissa that were obtained with uart
Figure 4. Closeup of the square pulse filtered with the floating point first order filter
Figure 5. Logic use with and without the floating point filter calculated with 16 bit mantissa word length. The floating point first order filter calculated using the floating point adder and multiplier consume around 600 logic cells and a single 18x18 bit multiplier

Remarks on the floating point implementation

Due to the normalization, we always have situation where the first bit of the floating point number is ‘1’. Typically this bit is not registered and only used in the calculations. This allows free extra bit of accuracy. Because of the abstractions used in the code, this feature can be added into the implementation at later point.

We can choose the mantissa and exponent lengths arbitrarily. The used Efinix has common 18×18 bit multipliers, thus good choises for the mantissa length is 18 bits as it can be accomplished using a single multiplier.

The limiting factor why only 16 bits were used for the mantissa are the de-normalization and normalization functions used in scaling the exponents. These need to be able to implement any shift from 0 to mantissa length, which requires many layers of logic gates. These long logic chains can be shortened by pipelining the shifts by limiting the shifts to only part of the mantissa length between two registers. The normalizer and denormalizer modules are already under design as of writing.

The adder and multiplier are very commonly used together, thus a floating point arithmetic module will be made from a record and procedure that allows a single object to perform both multiplier and adder functions.

IEEE754

I haven’t talked about the standard so far. The IEEE754 standard describes a most commonly used floating point format. In addition to describing the number format it also specifies several special cases. These include cases for normalized and de-normalized numbers, how to deal with +/- infinity and not a numbers(NaN) and has requirements for the accuracy and rounding.

Unless we want to add checks for the special cases, we can get away with just omitting the extra logic from the special cases and if we use smaller than standard mantissa lengths our floating point will definitely not fulfill the requirements.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top