Conquer the Divide

The next most important operation after summation, subtraction and multiplication is the fourth fundamental operation, division. As the inverse operation for multiplication, division is used for calculating scaling gains for measurements and controls. It is also needed for calculating further mathematic functions, for example arctangents are readily calculated using rational approximations which require evaluation of division of polynomials.

Importantly in the control of power electronics, division allows calculating the modulated voltages from measurements. This is extremely useful as it allows the converter to work with varying input voltages without it affecting the dynamics.

So this time we dive further into the elementary arithmetic as we take on the fourth fundamental operation, the division with VHDL.

The sources can be found in the fixed point math library

Divide and conquer

Since division is the same as multiplying with inverse, first we calculate the inverse of a number. The basic way to do inversion is by using the Newton-Raphson (NR) method. See the Wikipedia page of the algorithm for explanation.

The NR method uses newtons root finding algorithm for a function that has its root at the reciprocal of the input of the function. The NR reciprocal algorithm is $xi = xi(2-xi\cdot a)$ , where ‘a’ is the number to be inverted and $xi$ is the iterated value that converges to the inverse of $a$ . When this is written in code form we get

a = number_to_be_inverted
xi = initial_value
Loop :
xi = xi*(2-xi*a)
goto Loop
reciprocal = xi

For numbers between 0.5 and 1 this algorithm converges so fast that a single constant as initial value gives 32 bits of accuracy with less than 4 iterations for all numbers in the 0.5 to 1 range.

Extending the range

The [0.5, 1] NR range can be used to invert any arbitrary number by dividing or multiplying the number by 2 until it is between [0.5, 1] and then using the inverse algorithm. The inverse of the divisor is then multiplied with the dividend and after the multiplication is complete, the final division result can be obtained by using the same shift again. For example, if were to evaluate 325/531 we use the following procedure. The bold numbers are the inputs to the NR iteration. Note that steps 1 and 5 use the exact same shift

235/531 with newton raphson :
1 : range reduction : 531/1024 = 0.518552
2 : initial guess : x = 1.75
3 : 1.75*(2-1.75*0.51855) = 1.911940625 // first iteration
3.1 : 1.911940625*(2-1.911940625*0.51855) = 1.9283129 // second iteration
4 : 235 * 1,9283129= 453.149
5 : 453.149/1024 = 0.44252

If we look at the above example, we can see that the division algorithm has five steps, first is the range reduction, second is choosing the initial value, third is the NR iteration(s), fourth multiplying the inverse of the divisor and the dividend and fifth step is the shifting of the result again with the same scaling that was used to bring the input to the 0.5 to 1 range. And that is how we divide by multiplying. The algorithm converges extremely fast and is simple enough to be evaluated using just a basic calculator. Next the integer version of the division algorithm is developed.

Newton Raphson division in VHDL

The integer divider is implemented as an instantiable object as a record type which includes all of the internal signals required by the divider. The divider is created using create_divider procedure from signal with divider_record type. The multiplier interface is given to the create divider, so that the multiplier hardware can be reused in the application code.

The divider record has a counter which allows the NR division to be calculated multiple times and a multiplier_is_ready function is provided to allow for checking when the divider is ready as well as the intermediate value x for NR iterations and a process counter for the divider calculation.

The division is made accessible in application code by first creating a divider and a multiplier for it and the created object then allows request_division procedure to trigger the the divider and get_division_result(multiplier, divider) returns the division result after division_is_ready. The get division result is also given the requested radix which scales the division such that number/number returns 1 in fixed point as 2^radix. For example with radix15 that would be 32768.

package division_pkg is
--------------------------------------------------
    type division_record is record
        division_process_counter : natural range 0 to 3;
        x: int18;
        number_to_be_reciprocated : int18;
        number_of_newton_raphson_iteration : natural range 0 to 1;
        dividend : int18;
    end record;
    constant init_division : division_record := (3, 0, 0, 0, 0);
------------------------------------------------------------------------
------------------------------------------------------------------------
    procedure create_division (
        signal hw_multiplier : inout multiplier_record;
        signal division : inout division_record);
------------------------------------------------------------------------
------------------------------------------------------------------------
    function division_is_ready ( division_multiplier : multiplier_record; division : division_record)
        return boolean;
------------------------------------------------------------------------
    procedure request_division (
        signal division : out division_record;
        number_to_be_divided : int18;
        number_to_be_reciprocated : int18);
------------------------------------------------------------------------
    procedure request_division (
        signal division : out division_record;
        number_to_be_divided : int18;
        number_to_be_reciprocated : int18;
        iterations : in natural range 1 to 2);
------------------------------------------------------------------------
    function division_is_busy ( division : in division_record)
        return boolean;
------------------------------------------------------------------------
    function get_division_result ( division_result : int18)
        return natural;
------------------------------------------------------------------------
end package division_pkg;

When the division is calculated using integer arithmetic, also known as fixed point, the number ranges are determined by the used word length. As the hardware multipliers in FPGAs are predominantly 18 bit, we use that. With radix16 the range of the input [0.5 to 1] maps to [32768, 65535] and the result which ranges from 1 to 2, in radix16 is represented by [65536, 131071]. This uses the full accuracy of the 18 bit signed integer for the result of the inversion.

With the divider object minimal amount of code that allows access to division instantiates two signals for the divider and the multiplier that division uses, two procedure calls for creating the multiplier and the divider and then the division result is obtained with get_division_result after division_is_ready

-- architecture of a module
    signal multiplier : multiplier_record := init_multiplier;
    signal divider : division_record := init_division;
begin
    process_with_integer_division : process(clock)
    begin
        if rising_edge(clock) then
            create_multiplier(multiplier);
            create_division(multiplier, divider);
            if division_is_needed_by_application then
                request_division(divider, divident, divisor);
            end if;
  
            if division_is_ready(multiplier, division) then
                result_of_division <= get_division_result(multiplier, divider, radix15);
            end if;
    end process;

The range reduction function is called remove_leading zeros as that is effectively what it is actually doing. This can be observed from the bit representation of the valid input numbers which are any positive number with ‘1’ at the 15th position. Since the inverse is calculated using radix16 numbers, the range [32768, 65535] in bit vector is represented by

32768 : 00_0100_0000_0000_0000
65535 : 00_0111_1111_1111_1111

The divider range calculation is simply a bunch of if statements for different number ranges of the input. As the FPGA works in binary, multiplications by 2**n are equivalent to shifting the numbers n bits to the left.

------------------------------------------------------------------------
     function remove_leading_zeros
    (
        number : int18
    )
    return int18
    is
        variable abs_number : natural;
    begin
            abs_number := abs(number);
            if abs_number < 2**1  then return abs_number*2**15; end if;
            if abs_number < 2**2  then return abs_number*2**14; end if;
            if abs_number < 2**3  then return abs_number*2**13; end if;
            if abs_number < 2**4  then return abs_number*2**12; end if;
            if abs_number < 2**5  then return abs_number*2**11; end if;
            if abs_number < 2**6  then return abs_number*2**10; end if;
            if abs_number < 2**7  then return abs_number*2**9; end if;
            if abs_number < 2**8  then return abs_number*2**8; end if;
            if abs_number < 2**9  then return abs_number*2**7; end if;
            if abs_number < 2**10 then return abs_number*2**6; end if;
            if abs_number < 2**11 then return abs_number*2**5; end if;
            if abs_number < 2**12 then return abs_number*2**4; end if;
            if abs_number < 2**13 then return abs_number*2**3; end if;
            if abs_number < 2**14 then return abs_number*2**2; end if;
            if abs_number < 2**15 then return abs_number*2**1; end if;
            return abs_number;
         
    end remove_leading_zeros;

Step 2 : Initial value for NR iteration

The get_initial_value_for_division function is responsible for providing enough accuracy that only a single iteration is needed. A look up table of 32 initial values with 5 bit word length was chosen as a compromise between logic usage and accuracy. The index of the returned value from the table is read from bits 14 downto 10 which represent the input number range divided into 32 equal sized slices.

------------------------------------------------------------------------
    function get_initial_value_for_division
    (
        divisor : natural
    )
    return natural
    is
    --------------------------------------------------
        function get_lut_index
        (
            number : natural
        )
        return natural
        is
            variable u_number : unsigned(16 downto 0);
            variable lut_index : natural;
        begin
            u_number  := to_unsigned(number, 17);
            lut_index := to_integer(u_number(14 downto 10)); 
            return lut_index; 
        end get_lut_index;
    -------------------------------------------------- 
        type divisor_lut_array is array (integer range 0 to 31) of natural;
        constant divisor_lut : divisor_lut_array := ( 
          0  => 63 ,
          1  => 61 ,
          2  => 59 ,
          3  => 57 ,
          4  => 56 ,
          5  => 54 ,
          6  => 53 ,
          7  => 52 ,
          8  => 50 ,
          9  => 49 ,
          10 => 48 ,
          11 => 47 ,
          12 => 46 ,
          13 => 45 ,
          14 => 44 ,
          15 => 43 ,
          16 => 42 ,
          17 => 41 ,
          18 => 40 ,
          19 => 39 ,
          20 => 39 ,
          21 => 38 ,
          22 => 37 ,
          23 => 37 ,
          24 => 36 ,
          25 => 36 ,
          26 => 35 ,
          27 => 34 ,
          28 => 34 ,
          29 => 33 ,
          30 => 32 ,
          31 => 32);
    begin
        return divisor_lut(get_lut_index(divisor))*2**11;
    end get_initial_value_for_division;
--------------------------------------------------

The initial value used for the NR iteration is obtained by shifting the lookup table value left by 11 bits to move it into the 65536 to 131071 range of the result of the bit inversion. Optimal value for an index in the lookup table would be one the produces equal amount of error in the calculation with the bits 9 downto 0 being “00_0000_0000” and “11_1111_1111”. This does not happen for all values in the look up table but in the name of saving logic resources a minimum word length was chosen that provided good enough performance.

Steps 3 and 4 : Newton Raphson iterator and final multiplication

The Newton-Raphson iterator is created in the create_division procedure. The procedure has a state machine which can be run either one or two times. Two passes through the NR iteration already provides full accuracy thus there is no need for further iterations for the chosen word length.

The invert_bits is used in the second step since the number system used natively by VHDL is two’s complement in which $2-number$ is equivalent to just inverting all of the bits. Note that 2 refers to “2” in chosen radix, thus for rad16 it is 131071 or 2*2^16. Otherwise the algorithm is directly using the NR iteration. If the NR iteration is called with 2 iterations, then after the first pass the loop is run again. Once the NR iteration(s) are ready, the multiplier is then called with the dividend and the reciprocal obtained from the NR iterations.

------------------------------------------------------------------------
    procedure create_division
    (
        signal hw_multiplier : inout multiplier_record;
        signal division : inout division_record
    ) is
    --------------------------------------------------
        function invert_bits
        (
            number : natural
        )
        return natural
        is
            variable number_in_std_logic : std_logic_vector(16 downto 0);
        begin
            number_in_std_logic := not std_logic_vector(to_unsigned(number,17));
            return to_integer(unsigned(number_in_std_logic));
        end invert_bits;
    --------------------------------------------------
        alias division_process_counter is division.division_process_counter;
        alias x is division.x;
        alias number_to_be_reciprocated is division.number_to_be_reciprocated; 
        alias number_of_newton_raphson_iteration is division.number_of_newton_raphson_iteration; 
        alias dividend is division.dividend;
        variable xa : int18;
    --------------------------------------------------
    begin
         
        CASE division_process_counter is
            WHEN 0 =>
                multiply(hw_multiplier, number_to_be_reciprocated, x);
                division_process_counter <= division_process_counter + 1;
            WHEN 1 =>
                if multiplier_is_ready(hw_multiplier) then
                    xa := get_multiplier_result(hw_multiplier, 16);
                    multiply(hw_multiplier, x, invert_bits(xa));
                end if;
                increment_counter_when_ready(hw_multiplier,division_process_counter);
            WHEN 2 =>
                if multiplier_is_ready(hw_multiplier) then
                    x <= get_multiplier_result(hw_multiplier, 16);
                    if number_of_newton_raphson_iteration /= 0 then
                        number_of_newton_raphson_iteration <= number_of_newton_raphson_iteration - 1;
                        division_process_counter <= 0;
                    else
                        division_process_counter <= division_process_counter + 1;
                        multiply(hw_multiplier, get_multiplier_result(hw_multiplier, 16), dividend);
                    end if;
                end if;
            WHEN others => -- wait for start
        end CASE;
    end create_division;
------------------------------------------------------------------------

Step 5. Final shifting

Similar setup for calculating the number of zeroes in the range reduction is also used for scaling the final result. The final scaling is determined by the initial divisor range given as argument to the get_result_function.

------------------------------------------------------------------------
    function get_division_result
    (
        multiplier : multiplier_record;
        divisor : natural;
        radix : natural
    )
    return natural
    is
        variable multiplier_result : integer;
        variable multiplier_result2 : integer;
    begin
            multiplier_result := get_multiplier_result(multiplier,radix);
            multiplier_result2 := get_multiplier_result(multiplier,radix/2+1);
 
            if divisor < 2**1  then return (multiplier_result2)*2**7 ; end if ;
            if divisor < 2**2  then return (multiplier_result2)*2**6 ; end if ;
            if divisor < 2**3  then return (multiplier_result2)*2**5 ; end if ;
            if divisor < 2**4  then return (multiplier_result2)*2**4 ; end if ;
            if divisor < 2**5  then return (multiplier_result2)*2**3 ; end if ;
            if divisor < 2**6  then return (multiplier_result2)*2**2 ; end if ;
            if divisor < 2**7  then return (multiplier_result2)*2**1 ; end if ;
            if divisor < 2**8  then return (multiplier_result2)      ; end if ;
            if divisor < 2**9  then return (multiplier_result)*2**7  ; end if ;
            if divisor < 2**10 then return (multiplier_result)*2**6  ; end if ;
            if divisor < 2**11 then return (multiplier_result)*2**5  ; end if ;
            if divisor < 2**12 then return (multiplier_result)*2**4  ; end if ;
            if divisor < 2**13 then return (multiplier_result)*2**3  ; end if ;
            if divisor < 2**14 then return (multiplier_result)*2**2  ; end if ;
            if divisor < 2**15 then return (multiplier_result)*2**1  ; end if ;
 
            return (multiplier_result);
         
    end get_division_result;

The radix determines what number represents 1, so if division result is is returned with radix15, then 1 is represented by 32768 which is obtained when the same number is used for divisor and divident in the request_division call.

And that is how we divide with integers. This should work for all values for divisor values up to 32767. This is then tested by running all of the numbers through the divider with FPGA.

Testing with hardware.

Six division cores with multipliers are added to the hardware test code which is written in the system_components.vhd file. The divisions are calculated with input number is incremented at every calculation cycle thus all numbers between 1 and 32767 are run through the algorithm for every test. The data is sent out with UART and captured with PC. The instructions to build the uart capture software can be found in the navigation panel.

if ad_conversion_is_ready(spi_sar_adc_data_out) then
    request_power_supply_calculation(power_supply_simulation, -grid_duty_ratio, output_duty_ratio);
 
    test_leading_zeroes <= test_leading_zeroes + 1;
    if test_leading_zeroes = 32767 then
        test_leading_zeroes <= 1;
    end if;
 
    request_division(divider1 , test_leading_zeroes       , test_leading_zeroes                           , 1);
    request_division(divider2 , test_leading_zeroes/128   , test_leading_zeroes                           , 1);
    request_division(divider3 , test_leading_zeroes       , test_leading_zeroes/2 + test_leading_zeroes/4 , 1);
    request_division(divider4 , test_leading_zeroes +2500 , test_leading_zeroes                           , 1);
    request_division(divider5 , test_leading_zeroes       , test_leading_zeroes                           , 2);
    request_division(divider6 , test_leading_zeroes       , test_leading_zeroes +2500                     , 1);
 
    CASE uart_rx_data is
        WHEN 10 => transmit_16_bit_word_with_uart(uart_data_in, get_filter_output(bandpass_filter.low_pass_filter) );
        WHEN 11 => transmit_16_bit_word_with_uart(uart_data_in, (bandpass_filter.low_pass_filter.filter_input - get_filter_output(bandpass_filter.low_pass_filter))/2+32768);
        WHEN 12 => transmit_16_bit_word_with_uart(uart_data_in, get_filter_output(bandpass_filter)/2+32768);
        WHEN 13 => transmit_16_bit_word_with_uart(uart_data_in, bandpass_filter.low_pass_filter.filter_input - get_filter_output(bandpass_filter));
        WHEN 14 => transmit_16_bit_word_with_uart(uart_data_in, get_adc_data(spi_sar_adc_data_out));
        WHEN 15 => transmit_16_bit_word_with_uart(uart_data_in, uart_rx_data);
        WHEN 16 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.output_inverter_simulation.output_emi_filter.capacitor_voltage.state/2 + 32768);
        WHEN 17 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.output_inverter_simulation.output_emi_filter.inductor_current.state/2+ 32768);
        WHEN 18 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.output_inverter_simulation.output_inverter.dc_link_voltage.state/2);
        WHEN 19 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.grid_inverter_simulation.grid_emi_filter_2.capacitor_voltage.state/4 + 32768);
        WHEN 20 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.grid_inverter_simulation.grid_emi_filter_2.inductor_current.state/4+ 32768);
        WHEN 21 => transmit_16_bit_word_with_uart(uart_data_in, power_supply_simulation.grid_inverter_simulation.grid_inverter.dc_link_voltage.state/2);
        WHEN 22 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier1, divider1, 17));
        WHEN 23 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier2, divider2, 17));
        WHEN 24 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier3, divider3, 17));
        WHEN 25 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier4, divider4, 17));
        WHEN 26 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier5, divider5, 17));
        WHEN 27 => transmit_16_bit_word_with_uart(uart_data_in, get_division_result(division_multiplier6, divider6, 17));
        WHEN others => -- get data from MDIO
            register_counter := register_counter + 1;

The result of the division tests are shown in figures 1 – 4. The division operation is presented by calculating $x/(x+2500)$ and $(x+2500)/x$ , which are shown in Figures 1 and 2. Since fixed point calculation is used, the numbers equal to $x/(x+2500)*32768$ in real numbers and since x goes from 1 to 32768, the result goes from 1/2501*32768 to 32767/(32768+2500), which equals range from [13 to 30445].

Figure 1. x/(x+2500) calculated with the division core.

With $(x+2500)/x$ when the numbers are always larger than 32768 and can be seen to approach 32768 when x gets closer to 32767. With low values of x the division results in a number that is larger than 65535 which overflows due to the uart communication being unsigned 16 bits. Note that even though there is overflow, these values could be evaluated with the implemented division assuming that a correct radix was used such that the resulting number could fit in the result range.

Figure 2. (x + 2500)/x with the divisor core. When the test result is over 2, the numbers overflow thus when x is near zero, there are multiple overflows.

Figure 3 shows the performance of x/x operation with a single pass of the NR iteration. With single iteration of $x/x$ we get a worst case error of 68, which equals accuracy of better than 0.01% for entire input number range.

The close up of the single pass NR division is shown in Figure 5. There are several of the lookup table values that do not provide optimal initial value for the division which can be observed by the error being non symmetric. For optimal behaviour, all of the 32 segments should have equal maximum error which should have symmetric quadratic behaviour within a segment, thus there is at least some extra accuracy that could be still obtained. For now, the obtained behaviour is assumed to be good enough and more effective range will be added if the need arises.

With two NR iterations we get the correct value of 32768 with every other number aside from 1/1 with which the division returns 32767. It should also noted that two NR iterations takes 16 clock cycles, so if the absolute maximum accuracy is required then it should be used.

Figure 3. x/x with single pass NR division shows maximum error of 68 with 1/1 and a maximum deviation from the correct value of 32768 of 28 with all other numbers.

Figure 4. Single pass NR division of x/x in close. Since the NR division has quadratic error behavior, the obtained value differs from the correct result quadratically as the error gets larger

Figure 4. x/x with two NR iterations. With two iterations the division gives the exact value of 32768 with all other numbers aside from 1, with which it returns 32767.

The test code with 6 division cores takes roughly extra 2900 logic cells compared to the system simulation test code in the previous blog post on which the division test is built on. One division object takes thus slightly under 500 logic cells from the FPGA. Compared to the entire device size of 25k logic gates, this is substantial and the resource usage could be lowered by removing the initial value look up table in favour of a single constant and one more NR iteration, however in this case the speed of the algorithm is preferred as the logic usage is still quite low. The complete system which includes all of the code from the previous blog posts consumes 6400 logic units from the the device and 30 multiplier elements and meets timing up to 125MHz.

Figure 6. logic use

Figure 7. Obtained maxium clock speeds for the ethernet clock and the main core clock

Now the project has division which allows calculating the modulation algorithm. Sine function is still needed for the simulation of the grid voltages and the controls need to be designed for the simulated hardware. These will be developed in future posts.

Anatolii

April 15, 2024 at 6:00 pm

I did it without the LUT, just using the shift_right:
if signed(sig_in(width-1 downto frac)) < shift_left(to_signed(1,width-frac),i) then
shifted_input <= shift_right(signed(sig_in),i);
shift <= i;
i <= 0;
else
i <= i + 1;
end if;

April 15, 2024 at 10:27 am

Nice one. I did not know well how to make own division RTL, so I used the provided IP, but it was always a little bit of a pain in the butt

Pingback: Sine and Cosine in Synthesizable VHDL - Hardware Descriptions