When we write HDL, our source code gets compiled into a circuit of FPGA elements of logic, memories and dsp slices which then execute our design. If our design does not need the maximum throughput of a fully pipelined operation, we can save circuit area by arranging our design into sequential steps and reuse parts of the the hardware. When we reuse hardware, we essentially exchange complex circuit elements like DSPs into simpler ones. In this blog we look into design methods for building a simple processor which allow us to exchange logic resources to memory. Instead of using formal processor like RiscV we will be designing both the software and the processor to run it in VHDL.
Design a custom processor and its assembly in VHDL
Using plain VHDL has a few substantial benefits over using an actual processor. Staying with VHDL allows us to add a processor for a single functionality without the need to have any new tools, compilers, languages or testing frameworks. We can both develop and test our software along with the processor directly with the VHDL simulator and both simulate and test the entire functionality in one step. We also have the possibility to parametrize the application in VHDL allowing for example filter gains to be set from generics and functions. With a processor customized to our application, we can also have the absolute minimal amount of processor features that our design needs.
We will first design the assembly software for the filter, we then design functions and data types which allows us to design the software with functions in VHDL. Then we will write a pipelined processor to run this software and lastly it is tested with hardware. The processor is tested with an example project which creates a noisy sine which is the filtered with a first order filter that is run from ram.
The sources for the processor its testbench and the vhdl programming module are found in the microprogram processor repository. The repository has a vunit_run.py script to launch the testbench using VUnit and it was designed with GHDL and GTKWave. The design is shown in this post with the Efinix Titanium Evaluation kit. In addition to the Efinix Efinity, there are also build scripts for Intel Quartus, Xilinx Vivado and both Lattice Radiant and Diamond in the example projects repository.
Assembling a Low pass filter
The simplest example that does something useful is the basic low pass filter. The difference equation that we are implementing is given below. It is stable and describes a low pass filter as long as the filtergain is positive and less than 1.0.
We will first start with the assembly program that we will design a processor for. Assembly code is written such that our instruction has a command, followed by the target register and and lastly the arguments arg1_register, arg2_register.
For example a subtract instruction “sub” which accomplishes reg3 <= reg2 – reg1 would be
sub reg1, reg2, reg3
Using add, sub and mpy instructions that are similarly constructed we can write the equation (1) as processor instruction as follows
sub temp, u, y
mpy mpy_result, temp, g
add y, y, mpy_result
In the code above we first subtract y from u and place it to temp register. In the second line we multiply temp and g and place the result to mpy_result register and finally we get the filtered output by adding together previous output y and the mpy_result register.
In addition to the functional part of our program, the register values need to be saved and loaded from memory. The full program thus first loads the register values from memory, then executes the program and lastly saves the modified value back to memory. This is the program that we will write in vhdl and run on our processor.
load_from_ram u, address(u)
load_from_ram y, address(y)
load_from_ram g, address(g)
sub temp, u, y
mpy mpy_result, temp, g
add y, y, mpy_result
save_to_ram address(y), y
Processor Registers
The registers temp, u, y, g and mpy_result are generic variables that the processor operates with. The registers are connected to ram ports to allow loading and storing their values and they can be added, subtracted and multiplied together. The registers are hence the actual data vectors between which all of our operations are excecuted.
The registers are defined as an array of bit vectors. We choose the register length to be 20 bits since it is multiple of the Efinix titanium ram port and allows maximal utilization of the ram.
type reg_array is array (integer range 0 to 5) of std_logic_vector(19 downto 0);
Next we will design vhdl types and functions which allow us to describe the assembly instructions in VHDL.
Creating an instruction in VHDL
An instruction is bit vector into which we encode the required command and its arguments into and which is then stored in ram. In the instruction some of the bits are dedicated to encode the actual instruction and the rest of the bit vector can be then some parameters of that instruction. We use the same instruction width as the register width so we can use a single ram block for both instructions and data.
VHDL strongly typed and has the tick operator to get information from the types. We will use this feature for encoding the instructions and arguments into our instruction bit vector. The 20 bit instruction vector and the bit slices that correspond with the command, destination register index and the argument register indices are defined as follows
subtype t_instruction is std_logic_vector(19 downto 0);
subtype comm is std_logic_vector(19 downto 16);
subtype dest is std_logic_vector(15 downto 12);
subtype arg1 is std_logic_vector(11 downto 8);
subtype arg2 is std_logic_vector(7 downto 4);
subtype arg3 is std_logic_vector(3 downto 0);
There are only 5 commands that we currently need, which are load and save memory operations and ,sub, mpy, add arithmetic operations. To translate these commands into VHDL, we can make an enumerated type.
type t_command is (
add ,
sub ,
mpy ,
save ,
load
);
Encoding the command to the first 4 bits of the instruction will use the ‘pos attribute. This gives the position of the command in the list which is used as the bit pattern. So for example the “mpy” command would be 3 or “0011” in the command bit slice. Similarly we encode the specific register which are operating with by encoding the index of the register into the bit vector slice defined by the dest and arg1, arg2 and arg3 subtype ranges.
The instruction encoding is done using a function that takes in the command, destination and argument register indices and returns a std_logic_vector as shown below
function write_instruction
(
command : in t_command;
destination : in natural range 0 to number_of_registers-1;
argument1 : in natural range 0 to number_of_registers-1;
argument2 : in natural range 0 to number_of_registers-1
)
return std_logic_vector
is
variable instruction : t_instruction := (others=>'0');
begin
instruction(comm'range) := std_logic_vector(to_unsigned(t_command'pos(command) , comm'length));
instruction(dest'range) := std_logic_vector(to_unsigned(destination , dest'length));
instruction(arg1'range) := std_logic_vector(to_unsigned(argument1 , arg1'length));
instruction(arg2'range) := std_logic_vector(to_unsigned(argument2 , arg2'length));
return instruction;
end write_instruction;
Using this function an instruction that adds together registers 2 and 3 and places the result to register is written as follows.
constant add_regs_2_and_3_to_1 : t_instruction := write_instruction(add, 1, 2, 3);
Similarly when we are decoding the instructions, we use the subtype ranges to return register indices.
function get_dest
(
input_register : std_logic_vector
)
return natural
is
begin
return to_integer(unsigned(input_register(dest'range)));
end get_dest;
You can find the full assembler source in the repository.
Assembling software code from VHDL source
We next need a program which is a collection of instructions. Since our instructions are actually bit vectors, our program is simply an array of these vectors defined as follows
type program_array is array (natural range <>) of t_instruction;
We can now write the low pass filter program by defining a constant of program_array type and filling it with instructions returned by write_instruction function calls as shown below
constant y : natural := 1;
constant g : natural := 2;
constant temp : natural := 3;
constant u : natural := 4;
constant result_address : natural := 100;
constant gain_address : natural := 101;
constant input_address : natural := 102;
constant program : program_array :=
(
write_instruction(load , y , result_address) ,
write_instruction(load , g , gain_address) ,
write_instruction(load , u , input_address) ,
write_instruction(sub , temp , u , y) ,
write_instruction(mpy , temp , temp , g) ,
write_instruction(add , y, temp , y),
write_instruction(save, y, result_address)
);
To make this program easily reusable, we can write a function that returns a program array with the low pass filter instructions. The following function takes in the addresses of filter_gain, result and input value and returns the corresponding program.
function low_pass_filter
(
gain_address : natural;
result_address : natural;
input_address : natural
)
return program_array
is
constant y : natural := 1;
constant g : natural := 2;
constant temp : natural := 3;
constant u : natural := 4;
constant program : program_array := (
write_instruction(load , y , result_address) ,
write_instruction(load , g , gain_address) ,
write_instruction(load , u , input_address) ,
write_instruction(sub , temp , u , y) ,
write_instruction(mpy , temp , temp , g) ,
write_instruction(add , y, temp , y),
write_instruction(save, y, result_address)
);
begin
return program;
end low_pass_filter;
With the low pass filter now enclosed in a function we can make a another program that executes multiple low pass filters by defining a constant which is initialized with multiple calls to the low_pass_filter function and we give the addresses of the constants to it as arguments. The following program_array defines a program of 3 low pass filters that use register values 100-108 for the gains, input and memory values
constant multiple_low_pass_filters : program_array := (
low_pass_filter(gain_address => 100 , result_address => 101 , input_address => 102) &
low_pass_filter(gain_address => 103 , result_address => 104 , input_address => 105) &
low_pass_filter(gain_address => 106 , result_address => 107 , input_address => 108)
);
At this point and only at a few functions into our assembler design we are already creating inline function calls for which we input the memory addresses of the variables and we are actually programming in VHDL 🙂
Setting constants into memory
Before we can run this program, we need to store the gains of our filters into ram addresses 100, 103 and 106 to match the addresses in the program above. We can again use a function to modify the program to modify the memory contents of these addresses to specific values.
We can do this by making a function that takes in the program and puts it into initial value of a variable. The program then modifies whatever memory values we need and returns the modified array.
function modify_mem_values(program : program_array) return program_array
is
variable retval : program_array := program;
begin
retval(100) := x"0acdc";
retval(103) := x"0acdc";
retval(106) := x"0acdc";
return retval;
end function;
Initializing ram with program
Lastly we need to initialize our ram_array, which is the ram length data vector with the program. The full function to do this is shown below. As a simplifying method we loop through the program array indices and place them in the corresponding memory indices.
The memory datatype ram_array is defined in the ram module. We will use a multi-port ram which allows 2 read operations and a write operation to occur at the same time. The multi-port ram source code can be found in the memory repository. The ram module allows setting the initial values of the ram via the generic which is the way we will be giving our program into the ram module as shown below
--in architecture
constant ram_contents : ram_array := ram_initial_values;
begin
------------------------------------------------------------------------
u_mpram : entity work.ram_read_x2_write_x1
generic map(ram_contents) -- initial ram values
port map(
clock ,
ram_read_instruction_in ,
ram_read_instruction_out ,
ram_read_data_in ,
ram_read_data_out ,
ram_write_port);
The function here takes the program and filter gain in real value as argument and outputs the full ram_array. The to_fixed function is part of the hvhdl fixed point library and it is used to convert a real number into std_logic_vector. The to_fixed is overloaded here to clean up the code from vector length and radix constants. The assert is written into the function to make sure that our program is not overlapping with our constants which could make up for a really confusing bug.
function build_sw (program : program_array, filter_gain : real range 0.0 to 1.0) return ram_array
is
function to_fixed
(
number : real
)
return std_logic_vector
is
begin
return to_fixed(
number =>number,
bit_width => 20,
number_of_fractional_bits =>19);
end to_fixed;
variable retval : ram_array := (others => (others => '0'));
------------------------------------------------------------------------
begin
assert program'length < 100 report "program needs to be less than 100 instructions" severity failure;
for i in program'range loop
retval(i) := program(i);
end loop;
retval(100) := to_fixed(filter_gain);
retval(101) := to_fixed(0.0);
retval(102) := to_fixed(0.5);
retval(103) := to_fixed(filter_gain/2.0);
retval(104) := to_fixed(0.0);
retval(105) := to_fixed(0.08);
retval(106) := to_fixed(0.0);
return retval;
end build_sw;
Processing instructions in a pipeline
Creating a processor and processing the instructions in VHDL is quite straightforward. A single processing stage is essentially a case structure that takes in the instruction bit field and does some operation based on the different parts of the field like this
CASE decode(used_instruction) is
WHEN nop => -- do nothing
WHEN add => -- add
reg(destination) <= reg(arg1) + reg(arg2);
WHEN sub => -- subtract
reg(destination) <= reg(arg1) - reg(arg2);
WHEN mpy => -- multiply
reg(destination) <= reg(arg1) * reg(arg2);
Practical processors need a way to control the logic depth by adding pipeline stages for operations. To have our processor function correctly with the pipelined operations we add an instruction pipeline which allows the instruction to have same amount of latency as the operation. The instruction pipeline is just a shift register to which we feed the instruction that is read from ram to the other end and which then shifts it through one stage every clock cycle as shown below
type instruction_array is array (natural range <>) of t_instruction;
signal instruction_pipeline : instruction_array(0 to 5);
begin
process(clock)
if rising_edge(clock) then
instruction_pipeline <= used_instruction & instruction_pipeline(0 to instruction_pipeline'high-1);
With the instruction pipeline, we can now have a separate case structure for each stage of the pipeline and run different operations at different stages. In stage 1 we load the add and multiply registers and then we capture the result in pipeline stage 3. This is shown below.
program_counter <= program_counter + 1;
request_data_from_ram(ram_read_instruction_in, program_counter)
if ram_read_is_ready(ram_read_instruction_out) then
used_instruction := get_ram_data(ram_read_instruction_out);
end if;
instruction_pipeline <= used_instruction & instruction_pipeline(0 to instruction_pipeline'high-1);
------------------------------------------------------------------------
--stage 1
used_instruction := instruction_pipeline(1);
CASE decode(used_instruction) is
WHEN add =>
add_a <= registers(get_arg1(used_instruction));
add_b <= registers(get_arg2(used_instruction));
WHEN sub =>
add_a <= registers(get_arg1(used_instruction));
add_b <= -registers(get_arg2(used_instruction));
WHEN mpy =>
mpy_a <= registers(get_arg1(used_instruction));
mpy_b <= registers(get_arg2(used_instruction));
WHEN load =>
request_data_from_ram(data_port_in, get_address(used_instruction));
WHEN others => -- do nothing
end CASE;
------------------------------------------------------------------------
--stage 2
used_instruction := instruction_pipeline(2);
add_result <= add_a + add_b;
mpy_raw_result <= signed(mpy_a) * signed(mpy_b);
CASE decode(used_instruction) is
WHEN others => -- do nothing
end CASE;
------------------------------------------------------------------------
--stage 3
used_instruction := instruction_pipeline(3);
mpy_result <= std_logic_vector(mpy_raw_result(38 downto 38-19));
CASE decode(used_instruction) is
WHEN add | sub =>
registers(get_dest(used_instruction)) <= add_result;
WHEN others => -- do nothing
end CASE;
------------------------------------------------------------------------
--stage 4
used_instruction := instruction_pipeline(4);
CASE decode(used_instruction) is
WHEN mpy =>
registers(get_dest(used_instruction)) <= mpy_result;
WHEN load =>
registers(get_dest(used_instruction)) <= get_ram_data(data_port_out);
WHEN others => -- do nothing
end CASE;
------------------------------------------------------------------------
--stage 5
used_instruction := instruction_pipeline(5);
CASE decode(used_instruction) is
WHEN save =>
write_data_to_ram(ram_data_in, get_address(used_instruction), registers(get_arg1(used_instruction)))
WHEN load =>
registers(get_dest(used_instruction)) <= get_ram_data(data_port_out);
WHEN others => -- do nothing
end CASE;
Note that I used a a variable “used_instruction” as a name for each of the processing stages. The way variables work is that the variable is loaded immediately hence I use the same “used_insctruction” variable for each pipeline stage. During the processor design, this allows very easily moving parts of the design between different pipeline stages as we can just copy “when =>” sections between different pipeline stages hence it also makes adding or removing pipeline stages very simple.
Full processor design
The processor is built from the presented parts : the registers, memory and program counter and instruction pipeline. The test processor and its registers are collected in a single record. Additionally there are two boolean variables for indicating when the processor is done and when it is enabled. The full processor code can be found here.
type simple_processor_record is record
processor_enabled : boolean ;
is_ready : boolean ;
program_counter : natural range 0 to 511 ;
registers : reg_array ;
instruction_pipeline : instruction_array ;
add_a : std_logic_vector(19 downto 0) ;
add_b : std_logic_vector(19 downto 0) ;
add_result : std_logic_vector(19 downto 0) ;
mpy_a : std_logic_vector(19 downto 0) ;
mpy_b : std_logic_vector(19 downto 0) ;
mpy_a1 : std_logic_vector(19 downto 0) ;
mpy_b1 : std_logic_vector(19 downto 0) ;
mpy_raw_result : signed(39 downto 0) ;
mpy_result : std_logic_vector(19 downto 0) ;
end record;
The processor control logic is put into a procedure which creates the processor around the record and ram values. When the processor is enabled, data is read from the multi-port ram instruction port. When the ram module indicates that an instruction is fetched from ram, the instruction is set to used_instruction variable. If the decoded instruction is program_end, then the is_ready flag is set and processor_enabled is set to false. The program_end is special in that when it is decoded, the processor stops reading further instructions from the ram. We also add a “nop” instruction that simply does nothing but allows us to add delay instructions to our programs.
The init_ram and request_data_from ram are part of the ram module. In order to read the memory contents of the ram in we create an incrementing program counter and use it as the memory address for a request_data_from_ram -procedure call.
procedure create_simple_processor
(
signal self : inout simple_processor_record;
signal ram_read_instruction_in : out ram_read_in_record ;
ram_read_instruction_out : in ram_read_out_record ;
signal ram_read_data_in : out ram_read_in_record ;
ram_read_data_out : in ram_read_out_record ;
signal ram_write_port : out ram_write_in_record
) is
variable used_instruction : t_instruction;
begin
init_ram(ram_read_instruction_in, ram_read_data_in, ram_write_port);
------------------------------------------------------------------------
--stage -1
self.is_ready <= false;
used_instruction := write_instruction(nop);
if self.processor_enabled then
request_data_from_ram(ram_read_instruction_in, self.program_counter);
if ram_read_is_ready(ram_read_instruction_out) then
used_instruction := get_ram_data(ram_read_instruction_out);
end if;
if decode(used_instruction) = program_end then
self.processor_enabled <= false;
self.is_ready <= true;
else
self.program_counter <= self.program_counter + 1;
end if;
end if;
------------------------------------------------------------------------
CASE decode(used_instruction) is
WHEN load =>
request_data_from_ram(ram_read_data_in, get_sigle_argument(used_instruction));
WHEN others => -- do nothing
end CASE;
self.instruction_pipeline <= used_instruction & self.instruction_pipeline(0 to self.instruction_pipeline'high-1);
-- the processor instruction pipeline starts here --
The snippet from the processor simulation test bench shows the instantiation and request. The simulation testbench requests the processor calculation every 60 clock cycles and the processor runs the 3 low pass filter program.
stimulus : process(simulator_clock)
begin
if rising_edge(simulator_clock) then
simulation_counter <= simulation_counter + 1;
--------------------
create_simple_processor (
self ,
ram_read_instruction_in ,
ram_read_instruction_out ,
ram_read_data_in ,
ram_read_data_out ,
ram_write_port);
------------------------------------------------------------------------
if simulation_counter mod 60 = 0 then
request_processor(self);
end if;
------------------------------------------------------------------------
end process;
u_mpram : entity work.ram_read_x2_write_x1
generic map(ram_contents)
port map(
simulator_clock ,
ram_read_instruction_in ,
ram_read_instruction_out ,
ram_read_data_in ,
ram_read_data_out ,
ram_write_port);
The simulation checks that the results of the low pass filter values rise above 0.45 by the end of the simulation and there are result1, reasult2 and result3 signals which show the outputs of the filters step response as shown in figure 1.
Hardware test
We test the processor and its software with Efinix Titatinum Evaluation kit. The example project that we use to test our processor design creates a noisy sine and filters it with a fixed and a floating point filters. The noisy sine and the filtered results are connected to internal bus which is driven by UART.
The implementation of the test code is given below. The code creates the processor and connects the multi-port ram to it. We buffer the input values and the request for the operation and launch the processor. When the program has run to completion as indicated by the program_is_ready, we reset a pair of counters that are then used to request the results from the ram which are then registered into the signals that we connected to the internal memory bus.
begin
fixed_point_filter : process(clock)
begin
if rising_edge(clock) then
init_bus(bus_out);
connect_read_only_data_to_address(bus_in, bus_out, 15165 , result1 + 32768);
connect_read_only_data_to_address(bus_in, bus_out, 15166 , result2 + 32768);
connect_read_only_data_to_address(bus_in, bus_out, 15167 , result3 + 32768);
------------------------------------------------------------------------
create_simple_processor(
self ,
ram_read_instruction_in ,
ram_read_instruction_out ,
ram_read_data_in ,
ram_read_data_out ,
ram_write_port);
------------------------------------------------------------------------
request_buffer <= example_filter_input.filter_is_requested;
input_buffer <= std_logic_vector(to_signed(example_filter_input.filter_input,20));
if request_buffer then
request_processor(self);
write_data_to_ram(ram_write_port, 102, input_buffer);
end if;
if program_is_ready(self) then
counter <= 0;
counter2 <= 0;
end if;
if counter < 7 then
counter <= counter +1;
end if;
CASE counter is
WHEN 0 => request_data_from_ram(ram_read_data_in, 101);
WHEN 1 => request_data_from_ram(ram_read_data_in, 104);
WHEN 2 => request_data_from_ram(ram_read_data_in, 106);
WHEN others => --do nothing
end CASE;
if not processor_is_enabled(self) then
if ram_read_is_ready(ram_read_data_out) then
counter2 <= counter2 + 1;
CASE counter2 is
WHEN 0 => result1 <= to_integer(signed(get_ram_data(ram_read_data_out))) / 2;
WHEN 1 => result2 <= to_integer(signed(get_ram_data(ram_read_data_out))) / 2;
WHEN 2 => result3 <= to_integer(signed(get_ram_data(ram_read_data_out))) / 2;
WHEN others => -- do nothing
end CASE; --counter2
end if;
end if;
end if; --rising_edge
end process;
------------------------------------------------------------------------
u_dpram : entity work.ram_read_x2_write_x1
generic map(final_sw)
port map(
clock ,
ram_read_instruction_in ,
ram_read_instruction_out ,
ram_read_data_in ,
ram_read_data_out ,
ram_write_port);
------------------------------------------------------------------------
end microprogram;
The example program repository also has a test_app.py script that uses an uart to communicate with the fpga. The script sets and read a few register values from the fpga. Additionally it requests several streams of the fixed point, floating point and processor filtered data and plots 50000 points values from them as shown in the figures below
Synthesis result and hardware utilization
The hardware utilization of the processor is less than 500 logic units 4 multipliers and 2 memory slices as seen in Figure 4. Efinix memory blocks are 10bit wide, hence 20 bit multiport ram needs 2 ram blocks. The example project has maximum clock speed of about 200Mhz, though this is limited by the floating point filter implementation instead of the microprocessor. The microprocessor alone achieves approximately 340MHz core clock speeds when the sine generation and the two other filters are disabled.
The processor filter implementation takes roughly 150% more resources than a comparable fixed point filter written in just plain vhdl. The 18 bit fixed point filter synthesizes into roughly 200 luts and 2 multipliers whreas the processor needs 4 multipliers and 470 luts. Both modules also include the connection to the internal bus and they have 18 and 20 bit word lengths correspondinly. Hence the actual logic use is not exactly comparable but is similar enough to show that the logic use is relatively low. With the maximum 340MHz clock speeds we could use these 470 luts to run 15 first order low pass filters at the rate of 1 MHz.
The code has also been tested with five of the most common vendor tools Lattice Diamond, Intel Quartus, Efinix Efinity, Lattice Radiant and AMD Vivado. I used the most up to date tools for all of them. You can find build scripts for all of them in the hVHDL example projects repository.
Additional points
The code is quite raw right now and for example the processor is in a single module. We can make both the processor as well as the function definition reusable by moving the instruction and subtype definitions into their own package to allow for easily changing the instruction length. The flow control part which currently only starts and stops the processor operation, should also be put into separate module so we could extend it if needed.
Reading about the design of processors is going to get very complicated very fast especially if our goal is just to add a minimal amount of software to trade off memory and processing time for fpga resources. Hence I usually design some initial assembly program first and then iterate with how to distribute different parts of the functionality between the software and hardware. This keeps the complexity at a minimum and might prevent overengineering the solution.
The most obvious application for the vhdl programmed processor would be floating point math and very wide word length calculations where the hardware reuse reduces the resource use by a huge amount.
If we run into the area where a lot of non time critical functionality is needed, then an actual soft processor and the complexity of adding the additional build infrastructure and designing the software in C, C++ or Rust would probably be the correct choice. But when we have a well contained problem and especially if we aren’t using a processor then a more customized solution might be the way to go.