Electronic Design

Careful HDL Coding Maximizes Performance In LUT-Based FPGAs

It's High Time You Understand The Interaction Between HDL Coding Style, FPGA Device Architectures, And Design Software.

In an ideal world, synthesis tools would understand and exploit all field-programmable gate array (FPGA) architectures and their special features without designer intervention. In the real world, however, this isn't the case. Applications that are speed- and area-intensive require that designers be aware of the consequences of coding style. To obtain optimal results, an understanding of the FPGA's architecture, the synthesis tool, and the back-end layout software also becomes necessary.

Most FPGAs are not fine-grained. Instead, they're made up of programmable functional units (PFUs) that implement combinational logic in lookup tables (LUTs) and a certain number of flip-flops or latches. The following lists some FPGA features that synthesis tools may have difficulties implementing:

  • The flip-flops inside the PFUs share some control signals, such as the clock, clock enable, and reset/set. In an ORCA architecture, for example, four flip-flops will fit inside a single PFU only if they have the same mentioned signals. Most synthesis tools don't understand this. If a design is coded without keeping this fact in mind, the tools might utilize some of the flip-flops inefficiently. This results in an inflated chip size.
  • Memory elements inside some FPGAs can be implemented in the LUT portion of the PFU. This method of constructing RAM or ROM inside FPGAs saves a large number of gates and drastically improves the speed of a device. Unfortunately, there's no one way to implement memory in HDL. Hence, the synthesis tools can't detect their presence to utilize the FPGA's LUT feature.
  • Counters and state machines also are difficult. With so many different kinds of these circuits, the reason for using one over another is mostly dependent on the application. A knowledge of an FPGA's architecture also helps in deciding which method is most efficient.
  • Design hierarchy and floorplanning is hard for synthesis tools to implement.
  • Global Set Reset (GSR) signal is an internally routed reset signal that doesn't consume any of a chip's routing resources. There's currently no way to implement this feature in VHDL. Consequently, synthesis tools can't utilize this feature unless the GSR component gets instantiated in the HDL code.

There are three basic techniques for writing VHDL code. Starting with the least efficient method, they are:

  1. A generic code that has not been targeted to an architecture.
  2. A generic code targeted towards a device architecture.
  3. An HDL code with macro instantiation.

It helps to compare these three methods, incorporating coding styles that would be targeted to reduce the aforementioned synthesis inefficiencies.

Synchronous Logic
Flip-flops and latches in most LUT-based FPGAs can be configured in synchronous set/reset mode using the Local Set Reset (LSR) assigned by the designers. In order for a latch or flip-flop to be implemented correctly, the synthesis tool must instantiate the proper library macro. But, this won't happen unless the HDL code contains the correct description. A basic understanding of the FPGA architecture to be used is a must.

Designers have to keep in mind the kinds of flip-flops and latches that are available in the vendor's macro library. If the code implements a register functionality that's not represented by a corresponding macro in the library, the extra functionality will be added to the circuit using additional logic. Most of the time, this extra logic ends up on the registers' datapath, increasing area and delay.

Each PFU can implement up to a certain number of latch and/or flip-flops that share some of its inputs. To get the highest area utilization out of the device, latches and flip-flops are best grouped in multiples of the PFU's register capacity.

If synchronous functionality of the flip-flops is required, the Global Set Reset signal can't implement the set/reset signal. This is because the GSR has asynchronous functionality. It can, however, be used in addition to the LSR signal.

If the code implies a gated Clock Enable (CE) signal, the synthesis tool tends to duplicate the enable logic for every register in the design. To avoid this, it's recommended to keep the gated signals in a separate process. Also, pass their output to the CE input of the main module.

In order to use the correct flip-flop, the HDL code has to describe the correct functionality. For instance, the following code listing is used to implement a two-bit register with a +VE level synchronous reset and a +VE level enable signal.

DO <= D1 AND D2;
SYNC_RST : Process (CLK,RST)
begin
if (CLK'event and CLK='1') then
if (RST = '1') then
DATA_OUT <= (others => '0');
elsif (DO = '1') then
DATA_OUT <= DATA_IN ;
end if;
end if;
end process SYNC_RST;

Note that to implement a synchronous reset correctly the "if (RST = '1') then" statement has to be entered after the CLK'event inside the process. And for "DO" to be connected to the CE input of the flip-flop, the "elsif (DO = '1') then" statement must go after the "if (RST = '1') then".

Be vigilant with this approach, because some synthesis tools have known limitations in implementing synchronous reset/set. They can produce some unpredictable results that, although functionally correct, would affect the area and speed of the resulting circuits. The DO signal will be connected to the CE of the flip-flops only if the code is implemented, as shown in the previous HDL example.

Also, some signals weren't meant to get connected to the CE port. But, be aware that they will be if designers don't know what kind of coding algorithm will result in a CE connection. Consider:

SYNC_RST : Process (CLK,RST)
begin
if (CLK'event and CLK='1') then
if (D1 = '0' and DATA_IN = "10") then
DATA_OUT(0) <= DATA_IN(0) ;
elsif (D2 = '0' and DATA_IN = "01") then
DATA_OUT(1) <= DATA_IN(1);
end if;
end if;
end process SYNC_RST;

The code in this listing will generate two flip-flops with two different CE signals for a couple of reasons. First, there are some undefined states in the process (such as the state when D1= '1' and DATA_IN= '10'). Also, not all of the outputs for the defined states were defined under every "if" statement.

Both of these issues will force the synthesis tool to use the CE port of the flip-flops in order to retain their previous values. As a result, this circuit will consume two programmable logic cells (PLCs) instead of one. To avoid these kinds of inefficiencies, try the following when writing HDL code:

  • Always attempt to group multiples of four flip-flops under every "if" statement.
  • Try to define all the states of the control signals and the status of the register outputs for every state.
if (CLK'event and CLK='1') then
if (D1 = '0') then
DATA_OUT <= "01" ;
elsif (D2 = '0') then
DATA_OUT <= "10";
else DATA_OUT <= DATA_IN;
end if;

This listing shows an example of code that will not try to utilize the CE inputs of the flip-flops.

Memory Modules
The most efficient way to implement memory in an SRAM FPGA is by using the internal lookup tables inside of the PFU. In a Lucent FPGA, for example, each PFU can implement two RAM or ROM arrays: a single 16-by-4 element or two 16-by-2 memory blocks. Multiple PFUs can then be used to implement other array sizes (such as 16 by 8, 32 by 4, and 64 by 8). Let's discuss three methods for implementing a 16-by-8 memory block.

The first method is generic VHDL code (see Code Listing 1). When the VHDL code in this listing is implemented in a 2C04, the design uses 128 flip-flops, 76 out of 100 PFUs, and 0 out of 800 three-state buffers (TBUFs). The timing report states that 38 MHz is the maximum frequency for this circuit after map, place, and route.

Method two is generic VHDL code targeted towards FPGAs (see Code Listing 2). When this VHDL code is implemented in a 2C04, the design utilizes 128 flip-flops, 41 out of 100 PFUs, and 128 out of 800 TBUFs. According to the timing report, 40 MHz is the maximum frequency for this circuit after map, place, and route.

The third and final method is instantiation of RAM (see Code Listing 3). When the VHDL code in the listing is implemented in a 2C04, the design uses 20 flip-flops, six out of 100 PFUs, and eight out of 800 TBUFs. The timing report states that 52 MHz is the maximum frequency for this circuit after map, place, and route.

The advantages to the first method are that it maintains generic VHDL code that can be targeted to any technology. Plus, no knowledge of the FPGA architecture is required. There are, however, disadvantages to this method. There's no utilization of the FPGA's architectural features. And, it produces poor area and timing results.

Method two flaunts several advantages. It maintains a generic VHDL code that can be targeted to any technology. Compared to method one, it offers an improvement of almost 50% in terms of area. It also beats the first method out with an almost 200% improvement of clock-to-out delays.

Yet, method two also has its weak points. One disadvantage is its use of the FPGA's tri-state buffers, which might make routing difficult in bigger designs. Also, this method doesn't exploit the FPGA's architectural features.

The last method has two main advantages. It offers an improvement of almost 25% in the overall timing performance of the design, and provides a reduction of almost 35 PFUs over method two. Unfortunately, though, this method's VHDL code is locked to a specific technology.

Counters And State Machines
There are many types of counters and state machines that can be implemented through VHDL. Each type is application-specific, with its own efficiencies and inefficiencies.

Binary counter circuits are the easiest to implement in HDL. They also fit very efficiently in some of today's LUT-based FPGAs. In an ORCA architecture, for instance, each LUT can be configured in a ripple mode so that the PFU can implement up to 4-bit arithmetic functions. Moreover, most common synthesis tools understand this FPGA feature and can take advantage of it, while still keeping the HDL code generic. The following code reveals a simple HDL implementation of a synchronous 8-bit upcounter with an enable line.

SYNC_CNT : Process (CLK,RST)
begin
if (RST = '1') then
CNT <= (others => '0');
elsif (CLK'event and CLK='1') then
if (ENBL = '1') then
CNT <= CNT + '1';
end if;
end if;
end process SYNC_RST;
DATA_OUT <= CNT;

Implemented in a 2C04, the counter uses eight flip-flops and two out of 100 PFUs. The maximum frequency for this circuit after map, place, and route is 91.542 MHz.

Its advantages are very straightforward. It's very simple to implement, and fits efficiently in most LUT-based FPGAs. But, there is one main disadvantage. If the counter's output needs to be decoded for applications like a state-machine controller or a generic memory block, the decode logic will add a considerable amount of gates to the circuit. This will most probably degrade the performance.

If performance is the desired goal, using one hot key or shift-register counters is more suitable than the previous solution. Still, this method has a serious drawback: It consumes a large number of gates. The following listing shows the HDL code for a 4-bit counter implemented in a one hot key configuration.

SYNC_CNTR : Process (CLK,RST)
begin
if (RST = '1') then
CNT(3 downto 0) <= "0000";
CNT(4) <= '1';
elsif (CLK'event and CLK='1') then
if (ENBL = '1') then
CNT <= CNT(3 downto 0) & CNT(4);
end if;
end if;
end process SYNC_RST;
DATA_OUT( 3 downto 0) <= CNT(3 downto 0);

The resulting circuit, implemented in a 2C04, uses five flip-flops and two out of 100 PFUs. After map, place, and route, the maximum frequency for this circuit is 109 MHz.

A one hot key or shift-register counter circuit is much faster than the binary implementation, especially when it's combined with some sort of a machine controller. This is because the decoding of the counter's output will be done on just a single bit. In addition, this circuit is easy to implement in VHDL. As can be seen from the preceding software listing, it takes only one line of code using the concatenation operation (and).

On the down side, implementing the circuit requires a large number of flip-flops. Assume that an N-bit counter is to be built. Using this method, \[(2N)+1\] registers will be needed to implement all of the counter states. This form of counter or state machine doesn't take advantage of the FPGA's architectural features, which would allow for a straightforward implementation of arithmetic functions.

An Example
Let's look at an example that discusses the implementation of a FIFO memory block. The size of the FIFO is going to be 127 by 4. It will be implemented in two methods: both with and without instantiation of memory.

The FIFO is a single-port device, meaning that the memory array can only be read or written at one time. FULL_L and EMPTY_L signals indicate the status of the FIFO. WRL and RDL are the active low write and read signals.

First, let's try a generic VHDL description for a 127-by-4 FIFO. For this method, the whole FIFO design is placed under one process and then synthesized as a single block. Due to the limitation on the length of this article, the code for this method will not be shown. When implemented in a 2C15 FPGA, the design uses 528 flip-flops and 200 out of 400 PFUs. The timing report reveals that 20 MHz is the maximum frequency after map, place, and route.

Now, let's try an instantiated VHDL code approach. The same code that was used for the previous method is now divided into two blocks. Block one is a single process with the following functions:

  • Calculates the Write and Read addresses depending on the addresses' previous values, as well as the FULL_L, EMPTY_L, WRL, and RDL signals.
  • Calculates a value named DIFF_PTR that gets incremented by 1 in a write operation and decremented by the same value during a read operation. This value gets used to set the FULL_L and EMPTY_L signals of the FIFO.

This first block includes everything except the RAM_ARRAY entity. The process for this block can be found in Code Listing 4.

Block two is created with instantiated VHDL code. In this entity, there's an instantiation for eight RPP16-by-4z macros from the ORCA FPGA library. These macros were netlisted so as to create the 127-by-4 memory block of the FIFO (RAM_ARRAY).

Implemented in a 2C15 device, the design uses 20 flip-flops and 25 out of 400 PFUs. After map, place, and route, the maximum frequency for this circuit is 29 MHz.

Conclusion
For designs that aren't speed- and area-sensitive, it's probably enough to write generic, synthesizable HDL code. However, for designs in which speed and area are critical, a basic knowledge of the FPGA architecture and the correct HDL coding style for that architecture is a must.

Synthesis vendors are currently working with the FPGA suppliers in hopes of advancing the tools to a level where they automatically exploit all of a device's architectural features. But for now, designers must apply their digital hardware experience while coding HDL. This is the only way to get the highest utilization out of FPGAs.

Recommended Reading:

  1. Cohen, B., VHDL Coding Styles and Methodologies, Kulwer Academic Publishers, 1995.
  2. Lucent Technologies Inc., 1996 Field-Programmable Gate Arrays Data Book, Oct. 1996.
  3. Lucent Technologies Inc., ORCA FPGAs HDL Design Guide, 1996.
  4. Ott, D., and Wilderotter, T., A Designer's Guide to VHDL Synthesis, Kulwer Academic Publishers, 1994.
TAGS: Digital ICs
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish