The term "Bit Slicing" was once dominant in history books as a technique for constructing a processor from processor modules of smaller bit width where each of these components processes one field or "slice" of an operand. Bit slice processors usually consist of an arithmetic logic unit (ALU) of 2, 4 or 8 bits and control lines. Using multiple, simpler ALUs was seen as a way to increase computing power in a cost-effective manner. The latest system-on-chip (SOC) technology revives bit-slice in a programmable fashion to serve the purpose of offloading the main CPU by intelligently assigning processing tasks to other processing.
How did bit-slice evolve over years?
In the early 70s, a number of very complex microprocessor designs passed the 8-bit barrier using very simple arithmetic logic units (ALUs). These sophisticated programmable digital systems weren't designed using 8, 16 or 32-bit microprocessors but rather cascaded 4-bit processors, known as bit-slice processors. These processors had very simple instruction sets (much simpler than today's RISC processors) but performed some very sophisticated processing. Devices such as AMD's Am2900 family and National Semiconductor's IMP-16 and IMP-8 were typically found in aviation systems, guidance and tracking systems, and early signal processing applications. Many of these bit-slice processors have gone the way of thru-hole components and have been replaced by the more popular 8- through 32-bit processors that are found in the market today. However, bit-slice processors are still found in some military, aerospace, industrial, and academic designs, and they are far from being dead. The marriage of programmable logic, such as PLDs and FPGAs with multiple reduced instructions set ALUs has opened up a new palette for the digital designers.
The programmable face-off of bit-slice technology
Given the numerous microprocessors and microcontrollers on the market today, why would one build a design using bit-slicing techniques? Given the many embedded designs the reader has probably completed during their design career the answer is simple - there are numerous tasks better performed by hardware than software. In order to keep production costs down, it's more cost-effective to select a high-performance processor and implement the hardware functions in software. What if instead of opting for a high-performance processor, a designer was able to use a low-cost microprocessor that included programmable logic and a number of simple instruction set ALUs. The microprocessor would then be able to perform simple tasks while the programmable logic and ALUs would handle the more complex, higher bit width processes.
Let us explore a device with 24 such ALUs, which we will call "data-paths" with a mixture of PLDs. The data-path shown in Figure 1 contains:
- An 8-bit single-cycle ALU that can perform general-purpose functions including add, subtract, AND, OR, XOR, and PASS
- Associated compare and condition generation circuits
- Built in Cyclic Redundancy Check (CRC) and Pseudo Random Sequence (PRS) generation
- Variable Most Significant Byte (MSB) to be programmable specified for arbitrary width digital functions
- Two 4-byte deep FIFOs, two 8-bit wide data registers and two 8-bit accumulators
- Data inputs that can be support different types of data inputs: Configuration, control, and serial and parallel data
- Data output that can be various signals such as conditional, status data, etc.
Each one of these 8 bit data-paths can be coupled to its 8-bit data-path neighbor, which in turn can be coupled to its neighbor, and so on. An architecture of this nature effectively yields an 8 to n-bit processor in multiples of 8 bit. Note that the FIFOs, data registers, accumulators, and ALUs in the data-path can all be configured as n-bit in this manner. In addition, multi-byte data-path modules automatically chain the 8-bit data-paths together and the control signals and status outputs for each of the data-paths in the module.
For instance, if 8 bits is not enough for a particular application, the data-path can be coupled to a neighboring data-path to form a 16 or higher bit processor. An additional benefit of this architecture is each instruction requires only 1 clock cycle. As a consequence, designs will run at hardware speed instead of processor state speed. In applications that are oversampled, or do not need the highest clock rates, the single ALU block in the data-path can be efficiently shared with two sets of registers and condition generators. ALU and shift outputs are registered and can be used as inputs in subsequent cycles. Usage examples include support for 16-bit functions in one (8-bit) data-path or interleaving a CRC generation operation with a data shift operation.
An enhancement made to the standard bit slice architecture is the inclusion of Programmable Logic. This allows developers to include a standard state machine using Verilog. In addition, arithmetic functions that normally consume a large number of logic gates are no longer a concern because these functions can be implemented in the standard ALU and controlled by the state machine. Also note that the main processor and ALUs can run on separate clocks. For instance, the core processor can be clocked at 24 MHz while the ALUs can be clocked at 48 MHz or higher.
Figure 2 shows three 8-bit ALUs or data-paths chained together to form a 24-bit processor.
A 16-bit example
In this example, we are going to create a 16-bit pattern generator with the 16-bit pattern shifted out continuously using the PSoC 3 Programmable System-on-chip and PSoC Creator development environment from Cypress Semiconductor. In this project, we are only using the digital portion inside the chip without involving the main CPU.
One data path is set for the least significant 8-bits and another data-path for the most significant 8-bits. Figure 3 shows the data path configuration for the least significant 8-bits of the 16-bit pattern generator, and Figure 4 shows the data path configuration for the most significant 8-bits of the 16-bit pattern generator.
In both Figure 3 and Figure 4, the ALU instructions are identical. A reset or clear of A0 (accumulator 0) is performed when Dynamic Configuration register 0 is pointed to by the state machine. The value in A0 is shifted right one bit when the state machine points to Configuration Register 1, and the value in A1 (accumulator 1) is incremented when Dynamic Configuration register 3 is pointed to. The bit shifted out of the high order ALU is shifted in to the low order ALU - shifting into and out of an ALU is accomplished by setting CHAIN in the SIA field of Static Configuration Register 6 in the low order ALU (Figure 3) and setting CHAIN in the CIA field of Static Configuration Register 6 in the high order ALU (Figure 4).
Since both the high and low order 8-bit data paths are clocked by a common clock, they act as a single 16-bit processor and are completely independent of the central processor - no firmware, processor intervention, or stolen processor cycles is needed to run the pattern generator. This simple project demonstrates how to connect multiple data-path ALUs. Rather than requiring a high performance microcontroller to run tasks in what appears to be real-time, developers can use a simple microcontroller to manage the application and leave the real-time background tasks to multiple ALUs combined with programmable logic.
System-on-chip (SOC) technology revives bit-slicing in a programmable fashion to serve the purpose of offloading the main CPU by intelligently assigning processing tasks to other on-chip programmable hardware. With a bit-slicing architecture, , developers can not only develop a standard state machine but the arithmetic functions as well that normally consume a large amount of logic gates. Neither is a cause for concern because these will be implemented in the standard ALU contained in the data path logic and/or controlled by the PLD based state machine, allowing the modern embedded system engineer to focus on the overall system power consumption and efficiency.