Power Play For The SoC Developers

Chris Rowen explains how computational performance can be boosted by flexible length instruction extensions.

The Xtensa LX processor uses Tensilica's innovative FLIX (Flexible Length Instruction eXtensions) architecture – a highly efficient implementation of the Xtensa instruction set architecture (ISA) that gives designers more options for cost/performance tradeoffs. FLIX technology provides the flexibility to freely and modelessly intermix single-operation RISC instructions, simple- and compound-operation TIE instructions, and multiple-operation FLIX instructions. By packing multiple operations into a wide 32- or 64-bit instruction word, FLIX technology allows designers to accelerate a broader class of 'hot spots' in embedded applications, while eliminating the performance and code-size drawbacks of VLIW processor architectures.

Instruction-set performance relates to the number of useful operations than can be executed per unit of time or per clock. High performance does not guarantee good flexibility, however. Instruction-set flexibility relates to the wider diversity of different applications whose computations can be efficiently encoded in the instruction stream. A longer instruction word generally allows a greater number and diversity of operations and operand specifiers to be encoded in each word.

RISC architectures generally encode one primitive operation per instruction. Long-instruction-word architectures encode a number of independent sub-instructions per instruction, with operation and operand specifiers for each sub-instruction. The sub-instructions may be primitive generic operations similar to RISC instructions or they may each be more sophisticated, application-specific operations, such as those described previously in this chapter as processor extensions. Making the instruction word longer, for any given number of operands and operations, makes instruction encoding simpler and more orthogonal.

It is worth noting that long-instruction-word processors are not always faster than RISC processors. Sometimes the benefit of RISC execution-unit simplicity boosts maximum clock frequency and the execution of several distinct RISC instructions per cycle can compensate for the relative austerity of RISC instruction sets. Nevertheless, when RISC instruction sets are found in the most demanding data-intensive tasks, they are implemented with super-scalar implementations that attempt to execute multiple instructions per cycle, mimicking the greater intrinsic operational parallelism of long-instruction words.

Shown in Figure 1 is an example of a basic long-instruction operation encoding. The figure lays out a 64-bit instruction word with three independent sub-instruction slots, each of which specifies an operation and operands. The first sub-instruction (sub-instruction 0) has an opcode and four operand specifiers – two source registers, an immediate field, and one destination register. The second and third sub-instructions (sub-instructions 1 and 2) have an opcode and three operand specifiers – two source registers and one source/destination register. The 2bit format field on the left designates this particular grouping of sub-instructions. It may also designate the overall length of the instruction if the processor supports variable-length encoding.

Clearly there is a hardware cost associated with long instruction words. Instruction memory is wider, decode logic is bigger, and a larger number of execution units and register files (or register file ports) must be implemented to deliver instruction parallelism. Larger numbers of bigger logic blocks are incrementally harder to optimise, so maximum clock frequency can drop compared to simpler, narrower instruction encodings such as RISC. Nevertheless, the performance and flexibility benefits can be substantial, particularly for data-intensive applications with high inherent parallelism.

In some long-instruction-word architectures, each sub-instruction has almost completely independent resources: dedicated execution units, dedicated register files, and dedicated data memories. In other architectures, the sub-instructions share common register files and data memories and require a number of ports into common storage structures to allow effective and efficient data sharing.

Long-instruction-word architectures also vary widely on the question: How 'long' is a long instruction? For high-end computer-system processors, such as Intel's Itanium family and for high-end embedded processors such as Texas Instruments' TMS320C6400 DSP family, the instruction word is very 'long' indeed – hundreds of bits. For more cost- and power-sensitive embedded applications, 'long' may be just 64 bits. The essential processor architecture principles are largely the same, however, once multiple independent sub-instructions are packed into each instruction word.

CODE SIZE AND LONG INSTRUCTIONS
One common liability of long-instruction-word architectures is large code size, compared to architectures that encode one independent operation per instruction. This is a common problem for VLIW architectures, but it is an especially important one for SOC designs where instruction memories may consume a significant fraction of total silicon area. Compared to code compiled for code-efficient architectures, VLIW code can often require two to five times more code storage. Compared in Figure 2 is the total code size of a VLIW DSP (TI TMS320C6203) with Tensilica's Xtensa processor for the EEMBC Telecom (discussed in Chapter 3) suite, with both straight compilation from unmodified C and with optimised C code. No assembly code was used.

Similarly, a comparison in Figure 3 shows the total code size of a VLIW media processor (Philips Trimedia TM1300) with Tensilica's Xtensa processor for the EEMBC Consumer suite, with both straight compilation from unmodified C and with full optimisation of the C. No handwritten assembly code was created for the optimised Tensilica processor.

Code bloat stems, in part, from instruction-length inflexibility. If, for example, the compiler can find only one operation whose source operands and execution units are ready, it may be forced to encode several sub-instruction fields as NOPs (no operation). Instruction storage is already a major portion of embedded SOC silicon area, so code expansion translates into higher cost, poorer instruction-cache performance, or both.

A second source of VLIW code bloat is the loose encoding of frequent operations commonly found in VLIW processors. The TI TMS320C6203 DSP, for example, requires 32bits of instruction to specify a 16bit multiplication and 32bits to specify a 16bit add, so the common multiply/accumulate (MAC) combination takes at least 64bits. If a loop containing many MACs is unrolled four times (to amortise the cost of branch and address calculations), the resulting eight MAC operations require 512bits of instruction storage, not counting the additional bits for any loads, stores, branches or address-calculation instructions.

However, long instructions do not necessarily lead to VLIW code bloat. A long-instruction-word implementation of Tensilica's Vectra LX DSP architecture needs about 20bits within the instruction stream to specify eight 16bit MACs executing in SIMD fashion, not counting the additional bits for any loads, stores, branches, or address-calculation instructions.

One attractive solution for long-instruction-word code bloat is to use a more flexible range of instruction lengths. If the processor allows multiple instruction lengths, including short instructions that encode a single operation, the compiler can achieve significantly better code size and instruction storage efficiency, compared to traditional VLIW processor designs with fixed-length instruction words. Reducing code size for long-instruction-word processors also tends to decrease bus-bandwidth requirements and reduces the power dissipation associated with instruction fetches. Tensilica's Xtensa LX processor, for example, incorporates flexible-length instruction extensions (FLIX). This architectural approach addresses the code size challenge by offering 16bit, 24bit, and a choice of either 32 or 64bit instruction lengths. Designer-defined instructions can use the 24, 32, and 64bit instruction formats.

Long instructions allow more encoding freedom, where a large number of sub-instruction or operation slots can be defined (although three to six independent slots are typical) depending on the operational richness required in each slot. The operation slots need not be equally sized. Big slots (20–30 bits) accommodate a wide variety of opcodes, relatively deep register files (16–32 entries), and three or four register-operand specifiers. Developers should consider creating processors with big operation slots for applications with modest degrees of parallelism, but a strong need for flexibility and generality within the application domain.

Small slots (8–16 bits) lend themselves to direct specification of movement among small register sets and allow a large number of independent slots to be packed into a long instruction word. Each of the larger number of slots offers a more limited range of operations, fewer specifiers and shallower register files. Developers should consider creating processors with many small slots for applications with a high degree parallelism among many specialised function units.

LONG INSTRUCTION WORDS AND AUTOMATIC PROCESSOR GENERATION
Long-instruction-word architectures fit very well with automatic generation of processor hardware and software. High-level instruction descriptions can specify the set of sub-instructions that fit into each slot. From these descriptions, the processor generator determines the encoding requirements for each field in each slot, assigns opcodes, and creates instruction-decoding hardware for all necessary instruction formats. The processor generator can also create the corresponding compiler and assembler for the long-word processor. For long-instruction-word architectures, packing of sub-instructions into long instructions is a very complex task. The assembler can handle this packing, so assembly source code programs written by programmers need only specify the operations or sub-instructions, giving less attention to packing constraints. The compiler generates code with instruction-slot availability in mind to maximise performance and minimise code size, so it generally does its own packing of operations into long instructions.

Below is a short but complete example of a very simple long-instruction word processor described in TIE with FLIX technology. It relies entirely on built-in definitions of 32-bit integer operations, and defines no new operations. It creates a processor with a high degree of potential parallelism even for applications written purely in terms of standard C integer operations and data-types. The first of three slots supports all the commonly used integer operations, including ALU operations, loads, stores, jumps and branches. The second slot offers loads and stores, plus the most common ALU operations. The third slot offers a full complement of ALU operations, but no loads and stores.

(1) length ml64 64 \{InstBuf\[3:0\]

15\}

(2) format format1 ml64 \{base_slot, ldst_slot, alu_slot\}

(3) slot_opcodes base_slot \{ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N \}

(4) slot_opcodes ldst_slot \{ ADD.N, SUB, ADDI.N, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, MOVI.N \}

(5) slot_opcodes alu_slot \{ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, SLLI, SRLI, SRAI, MOVI.N \}

The first line of the example declares a new instruction length (64bits) and specifies the encoding of the first 4bits of the instruction that determine the length. The second line declares a format for that instruction length, format1, containing three slots: base_slot, ldst_slot, and alu_slot and names the three slots within the new format. The fourth line lists all the TIE instructions that can be packed into the first of those slots: base_slot. In this case, all the instructions happen to be pre-defined Xtensa LX instructions but new instruction could also be included in this slot. The processor generator also creates a NOP (no operation) for each slot, so the software tools can always create complete instruction, even when no other operations for that slot are available for packing into a long instruction. Lines 4 and 5 designate the subset of instructions that can go into the other two slots.

A definition is shown below of a long-instruction-word architecture with a mix of built-in 32bit operations and new 128bit operations. It defines one 64bit instruction format with three sub-instruction slots (base_slot, ldst_slot, and alu_slot). The description takes advantage of the Xtensa processor's predefined RISC instructions, but also defines a large new register file and three new ALU operations on the new register file:

(1) length ml64 64 \{InstBuf\[3:0\]

15\}

(2) format format1 ml64 \{base_slot, ldst_slot, alu_slot\}

(3) slot_opcodes base_slot \{ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N \}

(4) regfile x 128 32 x

(5) slot_opcodes ldst_slot \{loadx, storex\} /* slot does 128b load/store*/

(6) immediate_range sim8 -128 127 1 /*8 bit signed offset field */

(7) operation loadx \{in x *a, in sim8 off, out x d\} \{out VAddr, in MemDataIn128\}\{

(8) assign VAddr = a + off; assign d = MemDataIn128;\}

(9) operation storex \{in x *a, in sim8 off, in x s\} \{out VAddr,out MemDataOut128\}\{

(10) assign VAddr = a + off; assign MemDataOut128 = s;\}

(11) slot_opcodes alu_slot \{addx, andx, orx\} /* two new ALU operations on x regs */

(12) operation addx \{in x a, in x b, out x c\} \{\} \{assign c = a + b;\}

(13) operation andx \{in x a, in x b, out x c\} \{\} \{ assign c = a & b;\}

(14) operation orx \{in x a, in x b, out x c\} \{\} \{ assign c = a | b;\}.

The first three lines are identical to the presious example. The fourth line declares a new register file 128bits wide and 32 entries deep. The fifth line lists the two load and store instructions for the new wide register file, which can be found in the second slot of the long instruction word. The sixth line defines a new immediate range, an 8bit signed value, to be used as the offset range for the new 128bit load and store instructions. Lines 7–10 fully define the new load and store instructions, in terms of basic interface signals Vaddr (the address used to access local data memory), MemDataIn128 (the data being returned from local data memory), and MemDataOut128 (the data to be sent to the local data memory). The use of 128-bit memory data signals also guarantees that the local data memory will be at least 128 bits wide. Line 11 lists the three new ALU operations that can be put in the third slot of the long instruction word. Lines 12–14 fully define those operations on the 128-bit wide register file: add, bit-wise AND, and bit-wise OR.

With this example, any combination of the 39 instructions (including NOP) in the first slot, three instructions in the second slot (loadx, storex, and NOP), and four instruction in the third slot can be combined to form legal instructions – a total of 468 combinations. This simplified example specifies almost enough instructions to densely populate a long instruction word. The first slot needs about 21bits, the second slot only needs about 19bits, the third slot needs about 17bits, and the format/length field required four bits—for a total of roughly 62bits. This example shows the potential to independently specify operations to enable instruction-level parallelism. Moreover, all of the techniques for improving the performance of individual instructions – especially fusion and SIMD – are readily applied to the operations encoded in each sub-instruction.

The compound operation technique can be applied within sub-instructions, but long instruction words also encourage the encoding of independent operations in different slots:

(1) length ml32 32 \{InstBuf\[3:0\] == 15\}

(2) format pair ml32\{shift, logic\}

(3) regfile X 128 4 x

(4) slot_opcodes shift \{xr_srl, xr_sll \}

(5) operation xr_sll \{in AR a,inout AR b\} \{\} \{assign b=b<<\{a\[3:0\],3'h0\};\}

(6) operation xr_srl \{in AR a,inout AR b\} \{\} \{assign b=b>>\{a\[3:0\],3'h0\};\}

(7) slot_opcodes logic \{ xr_or, xr_and \}

(8) operation xr_and \{in X c,inout X d\} \{\} \{assign d=d & c;\}

(9) operation xr_or \{in X c,inout X d\} \{\} \{assign d=d | c;\}.

The first two lines define a 32bit wide instruction, a new format, and the two slots within that format. The next line declares a new wide register file. Lines 4–6 define the instructions (byte shifts) that can occupy the first slot. Lines 7–9 define the instructions (bit-wise AND and bit-wise OR). Altogether this TIE example defines four instructions, representing the four combinations. If these were the only instructions, the processor generator would discover that this format requires only 16 bits to encode: 10 bits for the 'shift' slot (two 4-bit specifiers for the two AR register entries, plus 1 bit to differentiate shift left from shift right) and 6 bits for the 'logic' slot (two 2-bit specifiers for the two X register entries, plus 1 bit to differentiate AND from OR).

This article is based on the book Engineering the Complex SOC by Chris Rowen who is President and CEO of Tensilica

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish