Electronic Design
Making a New Type of Parallel Processing Possible

Making a New Type of Parallel Processing Possible

When looking for the best, it is often useful to examine the new or unusual to see if it will change the way designers create new applications.

Pattern recognition is a common requirement for many applications from deep packet inspection for network switching to searching for text patterns within a document. Sometimes a single pattern is used, but often multiple patterns are employed.

A common text-processing pattern-matching system is based on regular expressions, also known as regex or regexp. This was formalized by Stephen Kleene in the 1950s. It is a common tool used in most scripting languages like PHP and Javascript as well as tools such as Linux/Unix grep.

Conventional processors perform pattern matching sequentially. Content addressable memories can perform limiting matching in parallel. FPGAs and custom logic can be utilized to take advantage of parallel processing.

1. Micron’s Automata Processor compiler maps a regular expression to state transition elements (STE) within the array. The STEs then track incoming symbols for the desired pattern.

Micron’s Automata Processor (AP) tackles regular expressions by using a state transition element (STE) for each item within the expression (Fig. 1). The STE is actually more powerful than the functionality required to handle regular expressions. The figure shows a regular expression and how it maps to about two dozen STEs. The AP has other components like counters and terminal events in addition to the interconnect matrix.

Download this article in .PDF format
This file type includes high resolution graphics and schematics when applicable.

The STEs in the group are all disabled except for the first one that tries to match the letter P. It enables the next STE that matches the letter I when a P is detected. Essentially a host presents the AP with a stream of bytes. Each STE sees each byte in the stream, but an STE only reacts when enabled.

Automata Processor Architecture

The first AP incarnation has 49,152 STEs and 6144 terminal events. An expression is matched to a set of STEs and at least one terminal event that allows the host to recognize the event and the chain that triggered it. An AP will host multiple expressions that are evaluated in parallel. Each STE has 256 bits of information associated with it (Fig. 2). Micro used its DDR3 DRAM technology to implement the memory array. An STE that would match a single character will have one bit out of 256 set to true (1). An STE that would match a numeric digit would have 10 bits set.

2. Each STE has 256 bits of information associated with it. The input symbol selects one bit for each STE that is then ANDed with the STE enable bit to generate an output. (Click for larger image)

This architecture would be great if only a single stream was processed at a time. The trick is that many applications need to process many streams such as in a network switch. The AP handles up to 512 streams by replicating the enable bit storage for each STE. Again, DRAM technology is easily applied to this task.

Essentially we have an 8-bit symbol that accesses a 256 bit array per STE. The output is ANDed with the output of a 512-bit array indexed by a 9-bit address of the logical data stream. This translates to 768 bits per STE or 36 Mbits per AP.

Additional streams can be handled using the same “program” using additional APs. Additional APs can also be used to handle more expressions. Both increases can occur without increasing latency since everything is done in parallel on all chips at the same time.

The interconnect between STEs (Fig. 3) is done at the block level. An STE within a block can enable any STE within the block. STEs may also connect to adjacent blocks. An STE output can enable one or more other STEs.

3. STEs are group in blocks. An STE can enable another STE with the block. A fixed number of block-to-block links to adjacent blocks are included in the fabric.

Compiling an Application

Regular expressions are one way to program an AP. There are lots of applications that already utilize regular expressions like the open-source Snort project, a network intrusion detection system (NIDS). The approach works well for text-based processing, but it does not necessarily take advantage of all of an AP’s capability.

ANML (Automata Network Markup Language) is the XML-based programming language for an AP. It provides full access to an STE’s ability. Linkages between STEs are arbitrary versus regular expressions that have a more strict configuration. The ANML compiler handles place-and-routing like an FPGA compiler so developers do not have to be concerned with low level details. It can handle multiple AP configurations breaking up an application as necessary.

ANML can be used to create applications that do fuzzy comparisons. Keep in mind that the symbols involved do not have to represent characters. For example, ANML could be used to implement a rule-based system for controlling a robot with inputs being state from sensors. Researchers at Georgia Tech are using Micron's AP to match genome protein sequences.

Micron also provides a graphical integrated development tool (IDE) that can display an ANML application. The simulator provides a way to show an application in action and to assist in debugging. Breakpoints can be set on STE transitions. Developers can then examine the system’s state and single step if desired.

4. This PCI Express board facilitates AP application development. It can hold up to 48 APs. An FPGA provides a flexible interconnect linkage.

An AP can be placed in an embedded design but most developers will likely start with Micron’s PCI Express board (Fig. 4). The board has an FPGA and on-board DDR3 DRAM to connect up to 48 APs together and to a host via PCI Express. The APs are contained on memory-style DIMMs, but don’t plan on using them in existing memory sockets just yet.

An AP uses about 4 W. The 48-chip PCI Express board has bested a 48 CPU BECAT Hornet Cluster. One test that took 14 minutes on the AP board took over 45 hours on the cluster. Of course, your mileage will vary but this highlights the performance difference. Also, the board uses about 300 W while the cluster requires 2000 W. On the other hand, the cluster is more flexible and can handle applications the AP cannot. 

The University of Virginia and Micron Technology have founded the Center for Automata Processing. Research areas include biomedical informatics, cyber security, image analytics, and cortical computing. Given the flexibility of the AP, it will be useful in single-chip solutions through systems that employ hundreds.

Micron’s Automata Processor will be subservient to a host processor but, like GPUs, it provides a significant acceleration for many applications, making them very desirable for a variety of solutions. Also, like GPUs, an AP will be suitable only for a subset of problems--but for those that are applicable, the AP will provide a massive performance gain. 

TAGS: Defense Mobile
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish