A new generation of solid-state drive (SSD) controllers has turned the SSD business on end by enabling enterprise-class performance, reliability, and endurance with standard NAND flash memory. In this article, we will look under the hood a bit to see what issues are being dealt with by these new controllers.
Challenges in designing a flash controller fall into five areas: performance, reliability, endurance, security, and flexibility.
Performance is an issue with any storage medium. The host interface and the NAND flash interface both require careful coordination to maximize the overall performance of the SSD. A single flash die can be programmed at up to 40-50 megatransfers per second (MT/s) with asynchronous mode, and the new ONFI2 and Toggle modes increase this to 133-200 MT/s, providing a 3-4x performance increase at the flash interface. Typical SATA SSDs today support the SATA 2 spec with 3Gb/s, which translates to a top speed of just over 250MB/s. New controllers will need to support the SATA 3 spec with 6Gb/s (roughly 500MB/s), doubling the maximum host interface.
To achieve the required performance, some parallelism must be used so that multiple die are programmed or returning results at once. Native command queuing, garbage collection and block picking are all techniques that must be balanced to obtain this parallelism and high performance, without exceeding the power envelope given for the SSD.
Native command queuing and flash parallelism
There are typically up to 128 flash die on an SSD, and each one can take a millisecond or more to program. The goal is to be able to program as many die as possible at the same time, or nearly at the same time. Native command queuing (NCQ) is part of the SATA drive specification, enabling the host system to send as many as 32 commands to the drive at once. NCQ enables the flash controller to manage up to 32 commands/responses at a time, which reduces the amount of time the operating system must wait to issue new commands and thus improves drive performance.
NCQ automatically gangs up commands and tracks them. The controller can rearrange the queue dynamically if necessary. For example, if the controller is able to respond sooner to a command that came in later than others, it can do so out of order without causing problems for the host system.
Garbage collection and block picking
Flash memory is made up of cells which store one or a few bits of data each. These cells are grouped into pages, which are the smallest discrete location to which data can be written. The pages are collections of blocks, which are the smallest discrete location that can be erased. Flash memory cannot be directly overwritten like a hard disk drive; it must first be erased. Thus, while an empty page in a block can be written directly, it cannot be overwritten without first erasing an entire block of pages.
As the drive is used, data changes, and the changed data is written to other pages in the block or to new blocks. At this point, the old (stale) pages are marked as invalid and can be reclaimed by erasing the entire block. To do this, however, all of the still-valid information on all of the other occupied pages in the block must be moved to another block.
The requirement to relocate valid data and then erase blocks before writing new data into the same block causes write amplification; the total number of writes required at the flash memory is higher than the host computer originally requested. It also causes the SSD to perform write operations at a slower rate when it is busy moving data from blocks that need to be erased while concurrently writing new data from the host computer.
SSD controllers use a technique called garbage collection to free up previously written blocks. This process also consolidates pages by moving and rewriting pages from multiple blocks to fill up fewer new ones. The old blocks are then erased to provide storage space for new incoming data. However, since flash blocks can only be written so many times before failing, it is important to also wear-level the entire SSD to avoid wearing out any one block prematurely.
“Block picking” is the process of deciding which block to recycle during garbage collection (also sometimes called recycling). Blocks that have the least number of valid pages to be moved will take the shortest time to recycle because the valid pages must be written to new blocks. It seems obvious that the controller should always pick the block with the lowest number of valid pages. However, this does not account for the wear on the blocks. Since there are a finite number of times the block can be written, the controller must also ensure blocks with less wear are recycled as many times as the other blocks. This prevents the drive from reaching its end of life prematurely simply because a few blocks are exhausted. The key to block picking success is in the algorithms used to balance recycling blocks with minimal prior program/erase counts and the ease of recycling (containing the least amount of valid data).
Another aspect of the garbage collection and block picking processes is the TRIM command, an operating system command that notifies the drive it no longer needs data in certain pages. Information stored on the SSD is just bits of information and even the tables used by the operating system (OS) to allocate where files are located on the drive look like any other data. When the OS determines a file is to be deleted, it marks that space free in its table, but the SSD does not understand how to interpret that table because it can be different for every OS. The SSD continues to preserve and recycle that invalid data until the OS actually tells the drive to put something else in that space. This is when the drive finally knows the old data can be removed during the next recycling opportunity. Windows 7 and later versions of Linux since 2.6.33 have included the ATA TRIM command, making the SSD more efficient by preventing the drive from recycling the invalid data immediately after the OS determines it is not needed.
The TRIM command results in higher SSD performance from both the reduced data needed to be rewritten during garbage collection and the higher free space resulting on the drive. The free space exists because the SSD knows the invalid data does not need to be tracked, leaving more space for new data to be written directly to free and erased blocks. SSDs that don’t support TRIM or are installed in systems with older OSes that don’t support TRIM will lose this advantage. Most RAID configurations today do not yet support TRIM for SSDs in the RAID array. In those cases the overall performance of the SSD will depend solely upon the overall write speed and garbage collection efficiency.
Reliability is another key storage criterion, and some of the techniques used to ensure flash reliability include preventing errors from initially occurring and then managing errors once they happen.
Enhancing the media
Flash chip geometries are already at 25nm and are heading steadily downward. As geometries become smaller, errors increase because there is less space in which to store electrons for a given cell in the flash memory. Since the data stored in a cell is represented by a certain voltage, with fewer electrons holding the data it is easier for those electrons to be disturbed, or flipped, creating a different voltage resulting in errors.
Flash memory manufacturers reserve a certain amount of extra space on each NAND page to be used for Error Correction Code (ECC) bits. More advanced controllers have the ability to store more ECC bits for higher reliability. However, the controller will need to perform some special steps with the flash media in order to create room for storing additional ECC bits.
Advanced error recovery (RAISE)
Shrinking geometries will eventually make it very difficult for even advanced ECC techniques to prevent data errors, so SSD manufacturers should have more than one data protection technique at their command. One such technique is the Sandforce RAISE (redundant array of independent silicon elements) technology, which writes data across multiple flash die to enable recovery from a failure in a sector, page, and even an entire block. Similar in concept to RAID techniques used in disk-based storage, RAISE treats the individual die much like RAID treats individual disk drives and spreads data across them so these failures are not catastrophic.
End-to-end protection is designed to protect against any type of data integrity issue by comparing what is about to be sent back to the host with what was originally written. For example, SandForce SSD Processors add end-to-end protection information to the data when it first arrives from the host and that gets stored with the data on the flash. When the host later requests the data from the SSD, the SandForce processor then checks the end-to-end protection information from the retrieved page of data. This ensures that there were no errors during the writing process and double checks that both the ECC and RAISE did not fail.
One challenge in this process is knowing how and where to store the extra protection information. Knowing how to carve out space for this extra information is part of making this technique work in SSDs.
Read disturb management
Each flash cell holds a certain voltage. When the controller requests the voltage from a cell, the act of making the request introduces the potential for altering the voltage stored. If a cell is read too many times, its stored value may change. Read disturb is the process under which stored data is slightly altered each time it is read, and read disturb management is a technique for monitoring and managing the number of times a cell has been read. Once a particular cell approaches the point where it has been read too many times, the controller moves the data in that cell to another location, resetting the cycle to start the count over.
Knowing how and when to move data to prevent read disturbs is a difficult challenge, because if you move data too little you may lose data, while if you move data more often than necessary you reduce performance by adding to write amplification.
If the flash memory or the controller gets too hot, it can burn out prematurely. The controller needs to understand the temperature of the entire drive and mitigate the impact of high temperatures. Writing to the flash draws the most power, and that is what heats things up. The controller needs to monitor the overall temperature and if it exceeds a certain threshold the temperature management can slow down writes in order to reduce the power consumed, and thereby the temperature until the temperature is back under the threshold.
The challenge in temperature management is to regulate write speeds without disabling writing altogether. Ideally, the controller should use several thermal thresholds to ratchet down write performance so as to avoid having to stop writing activity.
One of the most important advances made by today’s new SSD controllers is a significant improvement in endurance. Today’s advanced controllers can manage commodity flash so that it delivers the same endurance as enterprise hard disk drives. This makes it possible to build SSDs that are competitively priced and yet last as long as other enterprise computing gear. There are several aspects to prolonging endurance, including firmware design and life curve throttling.
Designing firmware for endurance
Flash memory can only be written to so many times before it has trouble recovering (remembering) what it was storing. If the controller can write to the flash less or reduce the overall write amplification, the drive lasts longer. There are a few techniques used by advanced controllers to reduce write amplification including page-based volume management, efficient garbage collection, and line-speed data analysis.
Since whole flash blocks must be erased before the individual pages can be overwritten, the valid data in the surrounding pages that get moved to new blocks causes an increase in write amplification. Today’s advanced controllers use page-based volume managers, which enables them to manage individual pages. This is important for random (as opposed to sequential) write operations, since the controller needs to be able to skip around from page to page when small amounts of data change in a block. If the controller uses a block-based system to write random data, it wastes a huge amount of space as it is forced to rewrite the entire block of typically over 100 pages just to replace a single sector of information. This single write would have a write amplification of over 1000 on that block. A page-based system would only rewrite a single page, reducing the write amplification by two orders of magnitude on that one write.
As mentioned previously, garbage collection is a fundamental process in all flash drives because it ensures a steady pool of free blocks ready to be written. However, there is a limit to how much garbage collection should be done in advance of actually needing the space freed up. This background garbage collection can be wasteful because it may require moving of data that will soon be TRIMed by the operating system anyway. If the controller moves a bunch of data to new blocks and the OS will simply TRIM it shortly after that, the additional writes for that data needlessly increased the drive write amplification and further reduced the flash endurance. Early SSD makers thought that continuous background garbage collection was the best way to ensure a ready supply of fresh blocks for data, but it turned out that the performance gains of such continuous collection were not adequately offset by the larger loss of precious endurance. Ideally, the controller should know when garbage collection is optimum based on space needed and available write cycles. It should also have as efficient a garbage collection system as possible.
One simple method for extending endurance of the flash is to not write to it in the first place. This might sound crazy, like making a gallon of fuel last longer by not burning it, but there are many techniques in use today for storage applications including data deduplication, compression, and data differencing that reduce the amount of data that must be written to the drive. This technique, which SandForce implements with its DuraWrite technology is a very complex process and requires a significant investment in the controller.
Life curve throttling
If a flash drive can only be written to so many times, endurance depends on not writing to it unless necessary. Garbage collection schemes and block picking help reduce the number of unwanted writes, but at times it may be necessary to take a more direct approach. Life curve throttling is the method of tracking how much data has been written over a specific period of time using a system of debits and credits, and then throttling back write performance as needed. If the drive accumulates so many debits that the advertised life span is threatened, for example, the drive automatically slows down. With the SandForce DuraWrite technology reducing write amplification, users would never experience this throttling – it would more likely occur in enterprise environments where data is written 24/7. Basic SSD controllers do not support this life curve throttling and simply have the user restrict the total number of writes sent to the drive over its life. This would be difficult for anyone to manually track and restrict traffic to the drive.
Securing user data, beyond ensuring proper drive operation and endurance, means encryption. There are many ways to encrypt data on a drive, but advanced SSD controllers apply an extra level of protection. Ideally, the drive should incorporate hardware-based encryption rather than software-based encryption because software encryption can be defeated and it consumes valuable host CPU cycles. If properly designed, there is no performance degradation in hardware-based encryption, and it renders the data useless if the drive or even the individual flash chips are removed from the system. Advanced SSD controllers coordinate with various encryption standards, including ATA, TCG’s Opal/Enterprise, IEEE 1667, and so on. These all enhance the security of the device, and SSD controller manufacturers work closely with these organizations to ensure that the appropriate standards are met.
One final area of consideration for an SSD controller designer is the amount of flexibility enabled. Ideally, the controller should be easily configurable in any form factor and usable with any type of flash memory. There are two considerations here, mapping strategies and media flexibility.
Flash operates with a logical block address (LBA) lookup table. The LBA table is necessary because the flash media cannot be directly overwritten like an HDD. Therefore each overwrite to the same logical location on the SSD has to cross reference to a different physical location. This LBA table has to be stored someplace. One approach is to cache the full map in an external DRAM. This approach is fairly simple to use, but the external DRAM costs money and takes up additional space in the design and layout of the SSD. In addition, there is no ECC in the DRAM so errors can occur. Other manufacturers offload the map to the host CPU and DRAM, but this requires custom drivers for different hosts, is resource-intensive on the host, and requires long recovery times after a power failure.
The best approach is to cache the map on the flash itself. Although this is a very complex process for the SSD to manage and control, this eliminates the DRAM from the layout which enables more flexible SSD design options, reduces the overall solution cost, and is more power-efficient to implement.
Design for media flexibility
Finally, the controller must be designed to work with flash from multiple manufacturers. This gives the SSD maker ultimate flexibility in choosing from whom to buy the flash memory (except in the case where they manufacture their own flash memory). Every flash has different characteristics such as page/block size, spare area, response times, and other factors, and it adds complexity to the design process to require SSD makers to adapt their controllers to a given brand of flash. ‘Universal’ controller compatibility is much more desirable.
As we can see, there are many technical factors that contribute to the design and implementation of advanced flash controllers, enabling high performance, high endurance, and high security. Controller designers are working hard to meet these and other challenges daily.