Overcoming Barriers to Broad-Based SSD Adoption in the Enterprise

At first glance, solid-state drives (SSDs) appear to be a no-brainer for makers of storage systems for enterprise servers and laptops. After all, SSDs promise higher read/write performance, higher reliability, and lower power consumption when compared with hard-disk drives. But in practice, SSD adoption has been held back not only by a higher cost per gigabyte, but also by real-world issues that prevent them from achieving their performance and reliability promises.

In this article, we will look at the key technical issues preventing SSDs from broad-based adoption by enterprise server and laptop makers, and outline the requirements for broad-based market success.

SSD TECHNOLOGY TODAY Last year saw the proliferation of SSD product announcements in enterprise-class servers and laptops, beginning with 32- and 64-Gbyte devices. In servers, we’ve seen SSDs used in so-called “Tiered Storage” systems, in which the SSD acts as a higher-speed intermediary between system RAM and hard-disk-drive (HDD) storage. Fujitsu and other vendors have also begun to use SSDs in enterprise-class laptops, touting their ruggedness and higher read performance.

With SSDs now in the marketplace, two key trends have begun:

Declining costs: As NAND flash manufacturers have continued to advance technology and densities, the price per gigabyte from NAND flash vendors has dropped approximately three orders of magnitude over the last decade, starting from thousands of dollars in the 2000 timeframe to today’s commodity price levels of around a $1 per gigabyte (for MLC-based technologies). Continued price declines are expected for years to come.

Increasing performance: At the same time, advances in flash memory as well as techniques such as DRAM caching are driving input/output operations per second (IOPS) higher, with today’s fastest SSDs sporting tens of thousands of read IOPS.

SDD CHALLENGES Despite these advances, most analysts predict a very slow ramp up toward broad-based adoption of SSDs in enterprise-class servers and laptops. One key reason is the relatively high cost per gigabyte for SSDs (compared with HDDs). Today’s SSDs mainly use SLC memory due to its higher life expectancy and reliability.

The cost of SLC memory is roughly four times higher than multilevel-cell (MLC) memory due to two factors. First, MLC memory stores two bits per cell and therefore provides twice the storage per square millimeter of silicon (the main cost of the memory). Second, the volume of MLC is roughly 90% of all NAND flash, further increasing the economies of scale in its production. Unfortunately, MLC flash memory isn’t yet deemed reliable or durable enough for widespread enterprise use.

Nevertheless, MLC flash is clearly the way forward due to its ability to rapidly reduce the cost per gigabyte. Still, several challenges must be overcome when using MLC flash in its current implementation:

Poor write endurance: NAND flash memory can only be written a certain number of times to each block (or cell). SLC memory generally sustains 100,000 program/erase (P/E) cycles, while MLC memory is generally ten times less at 10,000 cycles. Once a block (or cell) is written to its limit, the block starts to forget what is stored.

Today’s SSDs are different from HDDs when it comes to data storage. HDDs can take the data directly from the host and write it to the rotating media. In contrast, SSDs can’t write a single bit of information without first erasing and then rewriting very large blocks of data at one time (also referred to as P/E). In addition, to maximize the life of the flash memory, a technique to level the wear across all blocks equally forces the SSD controller to constantly move data around on the flash memory.

These factors and other differences from HDDs give rise to write amplification, which can rise to a factor of 100 times the amount of user data actually being stored. Consequently, these factors also limit the life expectancy of the SSD. Figure 1 shows the basic life expectancy formula that affects all SSDs and Figure 2 shows the details of the formula:

A typical MLC drive might have the characteristics shown in Figure 3.

where:

Capacity = 128 Gbytes

P/E cycles = 10,000

Write speed from the host = 125 Mbytes/s

Duty cycle (when the drive is accessed for reads or writes) = 40% of the time

Read:Write ratio (percentage of time an access to the drive is a write, versus a read) = 33% of the time

Write amplification (assuming a conservative number) = 40

Clearly, 23 days is too short a lifespan to deploy in an enterprise environment. To overcome the endurance problem, SSD manufacturers use one or more of these five techniques:

Combining MLC and SLC flash on the same device, which extends endurance by storing more active data on the higher-endurance SLC memory, but still lowering the total cost by using some MLC memory.
Over-provisioning, which extends endurance by making more flash available. For example, an SSD with twice as much actual storage as its stated capacity would have twice the endurance as a drive in which flash and capacity had a 1:1 ratio (no over-provisioning). Of course this over-provisioning would also double the cost.
DRAM caches, which extend endurance by aggregating some writes before sending it to the flash memory and using it for other housekeeping (rather than the flash memory). Naturally, the DRAM also adds costs.
Daily write limitations, which extend the life of the drive by restricting the number of writes to the flash each day. For example, one vendor’s warranty specifies a limit of 20 Gbytes per day written from the host, which can be reached in less than five minutes on that same drive.
Reduced warranties, (less than five years) which account for lower endurance by simply reducing the guaranteed life of the drive.

Poor write performance: While random read performance in a typical MLC-based SSD can get up to 10,000-20,000 IOPS, random write performance is significantly less. Even a so-called “high-performance” SSD today delivers roughly less than 1000 IOPS of write performance (Fig. 4). This is generally caused by a high write amplification factor and by a need to restrict writes to extend the drive’s endurance.

Typically, SSD makers address the write performance issue with two methods. One is by adding DRAM caches, described earlier. However, this isn’t a long-term solution because it only speeds up writes until the cache is full (the first few minutes of use, at best). In any event, adding the gigabyte or more of DRAM cache that’s really needed to impact performance would make the SSD too expensive to sell. The other method is to over-provision the drive. This gives the SSD controller more room to manipulate the data and reduces the amount of time the drive is doing housekeeping operations, e.g., garbage collection (performed on blocks of data no longer needed by the host).

High error rates: NAND flash memory, like other memory chips, has naturally occurring defects that render portions of die unusable. Most SSDs provide error protection for up to one sector for each 1015 bits read. Assuming a 250-Mbyte/s read speed and a 40% operating duty cycle, a sector error would result on average every 14.4 days based on the following formula:

10¹⁵/(8 x 250 Mbytes/s x 40%) = 14.4 days

In contrast, high-performance HDDs offer error protection for up to one sector for each 10¹⁶ bits read, but they’re transferring data much more slowly than an SSD. Assuming a 120-Mbyte/s read speed at the same operating duty cycle, a sector error would result on average every 9.9 months based on the following formula:

10¹⁶/(8 x 120 Mbytes/s x 40%) = 9.9 months

To address this problem, SSD manufacturers either reduce the warranty period for their devices, or they leave it to RAID logic in the host computer system. Shorter warranties aren’t acceptable to most mainstream users, and using host-based RAID causes a high number of rebuilds and further reduces the SSD’s performance. And, of course, in a notebook environment, RAID isn’t a good option.

Security: One of the stealth issues impacting the use of SSDs is the lack of encryption in a typical drive. For most products on the market today, data can be recovered from the SSD by simply removing the SSD cover and attaching a clip to the flash-memory chips—a process far easier than trying to read the contents of a password-protected HDD that’s been removed from its host. Enterprises will demand much better security guarantees, though, before they use SSDs in large quantities, especially in laptops.

REQUIREMENTS FOR BROAD SSD ADOPTION The fundamental cost issue with today’s SSDs can be largely overcome with the use of MLC flash, but that flash must be made reliable enough and deliver enough performance to be practical for use in enterprise servers and laptops. The requirements for doing so are:

Better write endurance: The industry must develop new techniques to reduce the write amplification in MLC-based SSDs (thereby increasing the endurance) to meet the five-year expected life of enterprise-class HDDs, and it must do so without imposing daily write limitations, shorter warranties, or costly DRAM caches.

Better write performance: MLC-based SSDs should perform like HDDs—there should be no difference between write and read performance. This will also require significantly reducing the write amplification factor imposed by today’s SSD technology.

Lower error/defect rates: Error rates and ECC protection for MLC-based SSDs must be better than it is for today’s enterprise HDDs, without over-provisioning or relying on system-level RAID techniques.

Full security: SSDs will need some form of built-in encryption to prevent data theft before enterprises will trust their use in laptops.

Reduced complexity, size, and costs: SSD designs must eliminate DRAM and combinations of SLC and MLC to reduce packaging complexity, size, and costs.

The market is more than ready for practical SSD storage devices, just as soon as SSD manufacturers overcome the challenges that limit performance, endurance, and complexity and can offer fast and reliable devices at reasonable prices. Advances taking place today will address these key issues, leading to a very bright future for MLC-based SSDs.