Interview: LSI Fellow Rob Ober Discusses PCIe Flash Technology

Download this article in .PDF format
This file type includes high resolution graphics and schematics.

Flash storage has changed the way programmers and developers look at storage. Initially a way to boot a PC or run a small program in a microcontroller, flash storage has replaced rotating magnetic media in a wide range of applications. Hard drives were often replaced with flash drives that incorporated the same storage interface. Unfortunately, the higher performance of flash storage exceeds the capabilities of this kind of interface.

PCI Express (PCIe) has been used as the main peripheral interface. It is scalable, and its bandwidth exceeds that of most storage interfaces. It is a good match for the performance of flash storage. PCI Express-based standards like NVM Express (see “Controllers Speed NVM Express Drive Delivery”) have emerged. LSI’s Nytro WarpDrive PCI Express platform offers lots of on-board flash storage (see the figure).

Figure 1. Nytro WarpDrive connects on-board flash to a host via PCI Express.

Rob Ober, LSI Fellow, Processor, and System Architect at the LSI Corporate Strategy Office, discusses where PCI Express-based flash storage is headed.

Wong: What is the future of PCIe flash?

Ober: Ultimately, I think that PCIe cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too.

I’ve been working on enterprise flash storage since 2007, mulling over how to make it work. Endurance, capacity, cost, and performance have all been concerns that have been grappled with. Of course, flash is changing too as the nodes change, from 60 nm, 50 nm, 35 nm, 24 nm to 20 nm and from single-level cell (SLC) to multi-level cell (MLC) to triple-level cell (TLC) and all the variants of these trimmed for specific use cases. The endurance specification has gone from hundreds of thousands of program/erase (PE) cycles to 3000 and in some cases 500.

Related Articles

It’s worth pointing out that almost all the “magic” that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution, and that means less parallel bandwidth for data transfer. And the requirement for state-of-the-art single operation write latency has fallen well below the actual write latency of the flash itself. I know that sounds impossible, but that’s a real market requirement.

In short, things have changed fast in the enterprise flash market, and we’ve learned a lot in just a few years.

Wong: So what is the most important thing you’ve learned about why people use flash?

Ober: Hands down, latency is the factor that matters most. No one is pushing the IOPs (input/output processors) limits of flash, and no one is pushing the bandwidth limits of flash. But they are pushing the latency limits. A lot of enterprises believe they have tremendous IOPs and bandwidth needs. But when we actually instrument their workloads, when we profile actual use cases with real systems, we find that they do not. Even so, the latency improvements have a profound impact on work done.

Wong: What else have you learned?

Ober: We’ve gotten lots of feedback, and one of the biggest things we’ve learned is this: PCIe flash cards are awesome. They radically change the performance profiles of most applications, especially databases, allowing servers to run far more efficiently and increase actual work done four times to 10 times (and in a few extreme cases 100 times). Let me put that in context. Last month I saw a billboard advertising one company’s server giving 5% higher performance than a competitor’s on a key database. PCIe flash cards can easily give you 500% higher performance. So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they came along. But…” There’s always a “but,” right?

Wong: So what are the drawbacks to PCIe flash?

Ober: It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We’re not the only ones hearing it. To be clear, none of these frustrations are stopping people from deploying PCIe flash. The attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions. That said, there are a number of problems with PCIe flash that, in my view, need to be solved.

The first involves stranded capacity and IOPs. Some leftover space is always needed in a PCIe card since, after all, databases don’t do well when they run out of storage! But you still pay for that unused capacity, and that’s expensive. Additionally, all the IOPs and bandwidth are rarely used, which means latency needs are met, but capability is left on the table. Of course, there’s also the issue of simply not having enough capacity on a card. It’s hard to figure out how much flash a server/application will need, but it’s vital as there is no flexibility after it’s installed. If a working set goes one byte over the card capacity, there’s going to be a massive problem.

Download this article in .PDF format
This file type includes high resolution graphics and schematics.

Data protection can be a problem. Not all applications require protected data and should not have to bear that overhead. But if the storage is primarily the only copy, any component failure in the datapath would be an inconvenient problem. More than that, some applications have an absolute critical need for that protection or resilience, or else someone will lose their job. PCIe cards have a single point of failure, and if that goes, the 2 Tbytes of data behind that disappears too. Again, that is a huge problem for some applications.

Another is stranded data on server failures. If a server fails, all that valuable hot data is unavailable. To make matters worse, it all needs to be re-constructed when the server does come online because it will be stale, and that takes time. To illustrate, 2 Tbytes of hot data could take anywhere between hours and days to restore.

Yet another is that PCIe flash storage is a separate storage domain versus disks and boot. Because of this, you have to explicitly manage LUNs (logical unit numbers), explicitly place data to make it hot, and manage it via different APIs (application programming interfaces) and management portals. Applications may even have to be rewritten to use different APIs, depending on the PCIe card vendor.

Wong: Are there any vendor-specific drawbacks that you see?

Ober: There sure are, and the biggest is that performance doesn’t always scale. First, it’s important to know that most PCIe card deployments out there are multiple cards per server. One card will give significant performance improvements. With two cards, you’ll see an improvement, but it won’t be the same awesome improvement you saw with the installation of the first card. Three or four cards don’t give any improvement at all. The problem is that performance maxes out somewhere below two cards as drivers, and server on-loaded code creates resource bottlenecks. This is more a competitor’s problem than ours. We scale.

In addition, and again depending on the vendor, performance can sag over time. Many vendors execute bit-error correction algorithms in on-loaded code in the server. As flash wears, and more bit-errors occur per read, more and more computation (latency) is needed in that code. It’s a very noticeable problem over time. This has not been an issue for our products because that’s all part of the hardware datapath, but I mention it because it has been a problem for the PCIe flash industry as a whole, and people have been seeing disappointing performance over time as their cards age.

Wong: Are there any other drawbacks?

Ober: Yeah, there’s one more. It’s hard to get cards in servers.

A PCIe card is a card, right? Not really. Getting a high-capacity card in a half-height, half-length PCIe form factor is tough, but doable. However, running that card has problems. It may need more than 25 W of power to run at full performance, and the slot may or may not provide it. Flash burns power proportionately to activity, and writes/erases are especially intense on power. It’s really hard to remove more than 25 W of heat air cooling in a slot. Sometimes the air is preheated, or one slot doesn’t get good airflow. It ends up being a server-by-server, slot-by-slot qualification process, and, as trivial as that sounds, it’s actually one of the biggest problems. And it creates a very hostile environment for the flash memory itself.

Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That’s what we’re here for, though, right? To summarize the problems: it’s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons above.

Wong: What are the implications of these problems, and where do you see PCIe flash solutions evolving over the next two to four years?

Ober: If you think these problems through enough, you start to consider one basic path. It also turns out we’re not the only ones realizing there are three basic goals going forward with storage:

Unified storage infrastructure for boot, flash, and HDDs (hard-disk drives)
Pooling of storage, especially the flash, so that resources can be allocated/shared
Low latency, high performance, exactly as if those resources were PCIe card flash

One easy answer would be that’s a flash SAN (storage area network) or NAS (network attached storage). But I don’t believe that’s the answer. Not many customers want a flash SAN or NAS—not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember, this is flash, and people use flash for latency.

Today’s SAN type of flash devices have four times to 10 times worse latency than PCIe cards. That’s a non-starter. You have to suck the data through a relatively low-bandwidth interconnect, after passing through both the storage and network stacks. Making matters worse, there is interaction between the I/O threads of various servers and applications, meaning you have to wait in line for that resource. The net result is very poor latency compared to PCIe cards.

Normally where there’s smoke there’s fire. It’s true there is a lot of startup energy in this flash SAN and NAS space. At first glance, it seems to make sense to pursue these solutions if you’re a startup, because SAN/NAS is what people use today, and there’s lots of money spent in that market today. However, it’s not what the market is asking for, and, therefore, it’s not where the market is ultimately headed.

Another easy answer is NVMe (NVM Express) SSDs (solid-state drives). Everyone, at least OEMs, wants them. Front-bay PCIe SSDs (the HDD form factor or NVMe) crowd out your disk drive bays, but they don’t fix the problems. The extra mechanicals and form factor are more expensive and just make replacing the cards every five years a few minutes faster. We should be fixing problems, not just making them easier to deal with.

NVMe SSDs allow you to fit fewer HDDs. This is a definite drawback. They are still an island of flash in the server and have all the same problems as PCIe cards. They also provide uniformly bad cooling and hard-limit power to 9 W or 25 W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID (redundant array of independent disks) or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs, and bandwidth, but the software protection on the server adds a lot of latency. IOPs and bandwidth are not what applications need. They need low latency for the working set of data. Latency is key.

Wong: What do you think the PCIe replacement solutions in the near future will look like?

Ober: You need to pool the flash across servers to optimize bandwidth and resource usage and allocate appropriate capacity. You need to be resilient, that is, to manage failures or errors and limit the impact and span of failures that do occur. And you need to commit writes at very low latency (lower than native flash) and maintain low-latency, bottleneck-free physical links to each server. Achieving these goals is not easy.

One approach is a single small enclosure per rack supporting around 32 servers. That enclosure manages temperature and cooling optimally for performance/endurance of the flash. It also supports remotely configured/managed resources, which are allocated to each server, along with the ability to re-assign resources from one server to another in the event of server/VM (virtual memory) bluescreen.

It’s critical to have a low-latency/high-bandwidth physical cable or backplane from each server to the enclosure. It also needs to have the ability to replace small flash modules in case they fail or wear out over time, but the modules need to be physically small (and financially small too). And there needs to be a choice to provide protection across some or all modules (erasure coding) to allow continuous operation at very high bandwidth. Write latency performance can be further improved by using NV (non-volatile) memory to commit writes. And finally, this needs to be logically integrated with the whole storage architecture at the rack, the same software model, APIs, drivers, management tools, alarms, messaging, etc.

That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes they have today—getting better, in fact. This is just my opinion, but as I said, other leaders in flash are going down this path too.

Download this article in .PDF format
This file type includes high resolution graphics and schematics.