Apply Virtualization To Storage I/O

Virtualization is receiving lots of attention these days. Behind the buzz are some simple, time-tested concepts. But the movement of this technology from the mainframe to the mainstream has brought it into the limelight.

At its heart, virtualization is about making something “look” like something else. Typically, this means making an operating system “think” it’s running alone on a computer, when in fact that computer is shared by several operating systems—each referred to here as a system image (SI).

Since mainstream computers started incorporating memory-management units, this type of virtualization has been possible, albeit not terribly popular due to the extreme performance hit required to emulate every device in the computer. Recent hardware and software technology has boosted emulation speed, but plenty of room for improvement remains.

I/O virtualization (IOV) is simply the ability to make one device look like multiple devices, each assignable to a unique SI. Moving the virtualization to the device level can provide dramatic performance increases by freeing the system processor(s) from the cumbersome task of emulating those devices.

The PCI-SIG has defined a mechanism for providing these virtual device interfaces on the PCI Express (PCIe) bus. Now that these efforts are complete, some needed standardization is finally available in mainstream IOV design, enabling multi-vendor silicon solutions to work on multiple platforms under multiple operating systems.

Problem solved, right? Take a chip’s current PCI Express front-end logic, slap on this I/O virtualization stuff, and “poof!” It’s virtualizing like mad. Well, yes and no. Yes, it creates a chip assignable to multiple SIs at once. However, chances are that virtualizing the back end of the device is actually the harder part of the problem.

Also, consider a storage controller— clearly, one would want to partition the connected storage so System Image X (perhaps an online banking system) has space allocated only to itself and isn’t accessible by System Image Y (e.g., the Web server for www.hackers-are-we.org).

PCI EXPRESS I/O VIRTUALIZATION Let’s take a brief look at the system view of IOV. The term “system image,” or SI, refers to a real or virtual system of CPU(s), memory, operating systems, I/O, etc. Multiple SIs may run on one or more sets of actual hardware.

One example today might be a hypervisor like VMWare running Windows XP and Linux simultaneously on a single-CPU desktop PC. In that case, two SIs exist, each sharing a single CPU, memory, disk drive, etc. Another example would be a blade server running Windows XP on one blade and Linux on another blade. There, each SI isn’t sharing any of the CPU blade’s hardware, though it could potentially share hardware on an I/O blade.

Regardless of the physical assignments, each SI needs to “see” its own PCI hierarchy. Even if no end devices are actually shared (e.g., two Fibre Channel controllers on the I/O blade, one assigned to the Linux blade and one assigned to the XP blade), some control over the hierarchy’s visibility is required. If end devices are shared, each SI must be restricted to seeing only its “portion” of shared end devices.

The device needs to make its one physical set of hardware appear to be multiple virtual devices, which appear completely independent to outside observers. Those devices may:

• occupy different PCI memory ranges
• have different settings for various PCI configuration registers
• potentially each be a particular PCI multifunction device

Furthermore, the device needs to keep cross-“device” traffic isolated internally so no data spillover occurs between virtual devices.

As seen in the examples above, a clear distinction can be drawn between systems having a single point of attachment to the PCI hierarchy and those with multiple points of attachment. The traditional single-CPU desktop computer and even the traditional n-way multi-CPU server previously had a “single” logical point of attachment to the PCI hierarchy (Fig. 1).

By contrast, blade systems enable a new hierarchy view where some upper-level enhanced PCI Express switch could allow multiple root complexes to attach to the total PCI hierarchy (Fig. 2). Here, some new mechanisms are clearly required to enable each root complex to access only its assigned portion of that hierarchy.

Given the large separation between these two types of systems, both from a complexity and market segmentation perspective, the PCI-SIG chose to break IOV up into two separate specifications. Since each root complex (Fig. 2, again) could also be utilizing single-root IOV, the two specifications will necessarily be interdependent. Thus, the so-called “concentric circles” model was adopted, whereby the single-root specification builds on the PCI Express base specification, and the multi-root specification builds on both the single-root specification and PCI Express base specification.

Continue to page 2

SINGLE-ROOT I/O VIRTUALIZATION Single-root I/O virtualization’s primary target is existing PCI hierarchies, where single-CPU and multi-CPU computers have the traditional single point of attachment to PCI (Fig. 1, again). One of the significant constraining goals of the single-root spec was to enable the use of existing or absolutely minimally changed root-complex (i.e., chipset) silicon. Likewise, enabling existing or minimally changed switch silicon was a constraint.

Given those requirements, there can still only be a single memory address space from the bus perspective. Partitioning and allocation for the virtualized SIs is performed at a level above the root-complex attachment point. Some type of address translation logic is generally presumed to exist in or above the root complex to enable a “virtualization intermediary” (commonly referred to as a hypervisor) to perform that mapping. New IOV endpoint devices will be required, of course, with their associated non-trivial design and support challenges.

The “don’t change the chipset!” philosophy opens the virtualization market to significant numbers of existing or simply derived systems (e.g., might need new BIOS or chip-set revision). However, it shifts a substantial burden to software performing the virtualization intermediary function.

MULTI-ROOT I/O VIRTUALIZATION
The most obvious example implementation of the multiple attachment point hierarchy (Fig. 2, again) is a blade server with a PCI Express backplane, though the PCI Express Cable specification opens up a number of other possibilities. This is a new PCIe hierarchy construct—effectively a (mini) fabric.

Here, the PCI-SIG target was “small” systems with 16 to 32 root ports as likely maximums, though the architecture allows many more. (One of the workgroup’s sayings was “Our yardstick is a yardstick,” i.e., the typical implementation is expected to be a system occupying not more than about three feet cubed.)

Again, retaining the use of existing or absolutely minimally changed rootcomplex (i.e., chipset) silicon was a key goal. Unlike single root, however, no virtualization intermediary is assumed and the complexity of partitioning the system moves into a new enhanced type of PCI Express switch (Fig. 2, again), which is called “multi-root aware.”

The key difference in a multi-root system is the partitioning of the PCI hierarchy into multiple virtual hierarchies all sharing the same physical hierarchy. Where single-root systems are stuck with a single memory address space being partitioned among their SIs, multi-root systems actually have a full 64-bit memory

address space for each virtual hierarchy. Configuration management software, working in conjunction with the enhanced switch(es) and IOV devices, programs the hierarchy so each root complex from Figure 2 “sees” its portion of the entire multi-root hierarchy as if it were a singleroot hierarchy as in Figure 1. Each of those “views” of the hierarchy is called a virtual hierarchy. Each virtual hierarchy of a multi-root system can be independently enabled for single root or not. Therefore, endpoint devices in a multi-root system face the challenge of layering both modes.

Every SI should see its own virtualized copy of the configuration space and address map for a given device being virtualized. Effectively, the device needs “n” sets of PCI configuration space to support “n” of these virtual functions. The singleroot specification defines lightweight virtual function definitions to reduce the gate count impact, while the multi-root specification relies on a full configuration space per device usable virtual hierarchy.

The various “flavors” of configuration spaces are too detailed for this article, which is focused on virtualization at a high level. For the purposes of this discussion, it’s sufficient to note that every SI interacting with an IOV device will have its own device address range and configuration space. Thus, the IOV device can associate work with a particular SI based on which address space was accessed.

VIRTUALIZING THE STORAGE SIDE At this point in our hypothetical development process, an IOV device was enabled to respond as if it were multiple devices and provided with a mechanism to distinguish between two different SIs. If the implementation were stopped at this point, the model would look like Figure 3. Note that the depictions of SIs don’t attempt to distinguish whether they’re single-root or multi-root. At this point, there’s really only concern that they’re different images. The precise means of connection is unimportant.

Effectively, all SIs see all of the disks connected to the IOV storage controller. In some environments, this model might actually be okay. If the SIs were cooperative, they could divide up the pool of storage themselves. Likewise, if there were some software intermediary between each SI and the storage controller, it could divide up the pool of storage and allow an SI to see only a portion of the pool.

Considering the example at the beginning of this article, users could be uncomfortable with their banking system “cooperating” with the crew at www.hackers-are-we.org. While the software intermediary idea would be okay, it would eliminate a lot of the performance savings of doing IOV in hardware, and it would be a rather complex piece of software needing intimate knowledge of each controller’s hardware and device driver. Clearly, then, for most environments, hardware virtualization of the storage side is desirable.

SAS TO THE RESCUE Therefore, it’s not a difficult stretch to imagine that a creative IOV storage controller designer could add a straightforward table mechanism to filter out disk drives by their ID and only let certain SIs “see” certain disk drives. Such a system would look like Figure 4, where each colored SI has access to the same color disk drive(s).

Historically, this could have been done fairly easily in an SCSI environment— where SCSI even provided facilities for sub-dividing a single disk drive. Even a SATA controller today could probably handle this sort of per-disk drive “masking.”

Continue to page 3

Like the free-for-all model of Figure 3, the per-drive-masking model of Figure 4 might be usable in certain controlled environments. As long as the number of disk drives connected is small (for example, the 1 to 15 drives SCSI supported), then this model is quite workable. Once the system grows beyond the bounds of directly connected disk drives, however, the complexity of this mechanism becomes cumbersome.

Furthermore, implementing the software to support a proprietary mechanism for a dozen or so disk drives is probably irritating but not prohibitive. Extending that software to tens, hundreds, or thousands of disk drives is likely more than any sane developer would take on.

Luckily, SAS provides a standard mechanism for access control, called zoning, which is nearly perfect for storage virtualization. SAS zoning is very analogous to similarly named mechanisms in Fibre Channel and other storage-area network (SAN) technologies.

SAS is a point-to-point serial protocol designed as the successor to parallel SCSI, which utilizes devices called expanders to enable the connection of additional devices. A typical SAS host adapter might implement eight ports, allowing the direct connection of eight disk drives. (Actually, SAS disk drives may use multiple ports to provide additional bandwidth, so those eight ports could even be fully utilized by having four higher-performance two-port disk drives attached.)

To provide more connectivity, SAS expanders would be used as shown in Figure 5, ignoring the colors for the moment. Each of these expanders is logically a switch, though without the high dollar cost associated with Fibre Channel switches. SAS expanders can optionally support a zoning capability, providing a means to limit access from specified hosts to specified targets, such as disk drives.

In SAS zoning, access is controlled per connection point on the expander (called a “PHY” in SAS-speak). Each expander maintains a table of which PHYs can communicate with other specific PHYs on that expander. By manipulating these tables on its zoning expanders, a SAS system can provide full access control.

SAS zoning is configured via special SAS messages that extend the existing SCSI Management Protocol inherent in SAS already. The protocol already comprehends the idea of a protected “supervisor” as the only agent allowed to reconfigure the zones.

Because SAS zoning is done per connection point, adding or removing devices automatically triggers zone re-analysis and potentially zone reconfiguration. Thus, new disk drives may be added to a zone without disrupting other zones—or even alerting them that the system configuration changed.

Several articles could be written about SAS zoning alone. But for the purposes of this article, suffice it to say that zoning provides full host to disk isolation and access control (Fig. 5, again), with colors representing each zone.

Following these steps, it’s clear that mapping SAS zones to the SIs of PCI Express I/O virtualization provides a full-featured implementation of storage virtualization. Figure 6 shows the full picture of a SAS IOV controller. The SAS controller provides one or more logical SAS expander(s) internally with slight tweaks to map SIs as if they were PHYs. Each SI then sees only a portion of the total storage pool, without the need for a software intermediary filter. Furthermore, this has been accomplished using existing standardized mechanisms.

While this example used a plain SAS controller, a SAS RAID controller could be used as well. Such a RAID controller would likely present its RAID sets as if they were simple disk drives behind the same type of internal logical SAS expander as was used in the controller.