Understanding the Effects of the IO Stack on Workloads and SSD Performance

According to IDC, shipments of SSDs are expected to increase by a compound annual growth rate of 52 percent from 2010 to 2015 on a worldwide basis. With the resulting mass-proliferation of SSD storage solutions for enterprise applications, understanding critical performance components under a wide variety of workloads and demand intensities will enable end users to optimize tests to best measure their specific environment.

This article discusses the marked differences between "Synthetic Device Level" and "Real World - Application Level" workload performance measurement. We focus on the path taken by the IO stimulus through the software / hardware stack and its effect of the SSD test environment.

SSD Performance is highly dependent on the workload applied and the method of application. Here, we investigate how workloads may be affected by host computer variables. Note: for the remainder of this article, the terms "real world workload" and "user workload" will be used interchangeably.

What is a "User Workload?"

A user workload is an IO stimulus generated at the application level and applied to the device under test ("DUT"), that traverses the same IO stack (including host side caching and file systems) as would any IO generated by a user application outside of a performance measurement environment. Further, a user workload based test requires that the DUT have a file system installed and that all interactions with the DUT exactly mirror the interactions and IO path(s) that will occur when the DUT is deployed for use in end user system.

What is a "Synthetic Workload?"

A synthetic workload is an IO stimulus that uses a known and repeatable stimulus that is targeted directly at the block-level IO device. The set of test parameters completely describes the test stimulus. Such parameters commonly include:

Block Size (typically measured in kilobytes)
Traffic type, i.e. Read / Write Ratio and
Degree of randomness of the access pattern (Random or Sequential Accesses)

What is a "Synthetic Device Level Test"?

Synthetic device level testing refers to the use of a synthetic workload (as defined above) to measure the performance of any device, in particular an SSD, at the block IO level (not at the file system or OS level). In this type of testing, the test sponsor applies synthetic workload parameters that relate to the type of "IO traffic of interest." This synthetic workload is often a mix of Random and Sequential Reads and Writes in various Block Sizes similar to the type of IO traffic (or the "IO traffic of interest") expected to access the block level Device Under Test when used in the end user system.

Synthetic workload mapping

File System / Application level workloads will traverse the software/hardware stack and present a given IO stimulus to the Device Under Test. While these IOs may vary over time and by workload, general patterns have been observed that associate a given read/write ratio, block size, and access pattern with workload types. The following table shows general workload characteristics at the device level that have been observed to be associated with certain user application level workloads.

Synthetic Workload Parameters

Block Size (Kbytes)	Read/Write Ratio	Degree of Randomness Typical	User Workload
8	67% Read / 33% Write	100% Random	Database, commonly referred to as "OLTP" (online transaction processing)
128	90% Read / 10% Write	Random or Sequential	Video stream cache
128	50% Read / 50% Write	100% Sequential	Data backup
64	95% Read / 5% Write	Sequential	System startup (boot)
4	50% Read / 50% Write	100% Random	Virtual memory (disk paging)

Table 1 shows the access patterns observed at the block IO level for different user workloads. These block level IO stimuli may not be identical to user workloads at the application or file system level and can change depending upon how the IOs traverse the IO software/hardware stack

Characteristics of file-system level testing

File system level testing differs from testing at the device level. File system level testing generally involves directly issuing specific file IO operations through the file system, targeted towards the DUT.

Because SSDs can be 1,000 to 10,000 time faster than conventional HDDs (SSDs can regularly reach IOPs of 40,000 compared to those of a conventional HDD, approximately 300 IOPs) and can have much smaller latencies (SSD latencies are usually measured in micro seconds compared to HDD latencies in milliseconds), the timing issues and effects of outstanding concurrent IOs are much more important in SSD performance measurement. Thus, the SSDs may be more susceptible to, and may have a higher frequency of file system effects which can alter the nature of the outstanding IOs due to caching, fragmentation, split IOs and the like as discussed above.

These variables and their effects on SSD performance are generally related to the following:

The test application/test stimulus generator itself
Various components/drivers that site "between" the stimulus generator and the DUT
The interactions of these components within the OS software stack
Characteristics of each file system
The underlying computer hardware platform

As an example of how these interactions may affect the resultant I/O applied to the DUT (that is, the transfer function between the I/O generated and the I/O "seen" at the DUT level is not unity), consider the below diagram. Some of the specific variables that can impact SSD performance testing at the File System Level as well as application I/O performance in general are:

User Workloads. A primary interest for many, if not most, end users in comparing performance amongst SSDs is to determine and substantiate the performance benefits that can be gained while operating within their specific computing environments using their particular applications of interest. However, the range and diversity of applications that are available along with the particular manner in which they are actually used can introduce a significant set of factors that can impact application I/O performance.

Fragmentation: As one example, a single file IO operation issued by an application may, on the one hand, require multiple IO operations to the physical device due to file fragmentation. The same IO could also result in no physical device access at all due to various caching strategies that may be implemented at the OS or driver level. Furthermore, the drivers can also split or coalesce IO commands, which can result in the loss of 1:1 correspondence between the originating I/O operation and the physical device access.

Timing: Various timing considerations can have a notable impact upon the manner in which IO operations traverse the OS software stack (Fig. 1). For instance, while several applications can each be performing sequential access IO operations to their respective files, these concurrent IO operations can be observed to arrive in a more random access pattern at the lower-level disk drivers and other components (due to system task switching, intervening file system metadata I/O operations, etc.).

Figure 1. Timing considerations can have a notable impact upon the manner in which IO operations traverse the OS software stack.

Concurrency: All hosts process I/Os with some degree of concurrency. However, the degree of concurrency can be highly dependent on the host characteristics. For example, the number of processing cores and the ability to parallelize execution threads can each have an effect of how efficiently concurrent I/Os are processed, affecting the (apparent) performance of the DUT.

Caching: System caches my intercept small I/Os directed at the DUT, returning the requested data without ever actually accessing the DUT.

Coalescing: It is also possible that file system level tests can transparently coalesce smaller transfers, effectively modifying the stimulus from a smaller block, random transfer to a larger block, more sequential transfer.

In summary, file system level testing can dramatically, and potentially inconsistently, alter the nature of the I/O. When executing file system level testing, one potentially loses the 1:1 correspondence between the I/O generated and the I/O applied to the DUT. It is both the alteration of the I/O and its potentially inconsistent transfer function that may lead to imprecise results.

If the primary interest and goal of end users is to properly and prudently match the performance needs of their particular applications to their specific storage purchases, file system level testing my lead to incorrect conclusions. The natural propensity towards attempting to directly map (i.e., correlate) the advertised/reported performance metrics of storage devices (e.g., IOPS, MB/s, etc.) to the presumed workload characteristics of their applications is extremely difficult because. The IO activity that stems from applications is subject to a wide and inconsistent series of variables and effects as the IO operations traverse the OS software stack. This seemingly "natural" mapping is therefore very imprecise.

This can easily be confirmed by collecting empirical IO operation performance metrics from the application perspective and comparing them at various key points within the OS software stack. Such a comparison of a sample I/O traces will show the lack of a unity transfer function as the I/O migrates from the application, to the DUT, and back.

The advantages of synthetic, block level testing

In contrast to user workload testing and the resultant, potentially wide variance between tests and the potential loss of 1:1 correspondence between the generated I/O and the actual I/O applied to the DUT, the use of Synthetic device level testing affords several advantages:

The use of a known and repeatable stimulus, providing consistent results across multiple test runs and multiple DUTs. This enables direct, consistent, DUT-to-DUT performance characterization.
The profile of the stimulus can be managed to reflect aspects of the user's workload, enabling comparison of a single DUT to multiple real-world use cases. This increases the efficiency of the test process because one can establish the DUT's performance across the potential use cases and do so non-disruptively and in a single test run.
Defining workload stimulus at the block IO level eliminates the affects of the software/hardware stack on the test stimulus - i.e. different test environments may treat application level IOs differently thus presenting a different test stimulus at the device level. Use of device level IO characterization ensures that the test sponsor is directly testing the DUT with a known test stimulus, not one that is changed as it traverses the IO software/hardware stack.

Furthermore, knowing your workload characteristics at the device level and comparing those to the synthetic test results enables:

Knowledge of what to tests best determine a given SSD's performance strengths and weaknesses
Accurate comparison of different SSDs
The correct selection of the best SSD for several different workloads

SSDs have introduced a heretofore unheard of level of storage device performance. Whereas IOPs used to be measured in the hundreds (at best), SSDs have increased the single drive performance by at least two orders of magnitude. However, this dramatic performance improvement comes at a substantial cost. These, combined with the "drop in" appearance of SSDs, require that a greater importance be placed on understanding the variables that affect performance measurement.

Although the natural tendency is to measure SSD performance the same way that one traditional measures conventional storage device performance - i.e. through the complete software I/O stack - a different approach is required since SSD performance can be greatly affected by host side variables.

Synthetic workload testing applied directly to the DUT (treated as a block level device) ensures that host side variables are minimized. This results in a more precise measurement with far fewer run-to-run, and host-to-host variances.

Synthetic tests can still be mapped to expected application traffic and, in fact, offer a clearer understanding of overall performance expectation. This is due to their consistent results and their ability to measure several workloads in a single test run.

The use of a synthetic workload test is preferred because end user workloads are extremely diverse. This enables one to accurately and consistently measure SSD performance under the widest variety of workloads and demand intensities. The end user, knowing the attributes of their particular IO profile, can select those test results, which best represents their environment.