Today's online connected lifestyle is delivering unprecedented amounts of raw data on consumer preference to modern businesses. However, those companies working to harness this information must first winnow innumerable petabytes of raw data to glean some concentrated, actionable, useful intelligence. Data such as server logs, click streams, search history, likes, links, and postings contain valuable information about customer interest and preferences. However the available data do not fit comfortably into the kind of data base management systems that have been designed to execute more structured IT tasks, such as accounting, customer relationship management and enterprise resource planning.
The explosion in the volume of this unstructured data has prompted the development of new data management and processing tools to convert this vast amount of raw information into intelligence. Foremost among these new tools is the Apache Hadoop framework. Hadoop provides the tools necessary to process data sets that are much too big to fit in the file system of a single server. It harnesses the power of large clusters of servers to sift through vast amounts of data to distill essential intelligence.
This article discusses techniques to accelerate the Apache Hadoop framework using hardware accelerated data compression and a transparent compression/decompression file system (CeDeFS).
Hadoop Changes Traditional Assumptions
The sheer scale of Big Data and the need to process it in a timely way is changing many of the assumptions of enterprise computing. For example, Hadoop's designers found that it is more efficient to keep data very close to the processor rather than the traditional method of moving it across a storage network using large SAN or NAS servers. They prefer to load each server in the cluster with as much local disk storage as it can carry. They also prefer to use simpler and cheaper servers in larger arrays, rather than fewer super-powered compute servers and large network connected storage servers.
Accelerating Hadoop with Data Optimization
We begin with the observation that many Hadoop jobs are I/O bound. Moreover, many Hadoop map/reduce computations-though they may process huge amounts of data-are computationally fairly simple.Sorting, indexing, word and tuple counting, and partial ordering are the fundamental operations of data mining. These operations derive their effectiveness not so much from the computational complexity that they can handle, but rather from the vast amount of data that they can screen. Hadoop is designed to solve problems where the volume of data is too large to reside on a single server file system. As much data as possible is stored in directly attached disks, and efforts are made to ensure that any given computation will execute on or near the node that contains the data to be processed. This approach avoids having to transfer large amounts of data over the network. Even so, the low computational complexity of the map/reduce computations means that the processors will often have to wait for disk I/O completion. Hence, accelerating the delivery of data from storage to the CPU speeds up the local nodes and, consequently, the whole cluster.
The Hardware Acceleration Approach
A new approach to Hadoop acceleration starts with packing as much data as possible into the individual nodes of the cluster. The approach combines a compression/decompression file system and a hardware compression accelerator installed on each Hadoop node. A compression/decompression file system is a filter layer that manages the mapping of compressed data blocks onto the native file system. Using gzip compression the new approach can pack up to five times more data onto the local disks than can be stored in uncompressed form. This yields two benefits.
First, the storage capacity of the local cluster is increased by a factor of 4 to 5. This not only allows much more data to be processed locally, it also gives an immediate cost savings of up to $7,000 per node in the cost of the disk drives.
The second, perhaps, less obvious, benefit is that compression actually accelerates the delivery of data to the CPU. Since data is compressed on the disk, each I/O operation delivers up to five times as much data to the CPU than would be possible without compression. Because the approach uses asynchronous hardware acceleration, the time required to decompress data overlaps with other processing. Consequently, the data delivery rate to the CPU is higher without adding significantly to the CPU's workload. Thus, the Hadoop node spends less time waiting for I/O and more time executing map/reduce instructions, leading directly to faster execution time for map/reduce tasks. Early experience with Hadoop benchmarks indicates that execution time savings of up to 50 per cent are achievable.
The Software Compression Approach
The idea of compressing Hadoop data sets is not new. Hadoop offers native compression codecs as well as add-in libraries to allow the user to compress data.So, why use a hardware-accelerated approach? There are many differences in the application of compression via a software codec inserted into the application code and/or using an embedded hardware accelerator with file system compression. There are pros and cons for each approach and we will examine them here.
The greatest disadvantage of the software compression approach is that it uses the same processor, and in fact the same thread, to compress or decompress data as the application processes. Consequently, all of the work of compressing and decompressing is simply added to the CPU load. The user may use multiple threads to compress and decompress data, and thereby apply parallel processors to the task, but this comes at the expense of increased complexity in the application code. Furthermore, software compression may lead to little, if any, improvement in I/O speed because the reading thread must serially read, decompress and process the incoming data. There would be some gain from reading compressed data from the disk, but this would be largely offset by the additional latency resulting from decompressing in the same thread.
Since software compression can be a big drain on CPU resources, the most popular compression algorithms used by Hadoop are LZ0 and Snappy. They are very fast, but do not achieve particularly good compression results, exacerbating the difficulty of achieving any I/O acceleration. In this tradeoff between CPU load and compression ratio, the low ratio compression algorithms certainly deliver better overall performance , but we must eliminate this bottleneck to achieve the maximum benefit for Hadoop.
The one advantage that the Hadoop software compression codecs have is in network optimization during the shuffle phase, where a lot of data is exchanged between nodes. If the data is compressed by a native codec, it can remain compressed in this exchange. This consumes less network bandwidth and result in a faster shuffle time, but again, at the cost of significantly increased software complexity. Why? Because the data must be compressed when first read, decompressed for map processing, compressed for the shuffle phase, decompressed for reduce processing, etc. - and all operations must be executed by the CPU.
By contrast, a compression file system approach using hardware acceleration handles all compression and decompression operations transparently. The simple act of reading or writing from a file causes the underlying data to be compressed or decompressed automatically with no other user intervention. The compression/decompression file system can also exercise heuristics to ensure optimal storage, such as preventing the expansion of uncompressible data. Asynchronous I/O operation and parallel processing results in a zero net increase in I/O latency. Nonetheless, data is transferred uncompressed over the network during the shuffle phase, consuming more network bandwidth and time.
The Solid-state Disk Approach
Another proposal for accelerating Hadoop is to use solid-state disks (SSD), either as a cache or for primary storage. This has the advantage of speeding up I/O by a factor of perhaps 2.5 to 3 for data read from the SSD. Unfortunately, SSD's cost about $3 per gigabyte versus 10-20 cents for hard drives - a 15x increase. Multiplied over a large Hadoop cluster with hundreds or even thousands of nodes, SSD is not economically practical.
Could we use SSD purely as a caching device to speed I/O? Unfortunately Hadoop does not cooperate well with caches because its predominant mode of file access is to sequentially write or read an entire file. The resulting long streaming operations cannot be efficiently cached. Once the size of the I/O exceeds the capacity of the cache, the overall system speed would be limited by the speed of the hard disk. Consequently, the cache would serve no useful purpose.
By contrast, compressing data on a hard disk achieves I/O acceleration proportional to the compression ratio, and this acceleration can be sustained across the entire capacity of the device. When we consider that Hadoop data sets can often be compressed by a factor of 4 to 5, the net result is probably better than could be achieved even by SSD primary storage.
In closing, the advantages of using a hardware-accelerated compression file system are not limited to Hadoop. Any big data application that can benefit from a larger local storage will benefit, as will any I/O bound application that is spending too much time waiting for its disk I/O to complete. Also, all NoSQL-type databases similar to Hadoop would benefit.