Distributed computing platforms serve big-data applications

Feb. 22, 2015

Data storage is inexpensive—you can store a terabyte of data on a disk costing $50 or less. But reading that disk end-to-end could take six hours, according to Matei Zaharia, an assistant professorship at MIT, speaking in a lecture titled “Distributed Computing Platforms” as part of the online course “Tackling the Challenges of Big Data,” which runs through March 17.

“To do anything quickly with these large data sets, we need to parallelize them,” he added, explaining that it’s not just a matter of speeding up queries but also loading data, transforming it, indexing it, and performing complex analytics functions that might require custom code. He cited Google as an example of parallelization—a single internal application might use over 1,000 nodes.

He cited problems with traditional network programming based on message passing between nodes. Depending on topology, moving data could be too slow. You’ll need to deal with failures—a single server might fail every three years, but with 10,000 nodes, you could see 10 failures per day. And you may have stragglers—nodes that run slower than other nodes.

New programming models have been developed to deal with these problems, he said, including data-parallel models of the type used in MapReduce, which operates on key-value records. Here, the programmer specifies an operation to run on all the data, and the system schedules the operation and handles functions such as recovery from a failure.

He cited examples such as Hadoop (the open-source version of MapReduce) and the associated Hive and Pig. He described Microsoft’s Dryad model (with DryadLINQ built on top) as a generalization of MapReduce that can express more types computations. Spark, he added, generalizes Dryad further and supports in-memory data sharing between computations. And Google developed the Pregel model for large-scale graph processing, with Giraph and other open-source models following or extending this model.

In addition, Google developed Dremel for large-scale SQL queries, with the open-source Impala and Tez implementing similar models. And Storm and Samza make use of the data-parallel model for real-time stream processing.

Zaharia cited the example of a simple MapReduce program that counts words in an input file. Given a line of text, the Map function will output each word as the key and 1 as the value; the Reduce function will add the values for each key to provide word count. Despite the seeming simplicity, there are subtleties involved. MapReduce sends the Map task to cluster nodes based on data locality to avoid sending data over the network. And it provides load-balancing and will assign a task that crashes to another node.

MapReduce is good at building single-pass analytics applications, Zaharia said, but for real-world applications requiring multiple MapReduce steps it has limitations related to programmability and performance. Responses have been to build higher level programming interfaces—such as Hive and Pig—over MapReduce or to generalize the model itself, as Dryad and Spark do, to better support multiple passes. In particular, Spark addresses applications that need to efficiently reuse data—such as interactive data mining, and it offers an abstraction called resilient distributed datasets (RDD), which can be stored disk or in memory for fast use. As an example, he said a full text search of Wikipedia (about 60 GB across 20 machines) using on-disk data can take about 30 seconds; loading the data into RDD memory cuts this time to about one second.

He also described asynchronous computing, as implemented by systems such as GraphLab for graph processing and Hogwild for machine learning, and stream processing, as implemented by Storm.

“Together,” he concluded, “these systems are making it easier and easier for ordinary programmers to tackle the large-scale infrastructure needed for big data.”

See these related posts on the online course “Tackling the Challenges of Big Data”:

Read these other posts on big-data topics:

About the Author

Rick Nelson | Contributing Editor

Rick is currently Contributing Technical Editor. He was Executive Editor for EE in 2011-2018. Previously he served on several publications, including EDN and Vision Systems Design, and has received awards for signed editorials from the American Society of Business Publication Editors. He began as a design engineer at General Electric and Litton Industries and earned a BSEE degree from Penn State.