Big data isn’t just more data

I noted last month on this page that big data involves large, complex data sets difficult to handle using traditional data-processing techniques. You can’t solve big-data problems just by adding more storage and more processing power.

The recently concluded online course “Tackling the Challenges of Big Data” described why this is so and what you can do about it. Samuel Madden, faculty co-chair for the course and professor at the MIT Computer Science and Artificial Intelligence Laboratory, said big data is characterized by volume, velocity, and variety—you have a lot of data coming at you fast, and you may need to accommodate multiple data sets that don’t conform to a common schema.

In subsequent course modules, MIT faculty members discussed how to deal with these issues. The variety problem, for example, can be addressed through data-curation tools like Data Wrangler, which help validate, transform, and consolidate disparate data sets, said Mike Stonebraker.

As for storing data, said Matei Zaharia, you can store a terabyte of data on a $50 disk. But moving 1 TB over a 45-Mb/s T3 line would take a couple of days—something to keep in mind if you want to process your data in the cloud. An option, believe it or not, is to ship your physical disk to your cloud service provider.

As for processing, you can employ distributed computing platforms, according to Zaharia, or scalable multicore systems, said Nickolai Zeldovich, but with the latter, you need to prevent the performance collapse that can occur while multiple cores contend for shared resources.

Fast algorithms can help, too. Ronitt Rubinfeld cited linear-time algorithms as the gold standard for undergraduates: processing time is roughly proportional to the size of your input data. But for noisy, constantly changing data, she said, sublinear-time algorithms, such as the greedy algorithm, can help get adequate answers fast.

Data compression is another approach. In 2012, said Daniela Rus, faculty co-chair of the course, the world generated 2.5 quintillion data bytes of data per day. To cut huge volumes of data down to size, she advocated the concept of the coreset, a single weighted representation of many points of an original data set.

Alternatives to compression, said Piotr Indyk, include sampling, as performed with the sparse FFT algorithm. Or you can employ streaming, in which an algorithm requiring limited memory examines data in a single pass, deriving a synopsis (a vocabulary list for the complete works of Shakespeare, for example).

Machine learning offers an alternative to traditional programming for big data applications, and, according to Tommi Jaakkola, it’s useful for engineering problems that are hard to specify and solve directly. With traditional programming, said John Guttag, your input is data and a program, and your output is a result. With machine learning, your input is the data and the result, and your output is the program—thereby automating the programming function.

Jaakkola described using machine learning to generate website recommendations, and Guttag addressed medical applications. In addition, Regina Barzilay described machine learning for natural-language processing, and Andrew W. Lo detailed a financial application involving the estimation of credit-card risk.

The big-data concept presents its own big-data challenges for anyone attempting to master the topic. The 20-hour online course that I took presented a sparse sample of the subject. But it made it clear that “big data” isn’t just more of the same. Big data is in your future, and it behooves you to seek out courses, webcasts, trade shows, or other opportunities to learn about the specific, unique big-data aspects that will be most important to you.

Rick Nelson
Executive Editor
Visit my blog