Coresets offer a path to understanding big data

If you have too much data to process in traditional ways, you can employ fast algorithms, as discussed here and here, or you can compress your data and operate on the compressed version. The latter approach is the subject of MIT Professor Daniela Rus in a lecture on the topic “Big Data Analytics: Data Compression,” as part of the online course “Tackling the Challenges of Big Data,” which runs through March 17, where she emphasized how to learn big-data patterns form tiny core-sets, typically written as one word, coresets.

Why compress data? In 2012, she said, the world generated 2.5 quintillion data bytes per day. And where does that data come from? If you just focus on GPS data, she said, you’ll note that each GPS data packet is 100 bytes. Logging that 100 bytes every 10 seconds results in 0.4 MB of data per hour, or 0.1 GB of data per day, per device. If you multiply that by 100 million devices (and over 300 million smartphones were sold in 2010), you get 10 petabytes (10,000 TB) of data per day.

What happens to this data? It’s expensive to store it on a mobile phone or sensor node, and it’s also expensive to transmit the data and difficult to interpret it. Rus then presented a big-data computation model that assumes an infinite stream of vectors, n of which have arrived so far. Given logn memory, we cannot store all the data. But assume we have M processors to enable parallelism, enabling an insertion time per point of approximately log(n)/M. Perhaps we can find the right compressed data C from big data D such that, if algorithm A is unable to process A(D), it may be able to process A(C) » A(D).

By solving the k-median queries problem with respect to the big dataset, she said, it is possible to replace many points in the original by one weighted representative—yielding a coreset of reasonable size. In turn, coresets themselves can be merged, yielding further compression, and the merged coresets can also be merged. Merging results in higher errors, but the errors will remain small as long as your tree of merged coresets remains shallow. The process can be applied to our incoming infinite data stream to yield a reasonably sized representation.

She elaborated on using coresets to solve transportation problems, concluding by describing a life-logging application that, using latent semantic analysis, merges cellphone GPS data with Yelp database data to create your life story—for example: “You left home at 7:30 a.m. You arrived the Davis Square Au Bon Pain at 7:45. You entered the Davis Square subway station at 8:15….” Not great literature, perhaps, but an effective demonstration of the capabilities of coresets. Rus noted that coresets don’t use much space or much time, adding, “I see them everywhere I look—at work or in life—and I hope you will find a way to use coresets in your own activities at work.”

See these related posts on the online course “Tackling the Challenges of Big Data”: