GPUs can process millions of records in a few milliseconds, making them useful for big data applications, said MIT Professor Sam Madden. In the online class “Tackling the Challenges of Big Data,” which runs from February 3 to March 17, he described how GPUs have enabled MIT’s MapD tool, which enables the visualization of vast quantities of Twitter data very quickly. As of October 2013, he said, there are 500 million tweets per day, of which about 8 million are geocoded, allowing the tracking of snowstorms, for example.
A GPU, he said, can store data and scan thought it at about 250 GB/s with a compute capability of about four teraflops, leveraging thousands of threads. They do, however, have limited memory capability, so for the Twitter application, the data needs to be judiciously encoded. He then described a “shared nothing” processing scheme, in which a database is partitioned among physical compute nodes, each of which applies a filter to contribute to the rendering of a heat map, for example, with the renderings combined into the final visualization.
He then discussed the organization of GPU threads into “warps”—groups of threads running the same instructions of different pieces of data, in parallel, not sequentially. He said, though, that MapD essentially exposes an SQL relational database interface, so users don’t have to think about the underlying details of the GPU hardware.
MapD, he said, relies on brute force parallelism. Another approach, also used by MapD, is partitioning, in which data is partitioned into multiple chunks that can be processed independently. He described horizontal partitioning, in which, for example, rows in a table may represent monthly sales data. This type of partitioning makes it easy to zoom in on sales for August, for instance.
Vertical partitioning, on the other hand, can be useful if you want to examine sales by a particular salesperson (represented in one column), for example, and you don’t care about other salespeople, to whom the sales were made, or any other information that might be stored in additional columns in the table. Large databases, he explained tend to acquire many additional attributes (each represented in its own column) that might not be of interest for a specific analysis.
In addition to partitioning, you might employ sampling, in which you operate on a subset of the data (perhaps 10% of the total) using statistical techniques, just as in political polling for elections.
And finally, you might employ summarization, in which you make use of some data structure derived from your original data—such as a histogram.
So where does all this lead? With MapD’s Twitter analysis, you can determine where Dunkin Donuts is popular (New England and Florida, the latter presumably because of “snowbirds” who have traveled south) and where the border line is between fans of the New England Patriots and New York Jets (a line passing near Hartford, CT). You can come up with your own compelling and practical uses for such interactive visualization.
Registration for “Tackling the Challenges of Big Data,” has been extended until February 10.—Rick Nelson