Big data challenges: volume, velocity, variety

Unless you’ve been in hiding for the past five years, you’ve undoubtedly heard of big data, thanks to the efforts of companies like IBM to aggressively market the term. That’s according to MIT Professor Samuel Madden, introducing an online course called “Tackling the Challenges of Big Data,” which runs from February 3 to March 17.

Madden, who works on data management software systems including TinyDB, which collects sensor data, and C-Store, a relational database that stores massive amounts of information and exists in commercial form as Vertica, said a goal of the course is to separate the big data marketing from the reality. My goal in taking the course is to be able to focus on the reality rather than the hype when writing about the topic.

Big data, Madden said, represents the democratization of information, providing all of us with access to data about transportation, medicine, social interaction, industrial monitoring, government, education, and even about ourselves through wearable devices.

He cited an example of the insights big data can yield. A database of patient records, cross-linked with other databases, indicated the cost of treating lung-cancer patients. An analysis showed that 10% of patients received treatment costing five times more than the median cost. You might, Madden said, hypothesize that the patients at the high end of the cost scale were sicker, or had better outcomes, but that is not the case. The dominant factor was that the patients receiving the most expensive treatment had one of two doctors.

That particular study, he said, required laborious, manual inspection of the data. His goal is to help automate the process of identifying trends.

Madden then asked, “So how do you know when you have a big data problem?” Most obviously, you might just have too much data—a volume problem. You may have a lot of data coming at you fast—a velocity problem. Or you may have a lot of different data sets that you need to integrate—a variety problem. And in addition to volume, velocity, and variety, you may have problems related to nonscalable analysis.

And of course there are challenges related to privacy. He cited as an example usage-based car insurance. It may make sense to pay for insurance based on how and how much you drive, as monitored by your cellphone or a device in the car, but it would give your insurance company a lot of information about your behavior.

The course will present case studies and discuss ingesting and integrating data, storage and compute platforms, presentation and visualization of information, algorithms and analytics, and security and privacy.

Registration has been extended until February 10.—Rick Nelson