It’s a small world, but with big data

It’s a small world, maybe, but it sure contains a lot of data. MIT Professor Ronitt Rubinfeld cited the small-world phenomenon in her presentation as part of the online course “Tackling the Challenges of Big Data,” which runs through March 17. She asked, “What can you possibly hope to say about data that you can’t even view?”

Six Degrees of Separation is a play and a movie, but how can we tell if the premise holds in the real world? The data we would need to answer that question, Rubinfeld said, is really big data, some of it is inaccessible, and what you can access may change after you’ve accessed it. Somebody could die, or somebody could be born.

A linear time algorithm is the gold standard of undergraduate education, she said, but it may be inadequate for big data problems. We cannot know exactly how many people exhibit the six degrees of separation property. This is not new territory. Statisticians have long predicted the results of elections and the efficacy of medicines. But, said Rubinfeld, big data presents aspects that haven’t been classically studied.

She discussed property testing, in which you ask whether an input has a property or does not. Offering some leeway in answering that question for borderline cases can lead to much faster algorithms—ones that can operate in sublinear time. The leeway is fine, she said, because your data is noisy and constantly changing, and there is no exact answer.

She then discussed c-multiplicative and c-additive approximations, in which an output is within a multiple of c of an optimal solution or in which the output is within the optimal solution plus c, respectively (with an e factor because of sampling). She applied an approximation to the problem of determining the minimum size of a vertex cover, using an oracle reduction framework and a greedy maximal matching algorithm.

Some problems have big data inputs but a single number or pass/fail output. (Is this the optimal vertex cover?) Other problems might have big data outputs as well, but you might not need the entire output—you might just be interested in the i^th bit of the original message. An approach to such problems, she said, lies in local computation algorithms (LCAs), which have proposed in the last couple of years.

She then turned her attention to the question of how you can understand your data when it’s not written down but only available by samples—such as in a lottery—and whether the distribution is uniform. She recounted using sublinear time algorithms to evaluate certain state lotteries over a period of years to determine that the results were uniform.

She also mentioned a new central limit theorem for generalized multinomial distributions that has been developed over the past couple of years. She also discussed support size and entropy and concluded, “I hope I’ve convinced you that for many problems, we need a lot less time and samples than one might think.”

See these related posts on the online course “Tackling the Challenges of Big Data”: