Machine-learning tools for big-data analytics include programs that can be thought of as able to learn from experience. Such programs can be used to process natural language, make recommendations, or uncover how biological systems work, according to MIT Professor Tommi Jaakkola, delivering a lecture in the online course “Tackling the Challenges of Big Data.” They can also be used for predictive user modeling or for solving large-scale inverse problems.
Machine learning is useful, Jaakkola said, because modern engineering problems are often hard to specify and solve directly. For instance, it may be hard to write an algorithm to detect credit-card fraud, but a machine can be presented with examples of legitimate and fraudulent credit-card transactions and learn to identify the latter.
Mapping from an example to a label (for example, “fraudulent”) is a classification problem, Jaakkola said. Classifying news articles or biomedical samples, mapping genotype signatures to phenotypes, and predicting financial strategies’ success are all simple classification problems.
Beyond simple classification lie problems such as transcribing speech or processing natural-language sentences—often in the presence of incomplete or erroneous data. Jaakkola presented an example of mapping a sentence to dependency parses, noting that parsing a single sentence is computationally hard. But, he said, adaptively decomposing sentences into loosely coupled pieces that are easy to solve individually results in a close approximation for most languages.
He then discussed “recommend” problems—for example, predicting what movie I might like. The concept is simple: If you like movies A, B, and C, and I like movies A and C but have not seen B, it’s probably safe to recommend that I would like movie B.
The problem arises when you have thousands of movies and millions of users. The problem can be represented in a matrix of users vs. movies, with users’ ratings for movies populating the matrix. It’s a sparsely populated matrix, because the average user will have seen very few of the total available movies, and the problem essentially becomes a matrix-completion problem involving finding the simplest matrix consistent with limited data. The problem can be addressed via factorization—based on “user features” (the ratings a user assigned to a limited number of movies) and “item features” (the users who rated the movie).
Jaakkola concluded by saying, “There are lots of machine-learning algorithms available out there that solve all kinds of problems where we need to learn from experience. And I hope and strongly encourage you to learn more about them and try to apply them to problems that you are interested in.”
See these related posts on the online course “Tackling the Challenges of Big Data”:
- Big data challenges: volume, velocity, variety
- Commuting via rapidly mixing Markov chains
- GPUs drive MIT’s MapD Twitter big-data application
- Emerging tools address data curation
- What is cloud computing?
- Big data impels database evolution
- Distributed computing platforms serve big-data applications
- NewSQL takes advantage of main-memory databases like H-Store
- Onions of encryption boost big-data security
- Lecture addresses scalability in multicore systems
- User interfaces can clarify, or obscure, big data’s meaning
- It’s a small world, but with big data
- Sampling, streaming cut big data down to size
- Coresets offer a path to understanding big data