Machines read Twitter and Yelp so you don’t have to

Natural language contains lots of useful data, but it’s severely underutilized. For example, said MIT Professor Regina Barzilay, users generate lots and lots of reviews every second on Amazon, Yelp, and other services, but when we make our own shopping choices, we can only afford to read a very small percentage of them. Consequently, services often provide a star rating system, which although aggregated over thousands of reviews provides little information.

Delivering a lecture in the online course “Tackling the Challenges of Big Data,” Barzilay said that ideally, we would like to compress thousands of reviews into a format that we can read quickly before making a decision. Information extraction, she said, is the process of taking natural-language text and transferring it into a structured representation that could be stored in a database. The question is how can we get a machine to help with this, given that machines don’t understand the grammar of natural language.

Surprisingly, she said, it’s possible to extract a lot of information without an understanding of grammar. A simple word count, for example, could enable a computer to distinguish between a financial report and a weather forecast. Moving beyond this simple case, she described entity disambiguation, in which a machine tries to determine whether a word represents a person, organization, or location. For example, a capitalized word preceded by “Mr.” is probably a name. Classification is the process of mapping such features into a prediction, or label.

The next step is to determine relationships between entities—for example, extracting the name of an artist and a venue from a tweet. Hidden Markov models and conditioned random fields (CRFs) can take dependencies that might arise into account.

Such processes usually require training data, she said, which a human must annotate (providing “direct annotation”), and consequently it’s very expensive. One way to minimize this cost is through multi-aspect summarization, which might extract from a restaurant review information on food, ambience, and service. Another approach is building an event database from a noisy stream of tweets, in which huge redundancy in the stream supplements supervised annotation.

Elaborating on restaurant reviews, she described augmentation of the sequence-labeling task (with annotation) with a content topic model, which groups together sentences that seem to be similar—without regard to meaning. For example, you could highlight in orange all sentences containing the word déjeuner—you can perform this task even if you don’t speak French, and a computer can perform it across thousands of reviews. This approach leads to model based on unlabeled big data driven by a small amount of supervised data. In fact, she said, the more unlabeled data you have, the better the performance.

She then described in detail an example of automatic construction of event records (listing artists and venues for New York City performances) from a stream of relevant Twitter messages. She said that big data and particularly social media present challenges for natural-language technologies. “However,” she concluded, “by designing a linguistically rich model which can model the dependence in a smart, probabilistic way and utilize the big data can deliver significant prediction boosts for these complex extraction tasks.”

See these related posts on the online course “Tackling the Challenges of Big Data”: