Emerging tools address data curation

Feb. 10, 2015

2 min read

Data integration is the topic addressed by MIT Professor Mike Stonebraker as part of the online course “Tackling the Challenges of Big Data,” which runs through March 17. (Today is the final day of extended registration.)

Data integration, which he also calls data curation, involves ingesting data (usually from a source not created by your team), validating it, often transforming it, consolidating it with other data you may have onsite, and looking at the results.

Data curation, Stonebraker said, traces its roots to data warehouses, which originated in the retail sector in the 1990s. Those systems were often two times over budget and took twice as long as predicted to build because of data integration issues, but they nevertheless paid for themselves within six months.

Data warehouses were built with a process called extract, transform, and load (ETL), which was human-intensive and won’t scale much beyond 10 or 20 data sources. (Twist my arm, and I’ll give you 50,” Stonebraker added.)

But enterprises want to add more and more data sources. He cited the example of Miller beer, whose maker would like to integrate weather forecasts into its databases to help predict sales. And, he said, Novartis wanted to integrate the electronic lab notebooks of 8,000 chemists and biologists—well beyond the capabilities of ETL. And web aggregator Goby (renamed Scout after a recent sale) is integrating 80,000 URLs. “You cannot do ETL at scale 80,000,” he emphasized.

He then described inverting the ETL architecture and described a tool called Data Tamer, which automates most of the data integration task and asks a human (through a crowd sourcing system) when it gets stuck, cutting human labor dramatically.

He then cited the rise of data scientists, who take on point projects—such as deciding what key words to buy on Google. The process may involve three or four data sources, and ETL tools are “too heavy” for that. Data Wrangler can help data scientists. It’s an interactive system that supports data cleaning and data transformation to get data sets into a usable format.

Today, he said, data scientists spend 80% of their time getting, extracting, and cleaning data and only 20% of the time on data science, but better tools are coming.—Rick Nelson

About the Author

Rick Nelson

Contributing Editor

Rick is currently Contributing Technical Editor. He was Executive Editor for EE in 2011-2018. Previously he served on several publications, including EDN and Vision Systems Design, and has received awards for signed editorials from the American Society of Business Publication Editors. He began as a design engineer at General Electric and Litton Industries and earned a BSEE degree from Penn State.