JPL statistician comments on big data, cost and reliability of results
Amy Braverman, principal statistician at the Jet Propulsion Laboratory, develops strategies to analyze information from NASA’s space-borne instruments. In an interview with Scott Thrum, senior deputy technology editor at The Wall Street Journal, she comments on dealing with imperfect data or data that’s not perfectly arranged in rows and columns.
“Data collection is so different today than it used to be,” she says, pointing out that spacecraft collect information on thousands of variables, freeways have built-in sensors, and supermarket scanners fill databases with information on purchases. The “opportunistic” data doesn’t lend itself to analysis using traditional statistical and data-mining technologies, so your existing software is unlikely to work.
Data is also distributed, she said. Distributed computing is well known—you divide a problem and farm out the pieces to multiple computers. Distributed data presents different challenges—you may be trying to calculate correlation coefficients between a column of data in New York and one in Los Angeles. You can move the data back and forth, she said, but that might get expensive. Or you can operate on summaries of the data, but that could compromise the accuracy of your result. There is a tradeoff between cost and the reliability of your conclusion, she said.
You also need to know what the data you’ve captured means. From her own work, she noted that polar-orbiting satellites monitoring for CO2 concentrations might cross the equator at 1:30 p.m.—that’s a time when plants are photosynthesizing, so it would be a mistake to base global CO2 distributions for all times of day on that data. “You have to be aware of the biases that may be imparted to the data that you have, relative to the data you wish you had,” she said.
It’s important, she said, to really think hard from first principals and build new statistical tools.
WSJ subscribers can see an excerpt of the interview here.—Rick Nelson
See related posts: