User interfaces can clarify, or obscure, big data’s meaning
Big data presents significant challenges related to data curation, database evolution, distributed computing, security, and scalability, as I’ve noted in several recent posts. Yet another challenge relates to user interfaces, according to MIT Professor David Karger in the online course “Tackling the Challenges of Big Data,” which runs through March 17.
Hal Varian, Karger said, has pointed out that as data becomes more pervasive, it becomes more important to be able to understand it, process it, visualize it, and communicate it to others. Computers can handle the processing part (as well as storage), but users need effective interfaces for the other three aspects.
Karger presented Anscomb’s quartet as an example of the importance of visualization. It consists of four datasets that have nearly identical statistical properties but appear quite different when graphed.
Karger noted that there are domain-specific ways of handling some types of data—such as images or text documents—but for the purpose of his lecture he would be focusing on data that generally fits into a row-and-column table format.
Data visualization has been around for some time. He cited a 1981 New York Times graphic that succinctly represented 2,200 numbers describing weather. (You can find the graphic here.) And visualizations predate the computer era, extending back to the Çatalhöyük map 8,900 years ago. More recently, John Snow used data visualization during a cholera outbreak on London in 1854 to correlate outbreaks of the disease to proximity to a particular water pump. And Florence Nightingale diagrammed the causes of soldier mortality with her Coxcomb.
Visualization, Karger said, allows us to leverage “pre-attentive processing” (introduced by Anne Treisman)—which can occur without thinking. He credited Jacques Bertin with categorizing ways to present data without the viewer thinking: position, relative size, shape, and color.
Just as data visualization can be helpful, it can also mislead, as Karger described in a section he called “Lying with Visualization.” A graph might not show full-scale, for example, or time scales can be shifted. For artistic purposes, perspective of other effects might be added that distort the viewer’s perception. As a general rule, he said, don’t use multidimensional artifacts to represent quantities. He cited Edward Tufte’s “lie factor,” which compares the perceived visual representation with that actual data.
Interactivity in the form of exploratory data analysis can help you look at data from multiple perspectives and test hypotheses without preconceptions, he said. John Tukey pioneered exploratory data analysis with the PRIM-9 (project, rotate, isolate, mask) system.
Confirmatory data analysis, Karger said, is good for testing whether data fits certain distribution. Visualization, he added, is not good for confirmatory data analysis but good for suggesting hypotheses.
Direct manipulation is effective for data exploration, he said. He noted that advanced users think abstractly, have a language of relevant actions, and can debug and try again, whereas novices think concretely, learn by doing, and can’t figure out what they have done wrong. Amateurs use direct manipulation tools.
Karger said Ben Shneiderman put forth the direct manipulation paradigm in 1983. It consists of continuous representation, physical actions or button presses (instead of complex syntax), and incremental reversible operations. Karger said that direct manipulation tools for data are as important as WYSIWYG editors. The interaction strategy involves providing an overview (a starting point, not a blank screen), ways to filter, the ability to pan and zoom, and the ability to see details on demand.
Many websites are frontends for giant datasets, Karger said. They offer faceted browsing (vs. the original hierarchical browsing) for direct-manipulation filtering, with templates offering a uniform presentation.
Karger then discussed creating tags for visualization elements and a prototype JavaScript library called Exhibit that supports the tags—it lets you author one HTML page to create an interactive data visualization.
He concluded by discussing how to augment spreadsheets, which he described as the dominant database tool for end users who don’t know SQL. Spreadsheets offer only one table view at a time—they don’t support joins or connections between tables, and they don’t support many-to-many relationships. An alternative called Related Worksheets involves nested views with linkage between multiple tables. The model provides database-like interaction without the complexity.
See these related posts on the online course “Tackling the Challenges of Big Data”:
- Big data challenges: volume, velocity, variety
- Commuting via rapidly mixing Markov chains
- GPUs drive MIT’s MapD Twitter big-data application
- Emerging tools address data curation
- What is cloud computing?
- Big data impels database evolution
- Distributed computing platforms serve big-data applications
- NewSQL takes advantage of main-memory databases like H-Store
- Onions of encryption boost big-data security
- Lecture addresses scalability in multicore systems
Read these other posts on big-data topics:
- Lux Research acquires data and analytics provider Energy Points
- Big data in semiconductors—how to collect, detect, and act
- White House names first U.S. chief data scientist
- IIoT imposes latency, determinism challenges
- JPL statistician comments on big data, cost and reliability of results
Update, 4/18/2016: See related article “Site reviews 20 big-data visualization tools for presenters and developers.”