The world of relational databases got its start in 1970 with a pioneering paper by Edgar F. Codd in the Communications of the Association for Computing Machinery. Prototypes appeared in the 1970s, but a significant milestone occurred in 1984 when IBM released DB2, according to MIT Professor Mike Stonebraker, speaking as part of the online course “Tackling the Challenges of Big Data,” which runs through March 17.
By the ‘90s, the relational database had become a one-size-fits-all solution, Stonebraker said, adding, “Your relational database salesman was the guy with a hammer, and everything looked like a nail.” The hammer, in this case, is a disk-based SQL-oriented database system, with ACID providing reliability.
By the mid-2000s, Stonebraker said, he came to realize that one size did not fit all, and by now he has reached the conclusion that “one size fits none.” The major vendors, which he called elephants, are selling you 1980s technology (DB2 is still available today), and there’s a better way to address your data-management problems now—whether they involve data warehouses, online transaction processing, or something else.
For data warehouses, he said, column stores are taking over from the row stores an elephant will sell you because column stores are two orders of magnitude faster.
The OLTP world, he said, is an update-oriented world, where you want to do updates quickly. Since TP databases don’t often exceed a terabyte, it can be feasible to store them in main memory, whose price is dropping rapidly. And if your TP database doesn’t fit in main memory, you can employ a technique called anti-caching to store cold tuples in main-memory format in a slower storage device.
Stonebraker then addressed NoSQL, whose proponents contend that SQL is too slow and that you should code algorithms yourself. They further advocate giving up ACID. Both are bad ideas, he said, because compilers are very good at compiling SQL, and ACID can be made to run fast. Another tenet of NoSQL proponents, he said, is to load data now and think about a schema later. His advice? “Think about your data now and you’ll be way better off downstream.” He suggested that NoSQL actually means “not yet SQL,” but acknowledged that the tools are inexpensive, easy to learn, an applicable to low-end applications.
Turning to a different topic, Stonebraker discussed complex analytics, machine learning, or predictive models, which all fall under the rubric of data mining. They all involve linear algebra operations on arrays at scale, and they don’t look anything like SQL. You might, for example, want to know if the closing prices of two stocks over time are correlated, so you compute the covariance. Then you might repeat for all pairs of stocks—giving you a matrix of 15,000 by 4,000 if, for instance, you want to evaluate every publically traded stock in the U.S. over 20 years. And you want to mix these calculations with data mining. And if you are running a table system, you’ll need to convert between tables and arrays. This, he said, is the world of data scientists, and there may or may not be a market for array database systems.
He then discussed graph databases, noting that Facebook is one big graph, as is Twitter, and there are lots of graph data in the science world, and you can perform operations such as determining a shortest path.
His final topic was Hadoop, a version of Google’s MapReduce—so not surprisingly, it performs two functions: map and reduce. Hadoop, he said, is good at “embarrassingly parallel operations.” He noted that Hadoop has morphed over the years into a complete stack, with the original Hadoop in the middle, the HDFS file system at the bottom, and SQL lookalikes such as Hive and Pig on top. However, the stack, he said, can run 100 times slower than a column store for certain operations, such as computational fluid dynamics simulations. And 95% of the market, he said, is not embarrassingly parallel.
He offered four general conclusions: data warehouses will be a column store market, OLTP will be a main-memory database market, NoSQL will be popular for low-end applications, and array and graph databases may or may not gain traction, but you should understand them. (Arrays can store homogeneous collections of items such as sensor data.)
“There is going to be a sea change from simple analytics to complex analytics,” he added, with a move away from business intelligence to data science—and unfortunately, data scientists are hard to find.
In addition, he said, “The internet of things is a force to be reckoned with. Everything on the planet of material significance is going to get sensor-tagged.”
Data is your most valuable asset, he concluded, and you need to leverage it.
See these related posts on the online course “Tackling the Challenges of Big Data”:
About the Author

Rick Nelson
Contributing Editor
Rick is currently Contributing Technical Editor. He was Executive Editor for EE in 2011-2018. Previously he served on several publications, including EDN and Vision Systems Design, and has received awards for signed editorials from the American Society of Business Publication Editors. He began as a design engineer at General Electric and Litton Industries and earned a BSEE degree from Penn State.