Blending Big Data

Technology Editor Bill Wong examines how Hadoop and other big data platforms are being integrated into mainline database and development environments.

William Wong Blog

Oct. 14, 2014

4 min read

It is not hard to see the interest in “big data” or rather the usefulness of crunching larges amounts of data now available via the Internet. The trend towards the Internet of Things (IoT) only generates more data.

One of the platforms that is almost synonymous with big data is Hadoop (see “Essentials Of The Hadoop Open Source Project”). Its interface is different than the typical SQL database. MapReduce is the typical way of getting data out of a Hadoop cluster. Still, the data acquired in this fashion is typically used by other frameworks so integrating Hadoop with these can streamline data processing chores. Companies like Oracle and The Mathworks are starting to make Hadoop support part of their standard offering.

Matlab Release 2014b incorporates a number of new features including Hadoop support. Other features include a new default color theme (Fig. 1) for graphs. Graphs also have small but noticable changes like bold titles. Git and Subversion support provides better collaboration and there is a new custom toolbox packaging system. The packaging system can collect all the artifacts for a subsystem and place it into a single file for distribution and installation.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2014 10 143381 Fig1

Figure 1. The Mathworks Matlab Release 2014b incorporates a range of new features including new defaults for graphing that are more user friendly.

The Matlab Hadoop support adds a feature called datastores. A datastore is an object for reading collections of data from a data set that is too large to fit into memory. It can be a collection of files and directories that have the same structure and data formatting. There are different types of datastores such as a text file.

MapReduce can utilize the datastores. The mapreducer function defines the execution configuration for the MapReduce operation. The Hadoop support is designed to run on the workstation with a like to a Hadoop cluster or it can be configured to run jobs on the cluster that return the data. In the latter case, the Matlab compiler is used to generate an application that runs on the cluster.

Oracle's SQL database, Oracle SQL, is one of the best known relational database systems around. It is safe to say that Oracle mavens are well versed in SQL and would like to take advantage of Hadoop using the tools they are familiar with. This is why Oracle 12c (Fig. 2) provides hooks into Hadoop platforms like Cloudera thereby blending SQL and big data into a single interface.

Electronicdesign Com Sites Electronicdesign com Files Uploads 2014 10 143381 Fig2

Figure 2. Oracle 12c provides hooks into Hadoop blending SQL and big data into a single interface.

Oracle's big data connector for Hadoop was shown at the recent Oracle OpenWorld 2014.

Oracles Hadoop support extends to Oracle GoldeGate that provide real-time change and distribution support. One of the more interesting demos included Oracle Data Integrator that has hooks into Hadoop. The Big Data Discovery feature can create recommendations of other data based on what is currently being presented. Visualization features included the use of size and color instead of a simple dot for a point within a graph. It provides additional dimensions while sticking with a 2D graph.

Oracle Big Data SQL takes the tried and true SQL language and allows it to be used with big data. This is not the first time SQL syntax has been used to access data in clusters like Hadoop. Apache's HiveQL has an SQL-like structure. HiveQL deviates slightly from SQL adding and removing features where appropriate. The approach allows access to data using a familiar programming paradigm. In this case it is Oracle SQL rather than HiveQL.

Oracle has also add Hadoop-as-a-Service (HaaS). This is a Software-as-a-Service (Saas) but with Hadoop support. Oracle's Hadoop service works with HaaS as well as other Hadoop cluster implementations like Amazon's Elastic MapReduce.

Big data is becoming more important for the enterprise because data is being collected in all sorts of venues from remote sensors to smartphone usage to web browser tracking. These types of tools allow developers access to the information and processing power of platforms like HaaS without having to learn the sometimes arcane commands and configurations necessary to host and access big data.

This has some interesting implications for embedded developers as the use of cloud-based computing becomes more common as well as the ability to pack something like a Hadoop cluster into an embedded system.

About the Author

William Wong Blog

Senior Content Director

Bill's latest articles are listed on this author page, William G. Wong.

Bill Wong covers Digital, Embedded, Systems and Software topics at Electronic Design. He writes a number of columns, including Lab Bench and alt.embedded, plus Bill's Workbench hands-on column. Bill is a Georgia Tech alumni with a B.S in Electrical Engineering and a master's degree in computer science for Rutgers, The State University of New Jersey.

He has written a dozen books and was the first Director of PC Labs at PC Magazine. He has worked in the computer and publication industry for almost 40 years and has been with Electronic Design since 2000. He helps run the Mercer Science and Engineering Fair in Mercer County, NJ.