One may not realize, but technologies such as smart sensors and the Internet of Things (IoT) are beginning to play a crucial role in the path to scientific discovery and engineering. By leveraging the correct tools and techniques, data scientists can now collect, analyze, and interpret data pulled from these technologies to show important physical phenomena or provide information on the operating environment, efficiency, and health of a system.
Bridging beyond internal system benefits, big data can aid in conforming to industry regulatory requirements and developing better-performing products or services, ultimately placing companies in the position to enable bigger and better business.
To get more insight into this area, I talked with Dave Oswill, product marketing manager at MathWorks.
Dave Oswill, MATLAB Product Manager, MathWorks
“Big data” is a well-known term that applies across a broad spectrum of industries. When embracing big data, what should companies be most aware of?
There’s a lot of hype about the need to find data scientists in order to take advantage of big data. But, in fact, businesses have scientists and engineers with the domain knowledge and experience to be most effective in making design and business decisions with big data. These scientists and engineers just need a software analysis and modeling tool that provides domain-specific capabilities and can handle large sets of data and work with the systems used to store and process this data.
Today, tools such as MATLAB are equipped with new capabilities that enable scientists and engineers to analyze big data for gaining insight, developing models, and incorporating these insights and models into their business’s products, services, and manufacturing processes. Businesses that consider scientists and engineers as part of their big-data strategy can gain a significant competitive advantage in the global marketplace by offering differentiated products and services while also optimizing manufacturing processes.
How can companies benefit from integrating big-data analytic tools with their internal processes? Can you provide an example?
Engineers at Baker Hughes developed a predictive maintenance system to reduce costs and downtime from equipment malfunctions on their oil and gas extraction trucks. In the past, it was difficult to predict the health of a pump or other equipment on their trucks. This caused equipment to be overhauled before it was needed, or the equipment was run to failure, which risked damaging multiple pieces of equipment, sometimes beyond repair.
To solve this problem, engineers at Baker Hughes used MATLAB to collect terabytes of data from the oil and gas extraction trucks, and then developed an application to calculate when equipment would need maintenance or replacement. Getting the results of these algorithms in the hands of those responsible for maintaining the equipment is estimated to save more than $10M.
Businesses are accumulating rapidly increasing amounts of data. How is that changing the platforms used to store, analyze, and process this data?
Many teams still store vast amounts of data in flat files, and databases are also commonly used to store and manage data. But as the data increases in size and businesses realize the importance of this data across the organization, the trend is for organizations to move to specialized big-data platforms.
Hadoop, for example, provides storage infrastructure and supports a wide variety of data processing applications. Large-scale batch processing applications such as MapReduce and Spark can be used to look for trends and develop predictive models using historical data, while streaming applications add more intelligence and adaptive capabilities to products and services such as predictive maintenance, enhanced operation, or a differentiated level of automation.
In many cases, engineers and scientists are not familiar with these IT managed systems, so it’s important for them to have a familiar tool that can be used with, and take advantage of, the scaled-processing capabilities of these systems. The tool should also provide the infrastructure for deploying models and algorithms as part of an embedded system, as a real-time or near-real-time service, or as part of an organization’s IT/manufacturing system.
I imagine the thought of exploring and processing years’ worth of data may seem like a nightmare. What, if any, are the capabilities that can simplify this process?
Several capabilities can make it easier for engineers and scientists to easily observe slow-moving trends across the data, to clean the data before developing an algorithm or model, and to find the most relevant data for developing an algorithm or model. These capabilities include data visualizations supplemented with density information to easily view patterns and quickly gain insights within large datasets. These visualizations can also aid in determining outliers for which filtering routines can be developed and programmatically applied to newly acquired data.
Data reduction and correlation functionality that works at scale allows engineers and scientists to simplify the large number of signals collected from a system and to create a more efficient algorithm or model. Engineers and scientists can combine these capabilities with a familiar analysis and modeling tool that can work with big data on the desktop and scale for use with big-data systems and streaming-data pipelines.
What are the main considerations when developing a predictive model? Are there any core technologies that play a dominant role?
Often, these datasets contain a large number of signals, even after data reduction. To analyze this data and create an intelligent model, machine learning must be used.
Machine learning uses computational methods to “learn” information directly from data without relying on a predetermined equation. In fact, the ability to train models using the data itself opens up many use cases for predictive modeling, such as predictive health for complex machinery and systems, energy load forecasting, and even financial credit scoring.
Machine learning is divided into two types of methods—supervised and unsupervised learning—each of which contains several algorithms tailored for different problems. Supervised learning uses a training dataset that maps input data to previously known response values, while unsupervised learning draws inferences from datasets with input data that doesn’t map to a known output response.
Each of these approaches has several different types of models, and each of those models can have many different combinations of parameters to try. It’s crucial that the engineers and scientists developing these models have a tool that enables them to quickly try and compare many different approaches. (For more on machine learning, see the three things you need to know about machine learning.)
In closing, what do you consider to be the best tips for incorporating the full data process into a company’s products, services, and operations?
It’s crucial for engineers and scientists to work closely with an IT team. Collaboration across teams not only primes a business for adaptation, it makes for a significant competitive advantage by seamlessly integrating big-data solutions with the products and services that customers are demanding.
Using a tool like MATLAB can help businesses create a workflow that’s familiar and efficient, while enabling these domain experts, a business’s scientists and engineers, to gain important insight from a vast data collection. This approach allows these scientists and engineers to develop algorithms and models for smarter and differentiated products and services, ultimately leading organizations to better adapt to changing business conditions and address market needs more effectively.