Deep leaning is one of the most profound but widely recognized phenomenon taking place in the world of Information Technology. That statement opened a talk by Chris Rowen, CEO and co-founder of BabbleLabs, to a group of semiconductor industry executives during a recent Mentor, a Siemens Business, Summit hosted by its Emulation Division.
Rowen explained that some basic notions about what makes deep learning different are needed to understand the implications at the application level, especially around vision and speech recognition, BabbleLabs’ specialization. He also tailored his talk to outline what deep learning means for the semiconductor industry.
As he noted, everyone has been exposed to the AI hype. A Google search for AI totals three billion pages. A Google search for AI startups could reach close to 15,000. “The reality is that the words AI startup have exactly the same meaning as the word startup because every startup today does AI and the level of investment taking place in the core ideas is phenomenal.”
Rowen challenged his audience to search on archive.org for papers on neural networks. Audience members would find close to 20,000 papers, most published in the last two years. Rowen called it an explosion of invention, publishing, sharing, and mutual reinforcement of research in a single area.
“It's useful to review some basic taxonomy,” suggested Rowen. He began with AI that includes theorem improving, knowledge-based systems, and a wide variety of computations with some connection to human reasoning. Within that is machine learning, a broad cross-section of data-driven, statistically oriented mechanisms that have evolved broadly over the last 30 years or so. Within that is the sub-domain of deep learning with a precise definition. While it was invented more than 30 years ago, it has become prominent only in the last five or six years.
“We have rarely seen a technical idea rising from absolute obscurity,” remarked Rowen. Deep learning is built on deep neural networks consisting of a complex layer cake of computation transforming application domains. Describing the fundamental idea behind deep learning can be done several different ways, he said. “I think of it as the construction of a complex numerical model that mimics the behavior of an even more complex but hidden system. The hidden system in question is often the brain.”
Rowen used his brain to illustrate his point. It looks at a series of photographs of his dad and has a built-in model of pictures. If he accumulates a large enough set of pictures associated with this label and others, he can evolve a system using a highly parameterized mathematical model. These models often have millions, sometimes hundreds of millions of coefficients. It can reproduce generalization capabilities of that hidden system to find common elements in pictures of people and understand the quintessential elements of making a picture of his dad, not just memorizing a particular set of photographs.
Using an iterative process of learning from examples allows humans to apply it to different problems, almost independent of what type of data goes in––images, video sequences, voices, written language, or financial transactions. The data can be extracted into all kinds of different information so long as it can be labeled.
The human brain can identify objects, locate them, describe the action, transcribe the speech, extract emotion, translate the language, and identify fraud in transactions––almost an unbounded set of actions. In that sense, deep learning becomes a fundamentally new computational paradigm in the same way that the Turing Machine is a fundamental computational model. It works on a whole set of problems that traditional computing, based on a sequential programming model, couldn’t.
The Image Classification Challenge
Rowen next proposed his audience consider vision—a hard problem to take on, he observed. Processing vision via neural networks requires huge computational power to perform training and create a model, and computational power to run the model once it has been derived. That’s called inference.
The genesis for image classification traces back to the ImageNet challenge that consisted of 1.2 million images grouped in 1,000 categories of objects. Presumably, a six-year-old knows the name of 1,000 objects. The ImageNet challenge made it harder because 120 of them are breeds of dog, not something a normal six-year-old would know.
In the middle part of the last decade, when the benchmark was created, conventional image-processing algorithms were applied to solve it, but were able to get only the top five guesses right about 70% or 75% of the time, an error rate of 25% to 30%.
Then in 2012, the first neural-network application was applied to solve the challenge, dramatically cutting the error rate. In every subsequent year, there has been further progress on lowering the error rate. Interestingly, one researcher thought the approach was unfair because electronic systems were allowed to be trained for days on these pictures, whereas humans were asked to walk in off the street and get these images right. He elected to study these pictures and trained himself for hundreds of hours to compete effectively. For a while, he could beat the best neural-network systems, but then they got better and better, and even he got surpassed as the error rate dropped below 5%.
Neural networks started to achieve superhuman accuracy and transitioned to being better than humans, changing traditional thinking. It wasn't a laboratory curiosity anymore or something that might reduce costs a bit. Rather, it meant that neural networks provided the best way for solving a class of problems.
As neural networks proliferate, it raises interesting questions about where to perform this demanding computation. Imaging is one area being explored. Images are captured out on the edge, but computation may need to be done in the cloud to aggregate all of the data coming from all of the different cameras.
The camera could have its own built-in neural-network computing hardware to perform computations so that the camera operates on its own image stream. Or computations could move it somewhere to aggregate data across a cluster of cameras for a parallax view of what's going on. Or, it could move to the edge of the cloud, where it will have low latency and great flexibility. Or, the default for almost any new computation is to do it in the cloud. It's flexible and easy to get access to all of the world's data, where everything can be correlated against everything else.
However, there are sharp tradeoffs, Rowen cautioned. For many applications, particularly when operating on real-time data, system responsiveness is imperative. Low latency to drive a car prohibits the roundtrip to the cloud and back to make a split-second decision, for example.
The scope of data analysis is the other way around. To try to figure out what's happening in all of the shopping malls worldwide would require satellite photos from everywhere. That can only be done in the cloud where most data resides, where models have been trained, and where that and other information can be collected for correlation.
For privacy concerns, however, data should not go to the cloud, especially sophisticated analysis, which should be kept locally. Only necessary insights are shared, not raw data that contains other latent information.
Finally, Rowen pointed to costs in volume. Computation can be done in an edge device or an ASIC to get a couple of orders of magnitude less power consumption than in a general-purpose cloud device. Flexibility is worth real money and the fact that it's always there, scalable, and “pay as you go” has advantages. Doing a certain kind of computation in milliwatts that will take tens of watts is compelling. By volume of total computation, deep learning is going to be an edge phenomenon because that's where images come from and that's where the lowest long-term cost for computation is available.
In the meantime, training, development, and, most important, early deployment will shift to the right, he affirmed.
Vision is an interesting problem as well, because almost all of the world's new data comes in the form of pixels. “Think about all of those sensors, thermometers, pressure sensors, accelerometers, and microphones,” directed Rowen, who added that there is nothing like a CMOS sensor that gives so much information per sensor. “The sheer number of sensors is getting pretty darn big.”
Chart the population of humans versus the population of CMOS sensors. “In about 2016, we hit a crossover point where there were more sensors than people, which begs the question, where's the data going? Who's looking at it? Not even all the people in the world could look at all of the outputs from all of the sensors 24/7.”
“What this means is that for the data to become useful, we need more and more intelligent interpretation.” Rowen said it would happen by filtering autonomously in response to what’s happening in front of the camera or filter the data down by many orders of magnitude to get to something that makes sense. “If you take the raw sensor data volume, simplifying assumptions that these are all high-definition sensors running at 60 frames per second, that's 1019 raw pixels per second. If you compare that to any existing network, storage, or computation capabilities, you realize that we are in deep trouble, since there's no way we could transfer, store and process 1019 pixels per second.”
Rowen noted a revolution in computation is needed to take advantage of the potential information captured by all of the sensors. At a minimum, the need is great for more computation at the edge or networks and huge amounts of computation in the cloud. Unfortunately, he stated, the power-intensive cloud is untenable without warming the planet even faster.
Deep Learning and Emulation
Deep-learning methods are opening up a wide spectrum of smarter systems, developed more quickly and responding more naturally to the real world. Deep-learning-based design is particularly challenging, however, for developers of real-time vision and speech systems. Traditional simulation methods, using modest artificial test stimulus sets, are increasingly inadequate for three reasons:
- The computational complexity of deep-learning inference is high, so large complex arithmetic engines much be deployed in silicon. Slow turnaround time for simulation of these structures inhibits the innovation rate.
- Deep-learning algorithm “correctness” is often judged by statistical measures, not bit-exactness, so large data streams must be characterized. Near-real-time emulation gives increased confidence in the real-world behavior of the system.
- These smart real-time systems involved sophisticated, usually asynchronous interactions between raw input data streams, deep-learning inference engines, high-level system controls, and fault-tolerance and error-recovery monitors. Modern emulation systems can faithfully implement actual system behavior even in obscure corner-case coincidences.
In closing, Rowen determined that the need is great for optimized deep learning and optimized vision optimized for the edge or near-edge computation capabilities.
Author’s Note: To hear more about deep learning, register for DVCon February 25-28 at the DoubleTree Hotel in San Jose, Calif. Panelists from Achronix, AMD, Arm, Mythic, and NVIDIA will be discussing “Reshaping the Verification Landscape or Business as Usual?” The panel will be held on Wednesday, February 27, from 1:30 to 2:30 pm.