At this point there is no doubt that the present and future of computing is parallel, but which direction the evolution will take is still very much unclear. This article looks at some of the challenges facing us and tries to draw some conclusions about what the future parallel architecture should look like
#1 Power Consumption Wall
Challenge: For many years process geometries and supply voltages were decreasing, creating the illusion that transistors are free. System-On-Chips transistor counts grew exponentially since energy efficiency was kept under control by process scaling. As the channel width and oxide thickness approach atomic limits, the free-lunch scaling is over. Voltage scaling has stalled and we are stuck with high-static leakage currents. Process technologists are extending the CMOS lifeline with clever tricks like strained silicon, High-k metal gates and FinFETs, but signs point to trouble ahead.
Solution: Architectures with the smallest number of transistors will win the power game. The metric that matters going forward will be energy efficiency, expressed as work done per unit energy (GFLOPS/W, GHZ/W, etc).
#2 Thermal Density Wall
Challenge: We face an enormous power density challenge as we scale down to smaller geometries. Shrinking CMOS sink large amounts of current in very small spaces, leading to thermal densities approaching those of rocket engines. It is possible to improve power density by running circuits at a low frequency, but this is detrimental to performance and product cost. Thermal density can be mitigated by designing a homogeneous architecture that spreads out high-power circuits like floating point units uniformly across the chip. Architectures with high-power arithmetic units bunched close together, while other parts of the chip are quiet, will experience severe thermal gradients across the chip, decreasing reliability and limiting the maximum performance.
Solution: Architecture should be parallel and demonstrate uniform power consumption distribution to minimize thermal hotspots.
#3 Off-chip Bandwidth Bottlenecks
Challenge: Processors designed at 28nm and below have incredible performance densities. For example, a 1024 core chip from Adapteva would occupy 130mm^2 of silicon while offering 2048 GFLOPS performance. Scaling up the array and scaling down process geometry, the performance density gets even more extreme. Currently, standard off-chip IO solutions are placed at the periphery of the chip, meaning the IO bandwidth only grows linearly whereas performance grows as the square. The IO problem can be solved with more sophisticated high-frequency off-chip communication standards, but this is non-optimal in terms of energy efficiency, latency and complexity. Another approach is to try to maximize on-chip data reuse while minimizing redundant off chip data transfers. The off-chip bandwidth challenge is especially daunting for high performance accelerators that don't have the abilities to complete tasks independently without involvement from host.
Solution: Until someone invents an energy efficient method of transferring massive amounts of data off-chip, an optimal architecture should be smart enough to enable significant on-chip data and code reuse and to minimize communication with the rest of the system.
#4 Off-Chip Latency Mismatches
Challenge: Off-chip latency is another road-block in scaling up performance of real applications. It's very difficult to optimize performance of an application if an off-chip resource access takes thousands of clock cycles to complete. In many cache oriented architectures the off chip penalty might be non-deterministic, causing severe problems in real-time applications with hard deadlines. Sophisticated caches can overcome some of these latency issues, but at a power and cost premium. With thousands of asynchronous cores running simultaneously, it is becoming increasingly difficult to achieve quality caching performance. A promising approach is to employ software caching techniques optimized on a per application basis.
Solution: An architecture that does not use hardware-based caching.
#5 Architecture Scaling Limits
Challenge: Many processor architectures have inherent scaling limits. For example, a wide SIMD architecture with 1024 arithmetic units running in parallel might be very efficient on a small set of applications but will likely only achieve a fraction of peak performance on most applications. Today most SIMD extensions like Altivec, SSE, and Neon have settled on 4-way SIMD parallelism, and increased performance requires using multiple large SIMD processor cores in parallel. With multiple cores many choices affect scaling: cache coherency, programming model symmetry, and memory hierarchy. In less than five years it will be possible to put 64,000 RISC processors on a single chip, so scaling efficiency is very important.
Solution: An architectural approach with scalable parallelism and extensive field-proven application success stories.
#6 Frequency Scaling Limits
Challenge: For full swing CMOS circuits, it is difficult to make fast circuits while keeping dynamic power consumption under control. This wasn't always the case. When processors were running at 40MHz, there was still plenty of room to grow. Today leading-edge server processors might reach 4GHz, but we have reached a practical "frequency ceiling." Mobile processors are more efficient, but at 2GHz, they too are hitting the same frequency wall. In contrast, gains made from parallel circuits are essentially unlimited.
Solution: Architectures that derive performance from parallelism and execution efficiency rather than top operating frequency.
#7 Imperfect Manufacturing:
Challenge: As we scale down to 22nm and beyond, it becomes statistically complex to avoid all manufacturing imperfections. Perfecting the manufacturing process would make wafers prohibitively expensive. To get wafers at reasonable costs we need to accept some level of defects and on-chip variations. Mitigating imperfections is achieved by adding design redundancy.
Solution: Architecture that is naturally redundant without a need to add significant design overhead.
#8 Time to Market
Challenge: Time to market is accelerating and companies that miss critical market windows will at best decimate profit potential and at worst be out of business. Some just allocate more resources for a project to meet goals, but this has limited ROI. Projects with large engineering teams are notoriously difficult to manage and extremely expensive. As system complexity increases we must continue to improve abstraction levels, code reuse, iteration speeds and engineering efficiency.
Solution: Architectures supporting high-level programming languages, floating-point data formats, fast compilation times, extensive code reuse and code portability.
#9 Software Complexity
Challenge: Open source development enables the collective to achieve something that most individual companies could not achieve alone. With the scale of the complexity of today's drivers, operating systems, and application software, only the largest semiconductor companies can afford to develop closed source proprietary software.
Solution: Transparent architecture with support of an open source programming model that maximizes reuse and corporate support of the open source community.
#10 Amdahl's Law
Challenge: Amdahl's Law states that performance improvements are ultimately limited by how much serial code remains in the application. In order to scale performance by orders of magnitude in the future the only practical approach is to make the serial portion of the code trivial and fixed in time.
Solution: An architecture that can handle application-level parallelism, task-level parallelism, as well as fine-grained parallelism.