Adding processing directly into memory is getting a serious look, particularly for applications where the volume of data is so large that moving it back and forth between various memories and processors requires too much energy and time.
The idea of inserting processors into memory has cropped up intermittently over the past decade as a possible future direction, but it was dismissed as an expensive and untested alternative to device scaling. Now, as the benefits of scaling decrease due to thermal effects, various types of noise, and skyrocketing design and manufacturing costs, all options are on the table. This is particularly true for applications such as computer vision in cars, where LiDAR and camera sensors will generate streaming video, and for artificial intelligence/machine learning/deep learning, where large volumes of data need to be processed quickly.
“If you can process data where it resides, it’s much more efficient,” said Dan Bouvier, chief architect of client products at AMD. “If you have to go across links, it’s very expensive in terms of power—and I/Os are particularly costly. They’re not scaling. The PHYs are not scaling. And packaging technology is too expensive at this point to go to the finer bump pitches. You want as much compacted in close proximity as possible. And if you’re using heterogeneous processors, it’s easier to power manage locally.”
This is as true in data centers as it is in autonomous vehicles and other edge devices, and it’s far from a startling new revelation. AI/ML/DL and streaming video are not new technologies. But as they begin to ramp across multiple markets, unique challenges are cropping up involving power and latency. Put simply, the amount of data that will need to be processed is expected to outstrip the gains in performance and energy efficiency from scaling, and the only way to solve that is through architectural improvements and hardware-software co-design.
“Balancing memory bandwidth and compute bandwidth has been the central question in computer system architecture since the beginning of computers,” said Chris Rowen, CEO of Babblelabs. “Even 50 years ago, people said, ‘In a general-purpose sort of way I’m going to need on the order of a byte per operation.'”
That equation hasn’t changed significantly over the years. What has changed are the approaches to do that more efficiently. Among them:
• Combining multiple operations into a single cycle; • Altering the frequency at which data moves between processor and memory, either through caching or reduced accuracy in computing, and
• Shortening the distance between processor and memory while ensuring there is sufficient bandwidth.
Work is underway in all three of these areas, and all show promise. But reducing the distance between processor and memory offers some interesting challenges on a number of fronts.
“This is certainly do-able from a technology perspective to decrease the distance,” said Craig Hampel, chief scientist at Rambus. “And it fits a need for weighting on neural network training because you can’t afford to have a delay. The problem is economics. If you look at DRAM, the bits are assembled in a way that is very regular so it is cost-effective. The goal of 3D is to make those distances shorter, and 2.5D certainly helps, as well. But both approaches make thermal issues more difficult to resolve, and they are harder to test.”
The Hybrid Memory Cube, developed by Micron and Samsung, provided one example of efforts to distances and improve data throughput by stacking memory on logic in a 3D configuration and connecting the different layers with through-silicon vias (TSVs).
“People are very interested in direct access to memory,” said Amin Shokrollahi, CEO of Kandou Bus. “The problem is that you have to be able to build it so you can do normal programming. Software is as important as hardware.”
This is one area where the economics become particularly troublesome. “One reason why the Hybrid Memory Cube didn’t take off was there was no second source,” said Shokrollahi. “HBM (high-bandwidth memory) was more bare bones, but it provided access to all memory and it was multi-sourced. The HBM packaging also can support more layers, and you can cool it down very easily. If you package a processor inside memory, it gets very hot.”
One possible solution is to limit the size of the processors and the memories. Mythic, an Austin, Texas-based startup, introduced a new matrix multiply memory architecture at last month’s Hot Chips 30 conference aimed at the AI/machine learning market. Mythic’s approach is to improve performance by doing analog computation inside of flash memory.
“We are going to represent the weight matrix using flash transistors within a flash array,” said Dave Fick, Mythic’s CTO. “We take this flash array and we pack it into a tile. We have a tile-based architecture where each tile has one of these memory arrays, and then it also has other logic that supports reconfiguration and intermediate data storage. The SRAM provides intermediate data storage, so between intermediate stages we store the data in SRAM. We have a RISC-V processor for providing the control within the tile. We have a router that is going to communicate with adjacent tiles, and then a SIMD (single instruction, multiple data) unit that provides the operations that aren’t the matrix multiply.”
The limitation here is specialization, because flash transistors cannot be programmed quickly. “You will need to have a fixed set of applications that you are running, but that is fairly typical for edge systems,” Fick said. “We can support multiple applications by mapping different regions to different applications, so we can support several at a time.”
There are other problems that need to be solved, as well. Reducing the distance that signals need to travel between memory and logic creates thermal and cost challenges. But building processing directly into memory adds integration and compatibility issues into the mix.
“The big problem is the memory processes and logic processes don’t fit together, so you can’t do a reasonable job manufacturing these devices together,” said Raik Brinkmann, president and CEO of OneSpin Solutions. “That spurs another wave of innovation on the manufacturing side. For example, with a monolithic 3D architecture you have very thin wires between the logic layer and the memory layer that connect two pieces of silicon. That is basically in-memory computing.”
At this point no one is quite sure how this approach will yield in manufacturing.
“That adds a whole new set of challenges,” said Rob Aitken, an Arm fellow. “One of the interesting architectural innovations in that kind of processing is to do what the Stanford did with a pixel-based processing system. In a system like that, the pixels are relatively independent of each other and exist in a 2D surface. All the yield problems you would get gluing two wafers together don’t affect you nearly as much as they do if you have a case of, ‘This wafer gets 75% yield and that wafer gets 75% yield, and when I put them together they get 30% yield.’ You have to build systems where the redundancy implicit in the 3D stacking works with you, not against you. But even if you don’t go to monolithic 3D, and you want to do compute in memory, or near memory, that gets into the data movement problem. If your system requires moving data from here to there, it doesn’t matter how clever your processor is or how fast it is because that’s not your limiting factor.”
That opens up a whole series of other challenges from the design side.
“It’s not just about how you pack more stuff into a design,” said Mike Gianfagna, vice president of marketing at eSilicon. “Part of it also is how you change traditional approaches to chip design. A near-memory queue requires sophisticated parallel design.”
It also requires a deep understanding of how various types of memory will be utilized in a design. “One big nemesis is virtual memory subsystems,” said AMD’s Bouvier “You’re movng through data in unnatural ways. You’ve got translations of translations.”
But Bouvier noted there are different metrics for how different types of chips utilize DRAM. With a discrete GPU, he said DRAM runs in the 90% efficiency range. With an APU or CPU, it runs in the 80% to 85% range.
For AI/ML/DL applications, this is particularly important on the inferencing side.
“The reason Nvidia is so strong isn’t just because they have a parallel architecture—it’s that they have a huge amount of memory,” said Babblelabs’ Rowen. “One of the things that is strikingly different about the inference process is that you often have hundreds of operations per byte of memory. So what makes this problem different, at least for inference, is that you often don’t care about memory. You can throw a lot of compute closely connected to the problem without having that memory bottleneck be an issue. One of the things that’s causing so much innovation is people were coming out with very dense compute architectures and coarse-grained arrays, and the reaction was, ‘That’s nice, but there are no problems that have those characteristics.’ Most of them failed because they didn’t have enough memory bandwidth. But now we have a problem that really does have the characteristics where bandwidth is not an issue.”
This makes adding memory in or very close to memory much more attractive. And while it’s still not a sure bet, it’s no longer being dismissed without some serious discussion.
—Susan Rambo contributed to this report.
Next-Gen Memory Ramping Up At least 5 technologies are in the running, with 3D XPoint leading the pack.
Plumbing problems of the past continue to haunt chipmakers as gap grows between processor and memory speed.