# **CIM** Research

### New and cool memory technologies Jonathan Beard 2 October 2019

Data movement dominates, parallelism is critical



**arm** Research

Data movement dominates, parallelism is critical



**arm** Research

Data movement dominates, parallelism is critical



**arm** Research

Data movement dominates, parallelism is critical



**arm** Research

Data movement dominates, parallelism is critical



Data movement dominates, parallelism is critical



**arm** Research



#### **Bottom line up front**

Just in case I get kicked off the stage before I finish.....

- Problem we're really trying to solve is always data movement
  - Context to PE
  - PE communicating results to PE
  - Creating new context from parent PE
  - PE storing results
  - Aligning instruction/command data with input data
- PINM is just another accelerator, but not one we should tackle first.
- We have to face our inconvenient truths......



#### **Bottom line up front**

Just in case I get kicked off the stage before I finish.....

- Most PINM solutions often have issues with
  - VA -> PA Translation / interleaving (bank/channel/etc.)
  - Programmability
  - Cache maintenance operations? Where?
  - In-NVM compute, what happens when cells die? Interaction with wear-leveling??
  - Exceptions, error handling?
  - Synchronization: between PINM units and with host cores
  - Working set size vs. device size....thread migration is needed (some have solutions, do others?)
- There are no magic memories
  - If it sounds too good to be true, it usually is.



#### **Bottom line up front**

Just in case I get kicked off the stage before I finish.....

- Previous slide is a tad depressing...
- Let's talk about some easier opportunities....



## What is memory?



© 2018 Arm Limited

#### "the faculty by which the mind stores and remembers information"

- Apple Dictionary



"the faculty by which the mind stores and remembers information" - Apple Dictionary

# "the faculty by which the application mind stores and remembers information"

- Apple Dictionary, edited 🙂



### Why don't we consider the interconnect as memory too?

Maybe...it should be, it is right?

- It stores memory right? Even if only a few cycles at a time.
- Interconnect is way cheaper than DRAM (energy/latency)
- Keep data within interconnect when possible.
- Why is this so hard to do? Some networking cores have, why not general purpose?



## Accelerator != Just Logic



© 2018 Arm Limited

\*Concept by Joe Wingbermuehle – "Application Specific Memory Subsystems", PhD Thesis, 2015

**arm** Research

#### **Customized memory hierarchy**



There are some strange combinations



But...2-10x speedup on FPGA, up to 100x when built as ASIC (higher clock rates)

#### **Customized memory hierarchy**







Locality is, from a certain point of view.



### **Flip the script**

Instead of changing memory hierarchy, add processors (PEs) everywhere



### **Flip the script**

Instead of changing memory hierarchy, add processors (PEs) everywhere



## Do you really want to program that??



#### Just one problem among many..





## The interface



© 2018 Arm Limited



























#### But for programming computer hardware....



Build Over (multi-chip modules)



#### We need intuitive, productive interfaces....





**But....** 







Do we need super advanced aliens to help us build programs??

(after all, they did help build the pyramids right??;))

#### **arm** Research

34 © 2019 Arm Limited

(disclaimer: aliens didn't actually build the pyramids...)

Most "start-up" hardware vendors fail...because of software, or lack of.





#### 

#### Mature Software Ecosystem

Market success isn't based on having the "best" hardware, it's in having a broad software userbase....more simply, it's the software!!



#### What if....















- Build interface to make it easier to determine data locality, do fast dispatch, etc.!
- Lower offload latency == more tasks / unit time









# The AI/ML Assisted System

(shh, it's not really aliens)



**arm** Research

## Parting thoughts...

- Application performance is all about the data, and how fast you can access it
- Can we build better communications primitives to reduce main memory access?
- Memory is an accelerator, memory is compute, it is not an accessory.
- We shouldn't need advanced aliens to come down to help us program our systems...;)



## Parting thoughts...

- Application performance is all about the data, and how fast you can access it
- Can we build better communications primitives to reduce memory access?
- Memory is an accelerator, it is compute, it is not an accessory.
- We shouldn't need advanced aliens to come down to help us program our systems...;)



# Thanks!



The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

**CIM** Research

www.arm.com/company/policies/trademarks

# **Issues with PINM**

Systems integration – translation / interleaving



# **Issues with PINM**

Integration storage – translation / interleaving



To be useful, you have to initiate potentially hundreds of fine-grained data-targeted tasks...how?

costly (in terms of time/cycles/energy)

**arm** Research

What happens when cells go bad?

© 2019 Arm Limited 48

## **Types of PINM Accelerators**

Near-memory / In-memory / In-NVM





#### **arm** Research

# **Types of PINM Accelerators**

Near-memory / In-memory / In-NVM



 It doesn't make sense to move controller for this config.



Integrated xbar + full cores
near-memory in logic layer...

#### **arm** Research