[C82] ETA: Efficient Transformer Attention Mapping for ReRAM-based Compute-In-Memory Architectures

Abstract

Transformer models have set new performance benchmarks in vision and language applications. However, their attention mechanisms remain poorly suited for Compute-In-Memory (CIM) architectures due to frequent memory write and compute-write dependencies. These limitations often lead to increased latency and inefficient resource use. In this work, we propose Efficient Transformer Attention mapping (ETA), a novel approach optimized for ReRAM-based CIM systems. ETA alleviates the compute-write bottleneck by enabling parallel execution of computation and memory writes, and reduces the number of required arrays through an array-aware mapping strategy. This dual optimization leads to significant improvements in both energy efficiency and latency. Experimental results on DeiT-small and GPT2-small using 64x64 array demonstrate that ETA outperforms previous state-of-the-art methods, reducing waiting-for-write (W4W) by up to 66%, latency by up to 20%, and fewer arrays by up to 29%.

Publication
Asia Pacific Conference and Systems (APCCAS) 2025
Johnny Rhe (이존이)
Johnny Rhe (이존이)
Combined MS-PhD student
Juhong Park (박주홍)
Juhong Park (박주홍)
Visiting Researcher (Duke University)