[C94] REFLEX: Rewrite-Free Row-Aligned Sparse Attention for Efficient LLM Execution on PIM

Abstract

Large language models (LLMs) face decoding bottlenecks as attention repeatedly accesses the key-value (KV) cache. Sparse attention and processing-in-memory (PIM) each reduce data movement, but their naive integration produces irregular KV accesses that span multiple DRAM rows, leading to unnecessary activations and cache rewrites. We present REFLEX, a rewrite-free sparse attention framework that colocates required KV entries in a single DRAM row and applies activation-aware scheduling for PIM execution. REFLEX preserves accuracy without hardware changes, achieving up to 1.64× throughput and 1.36× energy efficiency on PIM, and 1.37× throughput in GPU-PIM systems.

Publication
63rd Design Automation Conference (DAC)
Juhong Park (박주홍)
Juhong Park (박주홍)
Visiting Researcher (Duke University)
Sangheum Yeon (연상흠)
Sangheum Yeon (연상흠)
Combined MS-PhD student