[J4] Design and Analysis of a Neural Network Inference Engine based on Adaptive Weight Compression

Abstract

Neural networks generally require significant memory capacity/bandwidth to store/access a large number of synaptic weights. This paper presents design of an energyefficient neural network inference engine based on adaptive weight compression using a JPEG image encoding algorithm. To maximize compression ratio with minimum accuracy loss, the quality factor of the JPEG encoder is adaptively controlled depending on the accuracy impact of each block. With 1% accuracy loss, the proposed approach achieves 63.4X compression for multilayer perceptron (MLP) and 31.3X for LeNet-5 with the MNIST dataset, and 15.3X for AlexNet and 10.2X for ResNet-50 with ImageNet. The reduced memory requirement leads to higher throughput and lower energy for neural network inference (3X effective memory bandwidth and 22X lower system energy for MLP).

Publication
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)