标题：FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures
作者：Lan, Haidong; Meng, Jintao; Hundt, Christian; Schmidt, Bertil; Deng, Minwen; Wang, Xiaoning; Liu, Weiguo; Qiao, Yu; Feng, Shengzhong
作者机构：[Lan, Haidong; Meng, Jintao; Deng, Minwen; Wang, Xiaoning] Tencent AI Lab, Shenzhen 518000, Peoples R China.; [Meng, Jintao; Qiao, Yu; Feng, Shengzh 更多
通讯作者：Meng, Jintao;Meng, JT;Meng, JT;Schmidt, B
通讯作者地址：[Meng, JT]Tencent AI Lab, Shenzhen 518000, Peoples R China;[Meng, JT]Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China;[Schmidt, 更多
来源：IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
关键词：Convolution; Performance evaluation; Optimization; Computer; architecture; Acceleration; Mobile handsets; Libraries; Convolutional; neural networks; ARM architecture; inference computation; tensorGEMM
摘要：Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN - a fast inference library for ARM CPUs - targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apples Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn.