Optimizing CNN-based segmentation with deeply customized convolutional and deconvolutional architectures on FPGA
Author(s)
Type
Journal Article
Abstract
Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition
problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used
as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such
as image segmentation and super resolution. However, the deconvolution algorithms are computationally
intensive which limits their applicability to real time applications. Particularly, there has been little research
on the efficient implementations of deconvolution algorithms on FPGA platforms which have been widely
used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power
efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation.
FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing
between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other
optimization techniques. A non-linear optimization model based on the performance model is introduced to
efficiently explore the design space in order to achieve optimal processing speed of the system and improve
power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the
low-latency hardware design for any given CNN model on the target device. Finally, we implement our
designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS
under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization,
which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation
on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves
a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per
second for 512x512 image inputs with a power consumption of only 9.6W.
problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used
as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such
as image segmentation and super resolution. However, the deconvolution algorithms are computationally
intensive which limits their applicability to real time applications. Particularly, there has been little research
on the efficient implementations of deconvolution algorithms on FPGA platforms which have been widely
used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power
efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation.
FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing
between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other
optimization techniques. A non-linear optimization model based on the performance model is introduced to
efficiently explore the design space in order to achieve optimal processing speed of the system and improve
power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the
low-latency hardware design for any given CNN model on the target device. Finally, we implement our
designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS
under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization,
which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation
on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves
a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per
second for 512x512 image inputs with a power consumption of only 9.6W.
Date Issued
2018-12-01
Date Acceptance
2018-07-25
Citation
ACM Transactions on Reconfigurable Technology and Systems, 2018, 11 (3)
ISSN
1936-7406
Publisher
Association for Computing Machinery
Journal / Book Title
ACM Transactions on Reconfigurable Technology and Systems
Volume
11
Issue
3
Copyright Statement
© 2018 ACM. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PUBLICATION, {VOL 11, ISS 3, (DATE Dec 2018)} http://doi.org/10.1145/3242900 .
Subjects
Science & Technology
Technology
Computer Science, Hardware & Architecture
Computer Science
FPGA
convolutional neural networks (CNNs)
deconvolution
hardware acceleration
segmentation
1006 Computer Hardware
Publication Status
Published
Article Number
19
Date Publish Online
2018-12