參考文獻 |
[1] H.-N. Wu and C.-T. Huang, “Data Locality Optimization of Depthwise Separable Convolutions
for CNN Inference Accelerators,” in Design, Automation & Test in Europe Conference
& Exhibition (DATE), 2019, pp. 120–125.
[2] Q. Sun, T. Chen, J. Miao, and B. Yu, “Power-driven DNN dataflow optimization on FPGA,”
in International Conference on Computer-Aided Design (ICCAD), 2019, pp. 1–7.
[3] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in International
Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
770–778.
[5] X. Yang,M. Gao, J. Pu, A. Nayak, Q. Liu, S. E. Bell, J. O. Setter, K. Cao, H. Ha, C. Kozyrakis
et al., “DNN dataflow choice is overrated,” arXiv preprint arXiv:1809.04070, 2018.
[6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and
H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”
arXiv preprint arXiv:1704.04861, 2017.
[7] Y. LeCun et al., “LeNet-5, convolutional neural networks,” URL: http://yann. lecun.
com/exdb/lenet, vol. 20, p. 5, 2015.
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection
with region proposal networks,” in Advances in neural information processing systems, 2015,
pp. 91–99.
[9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time
object detection,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 779–788.
[10] M. J. Shafiee, B. Chywl, F. Li, and A.Wong, “Fast YOLO: A fast you only look once system
for real-time embedded object detection in video,” arXiv preprint arXiv:1709.05943, 2017.
[11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2018, pp. 4510–4520.
[12] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet:
Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 2820–2828.
[13] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A smallfootprint
high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH
Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
[14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al.,
“Dadiannao: A machine-learning supercomputer,” in International Symposium on Microarchitecture,
2014, pp. 609–622.
[15] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao:
Shifting vision processing closer to the sensor,” in Proceedings of the International
Symposium on Computer Architecture, 2015, pp. 92–104.
[16] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambriconx:
An accelerator for sparse neural networks,” in International Symposium on Microarchitecture
(MICRO), 2016, pp. 1–12.
[17] S. Han, X. Liu, H.Mao, J. Pu, A. Pedram,M. A. Horowitz, andW. J. Dally, “EIE: efficient inference
engine on compressed deep neural network,” ACM SIGARCH Computer Architecture
News, vol. 44, no. 3, pp. 243–254, 2016.
[18] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits,
vol. 52, no. 1, pp. 127–138, 2016.
[19] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing
unit,” in Proceedings of the International Symposium on Computer Architecture, 2017, pp.
1–12.
[20] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang, “A reconfigurable
streaming deep convolutional neural network accelerator for Internet of Things,” IEEE Transactions
on Circuits and Systems I: Regular Papers, vol. 65, no. 1, pp. 198–208, 2017.
[21] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu, L. Liu, and S. Wei, “A high
energy efficient reconfigurable hybrid neural network processor for deep learning applications,”
IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
[22] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible dataflow accelerator
architecture for convolutional neural networks,” in International Symposium on High
Performance Computer Architecture (HPCA), 2017, pp. 553–564.
[23] Y. Huan, J. Xu, L. Zheng, H. Tenhunen, and Z. Zou, “A 3D tiled low power accelerator for
convolutional neural network,” in International Symposium on Circuits and Systems (ISCAS),
2018, pp. 1–5.
[24] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient deep
neural network accelerator with fully variable weight bit precision,” IEEE Journal of Solid-
State Circuits, vol. 54, no. 1, pp. 173–185, 2018.
[25] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging
deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
[26] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram: Optimized coarsegrained
dataflow for scalable NN accelerators,” in Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages and Operating
Systems, 2019, pp. 807–820.
[27] S. Cass, “Taking AI to the edge: Google’s TPU now comes in a maker-friendly package,”
IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019.
[28] B. Moons and M. Verhelst, “DVAFS : Dynamic-Voltage-Accuracy-Frequency-Scaling Applied
to Scalable Convolutional Neural Network Acceleration,” in System-Scenario-based
Design Principles and Applications, 2020, pp. 99–111.
[29] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P.
Graf, “A massively parallel coprocessor for convolutional neural networks,” in International
Conference on Application-specific Systems, Architectures and Processors, 2009, pp. 53–60.
[30] L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “AccPar: Tensor Partitioning for Heterogeneous
Deep Learning Accelerators,” in International Symposium on High Performance
Computer Architecture (HPCA), 2020, pp. 342–355.
[31] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator
design for deep convolutional neural networks,” in Proceedings of ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2015, pp. 161–170.
[32] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution operation to accelerate
deep neural networks on FPGA,” Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 26, no. 7, pp. 1354–1367, 2018.
[33] ——, “Automatic compilation of diverse CNNs onto high-performance FPGA accelerators,”
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
[34] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
[35] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow
for convolutional neural networks,” ACM SIGARCH Computer Architecture News, vol. 44,
no. 3, pp. 367–379, 2016.
[36] J. Li, G. Yan,W. Lu, S. Jiang, S. Gong, J.Wu, and X. Li, “SmartShuttle: Optimizing off-chip
memory accesses for deep learning accelerators,” in Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2018, pp. 343–348.
[37] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley,
A. Pedram, and M. Horowitz, “A systematic approach to blocking convolutional neural networks,”
arXiv preprint arXiv:1606.04209, 2016.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems, 2012, pp.
1097–1105.
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[40] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance
model for multicore architectures,” Communications of the ACM, no. 4, pp. 65–76, 2009.
[41] G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato, and M. P¨uschel, “Applying
the roofline model,” in International Symposium on Performance Analysis of Systems and
Software (ISPASS), 2014, pp. 76–85.
[42] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “DNNBuilder: an
automated tool for building high-performance DNN hardware accelerators for FPGAs,” in
International Conference on Computer-Aided Design (ICCAD), 2018, pp. 1–8.
[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich, “Going Deeper With Convolutions,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2015.
[44] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding
reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in
Proceedings of the International Symposium on Microarchitecture, 2019, pp. 754–768.
[45] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic CNN
accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
[46] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz,
“Towards energy-proportional datacenter memory with mobile DRAM,” in Annual International
Symposium on Computer Architecture (ISCA), 2012, pp. 37–48.
[47] I. G. Thakkar and S. Pasricha, “3D-ProWiz: An energy-efficient and optically-interfaced
3D DRAM architecture with reduced data access overhead,” Transactions on Multi-Scale
Computing Systems, pp. 168–184, 2015. |