深度學習的3D物件偵測、辨識、 與方位估計

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：18

、訪客IP：18.191.157.43

姓名

陳世翔(Shi-Xiang Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

深度學習的3D物件偵測、辨識、與方位估計
(3D Object detection, recognition, and position estimation using CNN)

相關論文

★ 適用於大面積及場景轉換的視訊錯誤隱藏法	★ 虛擬觸覺系統中的力回饋修正與展現
★ 多頻譜衛星影像融合與紅外線影像合成	★ 腹腔鏡膽囊切除手術模擬系統
★ 飛行模擬系統中的動態載入式多重解析度地形模塑	★ 以凌波為基礎的多重解析度地形模塑與貼圖
★ 多重解析度光流分析與深度計算	★ 體積守恆的變形模塑應用於腹腔鏡手術模擬
★ 互動式多重解析度模型編輯技術	★ 以小波轉換為基礎的多重解析度邊線追蹤技術(Wavelet-based multiresolution edge tracking for edge detection)
★ 基於二次式誤差及屬性準則的多重解析度模塑	★ 以整數小波轉換及灰色理論為基礎的漸進式影像壓縮
★ 建立在動態載入多重解析度地形模塑的戰術模擬	★ 以多階分割的空間關係做人臉偵測與特徵擷取
★ 以小波轉換為基礎的影像浮水印與壓縮	★ 外觀守恆及視點相關的多重解析度模塑

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來，深度學習技術的快速崛起，使得它在物件偵測與辨識的應用也漸趨成熟；物件偵測的技術也逐漸的擴展到3D應用層面；例如，自駕車、虛擬實境、擴增實境、機器手臂。3D偵測要使用3D影像，3D影像相較於2D影像多了深度資訊，然而3D物件偵測因多了深度資料而變得更困難；例如，有效擷取深度影像特徵、處理更複雜的高維度資料、物體之間的混雜和遮擋、更複雜的場景等等。在本研究中，我們提出一個可直接估計3D物件位置、方向、與大小的卷積神經網路 (convolution neural network, CNN)；透過輸入RGB與深度影像，卷積神經網路擷取特徵並預測物體的類別、姿態、和位置，最後輸出3D邊界框 (bounding box)。
本研究所使用的卷積神經網路模式是改自於有名的2D偵測網路YOLOv3。我們的主要改進分兩部份，一是修改YOLOv3的輸入端，使用RGB與深度影像作為輸入，且將YOLOv3 中的 Darknet-53 架構加入通道注意力 (channel attention) 強化擷取特徵能力，並使用這些特徵進行多尺度的偵測與辨識；二是物件的3D位移分量藉由物件中心與相機的距離來估計，並修改損失函數 (loss function) 加入四元數 (quaternion) 估計物件的3D旋轉分量，最後預測出多類別的物件機率與三維座標、方向及大小尺寸，並輸出3D邊界框。
在實驗中，我們將YOLOv3修改為6DoF YOLO，使網路預測3D邊界框，在(Falling Thing)資料庫下，使用了20854張影像，其中90%為訓練樣本，其餘為測試樣本，此物件偵測系統的mAP為89.33%，經過一連串改動與實驗分析後，我們最終使用的6DoF SE-YOLO架構，此架構增加約1.014倍的參數量及1.002倍的計算量，影像以416×416解析度進行測試，平均執行速度為每秒35張影像，mAP達到93.59%。

摘要(英)

According to rising of deep learning technology, its application in object detection and recognition gradually mature recently. Object detection technology has gradually developed to the 3D application. For example, self-driving cars, virtual reality, augmented reality, and robotic arms. 3D images have depth information, but 2D images haven’t. 3D object detection becomes more difficult due to the depth data. For example, depth image features extracted effectively, complex high-dimensional data handled, object occluded each other, scenes clutter, etc. In our research, we propose a convolution neural network (CNN) that can estimate directly the position and size of 3D objects. After input RGB and depth images extracts features, model outputs 3D bounding boxes.
In our research, model adapted from the famous 2D detection network YOLOv3. We made two improvements of model. First, we modify the input which use RGB and depth images. We use channel attention to enhance the ability to extract features. These features used for multi-scale detection and identify. Second, we estimated the 3D translation by localizing object center in the image and estimating distance object distance from the camera. We add quaternion to the loss function that can estimate the 3D rotation. Our model can predict 3D bounding box which contain the object class, 3D coordinate, position and size.
In the experiment, we modified YOLOv3 to 6DoF YOLO which can predict the 3D bounding box. There are 20854 images in (Falling Thing) dataset, 90% of which are training data and the others are test data. 6DoF YOLO get 89.33% mAP. After experimental analysis, we finally use the 6DoF SE-YOLO architecture. This architecture increases the parameter calculation amount by 1.014 times and 1.002 times, respectively. Our model can reach 93.59% mAP, and the average execution speed on 416×416 images is 35 frames per second.

關鍵字(中)

★ 3D 物件偵測
★ 方位估計
★ 四元數
★ 物件偵測
★ 6個自由度

關鍵字(英)

★ 3D Object detection
★ position estimation
★ quaternion
★ Object detection
★ 6 degree of freedom

論文目次

摘要 i
Abstract ii
致謝 iii
目錄 iv
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究動機 1
1.2 系統架構 2
1.3 論文特色 3
1.4 論文架構 4
第二章相關研究 5
2.1 2D物件偵測系統相關發展 5
2.2 3D物件偵測系統相關發展 10
第三章 6D網路架構修改 13
3.1 YOLOv3架構 13
3.2 基於YOLOv3架構的6D網路修改 21
第四章四元數與邊界框姿態 31
4.1 四元數的姿態算法 31
4.2 網路的邊界框輸出 36
第五章實驗結果與討論 40
5.1 實驗設備介紹 40
5.2 訓練卷積神經網路 40
5.3 卷積神經網路架構的評估和比較 42
5.4 6DoF SE-YOLO結果展示 46
第六章結論與未來展望 50
參考文獻 52

參考文獻

[1] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, and A. Zisserman, ′′The pascal visual object classes (voc) challenge,′′ Int. Journal of Computer Vision (IJCV), vol.88, is.2, pp.303-338, 2010.
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, ′′Microsoft coco: Common objects in context,′′ arXiv:1405.0312.
[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ′′Imagenet large scale visual recognition challenge, ′′ arXiv:1409.0575.
[4] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. of Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, NV, Dec.3-8, 2012, pp.1106-1114.
[5] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in Proc. of ECCV Conf., Zurich, Switzerland, Sep.6-12, 2014, pp.818-833.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.1-9.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. of ICLR Conf., San Diego, CA, USA, May.7-9, 2015, pp.1-14.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.770-778.
[9] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proc. of Neural Information Processing Systems (NIPS), Montréal, Canada, Dec.7-12, 2015, pp.2377-2385.
[10] J. Redmon and A. Farhadi, ′′Yolov3: an incremental improvement,′′ arXiv:1804.02767.
[11] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML Conf. , Lille, France, Jul.7-9, 2015, vol.37, pp.448-456.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.770-778.
[13] Jonathan Tremblay, Thang To, and Stan Birchfield, ′′Falling Things: A synthetic dataset for 3D object detection and pose estimation,′′ arXiv:1804.06534.
[14] A. Neubeck and L. Van Gool, "Efficient non-maximum suppression," in Proc. of IEEE Int. Conf. on Pattern Recognition(ICPR), Hong Kong, Aug.20-24, 2006, pp.850-855.
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, Jun.23-28, 2014, pp.580-587.
[16] J. Uijlings, K. Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” Int. Journal of Computer Vision (IJCV), vol.104, is.2, pp.154-171, 2013.
[17] R. Girshick, "Fast R-CNN," in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Santiago, Chile, Dec.11-18, 2015, pp.1440-1448.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in Proc. of ECCV Conf. , Zurich, Switzerland, Sep.6-12, 2014, pp.346-361.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.39, is.6, pp.1137-1149, 2016.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “ SSD: Single shot multibox detector,” in Proc. European Conf. on Computer Vision (ECCV), Amsterdam, Holland, Oct.8-16, 2016, pp.21-37.
[21] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv:1701.06659.
[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: unified, real-time object detection," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp.779-788.
[23] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, Jul.21-26, 2017, pp.6517-6525.
[24] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, CA, Jun.21-Jul.18, vol.1, 1967, pp.281-297.
[25] T.-Y. Lin, P. Dollár1, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul.21-26, 2017, pp.936-944.
[26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.2980-2988.
[27] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox, “PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes,” arXiv:1711.00199.
[28] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua, “Real-time seamless single shot 6D object pose prediction,” arXiv:1711.08848.
[29] Martin Simon, Stefan Milz, Karl Amende, and Horst-Michael Gross, “Complex-YOLO: Real-time 3D object detection on point clouds,” arXiv:1803.06199.
[30] Martin Simon, Karl Amende, Andrea Kraus, Jens Honer, Timo Sämann, Hauke Kaulbersch, Stefan Milz, and Horst Michael Gross, “Complexer-YOLO: Real-time 3D object detection and tracking on semantic point clouds,” arXiv:1904.07537.
[31] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello, “ENet: A deep neural network architecture for real-time semantic segmentation,” arXiv:1606.02147.
[32] N. Chigozie Enyinna, I. Winifred, G. Anthony, and M. Stephen, “Activation functions: comparison of trends in practice and research for deep learning,” arXiv:1811.03378.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. of ICML Conf. , Haifa, Israel, Jun.21-24, 2010, pp.807-814.
[34] M. Andrew L, H. Awni Y, and N. Andrew Y, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. of ICML Conf., Atlanta, GA, Jun.16-21, 2013.
[35] J. Hu, L. Shen and G. Sun, "Squeeze-and-excitation networks," in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.7132-7141.
[36] Dario Pavllo, David Grangier, and Michael Auli, “QuaterNet: A quaternion-based recurrent model for human motion,” arXiv:1805.06485.
[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” arXiv:1708.02002.

指導教授

曾定章(Din-Chang Tseng)

審核日期

2020-7-28

推文