基於具有座標注意力和邊緣檢測輔助之雙邊分割網路的實時語義分割任務

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：42

、訪客IP：18.117.72.100

姓名

曾昱瑋(Yu-Wei Tseng) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於具有座標注意力和邊緣檢測輔助之雙邊分割網路的實時語義分割任務
(Real-time Semantic Segmentation based on Bilateral Segmentation Network with Coordinate Attention and Edge Detection Support)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-8-1以後開放)

摘要(中)

語義分割任務在計算機視覺領域中一直是一個重要議題。近年來，卷積神經網路(Convolutional Neural Network)的作法也從比較早期的編碼器-解碼器(Encoder-Decoder)架構，演變至今各種架構都有人使用，對於語義分割任務來說，空間訊息和感受場(receptive field)是不可缺少的，為了使語義分割數方法幾乎都選擇在圖片解析度和低層次的細節訊息上做出妥協，這導致了準確性的大幅下降。在本文中，我們提出了一個基於雙邊分割網路(BiSeNet)的新架構，稱為BiSeNet V3。我們引入了一個新的特徵細化模組來優化特徵圖，以及一個特徵融合模組來有效結合特徵，引入了一個注意力機制來幫助模型提取上下文訊息，為了能更好的獲取特徵，我們還使用邊緣檢測來增強邊界的特徵。結合了這些方法，網路透過骨幹網路以及索伯算子(Sobel operator)提取特徵的同時，高解析度的特徵與低解析度的特徵透過本文提出的模組結合，在Cityscapes資料集上進行的大量實驗來驗證效果，我們提出的方法在分割精度和推理速度之間取得了優異的表現。具體來說，對於768 × 1536的輸入，BiSeNet V3在Cityscapes測試資料集上取得了79.0%的mIoU（Mean Intersection over Union），在NVIDIA GTX 1080Ti上的速度為93.8 FPS。對於720 × 960的輸入，BiSeNet V3在CamVid資料集上取得了76.6%的mIoU，在NVIDIA GTX 1080Ti上的速度為147.6 FPS。這樣的結果達到當前實時語義分割任務的state-of-the-art。

摘要(英)

Semantic segmentation has been an important issue in the field of computer vision. In recent years, the Convolutional Neural Network has evolved from the earlier Encoder-Decoder architecture to a variety of architectures. For the semantic segmentation task, spatial information and the receptive field are indispensable. For semantic segmentation to be practically applicable, it must have real-time inference speed. However, most of today’s methods almost choose to compromise the spatial resolution and low-level detail information, which leads to a significant decrease in accuracy. In this paper, we propose a new architecture based on Bilateral Segmentation Network (BiSeNet) called BiSeNet V3. It introduces a new feature refinement module to optimize the feature map and a feature fusion module to combine the features efficiently. An attention mechanism is introduced to assist the model in capturing contextual information. We also use edge detection to enhance features for boundaries. Combining these methods, the network extracts features through the backbone network and the Sobel operator while the high resolution features are combined with the low resolution features by the proposed module. The results are verified by extensive experiments on the Cityscapes dataset. Our proposed approach achieves an excellent performance between segmentation accuracy and inference speed. Specifically, for a 768×1536 input, BiSeNet V3 achieved 79.0% mIoU on the Cityscapes test set with a speed of 93.8 FPS on an NVIDIA GTX 1080Ti. For a 720×960 input, BiSeNet V3 achieved 76.6% mIoU on the CamVid dataset with a speed of 147.6 FPS on an NVIDIA GTX 1080Ti. The result outperforms other networks and archives the state-of-the-art of current real-time semantic segmentation task.

關鍵字(中)

★ 實時語義分割
★ 深度學習

關鍵字(英)

★ Real-time Semantic Segmentation
★ Deep learning

論文目次

摘要 I
ABSTRACT II
1.序論 1
1.1研究背景與動機 1
1.2論文架構 6
2.文獻探討 7
2.1高效骨幹網路 7
2.2傳統語義分割演算法 10
2.3實時語義分割演算法 13
2.4物件偵測演算法與系統實作 16
3.在實時語義分割任務上設計的雙邊分割網路 18
3.1設計動機與背景想法 18
3.2座標特徵細化模組(COORDINATE FEATURE REFINEMENT MODULE)以及座標特徵融合模組(COORDINATE FEATURE FUSION MODULE) 19
3.3邊緣檢測輔助方法 23
3.4雙邊分割網路架構設計 24
4.基於語義分割網路結合物件偵測網路的智慧資源回收物偵測系統 27
4.1設計動機與背景想法 27
4.2感興趣的區域模組(REGION OF INTEREST MODULE) 28
4.3智慧資源回收物偵測系統介紹 30
4.4語義分割網路結合物件偵測網路之架構設計 31
5.實驗結果 34
5.1資料集介紹 34
5.2實驗細節 36
5.3驗證指標 38
5.4比較結果—BISENET V3 39
5.5消融實驗 43
5.6比較結果—智慧資源回收物偵測系統之BILATRAL-YOLOV5 46
5.7討論 50
6.結論 51
參考文獻 52

參考文獻

[1] Kirillov, Alexander, et al. "Panoptic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[2] Ivan. (2019, October 16). S7: FCN for Semantic Segmentation簡介. Https://Ivan-Eng-Murmur.Medium.Com/%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-S7-Fcn-for-Semantic-Segmentation%E7%B0%A1%E4%BB%8B-29814b07f96a.
[3] Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European conference on computer vision (ECCV). pp. 405–420 (2018)
[4] Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)
[5] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems. vol. 25. Curran Associates, Inc. (2012), https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45bPaper.pdf
[6] Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., Tong, Y.: Semantic flow for fast and accurate scene parsing. In: European Conference on Computer Vision. pp. 775–793. Springer (2020)
[7] Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 325–341 (2018)
[8] Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J., Wei, X.: Rethinking bisenet for real-time semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9716–9725 (2021)
[9] Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13713–13722 (2021)
[10] Liu, Y., Cheng, M.M., Hu, X., Wang, K., Bai, X.: Richer convolutional features for edge detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3000–3009 (2017)
[11] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
[12] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
[13] Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., Keutzer, K.: Squeezenext: Hardware-aware neural network design. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1638– 1647 (2018)
[14] Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV). pp. 116–131 (2018)
[15] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
[16] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4510–4520 (2018)
[17] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)
[18] Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1580–1589 (2020)
[19] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
[20] Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017)
[21] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890 (2017)
[22] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
[23] Li, Hanchao, et al. "Dfanet: Deep feature aggregation for real-time semantic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[24] Yu, Changqian, et al. "Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation." International Journal of Computer Vision 129.11 (2021): 3051-3068.
[25] Chao, P., Kao, C.Y., Ruan, Y.S., Huang, C.H., Lin, Y.L.: Hardnet: A low memory traffic network. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3552–3561 (2019)
[26] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).
[27] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
[28] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[29] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
[30] Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "Yolov4: Optimal speed and accuracy of object detection." arXiv preprint arXiv:2004.10934 (2020).
[31] Glenn Jocher, Alex Stoken, Jirka Borovec, NanoCode012, Ayush Chaurasia, TaoXie, Liu Changyu, Abhiram V, Laughing,tkianai, yxNONG, Adam Hogan, lorenzomammana, AlexWang1900, Jan Hajek, Laurentiu Diaconu, Marc, Yonghye Kwon,oleg, wanghaoyang0106, Yann Defretin, Aditya Lohia, ml5ah, Ben Milanko, Benjamin Fineran, Daniel Khromov, DingYiwei, Doug, Durgesh, and Francisco Ingham. ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly andYouTube integrations, Apr. 2021
[32] Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors." arXiv preprint arXiv:2207.02696 (2022).
[33] Du, Luyao, et al. "Real-time detection of vehicle and traffic light for intelligent and connected vehicles based on YOLOv3 network." 2019 5th International Conference on Transportation Information and Safety (ICTIS). IEEE, 2019.
[34] Greenhalgh, Jack, and Majid Mirmehdi. "Real-time detection and recognition of road traffic signs." IEEE transactions on intelligent transportation systems 13.4 (2012): 1498-1506.
[35] Hu, Xuelong, et al. "Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network." Computers and Electronics in Agriculture 185 (2021): 106135.
[36] Wang, Guanbo, et al. "TRC‐YOLO: A real‐time detection method for lightweight targets based on mobile devices." IET Computer Vision 16.2 (2022): 126-142.
[37] Wu, Peishu, et al. "FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public." Image and vision computing 117 (2022): 104341.
[38] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[39] Gao, W., Zhang, X., Yang, L., Liu, H.: An improved sobel edge detection. In: 2010 3rd International conference on computer science and information technology. vol. 5, pp. 67–71. IEEE (2010)
[40] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448–456. PMLR (2015)
[41] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 315–323. JMLR Workshop and Conference Proceedings (2011)
[42] Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: European conference on computer vision. pp. 44–57. Springer (2008)
[43] Wang, Tao, et al. "A multi-level approach to waste object segmentation." Sensors 20.14 (2020): 3816.
[44] Serezhkin, A. (2020, July). Drinking Waste Classification, Version 2. Retrieved June 20, 2022 from https://www.kaggle.com/datasets/arkadiyhacks/drinking-waste-classification
[45] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing systems 32 (2019)
[46] Goyal, P., Doll´ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
[47] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 761–769 (2016)
[48] Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8818–8827 (2020)
[49] Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., Mei, T.: Customizable architecture search for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11641–11650 (2019)
[50] Lin, P., Sun, P., Cheng, G., Xie, S., Li, X., Shi, J.: Graph-guided architecture search for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4203–4212 (2020)
[51] Li, X., Zhou, Y., Pan, Z., Feng, J.: Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9145–9153 (2019)
[52] Chen, W., Gong, X., Liu, X., Zhang, Q., Li, Y., Wang, Z.: Fasterseg: Searching for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917 (2019)
[53] Li, P., Dong, X., Yu, X., Yang, Y.: When humans meet machines: Towards efficient segmentation networks. In: BMVC (2020)
[54] Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12607–12616 (2019)
[55] Nirkin, Y., Wolf, L., Hassner, T.: Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4061–4070 (2021)
[56] Vanholder, Han. "Efficient inference with tensorrt." GPU Technology Conference. Vol. 1. 2016.
[57] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2023-1-18

推文