應用於安全監控之深度學習多媒體處理技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：82

、訪客IP：18.223.239.250

姓名

王建堯(Chien-Yao Wang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用於安全監控之深度學習多媒體處理技術
(Deep-Learning-Based Multimedia Processing and Its Applications to Surveillance)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

安全監控系統日趨重要，在台灣由視訊監控系統破獲的刑事案件從2007年的1%到2016年第一季已達到19.83%。然而傳統的監控系統仰賴人工被動監視，這使得監控系統經常用作被動式的事後追查，而無法在緊急狀況發生時有效的遏止事故或犯罪的產生。另外，全球的監視攝影機將於2020年達到每秒300億張的資料量，而人力亦無法負擔及處理如此龐大的資料量。因此，開發一個有效的主動式智慧監控系統是極其重要的。深度學習在近年來於多媒體巨量資料分析上帶來了極大的成功，期能有效且快速的將龐大的資料話為有用的資訊。本論文將基於深度學習多媒體訊號處理技術，設計適合運用於智慧監控系統的技術。適用於主動式監控系統的感測器主要為攝影機與麥克風，在本論文中分別針對以聲音為基礎以及以視訊為基礎的監控系統開發智慧影音分析技術。以視訊為基礎的監控系統其優點為能夠明確的觀察到發生的事件，然其經常會有死角或較易受到環境變化的干擾。而以聲音為基礎的監控系統其優點則是能夠觀測到來自四面八方的聲音，並對其進行分析與辨識。本論文中開發了基於聲音的聲音事件辨識與偵測深度學習技術，以及基於視訊的影像切割、動作辨識、以及群體提取技術。

在聲音事件辨識與偵測系統中，基於人類聽覺感知模型，本論文設計了聽覺感知二值化模式聲學特徵，並設計能夠階層式地提取有效抽象鑑別性特徵做分類的深度神經網路架構-階層式跳台型深度信念網路。在影像語意切割中，提出的階層式聯合引導網路運用了提出的物件邊界預測聯合學習網路得到的物件邊界資訊以提出的聯合引導與遮罩網路調適影像切割結果。於行為辨識系統中，提出的動態追蹤運動注意力模型考慮了物體在影片中的動態變化資訊用以做行為辨識。在群體提取系統中，使用非監督轉移學習方式結合物件性映射圖提取網路與物件追蹤網路達到影片中的動態群體提取。

摘要(英)

Surveillance systems are becoming important. The criminal cases cracked by the video surveillance system, from 1% in 2007 to 19.83% in the first season of 2016. However, the traditional surveillance system relies on manual monitoring; this makes the surveillance system often used as a passive post-tracing, also cannot effectively prevent accidents or crimes when an emergency occurs. Otherwise, the global surveillance cameras will reach 30 billion frames per second by 2020; humans can’t afford to deal with such huge data. Therefore, it is important to develop an active intelligent surveillance system. Recently, deep learning brings great success in the multimedia data analysis; it can effectively and quickly turn a lot of data into useful information. This dissertation will be based on the deep learning multimedia signal processing technology to design for use in intelligent surveillance systems. Sensors suitable for active surveillance systems are cameras and microphones. In this dissertation, the surveillance system is based on the sound and vision to develop an intelligent sound and video analysis technology. The surveillance system based on the vision is able to clearly observe the occurrence of events. However, there is often a blind side or is susceptible to environmental changes. The surveillance system based on the sound is able to observe the sound from all directions, and analysis and recognition. In this dissertation, to develop a deep learning technology of the sound event recognition and detection based on the sound, and image segmentation, action recognition and group proposal technology based on the vision.

For sound event recognition and detection, a new deep neural network system, called hierarchical-diving deep belief network (HDDBN), is proposed to classify and detect sound event. The proposed system learns several forms of abstract knowledge from proposed auditory-receptive-field binary pattern (ARFBP) visual audio descriptor that support the knowledge transfer from previously learned concepts to useful representations. For semantic image segmentation, proposed hierarchical joint-guided network (HJGN) using our designed object boundary prediction hierarchical joint learning convolutional network (OBP-HJLCN) to guide segmentation results. For action recognition, The proposed motion attention model, called the dynamic tracking attention model (DTAM), not only considers the information about motion but also perform dynamic tracking of objects in videos. For group proposal, an unsupervised group proposal network (GPN) is developed by combined proposed objectness map generation network and proposed object tracklet network.

關鍵字(中)

★ 深度學習
★ 智慧型監控

關鍵字(英)

★ Deep Learning
★ Intelligent Surveillance

論文目次

1 Introduction 1
1.1 Motivation and background 1
1.2 Research objective 1
1.3 Organization 3
2 Deep learning based multimedia processing 4
2.1 Sound event recognition 4
2.2 Semantic image segmentation 6
2.3 Group behavior analysis in video 7
3 Deep learning 10
3.1 Deep belief network 10
3.2 Convolutional neural network 11
3.3 Recurrent neural network 14
4 Sound event recognition 16
4.1 Sparse coding convolutional neural networks for sound event
recognition 16
4.2 Auditory receptive eld binary pattern for sound event recognition
and detection 22
5 Semantic image segmentation 49
5.1 Hierarchical joint learning convolutional network for object boundary
prediction 49
5.2 Hierarchical joint-guided networks for semantic image segmentation 51
6 Group behavior analysis in video 56
6.1 Dynamic tracking attention model for action recognition 56
6.2 Group proposal networks 61
7 Conclusions 77
References 79

參考文獻

[1] A. Harma, M. F. McKinney, and J. Skowronek. Automatic surveillance of the acoustic activity in our living environment. In 2005 IEEE International Conference on Multimedia and Expo (ICME), July 2005.
[2] P. Guyot, J. Pinquier, X. Valero, and F. Alías. Two-step detection of water sound events for the diagnostic and monitoring of dementia. In 2013 IEEE International Conference on Multimedia and Expo (ICME), pages 16, July
2013.
[3] Behnaz Ghoraani and Sridhar Krishnan. Time-frequency matrix feature extraction and classication of environmental audio signals. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):21972209, 2011.
[4] Daniel P.W. Ellis and Keansub Lee. Minimal-impact audio-based personal archives. In Proceedings of the the 1st ACM workshop on continuous archival and retrieval of personal experiences - CARPE′04, page 39, New York, USA,
2004. ACM Press.
[5] Stavros Ntalampiras, Ilyas Potamitis, and Nikos Fakotakis. On acoustic surveillance of hazardous situations. In Proceedings of 2009 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 165168, Taipei, Taiwan, 2009. IEEE.
[6] C. Clavel, T. Ehrette, and G. Richard. Events detection for an audio-based surveillance system. In Proceedings of 2005 IEEE International Conference on Multimedia and Expo, pages 13061309. IEEE, 2005.
[7] Andrey Temko, Robert Malkin, Christian Zieger, Du²an Macho, Climent Nadeu, and Maurizio Omologo. Acoustic event detection and classication in smart-room environments: Evaluation of CHIL project systems. In Pro-
ceeding of The IV Biennial Workshop on Speech Technology, Zaragoza, Spain, 2006.
[8] Selina Chu, Shrikanth Narayanan, C.-c. Kuo, and Maja Mataric. Where am I? scene recognition for mobile robots using audio features. In Proceedings of 2006 IEEE International Conference on Multimedia and Expo, pages 885888. IEEE, 2006.
[9] Panagiotis Sidiropoulos, Vasileios Mezaris, Ioannis Kompatsiaris, Hugo Meinedo, Miguel Bugalho, and Isabel Trancoso. On the use of audio events for improving video scene segmentation. In Image Analysis for Multimedia
Interactive Services (WIAMIS), 2010 11th International Workshop on, pages 14. IEEE, 2010.
[10] Jia Ching Wang, Chang Hong Lin, Bo Wei Chen, and Min Kang Tsai. Gabor-Based Nonuniform Scale-Frequency Map for Environmental Sound Classification in Home Automation. IEEE Transactions on Automation Science and Engineering, 11(2):607613, apr 2014.
[11] Jonathan William Dennis. Sound event recognition in unstructured environments using spectrogram image processing. Phd thesis, Nanyang Technological University, 2014.
[12] Jonathan William Dennis, Huy Dat Tran, and Haizhou Li. Spectrogram image feature for sound event Classification in mismatched conditions. IEEE Signal
Processing Letters, 18(2):130133, 2011.
[13] Takumi Kobayashi and Jiaxing Ye. Acoustic feature extraction by statistics based local binary pattern for environmental sound classication. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing, 2014.
[14] Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. Robust sound event classication using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):540552, 2015.
[15] Haomin Zhang, Ian McLoughlin, and Yan Song. Robust sound event recognition using convolutional neural networks. In Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 559563, April 2015.
[16] Jia Ching Wang, Jhing Fa Wang, Kuok Wai He, and Cheng Shu Hsu. Environmental sound classication using hybrid SVM/KNN classier and MPEG-7 audio low-level descriptor. In Proceeding of 2006 IEEE International Joint Conference on Neural Network, pages 17311735, Vancouver, BC, Canada, 2006. IEEE.
[17] G. Wichern, J. Xue, H. Thornburg, B. Mechtley, and A. Spanias. Segmentation, indexing, and retrieval for environmental and natural sounds. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):688
707, March 2010.
[18] Rui Cai, Lie Lu, Alan Hanjalic, and Zhang Hong Jiang. A exible framework for key audio eects detection and auditory context inference. IEEE Trans. Audio, Speech, and Language Processing (TASLP), May 2006.
[19] J C Wang, Y S Lee, C H Lin, E Siahaan, and C H Yang. Robust environmental sound recognition with fast noise suppression for home Automation. IEEE Transactions on Automation Science and Engineering, 12(4):1235-1242, 2015.
[20] Sachin Chachada and C.-C. Jay Kuo. Environmental sound recognition: a survey. APSIPA Transactions on Signal and Information Processing, 3:e14, 2014.
[21] Jia Ching Wang, Hsiao Ping Lee, Jhing Fa Wang, and Cai Bei Lin. Robust environmental sound recognition for home automation. IEEE Transactions on Automation Science and Engineering, 5(1):2531, 2008.
[22] Mingming Zhang, Weifeng Li, Longbiao Wang, Jianguo Wei, Zhiyong Wu, and Qingmin Liao. Sparse coding for sound event classication. In Proceeding of 2013 Asia-Pacic Signal and Information Processing Association Annual Summit and Conference, number 3, pages 15, 2013.
[23] P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli. Audio based event detection for multimedia surveillance. 5:VV, May 2006.
[24] T. Heittola, A. Mesaros, T. Virtanen, and M. Gabbouj. Supervised model training for overlapping sound events based on unsupervised source separation. In Proceeding of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 86778681, 2013.
[25] Georey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):8297, 2012.
[26] Ruhi Sarikaya, Georey E. Hinton, and Anoop Deoras. Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):778784, 2014.
[27] Omid Ghahabi and Javier Hernando. Deep belief networks for i-vector based speaker recognition. In Proceeding of 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), number 1, pages 1700-1704, 2014.
[28] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen. Polyphonic sound event detection using multi label deep neural networks. In Proceeding of 2015 International Joint Conference on Neural Networks (IJCNN), pages 17, 2015.
[29] A. L. Berenzweig and D. P. W. Ellis. Locating singing voice segments within music signals. In Applications of Signal Processing to Audio and Acoustics,
2001 IEEE Workshop on the, pages 119122, 2001.
[30] Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
[31] P. Arbeláez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J. Malik. Semantic segmentation using regions and parts. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[32] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. European Conference on Computer Vision (ECCV), 2012.
[33] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. European Conference on Computer Vision (ECCV), 2014.
[34] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[35] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stu segmentation. arXiv: 1412.1283, 2014.
[36] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence (PAMI), 2015.
[37] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. IEEE International Conference on Computer Vision (ICCV), 2015.
[38] G. Papandreou, L. Chen, K. Murphy, and A. Yuille. Weakly-and semisupervised learning of a deep convolutional network for semantic image segmentation.
IEEE International Conference on Computer Vision (ICCV), 2015.
[39] Q. Huang, C. Xia, W. Zheng, Y. Song, H. Xu, and C. Kuo. Object boundary guided semantic segmentation. arXiv:1603.09742v4, 2016.
[40] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2015.
[41] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 2013.
[42] J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. arXiv:1503.00848, 2015.
[43] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. Computer Vision and Pattern Recognition (CVPR), 2014.
[44] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random elds as recurrent neural networks. IEEE International Conference on Computer Vision (ICCV), 2015.
[45] L. Chen, J. Barron, G. Papandreou, K. Murphy, and A. Yuille. Semantic image segmentation with task-specic edge detection using cnns and a discriminatively trained domain transform. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016.
[46] Jerome Buhl, David JT Sumpter, Iain D Couzin, Joe J Hale, Emma Despland, ER Miller, and Steve J Simpson. From disorder to order in marching locusts. Science, 312(5778):14021406, 2006.
[47] Nicholas C Makris, Purnima Ratilal, Deanelle T Symonds, Srinivasan Jagannathan, Sunwoong Lee, and Redwood W Nero. Fish population and behavior revealed by instantaneous continental shelf-scale imaging. Science,
311(5761):660663, 2006.
[48] Nicholas C Makris, Purnima Ratilal, Srinivasan Jagannathan, Zheng Gong, Mark Andrews, Ioannis Bertsatos, Olav Rune Godø, Redwood W Nero, and J Michael Jech. Critical population density triggers rapid formation of vast oceanic sh shoals. Science, 323(5922):17341737, 2009.
[49] Shuai Yi, Hongsheng Li, and Xiaogang Wang. Understanding pedestrian behaviors from stationary crowd groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3488-3496, 2015.
[50] Jing Shao, Chen C Loy, Kai Kang, and Xiaogang Wang. Crowded scene understanding by deeply learned volumetric slices. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
[51] Shuai Yi, Xiaogang Wang, Cewu Lu, Jiaya Jia, and Hongsheng Li. L0 regularized stationary-time estimation for crowd analysis. IEEE transactions on pattern analysis and machine intelligence, 2016.
[52] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 833-841, 2015.
[53] Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pages 640644. ACM, 2016.
[54] Carlos Arteta, Victor Lempitsky, and Andrew Zisserman. Counting in the wild. In European Conference on Computer Vision, pages 483498. Springer, 2016.
[55] Zheng Ma and Antoni B Chan. Counting people crossing a line using integer programming and local features. IEEE Transactions on Circuits and Systems for Video Technology, 26(10):1955-1969, 2016.
[56] Zhuoyi Zhao, Hongsheng Li, Rui Zhao, and Xiaogang Wang. Crossing-line crowd counting with two-phase deep neural networks. In European Conference on Computer Vision, pages 712726. Springer, 2016.
[57] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You′ll never walk alone: Modeling social behavior for multi-target tracking. In Computer Vision, 2009 IEEE 12th International Conference on, pages 261-268. IEEE, 2009.
[58] Anton Milan, Laura Leal-Taixé, Konrad Schindler, and Ian Reid. Joint tracking and segmentation of multiple targets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 53975406, 2015.
[59] Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Multiperson tracking by multicut and deep matching. In European Conference on Computer Vision, pages 100111. Springer, 2016.
[60] Shuai Yi, Hongsheng Li, and Xiaogang Wang. Pedestrian travel time estimation in crowded scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 3137-3145, 2015.
[61] Shuai Yi, Hongsheng Li, and Xiaogang Wang. Pedestrian behavior understanding and prediction with deep neural networks. In European Conference on Computer Vision, pages 263279. Springer, 2016.
[62] Jing Shao, Chen-Change Loy, Kai Kang, and Xiaogang Wang. Slicing convolutional neural network for crowd video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5620-5628, 2016.
[63] Shuai Yi, Hongsheng Li, and Xiaogang Wang. Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance. IEEE transactions on image processing, 25(9):4354-4368, 2016.
[64] Bolei Zhou, Xiaoou Tang, and Xiaogang Wang. Coherent ltering: Detecting coherent motions from crowd clutters. In Computer VisionECCV 2012, pages 857-871. Springer, 2012.
[65] Jing Shao, Chen Change Loy, and Xiaogang Wang. Scene-independent group proling in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2219-2226, 2014.
[66] Jing Shao, Chen Change Loy, and Xiaogang Wang. Learning sceneindependent group descriptors for crowd understanding. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
[67] Weina Ge, Robert T Collins, and R Barry Ruback. Vision-based analysis of small groups in pedestrian crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):10031016, 2012.
[68] Arun Kumar Chandran, Loh Ai Poh, and Prahlad Vadakkepat. Identifying social groups in pedestrian crowd videos. In Advances in Pattern Recognition (ICAPR), 2015 Eighth International Conference on, pages 1-6. IEEE, 2015.
[69] Francesco Solera, Simone Calderara, and Rita Cucchiara. Socially constrained structural learning for groups detection in crowd. IEEE transactions on pattern
analysis and machine intelligence, 38(5):995-1008, 2016.
[70] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. ACM International Conference on Multimedia,
2007.
[71] A. Klaser, M. Marcin, and S. Cordelia. A spatio-temporal descriptor based on 3d-gradients. British Machine Vision Conference, 2008.
[72] G. Willems, T. Tinne, and V. Luc. An ecient dense and scale-invariant spatio-temporal interest point detector. European Conference on Computer Vision, 2008.
[73] B. Nair and V. Asari. Regression based learning of human actions from video using HOF-LBP ow patterns. IEEE International Conference on Systems, 2013.
[74] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R.Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2015.
[75] C. Ding and D. Tao. Robust face recognition via multimodal deep face representation. IEEE Transactions on Multimedia, 2015.
[76] L. Pigou, S. Dieleman, P. Kindermans, and B. Schrauwen. Sign language recognition using convolutional neural networks. Workshop at the European Conference on Computer Vision, 2014.
[77] S. Sukittanon, A. Surendran, J. Platt, and C. Burges. Convolutional networks for speech detection. Interspeec, 2004.
[78] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language pprocessing, 2014.
[79] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. European Conference on Computer Vision, 2014.
[80] C. Szegedy, Y. Jia W. Liu, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[81] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[82] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. arXiv:1511.07571, 2015.
[83] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[84] V. Mnih, H. Nicolas, and G. Alex. Recurrent models of visual attention. Advances in Neural Information Processing Systems, 2014.
[85] Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006.
[86] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 5, pages 448455, 2009.
[87] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions
on Audio, Speech, and Language Processing, 20(1):3042, Jan 2012.
[88] Abdel rahman Mohamed, George Dahl, and Georey Hinton. Deep belief networks for phone recognition. In Proceedings of the NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
[89] Y. Lecun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, November 1998.
[90] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verication. In Proceesdings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1701-1708, June 2014.
[91] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. In 2014 IEEE Conference
on Computer Vision and Pattern Recognition Workshops, pages 512-519, June 2014.
[92] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In
2014 IEEE Conference on Computer Vision and Pattern Recognition , pages 1725-1732, June 2014.
[93] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 10971105. Curran Associates, Inc., 2012.
[94] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.
[95] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):17351780, November 1997.
[96] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[97] Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Nishiura Takanobu, and Yamada Takeshi. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Proceeding of International Conference on Language Resources & Evaluation, pages 2-5, 2000.
[98] Jonathan Dennis, Huy Dat Tran, and Eng Siong Chng. Image feature representation of the subband power distribution for robust sound event classication.
IEEE Transactions on Audio, Speech, and Language Processing, 21(2):367-377, February 2013.
[99] Alex Krizhevsky, Sutskever Ilya, and Georey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Neural Information Processing Systems (NIPS), 2012.
[100] Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neural codes for image retrieval. In Proceedings of European Conference on Computer Vision (ECCV), pages 584-599, 2014.
[101] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037-2041, Dec 2006.
[102] Taishih Chi, Powen Ru, and Shihab A Shamma. Multiresolution spectrotemporal analysis of complex sounds. The Journal of the Acoustical Society of
America, 118(2):887, 2005.
[103] Douglas O′Shaughnessy. Speech communication: human and machine. Addison-Wesley, 1987.
[104] Timo Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classication with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971987, jul 2002.
[105] Wang Y, Yang J A amf Lu J, Liu H, and Wang L W. Hierarchical deep belief networks based point process model for keywords spotting in continuous speech. In International Journal of Communication Systems Volume 28, Issue 3, pages 483-496, February 2015, 2015.
[106] Re Fan, Kw Chang, and Cj Hsieh. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning, 9(2008):18711874, 2008.
[107] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Tut database for acoustic scene classication and sound event detection. In Signal Processing Conference (EUSIPCO), 2016 24th European, pages 11281132. IEEE, 2016.
[108] Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley. Detection and classication of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733-1746, 2015.
[109] Laurens Van Der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:25792605, 2008.
[110] Jiaxing Ye, Takumi Kobayashi, Masahiro Murakawa, and Tetsuya Higuchi. Robust acoustic feature extraction for sound classication based on noise reduction. In Proceeding of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 59445948, 2014.
[111] Jiaxing Ye, Takumi Kobayashi, Masahiro Murakawa, and Tetsuya Higuchi. Kernel discriminant analysis for environmental sound recognition based on acoustic subspace. In Proceeding of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 808-812, 2013.
[112] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:33713408, 2010.
[113] Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, and Tuomas Virtanen. Sound event detection in multichannel audio using spatial and harmonic features. Technical report, DCASE2016 Challenge, September 2016.
[114] Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. DCASE2016 baseline system. Technical report, DCASE2016 Challenge, September 2016.
[115] Matthias Zöhrer and Franz Pernkopf. Gated recurrent networks applied to acoustic scene classication and acoustic event detection. Technical report, DCASE2016 Challenge, September 2016.
[116] Toan H. Vu and Jia-Ching Wang. Acoustic scene and event recognition using recurrent neural networks. Technical report, DCASE2016 Challenge, September 2016.
[117] R. Haralick, S. Sternberg, and X. Zhuang. Image analysis using mathematical morphology. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1987.
[118] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
[119] L. Van Gool M. Everingham and, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. IJCV, 2010.
[120] S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. arXiv:1603.08358v2, 2016.
[121] G. Ghiasi and C. Fowlkes. Laplacian pyramid reconstruction and renement for semantic segmentation. arXiv:1605.02264v2, 2016.
[122] Z. Wu, C. Shen, and A. Hengel. High-performance semantic segmentation using very deep fully convolutional networks. arXiv:1604.04339v1, 2016.
[123] H. Wang, A. Klaser, C. Schmid, and C. Liu. Action recognition by dense trajectories. IEEE Conference on Computer Vision and Pattern Recognition, 2011.
[124] H. Wang and C. Schmid. Action recognition with improved trajectories. IEEE International Conference on Computer Vision, 2013.
[125] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. IEEE International Conference on Computer Vision, 2007.
[126] P. Wang, Y. Cao, C. Shen, L. Liu, and H. Shen. Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv:1503.01224,
2015.
[127] J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classication. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[128] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. European Conference on Computer Vision, 2004.
[129] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[130] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[131] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. F. Li. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 2015.
[132] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. arXiv: 1511.04119, 2015.
[133] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 14401448, 2015.
[134] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems, pages 9199, 2015.
[135] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of
computer vision, 104(2):154-171, 2013.
[136] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211-252, 2015.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2017-8-18

推文