dc.description.abstract | Surveillance systems are becoming important. The criminal cases cracked by the video surveillance system, from 1% in 2007 to 19.83% in the first season of 2016. However, the traditional surveillance system relies on manual monitoring; this makes the surveillance system often used as a passive post-tracing, also cannot effectively prevent accidents or crimes when an emergency occurs. Otherwise, the global surveillance cameras will reach 30 billion frames per second by 2020; humans can’t afford to deal with such huge data. Therefore, it is important to develop an active intelligent surveillance system. Recently, deep learning brings great success in the multimedia data analysis; it can effectively and quickly turn a lot of data into useful information. This dissertation will be based on the deep learning multimedia signal processing technology to design for use in intelligent surveillance systems. Sensors suitable for active surveillance systems are cameras and microphones. In this dissertation, the surveillance system is based on the sound and vision to develop an intelligent sound and video analysis technology. The surveillance system based on the vision is able to clearly observe the occurrence of events. However, there is often a blind side or is susceptible to environmental changes. The surveillance system based on the sound is able to observe the sound from all directions, and analysis and recognition. In this dissertation, to develop a deep learning technology of the sound event recognition and detection based on the sound, and image segmentation, action recognition and group proposal technology based on the vision.
For sound event recognition and detection, a new deep neural network system, called hierarchical-diving deep belief network (HDDBN), is proposed to classify and detect sound event. The proposed system learns several forms of abstract knowledge from proposed auditory-receptive-field binary pattern (ARFBP) visual audio descriptor that support the knowledge transfer from previously learned concepts to useful representations. For semantic image segmentation, proposed hierarchical joint-guided network (HJGN) using our designed object boundary prediction hierarchical joint learning convolutional network (OBP-HJLCN) to guide segmentation results. For action recognition, The proposed motion attention model, called the dynamic tracking attention model (DTAM), not only considers the information about motion but also perform dynamic tracking of objects in videos. For group proposal, an unsupervised group proposal network (GPN) is developed by combined proposed objectness map generation network and proposed object tracklet network. | en_US |