摘要: | 隨著多媒體監控技術發展,如何使人類能更便利的擷取影音訊息,是一個重要的議題。近年來, 影音資訊內容搜尋的技術尋找其中的重點,除了可節省人力標記的成本,也能有效的自動化地擷取影 音的主要特徵,以這些特徵進行參數的相似度比對,從影音資料庫中回傳此影片的訊息內容。 因龐大的資料量與計算,使深度學習技術可利用大量資訊進行更接近人類複雜腦神經之學習,不 僅有助於在影音方面的環境認知與動作辨識的發展,未來更能應用於監控系統,使監控系統更為完 善,因此本計畫將利用深度學習之方法,分別學習影像與音訊特徵,比對聲訊與視訊內容,回傳影片 所敘述之內容。近三年,本團隊執行科技部整合型計畫-智慧型影音內容分析、創作及推薦,對深度 學習方法的音訊與影像檢索方面,有極為深入的了解。而本計畫也為期三年,首先將深度學習應用於 原始聲訊及視訊上,分別分析其每層之特徵基底,進而了解聲訊及視訊形成的元素,聲訊上辨識出聲 音場景,視訊方面辨識動作。第二年,基於過去所辨識的場景,聲訊上再更詳細的偵測聲訊於不同時 間點的事件;視訊上運用視訊分類模型進行視訊描述,並改善架構。最後整合比對聲訊偵測與視訊描 述之結果,描述出此影片的內容。 ;With the development of multimedia monitoring technology, how to more convenience and faster to catch information of audio and video, such as the events and actions in the video, shooting environments, and objects at around, is one of the research and application spotlights. The content based audio/video captures can efficiently and friendly to automate capture the main features of audio/video. Finally, these features are compared the similarity of parameters Then, return text content of audio/video from the database. Deep learning is a powerful technology in machine learning. Because of the huge amount of data and calculation, the technology makes the technique closer to the complex human brain study. This is beneficial to the development of environmental cognition and action identification in audio-visual aspects. Also, the project can apply in multimedia monitoring system in the future. Therefore, this project will apply the deep learning to obtain efficient feature representations. In the recent three years, our team has executed the integrated project of MOST, Intelligent Audio-visual Content Analysis, Authoring, and Recommendation. Therefore, we are very in-depth understand on the technology of deep learning which applied to audio and image. We plan to execute this project in three years. In the first year, we will apply deep learning to original audio/video data to analyze the basis of feature to understand the fundamentals of contents Then, we will classify acoustic scene in audio signal and identify motivation in visual signal. Furthermore, based on our previous work, we detect the acoustic events in terms of acoustic. On the other hand, we not only improve the structure of motivation recognition, but start to work for video captions in the same structure. Last year, our goal is to integrate result of acoustic event detection and video caption and make machine tell us what the information of video is. |