摘要: | 深度學習(DL)已經成為資訊處理問題的首選算法,因為它可以在許多領域達到最高水平。在DL出現之前,研究人員只能依賴人工的方式尋找資訊的特徵,這往往也需要耗費大量人力資源和領域知識。有了足夠的數據和高性能計算設備,DL模型可以從數據中學習豐富的表示形式,以滿足給定的條件或是以端到端的方式做出決策或預測。雖然DL模型種類繁多,但我們對開發遞歸神經網絡(RNN)尤其感興趣,因為這是一種可以解決真實問題的類神經網絡(NN)。我們的目標不僅是獲得高精度,還要從模型複雜度、資源消耗等方面來謹慎評估我們的大多數設計,使得系統可以在現實中適用。
RNN非常適合處理與時間有關的信號。RNN通過時間順序接收輸入信號,其中網路的隱藏狀態會累積信息並逐步更新;因此RNN能有效地學習輸入信號的順序動態特性,並且可以在當前時間做出決策或預測將來的變量。但是,由於隱藏狀態會在每個時步更新,權重也會在每個時步重複使用,導致RNN模型在訓練時可能會出現梯度消失/爆炸的問題,因此難以學習長期依賴關係,從而降低了RNN在許多情況下的性能。所以我們提出了新的RNN結構,該結構不存在梯度問題,並且對於我們的目標問題非常有效。
在本論文中,我們開發了多個RNN模型來解決不同多媒體信號的現實問題。信號有多種類型,包括從集成傳感器收集的時間序列信號、音頻信號、圖像和視頻。我們特別針對以下四種信號進行研究:首先我們針對可穿戴裝置上的人類行為識別問題引入了兩種新的RNN結構。由於目標設備的功率、運算和記憶體資源有限,著名的RNN結構(如長短期內存(LSTM)和門控循環單元(GRU))並不合適;因此我們提出了基於控制門的遞歸神經網絡(CGRNN)和自門控的遞歸神經網絡(SGRNN),前者僅使用一個額外的門,後者則沒有使用額外的門。與LSTM和GRU相比,這兩個新模型不僅實現了競爭性準確性,資源消耗還少得多。其次,我們提出應用於現實生活中環境聲音識別的RNN模型。我們對DCASE 2016挑戰的數據集進行實驗並且我們的結果優於基線。第三,我們引入用於實時駕駛員睡意檢測(DDD)問題的DL模型。該模型由卷積神經網絡(CNN)、CGRNN的捲積版本(ConvCGRNN)和投票層構成。CNN會從駕駛員的完整面孔中提取有關的面部表情,並將其饋送到ConvCGRNN,好在投票層做出最終預測之前學習時間相關性。該系統不僅在檢測駕駛員困倦方面具有顯著的準確性,其高速處理還可以實現即時運算。最後,我們開發名為編碼器循環解碼器網絡(ERDN)的DL模型,來解決單一圖像的除霧問題。ERDN模型具有編碼器-解碼器體系結構。一方面,我們提出了剩餘有效空間金字塔(rESP)模塊,它是ESP模塊的擴展,以構造編碼器,而編碼器會從多個級別的模糊圖像中提取特徵;一方面,我們採用卷積遞歸神經網絡(ConvRNN),特別是ConvCGRNN,作為解碼器的主要組件,因為這個架構可以依序將編碼後的特徵從高級別聚合到低級別,以恢復清晰圖像。我們在RESIDE-Standard資料集中,證實了ERDN的效能和執行效率。;Deep learning (DL) has been becoming the first choice of algorithms for information processing problems as it can achieve state of the art in many areas. Before the emergence of DL, researchers had to design features manually, which requires a lot of human labor and domain knowledge. With enough of data and high performance computing devices, a DL model can learn a rich representation from data to satisfy given constraints or to make decisions or predictions in an end-to-end way. Despite of the variety of DL models, we are particularly interested in developing recurrent neural network (RNN) which is a class of neural networks (NN) to solve real problems. The goal is not only about getting high accuracy, but other aspects like model complexity, resource consumption are also carefully considered in most of our designs to make systems applicable in reality.
RNNs are best suited for time-dependent signals. An RNN sequentially receives input signals through time where its hidden states accumulate information and update themselves time-step by time-step. Thus, an RNN is strong to learn sequential dynamics of input signals, so it can make decisions at the present time or predict future variables. However, as the hidden states are updated and recurrent weights are reused at every timesteps, RNN models could have problems of the vanishing / exploding gradients in training, so can they be difficult in learning long-term dependencies, which decreases performance of RNNs in many tasks. Hence, we propose new RNN structures that do not have the gradients problem, and are very effective and efficient at our target problems.
In this dissertation, we develop RNN models to address real problems of different multimedia signals. The signals are in various types including time-series signals collected from integrated sensors, audio signals, images, and videos. In particular, first we introduce two new RNN structures for the problem of human activity recognition on wearable devices. Because the target devices are limited at their resource including power, memory, and computational capacity, famous RNN structures like long short-term memory (LSTM) and gated recurrent unit (GRU) are not quite suitable. We propose control gate-based recurrent neural network (CGRNN) and self-gated recurrent neural network (SGRNN) that employ only one additional gate and no additional gate, respectively. The two new models achieve competitive accuracy but with much less resource consumption in comparison to that of LSTM and GRU. Secondly, we introduce RNN models applied for environmental sound recognition in real life. We conduct experiments on datasets of the DCASE 2016 challenge; our results outperform the baselines. Thirdly, we introduce a DL model for realtime driver drowsiness detection (DDD) problem. The model is constructed by a convolutional neural network (CNN), a convolutional version of CGRNN (ConvCGRNN), and a voting layer. The CNN is to extract relevant facial representations from global faces that are then fed to the ConvCGRNN to learn temporal dependencies before the voting layer makes final predictions. The system not only yields significant accuracy in detecting driver drowsiness, but it also can run in real-time with a high processing speed. Lastly, we tackle the problem of single image dehazing by developing a DL model called encoder-recurrent decoder network (ERDN). The ERDN model has an encoder-decoder architecture. On the one hand, we propose residual efficient spatial pyramid (rESP) module which is an extension of the efficient spatial pyramid (ESP) module to construct the encoder. Thus, the encoder can effectively process hazy images at any resolution to extract relevant features at multiple contextual levels. On the other hand, we newly introduce the use of convolutional recurrent neural network (ConvRNN), specifically the use of ConvCGRNN, as the main component of the decoder to sequentially aggregate the encoded features from high levels to low levels to recover clear images. The proposed ERDN demonstrates its effectiveness and efficiency on the RESIDE-Standard dataset. |