dc.description.abstract | Deep learning (DL) has been becoming the first choice of algorithms for information processing problems as it can achieve state of the art in many areas. Before the emergence of DL, researchers had to design features manually, which requires a lot of human labor and domain knowledge. With enough of data and high performance computing devices, a DL model can learn a rich representation from data to satisfy given constraints or to make decisions or predictions in an end-to-end way. Despite of the variety of DL models, we are particularly interested in developing recurrent neural network (RNN) which is a class of neural networks (NN) to solve real problems. The goal is not only about getting high accuracy, but other aspects like model complexity, resource consumption are also carefully considered in most of our designs to make systems applicable in reality.
RNNs are best suited for time-dependent signals. An RNN sequentially receives input signals through time where its hidden states accumulate information and update themselves time-step by time-step. Thus, an RNN is strong to learn sequential dynamics of input signals, so it can make decisions at the present time or predict future variables. However, as the hidden states are updated and recurrent weights are reused at every timesteps, RNN models could have problems of the vanishing / exploding gradients in training, so can they be difficult in learning long-term dependencies, which decreases performance of RNNs in many tasks. Hence, we propose new RNN structures that do not have the gradients problem, and are very effective and efficient at our target problems.
In this dissertation, we develop RNN models to address real problems of different multimedia signals. The signals are in various types including time-series signals collected from integrated sensors, audio signals, images, and videos. In particular, first we introduce two new RNN structures for the problem of human activity recognition on wearable devices. Because the target devices are limited at their resource including power, memory, and computational capacity, famous RNN structures like long short-term memory (LSTM) and gated recurrent unit (GRU) are not quite suitable. We propose control gate-based recurrent neural network (CGRNN) and self-gated recurrent neural network (SGRNN) that employ only one additional gate and no additional gate, respectively. The two new models achieve competitive accuracy but with much less resource consumption in comparison to that of LSTM and GRU. Secondly, we introduce RNN models applied for environmental sound recognition in real life. We conduct experiments on datasets of the DCASE 2016 challenge; our results outperform the baselines. Thirdly, we introduce a DL model for realtime driver drowsiness detection (DDD) problem. The model is constructed by a convolutional neural network (CNN), a convolutional version of CGRNN (ConvCGRNN), and a voting layer. The CNN is to extract relevant facial representations from global faces that are then fed to the ConvCGRNN to learn temporal dependencies before the voting layer makes final predictions. The system not only yields significant accuracy in detecting driver drowsiness, but it also can run in real-time with a high processing speed. Lastly, we tackle the problem of single image dehazing by developing a DL model called encoder-recurrent decoder network (ERDN). The ERDN model has an encoder-decoder architecture. On the one hand, we propose residual efficient spatial pyramid (rESP) module which is an extension of the efficient spatial pyramid (ESP) module to construct the encoder. Thus, the encoder can effectively process hazy images at any resolution to extract relevant features at multiple contextual levels. On the other hand, we newly introduce the use of convolutional recurrent neural network (ConvRNN), specifically the use of ConvCGRNN, as the main component of the decoder to sequentially aggregate the encoded features from high levels to low levels to recover clear images. The proposed ERDN demonstrates its effectiveness and efficiency on the RESIDE-Standard dataset. | en_US |