dc.description.abstract | With the advancement of technology, starting from the 20th century,we hope that com-puters have the same learning ability as humans.With a large number of researchers investing in artificial intelligence gradually developed technologies, such as machine learning and deep learning to let computers learn how to make decision and classifications.In recent years, most people use neural networks to learn a large amount of data, and then develop many neural network architectures for different data forms.
This thesis uses different neural networks to extract features from the image, sound and semantic.The feature extraction of images and sounds is based on the convolutional neural network (CNN). The semantic first is to use coding to change the text into a digital perfor-mance, and then use the word embedding method to make there a continuous relationship between each word and the word.Finally, concatenation of image and audio features is fed into the LSTM to initialize the first step, which is expected to provide an overview of the video content.
Using the automatic scoring mechanism of the language to score the sentences that the semantics constitute the network output, we find that adding the sound features is helpful for the entire semantics to form the network. In the scoring mechanism of each language, when we add sound events and sound scene features to the image features, compared to use of im-age features only, the video descriptions output by the network that adds sound features are The scores in the item ratings have improved. In the BLEU score, we also considered the length of the word group from one word to four words, each of which has an increase of more than 1%, and the Cider-D score has an increase of 2.27%, and the Meteor and the Rouge-L score, there was also an increase of 0.2% and 0.7%, and the effect was very significant. | en_US |