參考文獻 |
[1] Venugopalan S. , Xu H. , Donahue J. , “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014.
[2] M. Minsky, S. Paper, “Perceptrons,” Cambridge, MA: MIT Press.
[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Machine Learning Research, vol. 15, pp. 1929-1958. Jun. 2014.
[4] J.J.Hopfield, “Neural networks and physical systems with emergent collective computa-tionalabilities”, Proc. Nut. Acad. Sci., U.S., vol. 79, pp. 2554-2558, Apr. 1982.
[5] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin. “A neural probabilistic language model,” The Journal of Machine Learning Research, 3:1137–1155, 2003.
[6] T. Mikolov, M. Karafia ?t, L. Burget, J. Cˇernocky ?, S. Khudanpu, “Recurrent neural net-work based language model,” Proceedings of Interspeech, 2010.
[7] S.Hochreiter, J.Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[8] Z. Wu, Y. G. Jiang, X. Wang, H. Ye, X. Xue, “Multi-stream multi-class fusion of deep networks for video classification,” Multimedia Conference, pp. 791-800, Oct. 2016
[9] S. Venugopalan, M. Rohrbach, R. Mooney, T. Darrell, K. Saenko. “Sequence to sequence video to text,” ICCV, 2015.
[10] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, L. Deng, “Semantic Composi-tional Networks for Visual Captioning,” CVPR, 2017
[11] Y. C. Wu, P. C. Chang, C. Y. Wang, J. C. Wang, “Asymmetrie Kernel Convolutional Neural Network for acoustic scenes classificationv,” IEEE International Symposium on Consumer Electronics (ISCE), May. 2018.
[12] ETSI Standard Doc., “Speech Processing, Transmission and Quality Aspects (STQ); Dis-tributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algo-rithms,” ES 201 108, v1.1.3, Sep. 2003.
[13] ETSI Standard Doc., “Speech Processing, Transmission and Quality Aspects (STQ); Dis-tributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algo-rithms,” ES 202 050, v1.1.5, Jan. 2007.
[14] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[15] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Conference, Jul. 2015.
[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convo-lutional Neural Networks,” Advances in neural information processing systems,pp. 1097-1105,2012
[17] Zeiler, M. D. and Fergus, R. , “Visualizing and Understanding Convolutional Networks,” CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.
[18] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[20] Tran D, Bourdev L, Fergus R, et al. , “Learning spatiotemporal features with 3d convolu-tional networks,” Proceedings of the IEEE International Conference on Computer Vision. 2015: 4489-4497.
[21] O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,S.Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
[22] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar, and L. Fei-Fei. Large-scale video classification with convolu- tional neural networks. In CVPR, 2014. 5
[23] S.Venugopalan, H.Xu, J.Donahue, M.Rohrbach, R.Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015. 1, 2, 5
[24] D. Kingma and J. Ba. Adam: A method for stochastic opti- mization. In ICLR, 2015. 6
[25] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Mi-crosoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 6
[26] TensorFlow: an open source Python package for machine intelligence, https://www.tensorflow.org, retrieved Dec. 1, 2016.
[27] J. Dean, et al. “Large-Scale Deep Learning for Building Intelligent Computer Systems,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 1-1, Feb. 2016.
[28] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv: 1605.02688, 2016. 6
[29] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[30] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Conference, Jul. 2015.
[31] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. YouTube2Text: Recognizing and describing arbitrary activ- ities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013.
[32] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classi-fication and Sound Event Detection,” IEEE 2016 24th European Signal Processing Conference, pp. 1128-1132, Aug. 2016
[33] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, Development da-taset, http://doi.org/10.5281/zenodo.45739, retrieved Dec. 1, 2016.
[34] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, Evaluation dataset, https://zenodo.org/record/165995#.WXblsYiGNhE, retrieved Dec. 1, 2016.
[35] Q. Kong, I. Sobieraj, W. Wang and M. Plumbley, “Deep Neural Network Baseline for DCASE Challenge 2016,” in 2016 Workshop on Detection and Classification of Acous-tic Scenes and Events (DCASE2016), pp. 50-54, Sep. 2016.
[36] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evalua-tion of machine translation,” ACL, 2002.
[37] M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” EACL Workshop on Statistical Machine Translation, 2014.
[38] Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh, “CIDEr: Consensus-based Image Description Evaluation,” CVPR, 2015.
[39] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task CNN Model for Attribute Pre-diction,” IEEE Transactions on Multimedia, vol. 17, no. 11, Nov. 2015, pp. 1949-1959 |