使用YMCL模型改善使用者意圖分類成效

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：17

、訪客IP：3.145.40.222

姓名

陳姿妤(Tzu-Yu Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

使用YMCL模型改善使用者意圖分類成效
(Improve User Intent Classification by Incorporating Visual Context Using YMCL Model)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
★ 藉由加入多重語音辨識結果來改善對話狀態追蹤	★ 對話系統應用於中文線上客服助理:以電信領域為例
★ 應用遞歸神經網路於適當的時機回答問題	★ 使用多任務學習改善使用者意圖分類
★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法	★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在這份研究中，我們設計了一套互動式對話系統，用以協助使用者完成一項機器人組裝任務。此對話系統會針對使用者在組裝過程中遇到的問題給出解決方法。我們將使用者的問題映射到最相近的預定義常見問題(FAQ)，以此做為使用者意圖。接下來系統會根據使用者意圖給出相對應的回答。
一般狀況下，有著相同使用者意圖的問題大多都能夠以相似的回答來解決。然而，在我們的組裝任務上，既使是同樣的問題，在不同的組裝步驟中被提出，也應該有不同的回應。我們對話系統中的使用者意圖分類器在只有使用者問句的情況下只能達到68.95%的準確率。為了解決這個問題，我們在原來系統中的使用者意圖分類器上加入了Yolo-based Masker with CNN-LSTM (YMCL)模型。透過合併影像資訊，在不同資料集的實驗結果上可以看到大幅度的準確率提升。

摘要(英)

In this research, we design an interactive dialogue system which aims at helping user complete the robot assembly task.The system would provide solution to the user question when the user encounters problems during the assembly process.We map the user question to the most related pre-defined frequently asked question (FAQ) as the user intent.The system will then give out the answer according to the detected user intent.
In general case, user questions with the same user intent can mostly be solved with similar answers. However, in our assembly task, even the same user question asked in different assembly step should lead to different response.With only user question utterance, our user intent classifier achieves accuracy of 68.95%. To solve this problem, we integrate the proposed Yolo-based Masker with CNN-LSTM (YMCL) model into the user intent classifier in our dialogue system.By incorporating visual information, a significant improvement can be observed from the experiments conducted on different dataset.

關鍵字(中)

★ 對話系統
★ 意圖分類任務
★ 多模態

關鍵字(英)

★ Dialogue system
★ Intent classification task
★ Multimodality

論文目次

中文摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Figures vi
List of Tables vii
1 Introduction 1
2 Related Work 4
2.1 Dialogue System 4
2.2 Visual and Video Question Answering 5
2.3 Object Detection 6
3 Method 7
3.1 Dataset 7
3.1.1 Data collection 7
3.1.2 Multimodal Dataset 12
3.2 YMCL Model 13
3.3 Multimodal Intent Classification Model 14
4 Experiment and Result 15
4.1 Without-Video Intent Classification 15
4.1.1 Training Dataset 15
4.1.2 Max Sequence Length of Input of BERT Model 16
4.1.3 Sentence Representation Method 16
4.2 Mulitmodal Intent Classification 17
4.3 Error Analysis 19
4.4 Analysis on Visual Context Capture Method 25
4.5 Analysis on Different Test Data 26
5 Conclusion 28
Bibliography 30

參考文獻

[1] D. Dougherty, “The maker movement,” Innovations: Technology, governance, globalization, vol. 7, no. 3, pp. 11–14, 2012.
[2] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” Acm Sigkdd Explorations Newsletter, vol. 19, no. 2, pp.
25–35, 2017.
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
[4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2612–2620.
[5] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networks for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4709–4717.
[6] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.6639–6648.
[7] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision, vol. 124, no. 3, pp. 409–421, 2017.
[8] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4631–4640.
[9] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2758–2766.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[11] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[15] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
[16] J. Redmon and A. Farhadi, “Yolov3: An incremental Improvement,” arXiv preprint arXiv:1804.02767, 2018.
[17] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[18] N. Dahlbäck, A. Jönsson, and L. Ahrenberg, “Wizard of oz studies: why and how,” in Proceedings of the 1st international conference on Intelligent user interfaces, 1993, pp. 193–200.
[19] S. Tauroza and D. Allison, “Speech rates in british english,” Applied linguistics, vol. 11, no. 1, pp. 90–105, 1990.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, 2018.
[21] B.-H. Chang, “Incorporate multi-modal context for improving user intent classification work,” Master’s thesis, National Central University, 2019.
[22] M. Popović and H. Ney, “Word error rates: Decomposition over POS classes and applications for error analysis,” in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 48–55.

指導教授

蔡宗翰

審核日期

2020-7-30

推文