隨著自然語言處理技術的進步,Waston, Siri, Alexa等自動對話系統已成為最重要的應用之一。近年來,企業嘗試建立自動客服聊天機器人,讓機器能學習解決客人的問題,降低客服人員成本並提供24小時不間斷的客戶服務。 然而,目前評估聊天機器人的方法高度仰賴人類評估,並沒有有效的方法能夠快速分析一個聊天機器人的好壞。因此,NTCIR-14提出了Short Text Conversation 3 (STC-3)任務,包含對話品質(Dialogue Quality)和事件偵測(Nugget Detection)子任務,提供有效的指標幫助我們對能自動對聊天機器人進行評估。在本研究中,我們使用深度學習方法來探討DQ和ND子任務,透過深度學習方法來分析一段對話的優劣。 DQ子任務是將一則對話進行對話品質分析,分析的指標包含對話完整性(A-score)、對話效率(E-score)以及顧客滿意度(S-score),ND子任務則是分析一段對話中每一句話語的對話行為,藉此分析對話的架構與邏輯性。 我們使用多層深度學習模型解決DQ和ND子任務,使用話語層(utterance layer)、上下文層(context layer) 和記憶層(memory layer)來學習對話表示法,並使用門控機制(gating mechanism)於話語層和上下文層。我們也嘗試使用BERT[9]和多層CNN作為句子表示,實驗結果顯示BERT的效能優於多層CNN。最後,我們提出的模型在DQ和ND子任務中均優於所有參賽者的模型與任務發起者所提出的baseline模型。 ;With the development of Natural Language Processing (NLP) Automatic question-answering system such as Waston, Siri, Alexa, has become one of the most important NLP applications. Nowadays, enterprises try to build automatic custom service chatbots to save human resources and provide a 24-hour customer service. However, evaluation of chatbots currently relied greatly on human annotation which cost a plenty of time. Thus, Short Text Conversation 3 (STC-3) in NTCIR-14 has initiated a new subtask called Dialogue Quality (DQ) and Nugget Detection (ND) which aim to automatically evaluate dialogues generated by chatbots. In this paper, we consider the DQ and ND subtasks for STC-3 using deep learning method. The DQ subtask aims to judge the quality of the whole dialogue using three measures: Task Accomplishment (A-score), Dialogue Effectiveness (E-score) and Customer Satisfaction of the dialogue (S-score). The ND subtask, on the other hand, is to classify if an utterance in a dialogue contains a nugget, which is similar to dialogue act (DA) labeling problem. We applied a general model with utterance layer, context layer and memory layer to learn dialogue representation for both DQ and ND subtasks and use gating and attention mechanism at multiple layers including: utterance layer and context layer. We also tried BERT and multi-stack CNN as sentence representation. The result shows that BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other participants’ model and the baseline models proposed by the organizer on Ubuntu customer helpdesk dialogues corpus.