姓名 吳登翔(WU, DENG-SIANG)
論文名稱 使用者模型為基礎的概念飄移預測
With the amount of data and the speed of data increasing are more quickly than past time for a user nowadays. Therefore considering data stream study becomes a trend of information retrieval. The concept drift means the data categories can change by time or the data filtering mistake when user′s interests changed causing. This study considers users′ exquisite feelings, using the documents users have read belongs to which topics and judge the relevance based on the co-occurrence between two topics. The demand of the system calculating speed we propose NGD similarity tolerance method to decrease the amount of terms to reach the goal of decreasing system executing time. And our study divide users′ interests into four categories and then aim to those categories designing the forgetting factor to keep and filter the data improving the effectiveness decreasing of concept drift. This study predicts the concept drift through the users′ reading behavior to decrease the effect to the system when concept drift happened.
關鍵字(中) ★ 概念飄移
★ 遺忘因子
★ 參與中間度分群
★ 主題關係
關鍵字(英) ★ Concept Drift
★ Forgetting factor
★ Betweenness centrality
★ Topic relationship
論文目次 目錄

摘要 5
Abstract 6
目錄 8
圖目錄 10
表目錄 11
一、緒論 12
1-1 研究背景 12
1-2 研究動機 13
1-3 研究目的 15
二、文獻探討 17
2-1 文件前處理 17
2-1-1 詞性過濾與基於詞性組合的關鍵字合併 17
2-1-2字詞長度過濾 18
2-1-3 字根還原 18
2-1-4 Wikipedia搜尋結果數過濾 18
2-2 文件特徵 19
2-2-1 文字頻率 (TF) 19
2-2-2 字詞網路 19
2-2-3 參與中間度分群 20
2-3 使用者興趣 21
2-4 概念飄移 21
2-5 正規化的Google距離 (Normalized Google Distance, NGD) 26
三、系統架構 28
3-1 研究假設 28
3-2 系統架構 28
3-3 文件預處理 29
3-3-1 文件前處理 29
3-3-2 文件特徵 29
3-4 相似度容差 30
3-5 使用者模型 32
3-5-1 字詞活躍分佈矩陣 32
3-5-2 主題共現關係矩陣 33
3-6 主題映射 34
3-7 動態遺忘因子 35
3-8興趣去除 41
3-9 文件過濾 42
3-10 概念飄移預測 43
四、實驗 45
本章將描述實驗的環境、所使用到的評估準則、資料集等敘述 45
4-1 實驗環境 45
4-2 資料集與評估準則 45
4-3 實驗設計 47
4-3-1 門檻值實驗: 47
4-3-2 相似度容差減少時間成效實驗: 54
4-3-3 使用者模型學習能力實驗: 56
4-3-4 動態遺忘因子實驗: 57
4-3-5 概念飄移預測成效實驗: 59
五、結論與未來研究方向 61
5-1 結論 61
5-2 未來研究方向 63
參考文獻 66
中文部分 66
英文部分 66
參考文獻 參考文獻
[1]. 林文羽、林熙禎,(2013),「關鍵字為基礎的多主題概念飄移學習」,TANET2013臺灣網際網路研討會-論文集
[2]. 李浩平、林熙禎,(2011),「運用NGD建立適用於使用者回饋資訊不足之文件過濾系統」,國立中央大學,碩士論文
[3]. 鄭奕駿、林熙禎,(2012),「離線搜尋Wikipedia以縮減NGD運算時間之研究」,國立中央大學,碩士論文
[4]. 鄭運剛、馬建國,(2008),“A Model of User s Interests Drift Based on Classification Model,” Journal of Information, no. 1
[5]. 蘇怡仁、溫建成、許維麟、陳岳群,(2012),「以重疊社群分析引文網路支援論文自動分類之探討」,The 8th International Conference on Knowledge Community

[6]. Aggarwal, Charu C. and Yu, Philip S., (2006), “A Framework for Clustering Massive Text and Categorical Data Streams,” Proceedings of the SIAM Conference on Data Mining (SDM)
[7]. Brandes, Ulrik, (2001), “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, vol. 25, pp. 163-177
[8]. Bifet, Albert, Holmes, Geoff, Pfahringer, Bernhard and Gavaldà, Ricard, (2011), “Mining Frequent Closed Graphs on Evolving Data Streams,” 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp, 591-599
[9]. Chang, H.-C. and Chiun-Chieh, H., (2005), “Using topic keyword clusters for automatic document clustering,” IEICE TRANSACTIONS on Information and Systems, vol. 88, pp. 1852-1860
[10]. Chen, P.-I. and Lin, S.-J., (2010), “Automatic keyword prediction using Google similarity distance,” Expert Systems with Applications, vol. 37, pp. 1928-1938
[11]. Chen, P.-I. and Lin, S.-J., (2011), “Word AdHoc network: using Google core distance to extract the most relevant information,” Knowledge-Based Systems, vol. 24, pp. 393-405
[12]. Cilibrasi, Rudi L. and Paul MB Vitanyi, (2007), “The google similarity distance,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 370-383.
[13]. Dietz, Laura and Dalton and Jeffrey, (2012), “Acrossdocument neighborhood expansion: UMass at TAC KBP 2012 entity linking,” Text Analysis Conference (TAC)
[14]. Dijkstra, E. W., (1959), “A note on two problems in connexion with graphs,” Numerische mathematik, vol. 1, pp. 269-271.
[15]. Farid, Dewan Md., Zhang, Li, Hossain, Alamgir, Rahman, Chowdhury Mofizur, Strachan, Rebecca, Sexton, Graham and Dahal, Keshav, (2013), “An adaptive ensemble classifier for mining concept drifting data streams,” Expert Systems with Applications, vol. 40, pp. 5895-5906
[16]. Girvan, M. and Newman, M. E., (2002), “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences, vol. 99, pp. 7821-7826
[17]. Gu, Suicheng, Tan, Ying and He, Xingui, (2013), “Recentness biased learning for time series forecasting,” Information Sciences, vol. 237, pp. 29-38
[18]. Koehn, Philipp, Och, Franz Josef and Marcu, Daniel, (2003), “Statistical phrase-based translation,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, pp. 48-54
[19]. Li, Lei, Zheng, Li, Yang, Fan and Li, Tao, (2014), “Modeling and broadening temporal user interest in personalized news recommendation,” Expert Systems with Applications, vol. 41, pp. 3168-3177
[20]. Nanas, Nikolaos, Uren, Victoria, Roeck, Anne de and Domingue, John, (2004), “Multi-topic Information Filtering with a Single User Profile,” Methods and Applications of Artificial Intelligence, vol. 3025, pp. 400-409
[21]. Tufis, D. and Mason, O., (1998), “Tagging romanian texts: a case study for qtag, a language independent probabilistic tagger,” Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pp. 589-596
[22]. Wang, Hongwei and Zou, Li, (2013), “Modeling User Preference Based on Long-term and Short-term Interest,” Journal of Tongji University(Natural Science), vol. 06
[23]. Yang, Jiping, Wang, Yue and Gao, Xuesong, (2011), “User interest modeling for personalized streaming media services based on behavior analysis,” Computer Applications and Software, vol. 28, no. 8
指導教授 林熙禎
