離線搜尋Wikipedia以縮減NGD運算時間之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：32

、訪客IP：3.141.12.254

姓名

鄭奕駿(Yi-chun Cheng) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

離線搜尋Wikipedia以縮減NGD運算時間之研究
(Using Offline Wikipedia Database to Reduce Time Costing of NGD)

相關論文

★ 網路合作式協同教學設計平台－以國中九年一貫課程為例	★ 內容管理機制於常用問答集(FAQ)之應用
★ 行動多重代理人技術於排課系統之應用	★ 存取控制機制與國內資安規範之研究
★ 信用卡系統導入NFC手機交易機制探討	★ App應用在電子商務的推薦服務-以P公司為例
★ 建置服務導向系統改善生產之流程-以W公司PMS系統為例	★ NFC行動支付之TSM平台規劃與導入
★ 關鍵字行銷在半導體通路商運用-以G公司為例	★ 探討國內田徑競賽資訊系統－以103年全國大專田徑公開賽資訊系統為例
★ 航空地勤機坪作業盤櫃追蹤管理系統導入成效評估—以F公司為例	★ 導入資訊安全管理制度之資安管理成熟度研究－以B個案公司為例
★ 資料探勘技術在電影推薦上的應用研究-以F線上影音平台為例	★ BI視覺化工具運用於資安日誌分析—以S公司為例
★ 特權帳號登入行為即時分析系統之實證研究	★ 郵件系統異常使用行為偵測與處理-以T公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著網際網路的快速發展，各式各樣的網頁資訊持續不斷的增加，使用者可以輕易的從各種搜尋引擎及入口網站獲取大量的資訊，例如Google和Yahoo奇摩!等。然而根據Jansen et al.研究指出一般情況下大部分使用者僅輸入2.35個關鍵字，且大多為不清楚或不詳盡的關鍵字，結果回傳的文件過量導致資訊過載的問題。過去的研究文獻中，常使用資訊分類或過濾的方法來協助降低使用者的資訊存取成本，但是這些方法都必須建立在大量訓練資料為前提下才能有好的效果。近期研究提出NGD，藉由Google所提供的搜尋引擎利用輸入關鍵字所回傳的結果數，計算兩個字詞之間的抽象距離，進而得出兩個字詞所在的文件是否相似。但是NGD依賴Google的線上搜尋功能，以致次數頻繁而被拒絕使用搜尋服務，因此本研究有別於先前之研究，提出將Wikipedia建立成離線版搜尋引擎，透過Wiki結構化的概念和純度較高的資訊內容，解決使用Google搜尋引擎所遇到的困難。並經過實驗的證明，使用者使用離線版Wikipedia搜尋引擎時，本研究提出的方法仍能提供使用者維持穩定的過濾效能，並且節省使用者的大量時間成本。

摘要(英)

With the rapid development of Internet, many kinds of information website continued a steady increase; the user can easily obtain a great deal of information from a variety of search engines and portals such as Google and Yahoo! However, Jansen, et al. pointed out that under normal circumstances, most users enter only 2.35 keywords, and mostly unclear or incomplete keyword results in returning a lot of websites so that lead to information overload. The research literature in the past, often using the categories of information, or filtering to help reduce the cost of user access to information, but these methods have to be built under the premise of a large number of training data can have good results. Recent studies have proposed NGD provided by Google’’s search engine, key in the keywords to get the number of results to calculate the abstract distance between the two words, and then draw a conclusion of two words where the file is similar. However NGD rely on Google’’s online search function, with the high-frequency query, Google will refused user to use the search service. In order to solve this problem, this study advances a method that use Wikipedia to establish the offline search engine, because Wikipedia has a structured concepts and high purity content. And with the experimental proofs, when user uses the offline Wikipedia database, the method proposed in this study still provides the user has a stable filtration performance, and saves the user a plenty of time costs.

關鍵字(中)

★ NGD
★ Wikipedia
★ Google

關鍵字(英)

★ NGD
★ Wikipedia
★ Google

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 3
1-4 研究方法 4
1-5 論文架構 5
二、文獻探討 6
2-1 資訊過濾(Information filtering) 6
2-1-1 內容式資訊過濾(Content-based filtering) 6
2-1-2 協同式過濾(Collaborative-based filtering) 7
2-2 文件特徵選取 7
2-2-1 詞彙頻率與反向文件頻率(TF-IDF) 7
2-2-2 字詞共現關係(Terms Co-occurrence) 8
2-2-3 資訊獲利(Information Gain) 9
2-2-4 Google相似度距離(Google similarity distance) 9
2-2-5 Google核心距離(Google core distance) 11
2-3 支援向量機(Support vector machine) 12
2-4 協同推荐 13
2-4-1 使用者模型(User profile) 13
2-4-2 自動協同過濾(Automated collaborative filtering) 14
2-5 社會網路連結分析 15
2-5-1 K核心(K-core) 15
2-5-2 中間化程度指標 16
三、研究方法與系統架構 18
3-1 系統架構 18
3-2 運用Solr建立Wikipedia database 20
3-2-1 資料前處理和建立索引 20
3-2-2 Wikipedia Extractor 23
3-3 文件前處理 23
3-3-1 詞性與關鍵字合併(Part-of-speech and keyword combination) 24
3-3-2 字詞長度(Length of word) 24
3-4 輸入至Wikipedia database 25
3-4-1 Wikipedia database搜尋結果數 25
四、實驗結果與討論 28
4-1 實驗環境 28
4-2 實驗資料集 28
4-3 評估準則 29
4-4 實驗設計 30
4-5 實驗結果 33
4-5-1 同樣來源之不同分類集過濾效能評估 33
4-5-2 不同樣來源之同分類集過濾效能評估 38
4-5-3 新聞稿長度之影響過濾效能評估 41
4-5-4 門檻值調整 43
4-6系統執行效能分析 46
4-6-1 時間複雜度 46
4-6-2 實際執行速度 47
五、結論與未來研究方向 52
5-1 結論 52
5-2 未來研究方向 53
參考文獻 55

參考文獻

﹝1﹞ Jansen, M., Spink, A., Bateman, J., and Saracevic, T., “Real Life Information Retrieval: A Study of User Queries on the Web,” in: Proc. ACM SIGIR Forum, vol. 32, pp. 5–17., 1998.
﹝2﹞ Montebello, M., “Information overload-an IR problem?”, String Processing and Information Retrieval: A South American Symposium, September 1998.
﹝3﹞ Cilibrasi, R. L., and Vitanyi, P. M. B., “Automatic Meaning Discovery Using Google”, arXiv: cs. CL/ 0412098 v2, Netherlands BSIK/BRICKS project, and by NWO project 612.55.002, 2005
﹝4﹞ 李浩平，「運用NGD建立適用於使者回饋資訊不足之文件過濾系統」，國立中央大學，碩士論文，民國100年。
﹝5﹞ Hanani, U., Shapira, B., and Shoval, P., “Information Filtering: Overview of Issues”, Research and Systems. User Modeling and User-Adapted Interaction, 11(3), 203-259, 2001.
﹝6﹞ Pazzani, M., and Billsus, D., “Content-Based Recommendation Systems”, The Adaptive Web, Vol 4321, pp. 325-341, 2007.
﹝7﹞ Basilico, J., and Hofmann, T., “Unifying collaborative and content-based filtering”, Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 2004.
﹝8﹞ Salton, G., and Buckley, C., “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5), pp. 513-523, 1988.
﹝9﹞ Joachims, T., “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, Proceedings of the Fourteenth International Conference on Machine Learning, 1997.
﹝10﹞ Sebastiani, F., “Machine learning in automated text categorization”, ACM Comput. Surv., 34(1), pp. 1-47, 2002.
﹝11﹞ Cheng, Y., and Xingshi, H., “A Text Feature Selection Algorithm Based on Improved TFIDF”, Pattern Recognition, CCPR ’’08. Chinese Conference, 2008.
﹝12﹞ Shouning, Q., Sujuan, W., and Yan, Z., “Improvement of Text Feature Selection Method Based on TFIDF”, Future Information Technology and Management Engineering. FITME ’’08. International Seminar, 2008.
﹝13﹞ Wen, Z., Yoshida, T., and Xijin, T., “TFIDF, LSI and multi-word in information retrieval and text categorization”, Systems, Man and Cybernetics, SMC. IEEE International Conference, 2008.
﹝14﹞ Wikipedia. Curse of dimensionality, from http://en.wikipedia.org/wiki/Curse_of_dimensionality
﹝15﹞ Liu, Y.-C., Wang, X.-L., and Liu, B.-Q., “A feature selection algorithm for document clustering based on word co-occurrence frequency”, Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference, 2004.
﹝16﹞ Quinlan, J. R., “Induction of Decision Trees”, Mach. Learn., 1(1), pp. 81-106, 1986.
﹝17﹞ Cilibrasi, R. L., and Vitanyi, P. M. B., “The Google Similarity Distance”, IEEE Trans. on Knowl. and Data Eng., 19(3), pp. 370-383, 2007.
﹝18﹞ P.-I, Chen, and S.-J., Lin, “Automatic keyword prediction using Google similarity distance”, Expert Systems with Applications, 37(3), pp. 1928-1938., 2010.
﹝19﹞ P.-I, Chen, and S.-J., Lin, “Word AdHoc Network: Using Google Core Distance to extract the most relevant information”, Knowledge-Based Systems., 24 (2011), pp. 393–405, 2011.
﹝20﹞ Vapnik, V., “Statistical Learning Theory:”, Wiley-Interscience, 1998.
﹝21﹞ Joachims, T., “Text categorization with Support Vector Machines: Learning with many relevant features”, In C. Nedellec & C. Rouveirol (Eds.), Machine Learning: ECML-98, Vol 1398, pp. 137-142, Springer Berlin / Heidelberg, 1998.
﹝22﹞ G. Ercan, and I. Cicekli, “Using lexical chains for keyword extraction”, Information Processing and Management, 43 (6), pp. 1705–1714, 2007.
﹝23﹞ Y. Li, C. Zhang, and J.R. Swan, “An information filtering model on the Web and its application in JobAgent”, Knowledge-Based Systems, 13 (5), pp. 285–296, 2000.
﹝24﹞ T. Meng, and H.F. Yan, “On the peninsula phenomenon in Web graph and its implications on Web search”, Computer Networks, 51 (1), pp. 177–189, 2007.
﹝25﹞ F. Sebastiani, “Machine learning in automated text categorization”, ACM Computing Surveys, 34 (1), pp. 1–47, 2002.
﹝26﹞ H. Cui, J.R. Wen, J.Y. Nie, and W.Y. Wei-Ying Ma, “Query expansion by mining user logs”, IEEE Transactions on Knowledge and Data Engineering, 15 (4), pp. 829–839, 2003.
﹝27﹞ I. Konstas, V. Stathopoulos, and J.M. Jose, “On social networks and collaborative recommendation”, in: SIGIR’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 195–202, 2009.
﹝28﹞ Seidman, S., “Network structure and minimum degree”, Social Networks, 5, pp. 269-287, 1983.
﹝29﹞ Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D., “Defining and identifying communities in networks”, 2004.
﹝30﹞ H.-W. Ma, and A.-P. Zeng, “The connectivity structure, giant strong component and centrality of metabolic networks”, Department of Genome Analysis,GBF - German Research Center for Biotechnology, Mascheroder Weg 1, 38124 Braunschweig, Germany, 2003.
﹝31﹞ Oliver M., QTag -a portable POS tagger，2011年5月30取自 http://phrasys.net/uob/om/software
﹝32﹞ H.-C., Chang, and C.-C., Hsu, “Using topic keyword clusters for automatic document clustering”, Information Technology and Applications, ICITA 2005. Third International Conference, July, 2005.
﹝33﹞ Wikipedia Extractor，2012年5月30取自http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
﹝34﹞ Apache Solr Project，2012年5月30取自http://lucene.apache.org/solr/
﹝35﹞ Wikipedia: Database download，2012年5月30取自http://en.wikipedia.org/wiki/Wikipedia:Database_download

指導教授

林熙禎(Shi-jen Lin)

審核日期

2012-7-20

推文