運用NGD提升程式碼搜尋品質

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：18.117.158.147

姓名

祝亞琪(Ya-chi Chu) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

運用NGD提升程式碼搜尋品質
(USING NORMALIZED GOOGLE DISTANCE TO REFINE CODE SEARCH RESULTS)

相關論文

★ 網路合作式協同教學設計平台－以國中九年一貫課程為例	★ 內容管理機制於常用問答集(FAQ)之應用
★ 行動多重代理人技術於排課系統之應用	★ 存取控制機制與國內資安規範之研究
★ 信用卡系統導入NFC手機交易機制探討	★ App應用在電子商務的推薦服務-以P公司為例
★ 建置服務導向系統改善生產之流程-以W公司PMS系統為例	★ NFC行動支付之TSM平台規劃與導入
★ 關鍵字行銷在半導體通路商運用-以G公司為例	★ 探討國內田徑競賽資訊系統－以103年全國大專田徑公開賽資訊系統為例
★ 航空地勤機坪作業盤櫃追蹤管理系統導入成效評估—以F公司為例	★ 導入資訊安全管理制度之資安管理成熟度研究－以B個案公司為例
★ 資料探勘技術在電影推薦上的應用研究-以F線上影音平台為例	★ BI視覺化工具運用於資安日誌分析—以S公司為例
★ 特權帳號登入行為即時分析系統之實證研究	★ 郵件系統異常使用行為偵測與處理-以T公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來，隨著開放原始碼的普及，為了增進軟體生產的效率、縮短開發時程，程式開發者越來越傾向於找尋現存且適當的開放原始碼加以修改，以減少開發的時間與成本，因此成了一種新的網路服務─程式碼搜尋。現今有許多程式碼搜尋引擎提供使用者便利的管道來取得一些已經存在的類別或是架構所提供的應用程式介面 (Application Programming Interfaces, API)，希望藉此幫助開發者尋找更為有幫助的資料。然而透過搜尋引擎在網路上所找回的程式碼結果，往往不能符合程式開發者的需求，過多且複雜的程式碼檔案讓程式開發者難以理解導致程式開發者無法過濾所需，而無法快速的應用的資源。
因此，在本研究中針對程式碼搜尋提出一個改良的系統架構，首先針對Koders搜尋引擎經過適當的過濾步驟下載有關查詢的程式碼至儲存庫，下載的程式檔案透過程式碼的抽象語意樹(Abstract Syntax Tree)擷取出程式碼重要的API，即時的利用正規化Google距離(Normalized Google Distance)的概念算出與查詢的相關性並重新排序，另外利用程式的結構性以資料探勘的階層演算法進行分群將搜尋結果重新分群，最後在每一個群集上賦予具有語意的標籤，以利使用者在沒有相關專業背景的情況下也能過濾找到適當的群集快速開發應用。最後，本研究將使用查準率(Precision)和查全率(Recall)及案例的方式當作評估系統是否能提升搜尋結果品質的衡量指標，並且與其他相關的研究進行比較。

摘要(英)

With the popularity of open source software, many people have the willing to share their projects via internet. In order to enhance the efficiency of software production, program developers try to search the existing open source software on the web. Therefore a new internet service, code search engine, emerged from the network. Although search engines provide a convenient way to assist developers to reuse the existing Application Programming Interfaces, the search results obtained from the search engines do not always satisfy the requirement of developers. Numerous and complex search results make developers hard to reuse the code quickly.
We proposed a system architecture which is able to solve the problem we mention above: First, we store the related data which is extracted from the search results of Koders in the local repository. Second, we convert every file into the abstract syntax tree format to get the structural data. Third, we cluster and compute every file’s normalized Google distance value through the structural data. And then we will re-rank the search results according to the Google distance value. Four, we will give some semantic tags to each cluster and hope it can help user to find the right cluster quickly.
Finally, we use precision and recall value as an index to evaluate the proposed system architecture’s performance about clustering. Furthermore, we also use a case to explain whether the proposed system architecture can effectively help developers to find the useful source code, and compare with related academic research.

關鍵字(中)

★ 程式碼搜尋
★ 程式碼排序
★ 開放原始碼
★ 正規化Google距離
★ 階層演算法
★ 抽象語法樹

關鍵字(英)

★ Open Source Code.
★ Normalized Google Distance
★ Abstract Syntax Tree
★ Cluster Analysis
★ Code Search Engine

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
1.4 研究方法 5
1.5 論文架構 6
第二章文獻探討 7
2.1 開放原始碼簡介 7
2.2程式碼搜尋引擎 9
2.3程式碼比對 11
2.3.1抽象語法樹介紹(Abstract Syntax Tree) 15
2.4程式碼排序 18
2.4.1 NGD介紹(Normalized Google Distance) 19
2.5資料探勘 20
2.5.1資料探勘簡介 20
2.5.2資料探勘技術 21
2.6群集標籤 27
2.6 小結 28
第三章系統設計與架構 29
3.1 系統架構 29
3.2程式碼搜尋引擎與程式碼過濾 30
3.3程式碼擷取 33
3.4 NGD計算 39
3.4 資料探勘與排序 42
3.5群集標籤 43
第四章實驗結果與討論 45
4.1系統實作與案例說明 45
4.2分群評估 50
4.3 NGD排序效用評估 54
4.4 群集標籤可用性 57
4.5系統效能評估 61
4.6相關搜尋引擎及相關研究比較 64
4.6.1相關搜尋引擎比較 64
4.6.2相關研究比較 67
第五章結論與未來研究方向 71
5.1 結論 71
5.2 未來研究方向 72
參考文獻 74
中文部分 74
英文部分 74
網頁資料 77

參考文獻

參考文獻
中文部分
1. 洪菁憶(2008)‧循序探勘在軟體版本控制上的應用‧未發表的碩士論文‧中壢：中央大學資訊管理研究所。
2. 陳文華(1999) ‧應用資料倉儲系統建立CRM‧資訊與電腦，122-127。
3. 張智星‧資料分群與樣式辨認。
4. 廖虹雲(2005)‧利用資料探勘來預測顧客對不同產品類別之偏好程度‧未發表的碩士論文‧台中：朝陽科技大學資訊管理研究所碩士論文。
5. 廖振傑(2009)‧藉由資料探勘的排序方式提昇程式碼搜尋品質─以Koders為例‧第20屆國際資訊管理學術研討會‧台北：世新大學。
6. 龔良民(1998) ‧衍生性群集分析方法之探定理論與應用‧未發表的碩士論文‧高雄：中山大學資訊管理研究所。
英文部分
7. Asako Ohno, & Hajime Murao (2007). Measuring Source Code Similarity Using Reference Vectors. ICIC International, ISSN 1349-4198.
8. Bajracharya, S., Ngo, T., Linstead, E., Dou, Y., Rigor, P., Baldi, P., & Lopes, C. (2006). Sourcerer: A search engine for open source code supporting structure-based search. In Proc. of OOPSLA’06 Companion, 25-26.
9. Berry, M. J. A., & Linoff, G. (1997). Data Mining Technique for Marketing. Sale, and Customer Support, Wiley Computer.
10. Cilibrasi, R.L., & Vitanyi, P.M.B. (2007). The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 370 – 383.
11. Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification. 7-24.
12. Dick Grune, Henri E. Bal, Ceriel J.H. Jacobs, & Koen G. Langendoen. (2004). Modern compiler Design. John Wiley & Sons Inc, 9-11.
13. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI Magazine. 57-70.
14. Fayyad, U. Piatetsky-Shapiro, G. & P. Smyth (1996). The KDD Process for Extracting Useful Knowledge form Volumes of Data. Communications of the ACM, 39(11), 27–34.
15. Grupe, F. H., & Owrang, M. M. (1995). Database Mining Discovering New Knowledge and Cooperative Advantage. Information System Management, 12(4), 26-30.
16. G. Valiente. (2002). Algorithms on Trees and Graphs. Springer-Verlag, Berlin.
17. Holmes, R., & Murphy, G. C. (2005). Using structural context to recommend source code examples. 27th International Conference on Software Engineering, 117-125.
18. Holmes, R., Walker, R. J., & Murphy, G. C. (2006). Approximate structural context matching: An approach to recommend relevant examples. IEEE Transactions on Software Engineering, 32(12), 958-970.
19. Jiawei, H., & Micheline, K. (2001). Data Mining：Concepts and Techniques. Morgan Kaufmann, 59-60.
20. Kaufman, L., & Rousseeuw, P. J. (2005). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons Inc.
21. Kawaguchi, S., Garg, P. K., Matsushita, M., & Inoue, K. (2003). Automatic categorization algorithm for evolvable software archive. International Workshop on Principles of Software Evolution, 6, 195-200.
22. Kawaguchi, S., Garg, P. K., Matsushita, M., & Inoue, K. (2004). MUDABlue: An Automatic Categorization System for Open Source Repositories. In Proceedings of the 11th Asia-Pacific Software Engineering Conference (November 30 - December 03, 2004). APSEC. IEEE Computer Society, Washington, DC, 184-193.
23. Kuhn, A., Ducasse, S., and Gírba, T., “Semantic clustering: Identifying topics in source code.” Information and Software Technology (49:3), pp.230-243, 2007.
24. Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., & Baldi, P. (2007). Mining concepts from code with probabilistic topic models. Proceedings of the twenty-second IEEE/ACM international conference on automated software engineering, November 05-09.
25. Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., & Gay, G. (2006). The Influence of Task and Gender on Search and Evaluation Behavior Using Google. Information Processing and Management, 42, 1123-1131.
26. Mandelin, D., Xu, L., Bodik, R., and Kimelman, D. (2005). Jungloid mining: helping to navigate the API jungle. In Proc. of PLDI 2005, 48-61.
27. Rousidis, D., & Tjortjis, C. (2005). Clustering Data Retrieved from Java Source Code to Support Software Maintenance: A Case Study. Proceedings of the Ninth European Conference on Software Maintenance and Reengineering, 276-279.
28. Robert C. Martin, “UML Tutorial:Part 1 -- Class Diagrams.”
29. Sahavechaphan, N., & Claypool, K. (2006). XSnippet:Mining for sample code. In Proc. of OOPSLA, 413–430.
30. Steven P. Reiss (2009). Semantics-based code search. International Conference on Software Engineering, 243 – 253.
31. Thummalapenta, S., & Xie, T. (2007). PARSEWeb：A Programmer Assistant for Reusing Open Source Code on the Web. In Proc. of ASE 2007, 204-213.
32. Tobias S. & Abraham B. (2006). Detecting Similar Java Classes Using Tree Algorithms. International Conference on Software Engineering, 65 – 71.
33. Xie, T., & Pei, J. (2006). MAPO: Mining API usages from open source repositories. In Proc. of MSR’06, 54-57.
網頁資料
34. 自由軟體鑄造場(Open Source Software Foundry)‧2010年5月31日取自http://www.openfoundry.org/
35. 自由軟體入口‧2010年5月31日取自http://www.oss.org.tw/
36. Codase source code search engine‧2010年5月31日取自http://www.codase.com/
37. Google Code Search Engine ‧2010年5月31日取自http://www.google.com/codesearch/
38. Koders source code search engine ‧2010年5月31日取自http://www.koders.com/
39. Krugle source code search engine‧2010年5月31日取自http://www.krugle.org/
40. SourceForge.net: Open Source Software‧2010年5月31日取自http://sourceforge.net/
41. TIOBE software‧2010年5月31日取自http://www.tiobe.com/

指導教授

林熙禎(Shi-Jen Lin)

審核日期

2010-6-29

推文