中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/54340
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41742304      線上人數 : 1046
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/54340


    題名: 平行化資訊理論共分群演算法;Parallel Information-Theoretic Co-Clustering based on MapReduce
    作者: 趙士賢;Chao,Shih-Hsien
    貢獻者: 軟體工程研究所
    關鍵詞: 共分群;雲端;co-clustering;could computing;Hadoop;MapReduce
    日期: 2012-07-27
    上傳時間: 2012-09-11 18:48:18 (UTC+8)
    出版者: 國立中央大學
    摘要: 資料分群(Data Clustering)在各種領域被廣泛的應用,如:資料探勘(Data Mining)、文件檢索(Document Retrieval)、影像分割(Image Segmentation)、樣式分類(Pattern Classification)等等。傳統資料分群演算法通常只能用在小規模資料分析上。如今,做資料分群時,常常必須面臨好幾Gigabytes的資料量,一般電腦已經無法再處理龐大的資料。為了解決這些問題,許多研究員嘗試去設計出許多有效率的平行化分群演算法(Parallel Clustering Algorithm) 來做大型資料分群。本論文中我們聚焦在Information-Theoretic Co-clustering (ITCC)演算法,ITCC是一種共分群演算法,它可以同時對行與列去作分群,並且其objective function是以行向量與列向量之mutual information作為基礎。ITCC被廣泛地用在許多領域,如: Text mining、Social recommendation system、生物資訊領域等等。在本篇論文中,我們提出Parallel Information-Theoretic Co-Clustering (PITCC)演算法,由於要處理的資料量相當龐大,我們使用一種近幾年來新興且熱門的平行化運算平台Hadoop,以Map-Reduce的方式來進行運算。Map-Reduce廣泛的被學術界(Academia)與業界(Industry)所接受,是一種簡單而且非常強大的programming方法。Hadoop除了具有高擴充性,還具有易於使用等優點。並且我們使用了CAMRa2011比賽所release的資料集。最後我們將在實驗部分使用了三種評估效能的方法來衡量我們的實驗,並且證明我們所提出的演算法,是一個相當有效率且能處理龐大的資料集的方法。Data clustering is used in many domains widely. For example: data mining, document retrieval, image segmentation, pattern classification, etc. Traditional clustering algorithms are usually used for small-scale data analysis. At present, we usually have to deal with the large data, which cannot be dealt with in single computer. To solve these problems, many researchers attempt to design efficient parallel clustering algorithms for huge data.In this paper we focus on Information-Theoretic Co-clustering (ITCC) which is a simultaneous clustering of the rows and columns based on mutual information between the clustered random variables subject to constraints on the number of row and column clusters. ITCC is widely used in many domains, such as text mining, social recommendation system, and bio-informatics, etc. We propose a Parallel Information-Theoretic Co-Clustering (PITCC) algorithm based on MapReduce. Because we need to analyze huge data, we develop our algorithm on cloud computing platform based on Hadoop. MapReduce is a programming model which has been widely embraced by both academia and industry because of high scalability and easy use. We use the movie recommendation contest “CAMRa2011” dataset for our experiments, and evaluate our experiment results in terms of speedup, sizeup and scaleup. The experimental results demonstrate that the proposed algorithm is very powerful and efficient, and it can process large datasets on commodity hardware.
    顯示於類別:[軟體工程研究所 ] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML1135檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明