English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41272336      線上人數 : 1099
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/68762


    題名: 平行運算架構下之巨量資料探勘:分散式與雲端方法之比較;Big Data Mining with Parallel Computing: A Comparison of Distributed and MapReduce Methodologies
    作者: 葉貞麟;Yeh,Chen-lin
    貢獻者: 資訊管理學系
    關鍵詞: 巨量資料;資料探勘;分散式運算;雲端運算;樣本選取;Big Data;Data Mining;Distributed Computing;Cloud Computing;Instance Selection
    日期: 2015-07-22
    上傳時間: 2015-09-23 14:25:31 (UTC+8)
    出版者: 國立中央大學
    摘要: 如今巨量資料 (Big Data) 狂潮來襲,資訊唾手可得的情景已在我們日常生活中,資料爆炸的速度已經超越摩爾定律,對於巨量資料的駕馭能力出現一大挑戰,最大的挑戰在於哪些技術能更善用巨量資料。Big Data的本質就是從資料探勘 (Data Mining) 延伸出來的概念,即時設法從龐大的資料中探勘出資料的價值。隨著網路的普及和雲端運算的發展,打破傳統資料探勘的侷限,運算巨量資料探勘有效率地縮短運算時間。大數據科學家John Rauser定義巨量資料就是超過了一台電腦處理能力的龐大資料量,若以目前單機的硬體設備運算會有運算速度不符合需求、資料儲存容量過小等問題,所以本研究針對傳統資料探勘環境與流程作了改進。本研究的目的是分析兩種運算技術:分散式架構與雲端MapReduce架構,整合運算資源來針對大型資料集做資料分類,就能擴增儲存容量並搭配強大運算能力,加快探勘速度。另一方面,運用樣本選取 (Instance Selection) 過濾雜訊資料能達到資料減量的效果,探討利用資料前處理於巨量資料是否為必要之流程。最後得出何種架構和流程之下,在不犧牲正確率的情況下,獲得最快的執行時間。而實驗結果顯示採用單一台大型主機建置雲端架構,機器數為1~20台且未使用資料前處理配合SVM分類器直接進行分類更能有效率地處理大型資料集。使用四個多至五十萬筆從UCI資料庫和KDD cup提供的大型資料集,來顯示我們所提出的架構與流程的有效性。;The dataset size is growing faster than Moore′s Law, and the big data frenzy is currently sweeping through our daily life. The challenges of managing massive amounts of big data involve the techniques of making the data accessible. The big data concept is general and encapsulates much of the essence of data mining techniques and they can discovery the most important and relevant knowledge to be valuable information. The advancement of the Internet technology and the popularity of cloud computing can break through the time efficiency limitation of traditional data mining methods over very large scale dataset problems. The technology of big data mining should create the conditions for the efficient mining of massive amounts of data with the aim of creating real-time useful information. The data scientist, John Rauser, defines big data as “any amount of data that’s too big to be handled by one computer.” A standalone device does not have enough memory to efficiently handle big data, and the storage capacity as well. Therefore, big data mining can be efficiently performed via the conventional distributed and MapReduce methodologies. This raises an important research question: Do the distributed and MapReduce methodologies over large scale datasets perform differently in mining accuracy and efficiency? And one more question: Does Big data mining need data preprocessing? The experimental results based on four large scale datasets show that the using MapReduce without data preprocessing requires the lest processing time and it allows the classifier to provide the highest rate of classification no matter how many computer nodes are used except for a class imbalance dataset.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML553檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明