如今巨量資料 (Big Data) 狂潮來襲,資訊唾手可得的情景已在我們日常生活中,資料爆炸的速度已經超越摩爾定律,對於巨量資料的駕馭能力出現一大挑戰,最大的挑戰在於哪些技術能更善用巨量資料。Big Data的本質就是從資料探勘 (Data Mining) 延伸出來的概念,即時設法從龐大的資料中探勘出資料的價值。隨著網路的普及和雲端運算的發展,打破傳統資料探勘的侷限,運算巨量資料探勘有效率地縮短運算時間。大數據科學家John Rauser定義巨量資料就是超過了一台電腦處理能力的龐大資料量,若以目前單機的硬體設備運算會有運算速度不符合需求、資料儲存容量過小等問題,所以本研究針對傳統資料探勘環境與流程作了改進。本研究的目的是分析兩種運算技術:分散式架構與雲端MapReduce架構,整合運算資源來針對大型資料集做資料分類,就能擴增儲存容量並搭配強大運算能力,加快探勘速度。另一方面,運用樣本選取 (Instance Selection) 過濾雜訊資料能達到資料減量的效果,探討利用資料前處理於巨量資料是否為必要之流程。最後得出何種架構和流程之下,在不犧牲正確率的情況下,獲得最快的執行時間。而實驗結果顯示採用單一台大型主機建置雲端架構,機器數為1~20台且未使用資料前處理配合SVM分類器直接進行分類更能有效率地處理大型資料集。使用四個多至五十萬筆從UCI資料庫和KDD cup提供的大型資料集,來顯示我們所提出的架構與流程的有效性。;The dataset size is growing faster than Moore′s Law, and the big data frenzy is currently sweeping through our daily life. The challenges of managing massive amounts of big data involve the techniques of making the data accessible. The big data concept is general and encapsulates much of the essence of data mining techniques and they can discovery the most important and relevant knowledge to be valuable information. The advancement of the Internet technology and the popularity of cloud computing can break through the time efficiency limitation of traditional data mining methods over very large scale dataset problems. The technology of big data mining should create the conditions for the efficient mining of massive amounts of data with the aim of creating real-time useful information. The data scientist, John Rauser, defines big data as “any amount of data that’s too big to be handled by one computer.” A standalone device does not have enough memory to efficiently handle big data, and the storage capacity as well. Therefore, big data mining can be efficiently performed via the conventional distributed and MapReduce methodologies. This raises an important research question: Do the distributed and MapReduce methodologies over large scale datasets perform differently in mining accuracy and efficiency? And one more question: Does Big data mining need data preprocessing? The experimental results based on four large scale datasets show that the using MapReduce without data preprocessing requires the lest processing time and it allows the classifier to provide the highest rate of classification no matter how many computer nodes are used except for a class imbalance dataset.