遺漏值填補於網路評論有益性資料集之研究;Evaluation of missing value imputation methods for the helpfulness of online reviews

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/77523

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/77523

題名:	遺漏值填補於網路評論有益性資料集之研究;Evaluation of missing value imputation methods for the helpfulness of online reviews
作者:	黃靖雅;Huang, Jing-Ya
貢獻者:	資訊管理學系
關鍵詞:	資料前處理;遺漏值;補值方法;網路評論;data preprocessing;missing value;imputation;online review
日期:	2018-06-22
上傳時間:	2018-08-31 14:46:54 (UTC+8)
出版者:	國立中央大學
摘要:	現今生活裡，每一件事情都可以被大家公開評論，包括你看過的報章雜誌、書籍。網路評論已被認定為是可以信任的，用戶可以透過不同的方式提供網路評論，例如星級、文字、圖片和視頻。多數的用戶在購買商品和體驗前也都會先查看網路上的評論，當網路上資訊量過多的時候，就會造成資訊超載的問題。我們因此想對這些評論的資料去做資料探勘，利用機器學習的方法，處理及過濾這些大量的資訊。本研究使用網路評論有益性資料集。在進行資料清理階段時，我們發現這些在真實世界中的資料，資料遺漏的現象是非常普遍的。且鑒於目前現有的文獻中，並無針對各項監督式學習演算法，在於真實世界的資料運作中有針對遺漏值預測填補上的效能表現進行比較。因此，設計了兩個實驗來進行，於實驗一，對具遺漏值之網路評論有益性資料集中的評論者資料進行遺漏值填補方法，使得能建立良好的預測模式，幫助旅客或是旅館業者找出最有幫助之評論。而實驗二，則對現實世界中其它可能產生的遺漏現象作探討，運用程式模擬10%到50%的資料遺漏，除了比較不同補值法之間的效能差異外，也會對網路評論領域找出最好的資料填補方法。實驗中使用了三種類型的技術，如使用傳統的Case Deletion、平均眾數補值法、KNN、使用學術界常常運用的支持向量機進行補值，以及使用對遺漏值較不敏感的決策樹方法，直接處理遺漏值資料而不補值。於實驗後的結果得知，使用決策樹直接處理不完整資料得到的分類正確率結果最好。相信這樣的貢獻能協助未來使用者能更洽當且有效率的處理遺漏值問題，使得能更快進入到資料分析階段。 ;In today′s world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem. This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets. The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	175	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....