遺漏值填補於網路評論有益性資料集之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：64

、訪客IP：18.225.195.163

姓名

黃靖雅(Jing-Ya Huang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

遺漏值填補於網路評論有益性資料集之研究
(Evaluation of missing value imputation methods for the helpfulness of online reviews)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

現今生活裡，每一件事情都可以被大家公開評論，包括你看過的報章雜誌、書籍。網路評論已被認定為是可以信任的，用戶可以透過不同的方式提供網路評論，例如星級、文字、圖片和視頻。多數的用戶在購買商品和體驗前也都會先查看網路上的評論，當網路上資訊量過多的時候，就會造成資訊超載的問題。我們因此想對這些評論的資料去做資料探勘，利用機器學習的方法，處理及過濾這些大量的資訊。
本研究使用網路評論有益性資料集。在進行資料清理階段時，我們發現這些在真實世界中的資料，資料遺漏的現象是非常普遍的。且鑒於目前現有的文獻中，並無針對各項監督式學習演算法，在於真實世界的資料運作中有針對遺漏值預測填補上的效能表現進行比較。因此，設計了兩個實驗來進行，於實驗一，對具遺漏值之網路評論有益性資料集中的評論者資料進行遺漏值填補方法，使得能建立良好的預測模式，幫助旅客或是旅館業者找出最有幫助之評論。而實驗二，則對現實世界中其它可能產生的遺漏現象作探討，運用程式模擬10%到50%的資料遺漏，除了比較不同補值法之間的效能差異外，也會對網路評論領域找出最好的資料填補方法。
實驗中使用了三種類型的技術，如使用傳統的Case Deletion、平均眾數補值法、KNN、使用學術界常常運用的支持向量機進行補值，以及使用對遺漏值較不敏感的決策樹方法，直接處理遺漏值資料而不補值。於實驗後的結果得知，使用決策樹直接處理不完整資料得到的分類正確率結果最好。相信這樣的貢獻能協助未來使用者能更洽當且有效率的處理遺漏值問題，使得能更快進入到資料分析階段。

摘要(英)

In today′s world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem.
This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets.
The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets.

關鍵字(中)

★ 資料前處理
★ 遺漏值
★ 補值方法
★ 網路評論

關鍵字(英)

★ data preprocessing
★ missing value
★ imputation
★ online review

論文目次

摘要 i
Abstract ii
誌謝辭 iii
目錄 iv
圖目錄 vi
表目錄 vii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 4
1-4 研究架構 4
二、文獻探討 6
2-1 網路評論及有益性 6
2-2 遺漏值介紹 6
2-2-1 完全隨機遺漏（Missing Completely at Random，MCAR） 7
2-2-2 隨機遺漏（Missing at Random，MAR） 7
2-2-3 非隨機遺漏（Missing Not at Random，MNAR） 8
2-3 遺漏值填補法 9
2-3-1 單一補值法（Single Imputation） 9
2-3-2 多重補值法（Multiple Imputation） 11
三、研究方法 14
3-1 實驗設計 14
3-2 實驗架構 19
3-3 實驗一 20
3-4 實驗二 22
四、實驗結果 23
4-1 實驗一結果 23
4-1-1 分類正確率（Classification Accuracy） 23
4-1-2 靈敏度分析(Sensitivity Analysis) 24
4-1-3 特異度分析(Specificity analysis) 26
4-1-4 實驗一總結 27
4-2 實驗二結果 28
4-2-1 實驗二（I） 28
4-2-1-1 分類正確率（Classification Accuracy） 28
4-2-1-2 靈敏度分析(Sensitivity Analysis) 30
4-2-1-3 特異度分析(Specificity analysis) 31
4-2-2 實驗二（II） 33
4-2-2-1 分類正確率（Classification Accuracy） 33
4-2-2-2 靈敏度分析(Sensitivity Analysis) 34
4-2-2-3 特異度分析(Specificity analysis) 36
五、研究結論 38
5-1 研究發現 38
5-2 研究貢獻及未來方向 39
參考文獻 40
附錄一 44
附錄二 46
附錄三 48

參考文獻

[1] K.Zhao, A. C.Stylianou, and Y.Zheng, “Sources and impacts of social influence from online anonymous user reviews,” Inf. Manag., vol. 55, no. 1, pp. 16–30, Jan.2018.
[2] G.Askalidis, S. J.Kim, and E. C.Malthouse, “Understanding and overcoming biases in online review systems,” Decis. Support Syst., vol. 97, pp. 23–30, May2017.
[3] Y.Pan andJ. Q.Zhang, “Born Unequal: A Study of the Helpfulness of User-Generated Product Reviews,” J. Retail., vol. 87, no. 4, pp. 598–612, Dec.2011.
[4] S. M.Mudambi and D.Schuff, “WHAT MAKES A HELPFUL ONLINE REVIEW? A STUDY OF CUSTOMER REVIEWS ON AMAZON.COM 1,” vol. 34, no. 1, pp. 185–200, 2010.
[5] R. E.Burnkrant and A.Cousineau, “Informational and Normative Social Influence in Buyer Behavior,” Journal of Consumer Research, vol. 2. Oxford University Press, pp. 206–215.
[6] P.-J.Lee, Y.-H.Hu, and K.-T.Lu, “Assessing the helpfulness of online hotel reviews: A classification-based approach,” Telemat. Informatics, vol. 35, no. 2, pp. 436–445, May2018.
[7] B.Swar, T.Hameed, and I.Reychav, “Information overload, psychological ill-being, and behavioral intention to continue online healthcare information search,” Comput. Human Behav., vol. 70, pp. 416–425, May2017.
[8] K.Lakshminarayan, S. A.Harp, and T.Samad, “Imputation of Missing Data in Industrial Databases,” Appl. Intell., vol. 11, pp. 259–275, 1999.
[9] J.Leskovec Stanford Univ Anand Rajaraman, J. D.Ullman, A.Rajaraman, J.Leskovec, and J. D.Ullman ii, Mining of Massive Datasets. 2010.
[10] Y.Laberge, Advising on Research Methods: A consultant’s Companion. 2008.
[11] C.-F.Tsai and F.-Y.Chang, “Combining instance selection for better missing value imputation,” J. Syst. Softw., vol. 122, no. C, pp. 63–71, Dec.2016.
[12] C.-F.Tsai, M.-L.Li, and W.-C.Lin, “A class center based approach for missing value imputation,” Knowledge-Based Syst., vol. 151, pp. 124–135, Jul.2018.
[13] G. E. A. P. A.Batista and M. C.Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
[14] P. J.García-Laencina, J.-L.Sancho-Gómez, and A. R.Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Comput. Appl., vol. 19, no. 2, pp. 263–282, Mar.2010.
[15] D.Weathers, S. D.Swain, and V.Grover, “Can online product reviews be more helpful? Examining characteristics of information content by product type,” Decis. Support Syst., vol. 79, pp. 12–23, Nov.2015.
[16] M.Siering, A.V.Deokar, and C.Janze, “Disentangling consumer recommendations: Explaining and predicting airline recommendations based on online reviews,” Decis. Support Syst., vol. 107, pp. 52–63, Mar.2018.
[17] R. J. A.Little and D. B.Rubin, STATISTICAL ANALYSIS WITH MISSING DATA WILEY SERIES IN PROBABILITY AND STATISTICS. 2002.
[18] J. M.Davis and D.Agrawal, “Understanding the role of interpersonal identification in online review evaluation: An information processing perspective,” Int. J. Inf. Manage., vol. 38, no. 1, pp. 140–149, Feb.2018.
[19] Y.-H.Cheng and H.-Y.Ho, “Social influence’s impact on reader perceptions of online reviews,” J. Bus. Res., vol. 68, no. 4, pp. 883–887, Apr.2015.
[20] C. M. K.Cheung and D. R.Thadani, “The impact of electronic word-of-mouth communication: A literature analysis and integrative model,” Decis. Support Syst., vol. 54, no. 1, pp. 461–470, Dec.2012.
[21] J. M.Rensink, Ed., What motivates people to write online reviews and which role does personality play? 2013.
[22] C.Forman, A.Ghose, and B.Wiesenfeld, “Examining the Relationship Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic Markets,” Inf. Syst. Res., vol. 19, no. 3, pp. 291–313, Sep.2008.
[23] HASS and R.G., Effects of source characteristics on cognitive responses in persuasion. Erlbaum, 1981.
[24] J. R.Quinlan, “UNKNOWN ATTRIBUTE VALUES IN INDUCTION,” in Proceedings of the Sixth International Workshop on Machine Learning, 1989, pp. 164–168.
[25] M.Huisman, “Imputation of Missing Item Responses: Some Simple Techniques,” Qual. Quant., vol. 34, no. 4, pp. 331–351, 2000.
[26] I.Barranco-Chamorro, M. D.Jiménez-Gamero, J. A.Mayor-Gallego, and J. L.Moreno-Rebollo, “A case-deletion diagnostic for penalized calibration estimators and BLUP under linear mixed models in survey sampling,” Comput. Stat. Data Anal., vol. 87, no. C, pp. 18–33, Jul.2015.
[27] M. J.Colledge, J. H.Johnson, R.Pare, I. G.Sande, and S.Canada, “LARGE SCALE IMPUTATION OF SURVEY DATA,” J. Am. Stat. Assoc., vol. 82, no. 397, pp. 431–436, 1978.
[28] “Sande, IG. Hot-deck procedures. in: WG Madow, I Olkin, H Nisselson, DB Rubin (Eds.) Incomplete Data in Sample Surveys. Volume 3. Academic Press, New York; 1983:339–349.”
[29] “Ford, B.: An Overview of Hot Deck Procedures. In: Madow, W., Nisselson, H., Olkin, I. (eds.) Incomplete Data in Sample Surveys, Theory and Bibliographies, 2, pp. 185–207. Academic Press (1983).”
[30] G.Kalton, “IMPUTING FOR MISSING SURVEY RESPONSES,” American Statistical Association, pp. 22–31, 1982.
[31] R. R.Andridge and R. J. A.Little, “A Review of Hot Deck Imputation for Survey Non-response.,” Int. Stat. Rev., vol. 78, no. 1, pp. 40–64, Apr.2010.
[32] J. F.Hair, Multivariate data analysis. Prentice Hall, 2010.
[33] K.-H.Wang, “A New Method for Handling Missing Values in Large Databases by Integrating Clustering and Regression Techniques,” National Cheng Kung University, 2002.
[34] J. Y.Nancy, N. H.Khanna, and K.Arputharaj, “Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework,” Comput. Stat. Data Anal., vol. 112, no. C, pp. 63–79, Aug.2017.
[35] S.Zhang, “Nearest neighbor selection for iteratively kNN imputation,” J. Syst. Softw., vol. 85, no. 11, pp. 2541–2552, Nov.2012.
[36] S.-B.Cho, “Towards Creative Evolutionary Systems with Interactive Genetic Algorithm,” Appl. Intell., vol. 16, no. 2, pp. 129–138, 2002.
[37] Y.Liu, K.Wen, Q.Gao, X.Gao, and F.Nie, “SVM based multi-label learning with missing labels for image annotation,” Pattern Recognit., vol. 78, pp. 307–317, Jun.2018.
[38] R.Pandya, J.Pandya, K. P.Dholakiya, and I.Amreli, “C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning,” Int. J. Comput. Appl., vol. 117, no. 16, pp. 975–8887, 2015.
[39] A.Ni, X.Zhu, and C.Zhang, “Any-Cost Discovery: Learning Optimal Classification Rules,” Springer, Berlin, Heidelberg, 2005, pp. 123–132.
[40] C. X.Ling, Q.Yang, J.Wang, and S.Zhang, “Decision trees with minimal costs,” in Twenty-first international conference on Machine learning - ICML ’04, 2004, p. 69.
[41] C. X.Ling, V. S.Sheng, and Q.Yang, “Test strategies for cost-sensitive decision trees,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 8, pp. 1055–1067, Aug.2006.
[42] R.Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Appear. Int. Jt. Conf. Articial Intell., vol. 2, pp. 1137–1143, 1995.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2018-6-22

推文