在PU類型資料之下比較三種邏輯斯迴歸模型

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：21

、訪客IP：18.191.139.131

姓名

莊渝涵(Yu-Han Jhuang) 查詢紙本館藏

畢業系所

數學系

論文名稱

在PU類型資料之下比較三種邏輯斯迴歸模型
(A Comparison among Three Logistic Regression Models under Positive and Unlabeled Data)

相關論文

★ New insights on ′′A semi-parametric model for wearable sensor-based physical activity monitoring data with informative device wear"	★ A parametric model for wearable sensor-based physical activity monitoring data with informative device wear
★ 透過隨機投影降維的函數型資料變異數分析—以穿戴式裝置資料為例	★ 邏輯斯迴歸的子取樣方法之比較
★ 用於函數型資料之兩步驟共變異數分析在穿戴裝置資料之應用	★ 兩個具時空效應之隨機場的獨立性檢定
★ Kronecker包絡主成分分析模型選擇方法及其應用

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

大數據時代的來臨，我們常面臨資料的標記品質不佳的情況。在傳統監督學習的二分類問題中，資料中含有部分的錯誤標記導致其訓練出的模型產生偏差。其中有一種含有錯誤標記的資料類型為僅含有正確標記的正標籤(positive)資料以及混雜大量負標籤(negative)及少量正標籤的未標記(unlabeled)資料，簡稱PU類型資料。在本文中我們比較文獻中所提出的三種邏輯斯迴歸的變型，分別是c-邏輯斯迴歸、ξ-邏輯斯迴歸以及γ-邏輯斯迴歸在PU類型資料的表現。我們藉由模擬實驗來比較這三種方法在PU類型資料下的參數估計準確性及分類正確性。實際資料分析使用UCI Machine Learning Repository中的兩筆資料集，分別是Wisconsin乳癌的資料集(WDBC)和Pima Indians糖尿病的資料集(Pima)。

摘要(英)

With the advent of the big data era, we often face the situation of poor quality of labeling the data. In binary classification problems of traditional supervised learning, mislabeled in data leads to a model bias issues. One type of mislabeled data is which contains correctly labeled of positive data and unlabeled ones which mixed with a large number of negative data and a small number of positive data, referred to as positive and unlabeled data. In this article, we compare the three logistic regression variants proposed in the literature, namely c-logistic regression, ξ-logistic regression and γ-logistic regression on positive and unlabeled data. We compare the parameter estimation accuracies and classification correct rates of these three methods under positive and unlabeled data by simulation experiments. For real-world applications, we supply the three methods on the two datasets, WDBC (breast cancer Wisconsin (diagnostic)) data set and PIMA (Pima Indians diabetes) data set in the UCI Machine Learning Repository.

關鍵字(中)

★ 邏輯斯迴歸
★ 錯標機制
★ 參數估計
★ PU類型資料
★ 穩健估計

關鍵字(英)

★ Logistic regression
★ Mislabeling mechanism
★ Parameter estimation
★ Positive and unlabeled data
★ Robust estimation

論文目次

摘要 iv
Abstract v
誌謝 vi
目錄 vii
圖目錄 ix
表目錄 x
一、緒論 1
二、方法介紹 3
2.1 c-邏輯斯迴歸 5
2.2 ξ-邏輯斯迴歸 6
2.3 γ-邏輯斯迴歸 9
2.4 三種方法與傳統邏輯斯迴歸之比較 11
三、統計模擬 12
3.1 模擬設定 12
3.2 模擬結果 14
3.2.1 模擬實驗一:小樣本模擬實驗 14
3.2.2 模擬實驗一:大樣本模擬實驗 19
3.2.3 模擬實驗三:在不同樣本數下AMSE的表現 21
四、實際資料分析 24
4.1 實際資料介紹 24
4.2 實際資料實驗結果 25
五、結論 29
參考文獻 31
附錄A c-邏輯斯迴歸推導 33
附錄B ξ-邏輯斯迴歸推導 37
附錄C 模擬實驗一的其他結果 41

參考文獻

1.Copas, J. B. (1988). Binary regression models for contaminated data. Journal of the Royal Statistical Society, Series B, 50(2), 225-253.
2.Deloach, J., Caragea, D., Ou, X. (2016). Android malware detection with weak ground truth data. In 2016 IEEE International Conference on Big Data, 3457-3464.
3.Fujisawa, H., and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9), 2053-2081.
4.Hayashi, K. (2012). A boosting method with asymmetric mislabeling probabilities which depend on covariates. Computational Statistics, 27(2), 203-218.
5.Hung, H., Jou, Z.Y., and Huang, S.Y. (2018). Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics, 74(1), 145-154.
6.Jones, M. C., Hjort, N. L., Harris, I. R. and Basu, A. (2001). A comparison of related density-based minimum divergence estimators. Biometrika, 88(3), 865–873.
7.Komori, O., Eguchi, S., Ikeda, S., Okamura, H., Ichinokawa, M., and Nakayama, S. (2016). An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution, 7(2), 249-260.
8.Mollah, M. N. H., Eguchi, S., and Minami, M. (2007). Robust prewhitening for ICA by minimizing $eta$-divergence and its application to FastICA. Neural Processing Letters, 25(2), 91-110.
9.Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the annual symposium on computer application in medical care, 261-265.
10.Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829-844.
11.Yang, P., Liu, W., and Yang, J. Y. H. (2017). Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conference on Artificial Intelligence, 3273-3279.
12.衛生福利部慢性疾病防治組(2019). 響應2019年世界糖尿病日全家齊控糖糖友「好家在」. Retrieved 2021/07/07, from https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=3804&pid=11765.
13.衛生福利部癌症防治組(2020). 衛生福利部公布癌症發生資料. Retrieved 2021/07/07, from https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=4141&pid=12682.

指導教授

黃世豪

審核日期

2021-8-19

推文