中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/92658
English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 78852/78852 (100%)
造访人次 : 37973001      在线人数 : 689
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/92658


    题名: 應用資料重採樣與資料離散化方法於類別不平衡問題之研究;Data Resampling and Discretization Methods for Class Imbalanced Data
    作者: 林家暘;LIN, JIA-YANG
    贡献者: 資訊管理學系
    关键词: 資料前處理;資料重採樣;資料離散化;類別不平衡;資料探勘;data preprocessing;data resampling;data discretization;class imbalance;data mining
    日期: 2023-07-27
    上传时间: 2023-10-04 16:07:44 (UTC+8)
    出版者: 國立中央大學
    摘要: 近年來,隨著人工智慧領域的蓬勃發展,許多產業積極投入相關研究,透過現有的產業資料,來研發適用於自身產業的智慧應用。然而在現實世界中,受到不同的人為或環境因素的影響,資料容易自然地呈現出偏斜且不均勻的狀態。這種類別不平衡問題廣泛存在於不同產業與領域當中,容易對相關應用的智慧模型造成負面影響,是近年相當重要的實務議題。因此,本研究欲應用資料層級的過採樣SMOTE(Synthetic Minority Over-sampling Technique, SMOTE),與ChiMerge和MDLP等監督式離散化方法,來探討不同資料前處理步驟的組合與順序,對於二元類別不平衡問題的效益與影響。此外,為了能夠深入理解不同重採樣方法,處理類別不平衡問題的效能差異。本研究納入多種相異的重採樣方法,即具有不同採樣策略的SMOTE方法、欠採樣Tomek Links方法,與上述兩者的混合方法,來進一步地探究不同前處理步驟的組合與順序,對於多元類別不平衡問題的影響。
    本研究使用UCI與KEEL網站提供的二元與多元資料集,透過使用不同的資料前處理步驟,分別比較單一前處理方法與混合前處理方法,對於二元與多元類別不平衡問題的影響,進而釐清不同前處理方法的適用性,以提供有效的解決方案與建議。根據實驗結果,在處理二元不平衡問題時,本研究建議使用「先MDLP後SMOTE」的混合方法,來改善SVM、C4.5,與RF的分類效能。此外,在處理多元類別不平衡問題時,在不考量時間成本的前題下,本研究推薦使用先重採樣後ChiMerge的流程,會具有較為穩健且準確的實驗結果。另外,若極為重視資料處理與模型的運算效率,則推薦先重採樣後MDLP的流程,亦可有效率地取得相當準確的實驗結果。
    ;In recent years, with the booming of artificial intelligence, more people have taken the initiative to develop intelligent applications using their existing data, looking forward to creating successful products which suitable for their business. However, data tends to naturally present skewed or biased states due to various human or environmental factors in reality. The class imbalance problem widely exists in different industries and domains, and it causes negative influences on intelligent models used in related applications. Therefore, the issue has become an important practical concern recently. This study aims to explore the benefits and effects of data preprocessing steps with different combinations and orders to address binary class imbalance problems. The preprocessing steps include the oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) and supervised discretization methods such as ChiMerge and MDLP. Additionally, to gain a deeper understanding of different resampling methods′ performance in handling class imbalance problems, this study brings in diverse resampling methods, including SMOTE with different sampling strategies, an undersampling method called Tomek Links, and a hybrid method combining the above methods. To further investigate the impact of different preprocessing combinations and orders to address multiclass imbalance problems.
    This study uses binary and multiclass datasets provided by UCI and KEEL websites, to compare the effects of single preprocessing methods and mixed preprocessing methods on binary and multiclass class imbalance problems. Thus, clarifying the applicability of different preprocessing methods and providing effective solutions and recommendations. According to the experimental results, when dealing with binary class imbalance problems, it recommends the mixed method of using MDLP to discrete data features first, then using SMOTE to balance the datasets, to improve the classification performance of SVM, C4.5, and RF. Furthermore, when handling multiclass imbalance problems without considering the time cost, it recommends the mixed method of using resampling methods to balance the datasets first, then using ChiMerge to discrete data features, which can get more robust and accurate experimental results. Additionally, if there is a high emphasis on data processing and model computation efficiency, it recommends the mixed method of using resampling methods to balance the datasets first, then using MDLP to discrete data features, to efficiently obtain fairly accurate experimental results.
    显示于类别:[資訊管理研究所] 博碩士論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    index.html0KbHTML126检视/开启


    在NCUIR中所有的数据项都受到原著作权保护.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明