摘要: | 背景:生物醫學是一個富含各種異構,進化,複雜和非結構化數據的領域(即HACE定理)。獲取生物醫學數據需要時間和人力,而且通常非常昂貴。因此,無法進行全族群的研究,只能使用抽樣去推論。近年來,醫療領域的兩個日益增長的問題:使用小樣本進行實驗並從大量醫療數據中提取有用信息(大數據)。研究人員聲稱,在生物醫學領域的小樣本研究中,過度擬合會導致假陽性(I型錯誤)或假陰性(II型錯誤),這會產生誇大的結果,而這些結果並不代表真正的效果。另一方面,在過去幾年中,由於來自fMRI,DTI,PET / SPECT和M / EEG等許多來源的數據的不斷生成,數據量變得越來越大,越來越複雜。 大數據挖掘已成為最迷人和發展最快的領域,可以選擇,探索和建模大量醫療數據,以幫助臨床決策,預防用藥錯誤,並提高患者的治療效果。但是,大數據中的挑戰很多,例如缺失值,數據的異構性,管理數據的複雜性等,這些可能會影響結果。因此,必須為大數據挖掘找到合適的流程和算法,以便從海量數據中提取有用的信息。然而,迄今為止,沒有相關的指導原則,特別是關於適合可信的樣本量,其中包含可靠結果的最重要信息。 目的:本研究的目的是整合人工智能和統計參數特性,以確定最佳樣本量。該法克服了當前樣本量計算方法中發現的偏差,例如統計參數的預期值;指定的閾值並標準化干預措施之間的差異(理論上不清楚)。此外,我還研究了樣本大小中數據變異性對分類器性能的影響。 方法:在這項研究中,我使用了兩種數據:實驗數據和模擬數據。實驗數據包括兩個數據集 - 第一個數據集由63個中風患者的腦信號(連續數據)組成,另一個由120個睡眠日記(離散的分類數據)組成,每個日記記錄一人數據。為了找到最佳樣本量,首先,我將每個實驗數據集分成多個樣本量,每個數據集占10%。 然後,我在四種最常用的AI方法中使用了這些樣本大小,例如SVM,決策樹,樸素貝葉斯和Logistic回歸。 十倍交叉驗證用於評估分類準確性。我還測量了每個樣本大小的樣本中的宏觀方差,特徵值,比例。另一方面,我通過獲取實際數據的平均值來生成人工數據集;生成的數據模擬了真實數據。我使用這個數據集來檢查標準偏差對分類器準確性的影響,當樣本大小從小樣本大小增加到大樣本時。最後,我將兩個實驗數據集的分類器結果應用到ROC圖中,以找到合適的樣本大小以及分類器性能對不同樣本大小(從小到大)的影響。 結果:結果描述了樣本大小對所有數據集中分類器和數據方差的準確性的顯著影響。中風和睡眠數據顯示機器學習(ML)分類器,數據方差(參數方差和主題方差),特徵值和方差比例的性能的內在屬性。我使用這個內在屬性來設計三個標準來確定最佳樣本大小。根據標準1,當分類器的性能與數據變化同時實現內在行為時,樣本被認為是最佳樣本大小。在第二個標準中,我使用了性能,特徵值和比例,當它們表明特定樣本大小的同時內在屬性時,則樣本大小被認為是有效樣本大小。此外,ROC圖表顯示分類器在小樣本量期間表現較差,但隨著樣本量的增加,性能得到改善。 結論:所有結果都斷言樣本量對AI方法和數據差異的性能有顯著影響。當數據變化具有可忽略的波動時,樣本大小的增加給出了AI方法的穩定結果。此外,當準確度,特徵值,比例和方差變得與樣本中的增量無關時,樣本大小的固有屬性有助於我們找到最佳樣本大小。;Background: Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured data, coming from autonomous sources (i.e. heterogeneous, autonomous, complex and evolving (HACE) theorem). Acquisition of biomedical data takes time, and human power, and usually are very expensive. So, it is difficult to work with populations, and hence, researchers work with samples. In recent years, two growing concerns have overwhelmed in the healthcare area: use of small sample size for experiment and extraction of useful information from massive medical data (big data). Researchers have claimed that overfitting causes false positive (type I error) or false negative (type II error) in small sample size studies in the biomedicine field which produces exaggerated results that do not represent a true effect. On the other hand, in last few years, the volume of data is getting bigger and more complicated due to the continuous generation of data from many sources such as Functional magnetic resonance imaging (fMRI), computed tomography (CT) scan, Positron-emission tomography (PET)/ Single-photon emission computed tomography (SPECT) and Electroencephalogram (EEG). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modelling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients’ outcomes. However, there are few challenges in big data, such as missing values, heterogeneous nature of data, the complexity of managing data, etc. that may affect the outcome. So, it is essential to find an appropriate process and algorithm for big data mining to extract useful information out of massive data. Up to date, however, there is no guideline for this, especially about a fair sample size that consists of paramount information for reliable results. Purpose: The goal of this study is to explore the relationship among sample size, statistical parameters and performance of machine learning (ML) methods to ascertain an optimal sample size. Moreover, the study also examines the impact of standard deviations on sample sizes by analyzing the performance of machine learning methods. Method: In this study, I used two kinds of data: experimental data and simulated data. Experimental data is comprised two datasets-the first dataset has 63 stroke patients′ brain signals (continuous data), and the other is consist of 120 sleep diaries (discrete categorical data) and each diary records one-person data. To find an optimal sample size, first, I divided each experimental dataset into multiple sample sizes by taking 10% proportion of each dataset. Then, I used these sample sizes in the four most used machine learning methods such as Support vector machine (SVM), Decision tree, Naive Bayes, and Logistic Regression. The ten-fold cross-validation was used to evaluate the classification accuracy. I also measured the grand variance, Eigen value, proportion among the samples of each sample size. On the other hand, I generated artificial dataset by taking an average of real data; the generated data mimicked the real data. I used this dataset to examine the effect of standard deviation on the accuracy of the classifiers when sample sizes were systematically increased from small to large sample sizes. In last, I applied classifiers’ results of both experimental datasets into Receiver operating characteristic curve (ROC) graph to find an appropriate sample size and influence of classifiers’ performance on different sample sizes, small to large size. Results: The results depicted a significant effect of sample sizes on the accuracy of classifiers, data variances, Eigen Value, and proportion in all datasets. Stroke and Sleep datasets showed the intrinsic property in the performance of ML classifiers, data variances (parameter wise variance and subject wise variance), Eigen Value, and proportion of variance. I used this intrinsic property to design two criteria for deciding an appropriate sample size. According to criteria 1, a sample is considered an optimal sample size when the performances of classifiers achieve intrinsic behaviour simultaneously with data variation. In the second criteria, I have used performance, Eigen value and proportion to decide a suitable sample size. When these factors indicate a simultaneous intrinsic property on a specific sample size, then the sample size is considered as an effective sample size. In this study, both criteria suggested similar optimal sample sizes 250 in sleep dataset, although, eigen value showed a little variation as compared to variance between 250 to 500 sample sizes. The variation in eigen values decreased after 500 samples. Thus, due to this trivial variation, criteria II suggested 500 samples size as an effective sample size. It should be noted that if criteria I & II recommend two different sample sizes, then choose a sample size that achieves earlier simultaneous intrinsic property between performance and variance or among performance, eigen value and proportion on a sample size. last, I also designed a third criterion that is based on the receiver operating characteristic curve. The ROC graph illustrates that classifiers have a good performance when the sample sizes have a large size. The large sample sizes have position above the diagonal line. On the other, small sample sizes show worse performance, and they are allocated below the diagonal line. However, the performances of classifiers improve with increment in sample sizes. Conclusion: All the results assert that the sample size has a dramatic impact on the performance of ML methods and data variance. The increment in sample size gives a steady outcome of machine learning methods when data variation has negligible fluctuation. In addition, the intrinsic property of sample size helps us to find an optimal sample size when accuracy, Eigen value, proportion and variance become independent of increment in samples. |