dc.description.abstract | Background: Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured data, coming from autonomous sources (i.e. heterogeneous, autonomous, complex and evolving (HACE) theorem). Acquisition of biomedical data takes time, and human power, and usually are very expensive. So, it is difficult to work with populations, and hence, researchers work with samples. In recent years, two growing concerns have overwhelmed in the healthcare area: use of small sample size for experiment and extraction of useful information from massive medical data (big data). Researchers have claimed that overfitting causes false positive (type I error) or false negative (type II error) in small sample size studies in the biomedicine field which produces exaggerated results that do not represent a true effect. On the other hand, in last few years, the volume of data is getting bigger and more complicated due to the continuous generation of data from many sources such as Functional magnetic resonance imaging (fMRI), computed tomography (CT) scan, Positron-emission tomography (PET)/ Single-photon emission computed tomography (SPECT) and Electroencephalogram (EEG). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modelling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients’ outcomes. However, there are few challenges in big data, such as missing values, heterogeneous nature of data, the complexity of managing data, etc. that may affect the outcome. So, it is essential to find an appropriate process and algorithm for big data mining to extract useful information out of massive data. Up to date, however, there is no guideline for this, especially about a fair sample size that consists of paramount information for reliable results.
Purpose: The goal of this study is to explore the relationship among sample size, statistical parameters and performance of machine learning (ML) methods to ascertain an optimal sample size. Moreover, the study also examines the impact of standard deviations on sample sizes by analyzing the performance of machine learning methods.
Method: In this study, I used two kinds of data: experimental data and simulated data. Experimental data is comprised two datasets-the first dataset has 63 stroke patients′ brain signals (continuous data), and the other is consist of 120 sleep diaries (discrete categorical data) and each diary records one-person data. To find an optimal sample size, first, I divided each experimental dataset into multiple sample sizes by taking 10% proportion of each dataset. Then, I used these sample sizes in the four most used machine learning methods such as Support vector machine (SVM), Decision tree, Naive Bayes, and Logistic Regression. The ten-fold cross-validation was used to evaluate the classification accuracy. I also measured the grand variance, Eigen value, proportion among the samples of each sample size. On the other hand, I generated artificial dataset by taking an average of real data; the generated data mimicked the real data. I used this dataset to examine the effect of standard deviation on the accuracy of the classifiers when sample sizes were systematically increased from small to large sample sizes. In last, I applied classifiers’ results of both experimental datasets into Receiver operating characteristic curve (ROC) graph to find an appropriate sample size and influence of classifiers’ performance on different sample sizes, small to large size.
Results: The results depicted a significant effect of sample sizes on the accuracy of classifiers, data variances, Eigen Value, and proportion in all datasets. Stroke and Sleep datasets showed the intrinsic property in the performance of ML classifiers, data variances (parameter wise variance and subject wise variance), Eigen Value, and proportion of variance. I used this intrinsic property to design two criteria for deciding an appropriate sample size. According to criteria 1, a sample is considered an optimal sample size when the performances of classifiers achieve intrinsic behaviour simultaneously with data variation. In the second criteria, I have used performance, Eigen value and proportion to decide a suitable sample size. When these factors indicate a simultaneous intrinsic property on a specific sample size, then the sample size is considered as an effective sample size. In this study, both criteria suggested similar optimal sample sizes 250 in sleep dataset, although, eigen value showed a little variation as compared to variance between 250 to 500 sample sizes. The variation in eigen values decreased after 500 samples. Thus, due to this trivial variation, criteria II suggested 500 samples size as an effective sample size. It should be noted that if criteria I & II recommend two different sample sizes, then choose a sample size that achieves earlier simultaneous intrinsic property between performance and variance or among performance, eigen value and proportion on a sample size. last, I also designed a third criterion that is based on the receiver operating characteristic curve. The ROC graph illustrates that classifiers have a good performance when the sample sizes have a large size. The large sample sizes have position above the diagonal line. On the other, small sample sizes show worse performance, and they are allocated below the diagonal line. However, the performances of classifiers improve with increment in sample sizes.
Conclusion: All the results assert that the sample size has a dramatic impact on the performance of ML methods and data variance. The increment in sample size gives a steady outcome of machine learning methods when data variation has negligible fluctuation. In addition, the intrinsic property of sample size helps us to find an optimal sample size when accuracy, Eigen value, proportion and variance become independent of increment in samples. | en_US |