單細胞轉錄組定序是一項很有前途的技術,可提供有關單細胞水平的基因表達模式的詳細信息。然而,單細胞轉錄組定序成本非常高,尤其是在分析大量細胞時。為了克服這一限制,研究人員開發了從次世代核醣核酸定序數據推斷細胞組成的方法,同時還利用了深度學習算法。這些基於機器學習的方法通常需要大量訓練數據,從而導致數據生成技術的發展,這些技術可以生成用於訓練細胞反卷積模型的偽批量核醣核酸定序樣本。 但是,數據生成方法還有改進的空間來達到更佳的表現。在本研究中,我們使用狄利克雷分佈來生成更接近真實場景的合成核醣核酸定序樣本。我們構建了數個基於深度學習網路與自注意力機制的細胞反卷積模型,這些模型是在狄利克雷方法生成的數據上訓練的,旨在實現比現有方法更優越的性能。為了評估模型的有效性,我們使用兩個真實的人類外周血單核細胞數據集作為測試基準。我們的結果表明,我們的模型在這兩個外周血單核細胞數據集上優於其他現有方法,顯示皮爾森相關性分別約為0.87和0.78。值得注意的是,我們的模型對人類外周血單核細胞中比例較小的細胞類型有更精確的預測。由於數據集之間的差異,我們還強調了為特定數據集構建單獨模型以優化性能的重要性。;Single-cell RNA sequencing (scRNA-seq) is a promising technique that provides detailed information about gene expression patterns at single-cell level. However, it can be prohibitively expensive, particularly when profiling a large number of cells. To overcome this limitation, researchers have developed methods to infer cell composition from next-generation RNA sequencing (RNA-seq) data, also utilizing deep learning algorithms. These machine leaning-based methods typically require a large amount of training data, leading to the development of data generation techniques that produce pseudo-bulk RNA-seq samples for training cell deconvolution models. However, there is room to improve data generation methods to achieve better performance. In this study, we use the Dirichlet distribution to generate synthetic RNA-seq samples that more closely resemble real-world scenarios. We construct deep learning-based deconvolution models trained on this Dirichlet-generated data, aiming to achieve superior performance compared to existing methods. To evaluate the models′ effectiveness, we employ two real human peripheral blood mononuclear cell (PBMC) datasets as testing benchmarks. Our results demonstrate that our models outperform other existing methods on these two PBMC datasets, showcasing Pearson correlations of approximately 0.87 and 0.78, respectively. Notably, our models achieve more precise predictions for cell types with smaller proportions in human PBMCs. We also emphasize the importance of building individual models for specific datasets to optimize performance due to the variance between datasets.