摘要: | 本研究提出CCL-OCR 模型,該模型整合了自監督對比學習框架與質心優化策略, 並融入動量更新機制。首先對發票圖像進行多樣化的數據增強,模擬真實世界的干擾因 素,以提升模型魯棒性。同時保留了動量對比學習的核心特徵,包括利用動量編碼器與 隊列機制來獲取豐富且多元的負樣本,確保特徵學習的穩定性與一致性。在此基礎上, 研究創新性地引入質心式優化損失函式,其目標是讓同一類別的樣本特徵向其質心聚集, 同時拉開不同類別質心之間的距離,從而顯著提升類內緊密度和類間區隔性。模型透過 超參數λ來平衡InfoNCE損失與質心損失的重要性。 實驗部分,CCL-OCR在八類真實發票所裁切下來的74,481張標準化字元圖片資料 集上進行實驗,涵蓋多種印表技術、字體樣式與排版差異。研究將CCL-OCR與VGG、 ResNet、SimCLR、MoCo 等模型進行比較。結果顯示,CCL-OCR表現卓越,準確率達 到96.5%,顯著超越ResNet的95.28%、MoCo的90.65%、SimCLR的85.2%及VGG的 65.22%。在精確率、召回率與F1-score三項指標上,CCL-OCR也均突破96%的門檻(精 確率96.57%、召回率96.13%、F1-score96.35%),大幅優於其他模型。這歸因於質心優 化損失有效壓縮了同類特徵的類內方差,並強化了模型對細微字元差異的辨識能力。 敏感性分析進一步探討了超參數對模型性能的影響:當λ為0.7時表現最佳,動量 對比係數接近0.7時準確率達到峰值,而溫度係數τ設為0.09時可達到最高準確率。此 外,訓練輪數增加至50輪、學習率設為0.1、批次大小擴大到256時,模型準確率皆有 提升。案例研究也顯示,CCL-OCR 在處理低對比度、輕微裁剪、筆畫繁複等複雜漢字 方面,能穩健還原結構,克服了傳統模型在細節辨識和高頻字干擾上的缺陷。;In this study, we introduce CCL-OCR, a self-supervised OCR model that unites contrastive learning with a centroid-enhanced strategy under a momentum-update regime. First, we apply a rich suite of data augmentations to invoice images—simulating real-world noise such as blur, distortion, and occlusion to enhance robustness. At the same time, we preserve the strengths of momentum contrastive learning by employing a momentum encoder and a queue mechanism to harvest a diverse pool of negative samples, ensuring stable and consistent feature representations. We propose a centroid-enhanced loss that pulls features of the same class toward their centroid while pushing apart centroids of different classes. This dual objective markedly improves intra-class compactness and inter-class separability. A single hyperparameter λ balances the conventional InfoNCE loss against the centroid loss. We evaluated CCL-OCR on a dataset of 74,481 standardized character images cropped from eight categories of real invoices, covering various printing technologies, font styles, and typesetting variations. Compared against VGG, ResNet, SimCLR, and MoCo, CCL-OCR achieved the highest accuracy at 96.5%, outperforming ResNet (95.28%), MoCo (90.65%), SimCLR (85.20%), and VGG (65.22%). It also surpassed 96% in precision (96.57%), recall (96.13%), and F1-score (96.35%), significantly exceeding the benchmarks set by the other models. These gains stem from the centroid loss’s ability to compress intra-class variance and sharpen discrimination between visually similar characters. Sensitivity analysis found the optimal settings to be λ = 0.7, momentum = 0.7, and τ = 0.09. Extending training to 50 epochs with a learning rate of 0.1 and batch size of 256 further boosted performance. Case studies confirm that CCL-OCR robustly handles low-contrast, cropped, and complex-stroke Chinese characters, overcoming the fine-detail and high-frequency interference issues common in traditional OCR. |