| 摘要: | 在文件處理(Document AI)領域中,針對以手機拍攝之稿紙中文手寫文章進行辨識仍具相當挑戰性。其主要原因在於,手寫稿紙常見的垂直且密集之網格佈局,與現代視覺語言模型(VLM)基於水平文本預訓練的歸納偏差存在嚴重的領域不匹配。儘管近年 VLM 在語意理解方面展現良好能力,直接應用於此類文件時,仍常受到解析度瓶頸與嚴重的閱讀順序幻覺等問題影響。
為此,本研究提出 V2H-Rectify,一個無需訓練的前處理框架,將佈局重整視為一種顯式的視覺提示(Visual Prompting)策略。本框架的關鍵設計原則為將佈局重整與語意識別加以解耦,使前處理模組可作為即插即用的元件,相容於任意下游 OCR 引擎或 VLM,無需重新訓練或 LoRA 微調。V2H-Rectify 包含三個主要創新:(1) 集成傾斜估計(ESE)演算法,一個訊號驅動的感知模組,用以消除幾何失真;(2) 深度特徵引導的佈局分析方法,利用 CRAFT 區域分數推斷文件的邏輯拓樸結構;以及 (3) 閱讀順序重建機制,透過理解稿紙規則(由上而下、由右而左)來決定正確的文字拼接順序,將垂直排列的視覺標記重組為標準化的水平序列,有效地以符合模型預訓練分佈的表示形式進行「視覺提示」。
我們在手機拍攝稿紙(MCGP)基準資料集(共 2,826 筆樣本,因學生隱私考量暫不公開)上驗證本方法。結果顯示,在 V2H-Rectify 的輔助下,一個 30 億參數的專用模型可達到 21.59\% 的字元錯誤率(CER)與 0.891 的結構相似度,相較於未經前處理之 Gemini 3 Pro 基線(54.72\% CER,相似度 0.596)有顯著改善。此外,當 V2H-Rectify 應用於 Gemini 3 Pro 時,其 CER 可進一步降低至 12.75\%,殘餘錯誤主要歸因於物理模糊與極端草書筆跡。
實驗結果證實,顯式的文本線性化在處理非標準文件佈局時較單純擴展模型參數更為有效,可作為釋放基礎模型在分佈外文件場景中潛力的有效策略。;The recognition of mobile-captured Chinese handwritten essays on \textbf{grid paper} (\textit{稀紙}) remains a persistent challenge in Document AI, primarily due to the severe domain misalignment between vertical, dense layouts and the horizontal inductive biases of contemporary Vision-Language Models (VLMs). While VLMs possess strong semantic reasoning capabilities, their direct application to this domain suffers from resolution bottlenecks and severe reading-order hallucinations.
To bridge this gap, we introduce V2H-Rectify, a training-free preprocessing framework that treats layout rectification as a form of explicit Visual Prompting. A critical design principle is the decoupling of layout rectification from semantic recognition, enabling the preprocessing module to operate as a plug-and-play component compatible with any downstream OCR engine or VLM, without retraining or LoRA adaptation. V2H-Rectify incorporates three key innovations: (1) the Ensemble Skew Estimation (ESE) algorithm, a signal-driven perception module that neutralizes geometric distortions; (2) a deep feature-guided layout analysis algorithm that leverages CRAFT region scores to robustly infer logical topology; and (3) reading order reconstruction, a text linearization mechanism that understands the grid paper rules (top-to-bottom, right-to-left) to reorganize vertical visual tokens into a standardized horizontal format, effectively ``prompting′′ the VLM with an in-distribution representation.
We validate our approach on the Mobile-Captured Grid Paper (MCGP) benchmark ($N=2,826$), noting that the dataset is private due to student privacy constraints. Empirical results demonstrate that V2H-Rectify enables a specialized 3B-parameter model to achieve a Character Error Rate (CER) of 21.59\% and a structural Sequence Similarity of 0.891, significantly outperforming the Gemini 3 Pro baseline (54.72\% CER, 0.596 Ratio). Furthermore, when applied to Gemini 3 Pro, V2H-Rectify reduces CER to 12.75\% with residual errors attributed primarily to physical blur and extreme cursive handwriting. These findings confirm that explicit text linearization is a more effective lever than parameter scaling for unlocking foundation model capabilities in out-of-distribution document scenarios. |