![]() |
以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:27 、訪客IP:18.97.14.91
姓名 蔡博維(Po-Wei Tsai) 查詢紙本館藏 畢業系所 資訊工程學系 論文名稱 結合批次交叉注意力與混合排序損失以強化多面向評分模型
(Enhancing Cross-Prompt Automated Essay Scoring with Batch Cross-Attention and Hybrid Ranking Loss)相關論文 檔案 [Endnote RIS 格式]
[Bibtex 格式]
[相關文章]
[文章引用]
[完整記錄]
[館藏目錄]
至系統瀏覽論文 (2030-8-26以後開放)
摘要(中) 隨著自然語言處理在教育、商業與生成任務中的廣泛應用,多面向評分成為語言理
解領域的重要挑戰。然而,現有方法多僅依賴單樣本或成對樣本進行評分,缺乏全局視
角,易產生排序傳遞性不一致的問題。為了解決此局限,本研究提出結合 Batch Cross
Attention 與 Hybrid Ranking Loss 的跨任務多面向評分模型。Batch Cross-Attention 允許
模型在訓練階段同時將同一批次所有文本作為 Query、Key、Value,透過注意力機制捕
捉樣本間的細微差異與整體分布,以提升排序的穩定性與比較性,Hybrid Ranking Loss
將局部 Pairwise Rank Loss 與全局 List-wise Loss 結合,既懲罰局部排序錯誤,又維持
全局一致性,避免傳遞性矛盾。所提模型能兼容作文、自動問句生成與評論等場景,在
內容、組織與可答性等多個面向中實現一致性評估。實驗結果顯示,相較於傳統 point
wise、pair-wise 及純 list-wise 方法,本方法在分數一致性(如 QWK)與排序相關性(如
Kendall’s τ)均有顯著提升,證明 Batch Cross-Attention 與 Hybrid Ranking Loss 的有效
性與通用性。摘要(英) Language education plays a vital role in globalization and cross-cultural communication,
and Automated Essay Scoring (AES) has gained increasing attention due to its fast and
consistent assessment capabilities. Traditional AES methods typically adopt prompt-specific
training, achieving high accuracy on familiar prompts but lacking generalization ability to
unseen prompts due to the unavailability of annotated data. To address this, recent cross-prompt
approaches train and test models across multiple prompts, yet most rely on point-wise or pair
wise comparisons that learn only relative rankings between pairs of essays, neglecting the
positioning of individual essays within the overall distribution. Building upon the MOOSE
framework, this study proposes a two-stage batch-aware ranking and regression framework. In
the first stage, we introduce Batch Cross-Attention within the MOOSE architecture, allowing
all essays in the same mini-batch to attend to each other during forward propagation, thereby
jointly considering global semantic differences. Optimization employs a combination of list
wise and pair-wise losses to ensure both global and local ranking consistency. In the second
stage, predicted ranking scores are discretized into K bins based on quantiles, and bin position
embeddings are concatenated with original essay features. A Bin Regressor is then trained with
mean squared error combined with pair-wise loss to fine-tune the continuous scores.
Experimental results demonstrate that our method improves ranking transitivity and QWK
regression accuracy across multiple prompts, yielding more stable and interpretable scoring by
incorporating global ranking information.關鍵字(中) ★ 自動評分
★ 排序損失
★ Cross-Attention關鍵字(英) ★ Automated Essay Scoring
★ Ranking Loss
★ 交叉注意力論文目次 章節目次
中文摘要 ..................................................................................................................................... i
Abstract ....................................................................................................................................... ii
章節目次 ................................................................................................................................... iii
圖目錄 ....................................................................................................................................... vi
表目錄 ...................................................................................................................................... vii
第一章 緒論 ........................................................................................................................ 1
1.1 背景 ........................................................................................................................ 1
1.2 研究動機 ................................................................................................................ 1
1.3 研究目的 ................................................................................................................ 2
第二章 相關文獻及文獻探討 ............................................................................................ 3
2.1 多面向自動評分與跨任務適應性 ........................................................................ 3
2.1.1 多面向評分的發展與挑戰 ........................................................................ 3
2.1.2 跨任務統一評分框架的需求 .................................................................... 3
2.2 評分系統中的排序穩定性 .................................................................................... 4
2.2.1 排序學習方法的演進與限制 .................................................................... 4
2.3 基於ASAP的作文評分模型 ................................................................................ 5
2.3.1 MOOSE ...................................................................................................... 5
第三章 方法 ........................................................................................................................ 7
3.1 System Overview .................................................................................................... 7
3.2 Global Ranking Learning ....................................................................................... 9
3.2.1 Comparison Expert ..................................................................................... 9
3.2.2 Hybrid Ranking Loss ................................................................................ 10
3.3 二階段排名回歸 .................................................................................................. 11
3.3.1 全局排名特徵生成 .................................................................................. 12
3.3.2 Refinement expert..................................................................................... 13
第四章 實驗結果與討論 .................................................................................................. 14
4.1 Evaluation Metrics ................................................................................................ 14
4.2 Dataset .................................................................................................................. 15
4.2.1 ASAP++ Dataset ...................................................................................... 15
4.2.2 BeerAdvocate And TripAdvisor Dataset .................................................. 17
4.2.3 QGEval Dataset ........................................................................................ 17
4.3 實驗與討論 .......................................................................................................... 18
4.3.1 ASAP++ ................................................................................................... 18
4.3.2 Beer & Trip ............................................................................................... 20
4.3.2.1 TripAdvisor............................................................................................... 21
4.3.2.2 BeerAdvocate ........................................................................................... 23
4.3.3 QGEval ..................................................................................................... 25
4.4 消融實驗 .............................................................................................................. 27
4.4.1 分箱策略對模型的影響 .......................................................................... 27
4.4.2 推論階段批次大小對模型的影響 .......................................................... 28
4.4.3 分箱數量對模型的影響 .......................................................................... 29
4.4.4 BCA Mask對模型的影響 ....................................................................... 30
4.5 Zero shot Tofel ...................................................................................................... 31
第五章 結論 ...................................................................................................................... 33
第六章 參考文獻 .............................................................................................................. 34參考文獻 [1] O′shea, K., & Nash, R. (2015). An introduction to convolutional neural
networks. arXiv preprint arXiv:1511.08458.
[2] Taghipour, K., & Ng, H. T. (2016, November). A neural approach to automated essay
scoring. In Proceedings of the 2016 conference on empirical methods in natural
language processing (pp. 1882-1891).
[3] Dong, F., & Zhang, Y. (2016, November). Automatic features for essay scoring–an
empirical study. In Proceedings of the 2016 conference on empirical methods in
natural language processing (pp. 1072-1077).
[4] Yang, R., Cao, J., Wen, Z., Wu, Y., & He, X. (2020). Enhancing automated essay
scoring performance via fine-tuning pre-trained language models with combination of
regression and ranking. Association for Computational Linguistics (ACL).
[5] Wang, Y., Wang, C., Li, R., & Lin, H. (2022). On the use of bert for automated essay
scoring: Joint learning of multi-scale essay representation. arXiv preprint
arXiv:2205.03835.
[6] Ridley, R., He, L., Dai, X., Huang, S., & Chen, J. (2020). Prompt agnostic essay scorer:
a domain generalization approach to cross-prompt automated essay scoring. arXiv
preprint arXiv:2008.01441.
[7] Do, H., Kim, Y., & Lee, G. G. (2023). Prompt-and trait relation-aware cross-prompt
essay trait scoring. arXiv preprint arXiv:2305.16826.
[8] Xu, J., Liu, J., Lin, M., Lin, J., Yu, S., Zhao, L., & Shen, J. (2025). EPCTS: Enhanced
prompt-aware cross-prompt essay trait scoring. Neurocomputing, 621, 129283.
[9] Chen, Y., & Li, X. (2023, July). PMAES: Prompt-mapping contrastive learning for
cross-prompt automated essay scoring. In Proceedings of the 61st annual meeting of
the association for computational linguistics (volume 1: long papers) (pp. 1489-1503).
[10] Foscarin, F., Mcleod, A., Rigaux, P., Jacquemard, F., & Sakai, M. (2020, October).
ASAP: a dataset of aligned scores and performances for piano transcription. In ISMIR
2020-21st International Society for Music Information Retrieval.
[11] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of the
2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp.
4171-4186).
[12] Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The
Journal of Technology, Learning and Assessment, 4(3).
[13] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
computation, 9(8), 1735-1780.
[14] Do, H., Kim, Y., & Lee, G. G. (2024). Autoregressive score generation for multi-trait
essay scoring. arXiv preprint arXiv:2403.08332.
[15] Lee, S., Cai, Y., Meng, D., Wang, Z., & Wu, Y. (2024). Unleashing large language
models′ proficiency in zero-shot essay scoring. arXiv preprint arXiv:2404.04941.
[16] Stahl, M., Biermann, L., Nehring, A., & Wachsmuth, H. (2024). Exploring LLM
prompting strategies for joint essay scoring and feedback generation. arXiv preprint
arXiv:2404.15845.
[17] Ridley, R., He, L., Dai, X. Y., Huang, S., & Chen, J. (2021, May). Automated cross
prompt scoring of essay traits. In Proceedings of the AAAI conference on artificial
intelligence (Vol. 35, No. 15, pp. 13745-13753).
[18] Fu, W., Wei, B., Hu, J., Cai, Z., & Liu, J. (2024). Qgeval: Benchmarking multi
dimensional evaluation for question generation. arXiv preprint arXiv:2406.05707.
[19] Fu, J., Ng, S. K., Jiang, Z., & Liu, P. (2023). Gptscore: Evaluate as you desire. arXiv
preprint arXiv:2302.04166.
[20] Khashabi, D., Kordi, Y., & Hajishirzi, H. (2022). Unifiedqa-v2: Stronger
generalization via broader cross-format training. arXiv preprint arXiv:2202.12359.
[21] Li, Y., Wang, H., Zhang, Q., Xiao, B., Hu, C., Wang, H., & Li, X. (2025). Unieval:
Unified
holistic
evaluation
for
unified
generation. arXiv preprint arXiv:2505.10483.
multimodal understanding and
[22] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019).
Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
[23] Chen, P. K., Tsai, B. W., Wei, S. K., Wang, C. Y., Wang, J. C., & Huang, Y. T. (2025,
July). Mixture of Ordered Scoring Experts for Cross-prompt Essay Trait Scoring.
In Proceedings of the 63rd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) (pp. 18071-18084).
[24] Mohammadshahi, A., Scialom, T., Yazdani, M., Yanki, P., Fan, A., Henderson, J., &
Saeidi, M. (2022). RQUGE: Reference-free metric for evaluating question generation
by answering the question. arXiv preprint arXiv:2211.01482.
[25] Yin, Y., Song, Y., & Zhang, M. (2017, September). Document-level multi-aspect
sentiment classification as machine comprehension. In Proceedings of the 2017
conference on empirical methods in natural language processing (pp. 2044-2054).
[26] Jin, C., He, B., Hui, K., & Sun, L. (2018, July). TDNN: A two-stage deep neural
network for prompt-independent automated essay scoring. In Proceedings of the 56th
annual meeting of the association for computational linguistics (volume 1: long
papers) (pp. 1088-1097).
[27] Li, X., Chen, M., & Nie, J. Y. (2020). SEDNN: Shared and enhanced deep neural
network model for cross-prompt automated essay scoring. Knowledge-Based
Systems, 210, 106491.
[28] Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation
of essays with the Intelligent Essay AssessorTM. Automated essay scoring: A cross
disciplinary perspective, 87-112.
[29] Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language
model for automated essay scoring. Research Methods in Applied Linguistics, 2(2),
100050.
[30] Lee, S., Cai, Y., Meng, D., Wang, Z., & Wu, Y. (2024). Unleashing large language
models′ proficiency in zero-shot essay scoring. arXiv preprint arXiv:2404.04941.
[31] Stahl, M., Biermann, L., Nehring, A., & Wachsmuth, H. (2024). Exploring LLM
prompting strategies for joint essay scoring and feedback generation. arXiv preprint
arXiv:2404.15845.
[32] Dong, F., Zhang, Y., & Yang, J. (2017, August). Attention-based recurrent
convolutional neural network for automatic essay scoring. In Proceedings of the 21st
conference on computational natural language learning (CoNLL 2017) (pp. 153-162).
[33] Larkey, L. S. (1998, August). Automatic essay grading using text categorization
techniques. In Proceedings of the 21st annual international ACM SIGIR conference
on Research and development in information retrieval (pp. 90-95).
[34] Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes′
theorem. The Journal of Technology, Learning and Assessment, 1(2).
[35] Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from
pairwise approach to listwise approach. In Proceedings of the 24th international
conference on Machine learning (pp. 129-136).
[36] Mathias, S., & Bhattacharyya, P. (2018, May). ASAP++: Enriching the ASAP
automated essay grading dataset with essay attribute scores. In Proceedings of the
eleventh international conference on language resources and evaluation (LREC 2018).指導教授 王家慶(Jia-Ching Wang) 審核日期 2025-8-28 推文 plurk
funp
live
udn
HD
myshare
netvibes
friend
youpush
delicious
baidu
網路書籤 Google bookmarks
del.icio.us
hemidemi
myshare