BenchWeaver：整合 LLM 評估機制與翻譯提示設計的低資源語言多語言評估框架;BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98541

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98541

題名:	BenchWeaver：整合 LLM 評估機制與翻譯提示設計的低資源語言多語言評估框架;BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages
作者:	梁致銓;Liang, Jhih-Cyuan
貢獻者:	資訊工程學系
關鍵詞:	大型語言模型;多語言評估;低資源語言;自動化基準測試;LLM;判別器;翻譯提示詞;跨語言評估;Large Language Models;Multilingual Evaluation;Low-Resource Languages;Automated Benchmarking;LLM;LLM-as-Judge;Translation Prompting;Cross-Lingual Assessment
日期:	2025-08-11
上傳時間:	2025-10-17 12:54:19 (UTC+8)
出版者:	國立中央大學
摘要:	在多語言環境中評估大型語言模型（LLMs）是一項關鍵挑戰，主要因為低資源語言的基準資料稀缺，以及跨語言評估標準不一致的問題。為了解決這些困境，我們提出 BenchWeaver，一個全自動化的多語言評估流程。BenchWeaver 結合高吞吐量推論、雙向翻譯與 LLM 判別器（LLM-as-judge）評分機制，支援多種任務類型，包括選擇題、開放式問答、程式碼生成與翻譯任務，並可在原生與非原生語言設定下進行標準化評估。我們針對英文、簡體中文、繁體中文與韓文等語言進行系統性實驗，結果顯示 BenchWeaver 的翻譯增強評估表現與原生語言推論相當。此外，我們針對翻譯提示詞（prompting strategy）進行深入研究，並與 P-MMEval 基準框架進行比較。實驗結果證明 BenchWeaver 具備高可靠性、可擴展性與語言包容性，是低資源語言場景中評估多語言 LLM 的實用解決方案。;Evaluating large language models (LLMs) across multiple languages remains a critical challenge due to the scarcity of low-resource benchmarks and inconsistencies in cross- lingual assessments. We introduce BenchWeaver, a fully automated multilingual evaluation pipeline that addresses these limitations by integrating high-throughput inference, bidirectional translation, and LLM-as-judge scoring. BenchWeaver supports a wide range of task types including multiple-choice question answering, open-ended question answer- ing, code generation, and translation, enables standardized evaluation in both native and non-native language settings. We systematically evaluate LLMs across English, Simplified Chinese, Traditional Chinese, and Korean benchmarks, and demonstrate that our translation-enhanced evaluation achieves performance competitive with native pipelines. Furthermore, we conduct an extensive study on translation prompting strategies and benchmark BenchWeaver against the P-MMEval framework. Results show that Bench- Weaver delivers reliable, scalable, and linguistically inclusive evaluation, offering a practical solution for benchmarking multilingual LLMs in low-resource settings.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	14	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....