在多語言環境中評估大型語言模型(LLMs)是一項關鍵挑戰,主要因為低資源語言的基準資料稀缺,以及跨語言評估標準不一致的問題。為了解決這些困境,我們提出 BenchWeaver,一個全自動化的多語言評估流程。BenchWeaver 結合高吞吐量推論、雙向翻譯與 LLM 判別器(LLM-as-judge)評分機制,支援多種任務類型,包括選擇題、開放式問答、程式碼生成與翻譯任務,並可在原生與非原生語言設定下進行標準化評估。 我們針對英文、簡體中文、繁體中文與韓文等語言進行系統性實驗,結果顯示 BenchWeaver 的翻譯增強評估表現與原生語言推論相當。此外,我們針對翻譯提示詞(prompting strategy)進行深入研究,並與 P-MMEval 基準框架進行比較。實驗結果證明 BenchWeaver 具備高可靠性、可擴展性與語言包容性,是低資源語言場景中評估多語言 LLM 的實用解決方案。;Evaluating large language models (LLMs) across multiple languages remains a critical challenge due to the scarcity of low-resource benchmarks and inconsistencies in cross- lingual assessments. We introduce BenchWeaver, a fully automated multilingual evaluation pipeline that addresses these limitations by integrating high-throughput inference, bidirectional translation, and LLM-as-judge scoring. BenchWeaver supports a wide range of task types including multiple-choice question answering, open-ended question answer- ing, code generation, and translation, enables standardized evaluation in both native and non-native language settings. We systematically evaluate LLMs across English, Simplified Chinese, Traditional Chinese, and Korean benchmarks, and demonstrate that our translation-enhanced evaluation achieves performance competitive with native pipelines. Furthermore, we conduct an extensive study on translation prompting strategies and benchmark BenchWeaver against the P-MMEval framework. Results show that Bench- Weaver delivers reliable, scalable, and linguistically inclusive evaluation, offering a practical solution for benchmarking multilingual LLMs in low-resource settings.