於代理人制度下新增 LLM 投票單元提高生成程式碼正確性

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：135

、訪客IP：18.216.214.63

姓名

顏維新(Wei-Hsin Yen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

於代理人制度下新增 LLM 投票單元提高生成程式碼正確性
(Enhancing Code Generation Accuracy through the Addition of LLM Judging Units in a Multi-Agent System)

相關論文

★ 條件判斷式事件驅動程式設計之C語言擴充	★ 基于小波变换的指纹活度检测,具有聚集 LPQ 和 LBP 特征
★ 應用自動化測試於異質環境機器學習管道之 MLOps 系統	★ 提升乳癌篩檢效率之批次排程框架
★ 設計具有可視化思維工具和程式作為單一步的輔助學習程式之棋盤式遊戲	★ TOCTOU 漏洞的靜態分析與實作
★ 用於繪製風力發電控制邏輯之特定領域語言	★ 在Java程式語言中以雙向結構表達數學公式間關聯之設計與實作
★ 支援模組化規則製作之程式碼轉換工具	★ 基於替代語意的 pandas DataFrame 靜態型別檢查器
★ 自動化時間複雜度分析的設計與實作–從軟體層面評估嵌入式系統的功率消耗	★ 以震波層析成像為應用之特定領域語言實作與分析
★ 用特徵選擇減少疲勞偵測腦電圖通道數	★ 一個應用紙本運算與數位化於程式設計學習使程序性思維可視化的機制
★ 基於抽象語法樹的陣列形狀錯誤偵測	★ 從合作學習角色分工獲得函式程式設計思維學習遞迴程式的機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-19以後開放)

摘要(中)

隨著大型語言模型 (LLM) 技術的進步，LLM 已成為程式開發時的重要輔助工具。然而，LLM 在程式碼生成方面的準確性和可靠性仍面臨諸多挑戰。本論文旨在深入分析現今 LLM 在程式碼生成中的正確性，探討其實際應用中的限制，並提出新的解決方案以提高生成程式碼的準確性。

本論文提出了一種基於大型語言模型（LLM）的程式碼生成方法，名為JudgeCoder，採用了多代理人系統和鍊式思考（CoT）策略來增加程式碼生成的正確性。透過模擬小組開發程式碼的分工流程，分離了程式碼撰寫、測試資料撰寫及測試執行三件工作，減少了單一 LLM 模型因為分工不明確所可能導致的幻覺現象 (LLM Hallucination) 。並且提出了結合 CoT-SC (Chain of Thought with Self-Consistency) 想法的策略，進一步地針對因模型幻覺現象所產生的錯誤測試資料進行偵測，避免了因錯誤測試資料而導致進入錯誤修正流程的發生。在實驗中，JudgeCoder 展示了優良的性能，在HumanEval和HumanEval-ET的評估資料集上達到了最前沿的效能，說明了提案的投票機制搭配適當的提示策略和合理的錯誤判斷機制可以有效提升生成程式碼的準確性，這些結果不僅驗證了JudgeCoder的實用性，也為未來基於 LLM 的程式碼自動生成研究提供了一個應用的方向。

摘要(英)

With the advancement of Large Language Models (LLMs), these models have become pivotal aids in software development. However, LLMs still face numerous challenges in terms of the accuracy and reliability of code generation. This paper aims to thoroughly analyze the correctness of current LLMs in code generation, explore their practical limitations, and propose solutions to enhance the accuracy of generated code.

This paper introduces a code generation method based on LLMs, named JudgeCoder, which employs a multi-agent system and Chain of Thought (CoT) strategy to increase the correctness of code generation. By simulating the division of labor in team coding environments, the process separates code generation, test data generation, and test execution, thereby reducing the illusion phenomena often caused by unclear task division in a single LLM. Moreover, the paper presents a strategy combining Chain of Thought with Self-Consistency (CoT-SC), which further detects erroneous test data produced by model illusions, preventing the entry into incorrect correction processes. In experiments, JudgeCoder demonstrates good performance, achieving state-of-the-art results on the HumanEval and HumanEval-ET datasets. The results confirm that the proposed voting mechanism, coupled with appropriate prompting strategies and reasonable error judgment mechanisms, can effectively enhance the accuracy of generated code. These findings not only validate the practicality of JudgeCoder but also provide a directional framework for future research in LLM-based automatic code generation.

關鍵字(中)

★ 大型語言模型
★ 程式碼生成
★ ChatGPT
★ 鍊式思考
★ 多代理人制度
★ LLM 投票

關鍵字(英)

★ LLM
★ Code Generation
★ ChatGPT
★ Chain-of-Thought
★ Multi- Agent Collaboration
★ LLM Judge

論文目次

目錄
頁次摘要 xi Abstract xiii
誌謝
目錄
圖目錄
表目錄
一、
1.1
1.2
1.3
1.4 1.5
xv xvii xxi xxiii
緒論 1 程式碼生成技術與語言模型的演進 ................................. 1
1.1.1 語言模型的演進 ................................................ 1
1.1.2 大型語言模型 (Large Language Models ,LLMs) ....... 2
結合 LLM 的程式開發方式 ........................................... 3
1.2.1 Pair programming.............................................. 4
1.2.2 Prompt programming ......................................... 5
改進 LLM 生成程式碼準確性的技術 ............................... 7
1.3.1 鏈式思考 ......................................................... 7
1.3.2 雙執行協議 ...................................................... 8
1.3.3 LLM 的自我修正 ............................................... 8
1.3.4 代理人制度 ...................................................... 9
對 LLM 生成程式碼的正確性驗證 .................................. 10 論文架構 .................................................................. 12
二、研究動機 13
2.1 現有生成程式碼技術的缺點 .......................................... 14
2.2 Motivating Example .................................................... 14
三、提案與實作 17
3.1 代理人制度的應用 ...................................................... 17
3.2 利用鏈式思考增加代理人輸出能力 ................................. 21
3.3 投票機制 .................................................................. 22
3.4 實作 ........................................................................ 23
3.4.1 實作架構 ......................................................... 23
3.4.2 生成資料的維護 ................................................ 26
3.4.3 鍊式思維提示設計 ............................................. 28
3.4.4 JudgeCoder 之使用及錯誤處理範例 ....................... 32
四、評估
4.1 實驗設置 .................................................................. 35
4.1.1 實驗環境 ......................................................... 35
4.1.2 評估基準 ......................................................... 36
4.1.3 評估資料集 ...................................................... 36
4.1.4 比較對象 ......................................................... 37
4.2 效能評估 .................................................................. 38
4.3 代理人系統效益 ......................................................... 40
4.3.1 代理人系統中投票機制的效益 .............................. 40
4.3.2 參與投票的投票單元數量對正確性的影響 ............... 41
4.4 JudgeCoder 生成結果與相關分析 ................................... 42
4.4.1 投票偏好及生成回合數統計 ................................. 42
4.4.2 每回合生成 API 耗用量及呼叫時間評估 ................. 44
4.4.3 生成回合數與正確性關係 .................................... 45
五、相關研究 49
5.1 利用 LLM 自動生成程式碼 ........................................... 49
5.2 代理人制度 ............................................................... 50
5.3 鍊式思考的發展 ......................................................... 51
六、總結 53
七、未來展望 55
7.1 對於複雜專案的處理 ................................................... 55
7.2 增加不同代理人使用不同底層模型的探討 ........................ 55
7.3 增加不同投票策略的比較 ............................................. 56
7.4 增加多語言的支援 ...................................................... 56
7.5 對於編譯式語言的支援 ................................................ 57
參考文獻
59

參考文獻

[1] A.M.TURING,“I.—COMPUTINGMACHINERYANDINTELLIGENCE,”Mind,
vol. LIX, pp. 433–460, 10 1950.
[2] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A survey of large language models,” 2023.
[3] J. Gao and C.-Y. Lin, “Introduction to the special issue on statistical language modeling,” ACM Transactions on Asian Language Information Processing, vol. 3, p. 87–93, jun 2004.
[4] T.Mikolov,M.Karafiát,L.Burget,J.Cernockỳ,andS.Khudanpur,“Recurrentneu- ral network based language model.,” in Interspeech, vol. 2, pp. 1045–1048, Makuhari, 2010.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[6] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion param- eter models with simple and eﬀicient sparsity,” 2022.
[7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
[8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” 2019.
[9] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” 2022.
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[11] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko,J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiri- donov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” 2022.
[12] OpenAI, “Chatgpt,” 2022. Accessed: 2024-06-18.
[13] GitHub, “Copilot,” 2022. Accessed: 2024-06-18.
[14] Google, “Gemini,” 2023. Accessed: 2024-06-18.
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[16] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
[17] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
[18] R. D. Austin, “The effects of time pressure on quality in software development: An agency model,” Information systems research, vol. 12, no. 2, pp. 195–207, 2001.
[19] C. Bird, D. Ford, T. Zimmermann, N. Forsgren, E. Kalliamvakou, T. Lowdermilk, and I. Gazit, “Taking flight with copilot: Early insights and opportunities of ai- powered pair-programming tools,” Queue, vol. 20, no. 6, pp. 35–57, 2022.
[20] L.Williams,R.R.Kessler,W.Cunningham,andR.Jeffries,“Strengtheningthecase for pair programming,” IEEE software, vol. 17, no. 4, pp. 19–25, 2000.
[21] K. Beck and M. Fowler, Planning extreme programming. Addison-Wesley Profes- sional, 2001.
[22] J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le,D.Zhou,etal., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[23] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
[24] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” 2022.
[25] J.Austin,A.Odena,M.Nye,M.Bosma,H.Michalewski,D.Dohan,E.Jiang,C.Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021.
[26] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge compe- tence with apps,” NeurIPS, 2021.
[27] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Mol- loy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
[28] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.
[29] X. Jiang, Y. Dong, L. Wang, Q. Shang, and G. Li, “Self-planning code generation with large language model,” arXiv preprint arXiv:2303.06689, 2023.
[30] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chat- gpt,” arXiv preprint arXiv:2304.07590, 2023.
[31] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian, C.-M. Chan, Y. Qin, Y. Lu, R. Xie, et al., “Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents,” arXiv preprint arXiv:2308.10848, 2023.
[32] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., “Metagpt: Meta programming for multi-agent collaborative framework,” arXiv preprint arXiv:2308.00352, 2023.
[33] Q.Wu,G.Bansal,J.Zhang,Y.Wu,S.Zhang,E.Zhu,B.Li,L.Jiang,X.Zhang,and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation framework,” arXiv preprint arXiv:2308.08155, 2023.
[34] C.-M.Chan,W.Chen,Y.Su,J.Yu,W.Xue,S.Zhang,J.Fu,andZ.Liu,“Chateval: Towards better llm-based evaluators through multi-agent debate,” arXiv preprint arXiv:2308.07201, 2023.
[35] Y. Shoham, “Agent-oriented programming,” Artificial intelligence, vol. 60, no. 1, pp. 51–92, 1993.
[36] D. Huang, Q. Bu, Y. Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,” 2024.
[37] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023.
[38] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent- based code generation with iterative testing and optimisation,” 2024.
[39] G. Van Rossum and F. L. Drake Jr, “Python tutorial,” 1995.
[40] IEEE Spectrum, “The top programming languages 2023,” 2023. Accessed: 2024-06- 18.
[41] P. Developers, “Pylint.” Accessed: 2024-06-18.
[42] M. Developers, “Mypy.” Accessed: 2024-06-18.
[43] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object- oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp. 416–419, 2011.
[44] M.F.Roslan,J.M.Rojas,andP.McMinn,“Anempiricalcomparisonofevosuiteand dspot for improving developer-written test suites with respect to mutation score,” in Search-Based Software Engineering (M. Papadakis and S. R. Vergilio, eds.), (Cham), pp. 19–34, Springer International Publishing, 2022.
[45] Z.Yuan,Y.Lou,M.Liu,S.Ding,K.Wang,Y.Chen,andX.Peng,“Nomoremanual tests? evaluating and improving chatgpt for unit test generation,” 2023.
[46] Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt-based auto- mated unit test generation tool,” 2023.
[47] N. Al Madi, “How readable is model-generated code? examining readability and visual inspection of github copilot,” in Proceedings of the 37th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE ’22, (New York, NY, USA), Association for Computing Machinery, 2023.
[48] J.-Y. Yao, K.-P. Ning, Z.-H. Liu, M.-N. Ning, and L. Yuan, “Llm lies: Hallucinations are not bugs, but features as adversarial examples,” arXiv preprint arXiv:2310.01469, 2023.
[49] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent- based code generation with iterative testing and optimisation,” arXiv preprint arXiv:2312.13010, 2023.
[50] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
[51] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[52] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S. Liang, “Spoc: Search-based pseudocode to code,” Advances in Neural Information Process- ing Systems, vol. 32, 2019.
[53] Y. Dong, J. Ding, X. Jiang, G. Li, Z. Li, and Z. Jin, “Codescore: Evaluating code generation by learning code execution,” arXiv preprint arXiv:2301.09043, 2023.
[54] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[55] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[56] OpenAI, “Humaneval: Evaluating large language models trained on code.” https: //github.com/openai/human-eval, 2021. Accessed: 2024-06-28.
[57] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Lan- guage agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[58] K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087, 2023.
[59] A. Majd, M. Vahidi-Asl, A. Khalilian, A. Baraani-Dastjerdi, and B. Zamani, “Code4bench: A multidimensional benchmark of codeforces data for different pro- gram analysis techniques,” Journal of Computer Languages, vol. 53, pp. 38–52, 2019.
[60] A. Manifesto, “Manifesto for agile software development,” 2001.
[61] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griﬀiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[62] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 17682–17690, 2024.

指導教授

莊永裕(YungYu Zhuang)

審核日期

2024-7-30

推文