隨著 ELMo, GPT, BERT 等大型語言模型的興起,自然語言處理領 域的研究逐漸轉向兩階段的訓練模式,預訓練大型語言模型及以下 游任務為目標做微調,而後續對於研究模型適應於多任務或未見過 的任務上,逐漸發現預訓練模型的泛化潛力,進而引導出指令微調 (instruction tuning) 的概念,這也同時影響了標註資料從原有針對不同 任務所設計的資料,轉向需要指令形式的資料 (instructional data),過 去的方法曾經為將過去標註的各種自然語言處理任務的資料加上指 令,成為指令微調的材料,而這浩大的工程也引起了如何自動化產生 指令資料的研究,本論文提出了新的以提示工程為基礎,自動化產生 指令資料的框架,我們設計了五種提示引導現有的經指令微調過的語 言模型,產生對應不同領域主題和任務,超過 1 萬筆的指令資料,此 框架的可控制指定任務的特性,改善了傳統利用現有自然語言處理資 料所產生的指令資料,和後來學者所提出的自動化產生指令資料的方 法,兩者皆出現了資料任務類型的不平衡現象。並且我們為第一個嘗 試使用自動化方法產生可用於強化學習中的獎勵模型資料,雖然本實 驗並無直接測試資料在強化學習中的影響力,但本實驗利用指令微調 GPT-3,並且利用近幾個月所提出的 G-Eval 方法來自動化評估不管是 產生的資料本身又或是指令微調後的結果,得到了優於基線 0.15 到 0.45 差距的成果。;With the emergence of large-scale language models such as ELMo, GPT, BERT, the focus of research in natural language processing has shifted towards two-stage training paradigms. This involves pretraining large language models and fine-tuning them on downstream tasks. The progress in multitasking research and the exploration of applicability to unseen tasks have revealed the potential for generalization in pretrained language models. This has paved the way for the development of the concept of instruction tuning. This shift in research direction has also impacted the type of labeled data required. Instead of task-specific annotated data, there is a need for instructional data. Previous approaches involved adding instructions to existing annotated natural language processing datasets. However, this proved to be a significant undertaking. Subsequently, researchers explored automated methods for generating instructional data. In this experiment, we propose a novel prompt-based framework for automated instruction data generation. We design five prompts to guide existing instruction-tuned language models in generating instructional data across various domains and tasks, resulting in a dataset of over 10,000 instructions. This framework provides control over the specified task characteristics, improving upon both traditional approaches using existing NLP data and automated methods proposed by other researchers. Both previous approaches suffered from data and task type imbalances. Furthermore, we are the first to attempt generating reward model data for reinforcement learning using an automated approach. While the experiment did not directly evaluate the impact of the data in reinforcement learning, we employed instruction tuning with GPT-3 and utilized the recently proposed G-Eval method to automate the evaluation of both the generated data and the instruction-tuned results. Our findings show significant improvements ranging from 0.15 to 0.45 over the baselines.