| 摘要: | 隨著大型語言模型與生成式模型的快速發展,音樂生成的發展逐漸從音訊合成進入到音樂元素可控的音樂創作形式。現有基於文字條件的生成技術已展現一定水準的音訊合成表現,但在音樂結構控制與真實表演情緒方面,仍有進步空間。 本研究提出一套以語言模型為基礎的符號音樂生成系統,將音樂視為一種具備文法規則與結構邏輯的語言,針對音樂的段落、節奏、音高、力度與速度等要素進行結構化建模。我們設計專屬音樂 token 詞彙表,並規劃分階段的訓練策略,透過將音樂生成任務拆分為五個子任務(和弦生成、主旋律生成、次旋律生成、力度控制與速度變化生成),以逐步學習並強化模型對音樂結構與表演細節的掌握。 系統以 LLaMA 3.1 8B-Instruct 模型為基礎,首先進行全參數微調,使模型能理解新增音樂 token 的語法結構與組合規律,建立音樂 token 之間的基礎聯結能力。再針對各子任務分別採用參數高效微調技術中的 LoRA方法,進一步優化特定任務表現,提升訓練效率並保留模型原有知識基礎。 為提升生成過程中的結構一致性與語法正確性,系統引入結構感知的 logits masking 機制,限制模型在生成過程中僅能選擇符合語法規範的 token,進而強化段落順序、小節邏輯與表演標記的一致性。 實驗結果顯示,透過本研究所設計的專屬 token 與多階段訓練流程,模型能生成具備完整樂段規劃、節奏連貫與表演情緒控制的樂曲。研究成果證明大型語言模型經過詞彙擴充與結構感知訓練後,能有效應用於符號音樂生成領域。 ;With the rapid advancement of large language models and generative models, music generation has gradually evolved from raw audio synthesis to a more controllable form of music creation involving structured musical elements. Although existing studies have demonstrated the ability to generate audio conditioned on textual input, significant challenges remain in controlling musical structure and expressive performance. This study proposes a symbolic music generation system based on a language model framework, treating music as a language with grammatical rules. We design custom tokens and training strategies to enable the model to learn the compositional logic and structural relationships among musical elements such as sections, rhythm, pitch, dynamics, and tempo. Our system builds upon the LLaMA 3.1-8B Instruct model and adopts a multi-stage training strategy. First, full fine-tuning is applied to help the model acquire the grammar of the custom music tokens. Then, parameter-efficient fine-tuning using LoRA is performed across five music generation sub-tasks: chord progression, melody, secondary melody, dynamics, and tempo. To enhance structural consistency and output quality, we introduce a structure-aware logits masking mechanism during training, which improves the model’s ability to predict segment transitions, bar continuity, and performance expression tokens. Experimental results on a structured symbolic music dataset demonstrate the potential of our model to generate compositions with coherent musical sections and expressive intent. Furthermore, our findings suggest that vocabulary extension and structure-aligned training of large language models can be effectively applied to symbolic music generation tasks. |