| 摘要: | 時尚產業為一持續發展的領域,消費者對穿搭組合的需求日益提升,尤其隨著電子商務平台的快速發展,如何提供兼具準確性與多樣性的服裝推薦成為推薦系統研究中的重要議題。然而,傳統服裝相容性預測方法常面臨多模態訊息不一致、圖文特徵整合困難,以及模型難以捕捉高階依賴與全局上下文語意等挑戰,導致實際應用中無法有效掌握穿搭準確性,特別是在圖像與文字特徵落於不同嵌入空間的情況下,現行方法的表現仍顯不足。本研究旨在提升整體服裝相容性預測效能與搭配準確性,提出一種整合大型語言模型(LLM)、FashionCLIP、超圖神經網路(HGNN)與Transformer的多階段相容性預測架構。首先,研究僅以單品圖像作為輸入,並運用先進的大型語言模型gpt-4o生成語義描述,涵蓋單品的色彩、材質、風格等語意特徵,進而補足視覺訊息所無法直接表達的語義資訊。接著,透過FashionCLIP對圖像與文字進行對齊與特徵融合,將每個單品轉化為語意一致的多模態嵌入表示。為進一步捕捉單品間的高階依賴關係,本研究構建以服裝類別為節點的超圖(Hypergraph),並使用HGNN進行消息傳遞與語意聚合,有效保留搭配中隱含的結構關係與多對多互動特性。此外,研究引入Transformer以強化全局語意建模能力,進一步整合HGNN後的節點嵌入表示,彌補其在長距依賴建模上的不足。本模型最終產生一組整體outfit相容性表示,並於Polyvore與Zalando兩大公開資料集上進行實證分析。實驗結果顯示,所提方法在相容性排序準確度上明顯優於現有多項基準模型,驗證本研究在多模態整合與上下文語意建模上的創新設計具有效益與實用性。;Fashion is a long-standing industry, with consumers exhibiting a persistent need for outfit coordination. As the importance of fashion recommendation systems increases in e-commerce and styling applications, outfit compatibility prediction has emerged as a key research focus. However, previous methods face limitations due to inconsistencies in auxiliary information, challenges in multimodal integration, and insufficient modeling of high-order dependencies and global context. These issues restrict both generalizability and accuracy in practical applications. Specifically, current methods remain suboptimal when visual and textual features lie in different representational spaces or when hypergraph neural networks (HGNNs) suffer from information aggregation loss in fine details. The objective of this study is to improve outfit compatibility prediction and enhance matching accuracy by proposing a multi-stage prediction model that integrates a Large Language Model (LLM), FashionCLIP, Hypergraph Neural Network (HGNN), and Transformer. The model begins by using only the fashion item image—which contains the most visual information—as input. A rapidly evolving LLM, GPT-4o, is then employed to generate detailed textual descriptions of each item, including specific fashion attributes, as well as its category label. FashionCLIP, which is tailored for fashion-related tasks, is used to align the image and textual features, forming consistent multimodal inputs. These aligned features are used to construct a hypergraph, where a message-passing mechanism from a graph convolutional network aggregates neighboring node information and updates node representations. The updated node embeddings are subsequently fed into a Transformer to capture fine-grained details and long-range dependencies within the global context, addressing the limitations of HGNN. The combination of HGNN and Transformer forms the final compatibility prediction module. The proposed model is evaluated on the Polyvore Disjoint and Zalando datasets. Experimental results demonstrate that the proposed method outperforms existing approaches in outfit compatibility prediction, thereby validating its effectiveness. |