紅外光與可見光影像融合藉由擷取此兩種影像感測器畫面的互補資訊進而生成兼具兩者特徵的單一影像,希望融合畫面更符合人類視覺感知,或協助後續場景語意分割與物件偵測等高階視覺任務。現今的融合演算法多假設可取得成對的紅外光與可見光影像,然而,不同的感測裝置經常造成畫面內容物錯位或是發生掉幀而出現時間域的不對齊。近期研究在輸入影像解析度相同的前提下或能消除存在於兩輸入影像中的輕微位移及變形,但實際拍攝的影像在解析度及拍攝範圍等可能存在甚大差異而需更有效的畫面對齊方式。現有影像融合資料集缺乏物件和語意分割標記而不利相關模型的訓練,且不同資料集的紅外光與可見光內容也讓傳統特徵比對方法難有令人滿意的效果。本論文提出建立具語意分割資訊的紅外光與可見光影像融合資料集方法,將現有語意分割資料集影像經風格轉換生成對應的紅外光與可見光影像,再利用這些影像重新訓練語意分割模型,從而建立符合應用場景情境且包含相對應語意分割標記與遮罩的影像資料集。我們根據背景是否包含經典畫面分割類別而選擇使用語意分割標記或重要物件遮罩,透過對數極座標轉換暨傅立葉轉換於頻域上計算畫面縮放和平移量以達成全局影像空間域對齊。我們可再利用深度學習方法微調局部輕微位移以取得畫面中物件更精確的對齊效果。關於時間域對齊問題,我們結合空間域對齊及遮罩比對逐一檢視紅外光與可見光目標影像以找出最大物件重疊相對應畫面,藉此克服因掉幀或裝置設定所導致的時域錯位。最後,我們提出超低參數量的影像融合設計以降低計算資源需求,同時提升影像融合性能及效率。 關鍵字 – 影像融合、影像對齊、深度學習、語意分割、風格轉換 ;Infrared and visible image fusion aims to integrate the complementary information from both types of sensors to generate a single image that incorporates the features of both. This fusion is intended to better match human visual perception or assist with high-level visual tasks such as semantic segmentation and object detection. Most current fusion algorithms assume that paired infrared and visible images are available. However, different sensor devices often cause misalignment of image content or result in frame drops, leading to temporal misalignment. Recent research addresses slight displacements and distortions between input images under the assumption of the same resolution. However, significant differences in resolution and field of view in actual captured images necessitate more effective alignment methods. Existing image fusion datasets lack object and semantic segmentation annotations, which hampers the training of related models, and the differing content between infrared and visible images across datasets makes traditional feature matching methods less effective. This paper proposes a method for creating an infrared and visible image fusion dataset with semantic segmentation information. By applying style transfer to existing semantic segmentation dataset images, we generate corresponding infrared and visible images. These images are then used to retrain semantic segmentation models, resulting in a dataset that matches the application scenario and includes relevant semantic segmentation annotations and masks. Depending on whether the background includes common segmentation classes, we use either semantic segmentation annotations or important object masks. We achieve global spatial alignment by calculating image scaling and translation using logarithmic polar coordinate transformation and Fourier Transforms. We can choose to refine local slight displacements using deep learning methods to achieve more accurate object alignment. To address temporal alignment issues, we combine spatial alignment and mask comparison to identify the maximum object overlap and corresponding images between infrared and visible targets, overcoming temporal misalignment caused by frame drops or device settings. Finally, we propose a low-parameter image fusion design to reduce computational resource requirements while enhancing image fusion performance and efficiency. Keywords - Image fusion, Image alignment, Deep learning, Semantic segmentation, Style transfer.