基於片段級分群提取之跨模態影片檢索技術與其於影片字幕生成之應用;SLCE：Segment-Level Clustering and Extraction based Cross-Modal Video Retrieval and its Applications in Video Captioning

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98183

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98183

题名:	基於片段級分群提取之跨模態影片檢索技術與其於影片字幕生成之應用;SLCE：Segment-Level Clustering and Extraction based Cross-Modal Video Retrieval and its Applications in Video Captioning
作者:	鄭佳榮;Cheng, Jia-Rong
贡献者:	資訊工程學系
关键词:	影片檢索;影片字幕生成;對比式學習;Video Retrieval;Video Captioning;Contrastive Learning
日期:	2025-06-30
上传时间:	2025-10-17 12:28:02 (UTC+8)
出版者:	國立中央大學
摘要:	隨著深度學習的快速發展，對比式學習（Contrastive Learning）方法在自監督學習（Supervised Learning）領域中取得顯著進展。另一方面，轉換器（Transformer）憑藉著強大的語言理解以及生成能力，在文本處理任務中展現了極高的效能。同時，文本-影像檢索（Text-Image Retrieval）作為多模態學習的重要應用場景，已廣泛應用於影像標註、自動標記生成與跨模態檢索等任務。本論文將探討這些技術的最新進展，進一步分析其在多模態檢索中的應用。本論文探討將影片檢索模型進行預訓練，並應用於影片字幕生成任務的有效性。首先，基於CLIP4Clip之影片檢索（Video Retrieval, VR）架構為原型，提出於檢索模型中引入片段級分群（Segment-Level Clustering, SLC）策略，對影片特徵進行篩選，以提取出更具代表性的關鍵特徵。進一步地，透過跨模態片段提取（Cross-Modal Segment Extraction, CMSE）技術，有效聚焦與文本語義高度相關的片段特徵，提升檢索表現與特徵品質。影片字幕生成（Video Captioning, VC）部分以Uni-VL架構為基礎，採用Transformer作為編碼器與文句解碼器，並使用透過預訓練檢索模型所優化之視覺編碼器，提取更具表達能力的影片特徵。最終，本研究設計雙流轉換器（Dual-Stream Transformer, DST）架構，透過權重分配策略於編碼流與解碼流之間實現多模態特徵融合，有效提升影片字幕生成的整體表現。最後，於實驗結果中顯示，證明了結合檢索預訓練與多模態融合策略之可行性與優勢。 ;Deep learning has experienced significant advancements in recent years. Contrastive Learning has made significant progress in the field of self-supervised learning. Meanwhile, Transformer has demonstrated outstanding performance in text processing tasks through its powerful language understanding and generation capabilities. Additionally, Text-Image Retrieval is widely used in tasks such as image annotation, automatic label generation, and cross-modal retrieval. Our study will explore the latest advances in these technologies and further analyze their applications in multimodal retrieval. Our study will explore the effectiveness of pre-training the Video Retrieval (VR) model and applying it to the Video Captioning (VC) task. Base on the video retrieval architecture of CLIP4Clip, we introduced the Segment Level Clustering (SLC) strategy into the retrieval model to filter the video features and extract more representative key features. Furthermore, we use the Cross-Modal Segment Extraction (CMSE) method to effectively focus on the segment features that are highly relevant to the text semantics. Thereby, improving the retrieval performance and feature quality. The video captioning is based on the Uni-VL architecture, using Transformer as the encoder and sentence decoder, and the visual encoder that optimized by the pre-trained retrieval model be used to extract more expressive video features. Finally, this study designs a Dual-Stream Transformer (DST) architecture between the encoding and the decoding stream to achieve multimodal feature fusion through a weight allocation strategy. The experimental results show an improvement in the overall performance of video captioning, proving the feasibility and advantages of combining retrieval pre-training with multimodal fusion strategies.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	20	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....