長短時間步長生成對抗網路用於動態影片生成

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：25

、訪客IP：3.144.104.210

姓名

邱靖恩(Jing-En Chiu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

長短時間步長生成對抗網路用於動態影片生成
(Long and Short Time Step Generative Adversarial Network for Motion Video Generation)

相關論文

★ 影片指定對象臉部置換系統	★ 以單一攝影機實現單指虛擬鍵盤之功能
★ 基於視覺的手寫軌跡注音符號組合辨識系統	★ 利用動態貝氏網路在空照影像中進行車輛偵測
★ 以視訊為基礎之手寫簽名認證	★ 使用膚色與陰影機率高斯混合模型之移動膚色區域偵測
★ 影像中賦予信任等級的群眾切割	★ 航空監控影像之區域切割與分類
★ 在群體人數估計應用中使用不同特徵與回歸方法之分析比較	★ 以視覺為基礎之強韌多指尖偵測與人機介面應用
★ 在夜間受雨滴汙染鏡頭所拍攝的影片下之車流量估計	★ 影像特徵點匹配應用於景點影像檢索
★ 自動感興趣區域切割及遠距交通影像中的軌跡分析	★ 基於回歸模型與利用全天空影像特徵和歷史資訊之短期日射量預測
★ Analysis of the Performance of Different Classifiers for Cloud Detection Application	★ 全天空影像之雲追蹤與太陽遮蔽預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-8-17以後開放)

摘要(中)

隨著電腦運算能力提升，與機器學習模型的發展，影像相關的模型有了長
足的進展。在生成對抗網路被提出後，各種影像生成的應用應運而生，動態影
片生成便是其中之一。從影像序列生成動態影片，可以協助系統分析場景中物
件的未來動態，並成為系統決策的依據，如：自動駕駛系統可以藉此判斷潛在
危險，搶先駕駛人察覺之前做出應對。也可以用於生成創作素材，由單張影像
生成動態影片。目前以循環輸入的方法，生成動態影片的相關研究，主要分為
兩種：第一種，模型直接生成影像；第二種，模型生成合成資訊，再與輸入影
像合成。直接生成影像的方法，模型同時要學習生成真實影像與動態變化；而
生成合成資訊的方法，專注於學習合成相鄰影像之間的動態，因此通常可以生
成較清晰的影像，本研究研讀過往研究決定採用此方法。然而，無論何種方
法，在連續循環輸入模型生成一定時間長度後，影像都會變得模糊。為了使生
成影像維持清晰，本研究提出長短時間步長模型，先由長時間步長模型使用光
流法 (Optical Flow)，跳躍設定的步長生成影像，再以短時間步長模型使用差異
法 (Difference) 生成中間的影像。在海浪資料集與行車紀錄器資料集，進行單
張與多張影像輸入的實驗。兩資料集所得到的評估分數，都顯示本研究提出的
模型，能夠減緩連續循環生成影像產生的模糊區塊，使影片在生成較長時間
後，仍保有影像結構與合理的動態。

摘要(英)

Generating motion video from image sequence helps system to analyze the
motion of objects in the scene and make decision. For example, self-driving system
can determine whether there is a potential danger and react before driver being aware
of it. The application can also be used to generate video materials from single image
input. In recent research, there are mainly two ways to generate motion video by
recurrently take generated image as input. Firstly, directly generate image with model.
Secondly, synthesize the generated warping information and the last image of input
sequence. For the first method, the network has to model not only the distribution of
training data but also the object motion simultaneously. For the second method, the
network focus on learning how to generate the warping information between adjacent
frames. We decide to adopt the second method since it generally generates sharper
images than the first method. However, both recurrent generating methods mentioned
above will eventually produce blurry result after generating several frames. In order to
keep generated image sharp, we propose long and short time step generative
adversarial network. Firstly, the long time step network generate the long time step
frames skipping a number of frame number with optical flow method. Secondly,
generate the frames between two long time step frames with difference method. We
conduct single and multi-image sequence experiments on sea wave dataset and car
camera dataset. The result shows that our model is able to generate images with less
blurry artifact, stable scene and reasonable motion after generating several frames.

關鍵字(中)

★ 動態影片生成
★ 影片預測
★ 影像合成

關鍵字(英)

★ Motion Video Generation
★ Video Prediction
★ Image Synthesis

論文目次

中文摘要.......................................................................................................................iv
英文摘要........................................................................................................................v
誌謝...............................................................................................................................vi
目錄..............................................................................................................................vii
圖目錄............................................................................................................................x
表目錄...........................................................................................................................xi
一、緒論................................................................................................................1
1-1 研究背景 ............................................................................................................1
1-2 研究目的 ............................................................................................................2
1-3 論文架構 ............................................................................................................2
二、文獻探討........................................................................................................3
2-1 神經網路結構 ....................................................................................................3
2-1-1 Encoder-Decoder.........................................................................................3
2-1-2 Residual Block.............................................................................................4
2-1 生成對抗網路模型 ............................................................................................5
2-1-1 生成對抗網路 .............................................................................................5
2-1-2 Conditional GAN.........................................................................................7
2-1-3 PatchGAN....................................................................................................8
2-2 動態影片生成方法 ............................................................................................9
2-2-1 光流法 ........................................................................................................9
2-2-2 差異法 ......................................................................................................10
2-2-3 PredNet ......................................................................................................11
2-2-4 Dual Motion GAN .....................................................................................12
三、研究方法......................................................................................................13
3-1 生成器 ..............................................................................................................13
3-2 判別器 ..............................................................................................................14
3-3 影片生成模型 ..................................................................................................14
3-3-1 長時間步長模型 ......................................................................................15
3-3-2 短時間步長模型 ......................................................................................17
3-4 損失函數 ..........................................................................................................20
3-4-1 對抗損失函數 ..........................................................................................21
3-4-2 重建損失函數 ..........................................................................................23
四、實驗結果......................................................................................................25
4-1 實驗資料集 ......................................................................................................25
4-1-1 海浪影片資料集 ......................................................................................25
4-1-2 行車記錄器資料集 ..................................................................................25
4-1-2-1 KITTI..................................................................................................26
4-1-2-1 Caltech Pedestrian...............................................................................26
4-1-3 資料集前處理 ..........................................................................................27
4-2 評估指標 ..........................................................................................................27
4-2-1 MSE...........................................................................................................27
4-2-2 PSNR .........................................................................................................28
4-2-3 SSIM..........................................................................................................28
4-3 實驗環境 ..........................................................................................................30
4-4 單張影像序列輸入實驗 ..................................................................................31
4-4-1 海浪資料集 ..............................................................................................31
4-4-2 損失消融實驗 ..........................................................................................36
4-4-3 Caltech Pedestrian 資料集 .......................................................................37
4-5 多張影像序列輸入實驗 ..................................................................................42
4-5-1 海浪資料集 ..............................................................................................42
4-5-2 損失消融實驗 ..........................................................................................47
4-5-3 Caltech Pedestrian 資料集 .......................................................................47
4-6 定性分析實驗 ..................................................................................................51
4-6-1 海浪資料集四張影像輸入 ......................................................................52
4-6-2 海浪資料集單張影像輸入 ......................................................................56
4-6-3 Caltech Pedestrian 資料集四張影像輸入 ................................................62
4-6-4 Caltech Pedestrian 資料集單張影像輸入 ................................................66
五、結論與未來研究方向..................................................................................68
參考文獻......................................................................................................................69

參考文獻

[1] Mehdi Mirza, Simon Osindero, "Conditional generative adversarial nets," ArXiv,
abs/1411.1784, 2014.
[2] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, "Image-to-image
translation with conditional adversarial networks," in Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep residual
learning for image recognition," in Conference on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, 2016.
[4] Sepp Hochreiter and Jürgen Schmidhuber, "Long short-term memory," Neural
computation, vol. 9, no. 8, pp. 1735--1780, 1997.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, Yoshua Bengio, "Learning phrase
representations using RNN encoder-decoder for statistical machine translation,"
ArXiv, abs/1406.1078, 2014.
[6] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, "U-Net: Convolutional
Networks for Biomedical Image Segmentation," in International Conference on
Medical Image Computing and Computer-Assisted Intervention (MICCAI),
Munich, Germany, 2015.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, "Generative
adversarial nets," in Advances in Neural Information Processing Systems
(NIPS), Montréal CANADA, 2014.
[8] Arjovsky, Martin, Soumith Chintala, and Léon Bottou, "Wasserstein generative
adversarial networks," in Proceedings of Machine Learning Research (PMLR),
Boston, MA, USA, 2017.
[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron C.
Courville, "Improved training of wasserstein GANs," in Advances in Neural
Information Processing Systems (NIPS), Long Beach, CA, USA, 2017.
[10] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and
Stephen Paul Smolley, "Least squares generative adversarial networks," in IEEE
International Conference on Computer Vision (ICCV), Venice, Italy, 2017.
[11] Jae Hyun Lim and Jong Chul Ye, "Geometric GAN," ArXiv, abs/1705.02894,
2017.
[12] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena, "Selfattention generative adversarial networks," in Proceedings of Machine Learning
Research (PMLR), Long Beach, CA, USA, 2019.
[13] Andrew Brock, Jeff Donahue, and Karen Simonyan, "Large scale GAN training
for high fidelity natural image synthesis," in International Conference on
Learning Representations (ICLR), New Orleans, LA, USA, 2019.
[14] A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning
with deep convolutional generative adversarial networks," in International
Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
[15] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges, "MNIST
handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges,"
2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[16] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
and Honglak Lee, "Generative adversarial text to image synthesis," in
Proceedings of Machine Learning Research (PMLR), 2016.
[17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros, "Unpaired
image-to-image translation using cycle-consistent adversarial networks," in
International Conference on Computer Vision (ICCV), Venice, Italy, 2017.
[18] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu, "Semantic
image synthesis with spatially-adaptive normalization," in Computer Vision and
Pattern Recognition (CVPR), Long Beach, CA, USA, 2019.
[19] Michael Mathieu, Camille Couprie, and Yann LeCun, "Deep multi-scale video
prediction beyond mean square error," in International Conference on Learning
Representations (ICLR), San Juan, Puerto Rico, 2016.
[20] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba, "Generating videos
with scene dynamics," in Advances in Neural Information Processing Systems
(NIPS), Barcelona, Spain, 2016.
[21] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing, "Dual motion gan for
future-flow embedded video prediction," in International Conference on
Computer Vision (ICCV), Venice, Italy, 2017.
[22] Bruce D. Lucas, and Takeo Kanade, "An iterative image registration technique
with an application to stereo vision," in International Joint Conference on
Artificial Intelligence (IJCAI), Vancouver, BC, Canada, 1981.
[23] G. Farnebäck, "Two-frame motion estimation based on polynomial expansion,"
in Scandinavian Conference on Image Analysis (SCIA), Springer, Berlin,
Heidelberg, 2003.
[24] Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama, "Animating landscape:
self-supervised learning of decoupled motion and appearance for single-image
video synthesis," ACM Transactions on Graphics (Proc. of SIGGRAPH ASIA
2019), vol. 38, no. 6, pp. 175:1--175:19, 2019.
[25] Guohao Ying, Yingtian Zou, Lin Wan, Yiming Hu, and Jiashi Feng, "Better
guider predicts future better: Difference guided generative adversarial
networks," in Asian Conference on Computer Vision (ACCV), Perth, Australia,
2018.
[26] William Lotter, Gabriel Kreiman, and David Cox, "Deep predictive coding
networks for video prediction and unsupervised learning," in International
Conference on Learning Representations (ICLR), Toulon, France, 2017.
[27] Sergey Ioffe and Christian Szegedy, "Batch normalization: Accelerating deep
network training by reducing internal covariate shift," in Proceedings of
Machine Learning Research (PMLR), Lille, France, 2015.
[28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel, “Vision meets
Robotics: The KITTI Dataset,” International Journal of Robotics Research
(IJRR), 2013.
[29] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona, "Pedestrian
detection: An evaluation of the state of the art," Pattern Analysis and Machine
Intelligence (PAMI), 2012.
[30] Yong-Hoon Kwon, Min-Gyu Park, "Predicting Future Frames Using
Retrospective Cycle GAN," in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
2019.
[31] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P. Simoncelli,
"Image quality assessment: from error visibility to structural similarity," IEEE
Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr 2004.

指導教授

鄭旭詠(Hsu-Yung Cheng)

審核日期

2021-8-18

推文