摘要: | 聯合影像探索團隊 (Joint Video Exploration Team, JVET) 於2020年7月通過第一版H.266/VVC影像壓縮標準,在影像資料的壓縮率比上一代標準H.265/HEVC提升了大約兩倍,即保持相同影像品質的情況下,碼率 (bit rate)節省了接近50%,但代價是複雜度的急劇上升,造成編碼時間相比H.265/HEVC多了六到十倍不等。因此,如何降低編碼運算時間成為標準普及化的首要任務。 在VVC規格中提出了許多新技術,而其中一項名為QTMT (Quadtree with Nested Multi-Type Tree) 的區塊分割結構佔據了編碼時間97%以上,這是因為HEVC中只使用QT結構進行CU (Coding Unit) 的區塊劃分,而VVC則多了MTT區塊劃分,其中包括水平和垂直方向的BT (Binary Tree) 與TT (Ternary Tree)。這種新的CU區塊分割結構使每個CU有六種可能的分割模式,以此為基準計算大量的碼率失真代價函數 (Rate-Distortion cost, RD Cost),就會導致編碼時間大幅度增加;所以本文提出三種基於監督學習模型的分割演算法進行快速決策,並且設計了一種新型的運動向量場 (Motion Vector Field, MVF) 來加強畫面間編碼中的運動估計,而運動估計後得到的結果會作為特徵輸入到各模型中,最後將三種模型串接在一起來獲得最大的編碼效能。 本文所有的研究總共有四項主要貢獻,首先是我們重新定義了一個新型的運動向量場,實驗結果表明此新型MVF能有效地降低BDBR (Bjøntegaard Delta Bit Rate),甚至能夠取代VVC規格中的Affine Merge Mode。第二是我們開發了一種基於有向無環圖支援向量機 (Directed Acyclic Graph-Support Vector Machine, DAG-SVM) 的演算法應用於VVC分割預測,能夠在幾乎不影響編碼性能的情況下,透過預測CU分群的方式削減運算時間。第三是利用隨機森林回歸 (Random Forest Regression, RFR) 易於處理高維度數據的特性作為整體分割預測架構的末端,能夠將卷積神經網路 (Convolutional Neural Network, CNN) 輸出的複雜數據再次進行處理,進一步得到更好的效能。最後一項貢獻則是在模型中規劃了閾值的選擇方案,使編碼複雜度與效率之間的權衡變為可調性。完整架構的實驗結果與原始VVC相比,在VVC測試軟體VTM-10.0使用RAGOP32的設置下,選擇 (Thm = 0.125, Thd = 8) 的閾值方案,BDBR僅僅增加了1.31%,編碼時間卻能減少將近50%,明顯優於其他最先進的解決方案。而選擇 (Thm = 0.2, Thd = 16) 的閾值方案下,BDBR也只增加了2.74%,編碼時間得以減少70%,大大增加了VVC即時應用的可能性。;In July 2020, the Joint Video Exploration Team (JVET) approved the first version of the H.266/VVC video compression standard. Compared to the previous standard—H.265/HEVC, VVC achieves approximately double of compression efficiency, saving about 50% in bit rate while maintaining the same video quality. However, this comes at the cost of a sharp increase in encoding complexity, resulting in encoding times six to ten times longer than H.265/HEVC. Therefore, reducing encoding time has become a primary target for the widespread adoption of this standard. The VVC specification introduces several new technologies, one of which is the QTMT (Quadtree with Nested Multi-Type Tree) block partitioning structure, which accounts for over 97% of encoding time. This is because, unlike HEVC, which only uses QT structure for CU (Coding Unit) block partitioning, VVC added MTT partitioning, including horizontal and vertical BT (Binary Tree) and TT (Ternary Tree) splits. This new CU block partitioning structure results in six possible partitioning modes per CU, leading to extensive Rate-Distortion cost (RD Cost) calculations, which significantly increase encoding time. Hence, We propose three partitioning algorithms based on supervised learning models to facilitate faster decisions. Additionally, a novel Motion Vector Field (MVF) is designed to enhance motion estimation in inter prediction, and the results of motion estimation are used as feature inputs to the models. Finally, the three models are combined to achieve maximum encoding efficiency. In this paper, we contribute four major innovations to VVC. First, we redefine a novel MVF. Experiments show that this new MVF effectively decreases the BDBR (Bjøntegaard Delta Bit Rate), even potentially replacing the Affine Merge Mode in VVC specification. Second, we develop a Directed Acyclic Graph-Support Vector Machine (DAG-SVM) algorithm for VVC partition prediction, which reduces computation time by grouping CU into six classes with minimal impact on encoding performance. Third, we use the high-dimensional data processing capability of Random Forest Regression (RFR) as the final component of the partition prediction structure, efficiently refining the complex data output from the Convolutional Neural Network (CNN) for further improved performance. The final contribution is the design of threshold selection schemes in each model, making the trade-off between encoding complexity and efficiency adjustable. Experiments of the entire prediction structure, compared to the original VVC, show that under the RAGOP32 configuration using VVC test software VTM-10.0 and with thresholds (Thm = 0.125, Thd = 8), the BDBR increase is only 1.31%, while encoding time is reduced by nearly 50%, outperforming other state-of-the-art solutions. With threshold settings of (Thm = 0.2, Thd = 16), the BDBR increase is just 2.74%, and encoding time is reduced by almost 70%, greatly enhancing the potential for real-time VVC applications. |