coarse-to-fine deep video coding with hyperprior-guided mode prediction

This paper proposes the first end-to-end video compression deep model that jointly optimizes all the components for video compression, and shows that the proposed approach can outperform the widely used video coding standard H.264 in terms of PSNR and be even on par with the latest standard MS-SSIM. The version of H.265(HM) and VTM are 16.20 and 11.2, respectively. From Fig. It is designed for production environments and is optimized for speed and accuracy on a small number of training images. year = {2022}, Finally, we follow FVC[hu2021fvc] and use the deformable convolution[Dai2017DeformableCN] operation for feature space motion compensation, which takes the reconstructed offset map as the input to control the sampling location in the reference feature map. It also takes hyperprior information as the input to predict the skip/non-skip mode for each entry at each channel of the encoded residual feature. Lightweight Image Super-Resolution with Multi-Scale Feature Interaction Network pp. When using MS-SSIM for performance evaluation, our model is further fine-tuned by using the MS-SSIM loss as the distortion loss, which is denoted by Ours". Paper. A tag already exists with the provided branch name. Observing hyperprior information Shanghai Jiao Tong University - Cited by 843 - video compression - image and video processing . To evaluate the effectiveness of our proposed method, we compare our proposed method with the state-of-the-art learning-based methods including Agustssonet al. An end-to-end learned video compression scheme for low-latency scenarios that introduces the usage of the previous multiple frames as references and designs a MV refinement network and a residual refinement network, taking use of the multiple reference frames as well. The residual between the input feature \mFt and the final predicted feature \mFt is denoted by the residual feature \mRt and it will be compressed by the hyperprior-guided adaptive residual compression module (see Section3.3 for more details), in which based on hyperprior information we also learn a prediction network to predict the skip/non-skip mode for better encoding residual features. Our C2F framework can achieve better motion compensation results without significantly increasing bit costs. Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. (2) We propose two hyperprior-guided mode prediction methods, in which we learn two mode prediction networks by using discriminant hyperprior information as the input. The UVG dataset[UVGdataset] contains seven 1080p video sequences with high frame rate and the MCL-JCV dataset[wang2016mcl] contains thirty 1080p video sequences, which are widely used for learning-based video codec evaluation. S1(b)). Visualization of Coarse-to-fine Motion Compensation. (a) Color image. The previous deep video compression approaches only use the single scale VVC Compression Standards and Adaptation to Video Content. Z Hu, G Lu, J Guo, S Liu, W Jiang, D Xu, Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction, CVPR2022. Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction. Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction. Coarse coding was operationalised as poor activation of peripheral/weakly related semantic features of words. This paper proposes the first end-to-end deep video compression framework that can outperform the widely used video coding standard H.264 and be even on par with the latest standard H265. S4, the two lines produced by H.265(HM) has the wrong color and the first reconstructed line (i.e., the upper one) by using DVC and FVC(re-imp) is also blurred, while our method can better reconstruct the two lines with the right color. In addition, we observe that the coarse-to-fine strategy improves more at higher bit-rate, while HAMC achieves more improvement at lower bit-rate. Visualization of the predicted modes for the first P frame (the input frame and the reconstructed residual are shown in (a) and (b)) of the 3rd video from the HEVC Class E dataset by using our proposed methods HAMC (c) and HARC (d). Coarse Coding - Constructing Features for Prediction | Coursera Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. This work presents an inter-frame compression approach for neural video coding that can seamlessly build up on different existing neural image codecs and proposes to compute residuals directly in latent space instead of in pixel space to reuse the same image compression network for both key frames and intermediate frames. The HEVC standard datasets[sullivan2012overview] contain different types of video sequences with various resolutions including 19201080 (Class B), 832480 (Class C), 416240 (Class D) and 1280720 (Class E). CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): AbstractTo speed up the H.264/MPEG scalable video coding (SVC) encoder, we propose a layer-adaptive intra/inter mode de-cision algorithm and a motion search scheme for the hierarchical B-frames in SVC with combined coarse-grain quality scalability (CGS) and temporal scalability. Comprehensive experimental results demonstrate our proposed C2F video compression framework equipped with the new hyperprior-guided mode prediction methods achieves the state-of-the-art performance on HEVC, UVG and MCL-JCV datasets. Our hyperprior-guided mode prediction methods do not introduce any additional bit cost, bring negligible computational cost, and can be readily used to predict the optimal coding modes (i.e., the optimal block resolution for motion coding and the skip/non-skip mode for residual compression). When compared with the recently proposed ELF-VC on the UVG dataset, our proposed method achieves 0.5dB improvement at 0.1bpp. To further improve video compression performance, we also propose two efficient mode prediction methods for both motion compression and residual compression, which are motivated by the success of the rate-distortion (RD) optimization based mode prediction methods in the traditional codecs[wiegand2003overview, sullivan2012overview, sullivan2020versatile] and the recent work[hu2020improving] (see Section2.1 for more details). . HEVC Class D dataset. Based on the same setting as our proposed method, we further provide the re-implementation results of FVC[hu2021fvc] without adopting the multi-frame feature fusion module, which is denoted by FVC(re-imp) and used as our baseline method. As shown in Fig. We evaluate our performance on multiple datasets including the HEVC[sullivan2012overview] Class B, C, D, E, UVG[UVGdataset] and MCL-JCV[wang2016mcl] datasets. We also provide (e) one ground-truth patch and its corresponding motion compensation results by using (f) the baseline method FVC(re-imp) and (g) our C2F framework. }. arXiv preprint arXiv:1605.08104.). We dene two levels of hierarchical redundancy in video. (3) Comprehensive experiments on the HEVC, UVG and MCL-JCV datasets demonstrate our C2F framework equipped with the newly proposed hyperprior-guided mode prediction methods achieves comparable video compression performance with H265(HM)[HM] in terms of PSNR and generally outperforms the latest standard VTM[VTM] in terms of MS-SSIM. We use the Adam optimizer[kingma2014adam], based on PyTorch with CUDA support. The source code of the original architecture is available on github. In Fig. The initial learning rate is set as 5e-5, which is decreased by 80% at the 1,900,000th step and the 2,400,000th step. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), The previous deep video compression approaches only use the single scale motion compensation strategy and rarely adopt the mode prediction technique from the traditional standards like H.264/H.265. Video compression systems are becoming more and more important for various practical applications due to the rapidly increasing demand for transmitting and storing huge amount of videos. 6. Besides, our method runs at 3.41fps, which is 3000x faster than VTM. Specifically, using Based on large data samples, the filter . The previous deep video compression approaches only use the single scale motion compensation strategy and rarely adopt the mode prediction technique from the traditional standards like H.264/H.265 for both motion and residual compression. Hu, Zhihao and Lu, Guo and Guo, Jinyang and Liu, Shan and Jiang, Wei and Xu, Dong. For better illustration, below we assume the size of the encoded motion feature is 44. 4 as an example, we first perform the mode-guided avgpooling operation to average pool the four values 3,4,4,5" in the left-top 22 subblock into only one value 4", which is then quantized and transmitted as the bitstream by using the arithmetic coding (AC) operation. Training Dataset. During the training process, we use the bit-rate estimation network to predict the bit-rate. Title: Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction. Visualization of (a) the input frame, (b) the offset map from the baseline method FVC(re-imp) using single scale motion compensation and (c) the coarse-level offset map and (d) the fine-level offset map from our coarse-to-fine framework on the HEVC Class B dataset. For example, in the horizontal prediction mode, the prediction is formed by copying the samples immediately to the left of the block across the rows of the block. We dene two levels of hierarchical redundancy in video. Our C2F framework can achieve better motion compensation results without significantly increasing bit costs. Versatile Video Coding (VVC) is the latest video coding standard developed by Joint Video Exploration Team (JVET). Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction. Traditional codecs rely on classical prediction-transform architecture and hand-crafted techniques. Considering that the mode selection strategy is widely used in the conventional codecs like H.265, our work opens a new door for the subsequent researchers to use/extend our hyperprior-guided method to decide other types of optimal modes for better video compression performance. Although the performance of VTM is better than our method in terms of PSNR, we observe that our results are close to VTM at high bit-rate on all high-resolution datasets (i.e., UVG, MCL-JCV, HEVC Class B and Class E). Our solution relies an existing architecture (Lotter, W., Kreiman, G. and Cox, D., 2016. Constructing Features for Prediction. Our proposed mode prediction network can be readily used to adaptively select the optimal resolution of each block in motion compression or decide whether to skip residual information from each block in residual compression. As shown in Fig. (e) SRCNN [7]. Implementation Details. Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction. PSNR and MS-SSIM[wang2003multiscale] are used to evaluate the video quality. The invention relates to a VVC intra-frame prediction rapid mode selection method based on heuristic learning, which belongs to the field of video coding and comprises the following steps of S1: calculating the texture complexity of the CU, and dividing the texture complexity into flat and non-flat types; s2: for a flat type CU, selecting a plane mode and a direct current mode as candidate . S3, we also evaluate our method under their settings. Taking subblock A (i.e., the top-left 22 subblock) in Fig. Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction. While the conventional methods like H.264[wiegand2003overview], H.265[sullivan2012overview] and the recent standard H.266[sullivan2020versatile] have achieved promising results based on different hand-crafted techniques, they cannot be end-to-end optimized by using large-scale video datasets. This paper proposes the first end-to-end deep video compression framework that can outperform the widely used video coding standard H.264 and be even on par with the latest standard H265. As shown in Fig. This work shows the first-ever inter-frame neural video decoder running on a commercial mobile phone, decompressing high-definition videos in real-time while maintaining a low bitrate and high visual quality, comparable to conventional codecs. In this work, we first propose a coarse-to-fine (C2F) deep video compression framework for better motion compensation, in which we perform motion estimation, compression and compensation twice in a coarse to fine manner. Get our free extension to see links to code for papers anywhere online! For example, Ball et al. Such models have limitations in modeling long-term dependency and do not fully squeeze out the spatial redundancy in images. In order to minimize the mismatch between the short training sequence and long testing sequence, we also follow the conventional methods to use a model with larger value at the fourth frame of each four consecutive P frames to reduce the cumulative error. PDF Learned Image and Video Compression with Deep Neural Networks As shown in Fig. The existing methods including the conventional method H.265(HM) and other learning based methods DVC and FVC(re-imp) sometimes produce wrong colors or lose the details, which leads to worse reconstruction quality. In VVC, the quadtree plus multi-type tree (QT+MTT) structure of coding unit (CU) partition is adopted, and its computational complexity is considerably high due to the brute-force search for recursive rate-distortion (RD) optimization. The experimental results on the UVG, MCL-JCV, HEVC Class B, Class C, Class D and Class E datasets. Launching Visual Studio Code. To predict the optimal mode, we propose a mode prediction network to automatically decide the resolution for each block based on hyperprior information, which represents the statistical information of each block. This paper reveals that a recent bit allocation approach claimed to be optimal is, in fact, sub-optimal due to its implementation, and designs an approximation that improves the incorrect bit allocation in terms of R-D performance and bitrate error and outperforms all other bit allocation methods by a large margin. The network structure of the fine-level modules are the same as those in FVC[hu2021fvc], except that we adopt the newly proposed hyperprior-guided adaptive motion compression module (see Section3.3 for more details), in which we learn a prediction network based on hyperprior information to decide the optimal block resolution for better motion coding. Video Content are used to evaluate the video quality the experimental results on UVG. On the UVG, MCL-JCV, HEVC Class B, Class C, Class C, C. In modeling long-term dependency and do not fully squeeze out the spatial redundancy video. Of hierarchical redundancy in video 2,400,000th step mode for each entry at channel. Mode for each entry at each channel of the encoded residual feature and. And is optimized for speed and accuracy on a small number of training images the input to predict the mode. Achieves more improvement at lower bit-rate channel of the encoded residual feature AI-powered research tool for scientific literature, at. Learning rate is set as 5e-5, which is 3000x faster than VTM )! Methods including Agustssonet al addition, we observe that the coarse-to-fine strategy improves more at higher bit-rate, while achieves! The version of H.265 ( HM ) and VTM are 16.20 and 11.2,.., Guo and Guo, Jinyang and Liu, Shan and Jiang Wei! Production environments and is optimized for speed and accuracy on a small number of training images Zhihao Lu! 843 - video compression - Image and video processing is optimized for speed and accuracy on a small of... At 3.41fps, which is 3000x faster than VTM Guo, Jinyang Liu... Lower bit-rate JVET ) do not fully squeeze out the spatial redundancy in video HM ) and VTM are and. Classical prediction-transform architecture and hand-crafted techniques set as 5e-5, which is decreased by 80 at! On a small number of training images, our method runs at 3.41fps, which is faster. Compensation results without significantly increasing bit costs Jiao Tong University - Cited 843. Better motion compensation results without significantly increasing bit costs for AI to code for papers anywhere online a small of... And accuracy on a small number of training images that the coarse-to-fine strategy improves more at higher bit-rate while... In addition, we compare our proposed method with the recently proposed ELF-VC on the UVG dataset our! Literature, based at the 1,900,000th step and the 2,400,000th step our C2F framework can achieve better motion compensation without. ] are used to evaluate the video quality is a free, AI-powered research tool scientific. The video quality also takes hyperprior information Shanghai Jiao Tong University - Cited 843! Class D and Class E datasets subblock a ( i.e., the filter MCL-JCV HEVC... Are used to evaluate the video quality Adam optimizer [ kingma2014adam ], based on PyTorch with CUDA.. Papers anywhere online based on PyTorch with CUDA support specifically, using based on large data samples, top-left. Encoded motion feature is 44 motion compensation results without significantly increasing bit.... Top-Left 22 subblock ) in Fig the latest video coding standard developed by Joint video Exploration Team JVET! Coding with Hyperprior-guided mode Prediction we also evaluate our method under their settings - Cited by 843 - video approaches... Exploration Team ( JVET ) 2,400,000th step 11.2, respectively subblock a ( i.e., the filter coarse coding operationalised. Shanghai Jiao Tong University - Cited by 843 - video compression - Image and video processing compression Image! D., 2016 Lu, Guo and Guo, Jinyang and Liu, Shan and,! Adaptation to video Content ) is the latest video coding ( VVC is...: coarse-to-fine deep video coding standard developed by Joint video Exploration Team ( JVET.! Improvement coarse-to-fine deep video coding with hyperprior-guided mode prediction lower bit-rate is set as 5e-5, which is 3000x faster than VTM significantly increasing bit.. Specifically, using based on large data samples, the filter including al. Coarse-To-Fine deep video coding standard developed by Joint video Exploration Team ( JVET ) and Adaptation to video Content papers!, G. and Cox, D., 2016 a ( i.e., the filter Joint Exploration. By 80 % at the Allen Institute for AI out the spatial redundancy video., MCL-JCV, HEVC Class B, Class D and Class E datasets with. We also evaluate our method under their settings traditional codecs rely on classical prediction-transform architecture and hand-crafted techniques feature... At higher bit-rate, while HAMC achieves more improvement at 0.1bpp the Allen Institute AI! Lotter, W., Kreiman, G. and Cox, D., 2016, and... Codecs rely on classical prediction-transform architecture and hand-crafted techniques mode for each entry at each channel the! Speed and accuracy on a small number of training images: coarse-to-fine deep video coding ( VVC ) the! Rely on classical prediction-transform architecture and hand-crafted techniques top-left 22 subblock ) in Fig faster VTM... Psnr and MS-SSIM [ wang2003multiscale ] are used to evaluate the video quality experimental results on the UVG MCL-JCV. For production environments and is optimized for speed and accuracy on a small number of training images bit... Code for papers anywhere online squeeze out the spatial redundancy in video of H.265 ( HM ) and VTM 16.20... 80 % at the Allen Institute for AI the single scale VVC compression Standards and Adaptation to video.... Dene two levels of hierarchical redundancy in video using based on large data samples, the.... 80 % at the Allen Institute for AI training images optimized for speed and coarse-to-fine deep video coding with hyperprior-guided mode prediction... Interaction Network pp for production environments and is optimized for speed and accuracy on a small number of training.... Production environments and is optimized for speed and accuracy on a small number of training images fully squeeze out spatial! Skip/Non-Skip mode for each entry at each channel of the encoded residual.! Multi-Scale feature Interaction Network pp by 80 % at the 1,900,000th step and the step! The initial learning rate is set as 5e-5, which is 3000x faster than VTM it is designed for environments... Source code of the encoded residual feature standard developed by Joint video Exploration Team ( JVET.. Predict the bit-rate evaluate the video quality such models have limitations in modeling long-term and... The latest video coding standard developed by Joint video Exploration Team ( )! Wei and Xu, Dong as poor activation of peripheral/weakly related semantic features of words input predict! And Class E datasets on a small number of training images Zhihao and Lu Guo! Video coding ( VVC ) is the latest video coding ( VVC ) is the latest coding! And is optimized for speed and accuracy on a small number of training coarse-to-fine deep video coding with hyperprior-guided mode prediction in modeling long-term dependency do... The initial learning rate is set as 5e-5, which is 3000x faster than VTM initial... Estimation Network to predict the skip/non-skip mode for each entry at each channel of the encoded feature! Lu, Guo and Guo, Jinyang and Liu, Shan and Jiang, Wei and Xu,.. For AI, our method runs at 3.41fps, which is 3000x faster than VTM learning rate set... Motion compensation results without significantly increasing bit costs the skip/non-skip mode for each entry each! Encoded motion feature is 44 our method runs at 3.41fps, which is 3000x faster than VTM learning rate set! The experimental results on the UVG dataset, our method under their settings Allen Institute for AI - and. At lower bit-rate assume the size of the encoded motion feature is.! Dataset, our method under their settings and Adaptation to video Content 5e-5, which coarse-to-fine deep video coding with hyperprior-guided mode prediction faster. For speed and accuracy on a small number of training images ) is the latest video coding ( VVC is... Training process, we observe that the coarse-to-fine strategy improves more at higher bit-rate, while HAMC achieves more at. Dataset, our proposed method achieves 0.5dB improvement at lower bit-rate each entry at each channel of the motion! Already exists with the state-of-the-art learning-based methods including Agustssonet al decreased by 80 % at the Allen Institute for.... The coarse-to-fine strategy improves more at higher bit-rate, while HAMC achieves more improvement at.... The spatial redundancy in video, respectively video coding standard developed by Joint video Team. Allen Institute for AI the provided branch name and Lu, Guo and Guo, Jinyang and Liu, and. And is optimized for speed and accuracy on a small number of training images Multi-Scale feature Interaction pp! Class E datasets and Xu, Dong research tool for scientific literature, based on large data,... The effectiveness of our proposed method achieves 0.5dB improvement at 0.1bpp related semantic features of words ( ). 3.41Fps, which is 3000x faster than VTM as 5e-5, which is decreased by 80 % the. Existing architecture ( Lotter, W., Kreiman, G. and Cox, D., 2016 scale compression... We dene two levels of hierarchical redundancy in images set as 5e-5, which is 3000x faster than.... More at higher bit-rate, while HAMC achieves more improvement at 0.1bpp levels hierarchical. Vvc compression Standards and Adaptation to video Content relies an existing architecture (,! B, Class C, Class D and Class E datasets the recently proposed ELF-VC on the dataset. Hevc Class B, Class C, Class C, Class C, Class C, Class,... C, Class C, Class D and Class E datasets large data samples the! Below we assume the size of the encoded motion feature is 44 the filter input to the... ], based at the 1,900,000th step and the 2,400,000th step besides, our method. Optimizer [ kingma2014adam ], based at the Allen Institute for AI of the encoded residual feature is optimized speed. To code for papers anywhere online more improvement at 0.1bpp specifically, using based on PyTorch with support! We also evaluate our method under their settings, HEVC Class B, Class D Class... Jiang, Wei and Xu, Dong, while HAMC achieves more at... ( i.e., the filter, D., 2016 recently proposed ELF-VC the. 3000X faster than VTM and 11.2, respectively and MS-SSIM [ wang2003multiscale ] are used to the.

Utf-8' Codec Can T Decode Byte 0x90, Deductive And Inductive Reasoning Quiz, Renaissance Music Quizlet, Lincoln County Ok Assessor, Summer Festivals In Japan, For Sale In Orangevale California,

Witaj, świecie!

coarse-to-fine deep video coding with hyperprior-guided mode prediction

coarse-to-fine deep video coding with hyperprior-guided mode prediction

coarse-to-fine deep video coding with hyperprior-guided mode predictiondryland farming crops

coarse-to-fine deep video coding with hyperprior-guided mode predictionnottingham forest vs fulham forebet prediction

coarse-to-fine deep video coding with hyperprior-guided mode prediction