WO2005009046A1

WO2005009046A1 - Interframe wavelet video coding method

Info

Publication number: WO2005009046A1
Application number: PCT/KR2004/001666
Authority: WO
Inventors: Chang-Hoon Yim; Ho-Jin Ha; Bae-Keun Lee; Woo-Jin Han
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2003-07-18
Filing date: 2004-07-07
Publication date: 2005-01-27
Also published as: KR20050009639A; CN1810040A

Abstract

An interframe wavelet video coding (IWVC) method by which an average temporal distance (ATD) is minimized is provided. The IWVC method comprises receiving a group-of-frames and decomposing the group-of-frames into difference frames and first average frames between the frames in a first forward temporal direction and a backward temporal direction, wavelet-decomposing the first difference frames and the first average frames, and quantizing coefficients resulting from the wavelet-decomposition to generate a bitstream. The IWVC method provides improved video coding performance.

Description

Description INTERFRAME WAVELET VIDEO CODING METHOD Technical Field

[1] The present invention relates to a wavelet video coding method, and more particularly, to an interframe wavelet video coding (IWVC) method in which an average temporal distance is reduced by changing a temporal filtering direction. Background Art

[2] With the development of information communication technology including Internet, video communication as well as text and voice communication has increased. Conventional text communication cannot satisfy the various demands of users, and thus multimedia services that can provide various types of information such as text, pictures, and music have increased. Multimedia data requires a large capacity storage medium and a wide bandwidth for transmission since the amount of multimedia data is usually large. For example, a 24-bit true color image having a resolution of 640 * 480 needs a capacity of 640 * 480 * 24 bits, i.e., data of about 7.37 Mbits, per frame. When this image is transmitted at a speed of 30 frames per second, a bandwidth of 221 Mbits/sec is required. When a 0-minute movie based on such an image is stored, a storage space of about 1200 Gbits is required. Accordingly, a compression coding method is a requisite for transmitting multimedia data including text, video, and audio.

[3] A basic principle of data compression is removing data redundancy. Data can be compressed by removing spatial redundancy in which the same color or object is repeated in an image, temporal redundancy in which there is little change between adjacent frames in a moving image or the same sound is repeated in audio, or mental visual redundancy taking into account human eyesight and limited perception of high frequency. Data compression can be classified into lossy/lossless compression according to whether source data is lost, intraframe/interframe compression according to whether individual frames are compressed independently, and symmetric/ asymmetric compression according to whether time required for compression is the same as time required for recovery. Data compression is defined as real-time c ompression when a compression/recovery time delay does not exceed 50 ms and as scalable compression when frames have different resolutions. For text or medical data, lossless compression is usually used. For multimedia data, lossy compression is usually used. Meanwhile, intraframe compression is usually used to remove spatial redundancy, and interframe compression is usually used to remove temporal redundancy.

[4] Difference types of transmission media for multimedia have different performance. Currently used transmission media have various transmission rates. For example, an ultrahigh- speed communication network can transmit data of several tens of megabits per second while a mobile communication network has a transmission rate of 384 kilobits per second. In conventional video coding methods such as Motion Hcture Experts Group (MPEG)-l, MPEG-2, HJ63, and HJ64, temporal redundancy is removed by motion compensation based on motion estimation and compensation, and spatial redundancy is removed by transform coding. These methods have satisfactory compression rates, but they do not have the flexibility of a truly scalable bitstream. Accordingly, to support transmission media having various speeds or to transmit multimedia at a data rate suitable to a transmission environment, data coding methods having scalability, such as wavelet video coding and subband video coding, may be suitable to a multimedia environment. For example, Interframe Wavelet Video Coding (IWNC) can provide a very flexible, scalable bitstream. Disclosure of Invention Technical Problem

[5] Efawever, conventional IWNC has lower performance than a coding method such as HJ64. Due to this low performance, IWNC is used only for very limited applications although it has very excellent scalability. Accordingly, it has been an issue to improve the performance of data coding methods having scalability. Technical Solution

[6] The present invention provides a scalable data coding method which provides improved performance by reducing a total temporal distance for motion estimation.

[7] According to an aspect of the present invention, there is provided an interframe wavelet video coding method comprising receiving a group-of-frames and decomposing the group-of-frames into first difference frames and first average frames between the frames in a first forward temporal direction and a first backward temporal direction, wavelet-decomposing the first difference frames and the first average frames, and quantizing coefficients resulting from the wavelet-decomposition to generate a bitstream. Preferably, the interframe wavelet video coding method may further comprise obtaining a motion vector between frames and compensating for a temporal motion using the motion vector before decomposing the group-of-frames into the first difference and average frames. Also, the first forward temporal direction and the first backward temporal direction are preferably combined such that an average of temporal distances between frames in the group-of-frames is minimized.

[8] The decomposing the group-of-frames into the first difference and average frames may comprise (a) decomposing the group-of-frames into a first difference frame and a first average frame between two frames in the first forward temporal direction; and (b) decomposing the group-of-frames into another first difference frame and another first average frame between other two frames in the first backward temporal direction. The steps (a) and (b) may be alternately performed with respect to the frames in the group- of-frames. Meanwhile, the decomposing the group-of-frames into the first difference and average frames may further comprise decomposing the first average frames into a second difference frame and a second average frame between two first average frames in either of a second forward temporal direction and a second backward temporal directions. Here, the decomposing the first average frames into the second difference and average frames may be repeated a plurality number of times. The second forward temporal direction and the second backward temporal direction may be combined such that an average of temporal distances between frames in the group-of-frames is minimized.

[9] The decomposing the first average frames into the second difference and average frames may comprise (c) decomposing the first average frames into a second difference frame and a second average frame between two first average frames in the second forward temporal direction, and (d) decomposing the group-of-frames into another second difference frame and another second average frame between other two first average frames in the second backward temporal direction. The steps (c) and (d) may be alternately performed with respect to the first average frames. Description of Drawings

[10] The above and other features and advantages of the present invention will become more apparent by describing in detail preferred embodiments thereof with reference to the attached drawings in which:

[11] FIG. 1 is a block diagram of an encoder performing an interframe wavelet video coding (IWNC) method;

[12] FIG. 2 illustrates directions of motion estimation in a conventional IWNC;

[13] FIGS. 3 and 4 illustrate directions of motion estimation in IWNC according to a first embodiment of the present invention;

[14] FIGS. 5 and 6 illustrate directions of motion estimation in IWNC according to a second embodiment of the present invention;

[15] FIG. 7 illustrates directions of motion estimation in IWNC according to a third embodiment of the present invention;

[16] FIG. 8 illustrates directions of motion estimation in IWNC according to a fourth embodiment of the present invention;

[17] FIG. 9 is a graph comparing Peak Signal to Noise Ratios (PSNRs) with respect to a 'Canoe' sequence between a conventional IWNC method and embodiments of the present invention;

[18] FIG. 10 is a graph comparing PSΝRs with respect to a 'Bus' sequence between a conventional IWNC method and embodiments of the present invention; and

[19] FIG. 11 is a graph comparing changes in PSΝRs with respect to a 'Canoe' sequence between a conventional IWNC method and embodiments of the present invention. Mode for Invention

[20] A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.

[21] FIG. 1 is a block diagram of an encoder performing an interframe wavelet video coding (IWNC) method.

[22] The encoder performing an IWNC method includes a motion estimation block 10 which obtains a motion vector, a motion compensation temporal filtering block 40 which removes temporal redundancy using the motion vector, a spatial wavelet decomposition block 50 which removes spatial redundancy, a motion vector encoding block 20 which encodes the motion vector using a predetermined algorithm, a quantization block 60 which quantizes wavelet coefficients of respective components generated by the spatial wavelet decomposition block 50, and a buffer 30 which temporarily stores an encoded bitstream received from the quantization block 60.

[23] The motion estimation block 10 obtains a motion vector used by the motion compensation temporal filtering block 40 using a hierarchical method such as Herarchical Variable Size Block Matching (HNSBM).

[24] The motion compensation temporal filtering block 40 decomposes frames into low- and high-frequency frames in a temporal direction using the motion vector obtained by the motion estimation unit 10. In more details, an average of two frames is defined as a low- frequency component, and half of a difference between the two frames is defined as a high-frequency component. Frames are decomposed in Group-of- Frames (GOF) units. Through such decomposition, temporal redundancy is removed. Decomposition into high- and low- frequency frames may be performed using only a pair of frames without using a motion vector. Efawever, decomposition using a motion vector shows better performance than that using only a pair of frames. For example, where a portion of a first frame is moved in a second frame, an amount of a motion can be represented by a motion vector. The portion of the first frame is compared with a portion to which a portion of the second frame at the same position as the portion of the first frame is moved by the motion vector, and a temporal motion is compensated. Thereafter, the first and second frames are decomposed into low- and high-frequency frames.

[25] The spatial wavelet decomposition block 50 wavelet-decomposes frames that have been decomposed in the temporal direction by the motion compensation temporal filtering block 40 into spatial low- and high-frequency components, thereby removing spatial redundancy.

[26] The motion vector encoding block 20 encodes a motion vector hierarchically obtained by the motion estimation block 10 such that the motion vector has an optimal number of bits using a rate-distortion algorithm, and then transmits the encoded motion vector to the buffer 30. The quantization block 60 quantizes and encodes wavelet coefficients of components generated by the spatial wavelet decomposition block 50. An encoded bitstream assumes scalability. The buffer 30 stores the encoded bitstream before transmission and is controlled by a rate control algorithm.

[27] FIG. 2 illustrates directions of motion estimation in conventional IWNC.

[28] In FIG. 2, a single GOF includes 16 frames. Two adjacent frames in a pair are replaced by a high-frequency frame and a low-frequency frame. In the conventional IWNC, motion estimation is performed only in a single direction, i.e., in a forward direction.

[29] For example, at level 0, motion estimation between frames 1 and 2 is performed in a direction from the frame 1 to the frame 2. Thereafter, a temporal high-frequency sub- band frame HI is positioned at the frame 1, and a temporal low- frequency sub-band frame L2 is positioned at the frame 2. In this case, the temporal low-frequency sub- band L2 at level 1 is similar to the frame 2 at the level 0, and the temporal high- frequency sub-band HI is similar to an edge image of the frame 1 at the level 0. As such, pairs of frames 1 and 2, 3 and 4, 5 and 6, 7 and 8, 9 and 10, 11 and 12, 13 and 14, and 15 and 16 at the level 0 are replaced by pairs of sub-band frames HI and L2, H3 and L4, H5 and L6, H7 and L8, H9 and L10, HI 1 and L12, H13 and L14, and H15 and L16 which form frames at the level 1.

[30] Temporal low-frequency sub-band frames at the level 1 are decomposed into temporal low-frequency sub-band frames and temporal high-frequency sub-band frames at level 2. For example, for temporal decomposition, motion estimation is performed in a direction from the frame L2 to the frame L4. As a result, at the level 2, a temporal high-frequency sub-band frame LH2 is positioned at a position of the frame L2, and a temporal low-frequency sub-band frame LL4 is positioned at a position of the frame L4. Similarly, the frame LH2 is similar to an edge image of the frame L2, and the frame LL4 is similar to the frame L4. As such, frames L2, L4, L6, L8, L10, L12, L14, and L16 at the level 1 are replaced by frames LH2, LL4, LH6, LL8, LH10, LL12, LH14, and LL16 at the level 2.

[31] In the same manner as described above, the temporal low- frequency sub-band frames LL4, LL8, LL12 and LL16 at the level 2 are replaced by temporal high- and low-frequency sub-band frames LLH4, LLL6, LLH12, and LLL16 at level 3. The temporal low- frequency sub-band frames LLL6 and LLL16 at the level 3 are finally replaced by temporal high- and low-frequency sub-band frames LLLH8 and LLLL16 at level 4.

[32] In FIG. 2, shaded squares represent temporal high-frequency sub-band frames, and non-shaded squares represent temporal low-frequency sub-band frames. Consequently, the frames 1 through 16 at the level 0 are decomposed into five types of temporal sub- bands through temporal filtering from the level 0 to the level 4. This decomposition results in:

[33] one LLLL frame LLLL16;

[34] one LLLH frame: LLLH8;

[35] two LLH frames LLH4 and LLH12;

[36] four LH frames LH2, LH6, LH10, and LH14; and

[37] eight H frames HI, H3, H5, H7, H9, Hl l, H13, and H15.

[38] Where a single GOF includes eight frames, the eight frames are finally decomposed into four types of temporal sub-bands through temporal filtering from level 0 to level 3. Where a single GOF includes 32 frames, the 32 frames are finally decomposed into six types of temporal sub-bands through temporal filtering from level 0 to level 5.

[39] The present invention provides a scalable data coding method in which performance is improved by reducing a total temporal distance for motion estimation. To quantitatively calculate the total temporal distance, an average temporal distance (ATD) is defined. To calculate the ATD, a temporal distance is calculated first. The temporal distance is defined as a positional difference between two frames. For example, a temporal distance between the frame 1 and the frame 2 is defined as 1, and a temporal distance between the frame L2 and the frame L4 is defined as 2. The ATD is obtained by dividing the sum of temporal distances between frames in pairs, which are subjected to an operation for motion estimation, by the number of the pairs of the frames.

[40] Referring to FIG. 2, a temporal distance for motion estimation increases as the level increases. Where motion estimation is performed between the frames 1 and 2 at the level 1, a temporal distance is calculated as 2-1=1. Similarly, a temporal distance for motion estimation at the level 1 is 2, and a temporal distance for motion estimation at the level 3 is 8. In FIG. 2, 8, 4, 2, and 1 pairs of frames for motion estimation exist at the levels 0, 1, 2, and 3, respectively. Accordingly, the total number of pairs of frames used for motion estimation is 15. This is arranged in Table 1.

[41] Table 1 : Number of pairs of frames and temporal distance for motion estimation at each level in conventional IWVC

[42] As the temporal distance increases, a size of a motion vector also increases. In particular, this phenomenon rapidly appears in a video sequence having fast motions. In the conventional IWNC shown in FIG. 2, as the level increases, the temporal distance also increases. A large temporal distance at a high level may cause coding efficiency of the conventional IWVC to decrease. The ATD is calculated in the con^¬ ventional IWVC as follows: [43]

[44] FIGS. 3 throigh 8 illustrate different directions of motion estimation in IWVC according to different embodiments of the present invention. Hereinafter, an IWVC method having directions of motion estimation shown in FIGS. 3 and 4 is referred to as Methodl. An IWVC method having directions of motion estimation shown in FIGS. 5 and 6 is referred to as Method2. An IWVC method having directions of motion estimation shown in FIG. 7 is referred to as Method3, and an IWVC method having directions of motion estimation shown in FIG. 8 is referred to as Method4. Since Methodl and Method2 provide a minimum ATD, they will be described in more detail by dividing each method into two modes according to whether a direction of motion estimation at the level 3 is a forward direction or a backward direction. In other words, Methodl is divided into Methodl-a and Methodl-b, and Method2 is divided into Method2-a and Method2-b. In FIGS. 3 throigh 8, solid lines denote forward motion estimation, and dotted lines denote backward motion estimation.

[45] Referring to FIGS. 3 and 4, in Methodl, both of the forward motion estimation and the backward motion estimation are present in level 0. Motion estimation between frames 1 and 2 is performed in a forward direction from the frame 1 to the frame 2. A temporal high-frequency sub-band frame HI is positioned at the frame 1, and a temporal low-frequency sub-band frame L2 is positioned at the frame 2. However, motion estimation on subsequent two frames is different. Motion estimation between frames 3 and 4 is performed in a backward direction from the frame 4 to the frame 3. A temporal high-frequency sub-band frame H4 is positioned at the frame 4, and a temporal low- frequency sub-band frame L3 is positioned at the frame 3.

[46] At the level 1, motion estimation is performed between the frames L2 and L3. As such, while a temporal distance for motion estimation at the level 1 is 2 in the conventional IWVC method, a temporal distance for motion estimation at the level 1 is 1 in Methodl shown in FIGS. 3 and 4. In other words, when motion estimation is performed in both of the forward and backward directions at the level 0, a temporal distance for the motion estimation can be reduced to 1 at the level 1. All of the directions of motion estimation except for directions at level 3 are the same between Methodl-a and Methodl-b. As shown in FIGS. 3 and 4, LLLL frames are positioned at positions of frames 10 and 7 in respective Methodl-a and Methodl-b.

[47] Directions of motion estimation at the level 0 are the same between Method 1 and Method2, but directions of motion estimation at the level 1 are different between Methodl and Method2. In Methodl, forward motion estimation is performed between frames L6 and L7 and backward motion estimation is performed between frames L10 and LI 1. Conversely, in Method2, backward motion estimation is performed between the frames L6 and L7, and forward motion estimation is performed between the frames L10 and LI 1. All of the directions of motion estimation except for directions at level 3 are the same between Method2-a and Method2-b. As shown in FIGS. 5 and 6, LLLL frames are positioned at positions of frames 11 and 6 in respective Method2-a and Method2-b.

[48] The numbers of pairs of frames used for motion estimation and temporal distances in Methodl and Method2 are shown in Tables 2 and 3. Table 3: Number of pairs of frames and temporal distance for motion estimation at each level in Method2

[50] Table 3: Number of pairs of frames and temporal distance for motion estimation at each level in Method2

[51] The ATD is calculated in Methodl as follows: [52]

ATD- 8x1 + 4x1 + 2x4 + 1x3 = 1.53 15

[53] The ATD is calculated in Method2 as follows: [54]

ATD 8x1+4x1+2x3+1x5 = 1. 15 53

[55] In Method3 and Method4 shown in FIGS.7 and 8, the LLLL frame is positioned at a position of a central frame, i.e., frame 8. As compared to Methodl and Method2, Method3 and Method4 provide a larger ATD and are arranged in Tables 4 and 5.

[56] Table 4: Number of pairs of frames and temporal distance for motion estimation at each level in Method3 Number of pairs of frames for Temporal distance for motion motion estimation estimation Level 0 8 1 Level 1 Level 2 Level 3

[57] Table 5: Number of pairs of frames and temporal distance for motion estimation at each level in Method4

[58] The ATD is calculated in Method3 as follows: [59] _{ATD =} 8 χ 1 + 4 χ 2 + 2 χ 4 + 1 χ 2 _{= 1 73} 15

[60] The ATD is calculated in Method4 as follows: [61] _{ATp =} 8 x 1 + 4 x 1 + 2 x 4 + 1 x 1 _{= 1 67} 15

[62] The ATDs obtained in Methodl through Method4 are 153, 153, 1.73, and 1.67, respectively; while the ATD obtained in the conventional IWVC is 2.13. Among Methodl through Method4 shown in FIGS. 3 throigh 8, Methodl and Method2 provide a least ATD.

[63] The ATD corresponds to a total temporal distance for motion estimation. When a total temporal distance for motion estimation decreases, a total motion vector also decreases. Such characteristic gives higher coding efficiency than the conventional IWVC.

[64] FIG. 9 is a graph comparing peak signal to noise ratios (PSNRs) with respect to a 'Canoe' sequence between the conventional IWVC and the embodiments of the present invention. Methodl-a and Method2-a provide almost the same performance and give a higher PSNR than the conventional IWVC by 1.0 through 15 dB.

[65] FIG. 10 is a graph comparing PSNRs with respect to a 'Bus' sequence between the conventional IWVC and the embodiments of the present invention. Methodl-a and Method2-a give higher PSNRs than the conventional IWVC by 1.0 dB and 15 dB, respectively. Method3 and Method4 provide lower performance than Methodl-a and Method2-a but provide higher performance than the conventional IWVC.

[66] FIG. 11 is a graph comparing changes in PSNRs with respect to a 'Canoe' sequence between the conventional IWVC and the embodiments of the present invention.

[67] It can be inferred from FIG. 11 that a PSNR is highest at the position of the LLLL frame in a GOF in all of the methods. Industrial Applicability

[68] According to the present invention, a total interframe temporal distance for motion estimation is reduced in a scalable video coding method using wavelets so that the performance of video coding can be improved.

[69] Although only a few embodiments of the present invention have been shown and described with reference to the attached drawings, it will be understood by those skilled in the art that changes may be made to these elements without departing from the features and spirit of the invention. For example, in the above-described embodiments of the present invention, a single GOF includes 16 frames. However, the present invention is not restricted thereto. In addition, the embodiments of the present invention has been described and tested based on IWVC. However, the present invention can be applied to other coding techniques. Therefore, it is to be understood that the above-described embodiments have been provided only in a descriptive sense and will not be construed as placing any limitation on the scope of the invention.

Claims

[1] An interframe wavelet video coding method comprising: receiving a group-of-frames and decomposing the group-of-frames into first difference frames and first average frames between the frames in a first forward temporal direction and a first backward temporal direction; wavelet-decomposing the first difference frames and the first average frames; and quantizing coefficients resulting from the wavelet-decomposition to generate a bitstream.

[2] The interframe wavelet video coding method of claim 1, further comprising obtaining a motion vector between frames and compensating for a temporal motion using the motion vector before decomposing the group-of-frames into the first difference and average frames.

[3] The interframe wavelet video coding method of claim 1, wherein the first forward temporal direction and the first backward temporal direction are combined such that an average of temporal distances between frames in the group-of-frames is minimized.

[4] The interframe wavelet video coding method of claim 1, wherein decomposing the group-of-frames into the first difference and average frames comprises: (a) decomposing the group-of-frames into a first difference frame and a first average frame between two frames in the first forward temporal direction; and (b) decomposing the group-of-frames into another first difference frame and another first average frame between other two frames in the first backward temporal direction.

[5] The interframe wavelet video coding method of claim 4, wherein steps (a) and (b) are alternately performed with respect to the frames in the group-of-frames.

[6] The interframe wavelet video coding method of claim 5, wherein the decomposing the group-of-frames into the first difference and average frames further comprises decomposing the first average frames into a second difference frame and a second average frame between two first average frames in either of a second forward temporal direction and a second backward temporal directions.

[7] The interframe wavelet video coding method of claim 6, wherein decomposing the first average frames into the second difference and average frames is repeated a plurality number of times.

[8] The interframe wavelet video coding method of claim 7, wherein the second forward temporal direction and the second backward temporal direction are combined such that an average of temporal distances between frames in the group-of-frames is minimized.

[9] The interframe wavelet video coding method of claim 6, wherein decomposing the first average frames into the second difference and average frames comprises: (c) decomposing the first average frames into a second difference frame and a second average frame between two first average frames in the second forward temporal direction; and (d) decomposing the group-of-frames into another second difference frame and another second average frame between other two first average frames in the second backward temporal direction.

[10] The interframe wavelet video coding method of claim 9, wherein steps (c) and (d) are alternately performed with respect to the first average frames. [11] The interframe wavelet video coding method of claim 4, wherein steps (a) and (b) are performed alternately and sequentially with respect to the frames in the group-of-frames. [12] The interframe wavelet video coding method of claim 4, wherein step (a) is preformed with respect to temporally first half of all of the frames in the group of frames and step (b) is performed with respect to temporally second half of all of the frames in the group-of-frames. [13] The interframe wavelet video coding method of claim 9, wherein steps (a) and (b) are performed alternately and sequentially with respect to the frames in the group-of-frames. [14] The interframe wavelet video coding method of claim 9, wherein step (a) is preformed with respect to temporally first half of all of the frames in the group of frames and step (b) is performed with respect to temporally second half of all of the frames in the group-of-frames. [15] The interframe wavelet video coding method of claim 6, wherein decomposing the first average frames into the second difference and average frames is repeated at least one time.