WO2003063482A1 - Coding scene transitions in video coding - Google Patents

Coding scene transitions in video coding Download PDF

Info

Publication number
WO2003063482A1
WO2003063482A1 PCT/FI2003/000052 FI0300052W WO03063482A1 WO 2003063482 A1 WO2003063482 A1 WO 2003063482A1 FI 0300052 W FI0300052 W FI 0300052W WO 03063482 A1 WO03063482 A1 WO 03063482A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
video
frame
transition
coded
Prior art date
Application number
PCT/FI2003/000052
Other languages
French (fr)
Inventor
Miska Hannuksela
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to EP03700320A priority Critical patent/EP1468558A1/en
Publication of WO2003063482A1 publication Critical patent/WO2003063482A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/038Cross-faders therefor
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/142Detection of scene cut or scene change
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/179Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a scene or a shot
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/87Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving scene cut or scene change detection in combination with video compression

Definitions

  • the invention relates to video coding, particularly to transitions between scenes that are included in video files, i.e. to scene transitions.
  • Video files are composed of a plurality of still image frames, which are shown rapidly in succession as a video sequence (typically 15 to 30 frames per second) to create an idea of a moving image.
  • Image frames typically comprise a plurality of stationary background objects defined by image in- formation that remains substantially the same, and few moving objects defined by image information that changes somewhat.
  • the image information comprised by the image frames to be shown in succession is typically very similar, i.e. consecutive image frames comprise much redundancy.
  • the redundancy comprised by video files is dividable into spatial, tem- poral and spectral redundancy.
  • Spatial redundancy represents the mutual correlation between adjacent image pixels; temporal redundancy represents the change in given image objects in following frames, and spectral redundancy the correlation between different colour components within one image frame.
  • video coding methods utilize the above-described temporal redundancy of consecutive image frames.
  • motion-compensated temporal prediction is used, wherein the contents of some (typically most) image frames in a video sequence are predicted from the other frames in the sequence by tracking the changes in given objects or areas in the image frames between consecutive image frames.
  • a video sequence comprises compressed image frames, whose image information is determined without using motion-compensated temporal prediction. Such frames are called INTRA or I frames.
  • motion-compensated image frames comprised by a video sequence and predicted from previous image frames are called INTER or P frames (Predicted).
  • One I frame and possibly one or more previously coded P frames are used in the determination of the image information of P frames. If a frame is lost, frames depending thereon can no longer be correctly decoded.
  • an I frame initiates a video sequence defined as a Group of Pictures (GOP), the image information of the P frames comprised by which can be defined using only the I frames comprised by said group of pictures GOP and previous P frames.
  • the following I frame again initiates a new group of images GOP, and the image information of the frames comprised by it cannot thus be defined on the basis of the frames in a previous group of pictures GOP. Accordingly, groups of pictures GOP do not temporally overlap and each group of pictures can be independently decoded.
  • many video compression methods use bi-directionally predicted B frames, which are placed between two anchor frames (I and P frame or two P frames) within a group of pictures GOP, and the image information of the B frame is predicted from both the previous anchor frame and the anchor frame following the B frame. B frames thus provide image information of a better quality than do P frames, but they are typically not used as an anchor frame and discarding them from the video sequence does therefore not cause any deterioration of the quality of subsequent pictures.
  • Each image frame is dividable into macro blocks that comprise the colour components (e.g. Y, U, V) of all pixels from a rectangular image area. More precisely, a macro block is composed of three blocks, each block comprising colour values (e.g. Y, U or V) from one colour layer of the pixels from said image area.
  • the spatial resolution of the blocks may be different from that of the macro block; for example, components U and V can be presented at only half the resolution compared with component Y.
  • Macro blocks can also be used to form for example slices, which are groups of several macro blocks wherein the macro blocs are typically selected in the image scanning order. In fact, in video coding methods, temporal prediction is typically performed block or macro block-specifically, not image frame-specifically.
  • Video materials such as news, music videos and movie trailers comprise rapid cuts between different image material scenes.
  • cuts between different scenes are abrupt, but often scene transition is used, i.e. the transition from scene to scene takes place for instance by fading, wiping, tiling or rolling the image frames of a previous scene, and by bringing forth the scenes of a subsequent scene.
  • scene transition is often a serious problem, since the image frames during a scene transition comprise information on the image frames of both the ending scene and the beginning scene.
  • a typical scene transition, fading, is performed by lowering the in- tensity or luminance of the image frames in a first scene gradually to zero and simultaneously raising the intensity of the image frames in a second scene gradually to its maximum value.
  • Such a scene transition is called a cross-faded scene transition.
  • a second typical scene transition, tiling, is performed by randomly or pseudo-randomly discarding square parts from the image frames of a first scene, and replacing the discarded parts with bits taken from the corre- sponding places in a second scene.
  • Some typical scene transitions such as roll, push, door etc., are accomplished by 'fixing' the first image frames on the surface of a virtual object (a paper sheet, a sliding door or an ordinary door) or some other arbitrary object, and turning this object or piece gradually away from sight, whereby information about the image frames of a second scene is copied to the emerging image areas.
  • a virtual object a paper sheet, a sliding door or an ordinary door
  • Many other transitions are known and used in several commercially available products, such as Avid CinemaTM (Avid Technology Inc.).
  • Present video coding methods utilize several methods of coding scene transitions.
  • the above-described B frames are usable for presenting image frames during a scene transition.
  • one image frame from a first (ending) scene and one image frame from a second (beginning) scene are selected as anchor frames.
  • the image information of the B frames inserted be- tween these during the scene transition is defined from these anchor frames by temporal prediction such that the pixel values of the predicted image blocks are calculated as average values of the pixel values of the motion- compensated prediction blocks of the anchor frames.
  • such a solution is, however, disadvan- tageous particularly if coding the scene transition requires that several B frames be inserted between the anchor frames.
  • the coding has been improved in the ITU-T H.26L standard such that the image information of the B frames inserted between the anchor frames during the scene transition is defined from these anchor frames by temporal prediction such that the pixel val- ues of the B image frames are calculated as weighted average values of the pixel values of the anchor frames based on the temporal distance of each B frame from both anchor frames.
  • a computer-generated image is made of layers, i.e. image objects.
  • image objects are definable by three types of information: the texture of the image object, its shape and transparency, and the layering order (depth) relative to the background of the image and other image objects.
  • MPEG-4 video coding uses some of these information types and the parameters values defined for them in coding scene transitions.
  • Shape and transparency are often defined using an alpha plane, which measures non-transparency, i.e. opacity and whose value is usually defined separately for each image object, possibly excluding the background, which is usually defined as opaque. It can be defined that the alpha plane value of an opaque image object, such as the background, is 1.0, whereas the alpha plane value of a fully transparent image object is 0.0. Intermediate values define how strongly a given image object is visible in the image relative to the background and other at least partly superposed image objects that have a higher depth value relative to said image object. Layering image objects on top of each other according to their shape, transparency and depth position is called scene composition. In practice, this is based on the use of weighted average values. The image object closest to the background, i.e.
  • the above-described process for coding a scene transition is used for instance in MPEG-4 video coding such that image frames in a beginning scene are typically selected as background images, whose opacity has a full value, and the opacity of image frames in an ending scene, the frames being 'image objects' to be positioned on top of the background, is reduced during the scene transition.
  • the opacity i.e. alpha plane value
  • prior art scene transition coding involves several problems.
  • the use of weighted anchor frame average values in the prediction of B frames does not work well in situations wherein the duration of the scene transition is long and the images include much motion, which considerably lowers the compression efficiency of coding based on temporal prediction.
  • the B pictures used in the scene transition are used for traffic shaping for instance in a streaming server, the image rate of the transmitted sequence temporarily decreases during the scene transition, which is usually observed as image jerks.
  • a problem in the method used in MPEG-4 video coding is the complexity of coding a scene transition.
  • the object of the invention is thus to provide a method and an appa- ratus for implementing the method to alleviate the above problems.
  • the objects of the invention are achieved by a method, video encoder, video decoder and computer software that are characterized in what is disclosed in the independent claims.
  • the invention is based on composing a scene transition in a video sequence between at least a first and a second scene, the first scene being an ending scene and the second scene a beginning scene.
  • At least one of the scenes comprises independently decodable video frames coded according to at least a first frame format, i.e. I frames, and video frames coded according to a second frame format, i.e. P or B frames, at least one of the video frames according to the second frame format being predicted from at least one other video frame.
  • the scene transition is coded in the video sequence preferably such that the presentation times of at least one video frame of the first scene and at least one video frame of the second scene are determined to be the same during the scene transition, allowing said video frames to be called scene transition video frames.
  • Scene transition information for composing a scene transition with a decoder is defined for at least one video frame of at least one of the scenes. At least said one scene transition video frame of the first scene, said one scene transition video frame of the second scene and said scene transition information are then coded in a encoder into the video sequence.
  • the coded video frame of the first scene, the coded video frame of the second scene, and the coded scene transition information are received at a decoder. These are decoded and the scene transition is generated using the decoded video frame of the first scene, the decoded video frame of the second scene, and the decoded scene transition information.
  • frames of the first and second scenes are placed on different scalability layers comprising at least a base layer and a first enhancement layer.
  • the advantage of the method of the invention is that it allows a scene transition to be coded in a video sequence such that it comprises essential information about the different scenes and their processing during the scene transition, enabling the decoding of the scene transition in a decoder based merely on the information comprised by said video sequence.
  • a further advantage is that the method of the invention enables scalable coding also in the coding of the scene transition.
  • An additional advantage is that, in accordance with a preferred embodiment of the invention, the scalability layers of the video sequence are combined with the above-described image objects of image frames and their information types such that a scalable video coding is achieved for the scene transition, the compression efficiency of the coding simultaneously being good.
  • Figure 1 shows the placement of the image frames of two different scenes onto scalability layers in accordance with a preferred embodiment of the invention
  • Figure 2 shows a scene transition that can be composed by means of the placement of image frames according to Figure 1 ;
  • Figure 3 shows a second scene transition that can be composed by means of the placement of image frames according to Figure 1 ;
  • Figure 4 shows the placement of the image frames of two different scenes onto scalability layers in accordance with a second preferred embodiment of the invention
  • Figure 5 shows a graph illustrating the portion of discardable data as a function of the duration of the cross-faded scene transition in accordance with a preferred embodiment of the invention.
  • Figure 6 shows a block chart illustrating a mobile station in accor- dance with a preferred embodiment of the invention.
  • the invention is applicable to all video coding methods using scalable coding.
  • the invention is particularly applicable to different low bit rate video codings typically used in limited-band telecommunication systems. These include for instance ITU-T standard H.263 and H.26L (later possibly H.264), which is currently being standardized.
  • the invention is applicable for instance in mobile stations, allowing video playback to adapt to variable transfer capacity or channel quality and the processor power available at each particular time when other applications than video playback is run in the mobile station.
  • Video coding methods use scalable coding for flexible adjustment of the video coding bit rate, whereby some elements or element sets in a video sequence can be discarded without it having any impact on the reconstruction on the other parts of the video sequence.
  • Scalability is typically implemented by grouping image frames onto several hierarchical layers. Substantially only the image frames necessary for decoding the video information at the receiving end are coded in the image frames of the base layer.
  • an in- dependently decodable group of pictures may constitute a sub-sequence, although in the present description, a sub-sequence is understood to mean any group of pictures whose pictures can be decoded using the pictures of the same group of pictures and one or more other groups of pictures.
  • the base layer in each group of pictures GOP typically comprises at least one I frame and a necessary number of P frames.
  • One or more enhancement layers may be defined under the base layer, each layer improving the quality of the video coding compared with the upper layer. Consequently, enhancement layers comprise P or B frames predicted by motion compensation from the pictures of one or more upper layers.
  • the frames are typically numbered in accordance with a predetermined alphanumeric series.
  • the quality of the picture to be displayed improves the more scalability layers are available or the more scalability layers it is capable of decoding.
  • the temporal or spatial resolution or spatial quality of image data improves, since the amount of image information and the bit rate used for its transfer increase.
  • a larger number of scalability layers also sets considerably higher requirements on the processing power of the terminal as regards decoding.
  • the bit rate of a video sequence is adjustable by discarding lower scalability layers from the video sequence.
  • the dependence of each image frame in a group of pictures or a sub-sequence on the other image frames in the group of pictures may also be known in some cases.
  • the group of pictures or the sub-sequence and the pictures dependent thereon also constitute an independent entity that can be omitted from the video sequence, if need be, without it affecting the decoding of sub- sequent image frames in the video sequence. Accordingly, the image frames of only said sub-sequence and the sub-sequences of lower scalability layers dependent thereon remain un-decoded or at least cannot be decoded correctly.
  • scalable video coding brings forth a plurality of advantages for adjusting the bit rate of a video sequence.
  • a method of implementing a scene transition utilizing scalable video coding will be described.
  • a method will be described, wherein the scalability layers of a video sequence are combined with the above described image frame image objects and their information types such that scalable video coding having good compression efficiency is achieved for the scene transition.
  • the invention is not only restricted to scalable coding of a scene transition. It is essential to the invention to enable the coding of a scene transition into a video sequence such that it comprises essential information about the different scenes and their processing during the scene transition, whereby the scene transition can be decoded in a decoder merely based on the information comprised by said video sequence.
  • the following is an exemplary illustration of the invention using a cross-faded scene transition and an abrupt scene transition as examples.
  • the image frames to be presented during a scene transition are typically composed of two superposed image frames, the first image frame from a first scene and the second image frame from a second scene.
  • One of the image frames constitutes a background picture and the other, called a foreground picture, is placed on top of the background picture.
  • the opacity of the background picture is constant, i.e. its pixel-specific alpha level values are not adjusted.
  • the background picture and the foreground picture are determined by scalability layers.
  • Figure 1 shows an example of how the image frames of two different scenes are placed on scalability layers during a scene transition according to the invention.
  • the first image frame 100 of a first (ending) scene is located on a base layer.
  • the image frame 100 may be either an I frame, the determination of whose image information does not use motion-compensated temporal prediction or a P frame, which is a motion-compensated image frame predicted from previous image frames.
  • the coding of a second (beginning) scene begins during a temporally subsequent image frame, and, in accordance with the invention, the image frames comprised by it are also placed on the base layer. This means that the rest of the image frames 102, 104, of the second (ending) scene are placed on a first enhancement layer (Enhance- mentl).
  • These image frames are typically P frames.
  • the image frames comprised by the second (beginning) scene are placed on the base layer, at least for the dura- tion of the scene transition.
  • the first image frame 106 of the scene is typically an I frame, from which the subsequent image frames of the second scene are temporally predicted.
  • the subsequent image frames of the second scene are temporally predicted frames, typically P frames, as shown in Figure 1 by frames 108 and 110.
  • Such image frame placement on scalability layers enables a cross- faded scene transition to be implemented in accordance with a preferred embodiment of the invention such that the image frame on the base layer is always defined as a background picture whose opacity is at its maximum (100%).
  • image frames located on an enhancement layer are placed on top of the background picture and their opacity is adjusted for instance with suitable filters such that the frames gradually change from opaque to transparent.
  • a new (second) scene begins at the following image frame 106 on the base layer, image frame 106 being set as the background picture as regards its depth position, and its opacity value being defined to maximum.
  • image frame 106 being set as the background picture as regards its depth position, and its opacity value being defined to maximum.
  • image frame 102 of the ending (first) scene on the enhancement layer Temporally simultaneously with image frame 106 on the base layer is image frame 102 of the ending (first) scene on the enhancement layer, and its transparency should be increased in order to achieve a cross-faded scene transition.
  • opacity is set to a value of for instance 67%, in addition to which image frame 102 is defined as a foreground picture as regards its depth position.
  • a picture combined from image frames 106 and 102 is coded in the video sequence, and it shows picture 106 more faintly in the background and picture 102 more clearly in the foreground, since its opacity value is substantially high (60 to 100%).
  • a second image frame 108 of the second scene is on the base layer and it is similarly set as the background picture as regards its depth position, and its opacity value is defined to maximum.
  • the last image frame 104 of the temporally simultaneously ending (first) scene is on the first enhancement layer, the opacity value of the frame being for instance 33%, and, in addition, image frame 104 is defined as a foreground picture as regards its depth position. That is, for said point in time, a picture combined from image frames 108 and 104 is coded into the video sequence, and it shows picture 108 more clearly in the background and picture 104 more faintly in the foreground, since its opacity value is substantially low (10 to 40%). Furthermore, it is feasible that between said image frame there would be a frame whose opacity value is substantially 50%, but this is not shown in this example.
  • image frame 200 shows a picture of a boat, the image frame belonging to the first (ending) scene in this video sequence.
  • Image frame 200 corresponds to image frame 100 of the first base layer in the video sequence of Figure 1 , during which frame there are no image frames on lower scalability layers. In other words, only the first image frame 100 of the base layer is coded into the video sequence for said point in time.
  • image frame 202 comprising image information about the first (ending) scene and the second (beginning) scene, combined in accor- dance with the invention.
  • the beginning scene shows a picture of a man's head dimly in the background of image frame 202.
  • Image frame 202 corresponds to the point in time in Figure 1 , when image frame 106 of a beginning scene is on the base layer and image frame 102 of an ending scene on the enhancement layer.
  • Image frame 106 (head) is set as a background picture as regards its depth position and its opacity value is defined to maximum.
  • the opacity of image frame 102 (boat) on the first enhancement layer is set to the value 67% and image frame 102 is defined as a foreground picture as regards its depth position.
  • image frame 102 is coded in the video sequence, wherein picture 106 (head) is shown more faintly in the background and picture 102 (boat) more intensely at the front, since its opacity value is substantially high (67%).
  • image frame 204 also comprising image information about the first (ending) scene and the second (beginning) scene, com- bined in accordance with the invention.
  • Image frame 204 corresponds to the point in time in Figure 1 , when image frame 108 of a beginning scene is on the base layer and the last image frame 104 of an ending scene is on the enhancement layer.
  • image frame 108 is set as the background picture as regards its depth position, its opacity value being defined to maxi- mum.
  • the opacity value of image frame 104 is set to 33%, and, in addition, image frame 104 is defined as a foreground picture as regards its depth position.
  • picture 204 is coded in the video sequence for said point in time, wherein picture 108 (head) is seen more intensely in the background and picture 104 (boat) more faintly at the front, since its opacity value is only 33%.
  • the scene transition is ended and only the third image frame 110 of the second scene on the base layer is coded in image frame 206, from which the presentation of the second scene continues.
  • one or more scalability layers or independently decodable groups of pictures GOP or sub-sequences comprised thereby can now be discarded from a video se- quence thus lowering the bit rate of the video sequence, and yet the scene transition can be simultaneously decoded without lowering image frequency.
  • this can be implemented by discarding the first enhancement layer from the video sequence. This means that only image frames 100, 106, 108 and 110 comprised by the base layer are shown of the video sequence. In other words, a transition occurs directly from the first (ending) scene to the second (beginning) scene as an abrupt scene transition, i.e.
  • a scene transition can be preferably performed without it having an impact on the picture quality in the video sequence, and, typically, the viewer does not experience an abrupt scene transition performed instead of a cross- faded scene transition as disturbing or faulty.
  • image frequency would instead have to be lowered at a scene transition, and the viewer would experience this as jerking and disturbing.
  • image frame 300 shows a picture of a boat, whose image frame belongs to the first (ending) scene in the video sequence. That is, image frame 300 corresponds to image frame 100 of the first base layer in the video sequence of Figure 1 , during the duration of which no image frames exist on the lower scalability layers. In other words, only the first image frame 100 of the base layer is coded into the video sequence for said point in time. At the next point in time, an abrupt scene transition from the first scene to the second scene is performed in the video sequence of Figure 3 by discarding the image frames on the enhancement layer from the video sequence.
  • image frame 302 then comprises image information only about a second (beginning) scene on the base layer, the man's head being clearly visible in image frame 302.
  • image frame 302 corresponds to image frame 106 of Figure 1 as such.
  • image frame 304 comprises image information only about the second image frame 108 of the second scene on the base layer.
  • image frame 306 only the third image frame 110 of the second scene on the base layer is coded into image frame 306, and the display of the second scene continues from there.
  • the scene transition can preferably be performed as an abrupt scene transition without it affecting the picture quality in the video sequence, nor can the abrupt scene transition shown in image frames 300 to 306 be observed in any way faulty.
  • the image frames of a latter scene typically on the base layer and set them as background pictures in a depth position.
  • the intention is to emphasize the first (ending) scene for in- stance because, in the case of an abrupt scene transition, the intention is to show all image frames of the ending scene, then the image frames of the first scene can be placed on the base layer.
  • an I frame has to be coded in the second (beginning) scene instead of a P frame immediately after the scene transition. How- ever, as regards compression efficiency, this is not as preferable a solution as the above described coding arrangement.
  • the above problem can be solved in systems supporting backward temporal prediction.
  • a method called 'reference picture selection' is known in some coding methods, which in its general form allows also the prediction of the image information of image frames from temporally later image frames.
  • the transfer of an INTRA frame is a coding technique that utilizes the reference picture selection. This means that the INTRA frame is not placed in a temporally 'correct' place in a video sequence, but its place is transferred temporally later. In this case, the image frames between the 'correct' place and the real place of the INTRA frame in the video sequence are predicted from said INTRA frame temporally backwards. This naturally requires that non-coded image frames be buffered for a sufficiently long time in order for all image frames shown to be coded and arranged in the display order.
  • Figure 4 shows all image frames 100, 102 and 104 of a first (ending) scene, placed on a base layer, their depth position defining them as background pictures and their opacity value being at maximum (100%). At least image frames 106 and 108, occurring during the scene transition, of a second (beginning) scene, are placed on a first enhancement layer. These image frames are P frames that are temporally predicted backwards from I frame 110. Depending on the coding method used, I frame 110 may be located either on the base layer or on the first enhancement layer.
  • the depth position placement of image frames located on an enhancement layer and occurring during a scene transition define them as foreground pictures, and their opacity values change gradually. If the intention is to accomplish a cross-faded scene transition similar to that in the example of Figures 1 and 2 above, the opacity value of image frame 106 is set to 33% and the opacity value of image frame 108 to 67%.
  • Scalability by layers can be preferably utilized in the implementation of a cross- faded scene transition and, on the other hand, if the enhancement layer has to be discarded from the video sequence, for instance because of a narrowing available bandwidth, the scene transition can still be preferably performed as an abrupt scene transition.
  • the base layer and one enhancement layer.
  • the number of enhancement layers is typically not restricted at all, but the coding is able to use several enhancement layers in addition to the base layer.
  • the base layer is further dividable such that a separate INTRA layer exists above it comprising only I frames and being followed by the actual base layer and below it a necessary number of enhancement layers.
  • the above examples illustrate the invention in situations where a scene transition is performed between two scenes.
  • the invention is not restricted to scene transitions between only two scenes, but the coding can be performed by coding more than two scenes in the same scene transition.
  • the different scenes may be temporally consecutive or at least partly overlapping.
  • the different scenes can be placed on different scalability layers such that after the scene transition, the image frames comprised by a continuous scene are preferably placed on the base layer and the image frames comprised by the other scenes are placeable in several dif- ferent manners on several enhancement layers.
  • the image frames to be generated during a scene transition can be coded in the above manner by defining a different depth position for the image frames of the different scenes and by weighting the opacity values of the different image frames in different manners.
  • one way to proceed is to place the first scene on the base layer, to place monochrome, e.g. black frames on the first enhancement layer at least for part of the duration of the scene transition, and to place the second scene on the second enhancement layer.
  • the monochrome frames may be thought to exist between frames 102 and 106, and similarly between frames 104 and 108.
  • Such a scene transition between different scenes through black or white is very typical for example in documentary videos.
  • a problem in scanning video files arises when scanning is to be started in the middle of a video sequence.
  • To continue scanning a video file at a random point requires that an independently decodable group of pictures GOP be found.
  • Scene transitions are often coded by predicting the latter group of pictures from a first group of pictures belonging to the scene transition, and thus the latter group of pictures is not independently decodable and cannot be used to continue scanning the video file.
  • a beginning scene would be a natural point to start scanning the file.
  • this is avoidable in a decoder in such a manner that when the scanning of a video file starts at a random point, the decoder looks for said point in the video se- quence of the following scene transition and starts decoding from the scene transition.
  • the second (beginning) scene starts as an I frame, which thus acts as the starting point of an independently decodable group of pictures GOP or a sub-sequence.
  • the I frame acts as a starting point for the decoding.
  • the scene transition of the invention preferably provides a point from which decoding can be started after a random scanning point.
  • B frames can also be used in a scene transition for displaying image frames that occur during the scene transition.
  • the image information of the B frames that occur during the scene transition is determined from these anchor frames by temporal prediction by calculating the pixel values of the macro blocks in the predicted image frames as average values or weighted average values of the pixel values of the motion-compensated prediction blocks of the anchor frames relative to the distance of each B frame from both anchor frames.
  • the compression efficiency of B frames is typically better than that of P frames, a better quality is achieved also for the image frames combined in the scene transition at a corresponding bit rate than if the image frames corresponding to the B frames were P frames. If an image frame occur- ring during a scene transition, such as a conventional B frame, is not used to predict other frames, and does not have to be transmitted, temporally corresponding image frames on other enhancement layers should neither be transmitted.
  • the bit rate of the part to be discarded from the bit stream of the video sequence can be determined such that it depends on the lengths of the scenes, i.e. the time be- tween scene transitions, and on the duration of the scene transition. If the assumption here is that a constant bit rate is used and an equal bit rate is reserved for the use of each image frame, a formula can be defined for approximating the portion of discardable data in a cross-faded scene transition from the data reserved for the use of the image frames.
  • the portion of discardable data can be given by formula 1 :
  • the portion of discardable data as a function of the duration of the cross-faded scene transition can be presented by the curve of the graph of Fig. 5.
  • the graph shows that if a cross-faded scene transition is not used (duration of scene transition is zero, i.e. an abrupt scene transition is involved), the amount of data to be discarded during the scene transition is naturally zero.
  • the duration of the scene transition is equal to that of the actual scene, half of the image frame data can be discarded during the scene transition.
  • the ratio of the duration of the scene transition to the duration of the entire scene is typically below 0.1 , the amount of discardable data being less than 10%.
  • a movie trailer may include scenes of the duration of one second, between which a 0.1-second cross-fading is used, the ratio of the duration of the scene transition to the duration of the entire scene being exactly 0.1 , which corresponds to a portion of 9% of discardable data.
  • a news clip may include for instance 5-second scenes, which are cross-faded to the next scene during 0.3 seconds. In this case the ratio of the duration of the scene transition to the length of the entire scene is 0.06, corresponding to a portion of about 6% of discardable data.
  • the graph of Fig. 5 further shows that at its maximum, the amount of discardable data is calculated using weighted averages, which thus resembles the above prediction of B pictures by weighting the distances between the image frames used as anchor frames.
  • data can be discarded from a video sequence during a scene transition according to the above formula as transition filtering, known per se, such as SMIL filtering (Synchronized Multimedia Integration Language).
  • transition filtering known per se, such as SMIL filtering (Synchronized Multimedia Integration Language).
  • SMIL 2.0 standard pre- sents means for transition filtering of for instance image and video files.
  • the filtering process uses one source or filtering is determined to take place between two sources based on which the filtering output is determined to a given range in the image frame.
  • the filter determines the transition between the origin me- dia and the destination media by denoting the origin media by the value 0.0 and the destination media by the value 1.0. This enables the filtering process and the desired result to be determined by setting a suitable value on said parameter.
  • the SMIL 2.0 standard presents a plurality of different filtering ef- fects that are applicable to the transition filtering according to the invention.
  • the properties of the filters, particularly said parameter determining the transition are determined in accordance with formula 1.
  • the desired filtering effect affects the type of filter used.
  • coding a scene transition according to the invention is not only limited to the above examples and a cross-faded or abrupt scene transition, but, in principle, the invention is applicable to any type of scene transition. Accordingly, the invention is applicable for instance to the previously mentioned tiling, roll, push, door or different zoomings.
  • the procedure is the same in all scene transitions: determining the opacity and depth values for each frame during the scene transition and the filter type required for the scene transition and the effect used.
  • a video encoder which may be a video encoder known per se.
  • the video encoder used could be for instance a video encoder according to the ITU-T recommendations H.263 or H.26L, which, in accordance with the invention, is arranged to determine that the presentation time of at least one video frame of the first scene is equal to the presentation time of at least one video frame of the second scene during the scene transition, said video frames thus being scene transition video frames, to define scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder, to code said at least one scene transition video frame of the first scene in a encoder into a video sequence, to code at least said one scene transition video frame of the second scene in the encoder into the video sequence, and to code said scene transition information in the encoder into the video sequence.
  • decoding takes place in a video encoder, which may be a video decoder known per se.
  • the video decoder used could be for instance a low bit rate video decoder according to the ITU-T recommendations H.263 or H.26L, which, in accordance with the invention, is arranged to receive a video frame, coded in a decoder, of a first scene, a coded video frame of a second scene and coded scene transition information, to decode the coded video frame of the first scene, the coded video frame of the second scene and the coded scene transition information, and to generate a scene transition by using the decoded video frame of the first scene, the decoded video frame of the second scene and the decoded scene transition information.
  • the different parts of video-based telecommunication systems may comprise properties to enable bi-directional transfer of multimedia files, i.e. transfer and reception of files.
  • the functional elements of the invention in the above video encoder, video decoder and terminal can be implemented pref- erably as software, hardware or a combination of the two.
  • the coding and decoding methods of the invention are particularly well suited to be implemented as computer software comprising computer-readable commands for carrying out the functional steps of the invention.
  • the encoder and decoder can preferably be implemented as a software code stored on storage means and ex- ecutable by a computer-like device, such as a personal computer (PC) or a mobile station, for achieving the coding/decoding functionalities with said device.
  • PC personal computer
  • Fig. 6 shows a block diagram of a mobile station MS according to a preferred embodiment of the invention.
  • a central processing unit CPU controls blocks responsible for the various functions of the MS: a memory MEM comprising typically both random access memory RAM and read-only memory ROM, a radio frequency part comprising transmitter/receiver TX/RX, a video codec CODEC and a user interface Ul.
  • the user interface comprises a keyboard KB, a display DP, a speaker SP and a micro- phone MF.
  • the CPU is a microprocessor, or in alternative embodiments, some other kind of processor, such as a digital signal processor.
  • the operating instructions of the CPU have been stored in forehand in the ROM.
  • the CPU uses the radio frequency block for transmitting and receiving data over a radio path.
  • the video codec may be either hardware based or fully or partly software based, in which case the CODEC comprises computer programs for controlling the CPU to perform video encoding and decoding functions as explained above.
  • the CPU uses the RAM as its working memory.
  • the mobile station MS can advantageously include a video camera CAM, whereby the mobile station can capture motion video by the video camera. The captured motion video is then encoded and compressed using the CPU, the RAM and CODEC based software.
  • the radio frequency block is then used to exchange encoded video with other parties.
  • the invention can also be implemented as a video signal comprising at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of the scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame.
  • a video signal com- prises scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder.

Abstract

A method of generating a scene transition in a video sequence between a first and a second scene. One of the scenes comprises independently decodable video frames coded according to a first frame format, and video frames coded according to a second frame format, one of the video frames according to the second frame format being predicted from one other video frame. The presentation time of one video frame of the first scene is determined to be equal to that of one scene transition video frame of the second scene during the scene transition. Scene transition information is determined for one video frame of one scene for generating a scene transition with a decoder. One scene transition video frame of the first scene, one scene transition video frame of the second scene, and the scene transition information are coded in the encoder into the video sequence.

Description

CODING SCENE TRANSITIONS IN VIDEO CODING
FIELD OF THE INVENTION
The invention relates to video coding, particularly to transitions between scenes that are included in video files, i.e. to scene transitions.
BACKGROUND OF THE INVENTION
Video files are composed of a plurality of still image frames, which are shown rapidly in succession as a video sequence (typically 15 to 30 frames per second) to create an idea of a moving image. Image frames typically comprise a plurality of stationary background objects defined by image in- formation that remains substantially the same, and few moving objects defined by image information that changes somewhat. In such a case, the image information comprised by the image frames to be shown in succession is typically very similar, i.e. consecutive image frames comprise much redundancy. In fact, the redundancy comprised by video files is dividable into spatial, tem- poral and spectral redundancy. Spatial redundancy represents the mutual correlation between adjacent image pixels; temporal redundancy represents the change in given image objects in following frames, and spectral redundancy the correlation between different colour components within one image frame. Several video coding methods utilize the above-described temporal redundancy of consecutive image frames. In this case, so-called motion-compensated temporal prediction is used, wherein the contents of some (typically most) image frames in a video sequence are predicted from the other frames in the sequence by tracking the changes in given objects or areas in the image frames between consecutive image frames. A video sequence comprises compressed image frames, whose image information is determined without using motion-compensated temporal prediction. Such frames are called INTRA or I frames. Similarly, motion-compensated image frames comprised by a video sequence and predicted from previous image frames are called INTER or P frames (Predicted). One I frame and possibly one or more previously coded P frames are used in the determination of the image information of P frames. If a frame is lost, frames depending thereon can no longer be correctly decoded.
Typically, an I frame initiates a video sequence defined as a Group of Pictures (GOP), the image information of the P frames comprised by which can be defined using only the I frames comprised by said group of pictures GOP and previous P frames. The following I frame again initiates a new group of images GOP, and the image information of the frames comprised by it cannot thus be defined on the basis of the frames in a previous group of pictures GOP. Accordingly, groups of pictures GOP do not temporally overlap and each group of pictures can be independently decoded. In addition, many video compression methods use bi-directionally predicted B frames, which are placed between two anchor frames (I and P frame or two P frames) within a group of pictures GOP, and the image information of the B frame is predicted from both the previous anchor frame and the anchor frame following the B frame. B frames thus provide image information of a better quality than do P frames, but they are typically not used as an anchor frame and discarding them from the video sequence does therefore not cause any deterioration of the quality of subsequent pictures.
Each image frame is dividable into macro blocks that comprise the colour components (e.g. Y, U, V) of all pixels from a rectangular image area. More precisely, a macro block is composed of three blocks, each block comprising colour values (e.g. Y, U or V) from one colour layer of the pixels from said image area. The spatial resolution of the blocks may be different from that of the macro block; for example, components U and V can be presented at only half the resolution compared with component Y. Macro blocks can also be used to form for example slices, which are groups of several macro blocks wherein the macro blocs are typically selected in the image scanning order. In fact, in video coding methods, temporal prediction is typically performed block or macro block-specifically, not image frame-specifically. Many video materials, such as news, music videos and movie trailers comprise rapid cuts between different image material scenes. Sometimes cuts between different scenes are abrupt, but often scene transition is used, i.e. the transition from scene to scene takes place for instance by fading, wiping, tiling or rolling the image frames of a previous scene, and by bringing forth the scenes of a subsequent scene. As regards coding efficiency, the video coding of a scene transition is often a serious problem, since the image frames during a scene transition comprise information on the image frames of both the ending scene and the beginning scene.
A typical scene transition, fading, is performed by lowering the in- tensity or luminance of the image frames in a first scene gradually to zero and simultaneously raising the intensity of the image frames in a second scene gradually to its maximum value. Such a scene transition is called a cross-faded scene transition. A second typical scene transition, tiling, is performed by randomly or pseudo-randomly discarding square parts from the image frames of a first scene, and replacing the discarded parts with bits taken from the corre- sponding places in a second scene. Some typical scene transitions, such as roll, push, door etc., are accomplished by 'fixing' the first image frames on the surface of a virtual object (a paper sheet, a sliding door or an ordinary door) or some other arbitrary object, and turning this object or piece gradually away from sight, whereby information about the image frames of a second scene is copied to the emerging image areas. Many other transitions are known and used in several commercially available products, such as Avid Cinema™ (Avid Technology Inc.).
Present video coding methods utilize several methods of coding scene transitions. For example, in the coding according to the ITU-T (Interna- tional Telecommunication Union, Telecommunication Standardization Sector) H.263 standard, the above-described B frames are usable for presenting image frames during a scene transition. In this case, one image frame from a first (ending) scene and one image frame from a second (beginning) scene are selected as anchor frames. The image information of the B frames inserted be- tween these during the scene transition is defined from these anchor frames by temporal prediction such that the pixel values of the predicted image blocks are calculated as average values of the pixel values of the motion- compensated prediction blocks of the anchor frames.
As regards coding efficiency, such a solution is, however, disadvan- tageous particularly if coding the scene transition requires that several B frames be inserted between the anchor frames. In fact, the coding has been improved in the ITU-T H.26L standard such that the image information of the B frames inserted between the anchor frames during the scene transition is defined from these anchor frames by temporal prediction such that the pixel val- ues of the B image frames are calculated as weighted average values of the pixel values of the anchor frames based on the temporal distance of each B frame from both anchor frames. This improves the coding efficiency of scene transitions made by fading, in particular, and also the quality of the predicted B frames. Generally speaking, it is feasible that a computer-generated image is made of layers, i.e. image objects. Each of these image objects is definable by three types of information: the texture of the image object, its shape and transparency, and the layering order (depth) relative to the background of the image and other image objects. For example, MPEG-4 video coding uses some of these information types and the parameters values defined for them in coding scene transitions.
Shape and transparency are often defined using an alpha plane, which measures non-transparency, i.e. opacity and whose value is usually defined separately for each image object, possibly excluding the background, which is usually defined as opaque. It can be defined that the alpha plane value of an opaque image object, such as the background, is 1.0, whereas the alpha plane value of a fully transparent image object is 0.0. Intermediate values define how strongly a given image object is visible in the image relative to the background and other at least partly superposed image objects that have a higher depth value relative to said image object. Layering image objects on top of each other according to their shape, transparency and depth position is called scene composition. In practice, this is based on the use of weighted average values. The image object closest to the background, i.e. positioned the deepest, is first positioned on top of the background, and a combined image is created from these. The pixel values of the composite image are determined as an average value weighted by the alpha plane values of the background image and said image object. The alpha plane value of the combined image is then defined as 1.0, and it then becomes the background image for the following image object. The process continues until all image objects are combined with the image. The above-described process for coding a scene transition is used for instance in MPEG-4 video coding such that image frames in a beginning scene are typically selected as background images, whose opacity has a full value, and the opacity of image frames in an ending scene, the frames being 'image objects' to be positioned on top of the background, is reduced during the scene transition. When the opacity, i.e. alpha plane value, of the image frames of the ending scene reaches zero, only the image frame of the beginning scene is visible in the final image frame.
However, prior art scene transition coding involves several problems. The use of weighted anchor frame average values in the prediction of B frames does not work well in situations wherein the duration of the scene transition is long and the images include much motion, which considerably lowers the compression efficiency of coding based on temporal prediction. If the B pictures used in the scene transition are used for traffic shaping for instance in a streaming server, the image rate of the transmitted sequence temporarily decreases during the scene transition, which is usually observed as image jerks. A problem in the method used in MPEG-4 video coding is the complexity of coding a scene transition. In MPEG-4 video coding, scene composition always takes place by means of a system controlling the video coding and decoding, since an individual MPEG-4 video sequence cannot contain the information required for composing a scene from two or more video sequences. Consequently, composing a scene transition requires control-level support for the actual process and simultaneous transfer of two or more video sequences, which typically requires a wider bandwidth, at least temporarily.
BRIEF DESCRIPTION OF THE INVENTION
The object of the invention is thus to provide a method and an appa- ratus for implementing the method to alleviate the above problems. The objects of the invention are achieved by a method, video encoder, video decoder and computer software that are characterized in what is disclosed in the independent claims.
The preferred embodiments of the invention are disclosed in the dependent claims.
The invention is based on composing a scene transition in a video sequence between at least a first and a second scene, the first scene being an ending scene and the second scene a beginning scene. At least one of the scenes comprises independently decodable video frames coded according to at least a first frame format, i.e. I frames, and video frames coded according to a second frame format, i.e. P or B frames, at least one of the video frames according to the second frame format being predicted from at least one other video frame. The scene transition is coded in the video sequence preferably such that the presentation times of at least one video frame of the first scene and at least one video frame of the second scene are determined to be the same during the scene transition, allowing said video frames to be called scene transition video frames. Scene transition information for composing a scene transition with a decoder is defined for at least one video frame of at least one of the scenes. At least said one scene transition video frame of the first scene, said one scene transition video frame of the second scene and said scene transition information are then coded in a encoder into the video sequence.
Similarly, when said video sequence is being decoded, the coded video frame of the first scene, the coded video frame of the second scene, and the coded scene transition information are received at a decoder. These are decoded and the scene transition is generated using the decoded video frame of the first scene, the decoded video frame of the second scene, and the decoded scene transition information.
In a preferred embodiment of the invention, frames of the first and second scenes are placed on different scalability layers comprising at least a base layer and a first enhancement layer.
The advantage of the method of the invention is that it allows a scene transition to be coded in a video sequence such that it comprises essential information about the different scenes and their processing during the scene transition, enabling the decoding of the scene transition in a decoder based merely on the information comprised by said video sequence. A further advantage is that the method of the invention enables scalable coding also in the coding of the scene transition. An additional advantage is that, in accordance with a preferred embodiment of the invention, the scalability layers of the video sequence are combined with the above-described image objects of image frames and their information types such that a scalable video coding is achieved for the scene transition, the compression efficiency of the coding simultaneously being good.
BRIEF DESCRIPTION OF THE FIGURES
In the following, the invention will be described in detail in connec- tion with preferred embodiments with reference to the accompanying drawings, in which
Figure 1 shows the placement of the image frames of two different scenes onto scalability layers in accordance with a preferred embodiment of the invention; Figure 2 shows a scene transition that can be composed by means of the placement of image frames according to Figure 1 ;
Figure 3 shows a second scene transition that can be composed by means of the placement of image frames according to Figure 1 ;
Figure 4 shows the placement of the image frames of two different scenes onto scalability layers in accordance with a second preferred embodiment of the invention; Figure 5 shows a graph illustrating the portion of discardable data as a function of the duration of the cross-faded scene transition in accordance with a preferred embodiment of the invention; and
Figure 6 shows a block chart illustrating a mobile station in accor- dance with a preferred embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
The invention is applicable to all video coding methods using scalable coding. The invention is particularly applicable to different low bit rate video codings typically used in limited-band telecommunication systems. These include for instance ITU-T standard H.263 and H.26L (later possibly H.264), which is currently being standardized. In these systems, the invention is applicable for instance in mobile stations, allowing video playback to adapt to variable transfer capacity or channel quality and the processor power available at each particular time when other applications than video playback is run in the mobile station.
Furthermore, it is to be noted that, for the sake of clarity, the invention will be described next by describing the coding and temporal prediction of image frames at image frame level. However, in practice, coding and temporal prediction typically take place at block or macro block level, as was stated above.
Several video coding methods use scalable coding for flexible adjustment of the video coding bit rate, whereby some elements or element sets in a video sequence can be discarded without it having any impact on the reconstruction on the other parts of the video sequence. Scalability is typically implemented by grouping image frames onto several hierarchical layers. Substantially only the image frames necessary for decoding the video information at the receiving end are coded in the image frames of the base layer.
The concept of an independently decodable group of pictures GOP is typically used also in this case. In some video coding methods, such an in- dependently decodable group of pictures may constitute a sub-sequence, although in the present description, a sub-sequence is understood to mean any group of pictures whose pictures can be decoded using the pictures of the same group of pictures and one or more other groups of pictures. The base layer in each group of pictures GOP typically comprises at least one I frame and a necessary number of P frames. One or more enhancement layers may be defined under the base layer, each layer improving the quality of the video coding compared with the upper layer. Consequently, enhancement layers comprise P or B frames predicted by motion compensation from the pictures of one or more upper layers. On each layer, the frames are typically numbered in accordance with a predetermined alphanumeric series. As regards the terminal that plays back the video sequence, the quality of the picture to be displayed improves the more scalability layers are available or the more scalability layers it is capable of decoding. In other words, the temporal or spatial resolution or spatial quality of image data improves, since the amount of image information and the bit rate used for its transfer increase. Similarly, a larger number of scalability layers also sets considerably higher requirements on the processing power of the terminal as regards decoding.
Correspondingly, the bit rate of a video sequence is adjustable by discarding lower scalability layers from the video sequence. The dependence of each image frame in a group of pictures or a sub-sequence on the other image frames in the group of pictures may also be known in some cases. In these instances, the group of pictures or the sub-sequence and the pictures dependent thereon also constitute an independent entity that can be omitted from the video sequence, if need be, without it affecting the decoding of sub- sequent image frames in the video sequence. Accordingly, the image frames of only said sub-sequence and the sub-sequences of lower scalability layers dependent thereon remain un-decoded or at least cannot be decoded correctly. In other words, scalable video coding brings forth a plurality of advantages for adjusting the bit rate of a video sequence. Next, a method of implementing a scene transition utilizing scalable video coding will be described. In accordance with the invention, a method will be described, wherein the scalability layers of a video sequence are combined with the above described image frame image objects and their information types such that scalable video coding having good compression efficiency is achieved for the scene transition.
However, it is to be noted that the invention is not only restricted to scalable coding of a scene transition. It is essential to the invention to enable the coding of a scene transition into a video sequence such that it comprises essential information about the different scenes and their processing during the scene transition, whereby the scene transition can be decoded in a decoder merely based on the information comprised by said video sequence. The following is an exemplary illustration of the invention using a cross-faded scene transition and an abrupt scene transition as examples. The image frames to be presented during a scene transition are typically composed of two superposed image frames, the first image frame from a first scene and the second image frame from a second scene. One of the image frames constitutes a background picture and the other, called a foreground picture, is placed on top of the background picture. The opacity of the background picture is constant, i.e. its pixel-specific alpha level values are not adjusted.
In the present embodiment of the invention, the background picture and the foreground picture are determined by scalability layers. This is illustrated in Figure 1 showing an example of how the image frames of two different scenes are placed on scalability layers during a scene transition according to the invention. In Figure 1 , the first image frame 100 of a first (ending) scene is located on a base layer. The image frame 100 may be either an I frame, the determination of whose image information does not use motion-compensated temporal prediction or a P frame, which is a motion-compensated image frame predicted from previous image frames. The coding of a second (beginning) scene begins during a temporally subsequent image frame, and, in accordance with the invention, the image frames comprised by it are also placed on the base layer. This means that the rest of the image frames 102, 104, of the second (ending) scene are placed on a first enhancement layer (Enhance- mentl). These image frames are typically P frames.
As stated, in this embodiment, the image frames comprised by the second (beginning) scene are placed on the base layer, at least for the dura- tion of the scene transition. The first image frame 106 of the scene is typically an I frame, from which the subsequent image frames of the second scene are temporally predicted. In other words, the subsequent image frames of the second scene are temporally predicted frames, typically P frames, as shown in Figure 1 by frames 108 and 110. Such image frame placement on scalability layers enables a cross- faded scene transition to be implemented in accordance with a preferred embodiment of the invention such that the image frame on the base layer is always defined as a background picture whose opacity is at its maximum (100%). During a scene transition, image frames located on an enhancement layer are placed on top of the background picture and their opacity is adjusted for instance with suitable filters such that the frames gradually change from opaque to transparent.
In the video sequence of Figure 1 , there are no image frames on lower scalability layers during the first image frame 100 of the base layer. For this point in time, only the first image frame 100 of the base layer is coded in the video sequence.
A new (second) scene begins at the following image frame 106 on the base layer, image frame 106 being set as the background picture as regards its depth position, and its opacity value being defined to maximum. Temporally simultaneously with image frame 106 on the base layer is image frame 102 of the ending (first) scene on the enhancement layer, and its transparency should be increased in order to achieve a cross-faded scene transition. The example of Figure 1 assumes that opacity is set to a value of for instance 67%, in addition to which image frame 102 is defined as a foreground picture as regards its depth position. For this point in time, a picture combined from image frames 106 and 102 is coded in the video sequence, and it shows picture 106 more faintly in the background and picture 102 more clearly in the foreground, since its opacity value is substantially high (60 to 100%).
During the temporally following image frame, a second image frame 108 of the second scene is on the base layer and it is similarly set as the background picture as regards its depth position, and its opacity value is defined to maximum. In addition, the last image frame 104 of the temporally simultaneously ending (first) scene is on the first enhancement layer, the opacity value of the frame being for instance 33%, and, in addition, image frame 104 is defined as a foreground picture as regards its depth position. That is, for said point in time, a picture combined from image frames 108 and 104 is coded into the video sequence, and it shows picture 108 more clearly in the background and picture 104 more faintly in the foreground, since its opacity value is substantially low (10 to 40%). Furthermore, it is feasible that between said image frame there would be a frame whose opacity value is substantially 50%, but this is not shown in this example.
During the temporally following image frame, a third image frame 110 of the second scene is on the base layer. Since the first scene has ended, only image frame 110 is coded into the video sequence and the presentation of the second scene continues from that frame. The video sequence of Figure 2 can be preferably used to illustrate the above described cross-faded scene transition. In Figure 2, image frame 200 shows a picture of a boat, the image frame belonging to the first (ending) scene in this video sequence. Image frame 200 corresponds to image frame 100 of the first base layer in the video sequence of Figure 1 , during which frame there are no image frames on lower scalability layers. In other words, only the first image frame 100 of the base layer is coded into the video sequence for said point in time.
At the next point in time, a scene transition starts in the video sequence of Figure 2, image frame 202 comprising image information about the first (ending) scene and the second (beginning) scene, combined in accor- dance with the invention. The beginning scene shows a picture of a man's head dimly in the background of image frame 202. Image frame 202 corresponds to the point in time in Figure 1 , when image frame 106 of a beginning scene is on the base layer and image frame 102 of an ending scene on the enhancement layer. Image frame 106 (head) is set as a background picture as regards its depth position and its opacity value is defined to maximum. The opacity of image frame 102 (boat) on the first enhancement layer is set to the value 67% and image frame 102 is defined as a foreground picture as regards its depth position. For this point in time, an image frame 202, combined from image frames 106 and 102 is coded in the video sequence, wherein picture 106 (head) is shown more faintly in the background and picture 102 (boat) more intensely at the front, since its opacity value is substantially high (67%).
At the next point in time, the scene transition still continues in the video sequence of Figure 2, image frame 204 also comprising image information about the first (ending) scene and the second (beginning) scene, com- bined in accordance with the invention. Image frame 204 corresponds to the point in time in Figure 1 , when image frame 108 of a beginning scene is on the base layer and the last image frame 104 of an ending scene is on the enhancement layer. In the same way, image frame 108 is set as the background picture as regards its depth position, its opacity value being defined to maxi- mum. The opacity value of image frame 104 is set to 33%, and, in addition, image frame 104 is defined as a foreground picture as regards its depth position. That is, picture 204, combined from image frames 108 and 104, is coded in the video sequence for said point in time, wherein picture 108 (head) is seen more intensely in the background and picture 104 (boat) more faintly at the front, since its opacity value is only 33%.
At the last point in time in the video sequence of Figure 2, the scene transition is ended and only the third image frame 110 of the second scene on the base layer is coded in image frame 206, from which the presentation of the second scene continues.
The above described by way of example how the placement of im- age frames according to the invention onto different scalability layers allows a cross-faded scene transition to be implemented advantageously as regards coding efficiency. However, in the transmission or decoding of a video sequence, a situation may arise when the bit rate of the video sequence is to be adapted to the maximum value of the bandwidth and/or terminal decoding rate available for data transfer. Such an adjustment of the bit rate causes problems to known video coding methods in implementing a scene transition.
In accordance with a preferred embodiment of the invention, one or more scalability layers or independently decodable groups of pictures GOP or sub-sequences comprised thereby can now be discarded from a video se- quence thus lowering the bit rate of the video sequence, and yet the scene transition can be simultaneously decoded without lowering image frequency. In the placement of image frames according to Figure 1 , this can be implemented by discarding the first enhancement layer from the video sequence. This means that only image frames 100, 106, 108 and 110 comprised by the base layer are shown of the video sequence. In other words, a transition occurs directly from the first (ending) scene to the second (beginning) scene as an abrupt scene transition, i.e. directly from image frame 100 of the first scene to the beginning I image frame 106 of the second scene. In other words, an abrupt scene transition is performed instead of a cross-faded scene transition. However, a scene transition can be preferably performed without it having an impact on the picture quality in the video sequence, and, typically, the viewer does not experience an abrupt scene transition performed instead of a cross- faded scene transition as disturbing or faulty. In a prior art implementation, wherein scalability layers cannot be discarded, image frequency would instead have to be lowered at a scene transition, and the viewer would experience this as jerking and disturbing.
The above described abrupt scene transition can be preferably illustrated by the video sequence of Figure 3, comprising the same scenes (boat and head) as the video sequence shown in Figure 2. Also in Figure 3, image frame 300 shows a picture of a boat, whose image frame belongs to the first (ending) scene in the video sequence. That is, image frame 300 corresponds to image frame 100 of the first base layer in the video sequence of Figure 1 , during the duration of which no image frames exist on the lower scalability layers. In other words, only the first image frame 100 of the base layer is coded into the video sequence for said point in time. At the next point in time, an abrupt scene transition from the first scene to the second scene is performed in the video sequence of Figure 3 by discarding the image frames on the enhancement layer from the video sequence. In accordance with the invention, image frame 302 then comprises image information only about a second (beginning) scene on the base layer, the man's head being clearly visible in image frame 302. In other words, image frame 302 corresponds to image frame 106 of Figure 1 as such.
Similarly, image frame 304 comprises image information only about the second image frame 108 of the second scene on the base layer. In the video sequence of Figure 3, only the third image frame 110 of the second scene on the base layer is coded into image frame 306, and the display of the second scene continues from there.
As Figure 3 shows, the scene transition can preferably be performed as an abrupt scene transition without it affecting the picture quality in the video sequence, nor can the abrupt scene transition shown in image frames 300 to 306 be observed in any way faulty.
As is evident from the above, as regards the implementation of the invention, it is preferable to always place the image frames of a latter scene typically on the base layer and set them as background pictures in a depth position. However, if the intention is to emphasize the first (ending) scene for in- stance because, in the case of an abrupt scene transition, the intention is to show all image frames of the ending scene, then the image frames of the first scene can be placed on the base layer. In this case, in accordance with an embodiment of the invention, an I frame has to be coded in the second (beginning) scene instead of a P frame immediately after the scene transition. How- ever, as regards compression efficiency, this is not as preferable a solution as the above described coding arrangement.
In accordance with a preferred embodiment of the invention, the above problem can be solved in systems supporting backward temporal prediction. A method called 'reference picture selection' is known in some coding methods, which in its general form allows also the prediction of the image information of image frames from temporally later image frames. The transfer of an INTRA frame is a coding technique that utilizes the reference picture selection. This means that the INTRA frame is not placed in a temporally 'correct' place in a video sequence, but its place is transferred temporally later. In this case, the image frames between the 'correct' place and the real place of the INTRA frame in the video sequence are predicted from said INTRA frame temporally backwards. This naturally requires that non-coded image frames be buffered for a sufficiently long time in order for all image frames shown to be coded and arranged in the display order.
In the following, the above described coding of a scene transition by means of an INTRA frame transition is illustrated with reference to Figure 4. Figure 4 shows all image frames 100, 102 and 104 of a first (ending) scene, placed on a base layer, their depth position defining them as background pictures and their opacity value being at maximum (100%). At least image frames 106 and 108, occurring during the scene transition, of a second (beginning) scene, are placed on a first enhancement layer. These image frames are P frames that are temporally predicted backwards from I frame 110. Depending on the coding method used, I frame 110 may be located either on the base layer or on the first enhancement layer.
As regards the coding of a scene transition, it is essential that the depth position placement of image frames located on an enhancement layer and occurring during a scene transition define them as foreground pictures, and their opacity values change gradually. If the intention is to accomplish a cross-faded scene transition similar to that in the example of Figures 1 and 2 above, the opacity value of image frame 106 is set to 33% and the opacity value of image frame 108 to 67%.
The above examples clearly show how the method of the invention combines an improved, i.e. more compression-efficient, cross-faded scene transition using weighted averages with video coding that is scalable by layers. Scalability by layers can be preferably utilized in the implementation of a cross- faded scene transition and, on the other hand, if the enhancement layer has to be discarded from the video sequence, for instance because of a narrowing available bandwidth, the scene transition can still be preferably performed as an abrupt scene transition.
The above examples present a simplified illustration of the invention using only two scalability layers: the base layer and one enhancement layer. However, in scalable coding, the number of enhancement layers is typically not restricted at all, but the coding is able to use several enhancement layers in addition to the base layer. Furthermore, in some coding methods, the base layer is further dividable such that a separate INTRA layer exists above it comprising only I frames and being followed by the actual base layer and below it a necessary number of enhancement layers.
Furthermore, the above examples illustrate the invention in situations where a scene transition is performed between two scenes. However, the invention is not restricted to scene transitions between only two scenes, but the coding can be performed by coding more than two scenes in the same scene transition. The different scenes may be temporally consecutive or at least partly overlapping. In this case, the different scenes can be placed on different scalability layers such that after the scene transition, the image frames comprised by a continuous scene are preferably placed on the base layer and the image frames comprised by the other scenes are placeable in several dif- ferent manners on several enhancement layers. The image frames to be generated during a scene transition can be coded in the above manner by defining a different depth position for the image frames of the different scenes and by weighting the opacity values of the different image frames in different manners. In this case, one way to proceed is to place the first scene on the base layer, to place monochrome, e.g. black frames on the first enhancement layer at least for part of the duration of the scene transition, and to place the second scene on the second enhancement layer. For example in Figure 4, the monochrome frames may be thought to exist between frames 102 and 106, and similarly between frames 104 and 108. This enables the scene transition between the first and the second scene to be performed by first fading the first scene in black, whereupon the information may be returned in the image frames to the second scene for instance as a cross-faded scene transition described above. Such a scene transition between different scenes through black or white is very typical for example in documentary videos. A problem in scanning video files arises when scanning is to be started in the middle of a video sequence. Such situations arise for instance when a user wants to scan a locally stored video file forward or backward, or a streaming file at a given point, when a user starts playing back a streaming file at a random point, or when an error stopping the playback of a video file is ob- served in a video file to be played back requiring that the playback of the file be restarted from some point subsequent to the error. To continue scanning a video file at a random point requires that an independently decodable group of pictures GOP be found. Scene transitions are often coded by predicting the latter group of pictures from a first group of pictures belonging to the scene transition, and thus the latter group of pictures is not independently decodable and cannot be used to continue scanning the video file. However, a beginning scene would be a natural point to start scanning the file.
In accordance with a preferred embodiment of the invention, this is avoidable in a decoder in such a manner that when the scanning of a video file starts at a random point, the decoder looks for said point in the video se- quence of the following scene transition and starts decoding from the scene transition. This can preferably be implemented, since in the scene transition of the invention, the second (beginning) scene starts as an I frame, which thus acts as the starting point of an independently decodable group of pictures GOP or a sub-sequence. As regards the above described transfer of an INTRA frame, it is also feasible that the I frame acts as a starting point for the decoding. In this way the scene transition of the invention preferably provides a point from which decoding can be started after a random scanning point.
In accordance with a preferred embodiment of the invention, B frames can also be used in a scene transition for displaying image frames that occur during the scene transition. In this case, the image information of the B frames that occur during the scene transition is determined from these anchor frames by temporal prediction by calculating the pixel values of the macro blocks in the predicted image frames as average values or weighted average values of the pixel values of the motion-compensated prediction blocks of the anchor frames relative to the distance of each B frame from both anchor frames. Because the compression efficiency of B frames is typically better than that of P frames, a better quality is achieved also for the image frames combined in the scene transition at a corresponding bit rate than if the image frames corresponding to the B frames were P frames. If an image frame occur- ring during a scene transition, such as a conventional B frame, is not used to predict other frames, and does not have to be transmitted, temporally corresponding image frames on other enhancement layers should neither be transmitted.
In order to achieve sufficient compression efficiency, the bit rate of the part to be discarded from the bit stream of the video sequence can be determined such that it depends on the lengths of the scenes, i.e. the time be- tween scene transitions, and on the duration of the scene transition. If the assumption here is that a constant bit rate is used and an equal bit rate is reserved for the use of each image frame, a formula can be defined for approximating the portion of discardable data in a cross-faded scene transition from the data reserved for the use of the image frames. If the portion of discardable data is denoted by S(discard), the average duration of the cross-faded scene transition is denoted by D(cross-fade), and the length of the scenes is denoted by T(scene cut), the portion of discardable data can be given by formula 1 :
S(discard) = D(cross-fade)/(2 x D(cross-fade) + T(scene cut)) (1.)
The portion of discardable data as a function of the duration of the cross-faded scene transition can be presented by the curve of the graph of Fig. 5. The graph shows that if a cross-faded scene transition is not used (duration of scene transition is zero, i.e. an abrupt scene transition is involved), the amount of data to be discarded during the scene transition is naturally zero. On the other hand, if the duration of the scene transition is equal to that of the actual scene, half of the image frame data can be discarded during the scene transition. The ratio of the duration of the scene transition to the duration of the entire scene is typically below 0.1 , the amount of discardable data being less than 10%. For instance a movie trailer may include scenes of the duration of one second, between which a 0.1-second cross-fading is used, the ratio of the duration of the scene transition to the duration of the entire scene being exactly 0.1 , which corresponds to a portion of 9% of discardable data. Similarly, a news clip may include for instance 5-second scenes, which are cross-faded to the next scene during 0.3 seconds. In this case the ratio of the duration of the scene transition to the length of the entire scene is 0.06, corresponding to a portion of about 6% of discardable data.
The graph of Fig. 5 further shows that at its maximum, the amount of discardable data is calculated using weighted averages, which thus resembles the above prediction of B pictures by weighting the distances between the image frames used as anchor frames.
In accordance with a preferred embodiment of the invention, data can be discarded from a video sequence during a scene transition according to the above formula as transition filtering, known per se, such as SMIL filtering (Synchronized Multimedia Integration Language). The SMIL 2.0 standard pre- sents means for transition filtering of for instance image and video files. The filtering process uses one source or filtering is determined to take place between two sources based on which the filtering output is determined to a given range in the image frame. The filter determines the transition between the origin me- dia and the destination media by denoting the origin media by the value 0.0 and the destination media by the value 1.0. This enables the filtering process and the desired result to be determined by setting a suitable value on said parameter.
The SMIL 2.0 standard presents a plurality of different filtering ef- fects that are applicable to the transition filtering according to the invention. In accordance with a preferred embodiment of the invention, the properties of the filters, particularly said parameter determining the transition are determined in accordance with formula 1. In addition, the desired filtering effect affects the type of filter used. A detailed description of the SMIL 2.0 standard is found in specification The SMIL 2.0 Transition Effects Module', W3C, 7.8.2001.
Consequently, coding a scene transition according to the invention is not only limited to the above examples and a cross-faded or abrupt scene transition, but, in principle, the invention is applicable to any type of scene transition. Accordingly, the invention is applicable for instance to the previously mentioned tiling, roll, push, door or different zoomings. In principle, the procedure is the same in all scene transitions: determining the opacity and depth values for each frame during the scene transition and the filter type required for the scene transition and the effect used.
The above describes a method of coding a scene transition as a scalable video sequence. In concrete terms, this is performed in a video encoder, which may be a video encoder known per se. The video encoder used could be for instance a video encoder according to the ITU-T recommendations H.263 or H.26L, which, in accordance with the invention, is arranged to determine that the presentation time of at least one video frame of the first scene is equal to the presentation time of at least one video frame of the second scene during the scene transition, said video frames thus being scene transition video frames, to define scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder, to code said at least one scene transition video frame of the first scene in a encoder into a video sequence, to code at least said one scene transition video frame of the second scene in the encoder into the video sequence, and to code said scene transition information in the encoder into the video sequence.
Correspondingly, decoding takes place in a video encoder, which may be a video decoder known per se. The video decoder used could be for instance a low bit rate video decoder according to the ITU-T recommendations H.263 or H.26L, which, in accordance with the invention, is arranged to receive a video frame, coded in a decoder, of a first scene, a coded video frame of a second scene and coded scene transition information, to decode the coded video frame of the first scene, the coded video frame of the second scene and the coded scene transition information, and to generate a scene transition by using the decoded video frame of the first scene, the decoded video frame of the second scene and the decoded scene transition information.
The different parts of video-based telecommunication systems, particularly terminals, may comprise properties to enable bi-directional transfer of multimedia files, i.e. transfer and reception of files. This allows the encoder and decoder to be implemented as a video codec comprising the functionalities of both a encoder and a decoder.
It is to be noted that the functional elements of the invention in the above video encoder, video decoder and terminal can be implemented pref- erably as software, hardware or a combination of the two. The coding and decoding methods of the invention are particularly well suited to be implemented as computer software comprising computer-readable commands for carrying out the functional steps of the invention. The encoder and decoder can preferably be implemented as a software code stored on storage means and ex- ecutable by a computer-like device, such as a personal computer (PC) or a mobile station, for achieving the coding/decoding functionalities with said device.
Fig. 6 shows a block diagram of a mobile station MS according to a preferred embodiment of the invention. In the mobile station MS, a central processing unit CPU controls blocks responsible for the various functions of the MS: a memory MEM comprising typically both random access memory RAM and read-only memory ROM, a radio frequency part comprising transmitter/receiver TX/RX, a video codec CODEC and a user interface Ul. The user interface comprises a keyboard KB, a display DP, a speaker SP and a micro- phone MF. The CPU is a microprocessor, or in alternative embodiments, some other kind of processor, such as a digital signal processor. Advantageously, the operating instructions of the CPU have been stored in forehand in the ROM. In accordance with its instructions (i.e. a computer program), the CPU uses the radio frequency block for transmitting and receiving data over a radio path. The video codec may be either hardware based or fully or partly software based, in which case the CODEC comprises computer programs for controlling the CPU to perform video encoding and decoding functions as explained above. The CPU uses the RAM as its working memory. Furthermore, the mobile station MS can advantageously include a video camera CAM, whereby the mobile station can capture motion video by the video camera. The captured motion video is then encoded and compressed using the CPU, the RAM and CODEC based software. The radio frequency block is then used to exchange encoded video with other parties.
The invention can also be implemented as a video signal comprising at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of the scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame. Such a video signal com- prises scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder.
It is obvious to a person skilled in the art that as technology advances, the basic idea of the invention can be implemented in a variety of ways. The invention and its embodiments are thus not limited to the above ex- amples, but may vary within the claims.

Claims

1. A method of generating a scene transition in a video sequence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes compris- ing independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, characterized by determining that the presentation time of at least one video frame of the first scene is equal to the presentation time of at least one video frame of the second scene during the scene transition, said video frames being scene transition video frames, determining scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder, coding at least said one scene transition video frame of the first scene in an encoder into a video sequence, coding at least said one scene transition video frame of the second scene in the encoder into the video sequence, and coding said scene transition information in the encoder into the video sequence.
2. A method as claimed in claim 1, characterized by frames of the first and second scenes being placed on different scalability layers comprising at least a base layer and a first enhancement layer.
3. A method as claimed in claim 2, characterized by coding at least the scene transition video frames of at least the first and second scenes onto the different scalability layers in the video sequence.
4. A method as claimed in any one of the preceding claims, characterized by coding the scene transition into the video sequence such that it comprises a scene transition video frame of at least said one first scene and a scene transition video frame of at least said one second scene, whose image frame information is mixed according to a ratio predetermined by said scene transition information.
5. A method as claimed in claim 4, characterized by mixing said scene transition video frames by filtering the image frame information comprised thereby.
6. A method as claimed in claim 5, characterized by performing said filtering as transition filtering, such as SMIL filtering.
7. A method as claimed in any one of the preceding claims, characterized by the scene transition to be coded into the video sequence being at least one of the following: a cross-faded scene transition, tiling, rolling, pushing, zooming.
8. A video encoder for generating a scene transition in a video se- quence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, characterized in that the video encoder is arranged to determine that the presentation time of at least one video frame of the first scene is equal to the presentation time of at least one video frame of the second scene during the scene transition, said video frames being scene transition video frames, determine scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder, to code said at least one scene transition video frame of the first scene in a encoder into a video sequence, code at least said one scene transition video frame of the first scene in the encoder into the video sequence, code at least said one scene transition video frame of the second scene in the encoder into the video sequence, and code said scene transition information in the encoder into the video sequence.
9. A video encoder as claimed in claim 8, characterized in that the video encoder is arranged to place frames of the first and second scenes on different scalability layers comprising at least a base layer and a first enhancement layer.
10. A mobile station, characterized in that the mobile station comprises a video encoder as claimed in claim 8.
11. Computer software for generating a scene transition in a video sequence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, c h a r a c t e r i z e d in that the computer software comprises software means for determining that the presentation time of at least one video frame of the first scene is equal to the presentation time of at least one video frame of the second scene during the scene transition, said video frames being scene transition video frames, software means for determining scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder, to code said at least one scene transition video frame of the first scene in a encoder into a video sequence, software means for coding at least said one scene transition video frame of the first scene in the encoder into the video sequence, software means for coding at least said one scene transition video frame of the second scene in the encoder into the video sequence, and software means for coding said scene transition information in the encoder into the video sequence.
12. A method of decoding a scene transition from a video sequence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, c h a r a c t e r i z e d by receiving a video frame of the first scene, a video frame of the second scene and scene transition information, coded in a decoder, decoding the coded video frame of the first scene, decoding the coded video frame of the second scene, decoding the coded scene transition information, generating a scene transition by using the decoded video frame of the first scene, the decoded video frame of the second scene and the decoded scene transition information.
13. A method as claimed in claim 12, c h a r a c t e r '! z e d by decoding the scene transition into the video sequence such that it comprises a scene transition video frame of at least said one first scene and a scene transition video frame of at least said one second scene, whose image frame information is mixed according to a ratio determined by said decoded scene transition information.
14. A method as claimed in claim 12 or 13, c h a r a c t e r i z e d by initiating the access of the video sequence at a random point in said video sequence, determining the scene transition following said random point, and initiating decoding from the first independently decodable group of pictures of the beginning scene in connection with the scene transition.
15. A decoder for decoding a scene transition from a video sequence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, c h a r a c t e r i z e d in that the decoder is arranged to receive a video frame of a first scene, a video frame of a second scene and scene transition information, coded in a decoder, decode the coded video frame of the first scene, decode the coded video frame of the second scene, decode the coded scene transition information, and generate a scene transition by using the decoded video frame of the first scene, the decoded video frame of the second scene and the decoded scene transition information.
16. A mobile station, c h a r a c t e r i z e d in that the mobile station comprises a video decoder as claimed in claim 15.
17. Computer software for decoding a scene transition from a video sequence between at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, c h a r a c t e r - i z e d in that the software comprises software means for receiving a video frame of a first scene, a video frame of a second scene and scene transition information, coded in a decoder, software means for decoding the coded video frame of the first scene, software means for decoding the coded video frame of the second scene, software means for decoding the coded scene transition informa- tion, and software means for generating a scene transition by using the de- coded video frame of the first scene, the decoded video frame of the second scene and the decoded scene transition information.
18. A video signal comprising at least a first and a second scene, the first scene being an ending scene and the second a beginning scene, at least one of said scenes comprising independently decodable video frames coded in accordance with at least a first frame format, and video frames coded in accordance with a second frame format, at least one of the video frames according to the second frame format being predicted from at least one other video frame, c h a r a c t e r i z e d in that the video signal comprises scene transition information for at least one video frame of at least one scene for generating a scene transition with a decoder.
PCT/FI2003/000052 2002-01-23 2003-01-22 Coding scene transitions in video coding WO2003063482A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP03700320A EP1468558A1 (en) 2002-01-23 2003-01-22 Coding scene transitions in video coding

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20020128 2002-01-23
FI20020128A FI114433B (en) 2002-01-23 2002-01-23 Coding of a stage transition in video coding

Publications (1)

Publication Number Publication Date
WO2003063482A1 true WO2003063482A1 (en) 2003-07-31

Family

ID=8562893

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2003/000052 WO2003063482A1 (en) 2002-01-23 2003-01-22 Coding scene transitions in video coding

Country Status (4)

Country Link
US (2) US7436886B2 (en)
EP (1) EP1468558A1 (en)
FI (1) FI114433B (en)
WO (1) WO2003063482A1 (en)

Families Citing this family (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003035B2 (en) 2002-01-25 2006-02-21 Microsoft Corporation Video coding methods and apparatuses
US20040001546A1 (en) 2002-06-03 2004-01-01 Alexandros Tourapis Spatiotemporal prediction for bidirectionally predictive (B) pictures and motion vector prediction for multi-picture reference motion compensation
US7769084B1 (en) 2002-07-15 2010-08-03 Apple Inc. Method for implementing a quantizer in a multimedia compression and encoding system
US7418037B1 (en) * 2002-07-15 2008-08-26 Apple Inc. Method of performing rate control for a compression system
US7154952B2 (en) 2002-07-19 2006-12-26 Microsoft Corporation Timestamp-independent motion vector prediction for predictive (P) and bidirectionally predictive (B) pictures
US7804897B1 (en) 2002-12-16 2010-09-28 Apple Inc. Method for implementing an improved quantizer in a multimedia compression and encoding system
US7940843B1 (en) 2002-12-16 2011-05-10 Apple Inc. Method of implementing improved rate control for a multimedia compression and encoding system
US7609763B2 (en) * 2003-07-18 2009-10-27 Microsoft Corporation Advanced bi-directional predictive coding of video frames
US8064520B2 (en) 2003-09-07 2011-11-22 Microsoft Corporation Advanced bi-directional predictive coding of interlaced video
US20090142039A1 (en) * 2003-12-26 2009-06-04 Humax Co., Ltd. Method and apparatus for recording video data
KR100583518B1 (en) * 2003-12-26 2006-05-24 주식회사 휴맥스 Method for setting record quality in digital recording device
US8315307B2 (en) * 2004-04-07 2012-11-20 Qualcomm Incorporated Method and apparatus for frame prediction in hybrid video compression to enable temporal scalability
US20060013305A1 (en) * 2004-07-14 2006-01-19 Sharp Laboratories Of America, Inc. Temporal scalable coding using AVC coding tools
KR100679018B1 (en) * 2004-09-07 2007-02-05 삼성전자주식회사 Method for multi-layer video coding and decoding, multi-layer video encoder and decoder
JP2008521265A (en) * 2004-11-04 2008-06-19 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method and apparatus for processing encoded video data
US7466867B2 (en) * 2004-11-26 2008-12-16 Taiwan Imagingtek Corporation Method and apparatus for image compression and decompression
US8548055B2 (en) * 2005-03-10 2013-10-01 Qualcomm Incorporated Encoding of multimedia data
US8155189B2 (en) * 2005-10-19 2012-04-10 Freescale Semiconductor, Inc. System and method of coding mode decision for video encoding
KR100772868B1 (en) * 2005-11-29 2007-11-02 삼성전자주식회사 Scalable video coding based on multiple layers and apparatus thereof
US7831727B2 (en) * 2006-09-11 2010-11-09 Apple Computer, Inc. Multi-content presentation of unassociated content types
US7743341B2 (en) * 2006-09-11 2010-06-22 Apple Inc. Rendering icons along a multidimensional path having a terminus position
US7747968B2 (en) * 2006-09-11 2010-06-29 Apple Inc. Content abstraction presentation along a multidimensional path
US7930650B2 (en) 2006-09-11 2011-04-19 Apple Inc. User interface with menu abstractions and content abstractions
US7984377B2 (en) 2006-09-11 2011-07-19 Apple Inc. Cascaded display of video media
US8099665B2 (en) * 2006-09-11 2012-01-17 Apple Inc. Organizing and sorting media menu items
US7743338B2 (en) * 2006-09-11 2010-06-22 Apple Inc. Image rendering with image artifact along a multidimensional path
US7853972B2 (en) 2006-09-11 2010-12-14 Apple Inc. Media preview user interface
US20080095228A1 (en) * 2006-10-20 2008-04-24 Nokia Corporation System and method for providing picture output indications in video coding
US8630346B2 (en) * 2007-02-20 2014-01-14 Samsung Electronics Co., Ltd System and method for introducing virtual zero motion vector candidates in areas of a video sequence involving overlays
WO2008115195A1 (en) * 2007-03-15 2008-09-25 Thomson Licensing Methods and apparatus for automated aesthetic transitioning between scene graphs
US8254455B2 (en) 2007-06-30 2012-08-28 Microsoft Corporation Computing collocated macroblock information for direct mode macroblocks
US8335259B2 (en) * 2008-03-12 2012-12-18 Packetvideo Corp. System and method for reformatting digital broadcast multimedia for a mobile device
US8224775B2 (en) * 2008-03-31 2012-07-17 Packetvideo Corp. System and method for managing, controlling and/or rendering media in a network
WO2010065107A1 (en) * 2008-12-04 2010-06-10 Packetvideo Corp. System and method for browsing, selecting and/or controlling rendering of media with a mobile device
US8189666B2 (en) 2009-02-02 2012-05-29 Microsoft Corporation Local picture identifier and computation of co-located information
US20100201870A1 (en) * 2009-02-11 2010-08-12 Martin Luessi System and method for frame interpolation for a compressed video bitstream
US20100302255A1 (en) * 2009-05-26 2010-12-02 Dynamic Representation Systems, LLC-Part VII Method and system for generating a contextual segmentation challenge for an automated agent
US11647243B2 (en) 2009-06-26 2023-05-09 Seagate Technology Llc System and method for using an application on a mobile device to transfer internet media content
US20120210205A1 (en) 2011-02-11 2012-08-16 Greg Sherwood System and method for using an application on a mobile device to transfer internet media content
US9195775B2 (en) 2009-06-26 2015-11-24 Iii Holdings 2, Llc System and method for managing and/or rendering internet multimedia content in a network
US9565479B2 (en) * 2009-08-10 2017-02-07 Sling Media Pvt Ltd. Methods and apparatus for seeking within a media stream using scene detection
US20110183651A1 (en) * 2010-01-28 2011-07-28 Packetvideo Corp. System and method for requesting, retrieving and/or associating contact images on a mobile device
US8798777B2 (en) 2011-03-08 2014-08-05 Packetvideo Corporation System and method for using a list of audio media to create a list of audiovisual media
WO2014160380A1 (en) * 2013-03-13 2014-10-02 Deja.io, Inc. Analysis platform of media content metadata
WO2015009676A1 (en) * 2013-07-15 2015-01-22 Sony Corporation Extensions of motion-constrained tile sets sei message for interactivity
EP3092806A4 (en) 2014-01-07 2017-08-23 Nokia Technologies Oy Method and apparatus for video coding and decoding
KR20170054900A (en) * 2015-11-10 2017-05-18 삼성전자주식회사 Display apparatus and control method thereof
US10136194B2 (en) * 2016-07-06 2018-11-20 Cisco Technology, Inc. Streaming piracy detection method and system
CN112312201B (en) * 2020-04-09 2023-04-07 北京沃东天骏信息技术有限公司 Method, system, device and storage medium for video transition
CN113542847B (en) * 2020-04-21 2023-05-02 抖音视界有限公司 Image display method, device, equipment and storage medium
US11240540B2 (en) * 2020-06-11 2022-02-01 Western Digital Technologies, Inc. Storage system and method for frame trimming to optimize network bandwidth
US20220279185A1 (en) * 2021-02-26 2022-09-01 Lemon Inc. Methods of coding images/videos with alpha channels
CN113115054B (en) * 2021-03-31 2022-05-06 杭州海康威视数字技术股份有限公司 Video stream encoding method, device, system, electronic device and storage medium
WO2022204619A1 (en) * 2021-11-08 2022-09-29 Innopeak Technology, Inc. Online detection for dominant and/or salient action start from dynamic environment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5703995A (en) * 1996-05-17 1997-12-30 Willbanks; George M. Method and system for producing a personalized video recording
US5966162A (en) * 1996-10-25 1999-10-12 Diva Systems Corporation Method and apparatus for masking the effects of latency in an interactive information distribution system
WO2001047283A1 (en) * 1999-12-22 2001-06-28 General Instrument Corporation Video compression for multicast environments using spatial scalability and simulcast coding
EP1132812A1 (en) * 2000-03-07 2001-09-12 Lg Electronics Inc. Method of detecting dissolve/fade in mpeg-compressed video environment
US6337881B1 (en) * 1996-09-16 2002-01-08 Microsoft Corporation Multimedia compression system with adaptive block sizes
US20020167608A1 (en) * 2001-04-20 2002-11-14 Semko Szybiak Circuit and method for live switching of digital video programs containing embedded audio data

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05101609A (en) * 1990-09-28 1993-04-23 Digital Fx Inc System and method for editing vedeo tape
CO4440458A1 (en) 1995-06-05 1997-05-07 Kimberly Clark Co TEMPORARY MARKING, ULTRAVIOLET RADIATION DETECTION, AND PRINTING, USING PHOTOBORRABLE DYES
EP0882358A2 (en) 1996-02-20 1998-12-09 SAS Institute, Inc. Method and apparatus for transitions, reverse play and other special effects in digital motion video
US6360234B2 (en) * 1997-08-14 2002-03-19 Virage, Inc. Video cataloger system with synchronized encoders
US6301428B1 (en) 1997-12-09 2001-10-09 Lsi Logic Corporation Compressed video editor with transition buffer matcher
US6912251B1 (en) * 1998-09-25 2005-06-28 Sarnoff Corporation Frame-accurate seamless splicing of information streams
US6658157B1 (en) * 1999-06-29 2003-12-02 Sony Corporation Method and apparatus for converting image information
US6614936B1 (en) * 1999-12-03 2003-09-02 Microsoft Corporation System and method for robust video coding using progressive fine-granularity scalable (PFGS) coding

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5703995A (en) * 1996-05-17 1997-12-30 Willbanks; George M. Method and system for producing a personalized video recording
US6337881B1 (en) * 1996-09-16 2002-01-08 Microsoft Corporation Multimedia compression system with adaptive block sizes
US5966162A (en) * 1996-10-25 1999-10-12 Diva Systems Corporation Method and apparatus for masking the effects of latency in an interactive information distribution system
WO2001047283A1 (en) * 1999-12-22 2001-06-28 General Instrument Corporation Video compression for multicast environments using spatial scalability and simulcast coding
EP1132812A1 (en) * 2000-03-07 2001-09-12 Lg Electronics Inc. Method of detecting dissolve/fade in mpeg-compressed video environment
US20020167608A1 (en) * 2001-04-20 2002-11-14 Semko Szybiak Circuit and method for live switching of digital video programs containing embedded audio data

Also Published As

Publication number Publication date
FI20020128A0 (en) 2002-01-23
EP1468558A1 (en) 2004-10-20
US20030142751A1 (en) 2003-07-31
US7436886B2 (en) 2008-10-14
FI114433B (en) 2004-10-15
FI20020128A (en) 2003-07-24
US20090041117A1 (en) 2009-02-12

Similar Documents

Publication Publication Date Title
US7436886B2 (en) Coding scene transitions in video coding
US8050321B2 (en) Grouping of image frames in video coding
KR100698938B1 (en) Grouping of image frames in video coding
US6909747B2 (en) Process and device for coding video images
US20120200663A1 (en) Method and Apparatus For Improving The Average Image Refresh Rate in a Compressed Video Bitstream
US20060239563A1 (en) Method and device for compressed domain video editing
US20180077385A1 (en) Data, multimedia & video transmission updating system
JP2004502356A (en) Multicast transmission system including bandwidth scaler
EP1280356A2 (en) Apparatus and method for compressing multiplexed video signals
US20020021753A1 (en) Video signal coding method and video signal encoder
JP2004015351A (en) Encoding apparatus and method, program, and recording medium
KR100944540B1 (en) Method and Apparatus for Encoding using Frame Skipping
CN116193048A (en) Video processing method and video processing device
JPH10243403A (en) Dynamic image coder and dynamic image decoder
KR20110026952A (en) Video coding method using image data copy

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2003700320

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2003700320

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP