WO2006111912A2

WO2006111912A2 - Device and method for identifying a segment boundary

Info

Publication number: WO2006111912A2
Application number: PCT/IB2006/051172
Authority: WO
Inventors: Freddy Snijder; Luca Giovacchini
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-04-18
Filing date: 2006-04-14
Publication date: 2006-10-26
Also published as: WO2006111912A3

Abstract

The method of identifying a segment boundary in a content item comprises the steps of determining a first value for each of a plurality of content features of the content item at a first plurality of time instances, determining a second value for each of the plurality of content features at a second plurality of time instances, and identifying a segment boundary in dependence on a difference between the first value and the second value exceeding a certain threshold for at least one of the plurality of content features. The electronic device of the invention is operative to perform the method of the invention, e.g. by executing software instructions.

Description

Device and method for identifying a segment boundary

The invention relates to an electronic device, comprising electronic circuitry operative to identify a segment boundary in a content item.

The invention further relates to a method of identifying a segment boundary in a content item. The invention also relates to software for making a programmable device operative to perform a method of identifying a segment boundary in a content item.

An embodiment of this method is known from US 6,125,229. The known method creates a segmentation for a source video for the purpose of creating a visual index. The known method detects significant scenes by detecting when a scene of video has changed or a static scene has occurred. Two consecutive frames are compared, and if the frames are determined to be significantly different, a scene change is determined to have occurred between the two frames; and if determined to be significantly alike, processing is performed to determine if a static scene has occurred. A first drawback of the known method is that the created visual index does not allow a user to both find a particular part in the source video and get a broad overview of the source video. A second drawback of the known method is that it does not correctly segment certain types of video items.

It is a first object of the invention to provide an electronic device operative to create a segmentation for a content item, which allows a user to both find a particular part in a content item and get a broad overview of the content item.

The first object is according to the invention realized in that the electronic circuitry is operative to select a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of a content item, and create a segmentation for the content item using the segmentation parameter, the segmentation comprising a plurality of segments. Advantageously, the electronic device may allow the user to select granularity levels, e.g. 'fine', 'average' or 'coarse', or a desired number of segments to allow him to both find a particular part in the content item (e.g. by using the 'fine' level) and get a broad overview of the content item (e.g. by selecting the 'coarse' level). The segmentation parameter may comprise a time difference or a difference threshold, for example. If the user selects a desired number of segments, the electronic device may sequentially create multiple segmentations until the desired number of segments is at least approximately reached.

The electronic device may be a consumer or professional device. The electronic device may be, for example, a video player, a video recorder, a TV, a radio, a set top box, a mobile phone or a portable music player. The electronic circuitry may be an general purpose or an application specific processor, for example, a Philips TriMedia processor, an AMD Athlon CPU or a Intel Pentium CPU. The electronic device may comprise a storage means, e.g. an optical disc reader/writer or a hard disk. Alternatively or additionally, a storage means may be located outside the electronic device. In an embodiment of this electronic device, the electronic circuitry is further operative to create a plurality of segmentations for the content item, each segmentation being created using a different segmentation parameter and each segmentation comprising a plurality of segments, and align a first segment boundary of a first one of the segmentations at a first time instance with a second segment boundary of a second one of the segmentations at a second time instance if a difference between the first time instance and the second time instance does not exceed a certain threshold. Advantageously, this ensures that the transition between granularities is better understandable for users. A user may not understand that segment boundaries are generally only approximations/estimates of the real segment boundaries and may be different depending on the used method of segmentation and the used segmentation parameters.

The selected segmentation parameter may correspond to an associated time difference and the electronic circuitry may be operative to create a segmentation for the content item by comparing a value of at least one content feature at at least a first time instance with a value of the at least one content feature at at least a second time instance, the first time instance and second time instance having a time difference which equals the associated time difference. Herewith, a better performance is achieved than by adapting the generally used segmentation methods. Many known segmentation methods compare features of consecutive frames. In these known segmentation methods, the difference threshold (i.e. the device determines that a shot/scene-cut is present when the difference between a feature of a next frame and the feature of the previous frame exceeds this threshold) could also be adapted to create a granularity in segmentation.

Both the at least first time instance and the at least second time instance may each have a duration substantially similar to the associated time difference. Experiments have shown that this results in the best performance. It is a second object of the invention to provide a method of creating a segmentation for content item, which allows a user to both find a particular part in a content item and get a broad overview of the content item. The second object is according to the invention realized in that the method comprises the steps of selecting a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of the content item and creating the segmentation for the content item using the segmentation parameter, the segmentation comprising a plurality of segments. The method of the invention enables browsing through a content item at multiple time-scales. The method doesn't need to focus on a specific type of segment detection such as scene detection; at any timescale it could try to segment any type of segment, e.g. scenes, programs, commercial blocks, news item, music video clip, etc. However, at higher time-scales the method will more likely find segments that typically have a longer duration range (e.g. programs) and at lower timescales the methods will more likely find segments with a typically sorter duration range (e.g. scenes).

The segmentation method can be used to generically browse through a content item, in a semantically meaningful way, at different timescales. This is done by optimizing the segmentation system, for a set of timescales, to detect a diverse set of meaningful segment boundary combined (in other words the boundaries are considered to be equal). The diverse set of boundaries could contain Scene boundaries, Commercial (block) boundaries, Event boundaries (e.g. start of shouting or fighting event in scene), Music video clip boundaries, News item boundaries, News reportage boundaries and/or Series intro boundaries. The cursors on a remote control could be used to browse back and forward at some timescale (e.g. left- and right cursor) and change timescale to a higher or lower timescale (up- and down-cursor). The browsing can be accompanied by an on-screen color bar with a coarseness adapted to the browsing/segmentation timescale. The colors of the color bar could be mappings of the feature behavior calculated at the current timescale. In this way the color roughly represents what type of content is contained in the current segment. The browsing can alternatively or additionally be accompanied by onscreen key frames; for each segment, at the current timescale, one key frame. The playback of the current segment is indicated by the middle key frame in a sequence of displayed key frames or the highlighted key frame in a sequence of displayed key frames. The segmentation method can also be used for specific applications such as scene segmentation ("Go to start or end of scene"), commercial (block) segmentation ("Go to start or end of commercial (block)"), music video clip segmentation ("Go to start or end of music video clip"), News item segmentation ("Go to start or end of news item", "Go to start or end of news reportage") and/or Series intro segmentation, such as the title song of the Star Trek Voyager TV series ("Skip the intro"). In this case the method is optimized to detect a particular kind of boundary. This is done by optimizing the used audio/video features, used timescale, boundary detection model and detection threshold.

It is a third object of the invention to provide an electronic device of the kind described in the opening paragraph, which can achieve a good segmentation quality for all types of video content.

The third object is according to the invention realized in that the electronic circuitry is operative to determine a first value for each of a plurality of content features of a content item at a first plurality of time instances, determine a second value for each of the plurality of content features at a second plurality of time instances, and identify a segment boundary in dependence on a difference between the first value and the second value exceeding a certain threshold for at least one of the plurality of content features. This electronic device provides a generic method of segmenting content items, without the implementation being specifically adapted to a certain type of segment boundary. This allows an equipment manufacturer or service provider to provide a regional model and/or to update the model remotely (e.g. in EPG metadata) without having to update the software. This invention is based on the insight gained from experiments that a first combination of features may be different between frames and represent a segment boundary and a second combination of features may be different and not represent a segment boundary. This insight has been applied to a generic model of segmenting content items.

The electronic device may be a consumer or professional device. The electronic device may be, for example, a video player, a video recorder, a TV, a radio, a set top box, a mobile phone or a portable music player. The electronic circuitry may be a general purpose or an application specific processor, for example, a Philips TriMedia processor, an AMD Athlon CPU or an Intel Pentium CPU. The electronic device may comprise a storage means, e.g. an optical disc reader/writer or a hard disk. Alternatively or additionally, the electronic device may be able to access an external storage means. In an embodiment of the electronic device, the electronic circuitry is further operative to determine at least a subset of the plurality of content features for which a difference between the first value and the second value exceeds a certain threshold, compare the determined subset of content features with a plurality of predefined subsets of content features, and identify a segment boundary if the determined subset comprises all content features of at least one of the predefined subsets. This embodiment is advantageous, because most of the situations encountered in practice are complex and some features can be useless in some cases and useful in others.

Both the at least first plurality of time instances and the at least second plurality of time instances may each have a duration substantially similar to a time difference between the first plurality of time instances and the second plurality of time instances. Experiments have shown that this results in the best performance.

The electronic circuitry may further be operative to select a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of the content item and the segmentation parameter comprising at least one of a time difference between the first plurality of time instances and the second plurality of time instances and a difference threshold. Advantageously, the electronic device may allow the user to select granularity levels, e.g. 'fine', 'average' or 'coarse', or a desired number of segments to allow him to both find a particular part in the content item (e.g. by using the 'fine' level) and get a broad overview of the content item (e.g. by selecting the 'coarse' level). If the user selects a desired number of segments, the electronic device may sequentially create multiple segmentations until the desired number of segments is at least approximately reached.

The electronic circuitry may further be operative to create a plurality of segmentations for the content item, each segmentation being created using a different segmentation parameter and each segmentation comprising a plurality of segments and align a first segment boundary of a first one of the segmentations at a first time instance with a second segment boundary of a second one of the segmentations at a second time instance if a difference between the first time instance and the second time instance does not exceed a certain threshold. Advantageously, this ensures that the transition between granularities is better understandable for users. A user may not understand that segment boundaries are generally only approximations/estimates of the real segment boundaries and may be different depending on the used method of segmentation and the used segmentation parameters. It is a fourth object of the invention to provide a method of the kind described in the opening paragraph, which can achieve a good segmentation quality for all types of video content.

The fourth object is according to the invention realized in that the method comprises the steps of determining a first value for each of a plurality of content features of a content item at a first plurality of time instances, determining a second value for each of the plurality of content features at a second plurality of time instances, and identifying a segment boundary in dependence on a difference between the first value and the second value exceeding a certain threshold for at least one of the plurality of content features. In order for the user to browse through a content item in a meaningful manner the applied segmentation(s) preferably has some semantic meaning. For instance, it is more logical and convenient to have a segmentation boundary at the start of a commercial (block) then at some random point within a commercial (block). Bridging what is called the semantic gap, using (low-level) audio/video features to detect semantically meaningful segmentation boundaries is a very difficult task. Many methods for content segmentation (also called story unit segmentation) are based on the usage of a specific set of low- level audio/video features that are combined in a dedicated manner, making the methods harder to extend with new and/or semantically more meaningful features and thus also harder to improve.

In an embodiment of the method, the method further comprises the steps of determining at least a subset of the plurality of content features for which a difference between the first value and the second value exceeds a certain threshold and comparing the determined subset of content features with a plurality of predefined subsets of content features, and the step of identifying a segment boundary comprises identifying a segment boundary if the determined subset comprises all content features of at least one of the predefined subsets.

The method may further comprise the step of selecting a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of the content item and the segmentation parameter comprising at least one of a time difference between the first plurality of time instances and the second plurality of time instances and a difference threshold. This step enables browsing through a content item at multiple time-scales as described for the method achieving the second object of the invention. These and other aspects of the devices and methods of the invention will be further elucidated and described with reference to the drawings, in which:

Fig. 1 shows equations used in an embodiment of the method of the invention; Fig. 2 shows a first example in which the equations of Fig. 1 are applied to one content feature.

Fig. 3 shows a second example in which the equations of Fig. 1 are applied to one content feature; and

Fig. 4 shows a third example in which the equations of Fig. 1 are applied to multiple content features; Corresponding elements within the drawings are identified by the same reference numeral.

An embodiment of the two methods of the invention is described below. This segmentation method provides a solution for coarse to fine browsing (and visa- versa) in a semantically meaningful way by defining a segmentation method able to segment content at multiple (definable) timescales (coarse to fine browsing (and visa- versa)) and able to use any (extendable) set of extracted audio/video features (more powerful possibilities to detect semantically meaningful boundaries) The segmentation method may comprise a few processing steps:

1. Per time instance, extract audio/video features from audio-visual content

2. Normalize the extracted continuous audio/video features

3. For all defined (selected) timescales a. Per time instance, over a window of multiple time instances, calculate behavior features from audio/video features at current timescale b. Per time instance, calculate behavior difference features c. Per time instance, evaluate the segment boundary detection model, designed for this timescale, using the calculated behavior difference features. The output of the detection model resulting from the evaluation is a continuous value that can be interpreted as a confidence measure for the presence of a segmentation boundary at the current time instance. d. Per time instance, over a window multiple time instances, filter the continuous model output to decide if there is a segment boundary present and at what position, within this window of multiple time instances. 4. Multi timescale segment boundary detection alignment: a. Optional - Align segment boundary detections at lowest timescale with shotcuts: segments will start at the beginning of a shot b. Align the boundary detections at multiple timescales: detections at higher timescales are aligned with detections at lower timescales because the boundary position detection is more accurate at lower timescales. Note that, if at the lowest timescale the detections are shotcut-aligned, this will lead to boundary detections aligned with shotcuts at all timescales.

In principal any set of audio/video features can be used for this content segmentation method. These can be low- level video feature types such as average and variance of video frame luminance and color components, average, variance, maximum and minimum of X- and y-direction motion in video frame, average and variance of video frame complexity (bitrate x quantizer scale), average and variance of video frame Mean Absolute Difference (MAD) error, and number of I-, B- and/or -P macro blocks in a video frame (in the domain of MPEG video encoding). MAD is a product of the block matching process used for motion estimation during (MPEG) video encoding.

These low- level video features could be calculated for the whole video frame or selected parts of the frame, such as, the letterbox areas or a central frame area and surrounding frame area.

Types of low- level audio features could, for instance, be audio signal root- mean-square (RMS) level, spectral centroid, bandwidth, zero-crossing rate, spectral roll-off frequency, band energy ratio, delta spectrum magnitude, pitch and pitch strength, ERB- or Mel-Frequency Cepstral Coefficients (EFFCs or MFCCs) related features and Auditory Filter bank Temporal Envelope (AFTE) features.

Examples of higher- level feature types are Indoor/outdoor detection in video, Nature/urban(city) landscape, Speech/no speech, Silence/no silence and Crowd noise/no crowd noise.

Note that usually low- level audio feature extraction is done more often then low-level video feature extraction. In one embodiment, multiple low-level audio features, extracted within the duration of one video frame, are averaged in order to align the extracted low- level audio and video features. In this implementation there is thus one set of features per video frame (features are video frame aligned). Advantageously, the following set of low- level audio and video features that are calculated during MPEG2 encoding can be used:

Low-level video features: - Total frame luminosity

Total frame luminosity variation

Total frame U chrominance

Total frame U chrominance variation

Total frame V chrominance - Total frame V chrominance variation

Frame luminosity upper letterbox area

Frame luminosity lower letterbox area

Frame luminosity variation upper letterbox area

Frame luminosity variation lower letterbox area - Frame complexity (bitrate x quantizer scale) center area

Frame complexity (bitrate x quantizer scale) surrounding area

Frame Mean Absolute Difference (MAD) of upper frame area

Frame Mean Absolute Difference (MAD) of lower frame area

Average X-motion total frame - Average Y-motion total frame

Max X-motion total frame

Max Y-motion total frame

Total Number I Macro Blocks upper frame area

Total Number I Macro Blocks lower frame area

Low level audio features:

The average of the first EFCC coefficient across the M subframes [EFCC(I)]

The average of the second EFCC coefficient across the M subframes [EFCC(2)] - The average of the third EFCC coefficient across the M subframes [EFCC(3)]

The average of the fourth EFCC coefficient across the M subframes [EFCC(4)]

The average of the fifth EFCC coefficient across the M subframes [EFCC(5)]

The average of the sixth EFCC coefficient across the M subframes [EFCC(O)] The average of the seventh EFCC coefficient across the M subframes

[EFCC(7)] The average of the eighth EFCC coefficient across the M subframes [EFCC(8)]

The average of the ninth EFCC coefficient across the M subframes [EFCC(9)]

The zeroth Modulation Cepstrum Coefficient (Modulation depth)

The first Modulation Cepstrum Coefficient (Modulation spectrum tilt)

The second Modulation Cepstrum Coefficient

The third Modulation Cepstrum Coefficient

The fourth Modulation Cepstrum Coefficient

Correlation coefficient between EFCC(O) and EFCC(I)

Correlation coefficient between EFCC(I) and EFCC(2)

Correlation coefficient between EFCC(I) and EFCC(3)

Average inter- frame power spectrum correlation coefficient

Audio and video features having continuous values can have significantly different ranges which lowers the numerical stability of the detection system design and (online) evaluation. These audio and video features are normalized. At some time instance k , feature normalization of a feature Z₁(Jc) can be defined as follows:

m_t and S₁ are pre-calculated mean and standard deviation constants. The constants are calculated over an audio/video reference content set.

Behavior of extracted low-level audio and video features can be calculated ('extracted') over a sliding window in time. The size of the sliding window is the actual definition of the segmentation timescale. Behavior of feature signals in time can be defined in many different ways. A very simple definition is the mean and standard deviation of the feature signals over a window with length w . Another example of feature signal behavior definition is a set of pre-selected spectral energies of each feature signal.

In the following embodiment, during the behavior feature calculation, for each time instant k , the local behavior of each feature is extracted. This is done by applying m behavior feature extraction operators separately to each of the n features inside a sliding window long w centered on k . f_t (k) is the value at time instance k of the i-th feature, Φ_;. the j-th behavior feature extraction operator to be applied to every feature signal, we can express the behavior features as equation (1) of Fig.1. |_JC J and |^~JC] are respectively the closest lower integer and the closest higher integer of x . The result is a set of n • m behavior features BF_{1 j} . Note that the total number of samples inside the window is w . For example, two behavior features can be extracted: a window mean and standard deviation operator. This behavior features can be applied to the 38 features calculated during MPEG2 encoding, resulting in 2 x 38 = 76 behavior features. In this example, a time instance corresponds to a video frame.

The behavior features dynamics are extracted by simply making an absolute difference between behavior features at a certain distance as defined in equation (2) of Fig.1. When defining the mean and standard deviation of features over a behavior feature extraction window w as behavior features, an optimal choice for the used difference distance d is equal to w . The figures show that, when distance d is chosen to be equal to w , the peaks of the mean and standard deviation behavior difference features are exactly positioned at the transition of the feature behavior.

Fig. 2 demonstrates that the peak position of the mean behavior difference is at the exact position of the feature value transition when w and d are chosen to be equal. Fig. 2(a) shows an ideal transition of a feature for a mean behavior feature, with a moving window of length w used to calculate the local mean. Fig. 2(b) shows corresponding mean behavior obtained by using window w. Fig. 2(c) shows corresponding mean behavior dynamics obtained with equation (2) of Fig. 1 using a distance d equal to w.

Fig. 3 demonstrates that the peak position of the standard deviation behavior difference is at the exact position of the feature value transition when w and d are chosen to be equal. Fig. 3 (a) shows an ideal transition of a feature for a standard deviation behavior feature, with a moving window of length w used to calculate the local standard deviation. Fig. 3(b) shows corresponding standard deviation behavior obtained by using window w. Fig. 3(c) shows corresponding standard deviation behavior dynamics obtained with equation (2) of Fig. 1 using a distance d equal to w.

Fig. 4 shows a real example of behavior mean dynamics of five different features created with w=63 I-frames and d=w and further shows manually annotated boundaries.

It is possible to see two important things: The behavior mean dynamic assumes the expected triangular shape, however there are no perfect triangles because of noise factors, basically, the strongest source of noise is the instability of the features around the transition. Note that this noise factor is related to the window length. - None of these features show where are all and only the real boundaries, but each of these features can provide different and useful information to retrieve them.

If the information gathered from a large amount of features is combined, it is expectable to identify these boundaries. By adding new features (the relevant ones) the highest peaks become exactly the real boundaries (target boundaries). However, most of the situations encountered in practice are more complex, some features can be useless in some cases and useful in others, different dynamics could have different weights in identifying changes in the semantic. For this reason, it is advantageous to determine at least a subset of the plurality of content features for which a difference between the first value and the second value exceeds a certain threshold, compare the determined subset of content features with a plurality of predefined subsets of content features, and identify a segment boundary if the determined subset comprises all content features of at least one of the predefined subsets. To implement this, the calculated behavior difference features at every time instance k can be used as input for the evaluation of a detection model The detection model is designed (trained and structurally optimized) to have a high output when, based on the given input behavior difference features, the presence of a segment boundary is probable. The output value is low when this is not probable. In time the detection model output forms a signal.

This implementation can be equal to the detection model used in WO2004/019224, herein incorporated by reference. This detection model consists of a Self- Organizing Map (SOM) segmenting, for this method, the behavior difference feature space into clusters. Per cluster, a linear detection model is designed. To evaluate the model, using the behavior difference features, first the best matching SOM cluster is found. Subsequently, the linear model belonging to the best matching SOM cluster is evaluated with (a subset of) the behavior difference features. The evaluation results in a continuous output value that can be interpreted as a confidence measure for the presence of a segmentation boundary at the current time instance (that is, the current video frame).

Two types of optimizations can be applied during the detection model design: Structural - SOM map size

- Per SOM cluster, the input subset of the linear model belonging to that cluster

Parameters - Codebook vectors of SOM clusters

- Per SOM cluster, weights and bias of linear model belonging to that cluster.

The design can be example based: every video frame one example, consisting of the behavior difference features as input set and a binary target value indicating the presence of a segment boundary. The examples form sequences in time since the examples are derived form video sequences. Before training, the target signal is first processed to make the learning problem well defined. A detection at a time instance k could actually have occurred in the range between k - w and k + w (assuming d is chosen equal to w), due to the averaging nature of both the mean and standard deviation operator used in the behavior feature extraction process. The learning problem is thus ill-defined when the target signal only consists of peaks at the real ground truth positions. This is solved by adapting the target signal to contain triangular shaped waveforms, with base 2 w , with the peaks at the positions of the real ground truth segment boundaries.

At some timescale defined by w , for any time instance k , preferably only one detection should be chosen within some range around A; ; this is because of the averaging nature of behavior extraction operators (the actual range considering to have only one detection depends on the specific properties of the behavior extraction operators used).

Within this considered "filtering" range, the detection model output signal could have multiple peaks, usually wider than one time instance. The problem to solve here is to choose the detected boundary position, within the filter range, if any at all.

A filtering mechanism is required to decide if there is boundary detection at all, and if the presence of a segment boundary is plausible, at what position within the range this boundary occurs.

Theoretically, when the mean and standard deviation behavior features have a strong transition as in Figs. 2 and 3, ideally the model output will also show a strong triangular shaped peak with a 2 w base. Empirical analysis of the model output indeed shows that at real segment boundaries are strongly triangular shaped peaks, with a base 2 w , in many instances. However the triangular shapes usually have a lot of higher frequency peaks. The implemented filtering stage filters out the higher frequency peaks and selects only those peaks in the filtered signal that actually look similar to the ideal triangular shape, with base w or higher. Used filtering range is between A; - w and k + w . At every time instance k, the filtering is step is evaluated as follows: - In the filtering range, consisting of 2 w +1 video frames, the detection model

output is filtered using a low-pass filter with a cut-off frequency of . This method of

2w - l filtering suppresses all peaks that have a width smaller then w .

Only real triangular shapes with base 2w or higher are selected in the following manner: - First, the global form must be present: within a range of 2 w the signal should rise and fall: At each instance k , difference in model output at instances k + w -1 and A; should be bigger then zero and difference in model output at instances k +2 w - land A; + w should be smaller then zero.

- If the global form is present, the form is checked at a finer scale : within a range of 1/2 w the signal should also rise and fall: At instance k where global form is present, difference in model output at instances k + w IA-X and A; should be bigger then zero and difference in model output at instances k + w 12- 1 and k + w IA should be smaller then zero.

If the triangular form is present in the 2 w and 1/2 w ranges the top of the peak is a potential segment boundary position - If the model output value at the selected potential segment boundary position is higher then some trained threshold, a segment boundary detection is made at that position.

The previous segmentation steps can be repeated for multiple timescales, resulting in content segmentations with different typical segment lengths. When the different segmentations are used together in one browsing application, enabling the user to browse at multiple levels, segment boundaries at multiple levels very close to each other must be aligned to be able to seamlessly switch browsing at one level to another. Further, boundary detections at coarser time scales can be improved through alignment with segmentations at finer levels. Segment boundary detections at a courser timescale are less accurate because of the averaging nature of behavior extraction operators, working over some window w .

Usually it is of interest to align the segmentation at the lowest timescale with shotcuts, enabling fluent browsing. However in some application scenarios, boundaries are not typically at hard shot boundaries. For instance in football, the change in pace of the game as the start of an attack on goal is not at a shotcut boundary.

A multi timescale alignment may comprise the following steps: 1. Over all timescales W₁ starting with highest until second lowest (because there is nothing beyond the lowest timescale): a. For all detections in current timescale W₁ : i. Find the closest detection with the next lower timescale ii. If the distance to the closest detection is smaller than filter window of the current timescale W₁ then move the detection to the location of the closest detection iii. Else, create a new detection at the next lower timescale (reverse alignment through creation of new detection) b. If current timescale W₁ is not highest timescale then do realignment upward (changes at current timescale W₁ , due to alignment with lower level, has to be propagated back to higher levels): i. For all timescales W₂ higher than current timescale until highest timescale

1. For all detections in current timescale W₂ :

A. Find the closest detection with the next lower timescale

B. If the distance to the closest detection is smaller than filter window of the current timescale W₂ then move the detection to the location of the closest detection

While the invention has been described in connection with preferred embodiments, it will be understood that modifications thereof within the principles outlined above will be evident to those skilled in the art, and thus the invention is not limited to the preferred embodiments but is intended to encompass such modifications. The invention resides in each and every novel characteristic feature and each and every combination of characteristic features. Reference numerals in the claims do not limit their protective scope. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements other than those stated in the claims. Use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which perform in operation or are designed to perform a specified function, be it solely or in conjunction with other functions, be it in isolation or in co-operation with other elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. 'Software' is to be understood to mean any software product stored on a computer-readable medium, such as a floppy disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:

1. An electronic device, comprising electronic circuitry operative to: determine a first value for each of a plurality of content features of a content item at a first plurality of time instances; determine a second value for each of the plurality of content features at a second plurality of time instances; and identify a segment boundary in dependence on a difference between the first value and the second value exceeding a certain threshold for at least one of the plurality of content features.

2. An electronic device as claimed in claim 1, further being operative to: determine at least a subset of the plurality of content features for which a difference between the first value and the second value exceeds a certain threshold; compare the determined subset of content features with a plurality of predefined subsets of content features; and - identify a segment boundary if the determined subset comprises all content features of at least one of the predefined subsets.

3. An electronic device as claimed in claim 1, wherein both the first plurality of time instances and the second plurality of time instances each have a duration substantially similar to a time difference between the first plurality of time instances and the second plurality of time instances.

4. An electronic device as claimed in claim 1, the electronic circuitry further being operative to: - select a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of the content item and the segmentation parameter comprising at least one of a time difference between the first plurality of time instances and the second plurality of time instances and a difference threshold.

5. An electronic device as claimed in claim 1, the electronic circuitry further being operative to: create a plurality of segmentations for the content item, each segmentation being created using a different segmentation parameter and each segmentation comprising a plurality of segments; and align a first segment boundary of a first one of the segmentations at a first time instance with a second segment boundary of a second one of the segmentations at a second time instance if a difference between the first time instance and the second time instance does not exceed a certain threshold.

6. A method of identifying a segment boundary, comprising the steps of: determining a first value for each of a plurality of content features of a content item at a first plurality of time instances; determining a second value for each of the plurality of content features at a second plurality of time instances; and identifying a segment boundary in dependence on a difference between the first value and the second value exceeding a certain threshold for at least one of the plurality of content features.

7. A method as claimed in claim 6, further comprising the steps of: determining at least a subset of the plurality of content features for which a difference between the first value and the second value exceeds a certain threshold; and comparing the determined subset of content features with a plurality of predefined subsets of content features, and wherein the step of identifying a segment boundary comprises identifying a segment boundary if the determined subset comprises all content features of at least one of the predefined subsets.

8. A method as claimed in claim 6, further comprising the step of: - selecting a segmentation parameter based on a user input, the user input indicating a granularity of desired segmentation of the content item and the segmentation parameter comprising at least one of a time difference between the first plurality of time instances and the second plurality of time instances and a difference threshold.

9. Software for making a programmable device operative to perform the method of claim 6.