WO2007113580A1

WO2007113580A1 - Intelligent media content playing device with user attention detection, corresponding method and carrier medium

Info

Publication number: WO2007113580A1
Application number: PCT/GB2007/001288
Authority: WO
Inventors: David John Chatting; Ian Christopher Kegel
Original assignee: British Telecommunications Public Limited Company
Priority date: 2006-04-05
Filing date: 2007-04-05
Publication date: 2007-10-11

Abstract

The present invention relates to providing intelligent user control options for consuming media content such as video images to one or more viewers. The present invention provides an apparatus for playing media content, and comprising: means for detecting a face (110, 115); means for playing the media content and associating the media content being played with a play index (130); means for storing the play index of the media content being played in response to detecting the absence of a previously detected face (240); means for generating a user interface (140, 145) in response to detecting a face following the detection of the absence of a face; wherein the playing means is arranged to re-play at least a part of the media content depending on the stored play index in response to receiving a user input from the user interface.

Description

PLAYING MEDIA CONTENT

Field of the Invention

The present invention relates to providing intelligent user control options for consuming media content such as video images to one or more viewers.

Background of the Invention

Equipment for providing media content such as video images are well known and include the video cassette recorder (VCR), digital video disc players (DVD), hard disc drive based recorders (HDD), terrestrial, cable and satellite TV set top boxes coupled to a viewing device such as a television set (TV). Increasingly personal computers are also being used to provide media content, and may include DVD disc readers, TV tuner capture cards, and Internet file download and streaming facilities. Streaming media is increasingly being used for fast and convenient access to media content such as Video on Demand (VoD) and is typically implemented using a remote storage server transmitting over a network to a receiving apparatus such as a computer or mobile phone. Typically the quality of the media content streamed will depend on the bandwidth of the network over which it is transmitted, for example using a Broadband Internet connection a high quality level media content transmission may be made whereas over a wireless network the same media content may need to be transmitted at a lower quality level given the constraints of the lower bandwidth connection.

The media content quality level can be adjusted by using different compression technology (data rate) resulting in less packets to be sent over a unit time, a reduced number of frames for video, and lower resolution images. The quality of media content can also be adjusted dynamically, for example British Telecoms Fastnet video streaming technology provides media content in a number of different quality formats, and switches between them depending on network congestion. Thus if the network becomes congested, the media server switches to streaming a lower quality level media content to the consuming device, for example a mobile phone. This is described in Walker, M. D., Nilsson, M., Jebb, T. and Turnbull R. Mobile Video-Streaming, BT Technology Journal, Vol. 21, No. 3. (September 2003), pp. 192-202.

However the world is a distracting place and media displaying equipment is generally not designed to accommodate this in an efficient and user friendly manner. For example, a distraction such as a phone call or the door-bell ringing during playback of a DVD may necessitate the viewer manually rewinding to media content at the point of interruption in order to continue viewing without missing any of this content. Recent advances in video content recording using HDD devices such as TiVo™ allow a user to press a "pause" , button during a live broadcast in order to accommodate an interruption. When the viewer returns, the HDD has started recording the missed content so that the user can return to watching the video content from the interruption time by pressing a "continue" button for example.

WO2004084054 describes an attempt to automate such facilities by utilising an eye gaze attention recognition system which detects when a viewer's pupils are directed at a viewing screen. Detection of attention or non-attention can then be used to pause playback of media content. However such an arrangement requires expensive eye gaze detection equipment which also requires a light source to illuminate the pupils of the viewer.

PCT published patent application No. WO 2006/061770 and US 2002/144259 describe intelligent pause buttons in which user attentiveness is monitored by some mechanism (several examples are given including using a video camera to monitor user gait's or movements, etc) and if user inattentiveness is detected then it pauses the display of the media until user attentiveness is again detected. PCT published patent application WO 2003/026250 describes a system in which user attentiveness is again measured and in the event that user inattentiveness is detected during playback of a stream of media, the quality of the media is reduced. Again a number of different ways of detecting user attentiveness/inattentiveness are described including techniques involving cameras.

Summary of the Invention

The present invention provides user media content control options automatically based on a face detection system or other means of detecting user attention states. It has been found that detection of a face directed towards a viewing screen is surprisingly highly indicative of user attention. Furthermore, face detection systems can be much simpler if the face to be deteted is always oriented in the same direction (and in the same direction as the training data). Although people do look out of the corners of their eyes, this is typically followed shortly afterwards by reorienting the head in the same direction. Thus using face detection as a mechanism for detecting user attentiveness is especially efficient because the face detection system works best when the face is oriented in the same way as in the training data. Thus if a face is present but not oriented in the same way as the training data, it is likely to be the case that the respective person is not actually watching the screen displaying the media and so a correct conclusion of inattentiveness is reached without having to perform any complex head orieintation processing as suggested in WO 2003/026250. Also face detection is simpler than tracking eye gaze and does not require expensive pupil tracking mechanisms. Furthermore because the face is larger than the eyes, the detection equipment does not need to be as sensitive and can detect its targets at a greater distance than eye tracking. This is especially useful in informal viewing situations such as a number of viewers in a lounge watching a TV screen from a distance, perhaps up to 5 metres, rather than a single viewer up close to a computer monitor for example. Face detection can be implemented using cheap off-the-shelf equipment such as a web-cam and readily available computer software. When the absence of an attentive face is detected, the system notes or stores a play or distraction index or time (eg media content duration so far) for the media content being played, and continues to play the media content, most preferably at a reduced quality level. When an attentive face is then (re)detected, the system offers the viewer various viewing control options based on the noted play index. This can be implemented as a user interface for example a graphical user interface superimposed over the playing media content in the form of a user message and/or user actuable soft screen button. The user may then choose to continue playing the media content by ignoring the generated user interface, or to "rewind" the media content by actuating or operating the user interface so that the media content re-plays from the noted play index. Various other user control options are contemplated including viewing a "summary" of the missed content.

By providing the user with play options such as "rewind" rather than automatically pausing upon detection of user non-attention, the system provides a more user friendly control interface than other known approaches. There is no unnecessary pausing and instead a user interface or control option is provided. In some situations a user may be happy for the media content to continue playing rather than pausing even if they are interrupted, for example when momentarily discussing the content with a fellow viewer or during ad breaks. Furthermore a high number of (unnecessary) pauses can be annoying for the viewer and this is exacerbated where these are based on false detection of a user non-attention state. Furthermore it has been estimated that there are typically less false positives (ie detection that the user has looked away when in fact they haven't) with face detection than pupil tracking for example, and even when false positives are detected the impact of them is less annoying for the user.

By continuing to play the media, but at a reduced quality level, the user is able to keep half an eye or ear on the media and can therefore reach a good conclusion as to whether or not the "missed" content contains something that the user would like to "see" (whilst paying full attention). However, by reducing the quality level, the bandwidth occupied by the streaming media is greatly reduced which is beneficial to the operator. Furthermore, since the system has detected user inattentiveness, it is likely that the user will not graeatly notice or mind the fact that the quality level of the media is reduced. In an embodiment, a web cam or other digital camera apparatus and face detection software are used in conjunction with computer equipment arranged to play media content. The face detection software used in an embodiment is based on an algorithm disclosed in Wu J., Rehg J. M., Mullin M. D. (2004) Learning a Rare Event Detection Cascade by Direct Feature Selection, Advances in Neural Information Processing Systems vol.16 - see also http://www.cc.gatech.edu/~wuix/research.htm. However other face detection software could alternatively be used, for example Mobile-I available from Neyen Vision. The web cam and face detection software detects a face, and the computer plays the media content such as a movie. When the user leaves the computer screen to answer the phone for example, the absence of a face is detected and used by the computer to store the current play index of the movie as well as to reduce the quality at which the media is being streamed (in the case where the media is being streamed of a connection having a limited amount of bandwidth for which there is competing demand from other users. When the user returns to the viewing screen of the computer, a face is detected which triggers the computer to display a "rewind" button on the screen, and to increase the quality level of streamed media back to the "normal" level appropriate for when a user is viewing the media with attention. If the rewind button is actuated by the user, for example by pressing a button on a remote control device, the computer stops the current playing of the media and restarts playing from the stored play index (or shortly before this); otherwise the rewind button disappears from the display after a brief period and the media continues to play uninterrupted.

In another embodiment, the face detection software and/or computer are arranged to identify detected faces for example by associating an identifier with each detected face.

Then when a user returns to view the screen, the system detects the new face and recognises it as a previously identified face and displays an appropriate or personalised control option for that user. For example the returned user may be offered a summary of the content missed either on the entire screen or in a corner of the screen with the main content continuing to play in the background. As a further alternative, the summary may be offered on a separate personal device, for example a PDA or phone. When detecting whether a face is present/attentive or not, the system can be configured to receive multiple face detection outputs or readings before basing a decision on those inputs in order to minimise the likelihood of false positives or false negatives - that is determinations that the user is looking away when in fact they are attentive. This means that say one reading that a user is non-attentive amongst say 10 readings that the user is attentive can be ignored, and the system simply responds to the user being attentive. This also allows some flexibility to be built into the system, for example allowing a user brief distractions (eg a brief look at another user) without generating the user interface.

In one aspect there is provided apparatus for playing media content, and which is arranged to offer smart media content consumption (for example viewing or listening) options including rewinding to a point where consumption was interrupted. The apparatus includes means for detecting a face such as a camera and a face detection system or software executed on a processing platform such as a computer or set-top box. The apparatus also comprises means for playing the media content and associating the media content being played with a play index; examples include a DVD player which includes a play index or a streaming media client. The apparatus (for example the processing platform) is arranged to store the play index of the media content being played in response to detecting the absence of a previously detected face. The apparatus further comprises means for generating a user interface in response to detecting a face following the detection of the absence of a face. This may be a displayed message with a user control input facility such as a remote control device or soft button on the display showing the played media.

If no user input is received, the user interface is deactivated, for example the displayed message is deleted. If however a user input is received through the user interface, the playing means is arranged to re-play at least a part of the media content depending on the stored play index in response to receiving a user input from the user interface. For example the missed content may be replayed from the stored play index, or a play index dependent on this such as a few seconds before. The user interface may be implemented by generating a user alert such as a displayed message or audible sound, and facilitating the generation of a control signal in response to the user input, wherein the apparatus is responsive to the control signal to instruct the playing means to re-play the at least a part of the media content. Facilitation of the control signal may be achieved by monitoring incoming remote control signals for a predetermined signal for example.

In an embodiment, the apparatus also comprises a face recogniser which may be implemented using face recognition software executed on the processing platform. The face recogniser maintains a database of faces so that it can either recognise and identify a newly detected face, or add this to the database together with an identifier which can be used with the rest of the system. The face recogniser can be used to recognise a particular recently absent face in order to offer a personal service such as offering to provide a summary of content that user has missed when there are a number of users or viewers using the system.

There is also provided a method of playing media content such as a movie and having a play index, the method comprising playing the media content in response to detecting a face, and storing the current play index in response to detecting the absence of a previously detected face. In some embodiments there is no need to recognise the absent face, it is enough that previously there was one face detected, then it was not detected (absent). The method further comprises generating a user interface associated with the stored play index in response to detecting a face. Again there is no requirement for recognising this face, merely that a face has again been detected.

In an embodiment, the method further comprises further comprising re-playing at least part of the media content depending on the stored play index in response to receiving a user input from the generated user interface. If no user input is received, the user interface is no longer generated after a short time. In an embodiment the method further comprises identifying each detected face and wherein a personalised user interface is generated each time a previously absent identified face is again recognised .

In another aspect there is also provided a method of playing media, the method comprising: playing the media content in response to detecting a face; pausing the media content in response to detecting the absence of a detected face; resuming playing the media content in response to again detecting a face.

According to a further aspect of the present invention, there is provided a method of playing media content at different quality levels depending on the attentiveness of the user(s) in which attentiveness is detected using face detection which is correlated with user attentiveness and/or in which a smart rewind function is offered to a user upon detection of user re-attentiveness. In other embodiments, different measures or indicators of user attentiveness may be used, for example infra-red or other proximity detectors, eye or pupil tracking devices, or even monitoring interaction with the system such as button presses. In this specification the term user attention state is used to indicate whether or not a user is attentive or paying attention to the media content being played. As noted, this can be determined by detecting the presence (attentive) or absence (non-attentive) of a face for example. One or a combination of attentiveness or user attention state detectors may be used, and this may be configured to depend on the content being used. For example audio-only content does not require the user to be directly in front of the speaker, but merely within the room say. Whereas a 3D movie may require a user to be positioned within a certain location space for optimum viewing.

In an embodiment, in order to implement the different quality levels, a media server which is supplying the media content is instructed by a media client of a user device to play the media content at a quality level dependent on the detection of a face. When the user device detects a face notionally viewing a screen onto which the media content would be played, the user device instructs the media server to transmit a higher or the highest quality media content. Whereas if no face is detected, which corresponds to viewer or user inattention (a user non-attention state), the user device or apparatus instructs the media server to reduce or degrade the media content quality, hence reducing the data rate. This may be achieved by the server switching to transmit a lower bit-rate stream to the user device, for example one coded using a higher compression ratio compression algorithm, or switching off one of a number of parallel streams or sessions each supporting a different layer in hierarchically encoded data. In an embodiment, over a period of lack of viewer attention or interest - corresponding to an absence of detected faces - the user device continues to instruct a reduction in media content quality level so that the media content is degraded over time, perhaps to no content at all. Then when user attention (eg a face) is again detected, the media content quality is increased, perhaps initially to the highest quality so that the user does not notice any reduction in quality of the content that is actually viewed as well as also providing a smart rewind function at that time. User inattention may occur for a number of reasons, for example because the user or viewer has left the room to get a drink, answer the telephone or door, or looked away to talk with someone else in the room. By degrading the quality of media content such as VoD which is not actually being viewed, network traffic and congestion is reduced in this embodiment.

In another embodiment, the media server multicasts or transmits multiple streams of the same media content at different quality levels or bit rates, and the user device switches between the streams depending on the user attention state determined. Alternatively the user device may switch off one of a number of parallel streams each supporting a different layer in hierarchically encoded content.

In another embodiment the media server may be integral with the attention detection apparatus, for example all contained as part of a computer system, rather than having the media server at a remote location. For example the embodiment may be implemented on a personal computer with a DVD player as the media server. In this case, no network is required in order to couple these two parts of the overall system. The use of degrading the quality level of the content in this embodiment can be used to reduce the power consumption of the computer by lowering the brightness of the display (or turning it off altogether when not required to show content at all).

Brief Description of the Drawings 5

Embodiments will now be described with reference to the following drawings, by way of example only and without intending to be limiting, in which:

Figure 1 is a schematic showing a system for detecting user attentiveness and offering 10 user controls based thereon;

Figure 2 is a flow chart illustrating a method of responding to detected user inattentiveness;

15 Figure 3 is a flow chart illustrating another method of responding to detected user inattentiveness;

Figure 4 is a schematic showing a system for responding to user inattentiveness when watching streamed media; 20

Figure 5 is a graph illustrating degradation of video on demand content in response to viewer non-attentiveness;

Figure 6 is a flow chart illustrating a method of operating a user device or apparatus for 25 playing the media content;

Figure 7 is a flow chart illustrating a method of operating a media server for playing the media content;

30. Figure 8 illustrates a layered approach to streaming media content at different quality levels or bit rates; Figure 9 illustrates a media server architecture for layered streaming media content; and

Figure 10 illustrates a streaming client at the user device for receiving layered streamed media content.

Detailed Description

Figure 1 illustrates a system according to an embodiment, the system 100 comprising a user's face 105, a camera 110 or viewing equipment such as a web cam or other camera, a face detector 115, optionally a face recogniser 120, an attention interpreter 125, a media store 130 such as a DVD player containing video content such as a movie, a viewing screen 135, a user interface message display 140 (and optionally a user actuable soft button), and a user input device 145 such as a remote control unit. The face detector 115, face recogniser 120, and attention interpreter 125 are typically implemented as software executed on a computing platform such as a personal computer or dedicated video content equipment such as a DVD player or set-top box. The viewing screen 135 may be part of the computing platform, or a stand-alone device such as a plasma-TV screen for example. The Media Store may be a buffer for streamed content, or may exist remotely over a network - eg Video on Demand (VoD).

The camera 110 is arranged to capture images pointing away from the viewing screen 135 in order to "see" user faces 105 viewing the viewing screen. Typically the camera 110 will be positioned at right angles to the plane of the viewing screen 135. The camera can be arranged such that the viewing angle of the screen and the camera's field of view are largely coincident; and this may be accomplished with additional lenses if necessary. Video or still images from the camera 110 are directed to the face detector 115, which may be implemented using various well known software packages on a computing or processor platform. Examples of face detection algorithms include: Wu et al (2004), modules of the Neven Vision's Mobile-I™ face recognition software developer's kit (SDK); Detecting Faces in Images: A Sutyey, Ming-Hsuan Yang, David J. Kήegman and Narendra Ahuja- http://vision.ai.uiuc.edu/rnhyang/papers/pami02a.pdf; C-VIS Computer Vision and Automation GmbH Face Snap or similar technologies. These and similar packages may be either configured simply for face detection (115) or additionally for face recognition (120). The face detector function 115 simply reports whether a face has been detected or how many faces are currently detected within the field of view of the camera 110. The face recogniser function 120 either adds a new detected face to a database of recognised faces, or notes a match between a detected face and a face recorded in the database. The database may be temporary and used only for each media content playing session - such as each movie viewing. This is because a different set of faces may watch a different movie. Alternatively the database may collect face identities from multiple media content playing sessions, to offer personalised services to regular viewers. Examples might include providing summaries or rewinds for individual viewers depending on content that they personally have missed.

Various technologies are available for implementing face recognition, for example the Mobile-!™ approach uses Gabor Wavelets to locate individual facial features and to graph the geometry of interconnecting features, this is constructed for each face and can then be matched against the faces in the database. Whilst the specific operations of this software are beyond the scope of this discussion, this is described in more detail in Johannes Steffens, Egor Elagin, Hartmut Neven: PersonSpotter - Fast and Robust System for Human Detection, Tracking and Recognition. FG 1998: 516-521 - see http://ieeexplore.ieee.org/iel4/5501/14786/00671000.pdf?arnumber=671000. Another well known approach is that of using Eigenfaces which takes each original facial image and uses a pre-calculated set of base images (each is an Eigenface) in different combinations to create a best fit for the particular original image. The face can then be identified by referring to its combination of pre-determined Eigenfaces. This is described in more detail in M. Turk and A. Pentland, "Eigenfaces for Recognition," Journal of Cognitive Neuroscience, vol. 3, pp. 71-86, 1991 - see http://www.cs.ucsb.edu/~mturk/Papers/icn.pdf. The Mobile-I ™ and some of the other packages attempt to identify faces from any angle, and the system when using these packages can benefit from further identifying "looking ahead" faces, that is only faces that are looking directly (0 degrees) at the screen 135 or within a certain range of angles, for example up to say 20 degrees. Various algorithms are available to supplement the basic face recognition packages, including for example that discussed in Wu J., Rehg J. M., Mullin M. D. (2004) Learning a Rare Event Detection Cascade by Direct Feature Selection, Advances in Neural Information Processing Systems vol.16 - see also http://www.cc.gatech.edu/~wuix/research.htm.

In this embodiment, the Wu et al algorithm can be configured to detect only "face-on" faces by training it only on face-on examples. From these examples it learns its own rules and these pre-learnt rules can then be used to only detect new faces when they are face- on. This algorithm then gives the location of a rectangle within the camera image that contains the face. In order to integrate this with the Neven Mobile-I system, this result is feed directly into the Mobile-I Pre-selector stage (see figure, page 2, Steffen et al). This software package is therefore configured to detect faces that are, or filter out faces that are not, "face-on" or attentive and then feed these results on to the face recognizer 120. In an alternative embodiment there may be no restriction on the orientation of detected faces, merely whether faces are detected or not.

Note that the camera 110 is preferably located very near to the screen 135 so that users viewing the screen will be "face-on" to the camera.

Whichever face detector 115 and optionally face recognition 120 package is used, these will provide the number of detected faces (115) and, if face recognition is used, identifiers for each detected face (120) as outputs to the attention interpreter 125. The attention interpreter 125 interprets the 'outputs of the face detector 115 and face recogniser 120 as user (105) attention states, and controls operation of the media store and a user interface (140, 145) accordingly. The attention interpreter 125 is typically implemented as a routine in a computing or processor based device, and two embodiments of this function (125) are illustrated in figures 2 and 3 and described in greater detail below.

The attention interpreter 125 controls playing of the media store 130 for example playing, and rewinding of a movie played on a DVD player. In addition the attention interpreter 125 generates a user interface with the user 105 in order to allow the user to control operation of the media store dependent on the user's attention status as interpreted by the attention interpreter 125. In the embodiment illustrated, the attention interpreter 125 generates a user message display or user alert 140 on the viewing screen 135 superimposed on the content displayed on the screen by the media store 130. This could be provided in a new pop-up window on a Microsoft Windows ™ based computer for example, but typically with an overlaid graphic. The user display or alert 140 could be provided with a soft button or icon actuable by a computer mouse for example, or simply as a message to operate a remote control device 145 in a particular way. As a further alternative, the user alert 140 could simply be indicated using a suitable LED, a sound or some other user alerting mechanism, on the screen apparatus 135 or even on another device. The user alert is only shown (or sounded) on the screen apparatus for a short period, allowing it to be ignored.

A user control input is also enabled to allow the user to activate a predetermined control on the remote control device 145 for example, and which is interpreted by the attention interpreter 125. The user control input may be enabled using a dedicated button or control on the remote control device 145 which is only used for the "smart rewind" or other smart functions provided by the system 100. Alternatively, a standard function control or button such as the rewind control may be disabled for the media store 130 and instead enabled for the attention interpreter 125 in response to display of the displayed user control input 140. For example coloured buttons may be used together with the user alert "press red button to replay what I missed". This may be achieved using re-programmable remote control devices 145, and a suitably controlled computing platform hosting the attention interpreter 125 and media store 130 functions, or by suitable modification to legacy or new content players such as set- top boxes. Alternatively the user control input may be a mouse click whilst the cursor is positioned over the soft button 140. As noted, the user control input mechanism need not be a remote control device 145 as typically used with a TV set or other audio-visual provisioning equipment (eg DVD players), but could be integral with this equipment. Either way the user control input is used to record a user input in response to the user alert 140, which generates a control signal (button triggered) or message which causes the attention interpreter 125 to control the media store in predetermined ways.

As a further alternative, the camera could be used to facilitate gesture recognition as a means of user input control. A current example of this technology is provided by the Sony EyeToy.

The attention interpreter 125 is arranged to provide certain "smart" control options available to the user 105 in response to detecting a change in the user's attention status - for example that the user has returned to an attentive state following a period of non- attention. This particular example may correspond to the user watching a movie played by the media store 130 on the viewing screen 135, the user being interrupted for example by a phone call and moving or looking away from the screen 135, then looking back to the screen following completion of the phone call. In the meantime the movie has continued playing so that the user has missed some of the content. The attention interpreter 125 then offers the user the opportunity to rewind to the content playing when the user first looked away from the screen (a non-attention status) and to replay the content from this point. It may be desirable to configure the system to rewind slightly prior to this time in order to ensure continuity - say one or two seconds before the stored play index, other configurations are also possible in which the re-play starting point is dependent on the stored play index - for example a selectable number of seconds before or after the stored play or distraction index. This is implemented by generating the user control input interface comprising in this embodiment a user alert message 140 on the viewing screen 135, and a receiving an appropriate control input from the user via the remote control device 145. The media store 130 is then commanded by the attention interpreter 135 to rewind to this point (or slightly before for example)_and start replaying.

Alternative or additional "smart" user control options may be provided, for example a summary of the missed content. This may be implemented simply as a faster playing of the missed content until this catches up with the current played content, or only playing some of the frames of the content - for example every 10^th frame. This may be provided on the full size of the viewing screen, or in a smaller window with the currently playing content in the background. Alternatively the summary content may be provided by a third party content provider such as a broadcaster for example.

Figure 2 illustrates a method of operating the attention interpreter 125 for this situation in more detail. In the method (200), a distraction or play index is maintained which corresponds to the position or time in the playing of content when the user became non- attentive - for example looking or moving away from the viewing screen 135. Typically a media store 130 or the content itself such as a DVD movie will incorporate its own play index, for example to indicate that hounl, minute: 12 and second:45 of a movie is currently being played. However media store 100 and/or attention interpreter 125 may be configured to associate an independent play index with the content if it does not come already integrated in order to perform the method.

Referring to figure 2, the play step (205) indicates that the media store 130 is playing the content. The method (200) then determines whether the rewind button 140 has been shown on the screen 135 and activated by the user on the remote control device 145 (210). If this is the case (210Y), then the media store 130 is instructed to rewind the content to a previously stored distraction or play index (215) before continuing to play the content (205). Otherwise (210N), the method determines whether a summary (or other smart content control function) button has been shown and activated (220). If this is the case (220Y), a summary routine is invoked (225) before continuing to play the content (205). As noted above, the summary function or routine may simply involve playing the missed content (between the stored distraction or play index and the current play index) at an increased speed within a reduced size window of the screen 135. Otherwise (220N), the method receives the output from the face detector 115 (230).

The face detector output will typically just be the number of faces detected within the camera's field of view - thus if there is only one user the outputs will be 1 for an attentive state or 0 for a non-attentive (looking away or absent) state. The number of faces detected corresponds to a face count parameter for each unit time or camera image. The method then determines whether the current face count is less than, equal to, or greater than a previous face count (235). If the face count is less than the previous face count (235<), this corresponds to a user looking away from the screen (becoming non-attentive) and the distraction or play index is set or stored (240), and then content continues to play (205). In this step (240), the attention interpreter 125 queries the media store's current play index and stores this, for example as a distraction index. If the face count is equal to the previous face count (235=) the attention status of the user has not changed and the content continues to play (205). If the face count is greater than the last face count (235>) this corresponds to a user returning to watch the screen (a new or returning attentive user), and the method determines if the distraction index is set (245) or in other words that a play index has been stored.

If the distraction index is not set (245N), this corresponds to a new user watching the content, in addition to an existing user. In this case the content continues to play (205) without offering user content control options. As an alternative, an option to rewind to the beginning of the content might be offered if the current position of the playing content is close to the beginning (eg the current play index is below a threshold). This situation might correspond to a friend arriving to join the initial user shortly after the start of the content. As a further alternative a summary option might be offered. If the distraction index is set (245Y), this corresponds to a user having already watched some of the content returning to watch the content after a period of non-attentiveness.

In this case, the method then determines whether the last face count was greater than zero (250). If this is not the case (250N), this means there is only one user to consider, and the rewind button is shown (255). The method then returns to continue playing the content (205) and determines any user response to the rewind button at step (210) as previously described. If the last face count was greater than zero (250Y), this means that an additional user has returned to view the screen 135 so that for example there are now two or more users watching the screen. Upon detecting this new attentive face, the method shows the summary button (260) before returning to continue playing the content (205). Determining whether the summary button has been actuated is then carried out at step (220) as previously described.

The following example situation further illustrates the effect of this method. Andrew sits down to watch a live football match on television, he watches for five minutes when he is interrupted by a telephone call from his friend Ben. The conversation lasts for two minutes, before Andrew starts watching again, the television offers to rewind the match to where he left off. He opts not to rewind as he can see there is no score. After ten minutes Ben comes over and watches the match with him. The television offers to play a short summary of the match so far, so that he can catch up.

In the first case, after the telephone call, the system operates to provide the rewind button option. In the second case, when Ben arrives, the Face Detector counts two faces in the scene instead of one and the Attention Interpreter has a new rule for seeing more than one faces at once which triggers the automatic summarisation function.

Figure 3 illustrates a method of operating the attention interpreter 125 according to another embodiment which utilises the face recogniser function 120. In this situation the method recognises the faces rather than just counting them and making certain assumptions about their identity as in the method of figure 2. In the method (300) of figure 3, a distraction or play index is maintained for each recognised face. As with the previous method (200), the play step (305) indicates that the media store 130 is playing the content. The method (300) of figure 3 then proceeds in the same manner as the method (200) of figure 2, rewinding (315) in response to detecting activation of a rewind button (310) and activating the summary routine (325) in response to detecting activation of the summary button (320).

Although not explicitly shown, the attention interpreter 125 also receives an identifier with each face detected from the face recogniser 120. This may be simply a number assigned to each recognised face and stored in a database. Each identifier or number has its own associated stored play or distraction index corresponding to the respective user.

As with the previous method, the present method (300) then receives the face count from the face detector 115 (330) and determines whether the face count is less than, equal to, or greater than the previous face count (335). If the face count is less than the previous face count (335<), this corresponds to a user looking away from the screen (becoming non-attentive) and the distraction or play index for that user is set or stored (340), and then content continues to play (305). If the face count is equal to the previous face count (335=) the attention status of the users has not changed and the content continues to play (305). If however the face count is greater than the last face count (335>) this corresponds to an additional face having been detected, and the method determines whether the additional face has been seen before - ie is a previously recognised face (345). This may be implemented by using a new database of users or faces for each media content playing session (eg movie viewing). If the newly detected face has not been seen before (345N), this corresponds to a completely new or unrecognised face joining the content viewing, for example a friend starting to watch a movie part way through. In this case the content continues to play (305) without offering user content control options. As before an alternative is to offer an option to rewind to the beginning of the content if the current position of the playing content is close to the beginning; or to offer a summary option.

If the face belongs to a user that has been previously recognised during playing of the current content (345N), this corresponds to a previously recognised user having already watched some of the content returning to watch the content after a period of non- attentiveness. In this case, the method (300) then gets the distraction index for the newly re-recognised face (350). The method then determines whether the last face count was greater than zero (355). If this is not the case (355N), this means there is only one user to consider, and the rewind button is shown (260). The method then returns to continue playing the content (305) and determines any user response to the rewind button at step (310). Note however that the distraction index used at step (315) will be the one obtained at step (350) for the respective recognised users.

If the last face count was greater than zero (355Y), this means that an additional previously recognised user has returned to view the screen 135 so that for example there are now two or more previously recognised users watching the screen. Upon re- recognising this previously recognised face, the method shows the summary button (365) before returning to continue playing the content (305). Determining whether the summary button has been actuated is then carried out at step (320) as previously described.

The following extension to the example situation outlined earlier illustrates the effect of the method (300) of figure 3. At half-time Andrew goes to the kitchen to get some beer, while Ben continues to watch the commentator's analysis before he too leaves to go the bathroom. When Andrew returns the television offers to rewind to where Andrew left. A short while later, when Ben returns a summary of the time he was away for is offered.

The system described in this scenario or second embodiment provides more complex behaviour than in the first embodiment. In this case it is able to distinguish between the faces of Andrew and Ben using the face recogniser and to offer a personalised service.

Whilst identifiers have been described as associated with user faces, in fact it is only necessary to identify the faces in a scene as previously seen or not and no explicit identity needs to be defined for each face. Where two faces are seen simultaneously, it can be assumed that these two faces have different identities.

Whilst full face recognition has been described, alternative attention detecting mechanisms could alternatively be used, for example the eye gaze or pupil tracking approach described with respect to WO2004084054. As a further alternative, parts of a face or various combinations of facial features may be used such as detecting and/or recognising eyes and a nose, a mouth and eyes or eyes and mouth and nose.

The camera 110 used may be provided with a light source (not shown) to illuminate the faces for detection and recognition. Alternatively the ambient lighting may be relied upon, as this will be contributed to by the content displayed on the screen 135. This may be complemented by using a long exposure time for each frame, in this regard it is helpful that viewer's tend to stay relatively motionless whilst watching video content. In addition various night mode and "camera-shake" facilities are already "built-in" to many digital cameras and can be used in this situation to improve the facial images provided to the face detector 115 and recogniser 120. Alternatively in order to simulate a long exposure, a sequence of images or frames can be, summed such that each pixel in the resultant image is the sum of the luminance values in corresponding positions in the other frames. The summed images can then be used for face detection/recognition.

In order to minimise the number of "false positive" user attention status changes detected by the system, the face detector 115 or the attention interpreter 125 can be configured to only recognise a change of attention status after this has been confirmed on a number of subsequent face detections. For example after 10 image frames from the camera all indicate that there is one less or one more face than previously detected. This might be implemented in the method of figure 2 for example at step 235 by implementing an additional routine holding the "last face count" parameter to that noted at time x, then comparing the current "face count" parameter 10 times (once for each camera image say) at times x+1, x+2,...x+9, and then determining whether the average "face count" parameter is less than, greater than, or equal to the "last face count" parameter. Alternatively a statistical mode of the observations could be used to give an integer rather than a floating point number from the mean calculation.

As a further alternative, some face detection software packages can be configured to provide a confidence measure related to how confident the software module is that it has detected a face. This confidence measure or output can be used in an embodiment by the attention interpreter to decide when a face has become absent, for example by monitoring the confidence measure over time and detecting the absence of a face when the confidence measure drops below a predetermined threshold. In another alternative, detection of the absence of a face may only follow a characteristic temporal pattern such as a sharp drop-off in the confidence measure, rather than a gradual decline say which may be due to environmental changes.

Figure 4 illustrates a system according to an embodiment, the system 1100 comprising a user's face 1105, a camera 1110 or viewing equipment such as a web cam or other camera, a face detector 1115, a session controller 1120, a display screen 1125, a network

1130, and a video on demand (VoD) server 1135. The face detector 1115 and session controller 1120 are typically implemented as software executed on a computing platform such as a personal computer or dedicated video content equipment such as a set-top box. The viewing screen 1125 may be part of the computing platform, or a stand-alone device such as a plasma-TV screen for example. The network 1130 may be a broadband Internet connection or a wireless network. A user device or apparatus 1160 implements the face detection, session control and media content stream receiving and playing on the screen.

This may be provided on a mobile phone or other wireless device for example, home entertainment equipment, or a computer for example. Media content servers 1135 other than a VoD server may alternatively be used as will be understood by those skilled in the art.

The camera 1110 is arranged to capture images pointing away from the viewing screen 1125 in order to "see" user faces 1105 viewing the viewing screen. Typically the camera

1110 will be positioned at right angles to the plane of the viewing screen 1125. The camera can be arranged such that the viewing angle of the screen and the camera's field of view are largely coincident; and this may be accomplished with additional lenses if necessary. Video or still images from the camera 1110 are directed to the face detector 1115, which may be implemented using various well known software packages on a computing or processor platform. Examples of face detection algorithms include: Wu et al (2004), Neven Vision's Mobile-I™ face recognition software developer's kit (SDK); Detecting Faces in Images: A Suiyey, Ming-Hsuan Yang, David J. Kriegman and Narendra Ahuja- http://vision.ai.uiuc.edu/mhyang/papers/pami02a.pdf; C-VIS Computer Vision and Automation GmbH Face Snap or similar technologies. The face detector function 1115 simply reports whether a face has been detected or how many faces are currently detected within the field of view of the camera 1110.

In this embodiment, the Wu et al algorithm can be configured to detect only "face-on" faces by training it only on face-on examples. From these examples it learns its own rules and these pre-learnt rules can then be used to only detect new faces when they are face- on. Pre-learnt rules may be distributed with the service, so that no learning is required in situ. This algorithm then gives the location of a rectangle within the camera image that contains the face. In an alternative embodiment there may be no restriction on the orientation of detected faces, merely whether faces are detected or not.

The session controller or control module 1120 interprets the output of the face detector 1115 as a user (1105) attention state, and controls operation of the media store 1135 dependent on the user attention state. The session control module 1120 is typically implemented as a routine in a computing or processor based device, and an embodiment of this function (1120) is illustrated in figure 3 and described in greater detail below.

The session controller 1120 instructs the media store 1135 over the network using known control packets and protocols 1140 for example HTTP and RPC. The media server 1135 comprises one or more media content such as movies in a plurality of quality formats or quality levels and can switch between these formats according to instructions 1140 from the session control module 1120. Various techniques used for implementing the switching will be known to those skilled in the art, for example server switching between streams at different bit rates, server provision of multiple bit rate streams which the user device or client can switch between, or parallel streams supporting different layers of hierarchically encoded content which can be switched on or off by either the server or client. A duplex control link (incorporating return path 1145) may be used for more robust signalling, for example for server 1135 to return acknowledgements following receipt of an instruction from the session module 1120. The quality formats range from high quality (large files or high bit-rate streams) to low quality (smaller files or low bit- rate streams) which may be implemented using lower frame rates or image resolution as is known. Alternatively or additionally, different compression technologies may be used to provide smaller file sizes or lower bit-rate streams, and hence a reduced data rate over the network connection 1150. As a further alternative, a single media content file or bit stream may be crafted to reduce its bit-rate to the user device 160 for example by sending only intra-coded frames (i-frames) and not the predicted or bi-predictive frames (p-frames and b-frames) as is known in some compression algorithms. Either way, the media content 1150 provided to the user device 1160 over the network 1130 has a quality level controlled by the session control module 1120 which in turn is dependent on whether a user face 1105 has been detected or not (user attention or non-attention). This media content is received by the user device using a streaming client 1165 such as RealPlayer , Windows Media or Apple Quicktime for example which has established a network connection 1150 to the media server 1135. The media content is then played by the streaming client 1165 on the screen 1125 and/or an audio transducer such as loud speakers (not shown) at the quality level at which it is received from the media server 1135.

Figure 5 illustrates a graph of video quality over time. t0 is the time after which the video quality is lowered once zero faces are seen (user attention state is non-attentive). This will be largely dependant on the length of pre-buffered video available. The dotted line indicates the desired video quality if the VoD Server is able to deliver a gradual decline in quality. The solid line indicates how this may be approximated by switching between discrete quality layers or levels using different compression algorithms for example. This can be implemented using variable rate coding as is known. Whilst the detailed implementation of variable rate coding is beyond the scope of this specification, the interested reader is directed to A. Ortega, "Variable Bit-rate Video Coding", in Compressed Video over Networks, M.-T. Sun and A.~R. Reibman, Eds, pp. 343-382, Marcel Dekker, New York, NY, 2000 - see http://sipi.usc.edu/~ortega/Papers/OrtegaVBR-Chapter.pdf. The gradient of the decline may be controlled by the demands of the network (a congested network will benefit from a rapid descent) or the usage patterns of the television (related to the probability that the viewer will return, this may be learnt by the system or defined by the user or programme maker), tl is the time taken for the system to show full video quality on the return of the viewer (user attention state is attentive). As such it should be minimised to avoid any disruption to the viewer's experience. In reality there may be intermediate quality video layers or levels which allow the video to start quickly, while allowing the higher rate layers to be initiated, as described in Walker, M. D., Nilsson, M., Jebb, T. and Turnbull R. Mobile Video-Streaming, BT Technology Journal, Vol. 21, No. 3. (September 2003), pp. 192-202 (Walker et al (2003)), especially in Section 4. t2 is the minimum time between quality degradations when the user is non-attentive, and is used here to avoid the quality level being degraded too quickly.

As the time increases from the last observation of an attentive face, the data-rate of the streamed video is successively decreased. This results in bandwidth savings across the Network 130. There is a decrease in the data sent from the VoD server which is instructed by the Television's or user device's Session Control 1120 to the server 1135 to adapt the stream. As already noted this may be accomplished by the server crafting an individual stream for the device or switching it to a prepared lower-rate source; or in a further alternative implementing the switching at the streaming client 1165 by switching between multicast streams at different bit rates - various other methods of implementing the quality level change will also be known to those skilled in the art. The low-rate version may have a slower frame-rate, lower resolution or be otherwise degraded. As audio will be heard at a much wider range than the video, this will typically not be degraded (or at least not as quickly as the video) in a television based implementation for example.

Typically there will be a series of decreasing quality levels, of which the lowest is the static (probably black) screen. As the camera makes more observations without an attentive face, confidence grows that there are no viewers and the video can be safely reduced to the lowest quality level. In order to avoid the user noticing an apparent interruption, the full quality service should be reinstated as soon as practicable after attentive faces are observed. There is generally a degradation of perceptual quality of video and audio as the data-rate is decreased. However, some degradations are perceptually more significant than others at the same data-rate. For instance a decreased quality of the foreground will be noticed more than in the background. In an embodiment the degradation is such that the perception of quality decreases gradually, for example using the same approach detailed in Hands, D., Bourret, A., Bayart, D. (2005) Video QoS enhancement using perceptual quality metrics, BT Technology Journal, Vol. 23, No. 2. (April 2005), pp. 208-216.

At times of low network traffic it may be unnecessary to decrease the data-rate when the viewer is inattentive. Either the VoD Server or Session Control module may therefore be configured to measure the current network traffic and use this to determine if the data- rate is decreased or not. In an embodiment the return control path 1145 may be used to refuse an instruction from the session control.

Face detection may sometimes be in error. As such the face detection function or session controller may be configured to aggregate results, for instance taking the statistical mode of the faces counted over a time window.

Figure 6 illustrates a method of operating the session controller 1120 in more detail. In the method (600), the session controller monitors the output from the face detector 1115, and issues instructions 1140 to the media server 1135 accordingly. The play step (605) indicates that the media content is being received over the network 1130 from the media server 1135 and is played on the screen 1125, for example using a streaming client 1165. The method (600) then determines whether or not faces have been detected by the face detection package 1115 (610). If the output or count of the face detector 1115 is greater than zero (610N) corresponding to a user attention state of "attentive", viewing or using, the session controller 1120 instructs the media server to transmit or stream the media content at the highest quality level (615). This can be implemented by sending a suitable quality control packet or instruction to the media server 1135. In an alternative embodiment, the session control 1120 may instruct the streaming client 1165 to switch between content streams at different bit rates. In this alternative arrangement, the server 1135 is arranged to transmit or multicast multiple streams of the same content, but at different bit rates. In this case there is no need to instruct the server 1135. Similarly where a layered encoding approach is used, the client can be configured to switch on or off parallel streams or sessions associated with the different layers in order to change the content quality. The method then returns to playing the media content (605) which will now or shortly be at a higher quality level (eg higher data rate).

In some embodiments the increased quality level may be implemented by rapidly increasing the quality level over a series of intermediate quality levels, in order to allow time for the higher quality levels to be buffered before playing. Again this is described in Walker et al (2003), especially Section 4. In this arrangement the server switches between different bit rate content streams so that the user device continues to receive the same stream or maintains the same session but at different bit rates according to the server switching. This avoids having any delay introduced whilst the high quality media content is buffered at the user device 1160 before it can be played. An intermediate quality level media content stream may be played with little or no buffering immediately whilst the high quality media content is buffered then played so that there is no interruption.

If no faces are detected (610Y) corresponding to a user attentive state of "non-attentive" or not watching, the method determines whether a predetermined length of time (tθ) has elapsed since the last face was detected or the last quality level reduction was instructed (620). This avoids the quality being reduced too quickly, for example if the viewer has only momentarily looked away from the screen 1125. If the length of time has not been of sufficient duration (t2) since the last quality reduction (620N), then the method returns to playing the media content (605). If however there is been a predetermined period of time (t2) since the last face was detected or the last quality reduction instructed (620), then the method instructs the media server 1135 to reduce the quality level by one quality level (625). This may continue so that over time, with no faces detected, the quality level of the media content eventually falls to the lowest level - which may be zero data rate or a black screen for example. Again the reduce quality instruction to the media server 1135 may be in any suitable format.

The length of the time based parameters tO and t2 can be configured according to system design. For example tO may be derived from the storage/network constraints as it reflects the size of the video buffer at the client - assuming that already received frames don't want to be thrown-away. t2 may be set according to the speed at which the video is intended to decline.

Figure 7 illustrates a method of operating the media server 1135 in more detail. In the method (400), the media server having set up a connection with the streaming client 1165 on the user device 1160 transmits or streams packets of media content to the streaming client (405). The media content streamed is at a particular quality level or data rate, and the same media content is also stored in different formats having different assigned quality levels. For example lower quality media content may be highly compressed, have a low frame rate or image resolution, or a combination of these. The method then "listens" for quality control instructions from the user device's session controller 1120 (410). If no new quality level instruction is received (410N), then the media server continues to play the media content at the same quality level (405). If however a quality level instruction is received to increase or decrease the quality level (410Y), the method sets the streaming quality level to the instructed level by switching to a different content stream (415). This may be implemented simply by switching to transmitting a different content file, matching the play index of the previous stream to the new stream so that streaming of the new file starts at the right location. Again the mechanism for switching between quality levels is described in -Walker et al (2003) and WO03/009581 as noted previously. The method then transmits or streams the lower quality content (405) until instructed to again change this by the user devices session controller 120.

The camera 1110 used may be provided with a light source (not shown) to illuminate the faces for detection and recognition. Alternatively the ambient lighting may be relied upon, as this will be contributed to by the content displayed on the screen 1125. This may be complemented by using a long exposure time for each frame, in this regard it is helpful that viewers tend to stay relatively motionless whilst watching video content. In addition various night mode and "camera-shake" facilities are already "built-in" to many digital cameras and can be used in this situation to improve the facial images provided to the face detector 1115. Alternatively in order to simulate a long exposure, a sequence of images or frames can be summed such that each pixel in the resultant image is the sum of the luminance values in corresponding positions in the other frames. The summed images can then be used for face detection/recognition.

In order to minimise the number of "false positive" or "false negative" user attention state changes detected by the system, the face detector 1115 or the session controller 1120 can be configured to only recognise a change of attention status after this has been confirmed on a number of subsequent face detections. For example after 10 image frames from the camera all indicate that there is one less or one more face than previously detected. This might be implemented in the method of figure 6 for example at step 610 by implementing an additional routine holding the "face count" parameter to that noted at time x, then comparing the current "face count" parameter 10 times (once for each camera image say) at times x+1, x+2,...x+9, and then determining whether the average "face count" parameter is less than, greater than, or equal to the "last face count" parameter. Alternatively a statistical mode of the observations could be used to give an integer rather than a floating point number from the mean calculation.

As noted previously, various methods of adapting the bit-rate of the media content's transmission over the network to the receiving device 1160 can be used with the embodiments. For example the streaming client 1165 can set up multiple RTP sessions each associated with a different quality level of the content, for example the content in different compression formats or at lower frame rates. Then as the required quality level changes, either the media server 1135 starts transmission at the different quality level on one session and stops transmission on the current quality level on another session, such that the bits received by the receiving device 1160 change bit rate. Alternatively the same stream or session is maintained but the bit stream used by the server is switched, where each bit stream has a different bit rate. The received bits will be buffered and the rate at which the buffered bits are taken and decoded from the buffer by the client 1165 will also be changed to correspond with the new bit rate.

In an alternative arrangement a hierarchical coding approach is used in which the original content for example video data is encoded into a number of discrete streams called layers, where the first layer consists of basic data of a relatively poor quality and where successive layers represent more detailed information so that layers can be added to increase the image quality or layers can be taken away to degrade the image or other content quality level - this effectively means that the bit rate is decreased when layers are removed or increased when layers are added providing the required changes in quality level. Layered video compression is known from the 1998 version of H.263, but may equally be any other codec, such as MPEG4. Each layer in the hierarchy is coded in such a way as to allow the quality of individual pictures to be enhanced and their resolution to be increased, and additional pictures to be included to increase the overall picture rate. Figure 8 shows a typical dependency between pictures in an H.263 scalable layered codec, with boxes representing frames for each layer and arrows showing dependency between frames. The lowest row shows original, uncoded frames. The next row shows the. lowest layer (Layer 0) of the hierarchy which is coded at half the frame rate of Layer 1. Frames in Layer 0 are predicted from the previously encoded frame, as in conventional video compression. Frames in Layer 1 may be predicted from the previously encoded frame in Layer 1 , and if present, temporally simultaneous Layer 0 encoded frame. Frames in Layer 2 may be predicted from the previously encoded frame in Layer 2, and if present, temporally simultaneous Layer 1 and Layer 0 encoded frame. The H.263 specification allows for 15 layers, though a smaller number can be used in practical embodiments.

Figure 9 illustrates a media server 1135 which uses the layered approach to content delivery. The media content in this case is stored in a data-store 905 already compressed, although it could be received from a live feed for example. The content is packetised by an RTP packetiser 910 according to the Real-time Transport Protocol (RTP), although other protocols could alternatively be used. The packetiser 910 attaches an RTP header to the packets, as well as an H.263 Payload Header as is known. The payload header contains video specific information such as motion vector predictors. The packets are numbered by a packet numbering function 915 in order, to allow the receiving client to recover the correct order of the content packets. The layered encoding process uses a control strategy together with an output buffer for each layer in order to provide each layers constant bit rate output. Each layer is transmitted as an independent RTP session on a separate IP address by a corresponding session handler 925. The rate at which data is transmitted is controlled by the Transfer Rate Control module 920 which counts Layer 0 bytes to ensure that the correct number are transmitted in a given period of time. The transmission rate of the outer layers is smoothed and locked to the rate of Layer 0 using First-In-First-Out buffer elements 930.

Figure 10 illustrates a streaming client 1165 which uses the layered approach to content delivery and reception. Each RTP/RTCP session associated with each layer of encoded data has a session handler 705 at the client which is responsible for receiving RTP packets from the network. These packets are forwarded to a blender module 710 which receives the packets in the order in which they were received from the network. This may not be the order required for decoding because of packet inter-arrival jitter or packet loss. The blender 710 uses the packet numbers in the RTP headers to arrange the packets from each layer in the right order, and then combines the packets from all the received layers. The output from the blender 710 is a single stream of packets in the correct order for decoding. The packets are then sent to a decoder 715 for decoding into video samples or pictures. The client 1165 also comprises a controller 720 which controls operation of the RTP session handlers 705 and blender/buffer 710.

In order to control or change the quality level of the received content, the session control 1120 instructs the server 1135 to drop or add a layer. This may be implemented in the present embodiment by the server instructing a corresponding RTP handler (for example 925₃ for Layer 3) to close a current session with corresponding handler (705₃) at the client 1165, or to open a new session in order to transfer the contents of its output buffer 93O₃. Alternatively the session control 120 may directly instruct the controller 720 in the client 165 to open or close an RTP session using a local session handler 705, depending on which layer is to be added/dropped.

As a yet further alternative, the various layer RTP sessions may be maintained open, but corresponding layer encoded packets may not be sent depending on the quality level currently required. This may be implemented at the FIFO buffers 930 for example, with packets being dropped after a certain time. Then when a higher quality level is requested, the packets are routed to the corresponding RTP handler 925 where the RTP session is already open so that there is no delay in increasing the quality level of the content provided.

In order to provide a fast transition from a low quality level to a high quality level, switching in additional streams or layers is done incrementally so that the quality can be built up by the blender 710. At the same time the buffer of the blender 710 is lengthened in order to accommodate the increased packet numbers associated with the higher bit rates of the higher quality levels. When the quality level is still low, the low bit-rate RTP session provides low bit rate packets to the blender which starts filling up its newly enlarged or lengthened buffer. Meanwhile the packets from the higher layers start arriving and can be combined with the low bit-rate packets waiting in the buffer in order to form the higher quality content. Initially the low rate packets can be sent to the decoder in order to maintain the content at an initial low rate, then increasingly enlarged batches of packets are sent from the buffer to the decoder to provide the higher quality level content.

A final embodiment (not shown) combines certain of the above separately described features so as to provide an embodiment in which as well as reducing the quality of streamed media in response to detecting inattentiveness, on detecting re-attentiveness, as well as increasing the quality back to the higher level, the system also offers the user a smart option such as a smart rewind option or a summary option.

Whilst the embodiments have been described with respect to encoded video files stored at the server, in other embodiments real-time video streaming for example from a camera or other source of unencoded or otherwise high quality video data could be provided. This includes video data generated by a computational process on for instance a computer's graphics processor. The data rate can then be adjusted dynamically as the real-time unencoded video data is encoded before transmission.

The embodiments may be implemented on a range of apparatus for example set-top boxes coupled to a VoD system over the Internet, or wireless mobile devices such as mobile phones where the network is a wireless network such as GPRS for example.

The skilled person will recognise that the above-described apparatus and methods may be embodied as processor control code, for example on a carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications embodiments of the invention will be implemented on a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional programme code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog ™ or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field- (re)programmable analogue array or similar device in order to configure analogue hardware.

The skilled person will also appreciate that the various embodiments and specific features described with respect to them could be freely combined with the other embodiments or their specifically described features in general accordance with the above teaching. The skilled person will also recognise that various alterations and modifications can be made to specific examples described without departing from the scope of the appended claims.

The present application also includes the following clauses:

1. A method of operating an electronic device to play media content, the method comprising: playing the media content at a first quality level; determining an attention state of a user of the electronic device; playing the media content at a second quality level in response to detecting a change in the user attention state.

2. A method according to clause 1, wherein determining the user attention state comprises detecting the presence or absence of a face within a predetermined area. 3. A method according to clause 1 or 2, wherein the media content is received over a network from a media server, and wherein the method further comprises switching from a first media content stream at a first bit rate corresponding to the first quality level to a second media content stream at a second bit rate corresponding to the second quality level in response to determining a change in the user attention state.

4. A method according to clause 1 or 2, wherein the media content is received over a network from a media server, and wherein the method further comprises instructing the media server to transmit the media content at the second quality level in response to detecting the change in the user attention state.

5. A method according to any one preceding clause, wherein the media content is played at an increased or highest quality level in response to detecting a face, and wherein the media content is played at a reduced quality level in response to detecting the absence of a face.

6. A method according to clause 5, wherein the media content is played at a further reduced quality level for each consecutive predetermined period in which the absence of a face is detected.

7. A method according to any one preceding clause, wherein the quality levels depend on one or a combination of: data rate, compression ratio, resolution, frame rate of the transmitted media content.

8. An electronic device for playing media content,- the device comprising: means for determining an attention state of a user of the electronic device; means for playing the media content at a quality level dependent on the user attention state.

9. A device according to clause 8, wherein the means for determining the user attention state comprises means for detecting faces. 10. A device according to clause 8 or 9, wherein the playing means comprises: a session control module; means for receiving and playing media content transmitted from the media server; the receiving means arranged to switch from a first media content stream at a first bit rate corresponding to the first quality level to a second media content stream at a second bit rate corresponding to the second quality level in response to the session control determining a change in the user attention state.

11. A device according to clause 8 or 9, wherein the playing means comprises: a session control module for communicating with a media server; means for receiving and playing media content transmitted from the media server; the session control module arranged to instruct the media server to transmit the media content at a quality level dependent on the user attention state.

12. A device according to any one of clauses 8 to 11, wherein the media content is played at an increased or highest quality level in response to detecting a face, and wherein the media content is played at a reduced quality level in response to detecting the absence of a face.

13. A device according to clause 12, wherein the media content is played at a further reduced quality level for each consecutive predetermined period in which the absence of a face is detected.

14. A system for playing media content and comprising: an electronic device for playing the media content according to any one of clauses 8 to 13; a network coupled to the device and a media server for transmitting the media content over the network to the device.

Claims

1. Apparatus for playing media content, comprising: means for detecting a user attention state and a user non-attention state; means for playing the media content which is associated with a play index; means for storing the play index of the media content being played in response to detecting a user non-attention state; means for generating a user interface in response to detecting a user attention state following the detection of the user non-attention state; wherein the playing means is arranged to re-play at least a part of the media content depending on the stored play index in response to receiving a user input from the user interface.

2. An apparatus as claimed in claim 1, wherein the means for detecting the user attention and non-attention states comprises means for detecting a face wherein detecting a user attention state comprises detecting a face and detecting a user non-attention state comprises detecting the absence of a previously detected face.

3. An apparatus according to claim 2, wherein the means for detecting a face comprises a camera and face detection software executed on a processor.

4. An apparatus according to claim 2 or 3, further comprising means for recognising a face detected by the face detection means, and wherein the means for generating a user interface is further arranged to generate a personalised user interface in response to recognising a previously absent face.

5. An apparatus according to claim 4, wherein the face recognising means comprises face recognition software executed on a processor and which maintains a database of recognised faces, the apparatus further arranged to associate separate stored play indexes with each recognised face.

6. An apparatus according to any one preceding claim wherein the playing means is operable to play the media at a reduced quality level in response to detecting a user non-attention state.

7. An apparatus according to any one preceding claim, wherein the user interface comprises means for generating a user alert, and means for generating a control signal in response to the user input, the apparatus being responsive to the control signal to instruct the playing means to re-play the at least a part of the media content.

8. An apparatus according to claim 7, wherein the means for generating the user alert comprises means for displaying a user message on a display screen used by the playing means to display the media content.

9. An apparatus according to any one preceding claim, wherein the playing means is arranged to re-play all of the media content from a play index dependent on the stored play index.

10. An apparatus according to one of claims 1 to 8, wherein the playing means is arranged to play a summary of the media content from a play index dependent on the stored play index.

11. An apparatus according to any one preceding claim, wherein the means for playing media content comprises a display screen together with one or a combination of: a DVD player; a video player; a set-top box receiving an external broadcast; a media client for receiving streamed media; a hard disk drive.

12. A method of playing media content having a play index, the method comprising: playing the media content in response to detecting a user attention state; storing the current play index in response to detecting a user non-attention state; generating a user interface associated with the stored play index in response to detecting a user attention state.

13. A method according to claim 12, wherein detecting a user attention state comprises detecting a face and detecting a user non-attention state comprises detecting the absence of a previously detected face.

14. A method according to claim 13, further comprising identifying each detected face and wherein a personalised user interface is generated each time a previously absent identified face is again detected .

15. A method according to any one of claims 12 to 14, further comprising playing the media content at a reduced quality level in response to detecting a user non-attention state prior to detecting a user attention state following the detection of the user non-attention state.

16. A method according to any one of claims 12 to 15, further comprising re-playing at least part of the media content depending on the stored play index in response to receiving a user input from the generated user interface.

17. A method according to any one of claims 12 to 15, wherein the re-played media content is re-played as a summary of the media content from the stored play index.

18. A carrier medium carrying processor code which when executed on a processor causes the processor to carry out a method according to any one of claims 12 to 17.