WO2009083832A1 - Device and method for converting multimedia content using a text-to-speech engine - Google Patents

Device and method for converting multimedia content using a text-to-speech engine Download PDF

Info

Publication number
WO2009083832A1
WO2009083832A1 PCT/IB2008/055114 IB2008055114W WO2009083832A1 WO 2009083832 A1 WO2009083832 A1 WO 2009083832A1 IB 2008055114 W IB2008055114 W IB 2008055114W WO 2009083832 A1 WO2009083832 A1 WO 2009083832A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
data
text
data signal
image
Prior art date
Application number
PCT/IB2008/055114
Other languages
French (fr)
Inventor
Werner Lane
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2009083832A1 publication Critical patent/WO2009083832A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/414Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance
    • H04N21/41407Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance embedded in a portable device, e.g. video client on a mobile phone, PDA, laptop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43074Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of additional data with content streams on the same device, e.g. of EPG data or interactive icon with a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/764Media network packet handling at the destination 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234318Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/238Interfacing the downstream path of the transmission network, e.g. adapting the transmission rate of a video stream to network bandwidth; Processing of multiplex streams
    • H04N21/2389Multiplex stream processing, e.g. multiplex stream encrypting
    • H04N21/23892Multiplex stream processing, e.g. multiplex stream encrypting involving embedding information at multiplex stream level, e.g. embedding a watermark at packet level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Definitions

  • the invention relates to a system, a method, and a portable device for processing and/or displaying multimedia content, more particularly, original multimedia content may be reformatted so that it can be utilized by a user in a flexible and user- friendly manner.
  • Portable media players are becoming more popular and are available in a variety of configurations.
  • a portable digital music player e.g. MP3 player or decoder, which can be either software or hardware-based, converts encoded MP3 data into an output audio signal and also displays images or videos.
  • Music can be downloaded into such media players from a personal computer (PC) via a parallel or USB cable or directly from the Internet.
  • PC personal computer
  • Such media players typically store the music in a flash memory, on a hard disk, or on removable memory cards.
  • RSS feeds can deliver short news stories based on the user's interest. Traditionally, those news feeds can only be accessed on a computer or device connected to the Internet. RSS feeds may contain either a summary of content from an associated web site or the full text. RSS makes it possible for people to keep up with their favorite web sites in an automated manner. RSS content can be read using software called a "feed reader” or an “aggregator.” The user subscribes to a feed by entering the feed's link into the reader. The reader checks the user's subscribed feeds regularly for new content, downloading any updates that it finds.
  • a RSS feed in its most basic form consists of a channel with its own attributes (e.g. title, description, etc) and a number of items, each with its own attributes (e.g. title, description, images, etc).
  • An example of a schematic diagram outlining the basic structure of a RSS file is illustrated below:
  • the information enclosed between the " ⁇ channel>” tags is used to describe the feed itself.
  • the extractor encounters a channel tag, it uses the attributes of the channel tag to generate the "introduction page”.
  • the items contain the various "stories” of the RSS feed. Their title, description and image attributes are utilized to create “titles”, “images” and “pages” in the resulting output stream.
  • SMS Short Message Service
  • a recipient of an SMS message may reply with an SMS message for more information about items in which the recipient is interested.
  • a voice message is placed in the recipient's voicemail that includes the full news item, translated by a speech synthesis program.
  • Other forms of conventional players are known for visually impaired users.
  • Such devices do not require a display screen and are specially designed for reading books aloud and playing music. In addition, such devices can help the visually impaired users to browse the Internet, access e-mail, take voice and text notes, and manage personal information such as phone numbers, addresses and appointments.
  • a device capable of receiving a mark-up text input for describing information stored and accessible via the Internet, and of analysing the mark-up text to provide a spoken output is known from document GB-2361556-A.
  • Text configuration signals from a remote location can be received by a hand-held device. The text configuration signals are converted into speech engine configurational signals for configuring a speech engine to generate a corresponding speech output.
  • This device is designed for a motor vehicle to prevent the visual output from causing distraction to a driver.
  • the conventional devices cited above have several drawbacks in that they lack flexibility in outputting information to the user. They accept one specific type of input and produce one type of output, e.g. text converted to speech in the audio output. Moreover, during operation users can access only one type of retrieved information, namely speech output generated by a speech engine. Such speech output corresponds to the speech engine configurational signals that are converted from text configuration signals in the retrieved information.
  • Another portable device that uses a removable ROM device capable of providing visual and aural information displayed in paperless book is known from document US6933928-B1.
  • the device can be read in a visual mode, as print, or in an aural mode, as the spoken word corresponding to the print, or in both visual and aural modes.
  • This device allows the user to use the book in the aural mode when he is driving and switch to the visual mode when he is travelling by train.
  • this paperless book has a drawback, as it uses content in a specific format, which does not allow off-line preparation of downloaded contents for use by other conventional portable device and/or media player.
  • the user of a portable device may view the information represented by a text configuration signal on the display means of the device and/or listen to speech output from the same device when the user is commuting on public transportation, such as train or bus.
  • public transportation such as train or bus.
  • the user disembarks the public transportation and walks to his destination, he can continue to listen to the speech output via the headphone coupled to the portable device.
  • the user should desire to view the text configuration signal on the display means of the portable device, he can choose to do so while simultaneously choosing that the speech output is still being reproduced by the portable device. He may also like this to take place seamlessly, i.e. without any interruption when switching the invention to or using the invention in different modes or playing different information types.
  • Another embodiment of the invention is directed to a system capable of extracting multimedia content from one or more input signals.
  • the input signals can be inputs from among a plurality of different input signal sources, such as the Internet or stored media.
  • the extracted multimedia is converted into another format and then combined with original multimedia content to form a combined signal.
  • Another embodiment of the invention is directed to a method of extracting multimedia content or data information from one or more different input signal sources.
  • the extracted multimedia content is converted into another format(s) that is then combined with original multimedia content to form a combined signal.
  • a portable device is capable of receiving a combined signal that includes a first data signal and a second data signal.
  • the second data signal is a converted form of the first data signal. Based on a user's selection, the portable device can simultaneously reproduce/represent the data signals as audio and/or visual output.
  • One aspect of the invention is related to converting data information, extracted from an input source, wherein textual content is converted to an image-type signal and/or textual content is converted to an audio signal.
  • Another aspect of the invention is related to generating an event synchronization signal for synchronizing the start and end of an audio signal with a corresponding portion of visual multimedia content.
  • Another aspect of the invention is related to enabling at least one user-selectable preference, where the user can select from among a plurality of user- selectable preferences, such as background music and/or display format.
  • Another aspect of the invention is related to a portable device capable of receiving a combined signal that includes two types of data signals.
  • One of the types is a converted form of the other data signal.
  • An example of such data signals is the conversion of a textual configuration signal (a first data type) into an image configuration signal (a second data type).
  • Another example of such data signals is the conversion of a textual configuration signal (a first data type) into speech in an audio configuration signal (a second data type).
  • Fig. 1 illustrates a block diagram of a system according to an embodiment of the present invention and a portable device according to another embodiment of the present invention.
  • Fig. 2 illustrates the various timing of the different types of data signals.
  • Figs. 3a through 3f illustrate the flow charts of a method according to an embodiment of the present invention.
  • Fig. 1 shows a system 10 capable of extracting multimedia content from a desired input signal from among a plurality of different input signal sources, and converting such extracted data information into one or more converted information format(s), which are then combined into a combined media stream signal CMS.
  • the system 10 includes an extractor 2 (or extraction means) arranged to extract data from the one or more input signals DI, IA, IB that represent multimedia content.
  • the extracted data may include one or more data information types such as audio, textual or graphical information.
  • the system 10 also includes a decoder 6 coupled to the extractor 2 so as to convert, e.g., compressed visual information CVI (e.g. in a JPEG format) into uncompressed image data IMD-I.
  • CVI compressed visual information
  • the system 10 further includes one or more converters 3 and 4, which are used to convert the data information types into different or converted data types that are based upon the original data information types.
  • the converters include a text-to-speech engine 3 for converting textual data TS into first audio data AS-I, and a text-to-image engine 4 for converting textual data TS into an image data format IMD-2.
  • a formatter 7 (or formatting means), coupled to the converters 3, 4 and the decoder 6, is used to format the multimedia content and the converted data formats into a combined media stream signal CMS.
  • An encoder 8, coupled to the formatter 7, is used to compress the combined media stream signal CMS into a media stream file MSF for retrieval and playback by a media player 11.
  • the extractor 2 can also receive data input signals from the Internet IA (e.g. RSS feeds) or from a memory medium IB that includes data (e.g. text plus JPEG image TJ).
  • the extractor 2 extracts the multimedia content or separates it into the various data information types and outputs the different data information signals to the corresponding converters 3 and 4 and the decoder 6.
  • the extractor 2 may also output a sequence data signal SD to the formatter 7.
  • the sequence data signal SD is for indicating the document structure and texts, and such a signal includes header, image and page data that are used by the formatter 7 to format the outputs of the converters into a combined media stream signal CMS, which will now be described hereinafter.
  • the outputs of the converters 3 and 4 and the decoder 6, which include the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2, are inputs to the formatter 7.
  • an event synchronization signal may be used.
  • the text-to-speech engine 3 is designed to provide a Text-To-Speech Event signal TTSE to the formatter 7.
  • the Text-To- Speech Event signal TTSE represents events or information at the precise moment when a particular sentence, word or phoneme is started.
  • a typical event output of a text-to-speech engine 3 is shown in Table 1.
  • Each line of the event output is generated at the precise time that the corresponding audio is generated within the text-to-speech engine 3.
  • the formatter 7 is designed to use the information that includes a series of second image data (i.e. multiple "pages" of images that are part of an extracted "story"), along with knowledge of the texts present in the particular "page", first audio data signal AS-I and Text- To- Speech Event signal TTSE.
  • the Text-To- Speech Event signal TTSE comprises markers that indicate what the speech engine is actually outputting at a certain point in time.
  • the formatter 7 can be a "switch box" that sequences the various inputs (e.g. image data, and audio data) into its output stream to realize the combined media stream signal CMS.
  • the combined media stream signal CMS could be in the form of a program stream (like a format as specified in the MPEG specification), which is a container format for multiplexing digital audio, video, etc.
  • Program streams are created by combining one or more packetized elementary streams, which have a common time base, into a single stream.
  • An elementary stream contains only one kind of data, e.g. audio, video or closed caption and is often referred to as "elementary", "data", “audio”, or “video” bitstreams or streams.
  • the format of the elementary stream depends upon the codec or data carried in the stream, but will often carry a common header when turned into a packetized elementary stream.
  • the technique of creating the combined media stream signal will not be described in detail herein as this is known to a person skilled in the art.
  • the combined media stream signal CMS could also be realized by other forms of communications protocol for audio, video, and data - e.g. a transport stream that allows multiplexing of digital video and audio and synchronization of the output, which is known to a person skilled in the art.
  • the formatter 7 takes the second image data IMD-2 coming from text-to-image engine 4 and places it as its output stream together with the first audio data AS-I. It then constantly monitors the "word" event of Text- To- Speech Event signal TTSE that corresponds with the text or word of interest (e.g. first word on the new page of the story) in the second image data IMD-2. As soon as the last word of the text in the second image data IMD-2 has been "spoken", which is indicated by the Text- To-Speech Event signal TTSE that follows the last word represented by the current text of the second image data IMD-2, the formatter 7 outputs the next image of the series of second image data IMD-2.
  • TTSE Text- To- Speech Event signal
  • the sentence “Football season on now” as shown in Table 1 is displayed on two pages and this sentence is split such that the phrase “Football season” is on the first page and the phrase “on now” is on the second page.
  • the text-to-speech engine 3 continues with the output of the first "spoken” word "on” of the second page, and the text-to-image engine 4 outputs the second image data IMD-2 representing the second page containing the image displaying the phrase "on now".
  • the images are synchronized with the sound and the spoken output appears to be seamlessly reproduced during the switching over to a new page of the story.
  • the process repeats until all images of the second image data IMD-2 for the story have been processed.
  • TTSE is designed to match the movement of the color of the displayed texts represented by the second image data IMD-2. In this way, the color of the displayed texts changes, which corresponds with the audio/speech outputs, and this will not be described further as it is known in karaoke systems and educational toys.
  • the text-to-speech engine 3 can output Text-To-Speech Event signal TTSE that includes a "lip shape" event.
  • TTSE Text-To-Speech Event signal
  • Such information can be used for synchronizing the displayed texts with the corresponding audio output and animated image signals that represent the corresponding lip shape.
  • the time duration for every phoneme (i.e., one of a small set of speech sounds that are distinguished by the speakers of a particular language) that is allocated to each lip shape is adjusted to synchronize it with the displayed text and image corresponding to the lip shape.
  • the synchronization information is a record of a number of lip shapes corresponding to a series of phonemes that is divided into small groups.
  • the corresponding displayed text, image signal with the phoneme, and lip shape are synchronized. Details of the process of synchronizing text and image signal with the phoneme allocated to each lip shape will not be described further as this is well known to the person skilled in the art.
  • the formatter 7 After the synchronization of the outputs from the converters 3 and 4 and decoders 6, the formatter 7 sends the combined media stream signal CMS to the input of the encoder 8.
  • This combined media stream signal CMS may be compressed and output as a media stream file MSF (e.g. MP4). It should be understood that an uncompressed media stream file may also be provided.
  • MSF media stream file
  • the media stream file MSF can be stored in a memory media and/or the media player 11.
  • the media stream file MSF includes the compressed or encoded combined media stream signal that may be played back, viewed and/or used by the media player 11.
  • the system 10 may also include a controller (not in the drawing) and a selector 9 for enabling user-selectable preferences that may include background music and/or various display formats.
  • the selector 9 can be integrated with the controller and can be implemented by software or hardware and this can be in the form of a selection button, remote control, etc.
  • the selector 9 will send a template signal TPL to the Test-To-Images engine 4 for setting the output parameters (such as display format, background colour, etc.) of the text-to-image engine 4 according to the preference of the user.
  • the selector 9 is also designed to select and output background music data BM to a background music or audio decoder 5.
  • the output of the background music or audio decoder 5 represents second audio data AS-2, which is coupled to another input of the formatter 7.
  • the second audio data AS-2 together with the other outputs from the converters 3 and 4 and decoder 6, which include the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2 as described above in the first embodiment, are input into the formatter 7. It has to be noted that in the case of the second audio data AS-2, it can be designed to decrease in amplitude, or become muted, whenever the first audio data AS-I is present. This is to minimize the disturbance to the user when he is listening to, or reading, the converted multimedia contents (e.g. first audio data AS-I and/or second image data IMD-2, respectively).
  • the formatter 7 sends the combined media stream signal CMS, which includes the second audio data AS-2, the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2, to the input of the encoder 8.
  • This combined media stream signal CMS is compressed and output as a media stream file MSF (e.g. MP4) that may be played back, viewed and/or used by the media player 11 , as in the first embodiment.
  • MSF media stream file
  • the processed signals in the formatter 7 as shown in Fig. 2 correspond to the first image data signal IMD-I representing images downloaded from the source (see Fig. 2a), the second image data signal IMD-2 representing output of the text-to-image engine 4 (see Fig. 2b), the first audio data AS-I for the different pages/stories (see Fig. 2c), and second audio data AS- 2 for the background music (see Fig. 2d); and the horizontal axis represents a time variable for these signals.
  • Fig. 2a shows the first image data signal IMD-I comprising a downloaded image located at Page 1 of the first story 53-Pl (e.g. one that was embedded in a RSS feed), a downloaded image located at Page 4 of the first story 53-P4, a downloaded image located at Page 1 of the second story 55-Pl, and a downloaded image located at Page 2 of the second story 55-P2.
  • Fig. 2b shows the second image data signal IMD-2 comprising a Cover Page 51, a page for the Title of first story 52, Page 1 of first story 53-P2, and Page 2 of first story 53-P3.
  • the second image data signal IMD-2 further contains a Title of second story 54 (e.g. one that was embedded in the RSS feed), Page 1 of second story 55-P3 and Page 2 of second story 55-P4.
  • Title of second story 54 e.g. one that was embedded in the RSS feed
  • Page 1 of second story 55-P3 e.g. one that was embedded in the RSS feed
  • Page 2 of second story 55-P4 e.g. one that was embedded in the RSS feed
  • IMD-I IMD-2
  • Fig. 2c and Fig. 2d represent the first audio data AS-I from the output of the Text-To-Speech engine 3 and the second audio signal AS-2 from the background music or audio decoder 5, respectively.
  • the system 10 receives a data input signal DI, or data input signals from the Internet IA (e.g. RSS feeds), or from a memory medium IB, the extractor 2 extracts the compressed visual information CVI (e.g. in a JPEG format) and the decoder 6 decodes and outputs the first image data IMD-I.
  • DI data input signal
  • the Internet IA e.g. RSS feeds
  • CVI compressed visual information
  • the decoder 6 decodes and outputs the first image data IMD-I.
  • the extractor 2 also extracts the textual signal TS from all image pages and the extracted textual signals TS are processed by the text-to-speech engine 3, except for image pages of first story 53-Pl and 53-P4, and image pages of second story 55-Pl and 55-P2, where images embedded in the data input signal DI do not include texts. Those other pages containing textual signals will be converted and output by the text-to-speech engine 3 as first audio data AS-I, as shown in Fig. 2c.
  • the image pages that have textual signals that are being converted to first audio data AS-I include the cover page 5 IA, a page for Title of first story 52 A, Page 1 of first story 53A-P2, and Page 2 of first story 53A-P3, Title of second story 54A (e.g. one that was embedded in the RSS feed), Page 1 of second story 55A-P3 and Page 2 of second story 55A-P4.
  • first audio data AS-I i.e., background music
  • second audio data AS-2 is introduced during the absence of first audio data AS-I, which is represented in Fig.
  • the second audio data AS-2 illustrated with an increased audio signal amplitude 56; and when the first audio data AS-I signal is present, the second audio data AS-2 is designed to have the effect of minimising its amplitude. This is represented in Fig. 2d by the second audio data AS-2, illustrated with a reduced audio signal amplitude 57.
  • Such effect being that the level of one signal is reduced by the presence of another signal, is known as ducking and this will not be described in detail as it is known to a person skilled in the art.
  • the second audio data AS-2 with an increased audio signal amplitude 56 corresponds to a page containing an image, but without any textual signal (e.g. Page 1 of the first story 53-Pl as represented in Fig. 2a).
  • the user preference settings could be implemented in the form of a template, with the user preference selector 9 outputting a template signal TPL to the text-to-image engine 4 to configure the characteristics of the displayed image.
  • the template signal TPL includes parameters, like font (i.e., type face, size, colour), size of images (e.g.
  • user preference parameters include duration for showing images that are extracted from the source (e.g. from the data input signal DI ), turning on or turning off images that are being processed, maximum number of stories being processed, location from which stories are retrieved (e.g. URL of RSS feed), different kinds of CODEC settings for the output format (e.g. MPEG-4, WMV, bit rates, etc.), various pause time settings (e.g. between stories, between headlines, between images, etc.), etc.
  • the user preference settings can also include audio related parameters that are sent to the background music or audio decoder 5, where the volume level is set as desired by the user.
  • the flow chart 100 describes the operations involved in a process for creating stories for the system 10. These operations in the flowchart 100, which are performed automatically after the user starts the system 10, include:
  • the user may select images and add them to the story, and this will effect the "make image” operation in step 250; • Making “Pages” in the Make “Pages” operation in step 260, the number of pages to be made being dependent on the selection of the font size of the displayed texts, and the number of pages needed to display one complete story will be determined by the system 10;
  • the system 10 may invoke a default page or predetermined number of pages that is/are predefined as default in the system 10.
  • the default can be information extracted from the data input signal DI (e.g. title, description, images, etc., from the RSS file).
  • one or more extracted stories are processed by the extractor 2, as shown by the operation in step 200.
  • the extracted stories may be received/input from the data input sources DI, IA and IB, as described hereinbefore in relation to Fig. 1.
  • the system 10 first makes an introduction page (i.e., invoke the Make "Introduction” operation), which is shown in step 210.
  • the sub-operations of step 210 are shown in Fig. 3c, which will be explained in more detail hereinafter.
  • the user may also define or modify some user preference settings for the display format and/or set or select the background music to be reproduced when pages containing non textual information are displayed.
  • step 220 The next operation, which is the Make "Story” operation, is shown in step 220.
  • the sub-operations of step 220 are shown in Fig. 3b, which will be explained in more detail hereinafter.
  • a check will be made to see whether it is necessary to make the next story, and this is as shown in step 270. If another story has to be made, then the Make "Story" operation in step 220 will be repeated.
  • step 280 the method according to the invention will be terminated in step 280, and the converters 3, 4 and decoders 5, 6 are designed to stop reproducing audio and video outputs (i.e., the outputs coupled to the formatter 7, which include the uncompressed first image data IMD-I, first audio data AS-I, second image data IMD-2 and/or second audio data AS-2).
  • the system 10 may be configured by the user to start the repeat process from the first story after the last story has been reproduced (not shown in the flowchart).
  • the above descriptions give a brief overview of the operations involved, including extracting the stories, making "Introduction" (see sub-operations of step 210 in Fig.
  • step 3c making "Story” (see sub-operations of step 220 in Fig. 3b), up to the operation of stopping the audio and video outputs when all stories have been made. Further descriptions of sub-operations for making the "Introduction" page in step 210 and for making the story in step 220 will be explained in more detail hereinafter.
  • the sub-operations in step 210 include further operations where the extractor 2 (or parser) sends the "Introduction” texts (i.e. textual signal TS) to the text-to-image engine 4 and text-to-speech engine 3, as shown in step 211 and step 212, respectively.
  • the extractor 2 also sends the compressed visual information CVI of the "Introduction” page to the decoder 6, as shown in step 213.
  • the formatter 7 Upon receiving the converted outputs of the signal "Introduction" from the converters 3 and 4 and decoder 6, the formatter 7 takes the outputs of the text-to-image engine 4, the text-to-speech engine 3, and the decoder 6 and combine them to output a combined media stream signal CMS (i.e. a formatted AV stream), as shown in step 214.
  • CMS i.e. a formatted AV stream
  • step 220 for making “Story” are described in more detail. This operation includes making the story “Title” as shown in step 230, and checking whether the story contains picture or image information, as shown in step 240.
  • the sub-operations of step 230 are shown in Fig. 3d, which will be explained in more detail hereinafter.
  • step 250 If the story “Title” contains picture or image information, then the operation of making "Image” as shown in step 250 will take place.
  • the sub-operations of step 250 are shown in Fig. 3e, which will be explained in more detail hereinafter. If the story “Title” does not contain a picture or an image, then the operation proceeds to making "Page", as shown in step 260, of which the sub-operations are shown in Fig. 3f, which will be explained in more detail hereinafter.
  • the sub-operations of step 230 for making the introduction page "Story Title," as shown in Fig. 3d, includes further operations where the extractor 2 sends "Story" texts (i.e.
  • the extractor 2 also sends the compressed visual information CVI of the "Story Title" page to the decoder 6, as shown in step 233.
  • the formatter 7 takes the outputs of the text-to-image engine 4, the text-to-speech engine 3, and the decoder 6, combines them and outputs a combined media stream signal CMS (i.e. a formatted AV stream), as shown in step 234.
  • step 250 for making the "Image”, as shown in Fig. 3e includes further operations where the parser or extractor 2 sends "Image” data (i.e. compressed visual signals CVI) to a decoder 6 as in step 251.
  • the decoder 6 sends signal "Image” represented by the first image data IMD-I to the formatter 7, as shown in step 252.
  • the formatter 7 takes outputs of the decoder 6 and combines them with the outputs of the text-to-image engine 4 and the text-to-speech engine 3 to output a combined media stream signal CMS for a duration that is pre-determined by the user, and this operation is shown in step 253.
  • the outputs of the text-to-image engine 4 and the text-to- speech engine 3 may be represented by signals of zero amplitude, or other default levels.
  • the combined media stream signal CMS will include information relating to the user preference for the corresponding background music data BM and image template TPL, which are outputs of the background music or audio decoder 5 that represent second audio data AS-2, and the output of the text-to-image engine 4, both coupled to the inputs of the formatter 7.
  • step 260 for making the "Page” as shown in Fig. 3f includes further operations where the parser or extractor 2 takes the complete story texts and sends them to the text-to-speech engine 3, as shown in step 261.
  • the extractor 2 also splits the story texts into parts that fit on a single image page, as shown in step 262. Each part of the textual signal of the story texts fits into one image page before it is sent to the text-to- image engine 4, as shown in step 263.
  • the fitting of story texts into a single page is determined automatically by the system 10, based on parameters, which include size of the image page displayed, the font type and font size, and will not be described in detail as this is known to a person skilled in the art.
  • the formatter 7 Upon receiving the signal "Text", along with a part of the image of story texts that fits into one image page, the formatter 7 takes these outputs of the text-to-speech engine 3 and text-to-image engine 4, respectively, and the output of the decoder 6 and combines them to output a combined media stream signal CMS, as shown in step 264. The formatter 7 continues to receive and output the combined media stream signal CMS until the Text-To- Speech Event signal TTSE indicates that the corresponding last word in the text page has been received, which indicates that the whole story text has been processed.
  • step 265 the system 10 checks whether the whole story text has been processed, i.e., there is no more data input signal DI, IA or IB. If not, the operation is repeated from step 262. If the whole story text has been processed, then the operation proceeds to step 266, which continues with the operation in step 270, as shown in Fig. 3a, where a check is made to determine whether all stories have been made. If all the stories have not been made, the operations that follow after step 270 will be repeated from step 220 according to the explanation given above. The operation will terminate with step 280, as described above, when all stories have been made.
  • the system 10 is configured into a portable device 20, as shown in Fig. 1, by designing the formatter 7 to include an output for providing a video data signal VDS (e.g. LVDS) for driving a display means 21 and another output for providing audio output AO to an audio driver for driving loudspeaker or earphone 22.
  • VDS video data signal
  • the formatter 7 also has a storage means 23 to store the combined media stream signal CMS.
  • the portable device 20 may include at least one or more input ports 12 for receiving data input signals from the Internet IA (e.g. RSS feeds), or from a memory medium IB that includes data (e.g. Texts and JPEG).
  • the portable device 20 can also provide the output for delivering the combined media stream file MSF, as described in the first embodiment for the system 10, so that it can be connected to another device or computer for receiving the combined media stream signal CMS, such as the media player 11.
  • the portable device 20 can be designed so that the possibility for output of the combined media stream file CMS is removed. In this case, this embodiment only serves as a portable device.
  • the method described in the above sections could be used by service providers to provide the user with flexible means of formatting multimedia content according to the user- selectable preference via a website subscription service.
  • a computer product comprising executable codes for performing the method described above could be stored onto a computer readable medium for use with the conventional media player 11.

Abstract

A system 10 or portable device 20 capable of extracting multimedia content from a data input signal, converting the data information type of the multimedia content to a converted data format, and formatting the multimedia content and the converted data format into a combined (media stream) signal is disclosed. This system or device enables users to access and play the multimedia content in a seamless manner without being interrupted when switching the device to or using it in different modes or playing different information types.

Description

DEVICE AND METHOD FOR CONVERTING MULTIMEDIA CONTENT USING A TEXT-TO-SPEECH ENGINE
FIELD OF THE INVENTION
The invention relates to a system, a method, and a portable device for processing and/or displaying multimedia content, more particularly, original multimedia content may be reformatted so that it can be utilized by a user in a flexible and user- friendly manner.
BACKGROUND OF THE INVENTION
Portable media players are becoming more popular and are available in a variety of configurations. For example, a portable digital music player, e.g. MP3 player or decoder, which can be either software or hardware-based, converts encoded MP3 data into an output audio signal and also displays images or videos. Music can be downloaded into such media players from a personal computer (PC) via a parallel or USB cable or directly from the Internet. Such media players typically store the music in a flash memory, on a hard disk, or on removable memory cards.
In addition to music, users can also retrieve personalized and fast delivery of news from the Internet through RSS ("Really Simple Syndication") feeds. RSS feeds can deliver short news stories based on the user's interest. Traditionally, those news feeds can only be accessed on a computer or device connected to the Internet. RSS feeds may contain either a summary of content from an associated web site or the full text. RSS makes it possible for people to keep up with their favorite web sites in an automated manner. RSS content can be read using software called a "feed reader" or an "aggregator." The user subscribes to a feed by entering the feed's link into the reader. The reader checks the user's subscribed feeds regularly for new content, downloading any updates that it finds. A RSS feed in its most basic form consists of a channel with its own attributes (e.g. title, description, etc) and a number of items, each with its own attributes (e.g. title, description, images, etc). An example of a schematic diagram outlining the basic structure of a RSS file is illustrated below:
<?xml version="1.0"?> <channel> <title>BBC News</title>
<description>BBC UK News updated every minute of every day</description> </channel> <item> <title>Clare Short quits post over Iraq</title>
<description>Clare Short quits the cabinet, accusing Tony Blair of breaking his promises over the UN's role in rebuilding Iraq.</description> </item>
<item>...</item> <item>...</item>
The information enclosed between the "<channel>" tags is used to describe the feed itself. When the extractor encounters a channel tag, it uses the attributes of the channel tag to generate the "introduction page". The items contain the various "stories" of the RSS feed. Their title, description and image attributes are utilized to create "titles", "images" and "pages" in the resulting output stream.
A technique for pushing forward RSS feeds onto non-web-enabled, hand-held devices, such as mobile phones, is known from document US2006/0155698-A1. An SMS ("Short Message Service") message may be sent to a mobile phone when new feed items are available. The SMS message generally contains a small amount of information about each item because of the limited size of SMS messages. A recipient of an SMS message may reply with an SMS message for more information about items in which the recipient is interested. In response to the request for more information, a voice message is placed in the recipient's voicemail that includes the full news item, translated by a speech synthesis program. Other forms of conventional players are known for visually impaired users.
Such devices do not require a display screen and are specially designed for reading books aloud and playing music. In addition, such devices can help the visually impaired users to browse the Internet, access e-mail, take voice and text notes, and manage personal information such as phone numbers, addresses and appointments. A device capable of receiving a mark-up text input for describing information stored and accessible via the Internet, and of analysing the mark-up text to provide a spoken output is known from document GB-2361556-A. Text configuration signals from a remote location can be received by a hand-held device. The text configuration signals are converted into speech engine configurational signals for configuring a speech engine to generate a corresponding speech output. This device is designed for a motor vehicle to prevent the visual output from causing distraction to a driver. It is also intended for use by a visually impaired person. However, the conventional devices cited above have several drawbacks in that they lack flexibility in outputting information to the user. They accept one specific type of input and produce one type of output, e.g. text converted to speech in the audio output. Moreover, during operation users can access only one type of retrieved information, namely speech output generated by a speech engine. Such speech output corresponds to the speech engine configurational signals that are converted from text configuration signals in the retrieved information.
Another portable device that uses a removable ROM device capable of providing visual and aural information displayed in paperless book is known from document US6933928-B1. The device can be read in a visual mode, as print, or in an aural mode, as the spoken word corresponding to the print, or in both visual and aural modes.
This device allows the user to use the book in the aural mode when he is driving and switch to the visual mode when he is travelling by train.
However, this paperless book has a drawback, as it uses content in a specific format, which does not allow off-line preparation of downloaded contents for use by other conventional portable device and/or media player.
SUMMARY OF THE INVENTION
It is an object of the invention to provide a solution to the problem of reproducing/representing the information data signal to the user, which enables reducing or removing the drawbacks described above.
This object is achieved in that various embodiments of the present invention relate to portable devices capable of accessing and outputting information in different situations. For example, in one embodiment, the user of a portable device may view the information represented by a text configuration signal on the display means of the device and/or listen to speech output from the same device when the user is commuting on public transportation, such as train or bus. When the user disembarks the public transportation and walks to his destination, he can continue to listen to the speech output via the headphone coupled to the portable device. At any point in time, if the user should desire to view the text configuration signal on the display means of the portable device, he can choose to do so while simultaneously choosing that the speech output is still being reproduced by the portable device. He may also like this to take place seamlessly, i.e. without any interruption when switching the invention to or using the invention in different modes or playing different information types.
Another embodiment of the invention is directed to a system capable of extracting multimedia content from one or more input signals. The input signals can be inputs from among a plurality of different input signal sources, such as the Internet or stored media. The extracted multimedia is converted into another format and then combined with original multimedia content to form a combined signal.
Another embodiment of the invention is directed to a method of extracting multimedia content or data information from one or more different input signal sources. The extracted multimedia content is converted into another format(s) that is then combined with original multimedia content to form a combined signal. In another embodiment, a portable device is capable of receiving a combined signal that includes a first data signal and a second data signal. The second data signal is a converted form of the first data signal. Based on a user's selection, the portable device can simultaneously reproduce/represent the data signals as audio and/or visual output.
One aspect of the invention is related to converting data information, extracted from an input source, wherein textual content is converted to an image-type signal and/or textual content is converted to an audio signal.
Another aspect of the invention is related to generating an event synchronization signal for synchronizing the start and end of an audio signal with a corresponding portion of visual multimedia content. Another aspect of the invention is related to enabling at least one user-selectable preference, where the user can select from among a plurality of user- selectable preferences, such as background music and/or display format.
Another aspect of the invention is related to a portable device capable of receiving a combined signal that includes two types of data signals. One of the types is a converted form of the other data signal. An example of such data signals is the conversion of a textual configuration signal (a first data type) into an image configuration signal (a second data type). Another example of such data signals is the conversion of a textual configuration signal (a first data type) into speech in an audio configuration signal (a second data type).
Further aspects, embodiments and advantages of the invention are apparent from and will be elucidated with reference to the detailed description hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will be described in greater detail hereinafter, by way of non- limiting example, with reference to the embodiments shown in the drawings.
Fig. 1 illustrates a block diagram of a system according to an embodiment of the present invention and a portable device according to another embodiment of the present invention.
Fig. 2 illustrates the various timing of the different types of data signals.
Figs. 3a through 3f illustrate the flow charts of a method according to an embodiment of the present invention.
DESCRIPTION OF EMBODIMENTS
Fig. 1 shows a system 10 capable of extracting multimedia content from a desired input signal from among a plurality of different input signal sources, and converting such extracted data information into one or more converted information format(s), which are then combined into a combined media stream signal CMS. The system 10 includes an extractor 2 (or extraction means) arranged to extract data from the one or more input signals DI, IA, IB that represent multimedia content. The extracted data may include one or more data information types such as audio, textual or graphical information. The system 10 also includes a decoder 6 coupled to the extractor 2 so as to convert, e.g., compressed visual information CVI (e.g. in a JPEG format) into uncompressed image data IMD-I. The system 10 further includes one or more converters 3 and 4, which are used to convert the data information types into different or converted data types that are based upon the original data information types. For example, in the embodiment shown in Fig. 1, the converters include a text-to-speech engine 3 for converting textual data TS into first audio data AS-I, and a text-to-image engine 4 for converting textual data TS into an image data format IMD-2. A formatter 7 (or formatting means), coupled to the converters 3, 4 and the decoder 6, is used to format the multimedia content and the converted data formats into a combined media stream signal CMS. An encoder 8, coupled to the formatter 7, is used to compress the combined media stream signal CMS into a media stream file MSF for retrieval and playback by a media player 11.
The system 10 as shown in Fig. 1 and as described herein below, realises part of a system for formatting multimedia contents so that it can be displayed, played or used by a user in a flexible and user- friendly manner. In another embodiment, the extractor 2 can also receive data input signals from the Internet IA (e.g. RSS feeds) or from a memory medium IB that includes data (e.g. text plus JPEG image TJ). The extractor 2 extracts the multimedia content or separates it into the various data information types and outputs the different data information signals to the corresponding converters 3 and 4 and the decoder 6. In addition, the extractor 2 may also output a sequence data signal SD to the formatter 7. The sequence data signal SD is for indicating the document structure and texts, and such a signal includes header, image and page data that are used by the formatter 7 to format the outputs of the converters into a combined media stream signal CMS, which will now be described hereinafter.
In the embodiment of Fig. 1, the outputs of the converters 3 and 4 and the decoder 6, which include the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2, are inputs to the formatter 7. To synchronize these outputs of the converters, an event synchronization signal may be used. The text-to-speech engine 3 is designed to provide a Text-To-Speech Event signal TTSE to the formatter 7. The Text-To- Speech Event signal TTSE represents events or information at the precise moment when a particular sentence, word or phoneme is started. A typical event output of a text-to-speech engine 3 is shown in Table 1.
Table 1
No. Type Start End Position Value
1 start 0.0000 0.0000 0-0
2 sentence 0.0000 1.9442 0-22 Football season on now
3 phoneme 0.0000 0.1053 0-0 pau
4 phrase 0.1053 1.7442 0-22 Football season on now
5 token 0.1053 0.6629 0-0
6 word 0.1053 0.6629 0-8 Football
7 syllable 0.1053 0.3738 0-0 1
8 phoneme 0.1053 0.2046 0-0 f
9 phoneme 0.2046 0.2785 0-0 uh
10 phoneme 0.2784 0.3737 0-0 t
11 syllable 0.3738 0.6629 0-0 0
12 phoneme 0.3738 0.4536 0-0 b
13 phoneme 0.4536 0.5980 0-0 ao 14 phoneme 0.5979 0.6629 0-0 1
15 token 0.6628 1.0347 0-0
16 word 0.6628 1.0347 9-15 season
17 syllable 0.6628 0.9233 0-0 1
18 phoneme 0.6628 0.8049 0-0 S
19 phoneme 0.8049 0.9233 0-0 i
20 syllable 1.0102 1.0861 0-0 0
21 phoneme 1.0102 1.0215 0-0 Z
22 phoneme 1.0214 1.0630 0-0 ah
23 phoneme 1.0861 1.0981 0-0 n
24 token 1.0981 1.2966 0-0
25 word 1.0981 1.2966 16-18 on
26 syllable 1.0981 1.2966 0-0 1
27 phoneme 1.0981 1.2295 0-0 ao
28 phoneme 1.2295 1.2966 0-0 n
29 token 1.2966 1.8496 0-0
30 word 1.2966 1.8496 19-22 now
31 syllable 1.2966 1.8496 0-0 1
32 phoneme 1.2966 1.3685 0-0 n
33 phoneme 1.3684 1.8496 0-0 aw
34 phoneme 1.8496 2.0836 0-0 pau
35 end 2.0836 2.0836 0-0
Each line of the event output is generated at the precise time that the corresponding audio is generated within the text-to-speech engine 3.
In order to synchronize the second image data IMD-2 (i.e. output of text-to- image engine 4) with the first audio data AS-I (i.e. output of text-to-speech engine 3), the formatter 7 is designed to use the information that includes a series of second image data (i.e. multiple "pages" of images that are part of an extracted "story"), along with knowledge of the texts present in the particular "page", first audio data signal AS-I and Text- To- Speech Event signal TTSE. The Text-To- Speech Event signal TTSE comprises markers that indicate what the speech engine is actually outputting at a certain point in time. The formatter 7 can be a "switch box" that sequences the various inputs (e.g. image data, and audio data) into its output stream to realize the combined media stream signal CMS.
It should be noted that the combined media stream signal CMS could be in the form of a program stream (like a format as specified in the MPEG specification), which is a container format for multiplexing digital audio, video, etc. Program streams are created by combining one or more packetized elementary streams, which have a common time base, into a single stream.
An elementary stream contains only one kind of data, e.g. audio, video or closed caption and is often referred to as "elementary", "data", "audio", or "video" bitstreams or streams. The format of the elementary stream depends upon the codec or data carried in the stream, but will often carry a common header when turned into a packetized elementary stream. The technique of creating the combined media stream signal will not be described in detail herein as this is known to a person skilled in the art. The combined media stream signal CMS could also be realized by other forms of communications protocol for audio, video, and data - e.g. a transport stream that allows multiplexing of digital video and audio and synchronization of the output, which is known to a person skilled in the art.
As soon as first audio data AS-I appears, the formatter 7 takes the second image data IMD-2 coming from text-to-image engine 4 and places it as its output stream together with the first audio data AS-I. It then constantly monitors the "word" event of Text- To- Speech Event signal TTSE that corresponds with the text or word of interest (e.g. first word on the new page of the story) in the second image data IMD-2. As soon as the last word of the text in the second image data IMD-2 has been "spoken", which is indicated by the Text- To-Speech Event signal TTSE that follows the last word represented by the current text of the second image data IMD-2, the formatter 7 outputs the next image of the series of second image data IMD-2. For example, the sentence "Football season on now" as shown in Table 1 is displayed on two pages and this sentence is split such that the phrase "Football season" is on the first page and the phrase "on now" is on the second page. When the last word "season" on the first page is 'spoken" or detected from the Text-To- Speech Event signal TTSE, the text-to-speech engine 3 continues with the output of the first "spoken" word "on" of the second page, and the text-to-image engine 4 outputs the second image data IMD-2 representing the second page containing the image displaying the phrase "on now". In this way, the images are synchronized with the sound and the spoken output appears to be seamlessly reproduced during the switching over to a new page of the story. The process repeats until all images of the second image data IMD-2 for the story have been processed. In another embodiment, the "word" event of the Text-To-Speech Event signal
TTSE is designed to match the movement of the color of the displayed texts represented by the second image data IMD-2. In this way, the color of the displayed texts changes, which corresponds with the audio/speech outputs, and this will not be described further as it is known in karaoke systems and educational toys.
In another embodiment, the text-to-speech engine 3 can output Text-To-Speech Event signal TTSE that includes a "lip shape" event. Such information can be used for synchronizing the displayed texts with the corresponding audio output and animated image signals that represent the corresponding lip shape. The time duration for every phoneme (i.e., one of a small set of speech sounds that are distinguished by the speakers of a particular language) that is allocated to each lip shape is adjusted to synchronize it with the displayed text and image corresponding to the lip shape. The synchronization information is a record of a number of lip shapes corresponding to a series of phonemes that is divided into small groups. By comparing the lip shape allocated to each phoneme with the lip shape in the synchronization information and adjusting the time duration of phonemes in accordance with the position and manner of articulation for each phoneme, the corresponding displayed text, image signal with the phoneme, and lip shape are synchronized. Details of the process of synchronizing text and image signal with the phoneme allocated to each lip shape will not be described further as this is well known to the person skilled in the art.
After the synchronization of the outputs from the converters 3 and 4 and decoders 6, the formatter 7 sends the combined media stream signal CMS to the input of the encoder 8. This combined media stream signal CMS may be compressed and output as a media stream file MSF (e.g. MP4). It should be understood that an uncompressed media stream file may also be provided.
An example of the compression technique can be MPEG-4, which will not be described further as this is known to a person of ordinary skill in the art. The media stream file MSF can be stored in a memory media and/or the media player 11. The media stream file MSF includes the compressed or encoded combined media stream signal that may be played back, viewed and/or used by the media player 11.
In another embodiment, the system 10 may also include a controller (not in the drawing) and a selector 9 for enabling user-selectable preferences that may include background music and/or various display formats. The selector 9 can be integrated with the controller and can be implemented by software or hardware and this can be in the form of a selection button, remote control, etc. When enabled or selected by the user, the selector 9 will send a template signal TPL to the Test-To-Images engine 4 for setting the output parameters (such as display format, background colour, etc.) of the text-to-image engine 4 according to the preference of the user. The selector 9 is also designed to select and output background music data BM to a background music or audio decoder 5. The output of the background music or audio decoder 5 represents second audio data AS-2, which is coupled to another input of the formatter 7.
The second audio data AS-2, together with the other outputs from the converters 3 and 4 and decoder 6, which include the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2 as described above in the first embodiment, are input into the formatter 7. It has to be noted that in the case of the second audio data AS-2, it can be designed to decrease in amplitude, or become muted, whenever the first audio data AS-I is present. This is to minimize the disturbance to the user when he is listening to, or reading, the converted multimedia contents (e.g. first audio data AS-I and/or second image data IMD-2, respectively).
As in the first embodiment, after the synchronization of the outputs from the converters 3, 4 and decoder 6, the formatter 7 sends the combined media stream signal CMS, which includes the second audio data AS-2, the uncompressed image data IMD-I, first audio data AS-I, and image data IMD-2, to the input of the encoder 8. This combined media stream signal CMS is compressed and output as a media stream file MSF (e.g. MP4) that may be played back, viewed and/or used by the media player 11 , as in the first embodiment.
In the following sections, the different outputs of the converters 3 and 4, and decoders 5 and 6 that are processed in the formatter 7 are described in relation to the pages of the stories received by the system 10, with reference to the drawings in Fig. 2. The processed signals in the formatter 7 as shown in Fig. 2 correspond to the first image data signal IMD-I representing images downloaded from the source (see Fig. 2a), the second image data signal IMD-2 representing output of the text-to-image engine 4 (see Fig. 2b), the first audio data AS-I for the different pages/stories (see Fig. 2c), and second audio data AS- 2 for the background music (see Fig. 2d); and the horizontal axis represents a time variable for these signals. As an example, the following description illustrates contents containing a cover page and two stories, with each story containing two image pages and two pages of stories with textual information. Fig. 2a shows the first image data signal IMD-I comprising a downloaded image located at Page 1 of the first story 53-Pl (e.g. one that was embedded in a RSS feed), a downloaded image located at Page 4 of the first story 53-P4, a downloaded image located at Page 1 of the second story 55-Pl, and a downloaded image located at Page 2 of the second story 55-P2. Fig. 2b shows the second image data signal IMD-2 comprising a Cover Page 51, a page for the Title of first story 52, Page 1 of first story 53-P2, and Page 2 of first story 53-P3. The second image data signal IMD-2 further contains a Title of second story 54 (e.g. one that was embedded in the RSS feed), Page 1 of second story 55-P3 and Page 2 of second story 55-P4. The description of the audio signals in relation to the above image data signals,
IMD-I, IMD-2, will be given in the following sections, with reference to Fig. 2c and Fig. 2d, which represent the first audio data AS-I from the output of the Text-To-Speech engine 3 and the second audio signal AS-2 from the background music or audio decoder 5, respectively. In operation, the system 10 receives a data input signal DI, or data input signals from the Internet IA (e.g. RSS feeds), or from a memory medium IB, the extractor 2 extracts the compressed visual information CVI (e.g. in a JPEG format) and the decoder 6 decodes and outputs the first image data IMD-I. In the above example, the extractor 2 also extracts the textual signal TS from all image pages and the extracted textual signals TS are processed by the text-to-speech engine 3, except for image pages of first story 53-Pl and 53-P4, and image pages of second story 55-Pl and 55-P2, where images embedded in the data input signal DI do not include texts. Those other pages containing textual signals will be converted and output by the text-to-speech engine 3 as first audio data AS-I, as shown in Fig. 2c. In the above example, the image pages that have textual signals that are being converted to first audio data AS-I include the cover page 5 IA, a page for Title of first story 52 A, Page 1 of first story 53A-P2, and Page 2 of first story 53A-P3, Title of second story 54A (e.g. one that was embedded in the RSS feed), Page 1 of second story 55A-P3 and Page 2 of second story 55A-P4. There is no output at first audio data AS-I during the presence of image pages of first image data signal IMD-I, as these pages do not contain textual information. In this example, second audio data AS-2 (i.e., background music) is introduced during the absence of first audio data AS-I, which is represented in Fig. 2d by the second audio data AS-2 illustrated with an increased audio signal amplitude 56; and when the first audio data AS-I signal is present, the second audio data AS-2 is designed to have the effect of minimising its amplitude. This is represented in Fig. 2d by the second audio data AS-2, illustrated with a reduced audio signal amplitude 57. Such effect, being that the level of one signal is reduced by the presence of another signal, is known as ducking and this will not be described in detail as it is known to a person skilled in the art.
It has to be noted that instead of a minimal output level of second audio data AS-2, this output can also be muted. The second audio data AS-2 with an increased audio signal amplitude 56, as represented in Fig. 2d, corresponds to a page containing an image, but without any textual signal (e.g. Page 1 of the first story 53-Pl as represented in Fig. 2a). It should be noted that the user preference settings could be implemented in the form of a template, with the user preference selector 9 outputting a template signal TPL to the text-to-image engine 4 to configure the characteristics of the displayed image. The template signal TPL includes parameters, like font (i.e., type face, size, colour), size of images (e.g. size of the output video to optimize a particular playback device), and/or background image where the text is written on. Other user preference parameters include duration for showing images that are extracted from the source (e.g. from the data input signal DI ), turning on or turning off images that are being processed, maximum number of stories being processed, location from which stories are retrieved (e.g. URL of RSS feed), different kinds of CODEC settings for the output format (e.g. MPEG-4, WMV, bit rates, etc.), various pause time settings (e.g. between stories, between headlines, between images, etc.), etc. The user preference settings can also include audio related parameters that are sent to the background music or audio decoder 5, where the volume level is set as desired by the user.
In the following sections, the operation of the system 10 shown in Fig. 1 will be described in conjunction with the flow charts shown in Figs. 3a through 3f. Only the main steps will be described and, where necessary, intermediate steps are briefly addressed.
As an example, the flow chart 100 describes the operations involved in a process for creating stories for the system 10. These operations in the flowchart 100, which are performed automatically after the user starts the system 10, include:
• Extracting a story from the data source (e.g. the Internet, or storage medium);
• Making an "Introduction" page in the Make "Introduction" operation in step 210, where some introductory information about the contents available and the author/creator of the contents are displayed, based on the user' s installation inputs; • Making a "Story" page in operation 220, which comprises sub-operations for making one or more pages, including the operation for inputting the "Story Title" as in step 230 (i.e. Make "Story Title" operation), where user preferences are set and selections of background music to be reproduced after the texts have been spoken, etc. are made. These settings and selections may be predefined by the user during installation of the system 10. The user may select images and add them to the story, and this will effect the "make image" operation in step 250; • Making "Pages" in the Make "Pages" operation in step 260, the number of pages to be made being dependent on the selection of the font size of the displayed texts, and the number of pages needed to display one complete story will be determined by the system 10;
• Adding other stories, for which purpose the operations for "Make story" in step 220 and "make pages" in step 260 will be repeated until the last story is made.
It should be noted that if no user input is received, the system 10 may invoke a default page or predetermined number of pages that is/are predefined as default in the system 10. The default can be information extracted from the data input signal DI (e.g. title, description, images, etc., from the RSS file).
In this embodiment, one or more extracted stories are processed by the extractor 2, as shown by the operation in step 200. The extracted stories may be received/input from the data input sources DI, IA and IB, as described hereinbefore in relation to Fig. 1. When the stories (e.g., the multimedia content) from the data input are extracted, the system 10 first makes an introduction page (i.e., invoke the Make "Introduction" operation), which is shown in step 210. The sub-operations of step 210 are shown in Fig. 3c, which will be explained in more detail hereinafter. It should be noted that during this operation the user may also define or modify some user preference settings for the display format and/or set or select the background music to be reproduced when pages containing non textual information are displayed.
The next operation, which is the Make "Story" operation, is shown in step 220. The sub-operations of step 220 are shown in Fig. 3b, which will be explained in more detail hereinafter. Each time the operation in step 220 is completed, a check will be made to see whether it is necessary to make the next story, and this is as shown in step 270. If another story has to be made, then the Make "Story" operation in step 220 will be repeated. When no further story has to be made, the method according to the invention will be terminated in step 280, and the converters 3, 4 and decoders 5, 6 are designed to stop reproducing audio and video outputs (i.e., the outputs coupled to the formatter 7, which include the uncompressed first image data IMD-I, first audio data AS-I, second image data IMD-2 and/or second audio data AS-2). It has to be noted that the system 10 may be configured by the user to start the repeat process from the first story after the last story has been reproduced (not shown in the flowchart). The above descriptions give a brief overview of the operations involved, including extracting the stories, making "Introduction" (see sub-operations of step 210 in Fig. 3c), making "Story" (see sub-operations of step 220 in Fig. 3b), up to the operation of stopping the audio and video outputs when all stories have been made. Further descriptions of sub-operations for making the "Introduction" page in step 210 and for making the story in step 220 will be explained in more detail hereinafter.
For making the "Introduction" page, the sub-operations in step 210 include further operations where the extractor 2 (or parser) sends the "Introduction" texts (i.e. textual signal TS) to the text-to-image engine 4 and text-to-speech engine 3, as shown in step 211 and step 212, respectively. The extractor 2 also sends the compressed visual information CVI of the "Introduction" page to the decoder 6, as shown in step 213. Upon receiving the converted outputs of the signal "Introduction" from the converters 3 and 4 and decoder 6, the formatter 7 takes the outputs of the text-to-image engine 4, the text-to-speech engine 3, and the decoder 6 and combine them to output a combined media stream signal CMS (i.e. a formatted AV stream), as shown in step 214. In this section, the operations in step 220 for making "Story" are described in more detail. This operation includes making the story "Title" as shown in step 230, and checking whether the story contains picture or image information, as shown in step 240. The sub-operations of step 230 are shown in Fig. 3d, which will be explained in more detail hereinafter. If the story "Title" contains picture or image information, then the operation of making "Image" as shown in step 250 will take place. The sub-operations of step 250 are shown in Fig. 3e, which will be explained in more detail hereinafter. If the story "Title" does not contain a picture or an image, then the operation proceeds to making "Page", as shown in step 260, of which the sub-operations are shown in Fig. 3f, which will be explained in more detail hereinafter. The sub-operations of step 230 for making the introduction page "Story Title," as shown in Fig. 3d, includes further operations where the extractor 2 sends "Story" texts (i.e. textual signal TS) to the text-to-image engine 4 and text-to-speech engine 3, as shown in step 231 and step 232, respectively. The extractor 2 also sends the compressed visual information CVI of the "Story Title" page to the decoder 6, as shown in step 233. Upon receiving the converted outputs of the signal "Story Title" from the converters 3 and 4 and decoder 6, the formatter 7 takes the outputs of the text-to-image engine 4, the text-to-speech engine 3, and the decoder 6, combines them and outputs a combined media stream signal CMS (i.e. a formatted AV stream), as shown in step 234.
The sub-operations of step 250 for making the "Image", as shown in Fig. 3e, includes further operations where the parser or extractor 2 sends "Image" data (i.e. compressed visual signals CVI) to a decoder 6 as in step 251. The decoder 6 sends signal "Image" represented by the first image data IMD-I to the formatter 7, as shown in step 252. Upon receiving the signal "Image", the formatter 7 takes outputs of the decoder 6 and combines them with the outputs of the text-to-image engine 4 and the text-to-speech engine 3 to output a combined media stream signal CMS for a duration that is pre-determined by the user, and this operation is shown in step 253. It should be noted that when the "Image" data do not contain textual data, the outputs of the text-to-image engine 4 and the text-to- speech engine 3 may be represented by signals of zero amplitude, or other default levels.
In a variant of this embodiment, where the system 10 may also include a selector 9 for enabling a user-selectable preference (i.e., background music and/or display format), the combined media stream signal CMS will include information relating to the user preference for the corresponding background music data BM and image template TPL, which are outputs of the background music or audio decoder 5 that represent second audio data AS-2, and the output of the text-to-image engine 4, both coupled to the inputs of the formatter 7.
The sub-operation of step 260 for making the "Page" as shown in Fig. 3f, includes further operations where the parser or extractor 2 takes the complete story texts and sends them to the text-to-speech engine 3, as shown in step 261. The extractor 2 also splits the story texts into parts that fit on a single image page, as shown in step 262. Each part of the textual signal of the story texts fits into one image page before it is sent to the text-to- image engine 4, as shown in step 263. The fitting of story texts into a single page is determined automatically by the system 10, based on parameters, which include size of the image page displayed, the font type and font size, and will not be described in detail as this is known to a person skilled in the art.
Upon receiving the signal "Text", along with a part of the image of story texts that fits into one image page, the formatter 7 takes these outputs of the text-to-speech engine 3 and text-to-image engine 4, respectively, and the output of the decoder 6 and combines them to output a combined media stream signal CMS, as shown in step 264. The formatter 7 continues to receive and output the combined media stream signal CMS until the Text-To- Speech Event signal TTSE indicates that the corresponding last word in the text page has been received, which indicates that the whole story text has been processed.
In the next operation, as shown in step 265, the system 10 checks whether the whole story text has been processed, i.e., there is no more data input signal DI, IA or IB. If not, the operation is repeated from step 262. If the whole story text has been processed, then the operation proceeds to step 266, which continues with the operation in step 270, as shown in Fig. 3a, where a check is made to determine whether all stories have been made. If all the stories have not been made, the operations that follow after step 270 will be repeated from step 220 according to the explanation given above. The operation will terminate with step 280, as described above, when all stories have been made.
In another embodiment, the system 10 is configured into a portable device 20, as shown in Fig. 1, by designing the formatter 7 to include an output for providing a video data signal VDS (e.g. LVDS) for driving a display means 21 and another output for providing audio output AO to an audio driver for driving loudspeaker or earphone 22. In this embodiment, the formatter 7 also has a storage means 23 to store the combined media stream signal CMS. The portable device 20 may include at least one or more input ports 12 for receiving data input signals from the Internet IA (e.g. RSS feeds), or from a memory medium IB that includes data (e.g. Texts and JPEG).
It should be noted that the portable device 20 can also provide the output for delivering the combined media stream file MSF, as described in the first embodiment for the system 10, so that it can be connected to another device or computer for receiving the combined media stream signal CMS, such as the media player 11.
In a variant of this embodiment, the portable device 20 can be designed so that the possibility for output of the combined media stream file CMS is removed. In this case, this embodiment only serves as a portable device.
In yet another embodiment, the method described in the above sections could be used by service providers to provide the user with flexible means of formatting multimedia content according to the user- selectable preference via a website subscription service.
In another embodiment, a computer product comprising executable codes for performing the method described above could be stored onto a computer readable medium for use with the conventional media player 11.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. Use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and/or by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:
1. A system 10 comprising:
- extraction means 2 for extracting multimedia content from at least one data input signal, wherein the extracted multimedia content includes at least one data information type;
- converting means 3, 4, 6 for converting the at least one data information type of the multimedia content to at least one converted data format; and formatting means 7 for formatting the multimedia content and the at least one converted data format into a combined (media stream) signal CMS.
2. A system 10 as claimed in claim 1, wherein the converting means include a text-to- speech 3 engine designed for converting textual content of the at least one data information type to an audio signal; or a text-to-image engine 4 designed for converting textual content of the at least one data information type to an image-type signal.
3. A system 10 as claimed in claim 1, wherein the converting means include a combination of a text-to-speech engine 3 designed for converting a first textual information to an audio signal, and a text-to-image engine 4 designed for converting a first or second textual information to an image signal.
4. A system 10 as claimed in claim 2 or 3, wherein the text-to-speech engine 3 is designed for generating an event synchronisation signal TTSE for synchronising the start and end of the first audio signal AS-I with the corresponding portion of multimedia content.
5. A system 10 as claimed in claim 1, further comprising: selector 9 for enabling at least one user-selectable preference.
6. A system 10 as claimed in claim 5, wherein the user-selectable preference includes: background music; and/or display format
7. A method of receiving and outputting information comprising the steps of: - extracting multimedia content from at least one data input signal, wherein the extracted multimedia content includes at least one data information type;
- converting the at least one data information type of the multimedia content to the at least one converted data format; and
- formatting the multimedia content and the at least one converted data format into a combined signal CMS.
8. A method as claimed in claim 7, further comprising the step of:
- generating an event synchronisation signal TTSE for synchronising a start and an end of the at least one converted format with a corresponding portion of the multimedia content.
9. A method as claimed in claim 7 or 8, further comprising the step of:
- enabling at least one user- selectable preference.
10. A method as claimed in claim 9, wherein the enabling step includes minimising the background music and/or the formatting of the image signal in accordance with the display format.
11. A portable device 20 comprising: - an input port arranged to receive a combined information signal that includes a first data signal and a second data signal, wherein the second data signal is a converted form of the first data signal;
- a selector 9 arranged to configure the portable device to output the first or the second data or both signals; - a controller arranged to enable, based upon a selection signal from the selector 9, an audio output means 22 for reproducing an audio output signal based upon the combined information signal and display means 21 for reproducing visual content based upon the combined information signal.
12. A portable device 20 as claimed in claim 11, wherein the combined information signal further includes a third data signal that is also a converted form of the first data signal.
13. A portable device 20 as claimed in claim 12, wherein the first data signal is a textual data signal TS, the second data signal is a speech data signal AS-I converted from the textual signal TS, and the third data signal is an image data signal IMD-I converted from the textual signal TS.
14. A portable device 20 comprising:
- an input port arranged to receive a combined information signal that includes a first data signal and a second data signal, wherein the second data signal is a converted form of the first data signal; and - output means 21, 22 for simultaneously outputting the first and second data signals in synchronisation.
15. A portable device 20 as claimed in claim 14, wherein the combined information signal further includes a third data signal that is also a converted form of the first data signal.
16. A portable device 20 as claimed in claim 15, wherein the first data signal is a textual data signal TS, the second data signal is a speech data signal AS-I converted from the textual signal TS, and the third data signal is an image data signal IMD-I converted from the textual signal TS.
17. A computer product comprising executable codes for performing the steps of claim
7.
18. A computer readable medium, on which the computer program product according to claim 17 is stored.
PCT/IB2008/055114 2007-12-21 2008-12-05 Device and method for converting multimedia content using a text-to-speech engine WO2009083832A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP07123887 2007-12-21
EP07123887.7 2007-12-21

Publications (1)

Publication Number Publication Date
WO2009083832A1 true WO2009083832A1 (en) 2009-07-09

Family

ID=40433826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2008/055114 WO2009083832A1 (en) 2007-12-21 2008-12-05 Device and method for converting multimedia content using a text-to-speech engine

Country Status (1)

Country Link
WO (1) WO2009083832A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0598597A1 (en) * 1992-11-18 1994-05-25 Canon Information Systems, Inc. Method and apparatus for scripting a text-to-speech-based multimedia presentation
EP1001627A1 (en) * 1998-05-28 2000-05-17 Kabushiki Kaisha Toshiba Digital broadcasting system and terminal therefor
WO2001079986A2 (en) * 2000-04-19 2001-10-25 Roundpoint Inc. Electronic browser
US20030110297A1 (en) * 2001-12-12 2003-06-12 Tabatabai Ali J. Transforming multimedia data for delivery to multiple heterogeneous devices
JP2005309173A (en) * 2004-04-23 2005-11-04 Nippon Hoso Kyokai <Nhk> Speech synthesis controller, method thereof and program thereof, and data generating device for speech synthesis
US6970602B1 (en) * 1998-10-06 2005-11-29 International Business Machines Corporation Method and apparatus for transcoding multimedia using content analysis
WO2006081482A2 (en) * 2005-01-26 2006-08-03 Hansen Kim D Apparatus, system, and method for digitally presenting the contents of a printed publication
KR20060115162A (en) * 2005-05-04 2006-11-08 하승준 Web-service operating method of member web-page as member request using production editing of narrator multi-media data by 3d and tts module and system for the same

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0598597A1 (en) * 1992-11-18 1994-05-25 Canon Information Systems, Inc. Method and apparatus for scripting a text-to-speech-based multimedia presentation
EP1001627A1 (en) * 1998-05-28 2000-05-17 Kabushiki Kaisha Toshiba Digital broadcasting system and terminal therefor
US6970602B1 (en) * 1998-10-06 2005-11-29 International Business Machines Corporation Method and apparatus for transcoding multimedia using content analysis
WO2001079986A2 (en) * 2000-04-19 2001-10-25 Roundpoint Inc. Electronic browser
US20030110297A1 (en) * 2001-12-12 2003-06-12 Tabatabai Ali J. Transforming multimedia data for delivery to multiple heterogeneous devices
JP2005309173A (en) * 2004-04-23 2005-11-04 Nippon Hoso Kyokai <Nhk> Speech synthesis controller, method thereof and program thereof, and data generating device for speech synthesis
WO2006081482A2 (en) * 2005-01-26 2006-08-03 Hansen Kim D Apparatus, system, and method for digitally presenting the contents of a printed publication
KR20060115162A (en) * 2005-05-04 2006-11-08 하승준 Web-service operating method of member web-page as member request using production editing of narrator multi-media data by 3d and tts module and system for the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BURNETT I ET AL: "Universal multimedia experiences for tomorrow", IEEE SIGNAL PROCESSING MAGAZINE, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 20, no. 2, 1 March 2003 (2003-03-01), pages 63 - 73, XP011095793, ISSN: 1053-5888 *

Similar Documents

Publication Publication Date Title
US9875735B2 (en) System and method for synthetically generated speech describing media content
US20130204605A1 (en) System for translating spoken language into sign language for the deaf
JP4127668B2 (en) Information processing apparatus, information processing method, and program
US8732783B2 (en) Apparatus and method for providing additional information using extension subtitles file
US20050180462A1 (en) Apparatus and method for reproducing ancillary data in synchronization with an audio signal
JP4970903B2 (en) Multimedia content playback method and apparatus
JP2008299032A (en) Linguistic training aid, and character data regenerator
US20020055088A1 (en) Toggle-tongue language education method and apparatus
CN101527153B (en) Method of synchronously displaying asynchronous transmitted text and audio and video data on mobile terminal
Neves A world of change in a changing world
US7933671B2 (en) Data outputting device, data outputting method, data outputting program, and recording medium
WO2009083832A1 (en) Device and method for converting multimedia content using a text-to-speech engine
JP2008294722A (en) Motion picture reproducing apparatus and motion picture reproducing method
US11665392B2 (en) Methods and systems for selective playback and attenuation of audio based on user preference
JP7179387B1 (en) HIGHLIGHT MOVIE GENERATION SYSTEM, HIGHLIGHT MOVIE GENERATION METHOD, AND PROGRAM
JP2002197488A (en) Device and method for generating lip-synchronization data, information storage medium and manufacturing method of the information storage medium
KR100705901B1 (en) Mobile Device And Television Receiver Based On Text To Voice Converter
Spina Transcripts
JPH11145918A (en) Data broadcast transmission system, data broadcast reception system and data broadcast system
JP2002300434A (en) Program transmission system and device thereof
KR100641214B1 (en) Method for serving karaoke contents in a mobile terminal
KR20080086793A (en) Audio data reproducing mobile device
JP2004226701A (en) Marking apparatus and method, and data outputting apparatus and method
JP2007334365A (en) Information processor, information processing method, and information processing program
JP2000339925A (en) Method and apparatus for forming contents and for reproducing contents, and memory medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08868766

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08868766

Country of ref document: EP

Kind code of ref document: A1