WO2003063133A1

WO2003063133A1 - Personalisation of the acoustic presentation of messages synthesised in a terminal

Info

Publication number: WO2003063133A1
Application number: PCT/FR2002/003984
Authority: WO
Inventors: Ghislain Moncomble; Philippe Passelaigue; Jean-Pierre Remy
Original assignee: France Telecom
Priority date: 2002-01-23
Filing date: 2002-11-21
Publication date: 2003-07-31
Also published as: WO2003063133A8; FR2835087B1; FR2835087A1

Abstract

The invention relates to the personalisation of the acoustic presentation of messages synthesised in a terminal (1), whereby acoustic characteristics (CV) which describe a voice (V) are selected from a catalogue of acoustic characteristics pre-recorded on a server (2) for transmission to a voice synthesiser (3). A text message (MT) which can be selected on the terminal is synthesised in the synthesiser, based on the selected acoustic characteristics, as a voice message (MS) which is transmitted to the terminal for listening. At least one noise (B) can be selected on the server for mixing with the voice message.

Description

Personalization of the audio presentation of synthesized messages in a terminal

The present invention relates to the audio presentation of messages in a terminal. The messages are initially textual, then synthesized by voice by a means of voice synthesis internal or external to the terminal.

Currently, interactive voice servers or any other means of voice synthesis accessible via a server broadcast messages which result from the voice synthesis of text messages on the basis of artificial or natural voice models. With the exception of pre-filtering and adjustment of the bass and treble in the audio means of most terminals, such as television receivers, radio receivers or personal terminals of the computer type or digital assistant or telephone or radiotelephone terminal, the users of these terminals consult the voice servers all listen to the same voices for the dissemination of the synthesized messages without any personal influence on them. Because of messages broadcast with a single voice by a given voice server, the messages are not always very well perceived by some users.

Furthermore, US patents 5,860,064 and 6,006,187 propose a selection of speech emotion parameters in an integrated speech synthesis system linked to the graphical user interface in a personal computer. These parameters are mainly directed to the pitch, the volume and the speech rate and apply to a voice message, or to selected parties in a voice message.

However, the voices thus constructed by the user are only accessible to the user. In addition, the means used to build by voice must be reproduced on each user computer.

The present invention aims to remedy the drawbacks of “mono-voice” servers of the prior art so that each user personalizes the acoustic context of the voice messages broadcast by the servers, and thus makes the messages more intelligible and perceptible, and therefore more familiar. voice messages broadcast, while improving the distribution of the means used and the personalization of voice messages compared to the two aforementioned US patents.

To this end, a method for personalizing the sound presentation of messages synthesized in a terminal, is characterized in that it comprises steps of selecting acoustic characteristics describing a voice in a first catalog of acoustic characteristics pre-memorized in a server means, selecting acoustic characteristics describing a sound effect in a second catalog of acoustic characteristics stored in the server means in order to transmit them with the selected acoustic characteristics describing the voice by means of voice synthesis, and synthesizing a text message in the voice synthesis means in dependence on the selected acoustic characteristics describing the voice into a voice message which is mixed with the selected sound effects to be transmitted to the terminal.

Thus, the invention distributes the means implemented for the method in the terminal and the server. It improves the personalization of messages by selecting the acoustic characteristics of sound effects superimposed on the synthesized voice.

Instead of selecting acoustic characteristics to describe and thus compose a voice and a sound effect, the terminal user can directly select a voice in the first catalog and a sound effect in the second catalog, or even select a combination which is stored in a third catalog at least in the terminal and which includes at least one voice and at least one sound effect in order to synthesize any text message depending on the acoustic characteristics of the combination.

The selection of voices and sound effects is preferably accompanied by a selection of characteristics of a visual presentation, which can be a wallpaper and / or a facial animation, in a fourth catalog pre-memorized in the server medium in order to transmit to the terminal and display the visual presentation in the terminal in synchronism with the reproduction of the voice message in the terminal.

The invention also relates to a system for personalizing the sound presentation of messages in a terminal, for implementing the method of the invention. The system is characterized in that it includes server means for storing acoustic characteristics describing voices, and also acoustic characteristics describing sound effects to be selectively mixed with described voices, a voice synthesis means in which voices are described depending on acoustic characteristics, and an application means in the terminal for selecting in the server means acoustic characteristics describing a voice and acoustic characteristics describing a sound effect so that the voice synthesis means synthesizes at least one text message according to the selected acoustic characteristics describing the voice in a voice message mixed with the sound effects described and transmitted to the terminal.

Other characteristics and advantages of the present invention will appear more clearly on reading the following description of several preferred embodiments of the invention with reference to the corresponding appended drawings in which:

- Figure 1 is a schematic block diagram of a system for the audio presentation of messages according to a preferred embodiment of the invention; and

- Figure 2 is an algorithm of a method of audio presentation of messages according to the invention.

With reference to FIG. 1, a sound presentation system for messages according to the invention essentially comprises a user terminal 1 provided at least with a loudspeaker or a listener, a central sound server 2 and an equipment voice synthesis 3. The system is based on an architecture of the client-server type between the terminal 1 associated with the equipment 3 and the central sound server 2.

For example, the user terminal 1 is a personal computer or a digital assistant personal, or a smart TV or radio receiver, or a landline or mobile cell phone. The voice synthesis equipment 3 is connected to the terminal 1 by a conventional link 4 of the wired or radio proximity type. As a variant, the equipment 3 is removably integrated, like a card, in the terminal 1. On the other hand, the terminal 1 is connected to the central server 2 via an access network 5 corresponding to the type of terminal and of a packet network 6 such as the internet network.

The voice synthesis equipment 3 essentially comprises a buffer memory 30 for storing a text message MT to be synthesized, and preferably at least one test text TE to be synthesized, an analyzer 31 of the phonetics and of the prosody of the text to be synthesized, a speech synthesizer 32 proper, and a generator 33 generating an acoustic model as a function of acoustic characteristics CA delivered by the terminal 1 and supplied by the central server 2. The functional elements 30 to 33 schematically represent the speech synthesis equipment for a better understanding of the invention and may correspond to software modules.

The invention relates more particularly to the third module 33 which defines an acoustic model in dependence on the parameters, in particular such as values, CA characteristics of a sound, such as a voice mixed with a sound effect, in order to apply to this model of predetermined rules for synthesizing a textual message MT transcribed phonetically and prosodically in the analyzer 31 into a synthesized voice message MS transmitted by the synthesizer 32 at terminal 1. According to another variant, the characteristics CA received in the generator 3 make it possible to select acoustic units which are concatenated according to predetermined rules in the synthesizer 32 in order to reproduce vocally a message analyzed phonetically and prosodically in the analyzer 31. Whatever the type of speech synthesis implemented in synthesizer 32, speech synthesis is defined as a function of acoustic characteristics CA processed in module 33 and selected by terminal 1. As will be seen in the following description With reference to FIG. 2, the terminal 1 supports a sound presentation personalization application 10 for selecting the acoustic characteristics CA of a sound to be composed or for selecting an imprint of a sound described by acoustic characteristics in the central server 2. MT text messages to be temporarily stored in the memory oire 30 to synthesize them are provided by the user of terminal 1, either by entering them with the keyboard or by voice recognition in the terminal, or by reading them in the memory of the terminal if they have been prerecorded in the terminal, or again by downloading them from document servers through networks 5 and 6. Preferably, the test message TE contained in the memory 30 is preselected by the user of the terminal 1 and therefore known by the user to test a voice configured by the user and modeled in the text-to-speech equipment 3.

According to another variant of system architecture, at least one text-to-speech equipment 3 is integrated into the central server 2 and shared by several user terminals 1 for which user identifiers IDU are associated respectively with test messages MT respectively. The central server 2 is then analogous to an interactive voice server in which synthesized voices and their characteristics can be selected by users to listen to voice or multimedia messages. In this variant, the exchanges in particular of acoustic characteristic commands and of voice message broadcasting are carried out through the access network 5 and the packet network 6 between the terminal 1 and the server 2, and not also between the server 2 and the equipment 3 via the terminal 1 according to FIG. 1.

According to another variant, the central sound server 2 is distributed into several central servers in each of which one or more catalogs of files defined below can be consulted.

The central server 2 essentially comprises three catalogs of sound files V (CV, AV), B (CB, AB) and C (CC, AC) from which a terminal user can draw to personalize the sound presentation of voice messages reproduced in his terminal. All these files can be selected from terminal 1 using the application 10. The first catalog of V files (CV, AV) relates to voices whose voice prints have been recorded and analyzed in order to memorize the essential acoustic characteristics CV of those voices. For example, the acoustic characteristics describing a predetermined voice V and contained in a file from the first catalog concerns the male or female sex, age in the form of a period relating to childhood or adolescence or adulthood or old age, prosodic characters such as successive lengths of segments syllabics, the emphasis being in particular on the accent on sentence components, the laryngeal and fundamental frequencies relating to the pitch of the voice (in English "pitch"), the flow or rhythm of speech which can be slow or fast or intermediate, the level of sound expressed in decibels, etc.

The file of a predetermined voice V also contains AV attributes specific to each voice which are optional and which concern the owner of the voice such as a user or a company or an organization as a collective user, and / or access restrictions to the voice file so that it can be distributed and used by predetermined users who are identified by IDU identifiers entered in a list of UV users authorized to use the predetermined voice, or characteristics defining a profile user that a user must present to access the voice user, and / or a fee for the use of the voice which may be free, and any other feature contributing to market the predetermined voice.

The second catalog of files B (CB, AB) relates to sound effects B which can be sound effects, special sounds or musical pieces one or more of which can be selected by the user of the terminal 1 in order to be superimposed to the selected voice with which a text message is synthesized. Each B sound effect is defined as the voices in the first catalog, by acoustic characteristics CB and where appropriate is associated with attributes AB and with a list of authorized users UB. For practical reasons of downloading, the different sound effects in the second catalog are preferably made up of small, chained and looped sound files. One or more linked sound effects can be downloaded to the terminal.

The third catalog of files C (CC, AC) relates to files of combinations of sounds which each result from acoustic characteristics CC combining the acoustic characteristics CV and CB at least of a voice V and at least of a sound effect B, or several combinations of voices and sound effects distributed in time. Each combination is thus defined by acoustic characteristics CC and associated with attributes AC and a list of authorized users UC.

A terminal user can thus define an audio program which is divided into various periods during which combinations of voice and sound effects will respectively personalize portions of a text message MT to be synthesized. The attributes of a combination define in particular the lengths of periods for respective sound combinations, as well as the start time of these periods relative to the start time of a message.

Preferably, the central sound server 2 comprises a fourth PV catalog (CPV, APV) relating to visual presentations PV of text messages to be synthesized, each defined by CPV characteristics and APV attributes. CPV characteristics of a visual presentation relate to a wallpaper, or more or less animated images, or more particularly the face of a head of an animator whose eyes and mouth at least are animated according to the pronunciation of the message voice synthesized by means of a facial animation engine implemented in the user terminal. The whole animator's head or elements such as eyes and mouths can be chosen from the fourth catalog. As in the previous catalogs, visual presentation attributes APV define the owner of the visual presentation PV, access restrictions for example associated with a list of authorized users UPV, or remuneration relating to the selected visual presentation.

As a variant, the four catalogs defined above are distributed in respective servers instead of being centralized in a single server 2, and / or are declined for example by geographic region in order to offer in particular voices and sound effects adapted to regional or local customs and reduce response time.

Reference is now made to FIG. 2 to describe the main steps E1 to E14 of the method for personalizing the sound presentation of synthesized messages MS from the user terminal 1. The terminal 1 is designated by an identifier IDU which may include a number of telephone or an Internet Protocol (IP) address accompanied, where appropriate, by a confidential access code.

Depending on the type of terminal, the commands relating to selections made during the course of the process correspond to the pressing of a key on the terminal keyboard, for example translated into a DTMF (Dual Tone MultiFrequency) multifrequency code for a telephone or radiotelephone or a command specific to a communication protocol or user graphical interface, or else correspond to a voice command recognized by an included voice recognition means in terminal 1 and / or server 2. The various selections are preferably assisted by pages displayed in the terminal, when the latter has a display or a display screen, the dialogue between terminal 1 and server 2 thus being carried out in a known manner. As a variant, the dialogue between the terminal 1 and the central server 2 is carried out by means of an interactive voice server. The presentation of the catalogs consulted in the server 2 by the terminal 1 is tree-like, that is to say carried out by means of successive menus and submenus with a return to a main menu. Depending on the personalization application 10 implemented in the terminal 1, the user selects the acoustic characteristics of at least one voice V and of a sound effect B and / or of a combination of sounds C either directly in the server central 2, either after downloading to terminal 1 of part of the catalogs relating to files made available to the public, or access to which is authorized for this user.

FIG. 2 illustrates a preferred example of steps making up the personalization method according to the invention, both for the above variants.

In a conventional manner in step E1, the user in front of the terminal 1 opens a session of the application 10 relating to the audio presentation of messages to be synthesized so that the terminal 1 calls the central server 2, the IP address of which is stored in the terminal. If the application allows it, the user selects the catalogs or sound categories in the catalogs from a menu and downloads them from the server 2 in the terminal 1 in order to make various selections according to the following steps to build a sound presentation customized. Otherwise, the following selection steps are carried out through a question and answer dialogue between the terminal 1 and the server 2 which, as the user selects, constructs a specific sound presentation. This second variant will be referred to below.

In the next step E3, the application 10 asks the user if he wishes to select at least one of his favorite combinations determined and stored previously, possibly in association with the identifier IDU in the server 2, if they exist in the third catalog. Otherwise, sound characteristics are selected below by the user in order to constitute a combination of sounds personalizing the voice presentation of text messages MT to be synthesized by the voice synthesis equipment 3. In the following step E4, the user of terminal 1 selects acoustic characteristics CV by validating the parameters thereof so as to constitute a personalized voice V in the first catalog of voice files. As a variant, instead of selecting acoustic characteristics of voice V, the user selects a voice V from among several voices authorized in the first catalog, each of which is designated by a name and a brief description of the acoustic characteristics of that voice. this. In step E4, the application 10 optionally proposes to the user to further personalize the sound presentation of his messages by recording a predetermined voice print, for example a predetermined sentence pronounced by the user. To subsequently mix the voice defined by the selected characteristics CV or the previously selected voice with one or more sound effects B, the user selects in step E5, in a manner analogous to step E4, acoustic characteristics CB in the second catalog of sound effects files so as to determine one or more sound effects, or directly selects one or more authorized sound effects each defined by predetermined acoustic characteristics. After steps E4 and E5, the voice and sound effects characteristics selected directly or indirectly are transmitted to the acoustic model generator 33.

Then a test text TE is optionally selected in step E6 so that the selected test text serves as a test for speech synthesis in the synthesizer 32, depending on an acoustic model defined by the selected characteristics CV in the generator 33 , before definitively validating the choice of acoustic characteristics of the selected combination of voice and sound effects CS = CV + CB selected in the preceding steps E4 and E5. The TE test text selected in step E6 can be a text, or a combination of texts, scanned and prerecorded in the terminal 1 or the buffer memory 30 of the equipment 3, or entered directly by the user in the terminal 1, or downloaded from one or more at least textual document servers via the terminal 1 into the memory 30 of the equipment 3. The test text TE is preferably stored in the memory 30 of the equipment 3 in particular for test steps in subsequent sound presentations.

In the next step E7, the test text TE read in the memory 30 is analyzed by the analyzer 31 and synthesized in the synthesizer 32 as a function in particular of the acoustic model defined by the acoustic characteristics of voice CV selected in step E4 with mixing of sound effects B selected in step E5. The voice message MS produced by the synthesizer 32 is transmitted to the terminal 1 so that the user can listen to it.

If the user is not satisfied with the acoustic characteristics of the voice message produced, the application 10 proposes to him in step E8 to modify, that is to say to add or remove or correct an acoustic characteristic CV of the voice described V or CB of the sound effects described B selected in steps E4 and E5, returning to step E4. Finally, after possibly one or more repetitions of steps E4 to E8, the terminal 1 and the server 2 store the acoustic characteristics [CV + CB] of the selected combination CS in step E9, preferably by associating it with the IDU user identifier.

Optionally, in step E10, the application 10 offers the user a selection of CPV characteristics in the fourth catalog of visual presentation files in order to select in step E101 a visual presentation such as a wallpaper. and / or facial animation.

In the next step Eli, the terminal 1 and the server 2 store the visual presentation characteristics CPV possibly selected in step E101 in association with the combination of acoustic characteristics of voice and sound effects [CV + CB] selected in steps E4 to E8.

Then at the following steps, in particular E121, E122 and E123, the application 10 invites the user of the terminal 1 to select one or more parameters, in particular temporal and / or documentary, personalizing the use of the combination of sounds CS composed in the preceding steps.

In step E121, the user indicates two dates of broadcast about the selected combination CS. In practice, the user indicates a start date of broadcast and / or an end date of broadcast of the selected combination. One or more broadcast periods can thus be selected to make the combination selected during these periods accessible. By associating such periods with various selected combinations, audio programs are formed. The audiovisual programs created are preferably displayed in the terminal with their respective time positions. All previous and following time data are expressed in year, month, day, hour, minute and second.

In step E122, the application 10 proposes to the user to determine the time of start of introduction of the selected combination CS as well as the duration of the latter relative to the time of start of listening d a voice message MS synthesized in the equipment 3 in order to constitute with other selected combinations an audio program. Optionally, the start time and the duration of diffusion of the selected combination CS are chosen randomly by the application 10. In alternatively, several combinations are selected to constitute a series of combinations of sounds which is repeated periodically.

In step E123, the application 10 offers the user to associate predetermined documents with the selected combination CS. Each of these documents is identified by an identifier which can be a name and / or an address such as a URL (Uniform Resource Locator) address read from a website server. The association of a document with one or more combinations can be made available to any user.

After steps E121 and / or E122 and / or E123, or after step E3 having selected a so-called “favorite” combination already stored in the server 2 or the terminal 1 and accessible to the user, the application 10 prompts the user of the terminal 1 to listen to a text message MT of his choice in order to transmit it to the voice synthesis equipment 3 which has received the acoustic characteristics CV and CB of the selected or favorite combination. The voice message MS resulting from the voice synthesis of the selected text message MT is listened to by the user simultaneously with a possible visual presentation such as a facial animation whose CPV characteristics were selected in step E10 and which is displayed in the terminal. Then if the user wishes to modify acoustic and / or visual characteristics of the presentation of the synthesized message, as indicated in step E14, he again proceeds to the selection of acoustic characteristics of voice and / or sound effects and possibly of characteristics visual presentation from step E4. Otherwise, the session of the application 10 at least between the terminal 1 and the equipment 3 is terminated in step E15.

Claims

1 - Method for personalizing the sound presentation of messages synthesized in a terminal (1), characterized in that it comprises steps for selecting (E4) acoustic characteristics

(CV) describing a voice (V) in a first catalog of acoustic characteristics stored in a server means (2), select (E5) acoustic characteristics (CB) describing a sound effect (B) in a second catalog of acoustic features stored in the server means (2) in order to transmit them with the selected acoustic characteristics (CV) describing the voice (V) by means of voice synthesis (3), and synthesize (E7) a text message (MT) in the voice synthesis means depending on the selected acoustic characteristics describing the voice in a voice message (MS) which is mixed with the selected sound effects to be transmitted to the terminal (1) •

2 - Method according to claim 1, according to which the steps of selecting (E4, E5) acoustic characteristics describing a voice (V) and a sound effect (B) are replaced by a step of directly selecting a voice in the first catalog and sound effects in the second catalog.

3 - Method according to according to claim 1 or 2, comprising a step (E6) of selecting the text message (MT) to synthesize.

4 - Process according to claim 3, according to which the text message (MT) to be synthesized is transmitted by the terminal (1) and stored in the voice synthesis means (3).

5 - Method according to claim 3, according to which the text message (MT) to be synthesized is selected in a text document server in order to download it via the terminal (1) in the voice synthesis means (3).

6 - Method according to any one of claims 1 to 5, comprising after listening (E7) of the voice message (MS) transmitted by the voice synthesis means (3) to the terminal (1), a step (E8) d '' add or remove or correct an acoustic characteristic (CV, CB) to describe at least the voice.

7 - Method according to any one of claims 1 to 6, comprising a step (E3) of selecting at least one combination (C) stored in a third catalog at least in the terminal and comprising at least one voice (V) and at least one sound effect (B) in order to synthesize any text message (MT) depending on the acoustic characteristics of the combination.

8 - Method according to any one of claims 1 to 7, comprising a selection (E10, E101) of characteristics of a visual presentation (CPV) in a fourth catalog stored in the server means (2) in order to transmit them to the terminal

(1) and to display the visual presentation in the terminal in synchronism with the reproduction of the voice message (MS) in the terminal. 9 - Process according to any one of claims 1 to 8, further comprising a step

(E121, E122, E123) to select at least one of the following parameters personalizing the use of at least the voice (V) described by the selected acoustic characteristics: date and period of diffusion of the selected voice, instant of introduction and duration of the voice selected with respect to the instant of the start of a voice message synthesized in the voice synthesis means (3), identifier of documents to be associated with the voice, combination of sounds, including the selected voice, or of a series of sounds.

10 - Method according to any one of claims 1 to 9, comprising beforehand in the server means (2) a definition of attributes specific at least to voices and relating to the property and / or a restriction of access and / or remuneration for the use of votes.

11 - System for personalizing the sound presentation of messages synthesized in a terminal (1), characterized in that it comprises server means (2) for memorizing acoustic characteristics (CV) describing voices (V) and acoustic characteristics ( CB) describing sound effects (B) to be selectively mixed with described voices, a voice synthesis means (3) in which voices are described depending on acoustic characteristics, and an application means (10) in the terminal (1) for selecting in the server means (2) acoustic characteristics describing a voice and acoustic characteristics describing a sound effect so that the voice synthesis means synthesizes a text message (MT) according to the selected acoustic characteristics describing the voice into a voice message (MS) mixed with the sound effects described and transmitted to the terminal.

12 - System according to claim 11, wherein the voice synthesis means is an equipment (3) located near or integrated into the terminal (1).

13 - System according to claim 11, wherein the voice synthesis means (3) is integrated in the server means (2).