WO2007028871A1

WO2007028871A1 - Speech synthesis system having operator-modifiable prosodic parameters

Info

Publication number: WO2007028871A1
Application number: PCT/FR2006/001967
Authority: WO
Inventors: Edouard Hinard; Cédric BOIDIN; Laurent Roussarie
Original assignee: France Telecom
Priority date: 2005-09-07
Filing date: 2006-08-22
Publication date: 2007-03-15

Abstract

The invention concerns a system for speech synthesis of a text by concatenation of acoustic units comprising means for: generating (6) a target prosody in the form of a set of prosodic parameters; selecting (7) candidate acoustic units; and processing the signal (8) to create the sound signal including: means (10) for concatenation of the candidate acoustic units into a first intermediate stream; and means (11) for prosodic modification of said intermediate audio stream based on the target prosody so as to obtain the sound signal, and said system comprising: means (9) enabling the final sound signal to be listened to by a user; and means (12) enabling the speech synthesis system to be edited by the user, for editing the prosody generated with the sound signal and for modifying the prosodic parameters of the unit selecting means (7) and/or of the prosodic modifying means (11) prior to the creation of a new sound signal.

Description

Voice synthesis system before prosodic parameters modifiable by an operator.

The present invention relates to a system and method for voice synthesis by concatenating acoustic units.

Concatenated vocal synthesis of acoustic units uses a number of known principles.

Typically, a speech synthesis string from a text comprises the steps of: linguistic processing for extracting linguistic information relevant to the synthesis from the text,

phonetic transcription transforming the linguistic information into a phonetic string comprising a series of target acoustic units. This phonetic transcription can be accompanied by a generation of prosodic information,

selection of the candidate acoustic units, that is to say selection of the fragments of pre-recorded words that will be used for the synthesis, and

signal synthesis consisting in concatenating the selected candidate acoustic units to form the requested sound signal. This vocal synthesis can also be accompanied by prosodic modifications. The prosody is thus found in three of the steps mentioned:

- The prosodic generation step that uses a prosody model to generate a target prosody. The target prosody is the prosody imposed by the system. It can be used in the selection step and / or in the signal processing step,

the step of selecting the acoustic units, which consists in selecting, in a database, prerecorded speech segments that will be used. for the synthesis, and who use or not the target prosody, - the step of processing the signal that creates the final signal. Signal processing methods allow for prosodic modifications to effectively obtain the target prosody.

The development of a prosody model is a subject well known to those skilled in the art. However, no model currently allows to generate a perfect prosody, which would give a natural tone. Also, to obtain this perfect prosody, it has been proposed to use voice synthesis systems assisted by an operator. Such a system is described for example in US Patent Application 2003/02 29494 RUTTEN et al. In this patent application, the operator acts iteratively. He listens to the sentence produced by the system and can then adjust the parameters of the selection step and start a new selection. This process is repeated until the operator obtains the solution that suits him.

The disadvantage of such a system is that the relationship between changes made by the operator and the result heard is not intuitive. It is thus difficult for the operator to predict the result of the proposed changes.

The object of the invention is therefore to remedy this drawback by proposing an interactive voice synthesis system in which the user-provided parameter changes have a direct relationship with the expected result. This advantageously allows such a system to be used by an operator with little experience.

The object of the invention is therefore a voice synthesis system of a text by concatenation of acoustic units comprising:

- Prosodic generation means capable of generating a target prosody of the text in the form of a set of prosodic parameters, - candidate acoustic unit selection means capable of generating a stream of candidate acoustic units representative of the text and the target prosody of it,

signal processing means able to create the sound signal representative of the text and comprising: means for concatenating the flow of candidate acoustic units into a first intermediate flow, and

means of prosodic modification of this intermediate sound flux as a function of the target prosody in order to obtain the final sound flux, and said system further comprising means for listening to the final sound flow by a user, and

editing means adapted to allow a user to apply modifications to the parameters of the speech synthesis system so that he generates a new sound flow, and the editing means are adapted to edit the prosody generated with the final sound stream and to modify the prosodic parameters of the unit selection means and / or the prosodic modification means before the creation of a new sound signal by the means ( 8) signal processing. Other features of the invention are

the modifiable prosodic parameters are at least the fundamental frequency, the duration and / or the energy;

the modifiable prosodic parameters relate to the phonemes, the syllables, the words, the groups of words, the sentence of the text or a combination of these;

Another object of the invention is a method for the vocal synthesis of a paracatenation text of acoustic units comprising the steps of: a) prosodic generation of a target prosody of the text in the form of a set of prosodic parameters, b) selection of candidate acoustic units in the form of a representative flow of the text and the target prosody thereof; c) concatenation of the flow of candidate acoustic units into an intermediate sound flux; and d) prosodic modification of this sound flux. intermediate according to the target prosody to obtain the final sound flow, e) listens by a user of the sound stream thus generated, and f) modification of the parameters of the speech synthesis system and connection to step b) if the user considers the incorrect generated sound flow, the modifications relating to the prosodic parameters used by the candidate acoustic unit selection and / or prosodic modification steps n function of the prosody generated with the final sound flow. Other features of this process are

the modifiable prosodic parameters relate to the phonemes, the syllables, the words, the groups of words, the sentences of the text or a combination of these. Another object is a computer program product including program code instructions recorded on a computer readable medium, for implementing the steps of the method when said program is running on a computer. Another object is a data carrier supporting the computer program.

Thus, advantageously, it is the prosody of the sound flow generated during a first pass that serves as a target for the second pass or, in general, it is the prosody generated at a given iteration that serves as a basis for the prosody target used at the next iteration.

The prosodic parameters that can be modified by the user are advantageously the parameters such as the fundamental frequency, the duration and / or the energy of which the relationship with the qualities of the sound flow is directly perceptible to the user, even if he is not very experienced. In addition, the system and the method advantageously make it possible to apply the prosodic modifications to all or part of the text to be synthesized and according to a totally configurable particle size. The modification can be applied to phonemes as well as to syllables, words, groups of words or sentences of the text: The invention will be better understood on reading the description which follows, made solely as a example and in connection with the appended drawings in which:

FIG. 1 is a simplified diagram of a voice synthesis system according to the invention; FIG. 2 is a flow diagram of the method according to the invention;

FIG. 3 is an example of display of the information by the editing means; and

FIG. 4 is a second example of information display by the interface means. With reference to FIG. 1, a voice synthesis system 1 is intended to transform a text 2 into a sound wave 3. The text 2 is entered in the system 1 by means of input means 4 which transforms it into a file , typically to UNICODE standard. This file is processed by linguistic processing means 5 making it possible to extract from the text information relevant for the synthesis by a linguistic analysis of this text.

This linguistic information is used by the phonetic transcription and prosodic generation means. This transcription, not necessarily unique, is in the form of a series of target acoustic units, augmented by additional information representing the target prosody of this text. This target prosody is in the form of a set of prosodic parameters such as, for example, fundamental frequency, duration or energy.

The voice synthesis system 1 also comprises means 7 for selecting candidate acoustic units. These candidate acoustic units are prerecorded speech pieces corresponding to phonemes, diphones, syllables ... and represent a sound variation of a basic acoustic unit, for example a variation of length, size, ...

These selection means 7 generate a stream of candidate acoustic units representative of the text to be synthesized and the target prosody defined above.

This stream of candidate acoustic units is processed by signal processing means 8 to produce a sound flux. This sound stream is used by listening means 9 to generate the sound wave 3.

The signal processing means 8 comprise means 10 for concatenating the flow of candidate acoustic units into a single intermediate sound flux. The signal processing means 8 also comprise prosodic modification means 11 capable of modifying this intermediate sound flux as a function of the parameters of the target prosody in order to obtain the final sound flux.

These different means of speech synthesis system 1 will not be described in detail insofar as they are well known to those skilled in the art. Further information on these means can be found, for example, in the aforementioned US patent application 2003/0229494.

The voice synthesis system 1 also comprises editing means 12 of the prosodic parameters. These editing means 12 allow a user, through a visual interface to edit the prosody generated with the final sound flow and modify the prosodic parameters used by the unit selection means and / or the means of prosodic modifications.

The operation of voice synthesis system 1 will now be described as a method, with reference to FIG. 2. The method starts in step 20.

A target prosody is generated in 21 from the text 2 by the implementation of well known means described above.

The candidate acoustic units are selected at 22 as a representative stream of the text and the target prosody thereof. This stream is concatenated in 23 into a single intermediate sound stream.

Prosodic modifications are then applied at 24 on this intermediate sound flow, depending on the target prosody, to obtain a final sound flow.

This sound stream is listened to by the user. It is, in parallel, presented visually at 26 on the interface 13.

If the user considers in 27 that the sound flow is of satisfactory quality, the process ends in 28.

On the other hand, if the sound flow has defects, the user modifies the prosodic parameters at 29 via the interface 13. Depending on the type of prosodic modification requested, that is to say the modified prosodic parameters, the method executes a new step 22 of selecting the candidate acoustic units and / or only a prosodic modification at 24 of the intermediate sound flux.

The process is thus reiterated until a satisfactory quality for the sound flow is obtained.

In Figure 3 is shown an example of a user interface. This figure shows the structure of the sound flux generated by the system during the first phase. It contains in particular a curve at 31 representing the main prosodic information of the sound flow generated: the fundamental frequency at 32 and the duration of the different segments that constitute this stream at 33.

Figure 4 shows the structure of the sound flow with a prosody being modified by the operator. In this example, the operator considers that the first part of the stream does not need to be modified. This first part is referenced 40. On the other hand, he considers that a second part, referenced 41, requires prosodic modifications. It can make modifications on all the prosodic parameters such as the fundamental frequency, the duration or the energy and with several possible scales as for example at the level of the phoneme, the syllable, the word, the group of words or the sentence . The modified prosody is thus represented by the curve 42. The juxtaposition of the prosody of the two parts namely the prosody of the unmodified part and the new prosody associated with the modified part give a new target prosody. Once the modifications have been validated by the operator, the voice synthesis system 1 generates a new sound flow from this new target prosody.

The speech synthesis system described is, preferably, embodied as a computer program executable on a conventional computer, for example a workstation, comprising a sound card and loudspeakers.

Therefore, the invention also relates to a computer program comprising software instructions for executing the method previously described by the routing equipment. This computer program can be stored or transmitted by a data carrier. This may be a hardware storage medium such as a CD-ROM, a magnetic diskette or a hard disk, or a transmissible medium such as an electrical signal, optical or radio.

It is also noted that the user-modifiable prosodic parameters are advantageously the parameters such as the fundamental frequency, the duration and / or the energy of which the relationship with the qualities of the sound flow is directly perceptible to the user, even if he is not very experienced.

In addition, the system and method thus described advantageously make it possible to apply the prosodic modifications to all or part of the text to be synthesized and according to a totally configurable particle size. The This modification can be applied to phonemes as well as to syllables, words, groups of words or sentences of the text.

Claims

1. System for the voice synthesis of a text by concatenation of acoustic units comprising:

means (6) of prosodic generation capable of generating a target prosody of the text in the form of a set of prosodic parameters,

means (7) for selecting candidate acoustic units capable of generating a stream of candidate acoustic units representative of the text and the target prosody thereof,

signal processing means (8) able to create a sound signal representative of the text and comprising:

means (10) for concatenating the flow of candidate acoustic units into a first intermediate stream, and

means (11) of prosodic modification of this intermediate sound flux as a function of the target prosody in order to obtain the sound signal, and said system further comprising

means (9) for listening to the sound signal by a user, and

editing means (12) adapted to allow the user to apply modifications to the parameters of the speech synthesis system so that he generates a new sound signal, characterized in that the means (12) for editing are adapted to edit the generated prosody with the sound signal and to modify the prosodic parameters of the unit selection means (7) and / or the prosodic modification means (11) before the creation of said new sound signal by the means (8) signal processing.

2. Voice synthesis system according to claim 1, characterized in that the modifiable prosodic parameters are at least the fundamental frequency, the duration and / or the energy.

Speech synthesis system according to claim 1 or 2, characterized in that the modifiable prosodic parameters relate to the phonemes, the syllables, the words, the groups of words, the sentence of the text or a combination of these.

4. Method of voice synthesis of a text by concatenation of acoustic units comprising the steps of: a) prosodic generation (21) of a target prosody of the text in the form of a set of prosodic parameters, b) selection (22) of candidate acoustic units in the form of a representative flow of the text and the target prosody of the latter, c) concatenation (23) of the flow of candidate acoustic units into an intermediate sound flux, and d) prosodic modification (24) of this intermediate sound flux as a function of the target prosody to obtain the final sound flux, e ) listening (25) by a user of the sound stream thus generated, and f) modifying (29) the parameters of the speech synthesis system and then branching to step b) if the user considers the sound flow generated incorrect, characterized in that that the modifications relate to the prosodic parameters used by the selection steps (22) of candidate acoustic units and / or prosodic modification (24) as a function of the prosody generated with the final sound flow.

5. Voice synthesis method according to claim 4, characterized in that the modifiable prosodic parameters are at least the fundamental frequency, duration and / or energy.

6. Voice synthesis method according to one of claims 4 or 5, characterized in that the modifiable prosodic parameters relate to phonemes, syllables, words, groups of words, sentences of the text or a combination of these. this.

A computer program product comprising program code instructions recorded on a computer readable medium, for carrying out the steps of the method according to any one of claims 4 to 6 when said program is running on a computer.

Data carrier supporting the computer program according to claim 7.