WO2007071834A1

WO2007071834A1 - Voice synthesis by concatenation of acoustic units

Info

Publication number: WO2007071834A1
Application number: PCT/FR2006/002745
Authority: WO
Inventors: Edouard Hinard; Cédric BOIDIN; Laurent Roussarie
Original assignee: France Telecom
Priority date: 2005-12-16
Filing date: 2006-12-15
Publication date: 2007-06-28
Also published as: EP1960996A1; EP1960996B1; DE602006012540D1; FR2895133A1

Abstract

The present invention relates to a system of voice synthesis by concatenation of acoustic units comprising: - means (4) for linguistically processing a text so as to transform it into a string of phonemes accompanied by prosodic indications, - means (6) for synthesizing prerecorded elements by concatenation so as to restore an acoustic signal, as a function of the string of phonemes, - input and editing means (8), such that the linguistic processing means (4) comprise at least one elementary processing unit (4A, 4B, 4C) that generates intermediate results of the linguistic processing of said text, said unit being associated with an editor (8A, 8B, 8C) of the input and editing means (8), allowing an operator to modify the intermediate results and the voice synthesis system comprises means (14) for parameterizing the text on the basis of the results modified by the operator, the linguistic processing means (4) adapting the linguistic processing of the text on the basis of said parameterization.

Description

vocal synthesis by concatenation of acoustic units

The present invention relates to a system and method for voice synthesis by concatenation of acoustic units and a computer program for implementing the method.

A speech synthesis system based on a text conventionally comprises input means of the text to be synthesized and linguistic processing means of this text to transform it into a series of phonemes accompanied by prosodic indications. These linguistic treatments include syntactic treatments, grapheme-phoneme translations as well as prosodic treatments. They rely on dictionaries as well as rulesets.

It also includes concatenation synthesis means of prerecorded elements for generating an acoustic signal according to the sequence of phonemes provided by the linguistic processing.

Such a system is explained in more detail in Gaël Richard, Olivier Cappé "Synthesis of speech from the text", Techniques of the engineer H 7 288.

Such systems seek to achieve a quality comparable to that of natural speech.

Currently, a significant limitation in the quality of these speech synthesis systems lies in linguistic processing. This limitation is related to the loss of information induced by transcription and the ambiguous nature of certain textual forms. As a result, the systematic use of synthetic speech for static recordings can only be done under the control of an operator who overcomes the inevitable defects of this linguistic treatment.

In the state of the art, three methods are known to allow an operator to control the result of a speech synthesis system:

- a method of enriching the text by the presence of tags. This enrichment of the text makes it possible to control the linguistic analysis (Phonetization of a word or its grammatical label) or the synthesizer (volume, pitch of voice, speed of speech). The use of tags is currently being standardized by W3C. A first version of the Speech Synthesis Markup Language (SSML) markup language was published in September 2004, via the URL http://www.w3.org/TR/speech-svnthesis/. The enrichment of the input text is done through a specialized editor. The "TTS Director" tool from Loquendo is an example of a publisher dedicated to speech synthesis (http://www.loquendo.com/en/technology/tts director.htm).

- the configuration of the system. For example, the Lexitool tool that is part of the catalog of the company Elan Speech, allows to manage an exceptional lexicon. The operator enriches the data of the system by adding in the lexicon the words that the system does not pronounce correctly and associating with them the expected pronunciation.

- interactive synthesis. This is described in the article by Peter Rutten, Justin Fackrell "The application of interactive speech unit selection in TTS Systems". Eurospeech 2003. The intervention of the operator is done in the process of synthesis after the execution of an important stage of the treatment and leads to modify the global behavior of the system later by modifying the parameters of this stage of treatment. For example, in this article, an operator can locally modify the parameterization of the synthesizer, after execution of the selection process, to produce a synthetic production variant that is closer to what is expected.

These methods have the major disadvantage of the low correlation between the modification performed by the operator and the final result obtained. By the term "low correlation" is meant here that the operator does not have an intuitive manipulation of the system. This manipulation requires significant learning before the operator is able to determine the parameter or parameters to be modified to obtain a better result.

The object of the invention is therefore to overcome this drawback by proposing an interactive speech synthesis system and method that is easy to use for an operator. The object of the invention is a voice synthesis system by concatenation of acoustic units comprising:

means for memorizing a text to be synthesized,

means for linguistically processing said text to transform said text into a series of phonemes accompanied by prosodic indications,

synthesis means by concatenating pre-recorded elements to restore an acoustic signal, as a function of the series of phonemes,

input and edit means, characterized in that the linguistic processing means comprise at least one elementary processing unit generating intermediate results of linguistic processing of said text, said elementary processing unit being associated with an editor of the means of inputting and editing, allowing an operator to modify the results of the elementary processing unit, and said voice synthesis system further comprises means for setting the text to be synthesized according to the results modified by the operator, and said linguistic processing means adapting the linguistic processing of the text according to said parameterization.

Other features are:

- The text setting includes tags inserted into the text to be synthesized;

the or each unit of elementary treatment is adapted to perform one of the elementary treatments of all the elementary treatments of: a) validation of the text to be synthesized, b) cutting of the text into sentences, c) cutting of the text in groups of breath, d) - division of text into words, e) - modification of a lexicon of exceptions, f) - phonetization of words, g) - grammatical analysis, h) - prosody. the linguistic processing means comprise elementary processing means for performing all of the elementary treatments of said set of elementary processes.

Another object is a method of concatenating acoustic voice synthesis comprising the steps of:

- storage of a text to synthesize,

linguistic processing of said text to transform said text into a series of phonemes accompanied by prosodic indications,

generating a sound signal and intermediate results from said sequence,

- analysis by an operator of the sound signal and the intermediate results,

- modification by the operator of said intermediate results if said operator establishes that the quality of the sound signal is insufficient,

creation and / or modification of parameters of the text to be synthesized,

- Looping on the linguistic processing step, the latter generating a new series of phonemes taking into account said parameters.

Other features of this object are

- the modification of the parameters consists of creating / modifying tags in the text to be synthesized;

the step of generating intermediate results comprises one of the elementary treatment sub-stages:

- validation of the text to be synthesized,

- cutting the text into sentences,

- cutting the text into groups of breath,

- cutting the text into words,

- modification of a lexicon of exceptions,

- phonetics of words,

- grammatical analysis,

- prosody.

said method further comprises a step of selecting the elementary treatment substep to be performed from among the set of elementary treatment substeps; it is executed successively 8 times and each time, a different elementary treatment sub-step is selected in the following order:

- validation of the text to be synthesized,

- cutting the text into sentences,

- cutting the text into groups of breath,

- cutting the text into words,

- modification of a lexicon of exceptions,

- phonetics of words,

- grammatical analysis,

- prosody.

Another object is a computer program comprising program code instructions for performing the steps of the method when said program is executed on a computer.

Advantageously, the linguistic processing is decomposed for the operator into a series of elementary processes allowing him to control all the parameters having an impact on the quality of the sound flow produced.

Being able to select the elementary step on which he wishes to intervene, the operator advantageously controls the speech synthesis tool in what appears to him to be the detail of its operation.

In addition, the sequence of elementary treatments proposes a logic order of treatment well adapted to the mode of operation of the operator while it does not correspond to the internal operation of the synthesis system.

The invention will be better understood on reading the description which follows, made solely by way of example, and in relation to the appended drawings in which:

FIG. 1 is a block diagram of a speech synthesis system according to one embodiment of the invention;

FIG. 2 is a flow chart of a speech synthesis method according to one embodiment of the invention;

FIG. 3 is a variant of the method according to FIG. 2; and

FIG. 4 is a flow chart of a speech synthesis method using the method of FIG. 3 according to an order of presentation of elementary processes. With reference to FIG. 1, a voice synthesis system 1 comprises means 2 for inputting a text to be synthesized. This text is stored in a buffer memory 3 in the form of a record comprising the actual coded text, for example, according to the ISO / IEC 10646 standard, as well as linguistic processing aid parameters, for example in the form of tags. SSML.

The buffer memory 3 is connected to linguistic processing means 4 of this text. These linguistic processing means 4 are connected to a second buffer 5 in which they store the result of the linguistic processing in the form of a series of phonemes accompanied by prosodic indications.

This second memory 5 is connected to synthesis means 6 by concatenation of prerecorded elements to restore an acoustic signal as a function of the sequence of phonemes.

The acoustic signal is transformed into sounds by speakers 7.

A detailed description of these various elements is contained in the document by G. Richard and O. Cappé cited above.

The voice synthesis system 1 comprises means 8 for inputting and editing. These input and edit means 8 comprise keyboard-type input means 9 and a pointing tool 10 such as a mouse. They also comprise a display screen 11 and means 12 for controlling these devices 9, 10, 11.

Advantageously, these input and edit means 8 present to an operator of the voice synthesis system 1 a user-friendly graphical interface.

The linguistic processing means 4 comprise a unit processing unit chain 4A, 4B, 4C, each of which processes a particular element of the linguistic processing chain such as the division of the text into sentences, the division of the sentences into words, the phonetization of words, grammatical analysis, prosody ...

Each unit 4A, 4B, 4C of elementary treatment is connected to a specialized editor 8A, 8B, 8C 8 means of input and editing allowing the operator to intervene on the elementary results of the corresponding unit 4A, 4B, 4C to modify them.

Each pair consisting of a unit 4A, 4B, 4C of elementary processing and its editor 8A, 8B, 8C, constitutes a module 13A, 13B, 13C of processing and editing for a determined stage of linguistic processing.

The voice synthesis system 1 comprises parameterization means 14 connected to the first buffer memory 3 and to the elementary processing modules 13A, 13B, 13C.

These setting means 14 add, modify or delete the linguistic processing aid parameters contained in the recording stored in the buffer memory according to the modifications made by the operator on the elementary results of the unit 4A, 4B ₁ 4C. of elementary processing so that during a subsequent processing of the recording by the same elementary processing units, the elementary result obtained at the output of each unit is the result modified by the operator. The means 14 are not suitable for acting on the actual parameter setting of the elementary processing units, nor on the synthesis means 6.

In a preferred embodiment, the speech synthesis system 1 comprises 8 modules corresponding to 8 stages of the linguistic processing of the text.

The first module deals with the text itself. It allows the operator to validate that the text to be synthesized suits him. Optionally, this module enriches the text with change of voice tags.

The technique used by this first module is described in the state of the art, for example in the standardization of the W3C SSML language.

The second module deals with the division of text into phases. The editor shows the operator which phase boundaries can be deleted, moved or inserted.

The third module deals with splitting into breath groups. The publisher highlights breath groups and break times between groups. The operator can change the placement of breaks and their durations. The fourth module deals with the division into words. The publisher highlights the groupings of words that have a link. The operator can separate words or group others to form phrases.

The fifth module deals with the lexicon. The operator intervenes on the data by adding, modifying or deleting entries of the exception lexicon.

The sixth module deals with the phonetization of words. The editor presents to the operator the phonetic form or forms of each word on which the system is based to vocalize the text. The operator intervenes on the choice of the variants of pronunciation, the connections, the e dumb, ... It should be noted that this module differs from the preceding module on the lexicon in that it does not modify the data but the result of the phonetization process.

The seventh module deals with grammatical analysis. The editor presents the operator with the result of the grammar analysis and the rules that resulted in this result. The operator can modify the choice of grammar rules and markers associated with each word or group of words.

The eighth module is about prosody. The editor presents the operator with prosodic information in the form of curves or tables of values that the operator can modify.

The operation of each elementary processing unit and its associated interfacing module will now be explained in relation to FIG. 2.

Since the text is stored at 20 in the voice synthesis system 1, a complete speech synthesis, until the sound signal is generated, is performed at 21. The operator thus has a reference sound signal for his analysis.

This synthesis 21 comprises successively a linguistic processing step 22 and a concatenation synthesis step 23 as explained above.

During the linguistic processing step 22, one of the units 4A ₁ 4B, 4C of elementary processing generates in 24 intermediate results. For example, the grammatical analysis means generate a grammatical analysis result accompanied by the rules used. The sound result and the intermediate results obtained are presented to the operator at 25.

If the sound result is in accordance with 26 expectations of the operator, it is validated in 27 and the intermediate results.

If the sound result and / or intermediate results are not in accordance with the expectations of the operator, it modifies the intermediate results in 28 using the corresponding interface module.

These modifications are taken into account at 29 by the voice synthesis system 1 in the form of a modification of the linguistic processing aid parameters contained in the memorized text. Preferably, this consideration is made in the form of an enrichment or a modification of the enrichment of the text to be synthesized.

Then the voice synthesis step 21 is executed again using the new enriched text.

The improvement process loops until the operator is satisfied with the result obtained.

It is conceivable that to obtain a sound flow having all the characteristics desired by the operator, it may be necessary to intervene on several elementary treatments.

In a preferred embodiment, FIG. 3, the speech synthesis method further includes a step 30 of selecting the elementary processing module whose intermediate results will be analyzed and possibly modified by the operation.

Thus, the operator can advantageously choose the type of elementary treatment which he wishes to analyze and modify the results.

Advantageously, FIG. 4 the modifications are made in the order of presentation of the following elementary treatment units.

The operator starts at 40 by editing the text via the first module associated with the basic processing units of the text itself.

Then, when he has obtained a satisfactory result at this level, the operator launches at 41 the second module for cutting the text into sentences.

After obtaining a satisfactory intermediate result, it launches at 42 the third module of division into groups of breath, then in 43 the fourth word cutting module, then 44 the fifth module of the lexicon, then 45 the sixth word phonation module, then 46 the seventh grammatical analysis module, then 47 the eighth prosody module.

This embodiment is remarkable in that it follows a logical order for the operator but does not correspond to the organization of the processing within a linguistic analyzer of a conventional speech synthesis system.

The operator can also go back to modify the intermediate results of one of the modules already treated, for example because he noticed a mistake late.

Claims

1. Concatenated voice synthesis system for acoustic units comprising:

means for memorizing (2) a text to be synthesized,

means (4) for linguistic processing of said text to transform said text into a series of phonemes accompanied by prosodic indications,

- means (6) of synthesis by concatenation of prerecorded elements to restore an acoustic signal, as a function of the sequence of phonemes,

means (8) for inputting and editing, characterized in that the linguistic processing means (4) comprise at least one unit (4A, 4B, 4C) of elementary processing generating intermediate results of linguistic processing of said text, said elementary processing unit being associated with an editor (8A, 8B, 8C) of inputting and editing means (8), allowing an operator to modify the results of the processing unit (4A, 4B, 4C) elementary, and in that said voice synthesis system further comprises means (14) for parameterizing the text to be synthesized according to the results modified by the operator, and said linguistic processing means (4) adapting the linguistic processing of the text according to said parameterization.

2. Voice synthesis system according to claim 1, characterized in that the text setting includes tags inserted into the text to be synthesized.

3. Voice synthesis system according to claim 1 or 2, characterized in that the or each unit of elementary treatment is adapted to perform one of the elementary treatments of all the elementary treatments of: a) - validation of the text to synthesize, b) - splitting the text into sentences, c) - splitting the text into groups of breath, d) - splitting the text into words, e) - modifying a lexicon of exceptions, f) - phonetization of words, g) - grammatical analysis, h) - prosody.

4. Voice synthesis system according to claim 3, characterized in that the linguistic processing means comprise elementary processing means for performing all the elementary treatments of said set of elementary treatments.

5. Process for concatenating acoustic voice synthesis comprising the steps of:

storage (20) of a text to be synthesized,

linguistic processing (22) of said text to transform said text into a series of phonemes accompanied by prosodic indications,

generating (23,24) a sound signal and intermediate results from said sequence,

- analysis (25) by an operator of the sound signal and intermediate results,

- modification (28) by the operator of said intermediate results if said operator establishes that the quality of the sound signal is insufficient,

creation and / or modification (29) of parameters of the text to be synthesized,

6. speech synthesis method according to claim 5, characterized in that the modification of the parameters consists in creating / modifying tags in the text to be synthesized.

A speech synthesis method according to claim 5 or 6, characterized in that the step of generating intermediate results comprises one of the basic processing sub-steps:

- validation of the text to be synthesized,

- cutting the text into sentences,

- cutting the text into groups of breath,

- cutting the text into words,

- modification of a lexicon of exceptions,

- phonetics of words, - grammatical analysis,

- prosody.

8. speech synthesis method according to claim 7, characterized in that it further comprises a step of selecting (30) the substep of elementary processing to be performed among all the substeps of elementary treatment.

9. Voice synthesis method characterized in that the method of claim 8 is executed successively 8 times and that each time, a different elementary sub-step is selected in the following order:

- validation of the text to be synthesized,

- cutting the text into sentences,

- cutting the text into groups of breath,

- cutting the text into words,

- modification of a lexicon of exceptions,

- phonetics of words,

- grammatical analysis,

- prosody.

A computer program comprising program code instructions for performing the steps of the method according to one of claims 5 to 9 when said program is executed on a computer.