WO2009101319A1

WO2009101319A1 - Method, device and computer program for searching for keywords in a speech signal

Info

Publication number: WO2009101319A1
Application number: PCT/FR2009/050159
Authority: WO
Inventors: Corentin Dubois; Delphine Charlet
Original assignee: France Telecom
Priority date: 2008-02-08
Filing date: 2009-02-03
Publication date: 2009-08-20
Also published as: FR2927461A1

Abstract

A method of identifying at least one keyword in a speech signal, comprising the steps consisting in: a/ performing a search for a series of sub-lexical units which is obtained by converting the keyword, in a sequence of sub-lexical units which is obtained by converting the speech signal, b/ detecting segmentation marks in the speech signal, and c/ using the segmentation marks detected in step b/ to validate or deny the results of the search of step a/.

Description

Method, device and computer program for searching for keywords in a speech signal

The invention relates to the field of identification of keywords in a speech signal.

When a person utters a sentence, it generates an acoustic signal. This acoustic signal can be converted into an electrical signal to be processed. Nevertheless, in the remainder of the description, the term "acoustic signal", "speech signal" or "pronounced sentence" will be used to designate any signal representative of the acoustic signal.

One can seek to recognize the words spoken by proceeding by searching for keywords in the speech signal, for example according to a method STD (Spoken Term Detection). For example, one can seek to detect and locate all occurrences of pronunciation of a keyword in the speech signal emitted by a newscaster. The keyword can be entered verbatim by a user.

A known approach is to use a method of automatic speech recognition with large vocabulary or LVCSR (of the "Large Vocabulary Continuous Speech Recognizer"), to transcribe the speech signal into a text. A classic textual search is then performed to identify the keyword (s) searched for in the text. However, LVCSR processes lead to a significant error rate, for example 15 to 20%.

In addition, the LVCSR methods use closed dictionaries, which is a limit, although some dictionaries may have a relatively high number of entries, of the order of 70000 currently. Indeed, a request formulated by a user can contain one or more keyword (s) not belonging to the dictionary. These keywords are said to be out of vocabulary or OOV (from the English "Out Of Vocabulary"). A keyword OOV contained in a speech signal is therefore absent from the transcription of this speech signal. In addition, these OOV words, which may include, for example, proper nouns, are generally information carriers and may be primarily searched for as keywords. The treatment of OOV keywords is therefore a real challenge in the field of STD.

Another approach, based on a phonetic search, makes it possible to take into account the keywords OOV. This approach uses a representation of the speech signal in sub-lexical units, for example in phonemes. These lexical units are shorter than most words and can be combined to represent any keyword. The representation in sub-lexical units can be obtained for example by decoding the speech signal in phoneme sequence or phonemic lattice, or by phonétisant a textual transcription of the speech signal obtained by LVCSR. The search for a keyword is then performed using a representation in sub-lexical units of this keyword on the one hand, and the representation of the speech signal in sub-lexical units on the other hand. However, such research based on representations in sub-lexical units is likely to generate false alarms, especially for relatively short keywords.

There is therefore a need to improve the reliability of searches based on representations in sub-lexical units.

According to a first aspect, the subject of the invention is a method for identifying at least one keyword in a speech signal, comprising, for each keyword, a step consisting in: a / performing a search for a sequence of sub-lexical units, called request, obtained by conversion of the keyword, in a sequence of sub-lexical units obtained by conversion of the speech signal.

The method further comprises the steps of: b / detecting segmentation marks, called boundaries, in the speech signal, and c / use the boundaries detected in step b / to validate or invalidate the search results of step a /.

Taking into account the boundaries of the speech signal makes it possible to reject at least part of the search results that correspond to false alarms. The results of the research based on representations in sub-lexical units are thus constrained to remain consistent with the results of the detection of borders.

Searching for step a / may make it possible to identify one (or more) sub-sequence of sub-lexical units of the sequence corresponding to the speech signal, this identified subsequence, called candidate subsequence or detection, concordant with the request.

For example, the detected boundaries may include word boundaries. If a detection is exactly framed by two consecutive word boundaries, we can think that this detection actually corresponds to a word and the detection is retained. On the other hand, if, for example, the boundaries of words that surround a detection are relatively far from this detection, the detection probably corresponds to only a part of a spoken word, and the detection is rejected. The sub-lexical units may for example include phones, phonemes, diphones, syllables, or other.

The detected segments can be words, breath groups, phrases or other. Segmentation marks, or borders, may include word, sentence or other boundaries. Advantageously, the method may comprise a step of transcription of the speech signal using a dictionary. The transcription can be performed according to an LVCSR method, for example using an existing LVCSR software.

The resulting transcription can be used for step b / border detection. This step b / is thus implemented relatively simply. The invention is of course not limited by the use of a transcription of the speech signal to detect the boundaries.

Advantageously, the resulting transcription can be used for the conversion of the speech signal. For example, the speech signal is first transcribed and the textual transcription of the speech signal thus obtained, for example by LVCSR, is then transformed into a sequence of sub-lexical units.

The conversion of the speech signal is thus performed in a relatively reliable manner, the transcription can be implemented by means of known software, and with a relatively low error rate.

Of course, the invention is in no way limited by this transcription step to achieve the conversion of the speech signal. For example, it may be provided to perform conversions of the speech signal directly into phonemes. It can be expected to search for one or more keyword (s). The number of keywords can be relatively high.

The terms "word" and "keyword" refer to both words in the usual sense of the term and phrases, ie sequences of words forming units of meaning. Advantageously, the method includes a text search step in the transcription of the speech signal. The search can be on the same keyword as for the search based on sub-lexical units, or for another keyword. The results of the text search can be combined with the search results of step a /. This can benefit both the relatively good accuracy of textual inquiry and the ability to process the OOV keywords of research based on lexical units.

The method may thus comprise a transcription step of the speech signal, the results of which can be used for step b / of border detection, for the conversion of the speech signal, and / or for a textual search. Nevertheless, the method according to one aspect of the invention can be implemented without any transcription of the speech signal.

Advantageously, for each candidate detection or sub-sequence obtained in step a / of research, a score is estimated. The estimation of a score can be used to qualify the consideration of word boundaries.

One can decide to keep or reject a candidate sub-sequence according to the value of the corresponding score. For example, only those detections whose score exceeds a certain threshold or is below a certain threshold may be retained.

For example, if several keywords are searched, the different search steps may lead to associating different keywords with the same sub-sequence or sub-sequences overlapping at least in part. One can then plan to calculate a score for each subsequence and for each of these keywords, and choose the sub-sequence / keyword association corresponding to the lowest score.

The invention is in no way limited by this step of estimating a score. For example, it may be possible to retain a detection only if the first sub-lexical unit of this detection comes immediately after a word boundary and if the last sub-lexical unit of this detection is immediately followed by a word boundary.

Advantageously, for each detection, the score is estimated from at least one distance corresponding to this detection. This distance parameter can be obtained in step a / search and characterizes the alignment between the candidate sub-sequence and the suite of sub-lexical units corresponding to the keyword. Thus, alignment is taken into account in deciding whether to keep or reject a particular detection.

Alternatively, the score may not take into account the alignment between the detection and the searched keyword. Advantageously and in a nonlimiting manner, the score is estimated from a number of sub-lexical units obtained by the subtraction of the number of sub-lexical units of detection, the number of lexical units between the border immediately preceding the detection and the border immediately following the detection. If the first sub-lexical unit of the detection comes immediately after a word boundary and the last sub-lexical unit of this detection is immediately followed by a word boundary, this number is zero. On the other hand, if, for example, detection is part of a longer word, this number may have a relatively high value. This takes into account the fact that the detection coincides more or less with a pronounced word, in the case of word boundaries. Advantageously and in a nonlimiting manner, the score is estimated from a result of a comparison between the number of boundaries, for example word boundaries, within the desired lexical unit sequence and the number of boundaries of the detection. If these numbers of borders are different, the detection may be rejected. For example, if the detection covers (at least partially) more than one word, while the keyword corresponds to a single word, the detection may be rejected. The detection may also be rejected if the detection, corresponding for example to the word pronounced "ham", covers a single word, while the keyword, for example "Jean Bon" corresponds to two words. It is recalled that in the present description, the term "word" refers to both an isolated word and a phrase.

Advantageously and in a nonlimiting manner, the score is estimated from the number of sub-lexical units of the detection. Indeed, the lower the number, the higher the risk of false alarm. On the other hand, if the detection is relatively long, the results of the research are likely to be correct.

It should be noted that the invention is limited by the order of the steps only insofar as this order is necessary for the implementation of the method. For example, step b / may be performed before step a /. According to another aspect, the subject of the invention is a computer program, the computer program being intended to be stored in a memory of a device for identifying keywords in a speech signal, and / or stored on a memory medium intended to cooperate with a reader of the central unit of this device and / or downloaded via a telecommunication network, characterized in that it comprises instructions for implementing the method according to one aspect of the invention, when the instructions are executed by a processor of this device.

According to yet another aspect, the subject of the invention is a device for identifying at least one keyword in a speech signal, comprising:

automatic search means for searching for at least one series of sub-lexical units respectively obtained by conversion of the at least one keyword, in a sequence of sub-lexical units obtained by conversion of the speech signal; ,

detection means for detecting segmentation marks of the speech signal; processing means connected to the detection means and the automatic search means for validating or invalidating the search results using the segmentation marks obtained from the detection means.

The automatic search means, the detection means and the processing means can be integrated in the same electronic chip, for example a processor, a microprocessor, a DSP (of the "Digital Signal Processor") or other.

The device may further comprise any other means for implementing the method according to one of the embodiments of the invention. The device for identifying at least one keyword in a speech signal may include a computer, a terminal, a possibly remote server, a chip or other.

The speech signal may for example be stored in different media, such as a CD (the English "Compact Disc") or other. The invention finds a particularly advantageous application in the field of spontaneous speech recognition, in which the user enjoys total freedom of speech, but is of course not limited to this area. Other features and advantages of the present invention will appear in the following detailed description, made with reference to the accompanying drawings in which: - Figure 1 shows an example of a keyword identification device in a speech signal according to an embodiment of the present invention.

FIG. 2 shows an exemplary architecture of a keyword identification device according to an embodiment of the present invention.

FIG. 3 is a flowchart of an exemplary method of identifying key words in a speech signal, implemented in a device according to the embodiment of FIG. 2.

FIG. 4 shows an exemplary portion of a sequence of sub-lexical units including a detection, according to one embodiment of the invention.

FIG. 5 is a flowchart of an exemplary method of identifying key words in a speech signal, according to an embodiment of the present invention. FIG. 6 is a flowchart of an exemplary method of identifying key words in a speech signal according to another embodiment of the present invention.

Identical references designate identical or similar objects from one figure to another. Reference is first made to FIG. 1, in which a device for identifying keywords in a speech signal 1 comprises a central unit 2. Means for recording an acoustic signal, for example a microphone 13, communicate with acoustic signal processing means, for example a sound card 7. The sound card 7 provides a signal having a format suitable for processing by a microprocessor 8. A computer program for identifying keywords in a speech signal may be stored in a memory, for example a hard disk 6. When executing this computer program by the microprocessor 8, the program of computer and the signal representative of the acoustic signal can be momentarily stored in a random access memory 9 communicating with the microprocessor 8.

The computer program can also be stored on a memory medium, for example a floppy disk or a CD-ROM, intended to cooperate with a reader, for example a floppy disk drive 10a or a CD-ROM reader 10b.

The computer program can also be downloaded via a telecommunication network, for example the Internet, represented in FIG. 1 by the reference 12. A modem 11 can be used for this purpose.

Device 1 may also include peripherals. For example, a screen 3, a keyboard 4 and a mouse 5.

FIG. 2 shows an exemplary architecture of a device for identifying keywords in a speech signal according to one embodiment of the invention.

First conversion means 21 make it possible to convert a speech signal S (t), also referred to as a document, into a sequence of sub-lexical units P, for example a sequence of phonemes. The first conversion means 21 may comprise LVCSR transcription means 22 as well as phonation means 23.

The LVCSR transcription means 22 are arranged to perform a transcription of the speech signal S (t) using a dictionary of for example 65000 entries. The transcription T of the speech signal S (t) comprises words W _j corresponding to the speech signal S (t), and temporal indicators t ⁽⁰⁾ _j , t ⁽¹⁾ _j . For example, the time indicators may comprise, for each word of the transcription, a start time and a duration, or a start time t ⁽⁰⁾ _j and an end time t ⁽¹⁾ _j . The variable j serves to index the words of the transcription T. The phonation means 23 make it possible to obtain a P phoneme sequence from the T transcription at the output of the LVCSR transcription means 22. Each word W _j of the transcription T can be phonetized separately, that is to say that no matching phoneme is added between two words of the transcription T. It thus facilitates the recognition of keywords converted into phonemes among the sequence of phonemes, insofar as the keywords are converted in an isolated manner, without particular context, by second conversion means 24 described below.

Each word W _j of the transcription T is phonetized by resorting to the most probable pronunciation of this word.

The phoneme sequence P comprises, in addition to the phonemes themselves pi, temporal indicators t ,. Each phoneme can thus be localized in time. These temporal indicators t, are obtained from the transcription T. This transcription T having temporal indicators t ⁽⁰⁾ _j , t ⁽¹⁾ _j for the words only, we deduce the temporal indicators t, of phoneme sequence P by linear interpolation for example. We can take into account the periods of silence if they exceed a certain duration, for example 0.2 seconds.

The variable i serves to index the phonemes of the P sequence. The first conversion means 21 thus make it possible to obtain a transcription T and a phoneme sequence P from the speech signal S (t).

The second conversion means 24 make it possible to convert the key words W _Q into a sequence W _P of phonemes pi. The variable I is used to index the phonemes of the sequence W _P.

In an alternative embodiment and not shown, the second conversion means can be confused with the phonation means.

Automatic search means 25, for example a DSP, make it possible to search for the sequence W _P in the phoneme sequence P. The search can be carried out taking into account or not taking into account the variants of pronunciations. In the first case, we can limit ourselves to the most probable pronunciations, insofar as the phonation means 23 take into account only the most probable pronunciation. If a keyword is recognized with several possible pronunciations, in the same subsequence of the sequence P, only the pronunciation for which a distance measurement characterizing the alignment is kept is the lowest.

The search can be performed by aligning the sequence Wp with the sequence P, each alignment being characterized by a distance.

The distance can be estimated as a sum of the costs of operations, such as substitution, insertion, deletion, to be made to match part of the sequence P and the sequence W _P. These costs can be derived from preprogrammed matrices, stored for example in LUT tables (from the English "Look-Up Table").

The search performed by the means 25 may be a phonetic search of a type known to those skilled in the art.

The search leads to obtaining at least one subsequence Ck of the P-sequence. The search means can be configured to keep only the sub-sequences Ck corresponding to a distance below a certain threshold THR1. The variable k serves to index the subsequences obtained by the search means 25.

Detection means 26 make it possible to detect word boundaries in the speech signal S (t). In this example, the detection means receive the transcription T of the LVCSR transcription means 22, so that the detection of the temporal indicators of beginning t ⁽⁰⁾ _j and ending t ⁽¹⁾ _j of word is trivial.

These word boundaries are used by processing means 27 to validate or invalidate the results obtained from the search means 25, as detailed below. Only the validated sub-sequences C ^* _m are conserved, the variable m serving to index these retained subsequences.

It should be noted that the various means 21, 24, 25, 26 and 27 can be integrated into a single component, for example a microprocessor. FIG. 3 represents a flowchart of an example of a method for identifying keywords in a speech signal implemented in a device according to the embodiment of FIG. 2. In this embodiment, the conversion of the signal phonemic speech is performed via a transcription into words, this transcription is also used for the detection of boundaries.

After a step 30 of receiving a speech signal S (t), a LVSCR transcription is performed during a step 31, then the T transcription thus obtained is phonetized in a step 32.

For a given keyword W _Q , after a step 33 for receiving this keyword, a phonation step 34 is implemented to convert the keyword into a series of phonemes W _P , or request.

In a phonetic search step, subsequences Ck (or detections) of the sequence T are identified as relatively close to the request W _P. The algorithm implemented assigns each detection Ck a distance D _k indicative of the alignment between this detection Ck and the request W _P. This distance D _k is called alignment distance. Only the detections C _k for which the distance D _k is below a certain threshold THR1 are preserved.

A step 36 of detecting word boundaries makes it possible to locate the start times t ⁽⁰⁾ _j and the end t ⁽¹⁾ _j of each word transcribed during the search step LVSCR 31. These start times t ^{(0 )} _j and end t ⁽¹⁾ constitute the boundaries of words detected in the speech signal.

For each detection Ck obtained from the phonetic search, it is tested whether this detection is consistent with word boundaries detected in the speech signal. A loop 37 is implemented to traverse the different detections Ck, with conventional steps of initialization, testing and incrementation.

For each detection Ck, it is estimated in a step 38 a number N _b ^(k) of sub-lexical units preceding the first sub-lexical unit of the detection and located between the same boundaries as said first sub-lexical unit.

To better understand what is meant by this number N _b ^(k) , it is possible to refer, for example, to the phoneme sequence portion of FIG. 4. In this figure, only one candidate subsequence 49 is represented, and the number N _b ^(k) is called N _b for simplicity.

The portion of Figure 4 corresponds to the transcription of a speech signal corresponding to the text "grow together". The phonemes are referenced 48. The boundaries of detected words, represented by double vertical bars, have been superimposed on this phoneme sequence portion.

For an "Iran" keyword, the phonetic search step leads to selecting the framed subsequence 49.

The number N _b corresponds to the number of phonemes between the word boundary preceding the detection 49 and the first phoneme "I" of the detection 49, ie N _b = 4.

Also, during this step 38, an N _a ^{(k) number} of sublexical units is estimated according to the last sub-lexical unit of the candidate subsequence 49 and situated between the same boundaries as the latter sub-lexical unit. . This number, called N _a in FIG. 4, corresponds to the number of phonemes between the last "AN" phoneme of the detection and the word boundary following the detection, ie N _a = 4.

The result of subtracting the number of sub-lexical units of detection from the number of sublexical units between the border immediately preceding detection 49 and the boundary immediately following detection 49 is therefore N _a , _b = N _a + N _b = 8. This sum indicates in which measure detection is only part of one or more larger word (s).

In addition, during this step 38, it is estimated a number N _s ^d of word boundaries within the detection 49, here N _s ^d = 1, since the detection 49 partially covers two words. It is also estimated that the number of word boundaries N _S ^q within the sequence of phonemes sought "IR AN", said request. Let N _s ^q = 0 because the query corresponds to one word "Iran". We calculate a difference between these two last numbers:

N = N ^d - N ^q In the example of Figure 4, so there are N _s = 1. This difference is called N _s ^(k) in the context of the loop 37 of FIG.

Finally, during step 38, a number L ^(k) of sub-lexical units of the detection is stored, ie in the example of FIG. 4, L = 3. In fact, a relatively short detection is more likely to correspond to a false alarm than a relatively long detection. For example, the distance characterizing the alignment between a relatively short query and a portion of a longer word may be relatively small. A relatively short detection may also infringe on two words, as in the example of Figure 4. Also this number L ^(k), or L in the context of Figure 4, is it considered.

The numbers N _a ^(k) _, N _b ^(k) _, N _s ^(k) , L ^(k) are thus estimated from the results of the search (the detection, referenced 49 in FIG. 4, C _k in FIG. 2) and from the results of the detection of borders (the word boundaries, represented in FIG. 4 by double vertical bars). These numbers N _a ^(k) _, N _b ^(k) _, N _s ^(k) , L ^(k) thus make it possible to describe the textual configuration of the detection Ck.

The step 38 of estimating the parameters N _a ^(k) , N _b ^(k) , N _s ^(k) and L ^(k) is followed by a step 39 of calculating a score D ' _k , according to the formula: c3 + N ^w + Ni ^k) + N ^w

D \ = D _k + cl * 2- r (k) Where d, c2 and c3 denote positive or zero constants. The {d, c2, c3} triplet can be optimized to obtain the highest possible measure of performance.

For a relatively long detection, the number L ^{(k) is} likely to be relatively high, so that the weight of the sum c3 + N _a ^(k) + Nl ^k) + N ^ ^k) is relatively small. Indeed, phonetic search (step 35) generally provides relatively good results for relatively long detections, and word boundaries may be less relevant in this case. Thus, for a relatively short keyword, such as "Iran", detection of the type of detection 49 in Fig. 4 will correspond to a relatively high score D _k . A test step 40 during which the score D _k is compared with a second threshold thus makes it possible to reject the detections for which the corresponding score is too high. Only detections C ^* _m corresponding to scores D _'k are kept sufficiently low (step 41).

FIG. 5 shows an exemplary embodiment in which an improved phonetic search, such as the search described with reference to FIGS. 2 and 3, is combined with a textual search.

In the example of FIG. 5, a step 50 for receiving a search keyword is followed by a test step 51 to determine whether this keyword belongs to a fixed dictionary.

If this keyword does indeed belong to the dictionary, a text search (step 52) is carried out, using a method known from the prior art, and using this dictionary.

In the opposite case, an improved phonetic search is carried out (step 53), for example using the method of the embodiment described with reference to FIGS. 2 and 3. A given keyword is thus searched according to one or the other of a textual search and an improved phonetic search. The results of these two searches are collected (step 54).

Fig. 6 is an algorithm corresponding to another embodiment, wherein a conventional text search is combined with an improved phonetic search.

In this example, after a step 60 of receiving a keyword, a text search step 61 is performed. Following a test step 62: if the text search has led to select no detection, then we proceed to an improved phonetic search (step 63).

At step 64, the results of the text search of step 61 and / or the results of the improved phonetic search of step 63 are collected.

Tables 1 and 2 below show the results of an exemplary application of the invention. Experiments focus on finding two lists of keywords. The first list is composed of all the proper names pronounced in the speech signal. The second list is composed of undefined proper names in the speech signal.

The speech signal comes from eight French television newscasts, broadcast in 2002 and 2003, and has a duration of approximately 2:30.

The "recall" is the ratio of the number of correct detections to the number of detections to be made. "Precision" is the ratio of the number of correct detections to the number of detections made. The measure F _max is a harmonic mean of precision and recall. This performance measure F _max can serve as optimization criterion for the triplet {d, c2, c3} in the embodiment of FIG. 3.

The terms "textual search" and "classical phonetic search" denote respectively a conventional textual search and a conventional phonetic search, as described above with reference to the prior art. The term "improved phonetic search" refers to a search according to the embodiment of Figures 2 and 3. When the combination criterion is the dictionary of the LVCSR, the method implemented is of the type of the method described with reference to FIG. 5. When the combination criterion is the result of the textual search, the method implemented is of the type of the method described with reference to FIG.

The search of the keywords of the first list makes it possible to evaluate the performances of the method according to one aspect of the invention, in terms of recall and precision. The search of the keywords of the first and second lists together makes it possible to more specifically test the robustness of the method, insofar as the search for words of the second list tends to reduce the accuracy without modifying the recall.

Table 1 below shows the search results for the keywords in the first list.

Table 1

Table 2 below shows the search results of the union keywords of the first and second lists.

Table 2

These results show the ability of improved phonetic search to eliminate many of the false alarms. Even in the case where only a phonetic search is performed, the recall is of course increased compared to the textual search, due to the inclusion of OOV keywords, but the accuracy is also improved over the phonetic search to a level comparable to that of textual research.

In the case of a combination of two types of searches, this gain in precision is all the more marked, since the search for relatively short keywords, that is to say, generating false alarms, is often supported by textual research. Of the two embodiments envisaged for combining the searches, it is the embodiment using as a combination criterion the result of the textual search which makes it possible to obtain the best results. Indeed, this embodiment makes it possible, in addition to the management of the OOV keywords, a certain correction of the transcription errors made by the LVCSR method, by resorting to the phonetic search.

Claims

claims

A method of identifying at least one keyword in a speech signal, the method comprising for each keyword a step of: a / performing a search (35) of a series of sub-units; lexical obtained by conversion (34) of the keyword, in a sequence of sub-lexical units obtained by conversion (31, 32) of the speech signal, characterized in that it further comprises the steps of b / detect (36) segmentation marks, called boundaries, in the speech signal, and c / using (37) the segmentation marks detected in step b / to validate or invalidate the search results of step a / .

The method of claim 1, further comprising the steps of transcribing the speech signal with the aid of a dictionary, performing a text search (52; 61) in the transcription of the speech signal thus obtained, and combining (54; 64) the results of the text search with the results validated in step c /.

3. Method according to one of claims 1 or 2, comprising a step of transcribing (31) the speech signal using a dictionary, the resulting transcription being used for the conversion (32) of the signal of word.

4. Method according to one of claims 1 to 3, comprising a step of transcribing (31) the speech signal using a dictionary, the transcription thus obtained being used for the step b / of detection of borders (36).

5. A method according to one of claims 1 to 4, wherein, in step a / search is obtained, to read the searched word sub-units (W _P), at least one sub-sequence of sub candidate lexical units (C _k ; 49) of the lexical subunit sequence (P), and in step c / a score (D ' _k ) is estimated for each candidate subsequence obtained in step a / of research.

The method according to claim 5, wherein, for each candidate subsequence (Ck; 49), the score is estimated from at least one of:

a distance (D _k ) corresponding to said candidate subsequence, said distance being obtained in step a / of search,

a number of sub-lexical units (N _a , b) obtained by subtracting the number of sub-lexical units of the candidate sub-sequence from the number of lexical units between the preceding border immediately the subsequence candidate and the boundary immediately following the candidate subsequence,

a result (N _s ) of a comparison between the number of boundaries within the desired lexical unit sequence and the number of boundaries within the candidate subsequence, and

the number of sub-lexical units (L) of the candidate subsequence.

The method of claim 6, wherein for each candidate subsequence (Ck; 49), the score is estimated using the formula: c3 + N _^ + N.

D = cl * D + c2- 'a, b

Where D 'denotes the score, D the distance, N _{a, b} the number of sub-lexical units obtained by subtracting the number of sub-lexical units of the candidate sub-sequence from the number of lexical units between the preceding border immediately the candidate subsequence and the boundary immediately following the candidate subsequence,

N _s the absolute value of the difference between the number of boundaries within the desired lexical unit sequence and the number of boundaries within the candidate subsequence,

L the number of sub-lexical units of the candidate subsequence, and d, c2, c3 three constant values, these values being positive or zero.

8. Computer program intended to be stored in a memory of a device (2) for identifying keywords in a speech signal, and / or stored on a memory medium intended to cooperate with a reader (10a, 10b) of said device and / or downloaded via a telecommunication network (12), characterized in that it comprises instructions for implementing the method according to one of the preceding claims, when said instructions are executed by a processor of said device for identifying keywords in a speech signal.

9. Device for identifying at least one keyword (WQ) in a speech signal (S (t)), said device comprising

- automatic search means (25) for searching at least a sequence of sub-word units (W _P) respectively obtained by converting said at least one keyword, in a sequence of sub-word units (T) obtained by conversion of the speech signal, characterized in that it further comprises

detection means (26) for detecting segmentation marks of the speech signal; and processing means (27) connected to the detection means and to the automatic search means for validating or invalidating the results of the search using the segmentation marks obtained from the detection means.

10. Device according to claim 9, characterized in that it comprises means for implementing the method according to any one of claims 2 to 7.