WO1999054869A1 - Adaptation of a speech recognizer for dialectal and linguistic domain variations - Google Patents

Adaptation of a speech recognizer for dialectal and linguistic domain variations Download PDF

Info

Publication number
WO1999054869A1
WO1999054869A1 PCT/EP1999/002673 EP9902673W WO9954869A1 WO 1999054869 A1 WO1999054869 A1 WO 1999054869A1 EP 9902673 W EP9902673 W EP 9902673W WO 9954869 A1 WO9954869 A1 WO 9954869A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
recognizer
data
smoothing
additional
Prior art date
Application number
PCT/EP1999/002673
Other languages
French (fr)
Inventor
Volker Fischer
Yuqing Gao
Michael A. Picheny
Siegfried Kunzmann
Original Assignee
International Business Machines Corporation
Ibm Deutschland Informationssysteme Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm Deutschland Informationssysteme Gmbh filed Critical International Business Machines Corporation
Priority to AT99924814T priority Critical patent/ATE231642T1/en
Priority to DE69905030T priority patent/DE69905030T2/en
Priority to EP99924814A priority patent/EP1074019B1/en
Publication of WO1999054869A1 publication Critical patent/WO1999054869A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker

Definitions

  • the present invention relates to speech recognition systems. More particularly, the invention relates to a generator for generating an adapted speech recognizer. Furthermore the invention also relates to a method of generating such an adapted speech recognizer said method being executed by said generator.
  • Speech recognition systems use Hidden Markov Models to capture the statistical properties of acoustic subword units, like e.g. context dependent phones or subphones .
  • Hidden Markov Models To capture the statistical properties of acoustic subword units, like e.g. context dependent phones or subphones .
  • An overview on this topic may be found for instance in L. Rabiner, A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77(2), pp. 257-285, 1989 or in X. Huang and Y. Ariki and M. Jack, Hidden Markov Models for Speech Recognition, Information Technology Series, Edinburgh University Press, Edinburgh, 1990.
  • N k p ( C l I s k ) ⁇ N(c 2 I v ⁇ k , ⁇ ⁇ k ) (5)
  • the mixture component weights ⁇ , the means ⁇ , and the covariance matrices ⁇ are estimated from a large amount of transcribed speech data during the training of the recognizer.
  • a well known procedure to solve that problem is the EM-algorithm (illustrated for instance by A. Dempster and N. Laird and D. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Series B (Methodological), 1977, Vol. 39(1), pp. 1-38), and the markov model parameters ⁇ , A, B are usually estimated by the use of the forward-backward algorithm (illustrated for instance by L. Rabiner, A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77(2), pp. 257-285, 1989).
  • the labelled training data is passed through a binary decision network that separates the contexts into equivalence classes depending on the variations observed in the feature vectors .
  • a multi-dimensional Gaussian mixture model is used to model the feature vectors that belong to each class represented by the terminal nodes (leaves) of the decision network. These models are used as initial observation densities in a set of context-dependent, continuous parameter HMM, and are further refined by running the forward-backward algorithm, which converges to a local optimum after a few iterations.
  • the total number of both context dependent HMMs and Gaussians is limited by the specification of an upper bound and depends on the amount and contents of the training data
  • speaker adaptation techniques like the maximum a posteriori estimation of gaussian mixture observations (MAP adaptation) - refer for instance to J. Gauvain and C. Lee, Maximum a Posteriori Estimation of Multivariate Gaussian Mixture Observations of Markov Chains, IEEE Trans, on Speech and Audio Processing, Vol. 2(2), pp. 291--298, 1994 - or the maximum likelihood linear regression (MLLR adaptation) - refer for instance to C. Leggetter and P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, Vol. 9, pp. 171--185, 1995 - are exploited during the training of the recognizer.
  • MAP adaptation maximum a posteriori estimation of gaussian mixture observations
  • MLLR adaptation maximum likelihood linear regression
  • the invention is based on the objective of a reduction of training effort for individual end users and an improved speaker independent recognition accuracy.
  • the objective of the invention is solved by claim 1.
  • the generator of an adapted speech recognizer according the teaching of the current application is being based upon a base speech recognizer 201 for a definite but arbitrary base language.
  • the generator also comprises an additional speech data corpus 202 used for generation of said adapted speech recognizer.
  • Said additional speech data corpus comprises a collection of domain specific speech data and/or dialect - 6 -
  • said generator comprises reestimation means 203 for reestimating acoustic model parameters of the base speech recognizer by a speaker adaption technique. Said additional speech data corpus is exploited by said reestimation means for generating the adapted speech recognizer.
  • the technique proposed by the current invention thus achieves a significant reduction of training effort for individual end users, an improved speaker independent recognition accuracy for specific domains and dialect speakers, and the rapid development of new data files for speech recognizers in specific environments. Moreover also the recognition rate of non-dialect speakers is also improved.
  • speaker adaptation techniques were usually applied to an individual end users speech data and therefore yield in a speaker dependent speech recognizer
  • they are applied to a dialect and/or domain specific collection of training data from several speakers. This allows for an improved speaker independent recognition especially (but not solely) for a given dialect and domain and minimizes the individual end users investment to customize the recognizer to their needs .
  • Another important aspect of this invention is the reduced effort for the generation of a specific speech recognizer: whereas other commercially available toolkits start from the definition of subword units and/or HMM topologies, and thus require a considerable large amount of training data, the current approach starts from an already trained general purpose speech recognizer.
  • the approach of the current teaching offers a scalable recognition accuracy, if dialects and/or specific domains are handled in an integrated speech recognizer. As the current invention is completely independent from the specific dialect and/or specific domain they may be combined in any possible combination.
  • additional speech data corpus the additional speech data corpus
  • Only few additional domain specific or dialect data is required and besides that it is inexpensive and easy to collect.
  • the current invention allows to reduce the time for the upfront training of the recognizer significantly. Therefore it allows for rapid development of new data files for recognizers in specific environments or combination of environments.
  • said additional speech data corpus can be collected unsupervised or supervised.
  • said acoustic model is a Hidden-Markov-Model (HMM) .
  • the current teaching my be applied to the HMM technology. Therefore the HMM approach, one of the most successful techniques in the area of speech recognition, can be further improved with the current teachings .
  • said speaker adaption technique is the Maximum-A-Posteriori-adaption (MAP) or the Maximum-Likelihood-Linear-Regression-adaption (MLLR) . - 8 -
  • Claim 5 achieves additional benefits.
  • smoothing means 204 are introduced for optionally smoothing the reestimated acoustic acoustic model parameters.
  • said smoothing means performing a Bayesian smoothing.
  • a smoothing factor K from the range 1 to 500 is being suggested.
  • Especially the subrange for smoothing factor K of 20 to 60 is proposed.
  • Bayesian smoothing has been shown to produce good results in terms of recognition accuracy and performance. Intensive experimentation revealed that a smoothing factor K from the range 1 to 500 accomplishes good results. Especially the subrange for smoothing factor K of 20 to 60 turned out to achieve the best results.
  • iteration means 205 for optionally iterating the operation of said reestimation means and for optionally iterating the operation of said smoothing means are suggested.
  • the iteration may be based on said reestimated dialect or domain specific acoustic model parameters or based on said base language acoustic model parameters . - 9 -
  • This teaching allows for a stepwise approach to the generation of an optimally adapted speech recognizer.
  • said iteration means use a modified additional speech data corpus and/or said iteration means use a new smoothing factor value K.
  • the iteration process may be based on an enlarged or modified additional speech data corpus. For instance a changed smoothing factor allows to assist the generation process depending on the narrowness of the amount of training data.
  • said adapted speech recognizer is speaker independent.
  • a method for generating an adapted speech recognizer using a base speech recognizer 201 for a definite but arbitrary base language is suggested.
  • Said method comprises a first step 202 of providing an additional speech data corpus.
  • Said additional speech data corpus comprises a collection of domain specific - 10 -
  • said method comprises a second step 203 of reestimating acoustic model parameters of said base speech recognizer by a speaker adaption technique using said additional speech data corpus.
  • said method comprises an optional third step 204 for smoothing the reestimated acoustic model parameters .
  • said method comprises an optional fourth step 205 for iterating said first step by providing a modified additional speech data corpus and for iterating said second and third step based on said reestimated acoustic model parameters or based on said base acoustic model parameters .
  • said acoustic model is a Hidden Markov Model (HMM) .
  • HMM Hidden Markov Model
  • said speaker adaption technique is the Maximum-A- Posteriori-adaption (MAP) or the Maximum-Likelihood-Linear- Regression-adaption (MLLR) .
  • MAP Maximum-A- Posteriori-adaption
  • MLLR Maximum-Likelihood-Linear- Regression-adaption
  • said adapted speech recognizer is speaker independent.
  • Figure 1 is a diagram reflecting the overall structure of the state-of-the-art adaptation process visualizing the generation of a speaker dependent speech recognizer from a speaker independent speech recognizer of the base language.
  • Figure 2 is a diagram reflecting the overall structure of the adaptation process according the current invention visualizing the generation of an improved speaker independent speech recognizer from a speaker independent speech recognizer of the base language.
  • Said improved speaker independent speech recognizer may be the basis for further customization generating an improved speaker dependent speech recognizer.
  • Figure 3 gives a comparison of the error rates of the baseline recognizer (W) , the standard training procedure (VV-S), and the scalable fastboot method (VV-G) normalized to the error rate of the baseline recognizer (VV) for a German test speaker.
  • the starting point is a speech recognizer 101 for a base language which is speaker independent and without specialization to any domain.
  • the individual user has to read a predefined enrollment script 103 which is a further input to the reestimation process 102.
  • the parameters of the underlying acoustic model are adapted by available speaker adaptation techniques according to the state of the art.
  • the result of this generation process is the output of a speaker dependent speech recognizer.
  • the current invention is teaching a fast bootstrap (i.e. upfront) procedure for the training of a speech recognizer with improved recognition accuracy; i.e. the current invention is proposing a generation process for an additionally adapted speaker independent speech recognizer based upon a general speech recognizer for the base language. - 13 -
  • both accuracy and speed of the recognition system can be significantly improved by explicit modelling of language dialects and orthogonally by the integration of domain specific training data in the modelling process.
  • the architecture of the invention allows to improve the recognition system along both of these directions.
  • the current invention utilizes the fact that for certain dialects, like e.g. Austrian German or Canadian French, the phonetic contexts are similar in the base language (German or French, resp.), whereas acoustic model parameters differ significantly due to different pronunciations.
  • Similar, not well trained acoustic models for specific domains e.g. base domain: office correspondence, specific domain: radiology
  • the current invention achieves the reduction of training efforts for individual end users, an improved speaker independent recognition accuracy for specific domains and dialect speakers, and the rapid development of new data files for speech recognizers in specific environments.
  • the current invention (called fastboot in the remainder) utilizes the observation that speaker adaptation techniques, like e.g. the maximum a posteriori estimation of gaussian mixture observations (MAP adaptation) or maximum likelihood linear regression (MLLR adaptation) , yield a significant larger improvement in recognition accuracy for dialect speakers than for speakers that use pronunciations observed during the training of the recognizer. According to the current teaching this approach results in improved speaker independent recognition accuracy not only for dialect speakers. These techniques move the output probabilities Bof the HMMs to a speakers particular acoustic space, and thus it is achieved that
  • o the main differences between dialect and base language are captured by the output probabilities of the HMMs, o the trained parameters for the base language already provide good initial values for a dialect specific reestimation by the forward-backward algorithm, and o the reestimation of significant contexts from dialect data can be omitted to achieve a fast training procedure.
  • Fig. 2 teaching the application of additional speaker adaptation techniques for the upfront training, i.e. for the training before the speech recognizer is personalized to a specific user, of a speech recognizer for a dialect within a base language or for a special domain.
  • the current invention suggest to start with base speech recognizer 201 for a base language.
  • base speech recognizer 201 for a base language.
  • an additional speech data corpus 202 is being provided; the current invention is suggesting the usage of actual speech data not comparable with a dictionary.
  • This additional speech data corpus may comprise any collection of domain specific speech data and/or dialect - 15 -
  • the speech recognizer for the base language may be already used for an unsupervised collection of the additional speech data.
  • the generation process comprises reestimating 203 the acoustic model parameters of said base speech recognizer by one of the available speaker adaption techniques using the additional speech data corpus, thus generating an improved adapted speech recognizer reducing the potential training effort for individual end users and at the same time improving the speaker independent recognition accuracy for specific domains and/or dialect speakers .
  • the invention teaches the application of a further smoothing 204 of the reestimated acoustic model parameters.
  • Bayesian smoothing is an efficient smoothing technology for that purpose. With respect to Bayesian smoothing good results have been achieved with a smoothing factor k from the range 1 to 500 (see below for more details with respect to the smoothing approach) . Especially the range of 20 to 60 for the smoothing factor k ensued excellent results.
  • the current teaching suggests to iterate 205 the above mentioned generation process of reestimating the acoustic model parameters and the smoothing.
  • the iteration can be based on the reestimated acoustic model parameters of the previous run or on the base acoustic model parameters.
  • the iteration can be based on the decision whether the generated adapted speech recognizer shows sufficient recognition improvement.
  • the iteration step may be based for example on a modified additional speech data corpus and/or on the usage of a new smoothing factor value K.
  • speaker adaptation techniques were usually applied to an individual end users speech data and therefore yield in a speaker dependent speech recognizer
  • they are applied to a dialect and/or domain specific collection of training data from several speakers. This allows for an improved speaker independent recognition especially (but not solely) for a given dialect and domain and minimizes the individual end users investment to customize the recognizer to their needs .
  • Another important aspect of this invention is the reduced effort for the generation of a specific speech recognizer: whereas other commercially available toolkits start from the definition of subword units and/or HMM topologies, and thus require a considerable large amount of training data, the current approach starts from an already trained general purpose speech recognizer.
  • this invention suggest to optionally apply Bayesian smoothing to the reestimated parameters.
  • this invention suggest to use the means ⁇ i7
  • c ⁇ c (t)is the sum of all posteriori probabilities t c (t)of the i-th gaussian, at time t, computed from all observed dialect data x t , iVdenotes the total number of mixture
  • the constant Jcis referred to as a smoothing factor; it allows for an optimization of the recognition accuracy and depends on the relative amount of dialect training data.
  • Figure 3 compares the relative speaker independent error rates achieved with the baseline recognizer.
  • Figure 3 shows a comparison of the error rates of the baseline recognizer (VV) , the standard training procedure (VV-S), and the scalable fastboot method (VV-G) normalized to the error rate of the baseline recognizer (VV) for the German test speakers.
  • the error rate for the Austrian speakers increases by more than 50 percent, showing the need to improve the recognition accuracy for dialect speakers. Therefore, for the follow up product, ViaVoice Gold (VV-G), only less than 50 hours of speech from approx. hundred native Austrian speakers (50 ⁇ % female, 50 ⁇ % male) have been collected and applied with the fastboot approach for the upfront training of the recognizer according to the current invention.
  • Figure 3 compares the results achieved with the fastboot method (VV-G) to the standard training procedure (VV-S), that can be applied if both training corpora are pooled together. It becomes evident that the fastboot method is superior to the standard procedure and yields a 30 percent improvement for the dialect speakers.
  • the results for different values of the smoothing factor show that recognition accuracy is scalable, which is an important feature, if an integrated recognizer for base language and dialect (or - orthogonal to this direction - base domain and specific domain) is needed.
  • the pooled training corpus of the common recognizer (VV-S) is approx.
  • the fastboot approach offers a scalable recognition accuracy, if dialects and/or specific domains are handled in an integrated speech recognizer.
  • the fastboot approach uses only few additional domain specific or dialect data which is inexpensive and easy to collect. s * The fastboot approach reduces the time for the upfront training of the recognizer, and therefore allows for the rapid development of new data files for recognizers in specific environments.

Abstract

The invention relates to a generator and a method for generating an adapted speaker independent speech recognizer. The generator of an adapted speech recognizer is being based upon a base speech recognizer of an arbitrary base language. The generator also comprises an additional speech data corpus used for generation of said adapted speech recognizer. Said additional speech data corpus comprises a collection of domain specific speech data and/or dialect specific speech data. Said generator comprises reestimation means for reestimating language or domain specific acoustic model parameters of the base speech recognizer by a speaker adaption technique. Said additional speech data corpus is exploited by said reestimation means for generating the adapted speech recognizer. The invention proposes smoothing means for smoothing the reestimated acoustic model parameters. A beneficial range for the smoothing factor of a Bayesian smoothing is given. It is further suggested to iterate the adaption process.

Description

D E S C R I P T I O N
ADAPTATION OF A SPEECH RECOGNIZER FOR DIALECTAL AND LINGUISTIC DOMAIN VARIATIONS
1 Background of the Invention
1.1 Field of the Invention
The present invention relates to speech recognition systems. More particularly, the invention relates to a generator for generating an adapted speech recognizer. Furthermore the invention also relates to a method of generating such an adapted speech recognizer said method being executed by said generator.
1.2 Description and Disadvantages of Prior Art
For more than two decades speech recognition systems use Hidden Markov Models to capture the statistical properties of acoustic subword units, like e.g. context dependent phones or subphones . An overview on this topic may be found for instance in L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77(2), pp. 257-285, 1989 or in X. Huang and Y. Ariki and M. Jack, Hidden Markov Models for Speech Recognition, Information Technology Series, Edinburgh University Press, Edinburgh, 1990.
A Hidden Markov Model is a stochastic automaton that operates on a finite set of states S = { s1 , ... , sN) and allows for the
observation of an output each time t, t = 1 , 2 , ... , Ta state is occupied. It is defined by a tuple HMM = (π, A, B) where the initial state vector
π = [ π\ ] = [ P(s(l) = s ] , 1 ≤ i ≤ N, (1) gives the probabilities that the HMM occupies state s at time
t= 1 , and
A = [ aλj ] = [ P(s (t+1) = sj I s(t)=s2) ] , 1 i, j IV, (2)
gives the probabilities for a transition from state s to s , assuming a first order time invariant process. In case of discrete HMMs the observations c^are from a finite alphabet
0= { o1 , . . . , oL) , and
B = [ bkl ] = [ p ( o1 I s(t) = sk) ] , 1 < k ≤ N, 1 1 L , (3)
is a stochastic matrix that gives the probabilities to observe o 1, in state s k, .
For (semi-) continuous HMMs, which provide the state of the art in today's large vocabulary continuous speech recognition systems, the observations are (continuous valued) feature vectors c, and the output probabilities are defined by the probability density functions
B = [ bkl ] = [ p ( c1 I s(t) = sk) ] , 1 < k < N, 1 1 L , (4)
The actual distribution p(c2 | sk) of the feature vectors is
usually approximated by a mixture of N Gaussians:
Nk p ( C l I sk) = ∑ω^N(c2 I v±k, ∑ιk) (5)
2=1
W,
= ωιJ-|2π∑i -1 2-exp(-(clιλ) Σ^IC, μ^/2); (6)
J. 1 - 3 -
the mixture component weights ω, the means μ, and the covariance matrices ∑are estimated from a large amount of transcribed speech data during the training of the recognizer. A well known procedure to solve that problem is the EM-algorithm (illustrated for instance by A. Dempster and N. Laird and D. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Series B (Methodological), 1977, Vol. 39(1), pp. 1-38), and the markov model parameters τι , A, B are usually estimated by the use of the forward-backward algorithm (illustrated for instance by L. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77(2), pp. 257-285, 1989).
The training of a speech recognizer for an arbitrary language is described in some detail by L. Bahl and S. Balakrishnan-Aiyer and J. Bellegarda and M. Franz and P. Gopalakrishnan and D. Nahamoo and M. Novak and M. Padmanabhan and M. Picheny and S. Roukos, Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task, Detroit, Proc . of the IEEE Int. Conference on Acoustics, Speech, and Signal Processing, pp. 41-44, 1995 or by L. Bahl and P. de Souza and P. Gopalakrishnan and D. Nahamoo and M. Picheny, Context-dependent Vector Quantization for Continuous Speech Recognition, Minneapolis, Proc. of the IEEE Int. Conference on Acoustics, Speech, and Signal Processing, 1993. The procedure is briefly outlined in the following, since it provides the basis for the current invention. The algorithm assumes the existence of a labelled training corpus and a speaker independent recognizer for the computation of an initial alignment between the spoken words and the speech signal. After the framewise computation of cepstral features and their first and second order derivatives, the Viterbi algorithm is used for the selection of phonetic baseforms that best matches the utterances. An outline of the Viterbi algorithm can be found in - 4 -
Viterbi, A.J., Error Bounds for Convolutional Codes and an asymptotically optimum Decoding Algorithm, IEEE Trans, on Information Theory, Vol. 13, pp. 260—269, 1967.
Since the acoustic feature vectors show significant variations in different contexts, it is important to identify the phonetic contexts that lead to specific variations. For that purpose the labelled training data is passed through a binary decision network that separates the contexts into equivalence classes depending on the variations observed in the feature vectors . A multi-dimensional Gaussian mixture model is used to model the feature vectors that belong to each class represented by the terminal nodes (leaves) of the decision network. These models are used as initial observation densities in a set of context-dependent, continuous parameter HMM, and are further refined by running the forward-backward algorithm, which converges to a local optimum after a few iterations. The total number of both context dependent HMMs and Gaussians is limited by the specification of an upper bound and depends on the amount and contents of the training data
Both the large amount of data needed for the estimation of model parameters and relevant contexts and the need to run several forward-backward iterations make the training of a speech recognizer a very time consuming process. Moreover, speakers have to face a large degradation in recognition accuracy, if their pronunciation differs from those observed during the training of the recognizer. This can be caused by poorly trained acoustic models due to a mismatch between the collected data and the task domain. This can be considered as the main reason for the fact that most commercially available speech recognition products (like e.g. IBM ViaVoice, Dragon Naturally Speaking, Kurzweill) at least recommend, if not enforce, a new user to read an enrollment script of about 50 - 250 sentences for a speaker dependent reestimation of the model parameters. - 5 -
For such reestimation processes for instance speaker adaptation techniques like the maximum a posteriori estimation of gaussian mixture observations (MAP adaptation) - refer for instance to J. Gauvain and C. Lee, Maximum a Posteriori Estimation of Multivariate Gaussian Mixture Observations of Markov Chains, IEEE Trans, on Speech and Audio Processing, Vol. 2(2), pp. 291--298, 1994 - or the maximum likelihood linear regression (MLLR adaptation) - refer for instance to C. Leggetter and P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, Vol. 9, pp. 171--185, 1995 - are exploited during the training of the recognizer.
1.3 Objective of the Invention
Given these problems, the invention is based on the objective of a reduction of training effort for individual end users and an improved speaker independent recognition accuracy.
It is a further objective of the current invention to improve the easiness and the rapidness of development of new adapted speech recognizers.
2 Summary and Advantages of the Invention
The objective of the invention is solved by the independent claims .
The objective of the invention is solved by claim 1. The generator of an adapted speech recognizer according the teaching of the current application is being based upon a base speech recognizer 201 for a definite but arbitrary base language. The generator also comprises an additional speech data corpus 202 used for generation of said adapted speech recognizer. Said additional speech data corpus comprises a collection of domain specific speech data and/or dialect - 6 -
specific speech data. Furthermore said generator comprises reestimation means 203 for reestimating acoustic model parameters of the base speech recognizer by a speaker adaption technique. Said additional speech data corpus is exploited by said reestimation means for generating the adapted speech recognizer.
The technique proposed by the current invention thus achieves a significant reduction of training effort for individual end users, an improved speaker independent recognition accuracy for specific domains and dialect speakers, and the rapid development of new data files for speech recognizers in specific environments. Moreover also the recognition rate of non-dialect speakers is also improved.
Whereas in the past speaker adaptation techniques were usually applied to an individual end users speech data and therefore yield in a speaker dependent speech recognizer, in the current invention they are applied to a dialect and/or domain specific collection of training data from several speakers. This allows for an improved speaker independent recognition especially (but not solely) for a given dialect and domain and minimizes the individual end users investment to customize the recognizer to their needs .
Another important aspect of this invention is the reduced effort for the generation of a specific speech recognizer: whereas other commercially available toolkits start from the definition of subword units and/or HMM topologies, and thus require a considerable large amount of training data, the current approach starts from an already trained general purpose speech recognizer.
The approach of the current teaching offers a scalable recognition accuracy, if dialects and/or specific domains are handled in an integrated speech recognizer. As the current invention is completely independent from the specific dialect and/or specific domain they may be combined in any possible combination.
Moreover the amount of additional data (the additional speech data corpus) is very moderate. Only few additional domain specific or dialect data is required and besides that it is inexpensive and easy to collect.
Finally the current invention allows to reduce the time for the upfront training of the recognizer significantly. Therefore it allows for rapid development of new data files for recognizers in specific environments or combination of environments.
Additional advantages are accomplished by claim 2.
According to a further embodiment of the proposed invention said additional speech data corpus can be collected unsupervised or supervised.
Based on such a teaching complete flexibility is offered to an exploiter of the current teaching on how the additional speech data corpus is being provided.
Additional advantages are accomplished by claim 3.
According to a further embodiment of the proposed invention said acoustic model is a Hidden-Markov-Model (HMM) .
Thus the current teaching my be applied to the HMM technology. Therefore the HMM approach, one of the most successful techniques in the area of speech recognition, can be further improved with the current teachings .
Additional advantages are accomplished by claim 4. According to a further embodiment of the proposed invention said speaker adaption technique is the Maximum-A-Posteriori-adaption (MAP) or the Maximum-Likelihood-Linear-Regression-adaption (MLLR) . - 8 -
These approaches allow also to deal with situations in which only sparse training data is available. Excellent adaptation results in terms of recognition accuracy and generation speed of the adapted speech recognizer are achieved with especially these speaker adaptation techniques.
Claim 5 achieves additional benefits.
According to this additional embodiment of the proposed invention smoothing means 204 are introduced for optionally smoothing the reestimated acoustic acoustic model parameters.
Experiments revealed that additional smoothing further improves the recognition accuracy and the adaptation speed. Especially in cases with a limited amount of training data these improvements are of specific importance.
Additional advantages are accomplished by claim 6, 7 and 8. According to a further embodiment of the proposed invention said smoothing means performing a Bayesian smoothing. A smoothing factor K from the range 1 to 500 is being suggested. Especially the subrange for smoothing factor K of 20 to 60 is proposed.
Bayesian smoothing has been shown to produce good results in terms of recognition accuracy and performance. Intensive experimentation revealed that a smoothing factor K from the range 1 to 500 accomplishes good results. Especially the subrange for smoothing factor K of 20 to 60 turned out to achieve the best results.
Additional advantages are accomplished by claim 9. According to a further embodiment of the proposed invention iteration means 205 for optionally iterating the operation of said reestimation means and for optionally iterating the operation of said smoothing means are suggested. The iteration may be based on said reestimated dialect or domain specific acoustic model parameters or based on said base language acoustic model parameters . - 9 -
This teaching allows for a stepwise approach to the generation of an optimally adapted speech recognizer.
Additional advantages are accomplished by claim 10. According to a further embodiment of the proposed invention said iteration means use a modified additional speech data corpus and/or said iteration means use a new smoothing factor value K.
With this teaching a remarkable amount of selective influence on the iteration process is possible. Depending on the nature of said additional speech data corpus the iteration process may be based on an enlarged or modified additional speech data corpus. For instance a changed smoothing factor allows to assist the generation process depending on the narrowness of the amount of training data.
Additional advantages are accomplished by claim 11.
According to a further embodiment of the proposed invention said adapted speech recognizer is speaker independent.
This approach offers at the same time the benefit that an adapted speech recognizer can be generated which is already tailored to a certain domain and/or dialect or set of domains and/or dialects but which still is speaker independent. Nevertheless said adapted speech recognizer may be further personalized resulting in a speaker dependent speech recognizer. Thus at the same time specialization and flexibility is offered at the same time.
The objective of the invention is solved by claim 12. A method for generating an adapted speech recognizer using a base speech recognizer 201 for a definite but arbitrary base language is suggested. Said method comprises a first step 202 of providing an additional speech data corpus. Said additional speech data corpus comprises a collection of domain specific - 10 -
speech data and/or dialect specific speech data. Furthermore said method comprises a second step 203 of reestimating acoustic model parameters of said base speech recognizer by a speaker adaption technique using said additional speech data corpus.
The benefits achieved by teaching of claim 12 are those already discussed with claim 1.
Additional advantages are accomplished by claim 13. According to a further embodiment of the proposed invention said method comprises an optional third step 204 for smoothing the reestimated acoustic model parameters .
Experiments revealed that additional smoothing further improves the recognition accuracy and the adaptation speed. Especially in cases with a limited amount of training data these improvements are of specific importance. For further advantages refer to the benefits discussed with claim 6, 7, and 8 above.
Additional advantages are accomplished by claim 14. According to a further embodiment of the proposed invention said method comprises an optional fourth step 205 for iterating said first step by providing a modified additional speech data corpus and for iterating said second and third step based on said reestimated acoustic model parameters or based on said base acoustic model parameters .
For advantages adhering to this teaching refer to the benefits discussed with claim 9 above.
Additional advantages are accomplished by claim 15. According to a further embodiment of the proposed invention said acoustic model is a Hidden Markov Model (HMM) . Moreover it is taught that said speaker adaption technique is the Maximum-A- Posteriori-adaption (MAP) or the Maximum-Likelihood-Linear- Regression-adaption (MLLR) . In addition it is suggested to perform a Bayesian smoothing. - 11 -
The advantages of this approach has been discussed with claim 3, 4 and claims 6, 7 and 8 above.
Additional advantages are accomplished by claim 16.
According to a further embodiment of the proposed invention said adapted speech recognizer is speaker independent.
Benefits related to this teaching are discussed together with claim 11 above.
3 Brief Description of the Drawings
Figure 1 is a diagram reflecting the overall structure of the state-of-the-art adaptation process visualizing the generation of a speaker dependent speech recognizer from a speaker independent speech recognizer of the base language.
Figure 2 is a diagram reflecting the overall structure of the adaptation process according the current invention visualizing the generation of an improved speaker independent speech recognizer from a speaker independent speech recognizer of the base language. Said improved speaker independent speech recognizer may be the basis for further customization generating an improved speaker dependent speech recognizer.
Figure 3 gives a comparison of the error rates of the baseline recognizer (W) , the standard training procedure (VV-S), and the scalable fastboot method (VV-G) normalized to the error rate of the baseline recognizer (VV) for a German test speaker.
4 Description of the Preferred Embodiment
Throughout this description the usage of current teaching is not limited to a certain language, a certain dialect or a certain usage domain. If a certain language, a certain dialect or a - 12 -
certain domain is mentioned this is to be interpreted as an example only not limiting the scope of the invention.
Moreover if the current description is referring to a dialect/domain this may be interpreted as a specific dialect/domain or any combination of dialects/domains.
4.1 Introduction
The training of a for instance Hidden Markov Model based speech recognizer for a given language requires the collection of a large amount of general speech data for the detection of relevant phonetic contexts and the proper estimation of acoustic model parameters. However, a significant decrease in recognition accuracy can be observed, if a speaker's pronunciation differs significantly from those present in the training corpus, Therefore, commercially available speech recognizers partly impose the estimation of acoustic parameters to the individual end-user, by enforcing the personalization process depicted in Fig. 1.
The starting point is a speech recognizer 101 for a base language which is speaker independent and without specialization to any domain. The individual user has to read a predefined enrollment script 103 which is a further input to the reestimation process 102. Within this reestimation process the parameters of the underlying acoustic model are adapted by available speaker adaptation techniques according to the state of the art. The result of this generation process is the output of a speaker dependent speech recognizer.
The current invention is teaching a fast bootstrap (i.e. upfront) procedure for the training of a speech recognizer with improved recognition accuracy; i.e. the current invention is proposing a generation process for an additionally adapted speaker independent speech recognizer based upon a general speech recognizer for the base language. - 13 -
According to the current teaching both accuracy and speed of the recognition system can be significantly improved by explicit modelling of language dialects and orthogonally by the integration of domain specific training data in the modelling process. The architecture of the invention allows to improve the recognition system along both of these directions. The current invention utilizes the fact that for certain dialects, like e.g. Austrian German or Canadian French, the phonetic contexts are similar in the base language (German or French, resp.), whereas acoustic model parameters differ significantly due to different pronunciations. Similar, not well trained acoustic models for specific domains (e.g. base domain: office correspondence, specific domain: radiology) can be estimated more accurate by the application of the invention to a limited amount of acoustic data from the target domain.
By upfront training of dialects and/or specific domains towards a large number of end users the performance of the recognition system is tremendously increased and user investment to customize the recognizer to their needs is minimized.
According the current teaching it is in addition possible to reduce the training procedure to the computation of Hidden Markov Model parameters. Moreover, it is possible to use Bayesian smoothing techniques for the better utilization of a small amount of dialect or domain specific training data and for the achievement of a scalable recognition accuracy for a specific dialect within a base language (or domain, resp.).
Thus, based on these techniques, the current invention achieves the reduction of training efforts for individual end users, an improved speaker independent recognition accuracy for specific domains and dialect speakers, and the rapid development of new data files for speech recognizers in specific environments.
4.2 Solution - 14 -
The current invention (called fastboot in the remainder) utilizes the observation that speaker adaptation techniques, like e.g. the maximum a posteriori estimation of gaussian mixture observations (MAP adaptation) or maximum likelihood linear regression (MLLR adaptation) , yield a significant larger improvement in recognition accuracy for dialect speakers than for speakers that use pronunciations observed during the training of the recognizer. According to the current teaching this approach results in improved speaker independent recognition accuracy not only for dialect speakers. These techniques move the output probabilities Bof the HMMs to a speakers particular acoustic space, and thus it is achieved that
o the main differences between dialect and base language are captured by the output probabilities of the HMMs, o the trained parameters for the base language already provide good initial values for a dialect specific reestimation by the forward-backward algorithm, and o the reestimation of significant contexts from dialect data can be omitted to achieve a fast training procedure.
The basic teaching of the current invention is depicted in Fig. 2, teaching the application of additional speaker adaptation techniques for the upfront training, i.e. for the training before the speech recognizer is personalized to a specific user, of a speech recognizer for a dialect within a base language or for a special domain.
Referring to Fig. 2 the current invention suggest to start with base speech recognizer 201 for a base language. For the final generation of an adapted speech recognizer an additional speech data corpus 202 is being provided; the current invention is suggesting the usage of actual speech data not comparable with a dictionary. This additional speech data corpus may comprise any collection of domain specific speech data and/or dialect - 15 -
specific speech data. The speech recognizer for the base language may be already used for an unsupervised collection of the additional speech data.
The generation process comprises reestimating 203 the acoustic model parameters of said base speech recognizer by one of the available speaker adaption techniques using the additional speech data corpus, thus generating an improved adapted speech recognizer reducing the potential training effort for individual end users and at the same time improving the speaker independent recognition accuracy for specific domains and/or dialect speakers .
Optionally the invention teaches the application of a further smoothing 204 of the reestimated acoustic model parameters. Bayesian smoothing is an efficient smoothing technology for that purpose. With respect to Bayesian smoothing good results have been achieved with a smoothing factor k from the range 1 to 500 (see below for more details with respect to the smoothing approach) . Especially the range of 20 to 60 for the smoothing factor k ensued excellent results.
Optionally the current teaching suggests to iterate 205 the above mentioned generation process of reestimating the acoustic model parameters and the smoothing. The iteration can be based on the reestimated acoustic model parameters of the previous run or on the base acoustic model parameters. The iteration can be based on the decision whether the generated adapted speech recognizer shows sufficient recognition improvement. To achieve the desired recognition improvements the iteration step may be based for example on a modified additional speech data corpus and/or on the usage of a new smoothing factor value K.
Finally the process results in the generation 206 of a adapted speaker independent speech recognizer for dialect and/or specific domain. - 16 -
Whereas in the past speaker adaptation techniques were usually applied to an individual end users speech data and therefore yield in a speaker dependent speech recognizer, in the current invention they are applied to a dialect and/or domain specific collection of training data from several speakers. This allows for an improved speaker independent recognition especially (but not solely) for a given dialect and domain and minimizes the individual end users investment to customize the recognizer to their needs .
Another important aspect of this invention is the reduced effort for the generation of a specific speech recognizer: whereas other commercially available toolkits start from the definition of subword units and/or HMM topologies, and thus require a considerable large amount of training data, the current approach starts from an already trained general purpose speech recognizer.
For further recognition improvement this invention suggest to optionally apply Bayesian smoothing to the reestimated parameters. In particular it is suggested to use the means μi7
variances r^and mixture component weights ωiof the base
language system (distinguished by the upper index b) for the reestimation of the dialect specific parameters μi riand
ωi (distinguished by the upper index d) by Bayesian smoothing
and tying (refer for instance to a J. Gauvain and C. Lee, Maximum a Posteriori Estimation of Multivariate Gaussian Mixture Observations of Markov Chains, IEEE Trans, on Speech and Audio Processing, Vol. 2(2), pp. 291--298, 1994) according to the following equations:
(7) ci +*i - 17 -
Y +αi (r iiμ, ) d d r, = -μ2μ2 ( 8 ) c 1 + 1
Y . = ∑ c2 ( t) xtχτ t ( 9 )
a . l o ω , = -^F " - > α j = k'ωj ( 10 ) meM
Here, c = ∑c (t)is the sum of all posteriori probabilities t c (t)of the i-th gaussian, at time t, computed from all observed dialect data xt , iVdenotes the total number of mixture
components, and Mis the set of gaussians that belong to the same phonetic context as the i-th gaussian. The constant Jcis referred to as a smoothing factor; it allows for an optimization of the recognition accuracy and depends on the relative amount of dialect training data.
4.3 Example of an Embodiment of the Invention
In 1997 IBM Speech Systems released ViaVoice, the first continuous speech recognition software available in 6 different languages. The German recognizer, for example, was trained with several hundred hours of carefully read continuous sentences. Speech was collected solely from less than thousand native German speakers (approx. 50 \% female, 50 \% male) .
For test purposes of the current teaching 20 different German speakers (10 female, 10 male) and 20 native Austrian speakers (10 female, 10 male) were collected. All speakers read the same medium perplexity test script from an office correspondence - 18 -
domain, which is supposed to be one of the most important applications for continuous speech recognition.
For both groups of speakers, Figure 3 compares the relative speaker independent error rates achieved with the baseline recognizer. Figure 3 shows a comparison of the error rates of the baseline recognizer (VV) , the standard training procedure (VV-S), and the scalable fastboot method (VV-G) normalized to the error rate of the baseline recognizer (VV) for the German test speakers. The error rate for the Austrian speakers increases by more than 50 percent, showing the need to improve the recognition accuracy for dialect speakers. Therefore, for the follow up product, ViaVoice Gold (VV-G), only less than 50 hours of speech from approx. hundred native Austrian speakers (50 \% female, 50 \% male) have been collected and applied with the fastboot approach for the upfront training of the recognizer according to the current invention. Figure 3 compares the results achieved with the fastboot method (VV-G) to the standard training procedure (VV-S), that can be applied if both training corpora are pooled together. It becomes evident that the fastboot method is superior to the standard procedure and yields a 30 percent improvement for the dialect speakers. The results for different values of the smoothing factor show that recognition accuracy is scalable, which is an important feature, if an integrated recognizer for base language and dialect (or - orthogonal to this direction - base domain and specific domain) is needed. Moreover, since the pooled training corpus of the common recognizer (VV-S) is approx. 7 times larger than the Austrian training corpus, and usually the standard training procedure has to compute 4 - 5 forward-backward iterations, the fastboot method is at least 25 times faster. Thus, the rapid development of speech recognizers for specific dialects or domains becomes possible by our invention. - 19 -
4.4 Further Advantages of the Current Teaching
The invention and its embodiment presented above demonstrate the following further advantages:
'• The fastboot approach yields a significant decrease in speaker independent error rate for dialect speakers. Moreover also the recognition rate of non-dialect speakers is improved.
• The fastboot approach offers a scalable recognition accuracy, if dialects and/or specific domains are handled in an integrated speech recognizer.
• The fastboot approach uses only few additional domain specific or dialect data which is inexpensive and easy to collect. s* The fastboot approach reduces the time for the upfront training of the recognizer, and therefore allows for the rapid development of new data files for recognizers in specific environments.
5 Acronyms
HMM Hidden Markov Model
MAP maximum a posteriori adaptation
MLLR maximum likelihood linear regression adaptation

Claims

- 20 -
C A I M S
Generator of an adapted-speech-recognizer comprising a base-speech-recognizer (201) for a base-language
further characterized by
comprising an additional-speech-data-corpus (202) used for generation of said adapted-speech-recognizer and said additional-speech-data-corpus comprising a collection of domain-specific-speech-data and/or dialect-specifie-speech-data, and
said generator comprising reestimation-means (203) for reestimating acoustic-model-parameters of said base- speech-recognizer by a speaker-adaption-technique using said additional-speech-data-corpus.
Generator according to claim 1
wherein said additional-speech-data-corpus being provided by unsupervised or supervised collection.
Generator according to any of above claims
wherein said acoustic-model is a Hidden-Markov-Model (HMM) .
Generator according to claim 3
wherein said speaker-adaption-technique is the Maximum-A-Posteriori-adaption (MAP) or
wherein said speaker-adaption-technique is the Maximum-Likelihood-Linear-Regression-adaption (MLLR) . - 2 1 -
5. Generator according to claim 4
further comprising smoothing-means (204) for optionally smoothing the reestimated acoustic-model- parameters .
6. Generator according to claim 5
wherein said smoothing-means performing a Bayesian smoothing.
7. Generator according to claim 6
wherein a smoothing factor K is from the range 1 to 500.
8. Generator according to claim 6
wherein a smoothing factor K is from the range 20 to 60.
9. Generator according to any of above claims
further comprising iteration-means (205) for optionally iterating the operation of said reestimation-means and for optionally iterating the operation of said smoothing-means based on said reestimated-acoustic-model-parameters or based on said base-acoustic-model-parameters .
10. Generator according to claim 9
wherein said iteration-means using a modified additional-speech-data-corpus and/or
wherein said iteration-means using a new smoothing factor value K. - 22 -
11. Generator according to any of above claims
wherein said adapted-speech-recognizer being speaker independent .
12. Method for generating an adapted-speech-recognizer using a base-speech-recognizer (201) for a base-language
said method comprising a first-step (202) of providing an additional-speech-data-corpus, said additional- speech-data-corpus comprising a collection of domain- specific-speech-data and/or dialect-specific-speech- data, and
said method comprising a second-step (203) of reestimating acoustic-model-parameters of said base- speech-recognizer by a speaker-adaption-technique using said additional-speech-data-corpus.
13. Method for generating an adapted-speech-recognizer according claim 12
said method comprising an optional third-step (204) for smoothing the reestimated acoustic-model- parameters .
14. Method for generating an adapted-speech-recognizer according claim 12 or 13
said method comprising an optional fourth-step (205)
for iterating said first-step by providing a modified additional-speech-data-corpus and
for iterating said second and third-step based on said reestimated-acoustic-model-parameters or based on said base-acoustic-model-parameters . - 23 -
15. Method for generating an adapted-speech-recognizer according claim 12 to 14
wherein said acoustic-model is a Hidden-Markov-Model (HMM), and
wherein said speaker-adaption-technique is the Maximum-A-Posteriori-adaption (MAP) or
wherein said speaker-adaption-technique is the Maximum-Likelihood-Linear-Regression-adaption (MLLR) , and
wherein said third-step performing a Bayesian smoothing.
16. Method for generating an adapted-speech-recognizer according claim 12 to 15
wherein said adapted-speech-recognizer being speaker independent .
PCT/EP1999/002673 1998-04-22 1999-04-21 Adaptation of a speech recognizer for dialectal and linguistic domain variations WO1999054869A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AT99924814T ATE231642T1 (en) 1998-04-22 1999-04-21 ADAPTATION OF A LANGUAGE RECOGNIZER TO DIALECTIC AND LINGUISTIC FIELD VARIANTS
DE69905030T DE69905030T2 (en) 1998-04-22 1999-04-21 ADAPTING A SPEAKER TO DIALECTIC AND LINGUISTIC AREAS
EP99924814A EP1074019B1 (en) 1998-04-22 1999-04-21 Adaptation of a speech recognizer for dialectal and linguistic domain variations

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US8265698P 1998-04-22 1998-04-22
US60/082,656 1998-04-22
US6611398A 1998-04-23 1998-04-23
US09/066,113 1998-04-23

Publications (1)

Publication Number Publication Date
WO1999054869A1 true WO1999054869A1 (en) 1999-10-28

Family

ID=26746379

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP1999/002673 WO1999054869A1 (en) 1998-04-22 1999-04-21 Adaptation of a speech recognizer for dialectal and linguistic domain variations

Country Status (6)

Country Link
EP (1) EP1074019B1 (en)
CN (1) CN1157711C (en)
AT (1) ATE231642T1 (en)
DE (1) DE69905030T2 (en)
TW (1) TW477964B (en)
WO (1) WO1999054869A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1136982A2 (en) * 2000-03-24 2001-09-26 Philips Corporate Intellectual Property GmbH Generation of a language model and an acoustic model for a speech recognition system
EP1215653A1 (en) * 2000-12-18 2002-06-19 Siemens Aktiengesellschaft Method and system for speech recognition for a small size implement
US6999925B2 (en) 2000-11-14 2006-02-14 International Business Machines Corporation Method and apparatus for phonetic context adaptation for improved speech recognition
CN102543071A (en) * 2011-12-16 2012-07-04 安徽科大讯飞信息科技股份有限公司 Voice recognition system and method used for mobile equipment
CN104751844A (en) * 2015-03-12 2015-07-01 深圳市富途网络科技有限公司 Voice identification method and system used for security information interaction

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE466361T1 (en) * 2006-08-11 2010-05-15 Harman Becker Automotive Sys LANGUAGE RECOGNITION USING A STATISTICAL LANGUAGE MODEL USING SQUARE ROOT SMOOTHING
CN103839546A (en) * 2014-03-26 2014-06-04 合肥新涛信息科技有限公司 Voice recognition system based on Yangze river and Huai river language family
CN104766607A (en) * 2015-03-05 2015-07-08 广州视源电子科技股份有限公司 Television program recommendation method and system
CN106384587B (en) * 2015-07-24 2019-11-15 科大讯飞股份有限公司 A kind of audio recognition method and system
CN107452403B (en) * 2017-09-12 2020-07-07 清华大学 Speaker marking method
CN112133290A (en) * 2019-06-25 2020-12-25 南京航空航天大学 Speech recognition method based on transfer learning and aiming at civil aviation air-land communication field
CN112767961B (en) * 2021-02-07 2022-06-03 哈尔滨琦音科技有限公司 Accent correction method based on cloud computing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"BUILDING BASEFORMS FOR A NEW APPLICATION DOMAIN", IBM TECHNICAL DISCLOSURE BULLETIN, vol. 36, no. 4, 1 April 1993 (1993-04-01), pages 93 - 94, XP000364452, ISSN: 0018-8689 *
DIAKOLOUKAS V ET AL: "Development of dialect-specific speech recognizers using adaptation methods", 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (CAT. NO.97CB36052), 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, MUNICH, GERMANY, 21-24 APRIL 1997, 1997, Los Alamitos, CA, USA, IEEE Comput. Soc. Press, USA, pages 1455 - 1458 vol.2, XP002111686, ISBN: 0-8186-7919-0 *
HSIAO-WUEN HON ET AL: "TOWARDS SPEECH RECOGNITION WITHOUT VOCABULARY-SPECIFIC TRAINING", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY (EUROSPEECH), PARIS, SEPT. 26 - 28, 1989, vol. 1, no. CONF. 1, 26 September 1989 (1989-09-26), TUBACH J P;MARIANI J J, pages 481 - 484, XP000209672 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1136982A2 (en) * 2000-03-24 2001-09-26 Philips Corporate Intellectual Property GmbH Generation of a language model and an acoustic model for a speech recognition system
EP1136982A3 (en) * 2000-03-24 2004-03-03 Philips Intellectual Property & Standards GmbH Generation of a language model and an acoustic model for a speech recognition system
US6999925B2 (en) 2000-11-14 2006-02-14 International Business Machines Corporation Method and apparatus for phonetic context adaptation for improved speech recognition
EP1215653A1 (en) * 2000-12-18 2002-06-19 Siemens Aktiengesellschaft Method and system for speech recognition for a small size implement
CN102543071A (en) * 2011-12-16 2012-07-04 安徽科大讯飞信息科技股份有限公司 Voice recognition system and method used for mobile equipment
CN104751844A (en) * 2015-03-12 2015-07-01 深圳市富途网络科技有限公司 Voice identification method and system used for security information interaction

Also Published As

Publication number Publication date
TW477964B (en) 2002-03-01
DE69905030D1 (en) 2003-02-27
CN1157711C (en) 2004-07-14
EP1074019A1 (en) 2001-02-07
EP1074019B1 (en) 2003-01-22
ATE231642T1 (en) 2003-02-15
DE69905030T2 (en) 2003-11-27
CN1298533A (en) 2001-06-06

Similar Documents

Publication Publication Date Title
US6999925B2 (en) Method and apparatus for phonetic context adaptation for improved speech recognition
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
US5680510A (en) System and method for generating and using context dependent sub-syllable models to recognize a tonal language
US5862519A (en) Blind clustering of data with application to speech processing systems
US8386254B2 (en) Multi-class constrained maximum likelihood linear regression
JP2002500779A (en) Speech recognition system using discriminatively trained model
EP1022725B1 (en) Selection of acoustic models using speaker verification
JP5660441B2 (en) Speech recognition apparatus, speech recognition method, and program
JPH09152886A (en) Unspecified speaker mode generating device and voice recognition device
EP1074019B1 (en) Adaptation of a speech recognizer for dialectal and linguistic domain variations
US5706397A (en) Speech recognition system with multi-level pruning for acoustic matching
Chen et al. Automatic transcription of broadcast news
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
Williams Knowing what you don't know: roles for confidence measures in automatic speech recognition
Sawant et al. Isolated spoken Marathi words recognition using HMM
EP0562138A1 (en) Method and apparatus for the automatic generation of Markov models of new words to be added to a speech recognition vocabulary
Schlüter et al. Comparison of optimization methods for discriminative training criteria.
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
Justo et al. Improving dialogue systems in a home automation environment
JP3776391B2 (en) Multilingual speech recognition method, apparatus, and program
Shen et al. Automatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition
JP3589044B2 (en) Speaker adaptation device
Mohanty et al. Isolated Odia digit recognition using HTK: an implementation view
EP1205907B1 (en) Phonetic context adaptation for improved speech recognition
Breslin Generation and combination of complementary systems for automatic speech recognition

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 99805299.X

Country of ref document: CN

AK Designated states

Kind code of ref document: A1

Designated state(s): CN DE HU IN JP KR PL

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: IN/PCT/2000/00211/DE

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: KR

WWE Wipo information: entry into national phase

Ref document number: 1999924814

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1999924814

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWG Wipo information: grant in national office

Ref document number: 1999924814

Country of ref document: EP