US20080312921A1 - Speech recognition utilizing multitude of speech features - Google Patents

Speech recognition utilizing multitude of speech features Download PDF

Info

Publication number
US20080312921A1
US20080312921A1 US12/195,123 US19512308A US2008312921A1 US 20080312921 A1 US20080312921 A1 US 20080312921A1 US 19512308 A US19512308 A US 19512308A US 2008312921 A1 US2008312921 A1 US 2008312921A1
Authority
US
United States
Prior art keywords
speech
features
speech recognition
log
multitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/195,123
Inventor
Scott E. Axelrod
Sreeram Viswanath Balakrishnan
Stanley F. Chen
Yuging Gao
Rameah A. Gopinath
Hong-Kwang Kuo
Benoit Maison
David Nahamoo
Michael Alan Picheny
George A. Saon
Geoffrey G. Zweig
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US12/195,123 priority Critical patent/US20080312921A1/en
Publication of US20080312921A1 publication Critical patent/US20080312921A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • the present invention relates generally to a speech recognition system, and more particularly, to a speech recognition system that utilizes a multitude of speech features with a log-linear model.
  • Speech recognition systems are used to identify word sequences from unknown speech utterance.
  • speech features such as cepstra and delta cepstra features are extracted from the unknown utterance by a feature extractor to characterize the unknown utterance.
  • a search is then done to compare the extracted features of the unknown utterance to models of speech units (such as phrases, words, syllables, phonemes, sub-phones, etc.) to compute the scores or probabilities of different word sequence hypotheses.
  • the search space is restricted by pruning out unlikely hypotheses.
  • the word sequence associated with the highest score or likelihood, or probability is recognized as the unknown utterance.
  • a language model that determines the relative likelihood of different word sequences is also used in the calculation of the overall score of the word sequence hypotheses.
  • the speech recognition models may be used to model speech as a sequence of acoustic features, or observations produced by an unobservable “true” state sequence of sub-phones, phonemes, syllables, words, phrases, and the like.
  • Model parameters output from the training operation are often estimated to maximize the likelihood of the training observations.
  • the optimum set of parameters for speech recognition is determined by maximizing the likelihood on the training data.
  • the speech recognition system determines the word sequence with the maximum posterior probability given the observed speech signal to recognize the unknown speech utterance.
  • the best word sequence hypothesis is determined through the search process that considers the scores of all possible hypotheses within the search space.
  • a speech recognition system is provided.
  • the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances.
  • the speech recognition system models the posterior probability of a hypothesis, that is, the conditional probability of a sequence of linguistic units given the observed speech signal and possibly other information, using a log-linear model.
  • the posterior model captures the probability of the sequence of linguistic units given the observed speech features and the parameters of the posterior model.
  • the posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. That is, in accordance with these exemplary aspects, the probability of word sequence with timing information and labels, given a multitude of speech features, are used to determine the posterior model.
  • the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
  • log-linear models are used wherein parameters may be trained with sparse or incomplete training data.
  • FIG. 1 shows an exemplary speech processing system embodying the exemplary aspects of the present invention.
  • FIG. 2 shows an exemplary speech recognition system embodying the exemplary aspects of the present invention.
  • FIG. 3 shows an exemplary speech processor embodying the exemplary aspects of the present invention.
  • FIG. 4 shows an exemplary decoder embodying the exemplary aspects of the present invention.
  • FIG. 5 shows a flowchart for data training in accordance with the exemplary aspects of the present invention.
  • FIG. 6 shows a flowchart for speech recognition in accordance with the exemplary aspects of the present invention.
  • FIGS. 1-6 When referring to the figures, like structures and elements shown throughout are indicated with like reference numerals.
  • FIG. 1 an exemplary speech processing system 1000 embodying the exemplary aspects of the present invention is shown. It is initially noted that the speech processing system 1000 of FIG. 1 is presented for illustration purposes only, and is representative of countless configurations in which the exemplary aspects of the present invention may be implemented. Thus, the present invention should not be considered limited to the system configuration shown in the figure.
  • the speech processing system 1000 includes a telephone system 210 , a voice transport system 220 , a voice input device 230 , and a server 300 .
  • Terminals 110 - 120 are connected to telephone system 210 via telephone network 215 and terminals 140 - 150 are connected to voice transport system 220 via data network 225 .
  • telephone system 210 , voice transport system 220 , and voice input device 230 are connected to speech recognition system 300 .
  • the speech recognition system 300 is also connected to a speech database 310 .
  • speech is sent from a remote user over network 215 or 225 through one of terminals 110 - 150 , or directly from voice input device 230 .
  • terminals 110 - 150 run a variety of speech recognition and terminal applications.
  • the speech recognition system 300 receives the input speech and provides the speech recognition results to the inputting terminal or device.
  • the speech recognition system 300 may include or may be connected to a speech database 310 which includes training data, speech models, meta-data, speech data and their true transcription, language and pronunciation models, application specific data, speaker information, various types of models and parameters, and the like.
  • the speech recognition system 300 then provides the optimal word sequence as the recognition output or it may provide a lattice of word sequence hypotheses with corresponding confidence scores.
  • lattices may have a plurality of embodiments including a summary of set of hypothesis by a graph which may have complex topology. It should be appreciated that if the graph contains loops, the set of hypothesis may be infinite.
  • the speech processing system 1000 may be any system known in the art for speech processing.
  • the speech processing system 1000 may be configured and may include various topologies and protocols known to those skilled in the art.
  • FIG. 1 only shows 2 terminals and one voice input device
  • the various exemplary aspects of the present invention is not limited to any particular number of terminals and input devices.
  • any number of terminals and input devices may be applied in the present invention.
  • FIG. 2 shows an exemplary speech recognition system 300 embodying the exemplary aspects of the present invention.
  • the speech recognition system 300 includes a speech processor 320 , a storage device 340 , an input device 360 and an output device 380 , all connected by bus 395 .
  • the processor 320 of speech recognition system 300 receives the incoming speech data comprising unknown utterance, meta-data, such as caller ID, speaker gender, channel conditions, and the like, from a user at a terminal 110 - 150 or voice input device 230 through the input device 360 .
  • the speech processor 320 then performs the speech recognition based on the appropriate models stored in the storage device 340 , or received from the database 310 through the input device 360 .
  • the speech processor 320 then routes the recognition results to the user at the requesting terminal 110 - 150 or voice input device 230 or a computer agent (that may perform actions appropriate to what the user said) through output device 380 .
  • FIG. 2 shows a particular form of speech recognition system, it should be understood that other layouts are possible and that the various aspects of the invention are not limited to such layout.
  • the speech processor 320 may provide recognition results based on data stored in memory 340 or the database 310 .
  • the various exemplary aspects of the present invention are not limited to such layout.
  • FIG. 3 shows an exemplary speech processor 320 embodying the exemplary aspects of the present invention.
  • the speech processor 320 includes a decoder 322 which utilizes the posterior probability of linguistic units relevant to speech recognition using a log-linear model to provide the recognition of the unknown utterance. That is, from the probabilities determined, the decoder 322 determines the optimal word sequence that has the highest probability, and output the word sequence as the recognized output.
  • the decoder may prune the lattice of possible hypotheses to restrict the search space and reduce computation time.
  • the decoder 322 is further connected to a training storage 325 which stores speech data and their true transcriptions for training, and a model storage 327 that stores model parameters obtained from the training operation.
  • FIG. 4 shows the decoder of FIG. 3 in further detail.
  • the decoder 322 includes a features extractor 3222 , a log-linear function 3224 , and a search device 3226 .
  • training data is input to the decoder 322 along with the true word transcription from the training storage 325 , where the model parameters are generated and output to the model storage 327 , to be used during the speech recognition operation.
  • unknown speech data is input to the decoder 322 along with the model parameters stored in the model storage 327 during the training operation, and the optimal word sequence is output.
  • training data is input to the feature extractor 3222 along with the meta-data, and the truth from the truth element 325 which can consist of the true transcriptions, which are typically words, but can also be other linguistic units like phrases, syllables, phonemes, acoustic phonetic features, sub-phones, and the like, and possibly but not necessarily time alignments for matching the linguistic units in the true transcription with the corresponding segments of speech. That is, the training operation is performed to determine the maximum likelihood of truth.
  • the feature extractor 3222 extracts a multitude of features from the input data using a multitude of extracting elements.
  • the features may be advantageously asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention.
  • the extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like.
  • the exemplary direct matching element may compute a dynamic time warping score against various reference speech segments in the database.
  • Synchronous phonetic features can be derived from traditional features like mel cepstra features.
  • Acoustic phonetic features can be asynchronous features that include linguistic distinctive features such as voicing, place of articulation, and the like.
  • features can also include higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like.
  • Features can also be meta-data such as speaker information, speaking rate, channel condition, and the like.
  • the multitude of extracted features are then provided to a log-linear function 3224 , which, using the parameters of the log-linear model, can compute the posterior probability of a hypothesized linguistic unit or sequence, given the extracted features and possibly a particular time alignment of the linguistic units to speech data.
  • the correct word sequence is known, for example, the correct sequence is created by humans transcribing the speech.
  • the correct sequence is created by humans transcribing the speech.
  • the true time alignment any particular unit sequence to the speech may or may not be known.
  • the trainer uses the extracted features, the correct word sequence, or linguistic unit sequence, with possibly time alignments to the speech, and optimizes the parameters of the log-linear model.
  • the log-linear output may be provided to the search device 3225 which can refine and provide a better linguistic unit sequence choice and a more accurate time alignment of the linguistic unit sequence to the speech.
  • This new alignment may then be looped back to the feature extractor 3222 as FEEDBACK to repeat the process for a second time to optimize the model parameters.
  • the initial time alignment may be bootstrapped by human annotation or by hidden Markov model technology.
  • the model parameters corresponding to the maximum likelihood are determined as the training model parameters, and are sent to the model data element 327 , where they are stored for the subsequent speech recognition operations.
  • the log linear models are trained using any one of several algorithms, including improved iterative scaling, iterative scaling, preconditioned conjugate gradient, and the like.
  • the training results in optimizing the parameters of the model in terms of some criterion such as maximum likelihood or maximum entropy subject to some constraints.
  • the training is performed by a trainer (not shown) that uses the features provided by the features extractor, the correct linguistic unit sequence and the corresponding time alignment to the speech.
  • preprocessing by a state-of-the-art hidden Markov model recognition system to extract the features and to align the target unit sequences.
  • the hidden Markov model may be used to align the speech frames to optimal sub-phone state sequences, and determine the top ranked Gaussians. That is, within the hidden Markov model, the Gaussian probability models of traditional features such as mel cepstra features that are the best match to the speech frame pre-determined.
  • sub-phone state sequences and the ranked Gaussian data are features used to train the log linear model.
  • speech data to be recognized is input to the feature extractor 3222 along with the meta-data, and possibly a lattice that comprises the current search space of the search device 3226 .
  • This lattice may be pre-generated by well known technology based on hidden Markov models, or may be generated on a previous round of recognition.
  • the lattice is a compact representation of the current set of scores/probabilities of various possible hypotheses considered within the search space.
  • the feature extractor 3222 then extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention.
  • the extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like.
  • the multitude of extracted features is then provided to a log-linear function 3224 .
  • the search device 3226 is provided to determine the optimal word sequence of all possible word sequences. In an exemplary embodiment, the search device 3226 limits the search to the most promising candidates by pruning out unlikely word sequences.
  • the search device 3226 consults the log-linear function 3224 about the likelihood of entire or partial word or other unit sequences.
  • the search space considered by the search device 3226 may be represented as a lattice that is a compact representation of the hypotheses under active consideration, along with the scores/probabilities. Such a lattice may be an input to the search device, constraining the search space, or an output after work has been done by the search device 3226 to update the probabilities in the lattice or pruning out unlikely paths.
  • the search device 3226 may also advantageously combine the probabilities/scores from the log-linear function 3224 with probabilities/scores from other models such as language model, hidden Markov model, and the like in a non-log-linear fashion such as linear interpolation after dynamic range compensation.
  • language model and hidden Markov model information may also be considered features that are combined in the log-linear function 3224 .
  • the output of the search device 3226 is an optimal word sequence with the highest posterior probability among all the hypotheses in the search space.
  • the output may also output a highly pruned lattice, of which an N-best list may be an example, of highly likely hypotheses that may be utilized by a computer agent to take further action.
  • the search device 3226 may also output a lattice with updated scores and possibly alignments that can be fed back into the feature extractor 3222 and log-linear function 3224 to refine the scores/probabilities. It should be appreciated that, in accordance with the various exemplary embodiments of this invention, this last step may be optional.
  • a single-pass decoding or multiple-pass decoding may be applied, where a lattice, or list of top hypotheses, may be generated in the first pass using a crude model and may be looped back and rescored using the more refined model in a subsequent pass.
  • the probability of each of the word sequences in the lattice is evaluated.
  • the probability of each specific word sequence may be related to the probability of the best alignment of its constituent sub-phone state sequence. It should be appreciated that the optimally aligned state sequence may be found in any variety of alignment process in accordance with the various embodiments of this invention, and that this invention is not limited to any particular alignment.
  • Selecting the word sequence with the highest probability is done using the new model to perform word recognition.
  • the probabilities from various models may be combined heuristically with the probability from the log linear model of the various exemplary embodiments of this invention.
  • a multiple of scores may be combined, including the traditional hidden Markov model likelihood score, and the language model score, through linear interpolation after dynamic range compensation, with the probability score from the log linear model of the various exemplary embodiments of this invention.
  • the search device 3226 consults the log-linear function 3224 repeatedly in determining the scores/probabilities of different sequences.
  • the lattice is consulted by the search device 3226 to determine what hypothesis to consider.
  • Each path in the lattice corresponds to a word sequence and has an associated probability stored in the lattice.
  • the log linear models are determined based on the posterior probability of a hypothesis given a multitude of speech features.
  • the log linear model allows for the potential combination of multiple features in a unified fashion. For example, asynchronous and overlapping features may be incorporated formally.
  • the posterior probability may be represented as the probability of a sequence associated with a hypothesis given a sequence of acoustic observations:
  • i is the index pointing to the ith word (or unit)
  • k is the number of words (units) in the hypothesis
  • T is the length of the speech signal (e.g. number of frames)
  • w 1 k is the sequence of words associated with the hypothesis H j .
  • o 1 T is the sequence of acoustic observations.
  • conditional probabilities may be represented by a maximum entropy log-linear model:
  • ⁇ i are the parameters of the log-linear model
  • Equation 2 is a true probability (will sum up to 1).
  • the normalization factors are a function of the conditioned variables.
  • the speech recognition system shown in FIGS. 1-4 models the posterior probability of linguistic units relevant to speech recognition using a log-linear model.
  • the posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model.
  • the posterior model may be used to determine the probability of the word sequence hypotheses given a multitude of speech features.
  • the sequence w 1 k need not be a word sequence, but can also be a sequence of phrases, syllables, phonemes, sub-phone units, and the like associated with the spoken sentence.
  • the model of the various aspects of the present invention may therefore apply at different levels of linguistic hierarchy, and that the features f j may include many possibilities, including: synchronous and asynchronous, disjoint and overlapping, correlated and uncorrelated, segmental and suprasegmental, acoustic phonetic, hierarchical linguistic, meta-data, higher level knowledge, and the like.
  • the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
  • a feature may be defined as a function f with the following properties:
  • c i denotes everything the probability is conditioned on, which may include context and observations
  • b is a binary function expressing some property of the conditioned event
  • w is the target (or predicted) state/unit such as a word
  • is the weight of the function.
  • a feature is a computable function that is conditioned upon context and observation, that may be thought of firing or becoming active for a specific context/observation and a specific prediction, for example, w i .
  • the weight of the function ⁇ may be equal to 1 or 0, or may be real-valued.
  • the weight ⁇ may be related to the confidence of whether the property was detected in the speech signal, or the importance of that property.
  • the lattice output from the decoder 322 may consist of more than one score.
  • scores may be obtained of the top predetermined number of matches.
  • other data may be used by the search device 3226 , including such information as the hidden Markov model scores obtained from a hidden Markov model decoder and scores for different match levels of Dynamic Time Warping, such as word vs syllable vs allophone.
  • An exemplary method of combining the different scores is to use a log-linear model and then train the parameters of the log-linear model.
  • the log-linear model for the posterior probability of a path H i may be given by the exponent of the sum of a linear combination of the different scores:
  • F wj is the j th score feature for the segment spanned by word w. for example, if the top 10 Dynamic Time Warping scores and the hidden Markov score obtained by various well known Dynamic Time Warping and hidden Markov model technologies (not explicitly shown in the figures) are returned, then there will be 11 score features for each word in the lattice.
  • Z is the normalization constant Z given by the sum over all paths (H 1 . . . 3 ) of the exponential term:
  • Equation (4) is a true probability, that is, sum to 1.
  • the parameters ⁇ j may be estimated by maximizing the likelihood of the correct path, that is, maximizing the probability of the hypothesis over all the training data.
  • weight parameters ⁇ j can be have dependencies themselves. For example they could be a function of the length of the word or of the number of training samples for that word/syllable/phone/the like.
  • equation (4) may further be generalized to having an exponent which is a weighted sum of general features, each of which is a function of the path H i , and the acoustic observation sequence o 1 T .
  • non-verbal information such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.
  • non-verbal information such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.
  • the individual word scores F wj may themselves be taken to be posterior word probabilities from a log-linear model.
  • the log-linear models may be calculated quite tractably even using lots of features. Examples of features are Dynamic Time Warping, hidden Markov model, and the like.
  • log-linear models are used to make the best use of any given set of detected features, without the use of assumptions about features that are not present. That is, in contrast in contrast to other models such as the hidden Markov models which require using the same set of features in training and testing operations, the log-linear models make no assumptions about unobserved features, so that were some feature not observable due to noise masking, for example, the log-linear model will make the best use of the other available features.
  • the speech recognition system may make full use of the known models by training the known models with the log linear model, to obtain the first lattice, alignment, or decoding using the known models to combine with the log linear model of this invention.
  • log-linear model is provided that utilizes among many possible features, the identities of the Gaussians that are the best match to traditional short time spectral features, in a traditional Gaussian mixture model comprising weighted combinations of Gaussian distributions of spectral features such as mel cepstra features, widely used in hidden Markov models, and matching of speech segments to a large corpus of training data.
  • advantages such as not necessitating all features used in training to appear in testing/recognition operations, may be obtained. That is, with models other than log linear models, if features used for training does not appear in testing, a “mismatched condition” is obtained and performance is poor. Accordingly, usage of models other than a log linear model often results in failure if some features used in training are obscured by noise and are not present in the test data.
  • FIG. 5 shows a flowchart of a method for data training according to the various exemplary aspects of the present invention.
  • control proceeds to step 5100 , where training data and meta-data are input to the decoder.
  • This data contains the speech data typically collected and stored beforehand in the training storage, including the truth stored.
  • meta data may include such information as speaker gender or identity, recording channel, personal profile of speaker, and the like.
  • the truth may generally consist of the true word sequence transcription created by human transcribers.
  • a model is input to the decoder. This model is a general model stored beforehand in the model storage.
  • a prestored lattice is input. Control then proceeds to step 5400 .
  • step 5400 a multitude of features are extracted and a search is performed. These features include those derived from traditional spectral features such as mel cepstra and time derivatives, acoustic phonetic or articulatory distinctive features such as voicing, place of articulation, and the like, scores from dynamic time warping match to speech segments, higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like, speaking rate and channel condition, and the like. It should also be appreciated that some of the features extracted in this step may include log-linear or other models which will be updated in this process.
  • lattice with scores, objective functions and auxiliary statistics are determined using a log-linear function according to the various exemplary embodiments of this invention.
  • a plurality of objective functions are calculated in this step due to the fact that a plurality of models are being trained in this process, that is, the log linear model giving the overall score as well as any other models used for feature extraction.
  • the top level objective function is total posterior likelihood, which is to be maximized.
  • the auxiliary statistics calculated in this step may include gradient functions, and other statistics required for optimization using an auxiliary function technique.
  • step 5500 it is determined if the objective functions are close enough to optimal. It should be appreciated that there are a plurality of tests for optimality, including thresholds on increase of objective functions or gradients. If optimality has not been reached, control continues to step 5600 , where the models are updated and then control returns to step 5200 . In step 5600 , the models are updated using the auxiliary statistics. It is to be appreciated that there are a plurality of methods for updating the models, including but not limited to quasi-Newton gradient search, generalized iterative scaling, and extended Baum-Welch, and expectation maximization.
  • step 5400 efficient implementations may only update a subset of parameters in an iteration, and thus, in step 5400 , only a restricted calculation need be performed. This restriction may include only updating a single feature extractor.
  • step 5700 If optimality has been reached, control continues to step 5700 , where the model parameters are output. Then, in step 5800 , the process ends.
  • FIG. 6 shows a flowchart of a method for speech recognition according to the various exemplary aspects of the present invention.
  • control proceeds to step 6100 , where test data is input to the decoder.
  • this test data is received from a user at a remote terminal via a telephone or data network or at a voice input device.
  • This data may also include meta data such as speaker gender or identity, recording channel, personal profile of speaker, and the like.
  • step 6200 the model is input. This model is stored in the model storage 327 during the training operation. Then, in step 6300 , a prestored hypothesis lattice is input. Control then continues to step 6400 .
  • step 6400 a multitude of features are extracted and a search is performed using a log linear model of these features. These features include those derived from traditional spectral features. It should also be appreciated that some of the features extracted in this step may be determined using log-linear or other models.
  • this step different unit sequence hypotheses along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. It should be appreciated that this search in this step is constrained by the previous input lattice. The pruned combined results determine an updated lattice with scores. It should be appreciated that a particular embodiment of this updated lattice may be a single best most likely hypothesis.
  • step 6500 it is determined whether another pass is needed. If another pass is needed, then control returns to step 6200 . It should be appreciated that the features and models used in subsequent passes may vary.
  • the lattice output in step 6400 may be used as the input lattice in step 6300 . Else, no additional pass is needed, and control continues to step 6600 , where the optimal word sequence is output. That is, the word sequence corresponding to the hypothesis in the lattice having the highest score is output. It should be appreciated that in an alternative embodiment, the lattice is output. Control then continues to step 6700 , where the process ends.

Abstract

In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present application is a continuation of parent application Ser. No. 10/724,536 filed on Nov. 28, 2003.
  • FIELD OF THE INVENTION
  • The present invention relates generally to a speech recognition system, and more particularly, to a speech recognition system that utilizes a multitude of speech features with a log-linear model.
  • BACKGROUND
  • Speech recognition systems are used to identify word sequences from unknown speech utterance. In an exemplary speech recognition system, speech features such as cepstra and delta cepstra features are extracted from the unknown utterance by a feature extractor to characterize the unknown utterance. A search is then done to compare the extracted features of the unknown utterance to models of speech units (such as phrases, words, syllables, phonemes, sub-phones, etc.) to compute the scores or probabilities of different word sequence hypotheses. Typically the search space is restricted by pruning out unlikely hypotheses. The word sequence associated with the highest score or likelihood, or probability, is recognized as the unknown utterance. In addition to the acoustic model, a language model that determines the relative likelihood of different word sequences is also used in the calculation of the overall score of the word sequence hypotheses.
  • Through a training operation, the parameters for the speech recognition models are determined. The speech recognition models may be used to model speech as a sequence of acoustic features, or observations produced by an unobservable “true” state sequence of sub-phones, phonemes, syllables, words, phrases, and the like. Model parameters output from the training operation are often estimated to maximize the likelihood of the training observations. The optimum set of parameters for speech recognition is determined by maximizing the likelihood on the training data. The speech recognition system determines the word sequence with the maximum posterior probability given the observed speech signal to recognize the unknown speech utterance. The best word sequence hypothesis is determined through the search process that considers the scores of all possible hypotheses within the search space.
  • SUMMARY OF THE INVENTION
  • In accordance with the exemplary aspects of this invention, a speech recognition system is provided.
  • In accordance with the various exemplary aspects of this invention, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances.
  • In accordance with various exemplary aspects of this invention, the speech recognition system models the posterior probability of a hypothesis, that is, the conditional probability of a sequence of linguistic units given the observed speech signal and possibly other information, using a log-linear model.
  • In accordance with these exemplary aspects, the posterior model captures the probability of the sequence of linguistic units given the observed speech features and the parameters of the posterior model.
  • In accordance with these exemplary aspects of this invention, the posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. That is, in accordance with these exemplary aspects, the probability of word sequence with timing information and labels, given a multitude of speech features, are used to determine the posterior model.
  • In accordance with the various exemplary aspects of this invention, the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
  • In accordance with the various exemplary aspects of this invention, log-linear models are used wherein parameters may be trained with sparse or incomplete training data.
  • In accordance with the various exemplary aspects of this invention, not all features used in training need to appear in testing/recognition.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows an exemplary speech processing system embodying the exemplary aspects of the present invention.
  • FIG. 2 shows an exemplary speech recognition system embodying the exemplary aspects of the present invention.
  • FIG. 3 shows an exemplary speech processor embodying the exemplary aspects of the present invention.
  • FIG. 4 shows an exemplary decoder embodying the exemplary aspects of the present invention.
  • FIG. 5 shows a flowchart for data training in accordance with the exemplary aspects of the present invention.
  • FIG. 6 shows a flowchart for speech recognition in accordance with the exemplary aspects of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The following description details how exemplary aspects of the present invention are employed. Throughout the description of the invention, reference is made to FIGS. 1-6. When referring to the figures, like structures and elements shown throughout are indicated with like reference numerals.
  • DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • In FIG. 1, an exemplary speech processing system 1000 embodying the exemplary aspects of the present invention is shown. It is initially noted that the speech processing system 1000 of FIG. 1 is presented for illustration purposes only, and is representative of countless configurations in which the exemplary aspects of the present invention may be implemented. Thus, the present invention should not be considered limited to the system configuration shown in the figure.
  • As shown in FIG. 1, the speech processing system 1000 includes a telephone system 210, a voice transport system 220, a voice input device 230, and a server 300. Terminals 110-120 are connected to telephone system 210 via telephone network 215 and terminals 140-150 are connected to voice transport system 220 via data network 225. As shown in FIG. 1, telephone system 210, voice transport system 220, and voice input device 230 are connected to speech recognition system 300. The speech recognition system 300 is also connected to a speech database 310.
  • In operation, speech is sent from a remote user over network 215 or 225 through one of terminals 110-150, or directly from voice input device 230. In response to the input speech, terminals 110-150 run a variety of speech recognition and terminal applications.
  • The speech recognition system 300 receives the input speech and provides the speech recognition results to the inputting terminal or device.
  • The speech recognition system 300 may include or may be connected to a speech database 310 which includes training data, speech models, meta-data, speech data and their true transcription, language and pronunciation models, application specific data, speaker information, various types of models and parameters, and the like. The speech recognition system 300 then provides the optimal word sequence as the recognition output or it may provide a lattice of word sequence hypotheses with corresponding confidence scores. In accordance with the various exemplary aspects of this invention, lattices may have a plurality of embodiments including a summary of set of hypothesis by a graph which may have complex topology. It should be appreciated that if the graph contains loops, the set of hypothesis may be infinite.
  • As discussed above, though the exemplary embodiment above describes speech processing system 1000 in a particular embodiment, the speech processing system 1000 may be any system known in the art for speech processing. Thus, it is contemplated that the speech processing system 1000 may be configured and may include various topologies and protocols known to those skilled in the art.
  • For example, it is to be appreciated that though FIG. 1 only shows 2 terminals and one voice input device, the various exemplary aspects of the present invention is not limited to any particular number of terminals and input devices. Thus, it is contemplated that any number of terminals and input devices may be applied in the present invention.
  • FIG. 2 shows an exemplary speech recognition system 300 embodying the exemplary aspects of the present invention. As shown in FIG. 2, the speech recognition system 300 includes a speech processor 320, a storage device 340, an input device 360 and an output device 380, all connected by bus 395.
  • In operation, the processor 320 of speech recognition system 300 receives the incoming speech data comprising unknown utterance, meta-data, such as caller ID, speaker gender, channel conditions, and the like, from a user at a terminal 110-150 or voice input device 230 through the input device 360. The speech processor 320 then performs the speech recognition based on the appropriate models stored in the storage device 340, or received from the database 310 through the input device 360. The speech processor 320 then routes the recognition results to the user at the requesting terminal 110-150 or voice input device 230 or a computer agent (that may perform actions appropriate to what the user said) through output device 380.
  • Although FIG. 2 shows a particular form of speech recognition system, it should be understood that other layouts are possible and that the various aspects of the invention are not limited to such layout.
  • In the above exemplary embodiment, the speech processor 320 may provide recognition results based on data stored in memory 340 or the database 310. However, it is to be appreciated that the various exemplary aspects of the present invention are not limited to such layout.
  • FIG. 3 shows an exemplary speech processor 320 embodying the exemplary aspects of the present invention. As shown in FIG. 3, the speech processor 320 includes a decoder 322 which utilizes the posterior probability of linguistic units relevant to speech recognition using a log-linear model to provide the recognition of the unknown utterance. That is, from the probabilities determined, the decoder 322 determines the optimal word sequence that has the highest probability, and output the word sequence as the recognized output. The decoder may prune the lattice of possible hypotheses to restrict the search space and reduce computation time.
  • The decoder 322 is further connected to a training storage 325 which stores speech data and their true transcriptions for training, and a model storage 327 that stores model parameters obtained from the training operation.
  • FIG. 4 shows the decoder of FIG. 3 in further detail. As shown in FIG. 4, the decoder 322 includes a features extractor 3222, a log-linear function 3224, and a search device 3226.
  • In operation, during the training operation, training data is input to the decoder 322 along with the true word transcription from the training storage 325, where the model parameters are generated and output to the model storage 327, to be used during the speech recognition operation. During the speech recognition operation, unknown speech data is input to the decoder 322 along with the model parameters stored in the model storage 327 during the training operation, and the optimal word sequence is output.
  • As shown in FIGS. 3-4, during the training operation, training data is input to the feature extractor 3222 along with the meta-data, and the truth from the truth element 325 which can consist of the true transcriptions, which are typically words, but can also be other linguistic units like phrases, syllables, phonemes, acoustic phonetic features, sub-phones, and the like, and possibly but not necessarily time alignments for matching the linguistic units in the true transcription with the corresponding segments of speech. That is, the training operation is performed to determine the maximum likelihood of truth. The feature extractor 3222 extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be advantageously asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention. The extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like.
  • For example, the exemplary direct matching element may compute a dynamic time warping score against various reference speech segments in the database. Synchronous phonetic features can be derived from traditional features like mel cepstra features. Acoustic phonetic features can be asynchronous features that include linguistic distinctive features such as voicing, place of articulation, and the like.
  • It should be appreciated that, in accordance with the various exemplary embodiments of this invention, none of these feature extractors need to be perfectly accurate. Features can also include higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like. Features can also be meta-data such as speaker information, speaking rate, channel condition, and the like.
  • The multitude of extracted features are then provided to a log-linear function 3224, which, using the parameters of the log-linear model, can compute the posterior probability of a hypothesized linguistic unit or sequence, given the extracted features and possibly a particular time alignment of the linguistic units to speech data.
  • During the training process, the correct word sequence is known, for example, the correct sequence is created by humans transcribing the speech. However, there may be multiple valid choices of linguistic units, for example, phonemes, that make up the word sequence due to pronunciation variants and the like. All the valid sequences may be compactly represented as a lattice. In addition, the true time alignment any particular unit sequence to the speech may or may not be known. The trainer (not shown in diagram) uses the extracted features, the correct word sequence, or linguistic unit sequence, with possibly time alignments to the speech, and optimizes the parameters of the log-linear model.
  • Thus, during training, the log-linear output may be provided to the search device 3225 which can refine and provide a better linguistic unit sequence choice and a more accurate time alignment of the linguistic unit sequence to the speech. This new alignment may then be looped back to the feature extractor 3222 as FEEDBACK to repeat the process for a second time to optimize the model parameters. It should be appreciated that the initial time alignment may be bootstrapped by human annotation or by hidden Markov model technology. Thus, the model parameters corresponding to the maximum likelihood are determined as the training model parameters, and are sent to the model data element 327, where they are stored for the subsequent speech recognition operations.
  • In various exemplary embodiments of the present invention, the log linear models are trained using any one of several algorithms, including improved iterative scaling, iterative scaling, preconditioned conjugate gradient, and the like. The training results in optimizing the parameters of the model in terms of some criterion such as maximum likelihood or maximum entropy subject to some constraints. The training is performed by a trainer (not shown) that uses the features provided by the features extractor, the correct linguistic unit sequence and the corresponding time alignment to the speech.
  • In an exemplary embodiment, preprocessing by a state-of-the-art hidden Markov model recognition system (not shown in figures) to extract the features and to align the target unit sequences. For example, the hidden Markov model may be used to align the speech frames to optimal sub-phone state sequences, and determine the top ranked Gaussians. That is, within the hidden Markov model, the Gaussian probability models of traditional features such as mel cepstra features that are the best match to the speech frame pre-determined. In this exemplary embodiment, sub-phone state sequences and the ranked Gaussian data are features used to train the log linear model.
  • It should be understood that this exemplary embodiment is only one specific implementation, and that many other embodiments of training using log linear models may be used in the various aspects of this invention.
  • During the speech recognition operation, speech data to be recognized is input to the feature extractor 3222 along with the meta-data, and possibly a lattice that comprises the current search space of the search device 3226. This lattice may be pre-generated by well known technology based on hidden Markov models, or may be generated on a previous round of recognition. The lattice is a compact representation of the current set of scores/probabilities of various possible hypotheses considered within the search space. The feature extractor 3222 then extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention. The extracting elements, include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like. The multitude of extracted features is then provided to a log-linear function 3224.
  • The search device 3226 is provided to determine the optimal word sequence of all possible word sequences. In an exemplary embodiment, the search device 3226 limits the search to the most promising candidates by pruning out unlikely word sequences. The search device 3226 consults the log-linear function 3224 about the likelihood of entire or partial word or other unit sequences. The search space considered by the search device 3226 may be represented as a lattice that is a compact representation of the hypotheses under active consideration, along with the scores/probabilities. Such a lattice may be an input to the search device, constraining the search space, or an output after work has been done by the search device 3226 to update the probabilities in the lattice or pruning out unlikely paths. The search device 3226 may also advantageously combine the probabilities/scores from the log-linear function 3224 with probabilities/scores from other models such as language model, hidden Markov model, and the like in a non-log-linear fashion such as linear interpolation after dynamic range compensation. However, language model and hidden Markov model information may also be considered features that are combined in the log-linear function 3224.
  • The output of the search device 3226 is an optimal word sequence with the highest posterior probability among all the hypotheses in the search space. The output may also output a highly pruned lattice, of which an N-best list may be an example, of highly likely hypotheses that may be utilized by a computer agent to take further action. The search device 3226 may also output a lattice with updated scores and possibly alignments that can be fed back into the feature extractor 3222 and log-linear function 3224 to refine the scores/probabilities. It should be appreciated that, in accordance with the various exemplary embodiments of this invention, this last step may be optional.
  • As discussed in the above exemplary embodiments, in the speech recognition system of the exemplary aspects of this invention, there are many possible word sequences in the search space consisting theoretically of any sequence of words in the vocabulary, so that an efficient search operation is performed by the decoder 322 to obtain the optimal word sequence. It should be appreciated that, as shown by the feedback loop in FIG. 4, a single-pass decoding or multiple-pass decoding may be applied, where a lattice, or list of top hypotheses, may be generated in the first pass using a crude model and may be looped back and rescored using the more refined model in a subsequent pass.
  • In the multiple-pass decoding, the probability of each of the word sequences in the lattice is evaluated. The probability of each specific word sequence may be related to the probability of the best alignment of its constituent sub-phone state sequence. It should be appreciated that the optimally aligned state sequence may be found in any variety of alignment process in accordance with the various embodiments of this invention, and that this invention is not limited to any particular alignment.
  • Selecting the word sequence with the highest probability is done using the new model to perform word recognition.
  • It should be appreciated that, in accordance with the various exemplary embodiments of this invention, the probabilities from various models may be combined heuristically with the probability from the log linear model of the various exemplary embodiments of this invention. In particular, a multiple of scores may be combined, including the traditional hidden Markov model likelihood score, and the language model score, through linear interpolation after dynamic range compensation, with the probability score from the log linear model of the various exemplary embodiments of this invention.
  • In accordance with the various exemplary embodiments of this invention, the search device 3226 consults the log-linear function 3224 repeatedly in determining the scores/probabilities of different sequences. The lattice is consulted by the search device 3226 to determine what hypothesis to consider. Each path in the lattice corresponds to a word sequence and has an associated probability stored in the lattice.
  • In the above-described exemplary embodiments of the present invention, the log linear models are determined based on the posterior probability of a hypothesis given a multitude of speech features. The log linear model allows for the potential combination of multiple features in a unified fashion. For example, asynchronous and overlapping features may be incorporated formally.
  • As a simple example, the posterior probability may be represented as the probability of a sequence associated with a hypothesis given a sequence of acoustic observations:
  • P ( H j | features ) = P ( w 1 k | o 1 T ) = i = 1 k P ( w i | w 1 i - 1 , o 1 T ) , ( 1 )
  • where:
  • Hj is the jth hypothesis that contains a sequence of word (or other linguist unit) sequence w1 k=w1w2 . . . wk
  • i is the index pointing to the ith word (or unit)
  • k is the number of words (units) in the hypothesis
  • T is the length of the speech signal (e.g. number of frames)
  • w1 k is the sequence of words associated with the hypothesis Hj, and
  • o1 T is the sequence of acoustic observations.
  • In the above equation (1), the conditional probabilities may be represented by a maximum entropy log-linear model:
  • P ( w i | w 1 i - 1 , o 1 T ) = j λ j f j ( w i , w 1 i - 1 , o 1 T ) Z ( w 1 i - 1 , o 1 T ) , ( 2 )
  • where:
  • λi are the parameters of the log-linear model,
  • fi are the multitude of features extracted,
  • and
  • Z is the normalization factor that ensures that Equation 2 is a true probability (will sum up to 1). The normalization factors are a function of the conditioned variables.
  • As shown in the above exemplary embodiment, in accordance with various exemplary aspects of this invention, the speech recognition system shown in FIGS. 1-4 models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. As shown above, the posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. Thus, the posterior model may be used to determine the probability of the word sequence hypotheses given a multitude of speech features.
  • It should be appreciated that the above representation is just an example, and that, according to the various aspects of the present invention, myriad variations may be applied. For example, the sequence w1 k need not be a word sequence, but can also be a sequence of phrases, syllables, phonemes, sub-phone units, and the like associated with the spoken sentence. Further, it is to be appreciated that the model of the various aspects of the present invention may therefore apply at different levels of linguistic hierarchy, and that the features fj may include many possibilities, including: synchronous and asynchronous, disjoint and overlapping, correlated and uncorrelated, segmental and suprasegmental, acoustic phonetic, hierarchical linguistic, meta-data, higher level knowledge, and the like.
  • By modeling in accordance to the various exemplary aspects of this invention, the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
  • In the various aspects of the present invention, a feature may be defined as a function f with the following properties:
  • f < b , w > ( c i _ , w i ) = { α b ( c _ i ) = 1 w = w i 0 otherwise ( 3 )
  • where:
  • c i denotes everything the probability is conditioned on, which may include context and observations,
  • b is a binary function expressing some property of the conditioned event, and w is the target (or predicted) state/unit such as a word, and
  • α is the weight of the function.
  • That is, a feature is a computable function that is conditioned upon context and observation, that may be thought of firing or becoming active for a specific context/observation and a specific prediction, for example, wi.
  • It should be appreciated that the weight of the function α may be equal to 1 or 0, or may be real-valued. For example, in an exemplary embodiment, the weight α may be related to the confidence of whether the property was detected in the speech signal, or the importance of that property.
  • In accordance with various exemplary aspects of this invention, the lattice output from the decoder 322 may consist of more than one score. For example, scores may be obtained of the top predetermined number of matches. In addition, other data may be used by the search device 3226, including such information as the hidden Markov model scores obtained from a hidden Markov model decoder and scores for different match levels of Dynamic Time Warping, such as word vs syllable vs allophone.
  • An exemplary method of combining the different scores is to use a log-linear model and then train the parameters of the log-linear model.
  • For example, the log-linear model for the posterior probability of a path Hi may be given by the exponent of the sum of a linear combination of the different scores:
  • P ( H i ) = exp ( - w H i j α j F wj ) / Z ( 4 )
  • where:
  • Fwj is the jth score feature for the segment spanned by word w. for example, if the top 10 Dynamic Time Warping scores and the hidden Markov score obtained by various well known Dynamic Time Warping and hidden Markov model technologies (not explicitly shown in the figures) are returned, then there will be 11 score features for each word in the lattice.
  • Z is the normalization constant Z given by the sum over all paths (H1 . . . 3) of the exponential term:
  • Z = i exp ( - w H i j α j F wj )
  • that is needed to ensure that Equation (4) is a true probability, that is, sum to 1.
  • For the lattice generated on training data, the parameters αj may be estimated by maximizing the likelihood of the correct path, that is, maximizing the probability of the hypothesis over all the training data.
  • It should be appreciated that the above embodiment is merely an exemplary embodiment, and that the above equation (4) may be revised by adding syllable and allophone features since a hierarchical segmentation is available. The weight parameters αj can be have dependencies themselves. For example they could be a function of the length of the word or of the number of training samples for that word/syllable/phone/the like.
  • It should further be appreciated that equation (4) may further be generalized to having an exponent which is a weighted sum of general features, each of which is a function of the path Hi, and the acoustic observation sequence o1 T.
  • Further, it should be appreciated that other features representing “non-verbal information” (such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.) may also be included in this framework, and that the various exemplary aspects of this invention are not limited to the above described embodiments.
  • In other exemplary embodiments, the individual word scores Fwj may themselves be taken to be posterior word probabilities from a log-linear model. The log-linear models may be calculated quite tractably even using lots of features. Examples of features are Dynamic Time Warping, hidden Markov model, and the like.
  • In accordance with the exemplary aspects of the present invention, log-linear models are used to make the best use of any given set of detected features, without the use of assumptions about features that are not present. That is, in contrast in contrast to other models such as the hidden Markov models which require using the same set of features in training and testing operations, the log-linear models make no assumptions about unobserved features, so that were some feature not observable due to noise masking, for example, the log-linear model will make the best use of the other available features.
  • In accordance with the exemplary aspects of this invention, the speech recognition system may make full use of the known models by training the known models with the log linear model, to obtain the first lattice, alignment, or decoding using the known models to combine with the log linear model of this invention.
  • In accordance with various exemplary embodiments of this invention, log-linear model is provided that utilizes among many possible features, the identities of the Gaussians that are the best match to traditional short time spectral features, in a traditional Gaussian mixture model comprising weighted combinations of Gaussian distributions of spectral features such as mel cepstra features, widely used in hidden Markov models, and matching of speech segments to a large corpus of training data.
  • In accordance with the various exemplary aspects of this invention, advantages such as not necessitating all features used in training to appear in testing/recognition operations, may be obtained. That is, with models other than log linear models, if features used for training does not appear in testing, a “mismatched condition” is obtained and performance is poor. Accordingly, usage of models other than a log linear model often results in failure if some features used in training are obscured by noise and are not present in the test data.
  • FIG. 5 shows a flowchart of a method for data training according to the various exemplary aspects of the present invention. Beginning at step 5000, control proceeds to step 5100, where training data and meta-data are input to the decoder. This data contains the speech data typically collected and stored beforehand in the training storage, including the truth stored. It should be appreciated that meta data may include such information as speaker gender or identity, recording channel, personal profile of speaker, and the like. The truth may generally consist of the true word sequence transcription created by human transcribers. Next, in step 5200, a model is input to the decoder. This model is a general model stored beforehand in the model storage. Then in step 5300, a prestored lattice is input. Control then proceeds to step 5400.
  • In step 5400, a multitude of features are extracted and a search is performed. These features include those derived from traditional spectral features such as mel cepstra and time derivatives, acoustic phonetic or articulatory distinctive features such as voicing, place of articulation, and the like, scores from dynamic time warping match to speech segments, higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like, speaking rate and channel condition, and the like. It should also be appreciated that some of the features extracted in this step may include log-linear or other models which will be updated in this process.
  • In this step, lattice with scores, objective functions and auxiliary statistics are determined using a log-linear function according to the various exemplary embodiments of this invention. It should be appreciated that a plurality of objective functions are calculated in this step due to the fact that a plurality of models are being trained in this process, that is, the log linear model giving the overall score as well as any other models used for feature extraction. The top level objective function is total posterior likelihood, which is to be maximized. It should be appreciated that there may be a plurality of types of objective functions for feature extractors. In various exemplary embodiments, these types of object functions include posterior likelihood, direct likelihood, distance, and the like.
  • In this step, different unit sequence hypotheses consistent with the true word sequence transcription, along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. The pruned combined results determine an updated lattice with scores.
  • It should be appreciated that, in accordance with the various exemplary aspects of this invention, the auxiliary statistics calculated in this step may include gradient functions, and other statistics required for optimization using an auxiliary function technique.
  • Next, in step 5500, it is determined if the objective functions are close enough to optimal. It should be appreciated that there are a plurality of tests for optimality, including thresholds on increase of objective functions or gradients. If optimality has not been reached, control continues to step 5600, where the models are updated and then control returns to step 5200. In step 5600, the models are updated using the auxiliary statistics. It is to be appreciated that there are a plurality of methods for updating the models, including but not limited to quasi-Newton gradient search, generalized iterative scaling, and extended Baum-Welch, and expectation maximization.
  • It should be also appreciated that efficient implementations may only update a subset of parameters in an iteration, and thus, in step 5400, only a restricted calculation need be performed. This restriction may include only updating a single feature extractor.
  • If optimality has been reached, control continues to step 5700, where the model parameters are output. Then, in step 5800, the process ends.
  • FIG. 6 shows a flowchart of a method for speech recognition according to the various exemplary aspects of the present invention. Beginning at step 6000, control proceeds to step 6100, where test data is input to the decoder. In accordance with the various exemplary embodiments of this invention, this test data is received from a user at a remote terminal via a telephone or data network or at a voice input device. This data may also include meta data such as speaker gender or identity, recording channel, personal profile of speaker, and the like. Next, in step 6200, the model is input. This model is stored in the model storage 327 during the training operation. Then, in step 6300, a prestored hypothesis lattice is input. Control then continues to step 6400.
  • In step 6400, a multitude of features are extracted and a search is performed using a log linear model of these features. These features include those derived from traditional spectral features. It should also be appreciated that some of the features extracted in this step may be determined using log-linear or other models.
  • In this step, different unit sequence hypotheses along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. It should be appreciated that this search in this step is constrained by the previous input lattice. The pruned combined results determine an updated lattice with scores. It should be appreciated that a particular embodiment of this updated lattice may be a single best most likely hypothesis.
  • Next, in step 6500, it is determined whether another pass is needed. If another pass is needed, then control returns to step 6200. It should be appreciated that the features and models used in subsequent passes may vary. The lattice output in step 6400 may be used as the input lattice in step 6300. Else, no additional pass is needed, and control continues to step 6600, where the optimal word sequence is output. That is, the word sequence corresponding to the hypothesis in the lattice having the highest score is output. It should be appreciated that in an alternative embodiment, the lattice is output. Control then continues to step 6700, where the process ends.
  • The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and other modifications and variations may be possible in light of the above teachings. Thus, the embodiments disclosed were chosen and described in order to best explain the principles of the invention and its practical application to enable others skilled in the art to best utilize the invention in various embodiments and modifications as are suited to the particular use. It is intended that the appended claims be construed to include other alternative embodiments of the invention except insofar as limited by the prior art.

Claims (15)

1. A speech recognition system, comprising:
a features extractor that extracts a multitude of speech features directly from input speech;
a log-linear function that receives the multitude of speech features obtained from the input speech and determines a posterior probability of each of a plurality of hypothesized linguistic units unit given the extracted multitude of speech features, and
a search device that analyzes the posterior probabilities determined by the log-linear function to determine a recognized output of unknown utterances.
2. The speech recognition system of claim 1, wherein the log linear function models the posterior probability using a log linear model.
3. The speech recognition system of claim 1, wherein the speech features comprise at least one of asynchronous, overlapping, and statistically non-independent speech features.
4. The speech recognition system of claim 1, wherein at least one of the speech features extracted is derived from incomplete data.
5. The speech recognition system of claim 1, further comprising a loopback.
6. The speech recognition system of claim 1, wherein the features are extracted using direct matching between test data and training data.
7. The speech recognition system of claim 1, wherein the features are extracted using Gaussian model identities at each time frame.
8. A speech recognition method, comprising:
extracting a multitude of speech features directly from input speech;
using a log linear function for determining a posterior probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features, and
determining a recognized output of unknown utterances using the posterior probabilities.
9. The speech recognition method of claim 8, wherein the log linear function models the posterior probability using a log linear model.
10. The speech recognition method of claim 8, wherein the speech features comprise at least one of asynchronous, overlapping, and statistically non-independent speech features.
11. The speech recognition method of claim 8, wherein at least one of the speech features extracted is derived from incomplete data.
12. The speech recognition method of claim 8, further comprising a step of loopback.
13. The speech recognition method of claim 8, wherein the features are extracted using direct matching between test data and training data.
14. The speech recognition method of claim 8, wherein the extracting of a multitude of speech features comprises using Gaussian model identities at each time frame to identify and extract features.
15. A program storage device storing a program of instructions executable by a machine for performing a method of speech recognition, the method comprising:
extracting a multitude of speech features directly from input speech;
using a log linear function for determining a posterior probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features, and
determining a recognized output of unknown utterances using the posterior probabilities.
US12/195,123 2003-11-28 2008-08-20 Speech recognition utilizing multitude of speech features Abandoned US20080312921A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/195,123 US20080312921A1 (en) 2003-11-28 2008-08-20 Speech recognition utilizing multitude of speech features

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/724,536 US7464031B2 (en) 2003-11-28 2003-11-28 Speech recognition utilizing multitude of speech features
US12/195,123 US20080312921A1 (en) 2003-11-28 2008-08-20 Speech recognition utilizing multitude of speech features

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/724,536 Continuation US7464031B2 (en) 2003-11-28 2003-11-28 Speech recognition utilizing multitude of speech features

Publications (1)

Publication Number Publication Date
US20080312921A1 true US20080312921A1 (en) 2008-12-18

Family

ID=34620090

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/724,536 Expired - Fee Related US7464031B2 (en) 2003-11-28 2003-11-28 Speech recognition utilizing multitude of speech features
US12/195,123 Abandoned US20080312921A1 (en) 2003-11-28 2008-08-20 Speech recognition utilizing multitude of speech features

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/724,536 Expired - Fee Related US7464031B2 (en) 2003-11-28 2003-11-28 Speech recognition utilizing multitude of speech features

Country Status (3)

Country Link
US (2) US7464031B2 (en)
JP (1) JP4195428B2 (en)
CN (1) CN1296886C (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090099841A1 (en) * 2007-10-04 2009-04-16 Kubushiki Kaisha Toshiba Automatic speech recognition method and apparatus
US20100030560A1 (en) * 2006-03-23 2010-02-04 Nec Corporation Speech recognition system, speech recognition method, and speech recognition program
US20120078621A1 (en) * 2010-09-24 2012-03-29 International Business Machines Corporation Sparse representation features for speech recognition
CN104462071A (en) * 2013-09-19 2015-03-25 株式会社东芝 SPEECH TRANSLATION APPARATUS and SPEECH TRANSLATION METHOD
CN108415898A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The word figure of deep learning language model beats again a point method and system
WO2022102937A1 (en) * 2020-11-12 2022-05-19 Samsung Electronics Co., Ltd. Methods and systems for predicting non-default actions against unstructured utterances
US11373671B2 (en) 2018-09-12 2022-06-28 Shenzhen Shokz Co., Ltd. Signal processing device having multiple acoustic-electric transducers

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7899671B2 (en) * 2004-02-05 2011-03-01 Avaya, Inc. Recognition results postprocessor for use in voice recognition systems
US7392187B2 (en) * 2004-09-20 2008-06-24 Educational Testing Service Method and system for the automatic generation of speech features for scoring high entropy speech
US7840404B2 (en) * 2004-09-20 2010-11-23 Educational Testing Service Method and system for using automatic generation of speech features to provide diagnostic feedback
US7809568B2 (en) * 2005-11-08 2010-10-05 Microsoft Corporation Indexing and searching speech with text meta-data
US7831428B2 (en) * 2005-11-09 2010-11-09 Microsoft Corporation Speech index pruning
US7831425B2 (en) * 2005-12-15 2010-11-09 Microsoft Corporation Time-anchored posterior indexing of speech
US8214213B1 (en) * 2006-04-27 2012-07-03 At&T Intellectual Property Ii, L.P. Speech recognition based on pronunciation modeling
US8214208B2 (en) * 2006-09-28 2012-07-03 Reqall, Inc. Method and system for sharing portable voice profiles
US7788094B2 (en) * 2007-01-29 2010-08-31 Robert Bosch Gmbh Apparatus, method and system for maximum entropy modeling for uncertain observations
US7813929B2 (en) * 2007-03-30 2010-10-12 Nuance Communications, Inc. Automatic editing using probabilistic word substitution models
US20090099847A1 (en) * 2007-10-10 2009-04-16 Microsoft Corporation Template constrained posterior probability
US7933847B2 (en) * 2007-10-17 2011-04-26 Microsoft Corporation Limited-memory quasi-newton optimization algorithm for L1-regularized objectives
US8296141B2 (en) * 2008-11-19 2012-10-23 At&T Intellectual Property I, L.P. System and method for discriminative pronunciation modeling for voice search
US9484019B2 (en) 2008-11-19 2016-11-01 At&T Intellectual Property I, L.P. System and method for discriminative pronunciation modeling for voice search
US8401852B2 (en) * 2009-11-30 2013-03-19 Microsoft Corporation Utilizing features generated from phonic units in speech recognition
WO2012023450A1 (en) * 2010-08-19 2012-02-23 日本電気株式会社 Text processing system, text processing method, and text processing program
US8630860B1 (en) * 2011-03-03 2014-01-14 Nuance Communications, Inc. Speaker and call characteristic sensitive open voice search
US8727991B2 (en) 2011-08-29 2014-05-20 Salutron, Inc. Probabilistic segmental model for doppler ultrasound heart rate monitoring
US8909512B2 (en) * 2011-11-01 2014-12-09 Google Inc. Enhanced stability prediction for incrementally generated speech recognition hypotheses based on an age of a hypothesis
CN102376305B (en) * 2011-11-29 2013-06-19 安徽科大讯飞信息科技股份有限公司 Speech recognition method and system
US9324323B1 (en) 2012-01-13 2016-04-26 Google Inc. Speech recognition using topic-specific language models
US8775177B1 (en) * 2012-03-08 2014-07-08 Google Inc. Speech recognition process
CN102810135B (en) * 2012-09-17 2015-12-16 顾泰来 A kind of Medicine prescription auxiliary process system
US9697827B1 (en) * 2012-12-11 2017-07-04 Amazon Technologies, Inc. Error reduction in speech processing
US9653070B2 (en) 2012-12-31 2017-05-16 Intel Corporation Flexible architecture for acoustic signal processing engine
CN105378830A (en) * 2013-05-31 2016-03-02 朗桑有限公司 Processing of audio data
CN103337241B (en) * 2013-06-09 2015-06-24 北京云知声信息技术有限公司 Voice recognition method and device
US9529901B2 (en) * 2013-11-18 2016-12-27 Oracle International Corporation Hierarchical linguistic tags for documents
US9842592B2 (en) * 2014-02-12 2017-12-12 Google Inc. Language models using non-linguistic context
KR20170034227A (en) * 2015-09-18 2017-03-28 삼성전자주식회사 Apparatus and method for speech recognition, apparatus and method for learning transformation parameter
CN106683677B (en) 2015-11-06 2021-11-12 阿里巴巴集团控股有限公司 Voice recognition method and device
US10832664B2 (en) 2016-08-19 2020-11-10 Google Llc Automated speech recognition using language models that selectively use domain-specific model components
JP6585022B2 (en) 2016-11-11 2019-10-02 株式会社東芝 Speech recognition apparatus, speech recognition method and program
US10347245B2 (en) * 2016-12-23 2019-07-09 Soundhound, Inc. Natural language grammar enablement by speech characterization
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
US10607601B2 (en) * 2017-05-11 2020-03-31 International Business Machines Corporation Speech recognition by selecting and refining hot words
US10672388B2 (en) * 2017-12-15 2020-06-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for open-vocabulary end-to-end speech recognition
JP7120064B2 (en) * 2019-02-08 2022-08-17 日本電信電話株式会社 Language model score calculation device, language model creation device, methods thereof, program, and recording medium
CN110853669B (en) * 2019-11-08 2023-05-16 腾讯科技(深圳)有限公司 Audio identification method, device and equipment
US11250872B2 (en) * 2019-12-14 2022-02-15 International Business Machines Corporation Using closed captions as parallel training data for customization of closed captioning systems
US11074926B1 (en) 2020-01-07 2021-07-27 International Business Machines Corporation Trending and context fatigue compensation in a voice signal
CN113657461A (en) * 2021-07-28 2021-11-16 北京宝兰德软件股份有限公司 Log anomaly detection method, system, device and medium based on text classification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304841B1 (en) * 1993-10-28 2001-10-16 International Business Machines Corporation Automatic construction of conditional exponential models from elementary features
US6456969B1 (en) * 1997-12-12 2002-09-24 U.S. Philips Corporation Method of determining model-specific factors for pattern recognition, in particular for speech patterns
US20030023438A1 (en) * 2001-04-20 2003-01-30 Hauke Schramm Method and system for the training of parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory
US6687690B2 (en) * 2001-06-14 2004-02-03 International Business Machines Corporation Employing a combined function for exception exploration in multidimensional data
US7010486B2 (en) * 2001-02-13 2006-03-07 Koninklijke Philips Electronics, N.V. Speech recognition system, training arrangement and method of calculating iteration values for free parameters of a maximum-entropy speech model
US7054810B2 (en) * 2000-10-06 2006-05-30 International Business Machines Corporation Feature vector-based apparatus and method for robust pattern recognition
US7324927B2 (en) * 2003-07-03 2008-01-29 Robert Bosch Gmbh Fast feature selection method and system for maximum entropy modeling

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0756595A (en) 1993-08-19 1995-03-03 Hitachi Ltd Voice recognition device
US5790754A (en) * 1994-10-21 1998-08-04 Sensory Circuits, Inc. Speech recognition apparatus for consumer electronic applications
CN1141696C (en) * 2000-03-31 2004-03-10 清华大学 Non-particular human speech recognition and prompt method based on special speech recognition chip
JP2002251592A (en) * 2001-02-22 2002-09-06 Toshiba Corp Learning method for pattern recognition dictionary
JP3919475B2 (en) 2001-07-10 2007-05-23 シャープ株式会社 Speaker feature extraction apparatus, speaker feature extraction method, speech recognition apparatus, and program recording medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304841B1 (en) * 1993-10-28 2001-10-16 International Business Machines Corporation Automatic construction of conditional exponential models from elementary features
US6456969B1 (en) * 1997-12-12 2002-09-24 U.S. Philips Corporation Method of determining model-specific factors for pattern recognition, in particular for speech patterns
US7054810B2 (en) * 2000-10-06 2006-05-30 International Business Machines Corporation Feature vector-based apparatus and method for robust pattern recognition
US7010486B2 (en) * 2001-02-13 2006-03-07 Koninklijke Philips Electronics, N.V. Speech recognition system, training arrangement and method of calculating iteration values for free parameters of a maximum-entropy speech model
US20030023438A1 (en) * 2001-04-20 2003-01-30 Hauke Schramm Method and system for the training of parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory
US6687690B2 (en) * 2001-06-14 2004-02-03 International Business Machines Corporation Employing a combined function for exception exploration in multidimensional data
US7324927B2 (en) * 2003-07-03 2008-01-29 Robert Bosch Gmbh Fast feature selection method and system for maximum entropy modeling

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030560A1 (en) * 2006-03-23 2010-02-04 Nec Corporation Speech recognition system, speech recognition method, and speech recognition program
US8781837B2 (en) * 2006-03-23 2014-07-15 Nec Corporation Speech recognition system and method for plural applications
US20090099841A1 (en) * 2007-10-04 2009-04-16 Kubushiki Kaisha Toshiba Automatic speech recognition method and apparatus
US8311825B2 (en) * 2007-10-04 2012-11-13 Kabushiki Kaisha Toshiba Automatic speech recognition method and apparatus
US20120078621A1 (en) * 2010-09-24 2012-03-29 International Business Machines Corporation Sparse representation features for speech recognition
US8484023B2 (en) * 2010-09-24 2013-07-09 Nuance Communications, Inc. Sparse representation features for speech recognition
CN104462071A (en) * 2013-09-19 2015-03-25 株式会社东芝 SPEECH TRANSLATION APPARATUS and SPEECH TRANSLATION METHOD
CN108415898A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The word figure of deep learning language model beats again a point method and system
US11373671B2 (en) 2018-09-12 2022-06-28 Shenzhen Shokz Co., Ltd. Signal processing device having multiple acoustic-electric transducers
US11875815B2 (en) 2018-09-12 2024-01-16 Shenzhen Shokz Co., Ltd. Signal processing device having multiple acoustic-electric transducers
WO2022102937A1 (en) * 2020-11-12 2022-05-19 Samsung Electronics Co., Ltd. Methods and systems for predicting non-default actions against unstructured utterances
US11705111B2 (en) 2020-11-12 2023-07-18 Samsung Electronics Co., Ltd. Methods and systems for predicting non-default actions against unstructured utterances

Also Published As

Publication number Publication date
CN1622196A (en) 2005-06-01
US20050119885A1 (en) 2005-06-02
JP4195428B2 (en) 2008-12-10
JP2005165272A (en) 2005-06-23
US7464031B2 (en) 2008-12-09
CN1296886C (en) 2007-01-24

Similar Documents

Publication Publication Date Title
US7464031B2 (en) Speech recognition utilizing multitude of speech features
US9477753B2 (en) Classifier-based system combination for spoken term detection
US6542866B1 (en) Speech recognition method and apparatus utilizing multiple feature streams
Young HMMs and related speech recognition technologies
US9679556B2 (en) Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
US7627473B2 (en) Hidden conditional random field models for phonetic classification and speech recognition
Hemakumar et al. Speech recognition technology: a survey on Indian languages
Aggarwal et al. Integration of multiple acoustic and language models for improved Hindi speech recognition system
Williams Knowing what you don't know: roles for confidence measures in automatic speech recognition
Becerra et al. Speech recognition in a dialog system: From conventional to deep processing: A case study applied to Spanish
Das Speech recognition technique: A review
Yusuf et al. Low resource keyword search with synthesized crosslingual exemplars
Shahnawazuddin et al. Improvements in IITG Assamese spoken query system: Background noise suppression and alternate acoustic modeling
Meyer et al. Boosting HMM acoustic models in large vocabulary speech recognition
Hwang et al. Building a highly accurate Mandarin speech recognizer
Tabibian A survey on structured discriminative spoken keyword spotting
Breslin Generation and combination of complementary systems for automatic speech recognition
Huang et al. Task-independent call-routing
Nallasamy Adaptation techniques to improve ASR performance on accented speakers
Tabibian et al. Improved dynamic match phone lattice search for Persian spoken term detection system in online and offline applications
Holmes Modelling segmental variability for automatic speech recognition
Ben Ayed A new SVM kernel for keyword spotting using confidence measures
Herbig et al. Adaptive systems for unsupervised speaker tracking and speech recognition
Wang Model-based approaches to robust speech recognition in diverse environments
Fissore et al. The recognition algorithms

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION