US20140019131A1 - Method of recognizing speech and electronic device thereof - Google Patents

Method of recognizing speech and electronic device thereof Download PDF

Info

Publication number
US20140019131A1
US20140019131A1 US13/940,848 US201313940848A US2014019131A1 US 20140019131 A1 US20140019131 A1 US 20140019131A1 US 201313940848 A US201313940848 A US 201313940848A US 2014019131 A1 US2014019131 A1 US 2014019131A1
Authority
US
United States
Prior art keywords
phoneme
speech
speech signal
sections
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/940,848
Inventor
Jae-won Lee
Dong-Suk Yook
Hyeon-taek LIM
Tae-Yoon Kim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Korea University Research and Business Foundation
Original Assignee
Samsung Electronics Co Ltd
Korea University Research and Business Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd, Korea University Research and Business Foundation filed Critical Samsung Electronics Co Ltd
Assigned to SAMSUNG ELECTRONICS CO., LTD., KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION reassignment SAMSUNG ELECTRONICS CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KIM, TAE-YOON, LEE, JAE-WON, LIM, HYEON-TAEK, YOOK, DONG-SUK
Publication of US20140019131A1 publication Critical patent/US20140019131A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Definitions

  • Methods and apparatuses consistent with the exemplary embodiments generally relate to recognizing speech, and more particularly, to recognizing speech uttered by a user by using an acoustic model, a language model, and a pronunciation dictionary.
  • Speech recognition is controlled by using an electronic device such as a smart phone, a navigation system, or the like.
  • an electronic device such as a smart phone, a navigation system, or the like.
  • performance of hardware of electronic devices has improved and user demand level for speech recognition has increased, a user environment has been changed from an isolated word recognition method of recognizing a user's speech with dozens of existing commands to a continuous speech recognition method of recognizing a plurality of natural languages.
  • the continuous speech recognition method recognizes a word string including at least one word out of hundreds of thousands to millions of words, and forms a search space for all available words.
  • the continuous speech recognition method calculates a probability by using information including the acoustic model, the language model, and the pronunciation dictionary in order to determine whether a corresponding utterance is a type of a sentence, and acquires the recognized sentence according to the calculation result.
  • Exemplary embodiments address at least the above problems and/or disadvantages and other disadvantages not described above. Also, the exemplary embodiments are not required to overcome the disadvantages described above, and an exemplary embodiment may not overcome any of the problems described above.
  • the exemplary embodiments provide a speech recognition method of further rapidly recognizing a plurality of natural languages, and an electronic device thereof.
  • a method of recognizing speech in an electronic device may comprise: segmenting a speech signal into a plurality of sections at preset time intervals; performing a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model; extracting a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and performing speech recognition with respect to the one of the plurality of sections based on the candidate word.
  • the performance of the phoneme recognition may further comprise: deleting at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm.
  • the at least one deleted phoneme may be used to perform a phoneme recognition with respect to a next section of the speech signal following the one of the plurality of sections.
  • the extraction may comprise: extracting a similar phoneme pronounced similarly to the recognized phoneme; and generating a word graph for extracting the candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
  • the performing of the speech recognition may further comprise: calculating a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model; and outputting a word string having a highest probability in the word graph based on the second acoustic model and a language model.
  • the first and second acoustic models may be different from each other.
  • the performance of the phoneme recognition, the extraction, and the performance of the speech recognition may be performed in parallel through different cores.
  • an electronic device comprising: a speech signal input part configured to receive a speech signal; a speech signal segmenter configured to segment the speech signal input through the speech signal input part into a plurality of sections at preset time intervals; a phoneme recognizer configured to perform a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model; a candidate word extractor configured to extract a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and a speech recognizer configured to perform speech recognition with respect to the one of the plurality of sections based on the candidate word.
  • the phoneme recognizer is configured to delete at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm to perform the phoneme recognition.
  • the at least one deleted phoneme may be used to perform a phoneme recognition with respect to a next section of the speech signal following the one of the plurality of sections.
  • the candidate word extractor is configured to extract a similar phoneme pronounced similarly to the recognized phoneme and generate a word graph for extracting a candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
  • the speech recognizer is configured to calculate a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model and output a word string having a highest probability in the word graph based on the second acoustic model and a language model to perform the speech recognition.
  • the first acoustic model of the phoneme recognizer and the second acoustic model of the speech recognizer may be different from each other.
  • the phoneme recognizer, the candidate word extractor, and the speech recognizer may be realized as different cores of the electronic device.
  • FIG. 1 is a schematic block diagram illustrating a structure of an electronic device for performing speech recognition according to an exemplary embodiment
  • FIG. 2 is a block diagram illustrating a detailed structure of the electronic device of FIG. 1 for recognizing speech according to an exemplary embodiment
  • FIG. 3 is a view illustrating a method of processing parallel speech recognition according to an exemplary embodiment
  • FIG. 4 is a flowchart illustrating a method of recognizing speech according to an exemplary embodiment.
  • FIG. 1 is a schematic block diagram illustrating a structure of an electronic device 100 for performing speech recognition according to an exemplary embodiment.
  • the electronic device 100 includes a speech signal input part 110 , a speech signal segmenter 120 , a phoneme recognizer 130 , a candidate word extractor 140 , and a speech recognizer 150 .
  • the electronic device 100 may be realized as various types of electronic devices such as a smart phone, a smart television (TV), a desktop personal computer (PC), a tablet PC, etc. Accordingly, the above-noted elements of the electronic device may take the form of an entirely hardware embodiment such as a processor or circuit(s), an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware.
  • the speech signal input part 110 receives a speech signal corresponding to a speech uttered by a user.
  • the speech signal input part 110 may include a microphone and an amplifier for amplifying the received speech.
  • receiving of the speech signal in real time by using the microphone is only an exemplary embodiment, and thus the speech signal input part 110 may receive the speech signal through a pre-stored file.
  • the speech signal segmenter 120 segments the speech signal into a plurality of sections.
  • the speech signal segmenter 120 may segment the speech signal into the plurality of sections at preset time intervals (e.g., 0.1 second).
  • the phoneme recognizer 130 recognizes a phoneme of a speech signal of one of the plurality of sections segmented by the speech signal segmenter 120 .
  • the phoneme recognizer 130 may calculate a Gaussian probability distribution of a characteristic vector corresponding to the speech signal of the one section by using a first acoustic model for phoneme recognition and selects an optimum phoneme.
  • the phoneme recognizer 130 may delete at least a last one of a plurality of phonemes of the speech signal of the one section by using a segmented viterbi algorithm.
  • the phoneme recognizer 130 since the speech signal segmenter 120 segments the speech signals in units of time, and not in the units of phoneme, the phoneme recognizer 130 may not properly recognize a phoneme positioned in an end part of the one section. Therefore, the phoneme recognizer 130 deletes at least one phoneme positioned in the end part of the one section and outputs the deleted at least one phoneme to the speech signal segmenter 120 to use the deleted at least one phoneme for recognizing a phoneme of a next section.
  • the candidate word extractor 140 extracts a candidate word of the recognized phoneme by using a phoneme recognition result output from the phoneme recognizer 130 .
  • the candidate word extractor 140 extracts a similar phoneme pronounced similarly to the phoneme output from the phoneme recognizer 130 and generates a word graph of the speech signal of the one section for extracting the candidate word by using the similar phoneme.
  • the generation of the word graph of the speech signal of the one section to extract the candidate word is only exemplary, and thus a candidate word list of the speech signal of the one section may be generated.
  • the candidate word extractor 130 outputs the word graph of the speech signal of the one section to the speech recognizer 150 .
  • the speech recognizer 150 performs speech recognition with respect to the speech signal of the one section by using the candidate word extracted by the candidate word extractor 140 .
  • the speech recognizer 150 may search the word graph output from the candidate word extractor 140 for an optimum path of the speech signal of the one section output from the speech signal segmenter 120 to perform the speech recognition.
  • the phoneme recognizer 130 , the candidate word extractor 140 , and the speech recognizer 150 may operate in parallel in different cores of a processor or different processors. In other words, if the phoneme recognizer 130 performs phoneme recognition with respect to a speech signal of a first section, the speech recognizer 150 transmits a result of the recognition of the first section to the candidate word extractor 140 and performs phoneme recognition with respect to a speech signal of a second section.
  • the candidate word extractor 140 extracts the candidate word based on the phoneme recognition result of the first section, outputs the extracted candidate word to the speech recognizer 150 , and extracts a candidate word by using the phoneme recognition result of the speech recognition of the second section output from the phoneme recognizer 130 .
  • the speech recognizer 150 performs the speech recognition with respect to the speech signal of the first section by using the candidate word of the speech signal of the first section extracted by the candidate word extractor 140 and performs the speech recognition with respect to the speech signal of the second section by using the candidate word of the speech signal of the second section extracted by the candidate word extractor 140 .
  • the electronic device 100 described above rapidly performs phoneme recognition with a relatively small number of calculations, extracts a small number of candidate words based on the result of the phoneme recognition, and performs speech recognition by using a noticeably smaller number of candidate words than an existing method of recognizing a plurality of continuous words. Also, the electronic device 100 performs the phoneme recognition, the extraction of the candidate word, and the speech recognition in parallel to allow a user to further rapidly perform speech recognition.
  • FIG. 2 is a block diagram illustrating a detailed structure of the electronic device 100 for recognizing a speech according to an exemplary embodiment.
  • the electronic device 100 includes the speech signal input part 110 , the speech signal segmenter 120 , the phoneme recognizer 130 , the candidate word extractor 140 , and the speech recognizer 150 .
  • the speech signal input part 110 receives a speech signal corresponding to a user's speech.
  • the speech signal input part 110 may receive the speech signal in real time from a speech input device such as a microphone. However, this is only an example, and the speech signal input part 110 may receive the speech signal from a file stored in a storage (not shown) of the electronic device 100 .
  • the speech signal segmenter 120 segments the speech signal into a plurality of sections at preset time intervals.
  • the speech signal segmenter 120 includes a section segmenter 121 , a preprocessor 122 , and a characteristic vector extractor 123 .
  • the section segmenter 121 segments the speech signal output from the speech signal input part 110 at the preset time intervals (e.g., 0.1 seconds).
  • the preprocessor 122 performs signal-processing, such as noise removal, with respect to a speech signal of one of the plurality of sections.
  • the characteristic vector extractor 123 extracts a characteristic vector from the speech signal of the one section which is preprocessed.
  • the characteristic vector extractor 123 outputs the characteristic vector of the speech signal of the one section to the phoneme recognizer 130 and the speech recognizer 150 .
  • the phoneme recognizer 130 performs speech recognition by using the characteristic vector extracted by the characteristic vector extractor 123 .
  • the phoneme recognizer 130 includes a first Gaussian probability calculator 131 , a first acoustic model 132 , an optimum candidate searcher 133 , and a section segmentation error corrector 134 .
  • the first Gaussian probability calculator 131 calculates a Gaussian probability of the characteristic vector of the speech signal of the one section by using the first acoustic model 132 .
  • the first acoustic model 132 is an acoustic model for phoneme recognition and stores information about 40 to 50 phonemes in the case of Korean language.
  • the first acoustic model 132 may be a hidden Markov model (HMM) acoustic model.
  • HMM hidden Markov model
  • the first acoustic model 132 may be more simply realized than an acoustic model applied to an existing method of recognizing a plurality of continuous words to enable rapid speech recognition.
  • the optimum candidate searcher 133 selects optimum phonemes included in the speech signal of the one section based on the calculation results of the first acoustic model 132 and the first Gaussian probability calculator 131 .
  • the section segmentation error corrector 134 deletes at least a last one of the plurality of phonemes selected by the optimum candidate searcher 133 .
  • the speech signal segmenter 120 segments the speech signal based on time, and not based on a phoneme. Therefore, all data of the last phonemes of the speech signal of the one section input into the phoneme recognizer 130 may not be input, and thus the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 may be an incorrectly selected phoneme. Therefore, the section segmentation error corrector 134 deletes the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 and outputs the phonemes, which are not deleted, to the candidate word extractor 140 . The section segmentation error corrector 134 outputs the at least one deleted phoneme to the section segmenter 121 to recognize the at least one deleted phoneme in a next section.
  • the phoneme recognizer 130 deletes the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 to correct a section segmentation error through the second segmentation error corrector 134 .
  • the phoneme recognizer 130 may search an end part of a phoneme by using a HMM state position check or a signal processing technique to minimize a section segmentation error.
  • the candidate word extractor 140 extracts a candidate word based on the phoneme of the speech signal of the one section recognized by the phoneme recognizer 130 .
  • the candidate word extractor 140 includes a similarity calculator 141 and a section word graph generator 142 .
  • the similarity calculator 141 calculates a pronunciation similarity between the phoneme of the speech signal of the one section and other phonemes by using a pronunciation dictionary to extract a similar phoneme pronounced similarly to the phoneme of the speech signal of the one section.
  • the section word graph generator 142 generates a section word graph for generating a candidate word based on extracted similar phonemes.
  • the section word graph may be a network type graph on which recognized phonemes are connected to the similar phonemes.
  • the section word graph generator 142 outputs the section word graph for extracting the candidate word of the speech signal of the one section to an optimum word graph path searcher 153 .
  • the candidate word extractor 140 generates the section word graph, but this is only exemplary. Therefore, the candidate word extractor 140 may extract candidate words to generate a candidate word list.
  • the speech recognizer 150 performs speech recognition with respect to one section by using the candidate words output from the candidate word extractor 140 .
  • the speech recognizer 150 includes a second Gaussian probability calculator 151 , a second acoustic model 152 , the optimum word graph path searcher 153 , a language model 154 , and a speech recognition output part 155 .
  • the second Gaussian probability calculator 151 calculates a Gaussian probability distribution of the speech signal of the one section by using the second acoustic model 152 .
  • the second acoustic model 152 is an acoustic model used in a general method of recognizing a plurality of continuous words and may be an acoustic model using a triphone.
  • the second acoustic model 152 stores a larger amount of data than the first acoustic model 132 .
  • the optimum word graph path searcher 153 searches for an optimum path corresponding to the speech signal of the section word graph output from the section word graph generator 142 based on the calculation result of the second Gaussian probability calculator 151 .
  • the optimum word graph path searcher 153 may perform the speech recognition by using the language model 154 storing a grammar and a sentence structure in order to further accurately recognize a sentence.
  • the first acoustic model 132 may be an acoustic mode specialized for high-speed speech recognition
  • the second acoustic model 152 may be an elaborate acoustic model for improving the performance of a continuous word speech recognition.
  • the speech recognition output part 155 outputs a word string (a sentence) generated by the optimum path searched by the optimum word graph path searcher 153 .
  • the phoneme recognizer 130 , the candidate word extractor 140 , and the speech recognizer 150 may be formed in pipeline shapes which operate through different cores in parallel.
  • the speech signal segmenter 120 segments a speech signal into N sections and transmits the speech signals of the N sections to the phoneme recognizer 130 .
  • the phoneme recognizer 130 performs phoneme recognition with respect to a first section at a time t 1 .
  • the phoneme recognizer 130 performs phoneme recognition with respect to a second section, and the candidate word extractor 140 extracts a candidate word of the first section.
  • the phoneme recognizer 130 performs a phoneme recognition with respect to a third section
  • the candidate word extractor 140 extract a candidate word of the second section
  • the speech recognizer 150 performs a speech recognition with respect to the first section.
  • the phoneme recognizer 130 , the candidate word extractor 140 , and the speech recognizer 150 operate in parallel at each time.
  • the speech recognizer 150 performs and outputs speech recognitions with respect to speech signals of all sections after a short time tn+2 ⁇ tn from a time when a user ends uttering.
  • the electronic device 100 performs a phoneme recognition operation, a candidate word extracting operation using phoneme recognition, and a speech recognition operation using a candidate word in parallel. Therefore, the electronic device 100 performs speech recognition more rapidly than an existing method of recognizing a plurality of continuous words.
  • a speech recognition method of the electronic device 100 according to an exemplary embodiment will now be described with reference to FIG. 4 .
  • the electronic device 100 determines whether a speech signal is input.
  • the speech signal may be input in real time through a speech input device such as a microphone or through a pre-stored file.
  • the electronic device 100 segments the speech signal into a plurality of sections at preset time intervals in operation S 420 .
  • the electronic device 100 segments the input speech signal into the plurality of sections at the preset time intervals (e.g., 0.1 seconds) and performs signal-processing with respect to a speech signal of one of the plurality of sections to extract a characteristic vector.
  • the electronic device 100 recognizes a phoneme of the speech signal of the one section.
  • the electronic device 100 recognizes the phoneme of the speech signal of the one section by using a first acoustic model.
  • the electronic device 100 deletes at least one last phoneme of a plurality of recognized phonemes and uses the at least one deleted phoneme to recognize a phoneme of a speech signal of a next section.
  • the electronic device 100 extracts a candidate word of the speech signal of the one section by using the phoneme recognition result.
  • the electronic device 100 extracts similar phonemes of the plurality of recognized phonemes and generates a word graph for extracting the candidate word.
  • the word graph is a network type graph on which the recognized phonemes are respectively connected to the similar phonemes.
  • the electronic device 100 performs speech recognition with respect to the speech signal of the one section by using the candidate word.
  • the electronic device 100 performs speech recognition with respect to the speech signal of the one section by using a second acoustic model and a language model of the candidate word (the word graph) extracted in operation S 440 .
  • the electronic device 100 may repeatedly perform operations S 430 through S 450 with respect to speech signals of next sections.
  • the electronic device 100 may repeatedly perform operations S 430 through S 450 in parallel by using different cores of a processor.
  • an electronic device may more rapidly and accurately perform speech recognition than an existing method of recognizing a plurality of continuous words.
  • aspects of the exemplary embodiments may be embodied as an apparatus, system, method or computer program product. Accordingly, aspects of the exemplary embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the exemplary embodiments may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon, and executed by a hardware processor.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Abstract

A method of recognizing a speech and an electronic device thereof are provided. The method includes: segmenting a speech signal into a plurality of sections at preset time intervals; performing a phoneme recognition with respect to one of the plurality of sections of the speech signal by using a first acoustic model; extracting a candidate word of the one of the plurality of sections of the speech signal by using the phoneme recognition result; and performing a speech recognition with respect to the one the plurality of sections the speech signal by using the candidate word.

Description

    CROSS-REFERENCE TO RELATED APPLICATION(S)
  • This application claims priority from Korean Patent Application No. 10-2012-0076809, filed on Jul. 13, 2012, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety.
  • BACKGROUND
  • 1. Field
  • Methods and apparatuses consistent with the exemplary embodiments generally relate to recognizing speech, and more particularly, to recognizing speech uttered by a user by using an acoustic model, a language model, and a pronunciation dictionary.
  • 2. Description of the Related Art
  • Speech recognition is controlled by using an electronic device such as a smart phone, a navigation system, or the like. As performance of hardware of electronic devices has improved and user demand level for speech recognition has increased, a user environment has been changed from an isolated word recognition method of recognizing a user's speech with dozens of existing commands to a continuous speech recognition method of recognizing a plurality of natural languages.
  • The continuous speech recognition method recognizes a word string including at least one word out of hundreds of thousands to millions of words, and forms a search space for all available words. The continuous speech recognition method calculates a probability by using information including the acoustic model, the language model, and the pronunciation dictionary in order to determine whether a corresponding utterance is a type of a sentence, and acquires the recognized sentence according to the calculation result.
  • However, in the continuous speech recognition method, the search space becomes greater, and thus a memory requirement increases. Speech recognition is impossible or may be considerably lowered due to the increase in the number of calculations.
  • Accordingly, a speech recognition method which rapidly recognizes a plurality of natural languages is required.
  • SUMMARY
  • Exemplary embodiments address at least the above problems and/or disadvantages and other disadvantages not described above. Also, the exemplary embodiments are not required to overcome the disadvantages described above, and an exemplary embodiment may not overcome any of the problems described above.
  • The exemplary embodiments provide a speech recognition method of further rapidly recognizing a plurality of natural languages, and an electronic device thereof.
  • According to an aspect of an exemplary embodiments, there is provided a method of recognizing speech in an electronic device. The method may comprise: segmenting a speech signal into a plurality of sections at preset time intervals; performing a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model; extracting a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and performing speech recognition with respect to the one of the plurality of sections based on the candidate word.
  • The performance of the phoneme recognition may further comprise: deleting at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm. The at least one deleted phoneme may be used to perform a phoneme recognition with respect to a next section of the speech signal following the one of the plurality of sections.
  • The extraction may comprise: extracting a similar phoneme pronounced similarly to the recognized phoneme; and generating a word graph for extracting the candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
  • The performing of the speech recognition may further comprise: calculating a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model; and outputting a word string having a highest probability in the word graph based on the second acoustic model and a language model.
  • The first and second acoustic models may be different from each other.
  • The performance of the phoneme recognition, the extraction, and the performance of the speech recognition may be performed in parallel through different cores.
  • According to an aspect of another exemplary embodiment, there is provided an electronic device comprising: a speech signal input part configured to receive a speech signal; a speech signal segmenter configured to segment the speech signal input through the speech signal input part into a plurality of sections at preset time intervals; a phoneme recognizer configured to perform a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model; a candidate word extractor configured to extract a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and a speech recognizer configured to perform speech recognition with respect to the one of the plurality of sections based on the candidate word.
  • The phoneme recognizer is configured to delete at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm to perform the phoneme recognition. The at least one deleted phoneme may be used to perform a phoneme recognition with respect to a next section of the speech signal following the one of the plurality of sections.
  • The candidate word extractor is configured to extract a similar phoneme pronounced similarly to the recognized phoneme and generate a word graph for extracting a candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
  • The speech recognizer is configured to calculate a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model and output a word string having a highest probability in the word graph based on the second acoustic model and a language model to perform the speech recognition.
  • The first acoustic model of the phoneme recognizer and the second acoustic model of the speech recognizer may be different from each other.
  • The phoneme recognizer, the candidate word extractor, and the speech recognizer may be realized as different cores of the electronic device.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The above and/or other aspects will be more apparent by describing certain exemplary embodiments with reference to the accompanying drawings, in which:
  • FIG. 1 is a schematic block diagram illustrating a structure of an electronic device for performing speech recognition according to an exemplary embodiment;
  • FIG. 2 is a block diagram illustrating a detailed structure of the electronic device of FIG. 1 for recognizing speech according to an exemplary embodiment;
  • FIG. 3 is a view illustrating a method of processing parallel speech recognition according to an exemplary embodiment; and
  • FIG. 4 is a flowchart illustrating a method of recognizing speech according to an exemplary embodiment.
  • DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
  • Exemplary embodiments are described in greater detail with reference to the accompanying drawings.
  • In the following description, the same drawing reference numerals are used for the same elements even in different drawings. The matters defined in the description, such as detailed construction and elements, are provided to assist in a comprehensive understanding of the exemplary embodiments. Thus, it is apparent that the exemplary embodiments can be carried out without those specifically defined matters. Also, well-known functions or constructions are not described in detail since they would obscure the exemplary embodiments with unnecessary detail.
  • FIG. 1 is a schematic block diagram illustrating a structure of an electronic device 100 for performing speech recognition according to an exemplary embodiment. Referring to FIG. 1, the electronic device 100 includes a speech signal input part 110, a speech signal segmenter 120, a phoneme recognizer 130, a candidate word extractor 140, and a speech recognizer 150. The electronic device 100 according to the present exemplary embodiment may be realized as various types of electronic devices such as a smart phone, a smart television (TV), a desktop personal computer (PC), a tablet PC, etc. Accordingly, the above-noted elements of the electronic device may take the form of an entirely hardware embodiment such as a processor or circuit(s), an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware.
  • The speech signal input part 110 receives a speech signal corresponding to a speech uttered by a user. The speech signal input part 110 may include a microphone and an amplifier for amplifying the received speech. However, receiving of the speech signal in real time by using the microphone is only an exemplary embodiment, and thus the speech signal input part 110 may receive the speech signal through a pre-stored file.
  • The speech signal segmenter 120 segments the speech signal into a plurality of sections. In detail, the speech signal segmenter 120 may segment the speech signal into the plurality of sections at preset time intervals (e.g., 0.1 second).
  • The phoneme recognizer 130 recognizes a phoneme of a speech signal of one of the plurality of sections segmented by the speech signal segmenter 120. In detail, the phoneme recognizer 130 may calculate a Gaussian probability distribution of a characteristic vector corresponding to the speech signal of the one section by using a first acoustic model for phoneme recognition and selects an optimum phoneme.
  • The phoneme recognizer 130 may delete at least a last one of a plurality of phonemes of the speech signal of the one section by using a segmented viterbi algorithm. In detail, since the speech signal segmenter 120 segments the speech signals in units of time, and not in the units of phoneme, the phoneme recognizer 130 may not properly recognize a phoneme positioned in an end part of the one section. Therefore, the phoneme recognizer 130 deletes at least one phoneme positioned in the end part of the one section and outputs the deleted at least one phoneme to the speech signal segmenter 120 to use the deleted at least one phoneme for recognizing a phoneme of a next section.
  • The candidate word extractor 140 extracts a candidate word of the recognized phoneme by using a phoneme recognition result output from the phoneme recognizer 130. In detail, the candidate word extractor 140 extracts a similar phoneme pronounced similarly to the phoneme output from the phoneme recognizer 130 and generates a word graph of the speech signal of the one section for extracting the candidate word by using the similar phoneme. However, the generation of the word graph of the speech signal of the one section to extract the candidate word is only exemplary, and thus a candidate word list of the speech signal of the one section may be generated. The candidate word extractor 130 outputs the word graph of the speech signal of the one section to the speech recognizer 150.
  • The speech recognizer 150 performs speech recognition with respect to the speech signal of the one section by using the candidate word extracted by the candidate word extractor 140. In detail, the speech recognizer 150 may search the word graph output from the candidate word extractor 140 for an optimum path of the speech signal of the one section output from the speech signal segmenter 120 to perform the speech recognition.
  • The phoneme recognizer 130, the candidate word extractor 140, and the speech recognizer 150 may operate in parallel in different cores of a processor or different processors. In other words, if the phoneme recognizer 130 performs phoneme recognition with respect to a speech signal of a first section, the speech recognizer 150 transmits a result of the recognition of the first section to the candidate word extractor 140 and performs phoneme recognition with respect to a speech signal of a second section. The candidate word extractor 140 extracts the candidate word based on the phoneme recognition result of the first section, outputs the extracted candidate word to the speech recognizer 150, and extracts a candidate word by using the phoneme recognition result of the speech recognition of the second section output from the phoneme recognizer 130. The speech recognizer 150 performs the speech recognition with respect to the speech signal of the first section by using the candidate word of the speech signal of the first section extracted by the candidate word extractor 140 and performs the speech recognition with respect to the speech signal of the second section by using the candidate word of the speech signal of the second section extracted by the candidate word extractor 140.
  • The electronic device 100 described above rapidly performs phoneme recognition with a relatively small number of calculations, extracts a small number of candidate words based on the result of the phoneme recognition, and performs speech recognition by using a noticeably smaller number of candidate words than an existing method of recognizing a plurality of continuous words. Also, the electronic device 100 performs the phoneme recognition, the extraction of the candidate word, and the speech recognition in parallel to allow a user to further rapidly perform speech recognition.
  • FIG. 2 is a block diagram illustrating a detailed structure of the electronic device 100 for recognizing a speech according to an exemplary embodiment. Referring to FIG. 2, the electronic device 100 includes the speech signal input part 110, the speech signal segmenter 120, the phoneme recognizer 130, the candidate word extractor 140, and the speech recognizer 150.
  • The speech signal input part 110 receives a speech signal corresponding to a user's speech. The speech signal input part 110 may receive the speech signal in real time from a speech input device such as a microphone. However, this is only an example, and the speech signal input part 110 may receive the speech signal from a file stored in a storage (not shown) of the electronic device 100.
  • The speech signal segmenter 120 segments the speech signal into a plurality of sections at preset time intervals. Here, the speech signal segmenter 120 includes a section segmenter 121, a preprocessor 122, and a characteristic vector extractor 123.
  • The section segmenter 121 segments the speech signal output from the speech signal input part 110 at the preset time intervals (e.g., 0.1 seconds).
  • The preprocessor 122 performs signal-processing, such as noise removal, with respect to a speech signal of one of the plurality of sections.
  • The characteristic vector extractor 123 extracts a characteristic vector from the speech signal of the one section which is preprocessed. The characteristic vector extractor 123 outputs the characteristic vector of the speech signal of the one section to the phoneme recognizer 130 and the speech recognizer 150.
  • The phoneme recognizer 130 performs speech recognition by using the characteristic vector extracted by the characteristic vector extractor 123. Here, the phoneme recognizer 130 includes a first Gaussian probability calculator 131, a first acoustic model 132, an optimum candidate searcher 133, and a section segmentation error corrector 134.
  • The first Gaussian probability calculator 131 calculates a Gaussian probability of the characteristic vector of the speech signal of the one section by using the first acoustic model 132.
  • The first acoustic model 132 is an acoustic model for phoneme recognition and stores information about 40 to 50 phonemes in the case of Korean language. The first acoustic model 132 may be a hidden Markov model (HMM) acoustic model. In particular, the first acoustic model 132 may be more simply realized than an acoustic model applied to an existing method of recognizing a plurality of continuous words to enable rapid speech recognition.
  • The optimum candidate searcher 133 selects optimum phonemes included in the speech signal of the one section based on the calculation results of the first acoustic model 132 and the first Gaussian probability calculator 131.
  • The section segmentation error corrector 134 deletes at least a last one of the plurality of phonemes selected by the optimum candidate searcher 133. In detail, the speech signal segmenter 120 according to the present exemplary embodiment segments the speech signal based on time, and not based on a phoneme. Therefore, all data of the last phonemes of the speech signal of the one section input into the phoneme recognizer 130 may not be input, and thus the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 may be an incorrectly selected phoneme. Therefore, the section segmentation error corrector 134 deletes the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 and outputs the phonemes, which are not deleted, to the candidate word extractor 140. The section segmentation error corrector 134 outputs the at least one deleted phoneme to the section segmenter 121 to recognize the at least one deleted phoneme in a next section.
  • The phoneme recognizer 130 according to the present exemplary embodiment deletes the at least last one of the plurality of phonemes selected by the optimum candidate searcher 133 to correct a section segmentation error through the second segmentation error corrector 134. However, this is only an example, and the phoneme recognizer 130 may search an end part of a phoneme by using a HMM state position check or a signal processing technique to minimize a section segmentation error.
  • The candidate word extractor 140 extracts a candidate word based on the phoneme of the speech signal of the one section recognized by the phoneme recognizer 130. The candidate word extractor 140 includes a similarity calculator 141 and a section word graph generator 142.
  • The similarity calculator 141 calculates a pronunciation similarity between the phoneme of the speech signal of the one section and other phonemes by using a pronunciation dictionary to extract a similar phoneme pronounced similarly to the phoneme of the speech signal of the one section.
  • The section word graph generator 142 generates a section word graph for generating a candidate word based on extracted similar phonemes. Here, the section word graph may be a network type graph on which recognized phonemes are connected to the similar phonemes. The section word graph generator 142 outputs the section word graph for extracting the candidate word of the speech signal of the one section to an optimum word graph path searcher 153.
  • In the above-described exemplary embodiment, the candidate word extractor 140 generates the section word graph, but this is only exemplary. Therefore, the candidate word extractor 140 may extract candidate words to generate a candidate word list.
  • The speech recognizer 150 performs speech recognition with respect to one section by using the candidate words output from the candidate word extractor 140. The speech recognizer 150 includes a second Gaussian probability calculator 151, a second acoustic model 152, the optimum word graph path searcher 153, a language model 154, and a speech recognition output part 155.
  • The second Gaussian probability calculator 151 calculates a Gaussian probability distribution of the speech signal of the one section by using the second acoustic model 152.
  • Here, the second acoustic model 152 is an acoustic model used in a general method of recognizing a plurality of continuous words and may be an acoustic model using a triphone. In particular, in order to perform a complicated speech recognition, the second acoustic model 152 stores a larger amount of data than the first acoustic model 132.
  • The optimum word graph path searcher 153 searches for an optimum path corresponding to the speech signal of the section word graph output from the section word graph generator 142 based on the calculation result of the second Gaussian probability calculator 151. Here, the optimum word graph path searcher 153 may perform the speech recognition by using the language model 154 storing a grammar and a sentence structure in order to further accurately recognize a sentence. In other words, the first acoustic model 132 may be an acoustic mode specialized for high-speed speech recognition, and the second acoustic model 152 may be an elaborate acoustic model for improving the performance of a continuous word speech recognition.
  • The speech recognition output part 155 outputs a word string (a sentence) generated by the optimum path searched by the optimum word graph path searcher 153.
  • The phoneme recognizer 130, the candidate word extractor 140, and the speech recognizer 150 may be formed in pipeline shapes which operate through different cores in parallel. In detail, as shown in FIG. 3, the speech signal segmenter 120 segments a speech signal into N sections and transmits the speech signals of the N sections to the phoneme recognizer 130. The phoneme recognizer 130 performs phoneme recognition with respect to a first section at a time t1. At a time t2, the phoneme recognizer 130 performs phoneme recognition with respect to a second section, and the candidate word extractor 140 extracts a candidate word of the first section. At a time t3, the phoneme recognizer 130 performs a phoneme recognition with respect to a third section, the candidate word extractor 140 extract a candidate word of the second section, and the speech recognizer 150 performs a speech recognition with respect to the first section. According to this method, the phoneme recognizer 130, the candidate word extractor 140, and the speech recognizer 150 operate in parallel at each time. The speech recognizer 150 performs and outputs speech recognitions with respect to speech signals of all sections after a short time tn+2−tn from a time when a user ends uttering.
  • As described above, the electronic device 100 performs a phoneme recognition operation, a candidate word extracting operation using phoneme recognition, and a speech recognition operation using a candidate word in parallel. Therefore, the electronic device 100 performs speech recognition more rapidly than an existing method of recognizing a plurality of continuous words.
  • A speech recognition method of the electronic device 100 according to an exemplary embodiment will now be described with reference to FIG. 4.
  • Referring to FIG. 4, in operation S410, the electronic device 100 determines whether a speech signal is input. The speech signal may be input in real time through a speech input device such as a microphone or through a pre-stored file.
  • If it is determined in operation S410 that the speech signal is input, the electronic device 100 segments the speech signal into a plurality of sections at preset time intervals in operation S420. In detail, the electronic device 100 segments the input speech signal into the plurality of sections at the preset time intervals (e.g., 0.1 seconds) and performs signal-processing with respect to a speech signal of one of the plurality of sections to extract a characteristic vector.
  • In operation S430, the electronic device 100 recognizes a phoneme of the speech signal of the one section. In detail, the electronic device 100 recognizes the phoneme of the speech signal of the one section by using a first acoustic model. In order to further accurately recognize the phoneme, the electronic device 100 deletes at least one last phoneme of a plurality of recognized phonemes and uses the at least one deleted phoneme to recognize a phoneme of a speech signal of a next section.
  • In operation S440, the electronic device 100 extracts a candidate word of the speech signal of the one section by using the phoneme recognition result. In detail, the electronic device 100 extracts similar phonemes of the plurality of recognized phonemes and generates a word graph for extracting the candidate word. Here, the word graph is a network type graph on which the recognized phonemes are respectively connected to the similar phonemes.
  • In operation S450, the electronic device 100 performs speech recognition with respect to the speech signal of the one section by using the candidate word. In detail, the electronic device 100 performs speech recognition with respect to the speech signal of the one section by using a second acoustic model and a language model of the candidate word (the word graph) extracted in operation S440.
  • The electronic device 100 may repeatedly perform operations S430 through S450 with respect to speech signals of next sections. The electronic device 100 may repeatedly perform operations S430 through S450 in parallel by using different cores of a processor.
  • As described above, according to the speech recognition method, an electronic device may more rapidly and accurately perform speech recognition than an existing method of recognizing a plurality of continuous words.
  • As will be appreciated by one skilled in the art, aspects of the exemplary embodiments may be embodied as an apparatus, system, method or computer program product. Accordingly, aspects of the exemplary embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the exemplary embodiments may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon, and executed by a hardware processor.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • The foregoing exemplary embodiments are merely exemplary and are not to be construed as limiting. The present teaching can be readily applied to other types of apparatuses. Also, the description of the exemplary embodiments is intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

Claims (15)

What is claimed is:
1. A method of recognizing speech in an electronic device, the method comprising:
segmenting a speech signal into a plurality of sections at preset time intervals;
performing a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model;
extracting a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and
performing speech recognition with respect to the one of the plurality of sections of the speech signal based on the candidate word.
2. The method of claim 1, wherein the performing of the phoneme recognition further comprises:
deleting at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm,
wherein the at least one deleted phoneme is used to perform a phoneme recognition with respect to a next section of the speech signal following the one the plurality of sections.
3. The method of claim 1, wherein the extracting comprises:
extracting a similar phoneme pronounced similarly to the recognized phoneme; and
generating a word graph for extracting the candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
4. The method of claim 3, wherein the performing the speech recognition comprises:
calculating a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model; and
outputting a word string having a highest probability in the word graph based on the second acoustic model and a language model.
5. The method of claim 4, wherein the first and second acoustic models are different from each other.
6. The method of claim 1, wherein the performing the phoneme recognition, the extracting the candidate word, and the performing the speech recognition are performed in parallel by different cores of the electronic device.
7. An electronic device comprising:
a speech signal input part configured to receive a speech signal;
a speech signal segmenter configured to segment the speech signal input through the speech signal input part into a plurality of sections at preset time intervals;
a phoneme recognizer configured to perform a phoneme recognition with respect to one of the plurality of sections of the speech signal based on a first acoustic model;
a candidate word extractor configured to extract a candidate word of the one of the plurality of sections of the speech signal based on a result of the phoneme recognition; and
a speech recognizer configured to perform speech recognition with respect to the one of the plurality of sections of the speech signal based on the candidate word.
8. The electronic device of claim 7, wherein the phoneme recognizer is configured to delete at least one last phoneme of a plurality of phonemes of the one of the plurality of sections of the speech signal based on a segmented viterbi algorithm to perform the phoneme recognition,
wherein the at least one deleted phoneme is used to perform a phoneme recognition with respect to a next section of the speech signal following the one of the plurality of sections.
9. The electronic device of claim 7, wherein the candidate word extractor is configured to extract a similar phoneme pronounced similarly to the recognized phoneme and generate a word graph for extracting a candidate word of the one of the plurality of sections based on the recognized phoneme and the similar phoneme.
10. The electronic device of claim 9, wherein the speech recognizer is configured to calculate a Gaussian probability of the speech signal of the one of the plurality of sections based on a second acoustic model and output a word string having a highest probability in the word graph based on the second acoustic model and a language model to perform the speech recognition.
11. The electronic device of claim 10, wherein the first acoustic model of the phoneme recognizer and the second acoustic model of the speech recognizer are different from each other.
12. The electronic device of claim 7, wherein the phoneme recognizer, the candidate word extractor, and the speech recognizer are realized as different cores of the electronic device.
13. A method of recognizing speech in an electronic device, the method comprising:
receiving a speech signal;
segmenting the received speech signal into a plurality of sections;
performing phoneme recognition on a first section of the plurality of sections at a first time;
performing phoneme recognition on a second section of the plurality of sections, and extracting a candidate word of the first section at a second time; and
performing phoneme recognition on a third section of the plurality of sections, extracting a candidate word of the second section, and performing speech recognition on the first section of the plurality of sections at a third time.
14. The method of claim 13, wherein a phoneme recognition operation, a candidate word extracting operation based on a recognized phoneme, and speech recognition based on the candidate word are performed in parallel.
15. The method of claim 14, wherein the performance of the phoneme recognition, the extracting operation, and the speech recognition operation are performed through different cores.
US13/940,848 2012-07-13 2013-07-12 Method of recognizing speech and electronic device thereof Abandoned US20140019131A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020120076809A KR20140028174A (en) 2012-07-13 2012-07-13 Method for recognizing speech and electronic device thereof
KR10-2012-0076809 2012-07-13

Publications (1)

Publication Number Publication Date
US20140019131A1 true US20140019131A1 (en) 2014-01-16

Family

ID=48700451

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/940,848 Abandoned US20140019131A1 (en) 2012-07-13 2013-07-12 Method of recognizing speech and electronic device thereof

Country Status (4)

Country Link
US (1) US20140019131A1 (en)
EP (1) EP2685452A1 (en)
KR (1) KR20140028174A (en)
CN (1) CN103544955B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150234937A1 (en) * 2012-09-27 2015-08-20 Nec Corporation Information retrieval system, information retrieval method and computer-readable medium
US20170229124A1 (en) * 2016-02-05 2017-08-10 Google Inc. Re-recognizing speech with external data sources
US10074361B2 (en) 2015-10-06 2018-09-11 Samsung Electronics Co., Ltd. Speech recognition apparatus and method with acoustic modelling
CN110808032A (en) * 2019-09-20 2020-02-18 平安科技(深圳)有限公司 Voice recognition method and device, computer equipment and storage medium
US20210272551A1 (en) * 2015-06-30 2021-09-02 Samsung Electronics Co., Ltd. Speech recognition apparatus, speech recognition method, and electronic device

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036774B (en) * 2014-06-20 2018-03-06 国家计算机网络与信息安全管理中心 Tibetan dialect recognition methods and system
US20160063990A1 (en) * 2014-08-26 2016-03-03 Honeywell International Inc. Methods and apparatus for interpreting clipped speech using speech recognition
KR102267405B1 (en) * 2014-11-21 2021-06-22 삼성전자주식회사 Voice recognition apparatus and method of controlling the voice recognition apparatus
CN104851220A (en) * 2014-11-22 2015-08-19 重庆市行安电子科技有限公司 Automatic alarm system
CN105700389B (en) * 2014-11-27 2020-08-11 青岛海尔智能技术研发有限公司 Intelligent home natural language control method
KR102396983B1 (en) * 2015-01-02 2022-05-12 삼성전자주식회사 Method for correcting grammar and apparatus thereof
KR102386854B1 (en) * 2015-08-20 2022-04-13 삼성전자주식회사 Apparatus and method for speech recognition based on unified model
CN106297784A (en) * 2016-08-05 2017-01-04 Intelligent terminal plays the method and system of quick voice responsive identification
CN109961775A (en) * 2017-12-15 2019-07-02 中国移动通信集团安徽有限公司 Accent recognition method, apparatus, equipment and medium based on HMM model
CN109215630B (en) * 2018-11-14 2021-01-26 北京羽扇智信息科技有限公司 Real-time voice recognition method, device, equipment and storage medium
CN110176237A (en) * 2019-07-09 2019-08-27 北京金山数字娱乐科技有限公司 A kind of audio recognition method and device
CN110570842B (en) * 2019-10-25 2020-07-10 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
KR102345754B1 (en) * 2019-12-31 2021-12-30 주식회사 포스코아이씨티 Speech Recognition Model Management System for Training Speech Recognition Model
CN111091849B (en) * 2020-03-03 2020-12-22 龙马智芯(珠海横琴)科技有限公司 Snore identification method and device, storage medium snore stopping equipment and processor
CN111553726B (en) * 2020-04-22 2023-04-28 上海海事大学 HMM-based bill-of-brush prediction system and method

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2240203A (en) * 1990-01-18 1991-07-24 Apple Computer Automated speech recognition system
US5699456A (en) * 1994-01-21 1997-12-16 Lucent Technologies Inc. Large vocabulary connected speech recognition system and method of language representation using evolutional grammar to represent context free grammars
US20020178005A1 (en) * 2001-04-18 2002-11-28 Rutgers, The State University Of New Jersey System and method for adaptive language understanding by computers
US20040128132A1 (en) * 2002-12-30 2004-07-01 Meir Griniasty Pronunciation network
US20050010412A1 (en) * 2003-07-07 2005-01-13 Hagai Aronowitz Phoneme lattice construction and its application to speech recognition and keyword spotting
US20050080615A1 (en) * 2000-06-01 2005-04-14 Microsoft Corporation Use of a unified language model
US6961701B2 (en) * 2000-03-02 2005-11-01 Sony Corporation Voice recognition apparatus and method, and recording medium
US7089188B2 (en) * 2002-03-27 2006-08-08 Hewlett-Packard Development Company, L.P. Method to expand inputs for word or document searching
US20070033003A1 (en) * 2003-07-23 2007-02-08 Nexidia Inc. Spoken word spotting queries
US20070271241A1 (en) * 2006-05-12 2007-11-22 Morris Robert W Wordspotting system
US7310600B1 (en) * 1999-10-28 2007-12-18 Canon Kabushiki Kaisha Language recognition using a similarity measure
US20080255839A1 (en) * 2004-09-14 2008-10-16 Zentian Limited Speech Recognition Circuit and Method
US20080294441A1 (en) * 2005-12-08 2008-11-27 Zsolt Saffer Speech Recognition System with Huge Vocabulary
US7587321B2 (en) * 2001-05-08 2009-09-08 Intel Corporation Method, apparatus, and system for building context dependent models for a large vocabulary continuous speech recognition (LVCSR) system
US7698136B1 (en) * 2003-01-28 2010-04-13 Voxify, Inc. Methods and apparatus for flexible speech recognition
US20100241418A1 (en) * 2009-03-23 2010-09-23 Sony Corporation Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program
US7831424B2 (en) * 2006-07-07 2010-11-09 International Business Machines Corporation Target specific data filter to speed processing
US7930180B2 (en) * 2005-01-17 2011-04-19 Nec Corporation Speech recognition system, method and program that generates a recognition result in parallel with a distance value
US20110166855A1 (en) * 2009-07-06 2011-07-07 Sensory, Incorporated Systems and Methods for Hands-free Voice Control and Voice Search
US8000964B2 (en) * 2007-12-12 2011-08-16 Institute For Information Industry Method of constructing model of recognizing english pronunciation variation
US20120116768A1 (en) * 2001-07-12 2012-05-10 At&T Intellectual Property Ii, L.P. Systems and Methods for Extracting Meaning from Multimodal Inputs Using Finite-State Devices

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU1520000A (en) * 1998-11-25 2000-06-13 Sony Electronics Inc. Method and apparatus for very large vocabulary isolated word recognition in a parameter sharing speech recognition system
JP2002366187A (en) * 2001-06-08 2002-12-20 Sony Corp Device and method for recognizing voice, program and recording medium
CN100465043C (en) * 2004-07-27 2009-03-04 日本塑料株式会社 A cowl covers for a vehicle

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2240203A (en) * 1990-01-18 1991-07-24 Apple Computer Automated speech recognition system
US5699456A (en) * 1994-01-21 1997-12-16 Lucent Technologies Inc. Large vocabulary connected speech recognition system and method of language representation using evolutional grammar to represent context free grammars
US7310600B1 (en) * 1999-10-28 2007-12-18 Canon Kabushiki Kaisha Language recognition using a similarity measure
US6961701B2 (en) * 2000-03-02 2005-11-01 Sony Corporation Voice recognition apparatus and method, and recording medium
US20050080615A1 (en) * 2000-06-01 2005-04-14 Microsoft Corporation Use of a unified language model
US20020178005A1 (en) * 2001-04-18 2002-11-28 Rutgers, The State University Of New Jersey System and method for adaptive language understanding by computers
US7587321B2 (en) * 2001-05-08 2009-09-08 Intel Corporation Method, apparatus, and system for building context dependent models for a large vocabulary continuous speech recognition (LVCSR) system
US20120116768A1 (en) * 2001-07-12 2012-05-10 At&T Intellectual Property Ii, L.P. Systems and Methods for Extracting Meaning from Multimodal Inputs Using Finite-State Devices
US7089188B2 (en) * 2002-03-27 2006-08-08 Hewlett-Packard Development Company, L.P. Method to expand inputs for word or document searching
US20040128132A1 (en) * 2002-12-30 2004-07-01 Meir Griniasty Pronunciation network
US7698136B1 (en) * 2003-01-28 2010-04-13 Voxify, Inc. Methods and apparatus for flexible speech recognition
US20050010412A1 (en) * 2003-07-07 2005-01-13 Hagai Aronowitz Phoneme lattice construction and its application to speech recognition and keyword spotting
US20070033003A1 (en) * 2003-07-23 2007-02-08 Nexidia Inc. Spoken word spotting queries
US20080255839A1 (en) * 2004-09-14 2008-10-16 Zentian Limited Speech Recognition Circuit and Method
US7930180B2 (en) * 2005-01-17 2011-04-19 Nec Corporation Speech recognition system, method and program that generates a recognition result in parallel with a distance value
US20080294441A1 (en) * 2005-12-08 2008-11-27 Zsolt Saffer Speech Recognition System with Huge Vocabulary
US20070271241A1 (en) * 2006-05-12 2007-11-22 Morris Robert W Wordspotting system
US7831424B2 (en) * 2006-07-07 2010-11-09 International Business Machines Corporation Target specific data filter to speed processing
US8000964B2 (en) * 2007-12-12 2011-08-16 Institute For Information Industry Method of constructing model of recognizing english pronunciation variation
US20100241418A1 (en) * 2009-03-23 2010-09-23 Sony Corporation Voice recognition device and voice recognition method, language model generating device and language model generating method, and computer program
US20110166855A1 (en) * 2009-07-06 2011-07-07 Sensory, Incorporated Systems and Methods for Hands-free Voice Control and Voice Search

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cho, Seonho, "Segmented Viterbi Algorithm for Speech Recognition," June 2011, Korea University MA Degree Thesis *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150234937A1 (en) * 2012-09-27 2015-08-20 Nec Corporation Information retrieval system, information retrieval method and computer-readable medium
US20210272551A1 (en) * 2015-06-30 2021-09-02 Samsung Electronics Co., Ltd. Speech recognition apparatus, speech recognition method, and electronic device
US10074361B2 (en) 2015-10-06 2018-09-11 Samsung Electronics Co., Ltd. Speech recognition apparatus and method with acoustic modelling
US10607603B2 (en) 2015-10-06 2020-03-31 Samsung Electronics Co., Ltd. Speech recognition apparatus and method with acoustic modelling
US11176926B2 (en) 2015-10-06 2021-11-16 Samsung Electronics Co., Ltd. Speech recognition apparatus and method with acoustic modelling
US20170229124A1 (en) * 2016-02-05 2017-08-10 Google Inc. Re-recognizing speech with external data sources
CN110808032A (en) * 2019-09-20 2020-02-18 平安科技(深圳)有限公司 Voice recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
EP2685452A1 (en) 2014-01-15
KR20140028174A (en) 2014-03-10
CN103544955B (en) 2018-09-25
CN103544955A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
US20140019131A1 (en) Method of recognizing speech and electronic device thereof
KR100755677B1 (en) Apparatus and method for dialogue speech recognition using topic detection
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
JP5480760B2 (en) Terminal device, voice recognition method and voice recognition program
US10319373B2 (en) Information processing device, information processing method, computer program product, and recognition system
KR101590724B1 (en) Method for modifying error of speech recognition and apparatus for performing the method
KR102396983B1 (en) Method for correcting grammar and apparatus thereof
US9672820B2 (en) Simultaneous speech processing apparatus and method
US8849668B2 (en) Speech recognition apparatus and method
WO2014183373A1 (en) Systems and methods for voice identification
CN112331229B (en) Voice detection method, device, medium and computing equipment
KR20170007107A (en) Speech Recognition System and Method
US20160232892A1 (en) Method and apparatus of expanding speech recognition database
JP6690484B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
Patel et al. Cross-lingual phoneme mapping for language robust contextual speech recognition
Serrino et al. Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition.
JP5688761B2 (en) Acoustic model learning apparatus and acoustic model learning method
US20110224985A1 (en) Model adaptation device, method thereof, and program thereof
US20110218802A1 (en) Continuous Speech Recognition
KR101424496B1 (en) Apparatus for learning Acoustic Model and computer recordable medium storing the method thereof
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
US9953638B2 (en) Meta-data inputs to front end processing for automatic speech recognition
KR102299269B1 (en) Method and apparatus for building voice database by aligning voice and script
JP4595415B2 (en) Voice search system, method and program
KR20110119478A (en) Apparatus for speech recognition and method thereof

Legal Events

Date Code Title Description
AS Assignment

Owner name: KOREA UNIVERSITY RESEARCH AND BUSINESS FOUNDATION,

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, JAE-WON;YOOK, DONG-SUK;LIM, HYEON-TAEK;AND OTHERS;SIGNING DATES FROM 20130618 TO 20130710;REEL/FRAME:030788/0888

Owner name: SAMSUNG ELECTRONICS CO., LTD., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, JAE-WON;YOOK, DONG-SUK;LIM, HYEON-TAEK;AND OTHERS;SIGNING DATES FROM 20130618 TO 20130710;REEL/FRAME:030788/0888

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION