US20040210434A1 - System and iterative method for lexicon, segmentation and language model joint optimization - Google Patents

System and iterative method for lexicon, segmentation and language model joint optimization Download PDF

Info

Publication number
US20040210434A1
US20040210434A1 US10/842,264 US84226404A US2004210434A1 US 20040210434 A1 US20040210434 A1 US 20040210434A1 US 84226404 A US84226404 A US 84226404A US 2004210434 A1 US2004210434 A1 US 2004210434A1
Authority
US
United States
Prior art keywords
language model
lexicon
corpus
executable instructions
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/842,264
Inventor
Hai-Feng Wang
Chang-Ning Huang
Kai-Fu Lee
Shuo Di
Jianfeng Gao
Dong-Feng Cai
Lee-Feng Chien
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US10/842,264 priority Critical patent/US20040210434A1/en
Publication of US20040210434A1 publication Critical patent/US20040210434A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • This invention generally relates to language modeling and, more specifically, to a system and iterative method for lexicon, word segmentation and language model joint optimization.
  • a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence.
  • a common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of a textual corpus.
  • a prefix tree data structure (a.k.a. a suffix tree, or a PAT tree) enables a higher-level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above.
  • the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item strings.
  • the N-gram language model is limited in a number of respects.
  • the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-grams larger than three (i.e., a tri-gram).
  • a fixed lexicon limits the ability of the model to select the best words in general or specific to a task. If a word is not in the lexicon, it does not exist as far as the model is concerned. Thus, a small lexicon is not likely to cover the intended linguistic content.
  • segmentation algorithms are often ad-hoc and not based on any statistical or semantic principles.
  • a simplistic segmentation algorithm typically errors in favor of larger words over smaller words.
  • the model is unable to accurately predict smaller words contained within larger lexiconically acceptable strings.
  • This invention concerns a system and iterative method for lexicon, segmentation and language model joint optimization.
  • the present invention does not rely on a predefined lexicon or segmentation algorithm, rather the lexicon and segmentation algorithm are dynamically generated in an iterative process of optimizing the language model.
  • a method for improving language model performance comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved.
  • FIG. 1 is a block diagram of a computer system incorporating the teachings of the present invention
  • FIG. 2 is a block diagram of an example modeling agent to iteratively develop a lexicon, segmentation and language model, according to one implementation of the present invention
  • FIG. 3 is a graphical representation of a DOMM tree according to one aspect of the present invention.
  • FIG. 4 is a flow chart of an example method for building a DOMM tree
  • FIG. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention
  • FIG. 6 is a flow chart detailing the method steps for generating an initial lexicon, and iteratively altering a dynamically generated lexicon, segmentation and language model until convergence, according to one implementation of the present invention.
  • FIG. 7 is a storage medium with a plurality of executable instructions which, when executed, implement the innovative modeling agent of the present invention, according to an alternate embodiment of the present invention.
  • This invention concerns a system and iterative method for lexicon, segmentation and language model joint optimization.
  • an innovative language model the Dynamic Order Markov Model (DOMM)
  • DOMM Dynamic Order Markov Model
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules may be located in both local and remote memory storage devices. It is noted, however, that modification to the architecture and methods described herein may well be made without deviating from spirit and scope of the present invention.
  • FIG. 1 illustrates an example computer system 102 including an innovative language modeling agent 104 , to jointly optimize a lexicon, segmentation and language model according to the teachings of the present invention.
  • language modeling agent 104 may well be implemented as a function of an application, e.g., word processor, web browser, speech recognition system, etc.
  • application e.g., word processor, web browser, speech recognition system, etc.
  • innovative modeling agent may well be implemented in hardware, e.g., a programmable logic array (PLA), a special purpose processor, an application specific integrated circuit (ASIC), microcontroller, and the like.
  • PLA programmable logic array
  • ASIC application specific integrated circuit
  • computer 102 is intended to represent any of a class of general or special purpose computing platforms which, when endowed with the innovative language modeling agent (LMA) 104 , implement the teachings of the present invention in accordance with the first example implementation introduced above.
  • LMA innovative language modeling agent
  • computer system 102 may alternatively support a hardware implementation of LMA 104 as well.
  • LMA 104 the following description of computer system 102 is intended to be merely illustrative, as computer systems of greater or lesser capability may well be substituted without deviating from the spirit and scope of the present invention.
  • computer 102 includes one or more processors or processing units 132 , a system memory 134 , and a bus 136 that couples various system components including the system memory 134 to processors 132 .
  • the bus 136 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
  • the system memory includes read only memory (ROM) 138 and random access memory (RAM) 140 .
  • ROM read only memory
  • RAM random access memory
  • a basic input/output system (BIOS) 142 containing the basic routines that help to transfer information between elements within computer 102 , such as during start-up, is stored in ROM 138 .
  • Computer 102 further includes a hard disk drive 144 for reading from and writing to a hard disk, not shown, a magnetic disk drive 146 for reading from and writing to a removable magnetic disk 148 , and an optical disk drive 150 for reading from or writing to a removable optical disk 152 such as a CD ROM, DVD ROM or other such optical media.
  • the hard disk drive 144 , magnetic disk drive 146 , and optical disk drive 150 are connected to the bus 136 by a SCSI interface 154 or some other suitable bus interface.
  • the drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for computer 102 .
  • a number of program modules may be stored on the hard disk 144 , magnetic disk 148 , optical disk 152 , ROM 138 , or RAM 140 , including an operating system 158 , one or more application programs 160 including, for example, the innovative LMA 104 incorporating the teachings of the present invention, other program modules 162 , and program data 164 (e.g., resultant language model data structures, etc.).
  • a user may enter commands and information into computer 102 through input devices such as keyboard 166 and pointing device 168 .
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are connected to the processing unit 132 through an interface 170 that is coupled to bus 136 .
  • a monitor 172 or other type of display device is also connected to the bus 136 via an interface, such as a video adapter 174 .
  • personal computers often include other peripheral output devices (not shown) such as speakers and printers.
  • computer 102 operates in a networked environment using logical connections to one or more remote computers, such as a remote computer 176 .
  • the remote computer 176 may be another personal computer, a personal digital assistant, a server, a router or other network device, a network “thin-client” PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer 102 , although only a memory storage device 178 has been illustrated in FIG. 1.
  • the logical connections depicted in FIG. 1 include a local area network (LAN) 180 and a wide area network (WAN) 182 .
  • LAN local area network
  • WAN wide area network
  • remote computer 176 executes an Internet Web browser program such as the “Internet Explorer” Web browser manufactured and distributed by Microsoft Corporation of Redmond, Wash. to access and utilize online services.
  • computer 102 When used in a LAN networking environment, computer 102 is connected to the local network 180 through a network interface or adapter 184 .
  • computer 102 When used in a WAN networking environment, computer 102 typically includes a modem 186 or other means for establishing communications over the wide area network 182 , such as the Internet.
  • the modem 186 which may be internal or external, is connected to the bus 136 via a input/output (I/O) interface 156 .
  • I/O interface 156 In addition to network connectivity, I/O interface 156 also supports one or more printers 188 .
  • program modules depicted relative to the personal computer 102 may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • the data processors of computer 102 are programmed by means of instructions stored at different times in the various computer-readable storage media of the computer.
  • Programs and operating systems are typically distributed, for example, on floppy disks or CD-ROMs. From there, they are installed or loaded into the secondary memory of a computer. At execution, they are loaded at least partially into the computer's primary electronic memory.
  • the invention described herein includes these and other various types of computer-readable storage media when such media contain instructions or programs for implementing the innovative steps described below in conjunction with a microprocessor or other data processor.
  • the invention also includes the computer itself when programmed according to the methods and techniques described below.
  • certain sub-components of the computer may be programmed to perform the functions and steps described below. The invention includes such sub-components when they are programmed as described.
  • the invention described herein includes data structures, described below, as embodied on various types of memory media.
  • FIG. 2 illustrates a block diagram of an example language modeling agent (LMA) 104 , incorporating the teachings of the present invention.
  • language modeling agent 104 is comprised of one or more controllers 202 , innovative analysis engine 204 , storage/memory device(s) 206 and, optionally, one or more additional applications (e.g., graphical user interface, prediction application, verification application, estimation application, etc.) 208 , each communicatively coupled as shown.
  • additional applications e.g., graphical user interface, prediction application, verification application, estimation application, etc.
  • LMA 104 may well be implemented as a function of a higher level application, e.g., a word processor, web browser, speech recognition system, or a language conversion system.
  • controller(s) 202 of LMA 104 are responsive to one or more instructional commands from a parent application to selectively invoke the features of LMA 104 .
  • LMA 104 may well be implemented as a stand-alone language modeling tool, providing a user with a user interface ( 208 ) to selectively implement the features of LMA 104 discussed below.
  • controller(s) 202 of LMA 104 selectively invoke one or more of the functions of analysis engine 204 to optimize a language model from a dynamically generated lexicon and segmentation algorithm.
  • controller 202 is intended to represent any of a number of alternate control systems known in the art including, but not limited to, a microprocessor, a programmable logic array (PLA), a micro-machine, an application specific integrated circuit (ASIC) and the like.
  • controller 202 is intended to represent a series of executable instructions to implement the control logic described above.
  • the innovative analysis engine 204 is comprised a Markov probability calculator 212 , a data structure generator 210 including a frequency calculation function 213 , a dynamic lexicon generation function 214 and a dynamic segmention function 216 , and a data structure memory manager 218 .
  • controller 202 selectively invokes an instance of the analysis engine 204 to develop, modify and optimize a statistical language model (SLM).
  • SLM statistical language model
  • analysis engine 204 develops a statistical language model data structure fundamentally based on the Markov transition probabilities between individual items (e.g., characters, letters, numbers, etc.) of a textual corpus (e.g., one or more sets of text).
  • analysis engine 204 utilizes as much data (referred to as “context” or “order” as is available to calculate the probability of an item string.
  • the language model of the present invention is aptly referred to as a Dynamic Order Markov Model (DOMM).
  • DOMM Dynamic Order Markov Model
  • controller 202 When invoked by controller 202 to establish a DOMM data structure, analysis engine 204 selectively invokes the data structure generator 210 .
  • data structure generator 210 establishes a tree-like data structure comprised of a plurality of nodes (associated with each of the plurality of items) and denoting inter-node dependencies. As described above, the tree-like data structure is referred to herein as a DOMM data structure, or DOMM tree.
  • Controller 202 receives the textual corpus and stores at least a subset of the textual corpus in memory 206 as a dynamic training set 222 from which the language model is to be developed. It will be appreciated that, in alternate embodiments, a predetermined training set may also be used.
  • Frequency calculation function 213 identifies a frequency of occurrence for each item (character, letter, number, word, etc.) in the training set subset. Based on inter-node dependencies, data structure generator 210 assigns each item to an appropriate node of the DOMM tree, with an indication of the frequency value (C i ) and a compare bit (b i ).
  • the Markov probability calculator 212 calculates the probability of an item (character, letter, number, etc.) from a context (j) of associated items. More specifically, according to the teachings of the present invention, the Markov probability of a particular item (C i ) is dependent on as many previous characters as data “allows”, in other words:
  • the number of characters employed as context (j) by Markov probability calculator 212 is a “dynamic” quantity that is different for each sequence of characters C i , C i-1 , C i-2 , C i-3 , etc.
  • the number of characters relied upon for context (j) by Markov probability calculator 212 is dependent, at least in part, on a frequency value for each of the characters, i.e., the rate at which they appear throughout the corpus. More specifically, if in identifying the items of the corpus Markov probability calculator 212 does not identify at least a minimum occurrence frequency for a particular item, it may be “pruned” (i.e., removed) from the tree as being statistically irrelevant. According to one embodiment, the minimum frequency threshold is three (3).
  • analysis engine 204 does not rely on a fixed lexicon or a simple segmentation algorithm (both of which tend to be error prone). Rather, analysis engine 204 selectively invokes a dynamic segmentation function 216 to segment items (characters or letters, for example) into strings (e.g., words). More precisely, segmentation function 216 segments the training set 222 into subsets (chunks) and calculates a cohesion score (i.e., a measure of the similarity between items within the subset). The segmentation and cohesion calculation is iteratively performed by segmentation function 216 until the cohesion score for each subset reaches a predetermined threshold.
  • a cohesion score i.e., a measure of the similarity between items within the subset
  • the lexicon generation function 214 is invoked to dynamically generate and maintain a lexicon 220 in memory 206 .
  • lexicon generation function 214 analyzes the segmentation results and generates a lexicon from item strings with a Markov transition probability that exceeds a threshold.
  • lexicon generation function 214 develops a dynamic lexicon 220 from item strings which exceed a pre-determined Markov transition probability taken from one or more language models developed by analysis engine 204 .
  • analysis engine 204 dynamically generates a lexicon of statistically significant, statistically accurate item strings from one or more language models developed over a period of time.
  • the lexicon 220 comprises a “virtual corpus” that Markov probability calculator 212 relies upon (in addition to the dynamic training set) in developing subsequent language models.
  • data structure memory manager 218 When invoked to modify or utilize the DOMM language model data structure, analysis engine 204 selectively invokes an instance of data structure memory manager 218 .
  • data structure memory manager 218 utilizes system memory as well as extended memory to maintain the DOMM data structure. More specifically, as will be described in greater detail below with reference to FIGS. 6 and 7, data structure memory manager 218 employs a WriteNode function and a ReadNode function (not shown) to maintain a subset of the most recently used nodes of the DOMM data structure in a first level cache 224 of a system memory 206 , while relegating least recently used nodes to extended memory (e.g., disk files in hard drive 144 , or some remote drive), to provide for improved performance characteristics.
  • extended memory e.g., disk files in hard drive 144 , or some remote drive
  • a second level cache of system memory 206 is used to aggregate write commands until a predetermined threshold has been met, at which point data structure memory manager make one aggregate WriteNode command to an appropriate location in memory.
  • data structure memory manager 218 may well be combined as a functional element of controller(s) 202 without deviating from the spirit and scope of the present invention.
  • FIG. 3 graphically represents a conceptual illustration of an example Dynamic Order Markov Model tree-like data structure 300 , according to the teachings of the present invention.
  • FIG. 3 presents an example DOMM data structure 300 for a language model developed from the English alphabet, i.e., A, B, C, . . . Z.
  • the DOMM tree 300 is comprised of one or more root nodes 302 and one or more subordinate nodes 304 , each associated with an item (character, letter, number, word, etc.) of a textual corpus, logically coupled to denote dependencies between nodes.
  • root nodes 302 are comprised of an item and a frequency value (e.g., a count of how many times the item occurs in the corpus).
  • a frequency value e.g., a count of how many times the item occurs in the corpus.
  • the subordinate nodes are arranged in binary sub-trees, wherein each node includes a compare bit (b i ), an item with which the node is associated (A, B, . . . ), and a frequency value (C N ) for the item.
  • a binary sub-tree is comprised of subordinate nodes 308 - 318 denoting the relationships between nodes and the frequency with which they occur.
  • the complexity of a search of the DOMM tree approximates log(N), where N is the total number of nodes to be searched.
  • the size of the DOMM tree 300 may exceed the space available in the memory device 206 of LMA 104 and/or the main memory 140 of computer system 102 . Accordingly, data structure memory manager 218 facilitates storage of a DOMM tree data structure 300 across main memory (e.g., 140 and/or 206 ) into an extended memory space, e.g., disk files on a mass storage device such as hard drive 144 of computer system 102 .
  • main memory e.g., 140 and/or 206
  • an extended memory space e.g., disk files on a mass storage device such as hard drive 144 of computer system 102 .
  • FIG. 4 is a flow chart of an example method for building a Dynamic Order Markov Model (DOMM) data structure, according to one aspect of the present invention.
  • language modeling agent 104 may be invoked directly by a user or a higher-level application.
  • controller 202 of LMA 104 selectively invokes an instance of analysis engine 204 , and a textual corpus (e.g., one or more documents) is loaded into memory 206 as a dynamic training set 222 and split into subsets (e.g., sentences, lines, etc.), block 402 .
  • data structure generator 210 assigns each item of the subset to a node in data structure and calculates a frequency value for the item, block 404 .
  • frequency calculation function 213 is invoked to identify the occurrence frequency of each item within the training set subset.
  • data structure generator determines whether additional subsets of the training set remain and, if so, the next subset is read in block 408 and the process continues with block 404 .
  • data structure generator 210 completely populates the data structure, a subset at a time, before invocation of the frequency calculation function 213 .
  • frequency calculation function 213 simply counts each item as it is placed into associated nodes of the data structure.
  • data structure generator 210 may optionally prune the data structure, block 410 .
  • a number of mechanisms may be employed to prune the resultant data structure 300 .
  • FIG. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention. As shown, the method begins with block 400 wherein LM 104 is invoked and a prefix tree of at least a subset of the received corpus is built. More specifically, as detailed in FIG. 4, data structure generator 210 of modeling agent 104 analyzes the received corpus and selects at least a subset as a training set, from which a DOMM tree is built.
  • a very large lexicon is built form the prefix tree and pre-processed to remove some obvious illogical words. More specifically, lexicon generation function 214 is invoked to build an initial lexicon from the prefix tree. According to one implementation, the initial lexicon is built from the prefix tree using all sub-strings whose length is less than some pre-defined value, say ten (10) items (i.e., the sub-string is ten nodes or less from root to the most subordinate node). Once the initial lexicon is compiled, lexicon generation function 214 prunes the lexicon by removing some obvious illogical words (see, e.g., block 604 , below). According to one implementation, lexicon generation function 214 appends a pre-defined lexicon with the new, initial lexicon generated from at least the training set of the received corpus.
  • some pre-defined value say ten (10) items
  • At least the training set of the received corpus is segmented, using the initial lexicon. More particularly, dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an initial segmented corpus.
  • dynamic segmentation function 216 utilizes a Maximum Match technique to provide an initial segmented corpus.
  • segmentation function 216 starts at the beginning of an item string (or branch of the DOMM tree) and searches lexicon to see if the initial item (I 1 ) is a one-item “word”. Segmentation function then combines it with the next item in the string to see if the combination (e.g., I 1 I 2 ) is found as a “word” in the lexicon, and so on. According to one implementation, the longest string (I 1 , I 2 , . . . I N ) of items found in the lexicon is deemed to be the correct segmentation for that string. It is to be appreciated that more complex Maximum Match algorithms may well be utilized by segmentation function 216 without deviating from the scope and spirit of the present invention.
  • an iterative process is entered wherein the lexicon, segmentation and language model are jointly optimized, block 506 .
  • the innovative iterative optimization employs a statistical language modeling approach to dynamically adjust the segmentation and lexicon to provide an optimized language model. That is, unlike prior art language modeling techniques, modeling agent 104 does not rely on a pre-defined static lexicon, or simplistic segmentation algorithm to generate a language model. Rather, modeling agent 104 utilizes the received corpus, or at least a subset thereof (training set), to dynamically generate a lexicon and segmentation to produce an optimized language model. In this regard, language models generated by modeling agent 104 do not suffer from the drawbacks and limitations commonly associated with prior art modeling systems.
  • FIG. 6 presents a more detailed flow chart for generating an initial lexicon, and the iterative process of refining the lexicon and segmentation to optimize the language model, according to one implementation of the present invention.
  • the method begins with step 400 (FIG. 4) of building a prefix tree from the received corpus.
  • the prefix tree may be built using the entire corpus or, alternatively, using a subset entire corpus (referred to as a training corpus).
  • lexicon generation function 214 generates an initial lexicon from the prefix tree by identifying substrings (or branches of the prefix tree) with less than a select number of items. According to one implementation, lexicon generation function 214 identifies substrings of ten (10) items or less to comprise the initial lexicon. In block 604 , lexicon generation function 214 analyzes the initial lexicon generated in step 602 for obvious illogical substrings, removing these substrings from the initial lexicon.
  • lexicon generation function 214 analyzes the initial lexicon of substrings for illogical, or improbable words and removes these words from the lexicon.
  • dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an segmented corpus.
  • the Maximum Match algorithm is used to segment based on the initial lexicon.
  • the frequency analysis function 213 is invoked to compute the frequency of the occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon.
  • the threshold for this deletion and re-segmentation may be determined according to the size of the corpus.
  • a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon. In this way, we can delete most of the obvious illogical words from the initial lexicon.
  • the received corpus is segmented based, at least in part, on the initial lexicon, block 504 .
  • the initial segmentation of the corpus is performed using a maximum matching process.
  • the an iterative process of dynamically altering the lexicon and segmentation begins to optimize a statistical language model (SLM) from the received corpus (or training set), block 506 .
  • SLM statistical language model
  • the process begins in block 606 , wherein the Markov probability calculator 212 utilizes the initial lexicon and segmentation to begin language model training using the segmented corpus. That is, given the initial lexicon and an initial segmentation, a statistical language model may be generated therefrom. It should be noted that although the language model does not yet benefit from a refined lexicon and a statistically based segmentation (which will evolve in the steps to follow), it is nonetheless fundamentally based on the received corpus itself.
  • the segmented corpus (or training set) is re-segmented using SLM-based segmentation.
  • SLM-based segmentation Given a sentence w 1 , w 2 , . . . wn, there are M possible ways to segment it (where M ⁇ 1).
  • Dynamic segmentation function 216 computes a probability (p i ) of each segmentation (S i ) based on an N-gram statistical language model.
  • a Viterbi search algorithm is employed to find the most probable segmentation S k , where:
  • the lexicon is updated using the re-segmented corpus resulting from the SLM-based segmentation described above.
  • modeling agent 104 invokes frequency analysis function 213 to compute the frequency of occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon. All occurrences of the word must then be re-segmented into smaller words, as the uni-count for all those words are re-computed.
  • the threshold for this deletion and re-segmentation may be determined according to the size of the corpus.
  • a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon.
  • the language model is updated to reflect the dynamically generated lexicon and the SLM-based segmentation, and a measure of the language model perplexity (i.e., an inverse probability measure) is computer by Markov probability calculator 212 . If the perplexity continues to converge (toward zero (0)), i.e., improve, the process continues with block 608 wherein the lexicon and segmentation are once again modified with the intent of further improving the language model performance (as measured by perplexity). If in block 614 it is determined that the language model has not improved as a result of the recent modifications to the lexicon and segmentation, a further determination of whether the perplexity has reached an acceptable threshold is made, block 616 . If so, the process ends.
  • a measure of the language model perplexity i.e., an inverse probability measure
  • lexicon generation function 214 deletes the word with the smallest frequency of occurrence in the corpus from the lexicon, re-segmenting the word into smaller words, block 618 , as the process continues with block 610 .
  • innovative language modeling agent 104 generates an optimized language model premised on a dynamically generated lexicon and segmentation rules statistically predicated on at least a subset of the received corpus.
  • the resultant language model has improved computational and predictive capability when compared to prior art language models.
  • FIG. 7 is a block diagram of a storage medium having stored thereon a plurality of instructions including instructions to implement the innovative modeling agent of the present invention, according to yet another embodiment of the present invention.
  • FIG. 7 illustrates a storage medium/device 700 having stored thereon a plurality of executable instructions 702 including at least a subset of which that, when executed, implement the innovative modeling agent 104 of the present invention.
  • the executable instructions 702 When executed by a processor of a host system, the executable instructions 702 implement the modeling agent to generate a statistical language model representation of a textual corpus for use by any of a host of other applications executing on or otherwise available to the host system.
  • storage medium 700 is intended to represent any of a number of storage devices and/or storage media known to those skilled in the art such as, for example, volatile memory devices, non-volatile memory devices, magnetic storage media, optical storage media, and the like.
  • the executable instructions are intended to reflect any of a number of software languages known in the art such as, for example, C++, Visual Basic, Hypertext Markup Language (HTML), Java, eXtensible Markup Language (XML), and the like.
  • the storage medium/device 700 need not be co-located with any host system. That is, storage medium/device 700 may well reside within a remote server communicatively coupled to and accessible by an executing system. Accordingly, the software implementation of FIG. 7 is to be regarded as illustrative, as alternate storage media and software embodiments are anticipated within the spirit and scope of the present invention.

Abstract

A method for optimizing a language model is presented comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved.

Description

    RELATED APPLICATIONS
  • This is a continuation of U.S. patent application Ser. No. 09/609,202, filed Jun. 20, 2000, now U.S. Pat. No. ______, which claims priority to a provisional patent application No. 60/163,850, entitled “An iterative method for lexicon, word segmentation and language model joint optimization”, filed on Nov. 5, 1999 by the inventors of this application, each of which are incorporated herein by reference.[0001]
  • TECHNICAL FIELD
  • This invention generally relates to language modeling and, more specifically, to a system and iterative method for lexicon, word segmentation and language model joint optimization. [0002]
  • BACKGROUND
  • Recent advances in computing power and related technology have fostered the development of a new generation of powerful software applications including web-browsers, word processing and speech recognition applications. The latest generation of web-browsers, for example, anticipate a uniform resource locator (URL) address entry after a few of the initial characters of the domain name have been entered. Word processors offer improved spelling and grammar checking capabilities, word prediction, and language conversion. Newer speech recognition applications similarly offer a wide variety of features with impressive recognition and prediction accuracy rates. In order to be useful to an end-user, these features must execute in substantially real-time. To provide this performance, many applications rely on a tree-like data structure to build a simple language model. [0003]
  • Simplistically, a language model measures the likelihood of any given sentence. That is, a language model can take any sequence of items (words, characters, letters, etc.) and estimate the probability of the sequence. A common approach to building a prior art language model is to utilize a prefix tree-like data structure to build an N-gram language model from a known training set of a textual corpus. [0004]
  • The use of a prefix tree data structure (a.k.a. a suffix tree, or a PAT tree) enables a higher-level application to quickly traverse the language model, providing the substantially real-time performance characteristics described above. Simplistically, the N-gram language model counts the number of occurrences of a particular item (word, character, etc.) in a string (of size N) throughout a text. The counts are used to calculate the probability of the use of the item strings. Traditionally, a tri-gram (N-gram where N=3) approach involves the following steps: [0005]
  • (a) a textual corpus is dissected into a plurality of items (characters, letters, numbers, etc.); [0006]
  • (b) the items (e.g., characters (C)) are segmented (e.g., into words (W)) in accordance with a small, pre-defined lexicon and a simple, pre-defined segmentation algorithm, wherein each W is mapped in the tree to one or more C's; [0007]
  • (c) train a language model on the dissected corpus by counting the occurrence of strings of characters, from which the probability of a sequence of words (W[0008] 1, W2, . . . WM) is predicted from the previous two words:
  • P(W1, W2, W3, . . . WM)≈ΠP(Wi|Wi-1, Wi-2)   (1)
  • The N-gram language model is limited in a number of respects. First, the counting process utilized in constructing the prefix tree is very time consuming. Thus, only small N-gram models (typically bi-gram, or tri-gram) can practically be achieved. Second, as the string size (N) of the N-gram language model increases, the memory required to store the prefix tree increases by 2[0009] N. Thus, the memory required to store the N-gram language model, and the access time required to utilize a large N-gram language model is prohibitively large for N-grams larger than three (i.e., a tri-gram).
  • Prior art N-gram language models tend to use a fixed (small) lexicon, a simplistic segmentation algorithm, and will typically only rely on the previous two words to predict the current word (in a tri-gram model). [0010]
  • A fixed lexicon limits the ability of the model to select the best words in general or specific to a task. If a word is not in the lexicon, it does not exist as far as the model is concerned. Thus, a small lexicon is not likely to cover the intended linguistic content. [0011]
  • The segmentation algorithms are often ad-hoc and not based on any statistical or semantic principles. A simplistic segmentation algorithm typically errors in favor of larger words over smaller words. Thus, the model is unable to accurately predict smaller words contained within larger lexiconically acceptable strings. [0012]
  • As a result of the foregoing limitations, a language model using prior art lexicon and segmentation algorithms tend to be error prone. That is, any errors made in the lexicon or segmentation stage are propagated throughout the language model, thereby limiting its accuracy and predictive attributes. [0013]
  • Finally, limiting the model to at most the previous two words for context (in a tri-gram language model) is also limiting in that a greater context might be required to accurately predict the likelihood of a word. The limitations on these three aspects of the language model often result in poor predictive qualities of the language model. [0014]
  • Thus, a system and method for lexicon, segmentation algorithm and language model joint optimization is required, unencumbered by the deficiencies and limitations commonly associated with prior art language modeling techniques. Just such a solution is provided below. [0015]
  • SUMMARY
  • This invention concerns a system and iterative method for lexicon, segmentation and language model joint optimization. To overcome the limitations commonly associated with the prior art, the present invention does not rely on a predefined lexicon or segmentation algorithm, rather the lexicon and segmentation algorithm are dynamically generated in an iterative process of optimizing the language model. According to one implementation, a method for improving language model performance is presented comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved. [0016]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The same reference numbers are used throughout the figures to reference like components and features. [0017]
  • FIG. 1 is a block diagram of a computer system incorporating the teachings of the present invention; [0018]
  • FIG. 2 is a block diagram of an example modeling agent to iteratively develop a lexicon, segmentation and language model, according to one implementation of the present invention; [0019]
  • FIG. 3 is a graphical representation of a DOMM tree according to one aspect of the present invention; [0020]
  • FIG. 4 is a flow chart of an example method for building a DOMM tree; [0021]
  • FIG. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention; [0022]
  • FIG. 6 is a flow chart detailing the method steps for generating an initial lexicon, and iteratively altering a dynamically generated lexicon, segmentation and language model until convergence, according to one implementation of the present invention; and [0023]
  • FIG. 7 is a storage medium with a plurality of executable instructions which, when executed, implement the innovative modeling agent of the present invention, according to an alternate embodiment of the present invention.[0024]
  • DETAILED DESCRIPTION
  • This invention concerns a system and iterative method for lexicon, segmentation and language model joint optimization. In describing the present invention, an innovative language model, the Dynamic Order Markov Model (DOMM), is referenced. A detailed description of DOMM is presented in copending U.S. patent application Ser. No. 09/608,526 entitled A Method and Apparatus for Generating and Managing a Language Model Data Structure, by Lee, et al., the disclosure of which is expressly incorporated herein by reference. [0025]
  • In the discussion herein, the invention is described in the general context of computer-executable instructions, such as program modules, being executed by one or more conventional computers. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, personal digital assistants, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. In a distributed computer environment, program modules may be located in both local and remote memory storage devices. It is noted, however, that modification to the architecture and methods described herein may well be made without deviating from spirit and scope of the present invention. [0026]
  • Example Computer System [0027]
  • FIG. 1 illustrates an [0028] example computer system 102 including an innovative language modeling agent 104, to jointly optimize a lexicon, segmentation and language model according to the teachings of the present invention. It should be appreciated that although depicted as a separate, stand alone application in FIG. 1, language modeling agent 104 may well be implemented as a function of an application, e.g., word processor, web browser, speech recognition system, etc. Moreover, although depicted as a software application, those skilled in the art will appreciate that the innovative modeling agent may well be implemented in hardware, e.g., a programmable logic array (PLA), a special purpose processor, an application specific integrated circuit (ASIC), microcontroller, and the like.
  • It will be evident, from the discussion to follow, that [0029] computer 102 is intended to represent any of a class of general or special purpose computing platforms which, when endowed with the innovative language modeling agent (LMA) 104, implement the teachings of the present invention in accordance with the first example implementation introduced above. It is to be appreciated that although the language modeling agent is depicted herein as a software application, computer system 102 may alternatively support a hardware implementation of LMA 104 as well. In this regard, but for the description of LMA 104, the following description of computer system 102 is intended to be merely illustrative, as computer systems of greater or lesser capability may well be substituted without deviating from the spirit and scope of the present invention.
  • As shown, [0030] computer 102 includes one or more processors or processing units 132, a system memory 134, and a bus 136 that couples various system components including the system memory 134 to processors 132.
  • The [0031] bus 136 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. The system memory includes read only memory (ROM) 138 and random access memory (RAM) 140. A basic input/output system (BIOS) 142, containing the basic routines that help to transfer information between elements within computer 102, such as during start-up, is stored in ROM 138. Computer 102 further includes a hard disk drive 144 for reading from and writing to a hard disk, not shown, a magnetic disk drive 146 for reading from and writing to a removable magnetic disk 148, and an optical disk drive 150 for reading from or writing to a removable optical disk 152 such as a CD ROM, DVD ROM or other such optical media. The hard disk drive 144, magnetic disk drive 146, and optical disk drive 150 are connected to the bus 136 by a SCSI interface 154 or some other suitable bus interface. The drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for computer 102.
  • Although the exemplary environment described herein employs a [0032] hard disk 144, a removable magnetic disk 148 and a removable optical disk 152, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, random access memories (RAMs) read only memories (ROM), and the like, may also be used in the exemplary operating environment.
  • A number of program modules may be stored on the [0033] hard disk 144, magnetic disk 148, optical disk 152, ROM 138, or RAM 140, including an operating system 158, one or more application programs 160 including, for example, the innovative LMA 104 incorporating the teachings of the present invention, other program modules 162, and program data 164 (e.g., resultant language model data structures, etc.). A user may enter commands and information into computer 102 through input devices such as keyboard 166 and pointing device 168. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are connected to the processing unit 132 through an interface 170 that is coupled to bus 136. A monitor 172 or other type of display device is also connected to the bus 136 via an interface, such as a video adapter 174. In addition to the monitor 172, personal computers often include other peripheral output devices (not shown) such as speakers and printers.
  • As shown, [0034] computer 102 operates in a networked environment using logical connections to one or more remote computers, such as a remote computer 176. The remote computer 176 may be another personal computer, a personal digital assistant, a server, a router or other network device, a network “thin-client” PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer 102, although only a memory storage device 178 has been illustrated in FIG. 1.
  • As shown, the logical connections depicted in FIG. 1 include a local area network (LAN) [0035] 180 and a wide area network (WAN) 182. Such networking environments are commonplace in offices, enterprise-wide computer networks, Intranets, and the Internet. In one embodiment, remote computer 176 executes an Internet Web browser program such as the “Internet Explorer” Web browser manufactured and distributed by Microsoft Corporation of Redmond, Wash. to access and utilize online services.
  • When used in a LAN networking environment, [0036] computer 102 is connected to the local network 180 through a network interface or adapter 184. When used in a WAN networking environment, computer 102 typically includes a modem 186 or other means for establishing communications over the wide area network 182, such as the Internet. The modem 186, which may be internal or external, is connected to the bus 136 via a input/output (I/O) interface 156. In addition to network connectivity, I/O interface 156 also supports one or more printers 188. In a networked environment, program modules depicted relative to the personal computer 102, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • Generally, the data processors of [0037] computer 102 are programmed by means of instructions stored at different times in the various computer-readable storage media of the computer. Programs and operating systems are typically distributed, for example, on floppy disks or CD-ROMs. From there, they are installed or loaded into the secondary memory of a computer. At execution, they are loaded at least partially into the computer's primary electronic memory. The invention described herein includes these and other various types of computer-readable storage media when such media contain instructions or programs for implementing the innovative steps described below in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described below. Furthermore, certain sub-components of the computer may be programmed to perform the functions and steps described below. The invention includes such sub-components when they are programmed as described. In addition, the invention described herein includes data structures, described below, as embodied on various types of memory media.
  • For purposes of illustration, programs and other executable program components such as the operating system are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computer, and are executed by the data processor(s) of the computer. [0038]
  • Example Language Modeling Agent [0039]
  • FIG. 2 illustrates a block diagram of an example language modeling agent (LMA) [0040] 104, incorporating the teachings of the present invention. As shown, language modeling agent 104 is comprised of one or more controllers 202, innovative analysis engine 204, storage/memory device(s) 206 and, optionally, one or more additional applications (e.g., graphical user interface, prediction application, verification application, estimation application, etc.) 208, each communicatively coupled as shown. It will be appreciated that although depicted in FIG. 2 as a number of disparate blocks, one or more of the functional elements of the LMA 104 may well be combined. In this regard, modeling agents of greater or lesser complexity which iteratively jointly optimize a dynamic lexicon, segmentation and language model may well be employed without deviating from the spirit and scope of the present invention.
  • As alluded to above, although depicted as a separate functional element, [0041] LMA 104 may well be implemented as a function of a higher level application, e.g., a word processor, web browser, speech recognition system, or a language conversion system. In this regard, controller(s) 202 of LMA 104 are responsive to one or more instructional commands from a parent application to selectively invoke the features of LMA 104. Alternatively, LMA 104 may well be implemented as a stand-alone language modeling tool, providing a user with a user interface (208) to selectively implement the features of LMA 104 discussed below.
  • In either case, controller(s) [0042] 202 of LMA 104 selectively invoke one or more of the functions of analysis engine 204 to optimize a language model from a dynamically generated lexicon and segmentation algorithm. Thus, except as configured to effect the teachings of the present invention, controller 202 is intended to represent any of a number of alternate control systems known in the art including, but not limited to, a microprocessor, a programmable logic array (PLA), a micro-machine, an application specific integrated circuit (ASIC) and the like. In an alternate implementation, controller 202 is intended to represent a series of executable instructions to implement the control logic described above.
  • As shown, the [0043] innovative analysis engine 204 is comprised a Markov probability calculator 212, a data structure generator 210 including a frequency calculation function 213, a dynamic lexicon generation function 214 and a dynamic segmention function 216, and a data structure memory manager 218. Upon receiving an external indication, controller 202 selectively invokes an instance of the analysis engine 204 to develop, modify and optimize a statistical language model (SLM). More specifically, in contrast to prior art language modeling techniques, analysis engine 204 develops a statistical language model data structure fundamentally based on the Markov transition probabilities between individual items (e.g., characters, letters, numbers, etc.) of a textual corpus (e.g., one or more sets of text). Moreover, as will be shown, analysis engine 204 utilizes as much data (referred to as “context” or “order” as is available to calculate the probability of an item string. In this regard, the language model of the present invention is aptly referred to as a Dynamic Order Markov Model (DOMM).
  • When invoked by [0044] controller 202 to establish a DOMM data structure, analysis engine 204 selectively invokes the data structure generator 210. In response, data structure generator 210 establishes a tree-like data structure comprised of a plurality of nodes (associated with each of the plurality of items) and denoting inter-node dependencies. As described above, the tree-like data structure is referred to herein as a DOMM data structure, or DOMM tree. Controller 202 receives the textual corpus and stores at least a subset of the textual corpus in memory 206 as a dynamic training set 222 from which the language model is to be developed. It will be appreciated that, in alternate embodiments, a predetermined training set may also be used.
  • Once the dynamic training set is received, at least a subset of the training set [0045] 222 is retrieved by frequency calculation function 213 for analysis. Frequency calculation function 213 identifies a frequency of occurrence for each item (character, letter, number, word, etc.) in the training set subset. Based on inter-node dependencies, data structure generator 210 assigns each item to an appropriate node of the DOMM tree, with an indication of the frequency value (Ci) and a compare bit (bi).
  • The [0046] Markov probability calculator 212 calculates the probability of an item (character, letter, number, etc.) from a context (j) of associated items. More specifically, according to the teachings of the present invention, the Markov probability of a particular item (Ci) is dependent on as many previous characters as data “allows”, in other words:
  • P(C1, C2, C3, . . . , CN)≈ΠP(Ci|Ci-1, Ci-2, Ci-3, . . . , Cj)   (2)
  • The number of characters employed as context (j) by [0047] Markov probability calculator 212 is a “dynamic” quantity that is different for each sequence of characters Ci, Ci-1, Ci-2, Ci-3, etc. According to one implementation, the number of characters relied upon for context (j) by Markov probability calculator 212 is dependent, at least in part, on a frequency value for each of the characters, i.e., the rate at which they appear throughout the corpus. More specifically, if in identifying the items of the corpus Markov probability calculator 212 does not identify at least a minimum occurrence frequency for a particular item, it may be “pruned” (i.e., removed) from the tree as being statistically irrelevant. According to one embodiment, the minimum frequency threshold is three (3).
  • As alluded to above, [0048] analysis engine 204 does not rely on a fixed lexicon or a simple segmentation algorithm (both of which tend to be error prone). Rather, analysis engine 204 selectively invokes a dynamic segmentation function 216 to segment items (characters or letters, for example) into strings (e.g., words). More precisely, segmentation function 216 segments the training set 222 into subsets (chunks) and calculates a cohesion score (i.e., a measure of the similarity between items within the subset). The segmentation and cohesion calculation is iteratively performed by segmentation function 216 until the cohesion score for each subset reaches a predetermined threshold.
  • The [0049] lexicon generation function 214 is invoked to dynamically generate and maintain a lexicon 220 in memory 206. According to one implementation, lexicon generation function 214 analyzes the segmentation results and generates a lexicon from item strings with a Markov transition probability that exceeds a threshold. In this regard, lexicon generation function 214 develops a dynamic lexicon 220 from item strings which exceed a pre-determined Markov transition probability taken from one or more language models developed by analysis engine 204. Accordingly, unlike prior art language models which rely on a known, fixed lexicon that is prone to error, analysis engine 204 dynamically generates a lexicon of statistically significant, statistically accurate item strings from one or more language models developed over a period of time. According to one embodiment, the lexicon 220 comprises a “virtual corpus” that Markov probability calculator 212 relies upon (in addition to the dynamic training set) in developing subsequent language models.
  • When invoked to modify or utilize the DOMM language model data structure, [0050] analysis engine 204 selectively invokes an instance of data structure memory manager 218. According to one aspect of the invention, data structure memory manager 218 utilizes system memory as well as extended memory to maintain the DOMM data structure. More specifically, as will be described in greater detail below with reference to FIGS. 6 and 7, data structure memory manager 218 employs a WriteNode function and a ReadNode function (not shown) to maintain a subset of the most recently used nodes of the DOMM data structure in a first level cache 224 of a system memory 206, while relegating least recently used nodes to extended memory (e.g., disk files in hard drive 144, or some remote drive), to provide for improved performance characteristics. In addition, a second level cache of system memory 206 is used to aggregate write commands until a predetermined threshold has been met, at which point data structure memory manager make one aggregate WriteNode command to an appropriate location in memory. Although depicted as a separate functional element, those skilled in the art will appreciate that data structure memory manager 218 may well be combined as a functional element of controller(s) 202 without deviating from the spirit and scope of the present invention.
  • Example Data Structure—Dynamic Order Markov Model (DOMM) Tree [0051]
  • FIG. 3 graphically represents a conceptual illustration of an example Dynamic Order Markov Model tree-[0052] like data structure 300, according to the teachings of the present invention. To conceptually illustrate how a DOMM tree data structure 300 is configured, FIG. 3 presents an example DOMM data structure 300 for a language model developed from the English alphabet, i.e., A, B, C, . . . Z. As shown the DOMM tree 300 is comprised of one or more root nodes 302 and one or more subordinate nodes 304, each associated with an item (character, letter, number, word, etc.) of a textual corpus, logically coupled to denote dependencies between nodes. According to one implementation of the present invention, root nodes 302 are comprised of an item and a frequency value (e.g., a count of how many times the item occurs in the corpus). At some level below the root node level 302, the subordinate nodes are arranged in binary sub-trees, wherein each node includes a compare bit (bi), an item with which the node is associated (A, B, . . . ), and a frequency value (CN) for the item.
  • Thus, beginning with the root node associated with the [0053] item B 306, a binary sub-tree is comprised of subordinate nodes 308-318 denoting the relationships between nodes and the frequency with which they occur. Given this conceptual example, it should be appreciated that starting at a root node, e.g., 306, the complexity of a search of the DOMM tree approximates log(N), where N is the total number of nodes to be searched.
  • As alluded to above, the size of the [0054] DOMM tree 300 may exceed the space available in the memory device 206 of LMA 104 and/or the main memory 140 of computer system 102. Accordingly, data structure memory manager 218 facilitates storage of a DOMM tree data structure 300 across main memory (e.g., 140 and/or 206) into an extended memory space, e.g., disk files on a mass storage device such as hard drive 144 of computer system 102.
  • Example Operation and Implementation [0055]
  • Having introduced the functional and conceptual elements of the present invention with reference to FIGS. 1-3, the operation of the innovative [0056] language modeling agent 104 will now be described with reference to FIGS. 5-10.
  • Building DOMM Tree Data Structure [0057]
  • FIG. 4 is a flow chart of an example method for building a Dynamic Order Markov Model (DOMM) data structure, according to one aspect of the present invention. As alluded to above, [0058] language modeling agent 104 may be invoked directly by a user or a higher-level application. In response, controller 202 of LMA 104 selectively invokes an instance of analysis engine 204, and a textual corpus (e.g., one or more documents) is loaded into memory 206 as a dynamic training set 222 and split into subsets (e.g., sentences, lines, etc.), block 402. In response, data structure generator 210 assigns each item of the subset to a node in data structure and calculates a frequency value for the item, block 404. According to one implementation, once data structure generator has populated the data structure with the subset, frequency calculation function 213 is invoked to identify the occurrence frequency of each item within the training set subset.
  • In block [0059] 406, data structure generator determines whether additional subsets of the training set remain and, if so, the next subset is read in block 408 and the process continues with block 404. In alternate implementation, data structure generator 210 completely populates the data structure, a subset at a time, before invocation of the frequency calculation function 213. In alternate embodiment, frequency calculation function 213 simply counts each item as it is placed into associated nodes of the data structure.
  • If, in block [0060] 406 data structure generator 210 has completely loaded the data structure 300 with items of the training set 222, data structure generator 210 may optionally prune the data structure, block 410. A number of mechanisms may be employed to prune the resultant data structure 300.
  • Example Method for Lexicon, Segmentation and Language Model Joint Optimization [0061]
  • FIG. 5 is a flow chart of an example method for lexicon, segmentation and language model joint optimization, according to the teachings of the present invention. As shown, the method begins with [0062] block 400 wherein LM 104 is invoked and a prefix tree of at least a subset of the received corpus is built. More specifically, as detailed in FIG. 4, data structure generator 210 of modeling agent 104 analyzes the received corpus and selects at least a subset as a training set, from which a DOMM tree is built.
  • In [0063] block 502, a very large lexicon is built form the prefix tree and pre-processed to remove some obvious illogical words. More specifically, lexicon generation function 214 is invoked to build an initial lexicon from the prefix tree. According to one implementation, the initial lexicon is built from the prefix tree using all sub-strings whose length is less than some pre-defined value, say ten (10) items (i.e., the sub-string is ten nodes or less from root to the most subordinate node). Once the initial lexicon is compiled, lexicon generation function 214 prunes the lexicon by removing some obvious illogical words (see, e.g., block 604, below). According to one implementation, lexicon generation function 214 appends a pre-defined lexicon with the new, initial lexicon generated from at least the training set of the received corpus.
  • In [0064] block 504, at least the training set of the received corpus is segmented, using the initial lexicon. More particularly, dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an initial segmented corpus. Those skilled in the art will appreciate that there are a number of ways in which the training corpus could be segmented, e.g., fixed-length segmentation, Maximum Match, etc. To do so without having yet generated a statistical language model (SLM) from the received corpus, dynamic segmentation function 216 utilizes a Maximum Match technique to provide an initial segmented corpus. Accordingly, segmentation function 216 starts at the beginning of an item string (or branch of the DOMM tree) and searches lexicon to see if the initial item (I1) is a one-item “word”. Segmentation function then combines it with the next item in the string to see if the combination (e.g., I1I2) is found as a “word” in the lexicon, and so on. According to one implementation, the longest string (I1, I2, . . . IN) of items found in the lexicon is deemed to be the correct segmentation for that string. It is to be appreciated that more complex Maximum Match algorithms may well be utilized by segmentation function 216 without deviating from the scope and spirit of the present invention.
  • Having developed an initial lexicon and segmentation from the training corpus, an iterative process is entered wherein the lexicon, segmentation and language model are jointly optimized, block [0065] 506. More specifically, as will be shown in greater detail below, the innovative iterative optimization employs a statistical language modeling approach to dynamically adjust the segmentation and lexicon to provide an optimized language model. That is, unlike prior art language modeling techniques, modeling agent 104 does not rely on a pre-defined static lexicon, or simplistic segmentation algorithm to generate a language model. Rather, modeling agent 104 utilizes the received corpus, or at least a subset thereof (training set), to dynamically generate a lexicon and segmentation to produce an optimized language model. In this regard, language models generated by modeling agent 104 do not suffer from the drawbacks and limitations commonly associated with prior art modeling systems.
  • Having introduced the innovative process in FIG. 5, FIG. 6 presents a more detailed flow chart for generating an initial lexicon, and the iterative process of refining the lexicon and segmentation to optimize the language model, according to one implementation of the present invention. As before, the method begins with step [0066] 400 (FIG. 4) of building a prefix tree from the received corpus. As discussed above, the prefix tree may be built using the entire corpus or, alternatively, using a subset entire corpus (referred to as a training corpus).
  • In [0067] block 502, the process of generating an initial lexicon begins with block 602, wherein lexicon generation function 214 generates an initial lexicon from the prefix tree by identifying substrings (or branches of the prefix tree) with less than a select number of items. According to one implementation, lexicon generation function 214 identifies substrings of ten (10) items or less to comprise the initial lexicon. In block 604, lexicon generation function 214 analyzes the initial lexicon generated in step 602 for obvious illogical substrings, removing these substrings from the initial lexicon. That is, lexicon generation function 214 analyzes the initial lexicon of substrings for illogical, or improbable words and removes these words from the lexicon. For the initial pruning, dynamic segmentation function 216 is invoked to segment at least the training set of the received corpus to generate an segmented corpus. According to one implementation, the Maximum Match algorithm is used to segment based on the initial lexicon. Then the frequency analysis function 213 is invoked to compute the frequency of the occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon. The threshold for this deletion and re-segmentation may be determined according to the size of the corpus. According to one implementation, a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon. In this way, we can delete most of the obvious illogical words from the initial lexicon.
  • Once the initial lexicon is generated and pruned, [0068] step 502, the received corpus is segmented based, at least in part, on the initial lexicon, block 504. As described above, according to one implementation, the initial segmentation of the corpus is performed using a maximum matching process.
  • Once the initial lexicon and corpus segmentation process is complete, the an iterative process of dynamically altering the lexicon and segmentation begins to optimize a statistical language model (SLM) from the received corpus (or training set), block [0069] 506. As shown, the process begins in block 606, wherein the Markov probability calculator 212 utilizes the initial lexicon and segmentation to begin language model training using the segmented corpus. That is, given the initial lexicon and an initial segmentation, a statistical language model may be generated therefrom. It should be noted that although the language model does not yet benefit from a refined lexicon and a statistically based segmentation (which will evolve in the steps to follow), it is nonetheless fundamentally based on the received corpus itself.
  • In [0070] block 608, having performed initial language model training, the segmented corpus (or training set) is re-segmented using SLM-based segmentation. Given a sentence w1, w2, . . . wn, there are M possible ways to segment it (where M≧1). Dynamic segmentation function 216 computes a probability (pi) of each segmentation (Si) based on an N-gram statistical language model. According to one implementation, segmentation function 216 utilizes a tri-gram (i.e., N=3) statistical language model for determining the probability of any given segmentation. A Viterbi search algorithm is employed to find the most probable segmentation Sk, where:
  • S k=argmax(p i)   (3)
  • In [0071] block 610, the lexicon is updated using the re-segmented corpus resulting from the SLM-based segmentation described above. According to one implementation, modeling agent 104 invokes frequency analysis function 213 to compute the frequency of occurrence in the received corpus for each word in the lexicon, sorting the lexicon according to the frequency of occurrence. The word with the lowest frequency is identified and deleted from the lexicon. All occurrences of the word must then be re-segmented into smaller words, as the uni-count for all those words are re-computed. The threshold for this deletion and re-segmentation may be determined according to the size of the corpus. According to one implementation, a corpus of 600M items may well utilize a frequency threshold of 500 to be included within the lexicon.
  • In [0072] block 612, the language model is updated to reflect the dynamically generated lexicon and the SLM-based segmentation, and a measure of the language model perplexity (i.e., an inverse probability measure) is computer by Markov probability calculator 212. If the perplexity continues to converge (toward zero (0)), i.e., improve, the process continues with block 608 wherein the lexicon and segmentation are once again modified with the intent of further improving the language model performance (as measured by perplexity). If in block 614 it is determined that the language model has not improved as a result of the recent modifications to the lexicon and segmentation, a further determination of whether the perplexity has reached an acceptable threshold is made, block 616. If so, the process ends.
  • If, however, the language model has not yet reached an acceptable perplexity threshold, [0073] lexicon generation function 214 deletes the word with the smallest frequency of occurrence in the corpus from the lexicon, re-segmenting the word into smaller words, block 618, as the process continues with block 610.
  • It is to be appreciated, based on the foregoing, that innovative [0074] language modeling agent 104 generates an optimized language model premised on a dynamically generated lexicon and segmentation rules statistically predicated on at least a subset of the received corpus. In this regard, the resultant language model has improved computational and predictive capability when compared to prior art language models.
  • ALTERNATE EMBODIMENTS
  • FIG. 7 is a block diagram of a storage medium having stored thereon a plurality of instructions including instructions to implement the innovative modeling agent of the present invention, according to yet another embodiment of the present invention. In general, FIG. 7 illustrates a storage medium/[0075] device 700 having stored thereon a plurality of executable instructions 702 including at least a subset of which that, when executed, implement the innovative modeling agent 104 of the present invention. When executed by a processor of a host system, the executable instructions 702 implement the modeling agent to generate a statistical language model representation of a textual corpus for use by any of a host of other applications executing on or otherwise available to the host system.
  • As used herein, [0076] storage medium 700 is intended to represent any of a number of storage devices and/or storage media known to those skilled in the art such as, for example, volatile memory devices, non-volatile memory devices, magnetic storage media, optical storage media, and the like. Similarly, the executable instructions are intended to reflect any of a number of software languages known in the art such as, for example, C++, Visual Basic, Hypertext Markup Language (HTML), Java, eXtensible Markup Language (XML), and the like. Moreover, it is to be appreciated that the storage medium/device 700 need not be co-located with any host system. That is, storage medium/device 700 may well reside within a remote server communicatively coupled to and accessible by an executing system. Accordingly, the software implementation of FIG. 7 is to be regarded as illustrative, as alternate storage media and software embodiments are anticipated within the spirit and scope of the present invention.
  • Although the invention has been described in language specific to structural features and/or methodological steps, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or steps described. Rather, the specific features and steps are disclosed as exemplary forms of implementing the claimed invention. [0077]

Claims (27)

1. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus;
iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved; and
utilizing the iteratively refined language model in an application to predict a likelihood of another corpus:
2. A method according to claim 1, wherein the application is one or more of a spelling and/or grammatical checker, a word-processing application, a language translation application, a speech recognition application, and the like.
3. A method according to claim 1, wherein the step of developing an initial language model comprises:
generating a prefix tree data structure from items dissected from the received corpus;
identifying sub-strings of N items or less from the prefix tree data structure; and
populating the lexicon with the identified sub-strings.
4. A method according to claim 3, wherein N is equal to three (3).
5. A method according to claim 1, wherein predictive capability is quantitatively expressed as a perplexity measure.
6. A method according to claim 5, wherein the language model is refined until the perplexity measure is reduced below an acceptable predictive threshold.
7. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 1.
8. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and
an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 1.
9. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and
an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 1.
10. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus;
iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein:
the step of iteratively refining the language model comprises:
re-segmenting the corpus by determining, for each segment, a probability of occurrence for that segment; and
updating the lexicon from the re-segmented corpus; and
computing a predictive measure for the language model using the updated lexicon and the re-segmented corpus, wherein the predictive measure is language model perplexity.
11. A method according to claim 10, further comprising:
determining whether the predictive capability of the language model improved as a result of the steps of updating and re-segmenting; and
performing additional updating and re-segmenting if the predictive capability improved until no further improvement is identified.
12. A method according to claim 10, wherein determining the probability of occurrence for a segment is calculated using an N-gram language model.
13. A method according to claim 12, wherein the N-gram language model is a tri-gram language model.
14. A method according to claim 10, wherein determining the probability of occurrence for a segment is calculated using two prior segments.
15. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 10.
16. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and
an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 10.
17. A method comprising:
developing an initial language model from a lexicon and segmentation derived from a received corpus; and
iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the initial language model is derived using a maximum match technique.
18. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a method according to claim 17.
19. A computer system comprising:
a storage device having stored therein a plurality of executable instructions; and
an execution unit, coupled to the storage device, to execute at least a subset of the plurality of executable instructions to implement a method according to claim 17.
20. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the language modeling agent quantitatively determines predictive capability using a perplexity measure.
21. A storage medium according to claim 20, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
22. A storage medium according to claim 20, further comprising at least a subset of instructions which, when executed, implements an application utilizing the language model developed by the language modeling agent.
23. A system comprising:
a storage medium drive, to removably receive a storage medium according to claim 20; and
an execution unit, coupled to the storage medium drive, to access and execute at least a subset of the plurality of executable instructions populating the removably received storage medium to implement the language modeling agent.
24. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein the language modeling agent derives the lexicon and segmentation from the received corpus using a maximum matching technique.
25. A storage medium according to claim 24, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
26. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent, the language modeling agent including a function to develop an initial language model from a lexicon and segmentation derived from a received corpus, and a function to iteratively refine the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved, wherein:
the function to iteratively refine the initial language model by determining, for each segment, a probability of occurrence for that segment, and re-segmenting the corpus to reflect an improved segment probability; and
the language modeling agent utilizes hidden Markov probability measures to determine the probability of occurrence for each segment.
27. A storage medium according to claim 26, wherein the function to develop the initial language model generates a prefix tree data structure from items dissected from the received corpus, identifies sub-strings of N items or less from the prefix tree, and populates the lexicon with the identified sub-strings.
US10/842,264 1999-11-05 2004-05-10 System and iterative method for lexicon, segmentation and language model joint optimization Abandoned US20040210434A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/842,264 US20040210434A1 (en) 1999-11-05 2004-05-10 System and iterative method for lexicon, segmentation and language model joint optimization

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US16385099P 1999-11-05 1999-11-05
US09/609,202 US6904402B1 (en) 1999-11-05 2000-06-30 System and iterative method for lexicon, segmentation and language model joint optimization
US10/842,264 US20040210434A1 (en) 1999-11-05 2004-05-10 System and iterative method for lexicon, segmentation and language model joint optimization

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/609,202 Continuation US6904402B1 (en) 1999-11-05 2000-06-30 System and iterative method for lexicon, segmentation and language model joint optimization

Publications (1)

Publication Number Publication Date
US20040210434A1 true US20040210434A1 (en) 2004-10-21

Family

ID=26860000

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/609,202 Expired - Lifetime US6904402B1 (en) 1999-11-05 2000-06-30 System and iterative method for lexicon, segmentation and language model joint optimization
US10/842,264 Abandoned US20040210434A1 (en) 1999-11-05 2004-05-10 System and iterative method for lexicon, segmentation and language model joint optimization

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/609,202 Expired - Lifetime US6904402B1 (en) 1999-11-05 2000-06-30 System and iterative method for lexicon, segmentation and language model joint optimization

Country Status (5)

Country Link
US (2) US6904402B1 (en)
JP (1) JP2003523559A (en)
CN (1) CN100430929C (en)
AU (1) AU4610401A (en)
WO (1) WO2001037128A2 (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028369A1 (en) * 2001-07-23 2003-02-06 Canon Kabushiki Kaisha Dictionary management apparatus for speech conversion
US20060206313A1 (en) * 2005-01-31 2006-09-14 Nec (China) Co., Ltd. Dictionary learning method and device using the same, input method and user terminal device using the same
US20070106977A1 (en) * 2005-11-09 2007-05-10 Microsoft Corporation Dynamic corpus generation
US20070271087A1 (en) * 2006-05-18 2007-11-22 Microsoft Corporation Language-independent language model using character classes
US20080195940A1 (en) * 2007-02-09 2008-08-14 International Business Machines Corporation Method and Apparatus for Automatic Detection of Spelling Errors in One or More Documents
US20090006092A1 (en) * 2006-01-23 2009-01-01 Nec Corporation Speech Recognition Language Model Making System, Method, and Program, and Speech Recognition System
US20090076794A1 (en) * 2007-09-13 2009-03-19 Microsoft Corporation Adding prototype information into probabilistic models
US20090288142A1 (en) * 2008-05-19 2009-11-19 Yahoo! Inc. Authentication detection
US20110137642A1 (en) * 2007-08-23 2011-06-09 Google Inc. Word Detection
US8384686B1 (en) * 2008-10-21 2013-02-26 Google Inc. Constrained keyboard organization
US20140214407A1 (en) * 2013-01-29 2014-07-31 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US20140257787A1 (en) * 2006-02-17 2014-09-11 Google Inc. Encoding and adaptive, scalable accessing of distributed models
US20140330565A1 (en) * 2005-03-21 2014-11-06 At&T Intellectual Property Ii, L.P. Apparatus and Method for Model Adaptation for Spoken Language Understanding
WO2014190732A1 (en) * 2013-05-29 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
US20150325235A1 (en) * 2014-05-07 2015-11-12 Microsoft Corporation Language Model Optimization For In-Domain Application
US9396724B2 (en) 2013-05-29 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
US9734826B2 (en) 2015-03-11 2017-08-15 Microsoft Technology Licensing, Llc Token-level interpolation for class-based language models
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
US20190130902A1 (en) * 2017-10-27 2019-05-02 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
CN110162681A (en) * 2018-10-08 2019-08-23 腾讯科技(深圳)有限公司 Text identification, text handling method, device, computer equipment and storage medium
US10546008B2 (en) 2015-10-22 2020-01-28 Verint Systems Ltd. System and method for maintaining a dynamic dictionary
CN110853628A (en) * 2019-11-18 2020-02-28 苏州思必驰信息科技有限公司 Model training method and device, electronic equipment and storage medium
US10614107B2 (en) 2015-10-22 2020-04-07 Verint Systems Ltd. System and method for keyword searching using both static and dynamic dictionaries
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium
US11132389B2 (en) * 2015-03-18 2021-09-28 Research & Business Foundation Sungkyunkwan University Method and apparatus with latent keyword generation
CN113468308A (en) * 2021-06-30 2021-10-01 竹间智能科技(上海)有限公司 Conversation behavior classification method and device and electronic equipment

Families Citing this family (87)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE443946T1 (en) * 1999-05-27 2009-10-15 Tegic Communications Inc KEYBOARD SYSTEM WITH AUTOMATIC CORRECTION
US7030863B2 (en) 2000-05-26 2006-04-18 America Online, Incorporated Virtual keyboard system with automatic correction
US7750891B2 (en) 2003-04-09 2010-07-06 Tegic Communications, Inc. Selective input system based on tracking of motion parameters of an input device
US7286115B2 (en) 2000-05-26 2007-10-23 Tegic Communications, Inc. Directional input system with automatic correction
US7821503B2 (en) 2003-04-09 2010-10-26 Tegic Communications, Inc. Touch screen and graphical user interface
US20050044148A1 (en) * 2000-06-29 2005-02-24 Microsoft Corporation Method and system for accessing multiple types of electronic content
US7020587B1 (en) * 2000-06-30 2006-03-28 Microsoft Corporation Method and apparatus for generating and managing a language model data structure
CN1226717C (en) * 2000-08-30 2005-11-09 国际商业机器公司 Automatic new term fetch method and system
EP1213706B1 (en) * 2000-12-11 2006-07-19 Sony Deutschland GmbH Method for online adaptation of pronunciation dictionaries
US7177792B2 (en) * 2001-05-31 2007-02-13 University Of Southern California Integer programming decoder for machine translation
WO2003005344A1 (en) * 2001-07-03 2003-01-16 Intel Zao Method and apparatus for dynamic beam control in viterbi search
US8214196B2 (en) 2001-07-03 2012-07-03 University Of Southern California Syntax-based statistical translation model
WO2004001623A2 (en) * 2002-03-26 2003-12-31 University Of Southern California Constructing a translation lexicon from comparable, non-parallel corpora
CA2411227C (en) * 2002-07-03 2007-01-09 2012244 Ontario Inc. System and method of creating and using compact linguistic data
US7197457B2 (en) * 2003-04-30 2007-03-27 Robert Bosch Gmbh Method for statistical language modeling in speech recognition
ES2369665T3 (en) * 2003-05-28 2011-12-02 Loquendo Spa AUTOMATIC SEGMENTATION OF TEXTS THAT INCLUDE FRAGMENTS WITHOUT SEPARATORS.
US7711545B2 (en) * 2003-07-02 2010-05-04 Language Weaver, Inc. Empirical methods for splitting compound words with application to machine translation
US8548794B2 (en) 2003-07-02 2013-10-01 University Of Southern California Statistical noun phrase translation
US7941310B2 (en) * 2003-09-09 2011-05-10 International Business Machines Corporation System and method for determining affixes of words
WO2005089340A2 (en) * 2004-03-15 2005-09-29 University Of Southern California Training tree transducers
US8296127B2 (en) * 2004-03-23 2012-10-23 University Of Southern California Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US8666725B2 (en) 2004-04-16 2014-03-04 University Of Southern California Selection and use of nonstatistical translation components in a statistical machine translation framework
JP5452868B2 (en) 2004-10-12 2014-03-26 ユニヴァーシティー オブ サザン カリフォルニア Training for text-to-text applications that use string-to-tree conversion for training and decoding
DK1666074T3 (en) 2004-11-26 2008-09-08 Bae Ro Gmbh & Co Kg sterilization lamp
CN101266599B (en) * 2005-01-31 2010-07-21 日电(中国)有限公司 Input method and user terminal
US8041557B2 (en) * 2005-02-24 2011-10-18 Fuji Xerox Co., Ltd. Word translation device, translation method, and computer readable medium
US8886517B2 (en) 2005-06-17 2014-11-11 Language Weaver, Inc. Trust scoring for language translation systems
US8676563B2 (en) 2009-10-01 2014-03-18 Language Weaver, Inc. Providing human-generated and machine-generated trusted translations
US7974833B2 (en) 2005-06-21 2011-07-05 Language Weaver, Inc. Weighted system of expressing language information using a compact notation
US7389222B1 (en) 2005-08-02 2008-06-17 Language Weaver, Inc. Task parallelization in a text-to-text system
US7813918B2 (en) * 2005-08-03 2010-10-12 Language Weaver, Inc. Identifying documents which form translated pairs, within a document collection
CN1916889B (en) * 2005-08-19 2011-02-02 株式会社日立制作所 Language material storage preparation device and its method
US7624020B2 (en) * 2005-09-09 2009-11-24 Language Weaver, Inc. Adapter for allowing both online and offline training of a text to text system
US20070078644A1 (en) * 2005-09-30 2007-04-05 Microsoft Corporation Detecting segmentation errors in an annotated corpus
US7328199B2 (en) * 2005-10-07 2008-02-05 Microsoft Corporation Componentized slot-filling architecture
US20070106496A1 (en) * 2005-11-09 2007-05-10 Microsoft Corporation Adaptive task framework
US10319252B2 (en) 2005-11-09 2019-06-11 Sdl Inc. Language capability assessment and training apparatus and techniques
US7606700B2 (en) * 2005-11-09 2009-10-20 Microsoft Corporation Adaptive task framework
US7822699B2 (en) * 2005-11-30 2010-10-26 Microsoft Corporation Adaptive semantic reasoning engine
US20070130134A1 (en) * 2005-12-05 2007-06-07 Microsoft Corporation Natural-language enabling arbitrary web forms
US7831585B2 (en) * 2005-12-05 2010-11-09 Microsoft Corporation Employment of task framework for advertising
US7933914B2 (en) 2005-12-05 2011-04-26 Microsoft Corporation Automatic task creation and execution using browser helper objects
US7835911B2 (en) * 2005-12-30 2010-11-16 Nuance Communications, Inc. Method and system for automatically building natural language understanding models
US20070203869A1 (en) * 2006-02-28 2007-08-30 Microsoft Corporation Adaptive semantic platform architecture
US7996783B2 (en) * 2006-03-02 2011-08-09 Microsoft Corporation Widget searching utilizing task framework
US8943080B2 (en) 2006-04-07 2015-01-27 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US7558725B2 (en) * 2006-05-23 2009-07-07 Lexisnexis, A Division Of Reed Elsevier Inc. Method and apparatus for multilingual spelling corrections
US8831943B2 (en) * 2006-05-31 2014-09-09 Nec Corporation Language model learning system, language model learning method, and language model learning program
CN101097488B (en) * 2006-06-30 2011-05-04 2012244安大略公司 Method for learning character fragments from received text and relevant hand-hold electronic equipments
US8886518B1 (en) 2006-08-07 2014-11-11 Language Weaver, Inc. System and method for capitalizing machine translated text
US8433556B2 (en) 2006-11-02 2013-04-30 University Of Southern California Semi-supervised training for statistical word alignment
US9122674B1 (en) 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US8468149B1 (en) 2007-01-26 2013-06-18 Language Weaver, Inc. Multi-lingual online community
US8201087B2 (en) 2007-02-01 2012-06-12 Tegic Communications, Inc. Spell-check for a keyboard system with automatic correction
US8225203B2 (en) * 2007-02-01 2012-07-17 Nuance Communications, Inc. Spell-check for a keyboard system with automatic correction
US8615389B1 (en) 2007-03-16 2013-12-24 Language Weaver, Inc. Generation and exploitation of an approximate language model
US8831928B2 (en) 2007-04-04 2014-09-09 Language Weaver, Inc. Customizable machine translation service
US8825466B1 (en) 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
US8521516B2 (en) * 2008-03-26 2013-08-27 Google Inc. Linguistic key normalization
US8046222B2 (en) * 2008-04-16 2011-10-25 Google Inc. Segmenting words using scaled probabilities
US9411800B2 (en) * 2008-06-27 2016-08-09 Microsoft Technology Licensing, Llc Adaptive generation of out-of-dictionary personalized long words
US8301437B2 (en) * 2008-07-24 2012-10-30 Yahoo! Inc. Tokenization platform
CN101430680B (en) 2008-12-31 2011-01-19 阿里巴巴集团控股有限公司 Segmentation sequence selection method and system for non-word boundary marking language text
GB201016385D0 (en) * 2010-09-29 2010-11-10 Touchtype Ltd System and method for inputting text into electronic devices
US8326599B2 (en) * 2009-04-21 2012-12-04 Xerox Corporation Bi-phrase filtering for statistical machine translation
US8990064B2 (en) 2009-07-28 2015-03-24 Language Weaver, Inc. Translating documents based on content
US8380486B2 (en) 2009-10-01 2013-02-19 Language Weaver, Inc. Providing machine-generated translations and corresponding trust levels
US10417646B2 (en) 2010-03-09 2019-09-17 Sdl Inc. Predicting the cost associated with translating textual content
GB201200643D0 (en) 2012-01-16 2012-02-29 Touchtype Ltd System and method for inputting text
US11003838B2 (en) 2011-04-18 2021-05-11 Sdl Inc. Systems and methods for monitoring post translation editing
US8972260B2 (en) * 2011-04-20 2015-03-03 Robert Bosch Gmbh Speech recognition using multiple language models
US8694303B2 (en) 2011-06-15 2014-04-08 Language Weaver, Inc. Systems and methods for tuning parameters in statistical machine translation
US8886515B2 (en) 2011-10-19 2014-11-11 Language Weaver, Inc. Systems and methods for enhancing machine translation post edit review processes
CN103034628B (en) * 2011-10-27 2015-12-02 微软技术许可有限责任公司 For by normalized for language program functional device
US8942973B2 (en) 2012-03-09 2015-01-27 Language Weaver, Inc. Content page URL translation
US10261994B2 (en) 2012-05-25 2019-04-16 Sdl Inc. Method and system for automatic management of reputation of translators
CN102799676B (en) * 2012-07-18 2015-02-18 上海语天信息技术有限公司 Recursive and multilevel Chinese word segmentation method
US9152622B2 (en) 2012-11-26 2015-10-06 Language Weaver, Inc. Personalized machine translation via online adaptation
CN103871404B (en) * 2012-12-13 2017-04-12 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
US9396723B2 (en) * 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
US9213694B2 (en) 2013-10-10 2015-12-15 Language Weaver, Inc. Efficient online domain adaptation
US10181098B2 (en) 2014-06-06 2019-01-15 Google Llc Generating representations of input sequences using neural networks
US9953646B2 (en) 2014-09-02 2018-04-24 Belleau Technologies Method and system for dynamic speech recognition and tracking of prewritten script
US10409910B2 (en) * 2014-12-12 2019-09-10 Omni Ai, Inc. Perceptual associative memory for a neuro-linguistic behavior recognition system
US10409909B2 (en) * 2014-12-12 2019-09-10 Omni Ai, Inc. Lexical analyzer for a neuro-linguistic behavior recognition system
CN107427732B (en) * 2016-12-09 2021-01-29 香港应用科技研究院有限公司 System and method for organizing and processing feature-based data structures
US11893983B2 (en) * 2021-06-23 2024-02-06 International Business Machines Corporation Adding words to a prefix tree for improving speech recognition

Citations (82)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4689768A (en) * 1982-06-30 1987-08-25 International Business Machines Corporation Spelling verification system with immediate operator alerts to non-matches between inputted words and words stored in plural dictionary memories
US4899148A (en) * 1987-02-25 1990-02-06 Oki Electric Industry Co., Ltd. Data compression method
US5621859A (en) * 1994-01-19 1997-04-15 Bbn Corporation Single tree method for grammar directed, very large vocabulary speech recognizer
US5800473A (en) * 1996-02-08 1998-09-01 Ela Medical S.A. Systems, methods, and apparatus for automatic updating of a programmer for an active implantable medical device
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US5822729A (en) * 1996-06-05 1998-10-13 Massachusetts Institute Of Technology Feature-based speech recognizer having probabilistic linguistic processor providing word matching based on the entire space of feature vectors
US5872530A (en) * 1996-01-31 1999-02-16 Hitachi, Ltd. Method of and apparatus for compressing and decompressing data and data processing apparatus and network system using the same
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US5963893A (en) * 1996-06-28 1999-10-05 Microsoft Corporation Identification of words in Japanese text by a computer system
US6023570A (en) * 1998-02-13 2000-02-08 Lattice Semiconductor Corp. Sequential and simultaneous manufacturing programming of multiple in-system programmable systems through a data network
US6048305A (en) * 1997-08-07 2000-04-11 Natan Bauman Apparatus and method for an open ear auditory pathway stimulator to manage tinnitus and hyperacusis
US6052657A (en) * 1997-09-09 2000-04-18 Dragon Systems, Inc. Text segmentation and identification of topic using language models
US6061431A (en) * 1998-10-09 2000-05-09 Cisco Technology, Inc. Method for hearing loss compensation in telephony systems based on telephone number resolution
US6076056A (en) * 1997-09-19 2000-06-13 Microsoft Corporation Speech recognition system for recognizing continuous and isolated speech
US6094492A (en) * 1999-05-10 2000-07-25 Boesen; Peter V. Bone conduction voice transmission apparatus and system
US6104913A (en) * 1998-03-11 2000-08-15 Bell Atlantic Network Services, Inc. Personal area network for personal telephone services
US6137889A (en) * 1998-05-27 2000-10-24 Insonus Medical, Inc. Direct tympanic membrane excitation via vibrationally conductive assembly
US6141641A (en) * 1998-04-15 2000-10-31 Microsoft Corporation Dynamically configurable acoustic model for speech recognition system
US6151645A (en) * 1998-08-07 2000-11-21 Gateway 2000, Inc. Computer communicates with two incompatible wireless peripherals using fewer transceivers
US6157912A (en) * 1997-02-28 2000-12-05 U.S. Philips Corporation Speech recognition method with language model adaptation
US6163769A (en) * 1997-10-02 2000-12-19 Microsoft Corporation Text-to-speech using clustered context-dependent phoneme-based units
US6188979B1 (en) * 1998-05-28 2001-02-13 Motorola, Inc. Method and apparatus for estimating the fundamental frequency of a signal
US6188976B1 (en) * 1998-10-23 2001-02-13 International Business Machines Corporation Apparatus and method for building domain-specific language models
US6198971B1 (en) * 1999-04-08 2001-03-06 Implex Aktiengesellschaft Hearing Technology Implantable system for rehabilitation of a hearing disorder
US6201975B1 (en) * 1996-07-03 2001-03-13 Telefonaktiebolaget Lm Ericsson Method and device relating to a user unit
US6208273B1 (en) * 1999-01-29 2001-03-27 Interactive Silicon, Inc. System and method for performing scalable embedded parallel data compression
US6219427B1 (en) * 1997-11-18 2001-04-17 Gn Resound As Feedback cancellation improvements
US6229900B1 (en) * 1997-07-18 2001-05-08 Beltone Netherlands B.V. Hearing aid including a programmable processor
US6240193B1 (en) * 1998-09-17 2001-05-29 Sonic Innovations, Inc. Two line variable word length serial interface
US6240194B1 (en) * 1997-07-18 2001-05-29 U.S. Philips Corporation Hearing aid with external frequency control
US20010003542A1 (en) * 1999-12-14 2001-06-14 Kazunori Kita Earphone-type music reproducing device and music reproducing system using the device
US20010004397A1 (en) * 1999-12-21 2001-06-21 Kazunori Kita Body-wearable type music reproducing apparatus and music reproducing system which comprises such music eproducing appaartus
US6251062B1 (en) * 1998-12-17 2001-06-26 Implex Aktiengesellschaft Hearing Technology Implantable device for treatment of tinnitus
US6265102B1 (en) * 1998-11-05 2001-07-24 Electric Fuel Limited (E.F.L.) Prismatic metal-air cells
US20010031999A1 (en) * 2000-01-07 2001-10-18 John Carter Electro therapy method and apparatus
US20010031996A1 (en) * 2000-04-13 2001-10-18 Hans Leysieffer At least partially implantable system for rehabilitation of a hearing disorder
US20010033664A1 (en) * 2000-03-13 2001-10-25 Songbird Hearing, Inc. Hearing aid format selector
US20010041602A1 (en) * 1997-05-02 2001-11-15 H. Stephen Berger Intergrated hearing aid for telecommunications devices
US20010040873A1 (en) * 1999-12-20 2001-11-15 Kabushiki Kaisha Toshiba Communication apparatus and method
US20010044668A1 (en) * 1994-12-09 2001-11-22 Kimbrough Thomas C. System and method for producing a three dimensional relief
US6324907B1 (en) * 1999-11-29 2001-12-04 Microtronic A/S Flexible substrate transducer assembly
US20010049566A1 (en) * 2000-05-12 2001-12-06 Samsung Electronics Co., Ltd. Apparatus and method for controlling audio output in a mobile terminal
US6330233B1 (en) * 1997-07-29 2001-12-11 Matsushita Electric Industrial Co., Ltd. CDMA radio transmitting apparatus and CDMA radio receiving apparatus
US6334072B1 (en) * 1999-04-01 2001-12-25 Implex Aktiengesellschaft Hearing Technology Fully implantable hearing system with telemetric sensor testing
US20020012438A1 (en) * 2000-06-30 2002-01-31 Hans Leysieffer System for rehabilitation of a hearing disorder
US6347148B1 (en) * 1998-04-16 2002-02-12 Dspfactory Ltd. Method and apparatus for feedback reduction in acoustic systems, particularly in hearing aids
US6351472B1 (en) * 1998-04-30 2002-02-26 Siemens Audiologische Technik Gmbh Serial bidirectional data transmission method for hearing devices by means of signals of different pulsewidths
US20020026091A1 (en) * 2000-08-25 2002-02-28 Hans Leysieffer Implantable hearing system with means for measuring its coupling quality
US6366880B1 (en) * 1999-11-30 2002-04-02 Motorola, Inc. Method and apparatus for suppressing acoustic background noise in a communication system by equaliztion of pre-and post-comb-filtered subband spectral energies
US6366863B1 (en) * 1998-01-09 2002-04-02 Micro Ear Technology Inc. Portable hearing-related analysis system
US6374210B1 (en) * 1998-11-30 2002-04-16 U.S. Philips Corporation Automatic segmentation of a text
US6377925B1 (en) * 1999-12-16 2002-04-23 Interactive Solutions, Inc. Electronic translator for assisting communications
US20020048374A1 (en) * 2000-06-01 2002-04-25 Sigfrid Soli Method and apparatus for measuring the performance of an implantable middle ear hearing aid, and the respones of a patient wearing such a hearing aid
US20020076073A1 (en) * 2000-12-19 2002-06-20 Taenzer Jon C. Automatically switched hearing aid communications earpiece
US20020083235A1 (en) * 1997-01-13 2002-06-27 Scott T. Armitage System for programming hearing aids
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US20020094398A1 (en) * 1993-07-02 2002-07-18 Gockel Gary L. Extruded multilayer polymeric shell having textured and marbled surface
US20020095892A1 (en) * 2001-01-09 2002-07-25 Johnson Charles O. Cantilevered structural support
US20020150219A1 (en) * 2001-04-12 2002-10-17 Jorgenson Joel A. Distributed audio system for the capture, conditioning and delivery of sound
US20020168075A1 (en) * 1997-01-13 2002-11-14 Micro Ear Technology, Inc. Portable system programming hearing aids
US6490558B1 (en) * 1999-07-28 2002-12-03 Custom Speech Usa, Inc. System and method for improving the accuracy of a speech recognition program through repetitive training
US20020183648A1 (en) * 2001-05-03 2002-12-05 Audia Technology, Inc. Method for customizing audio systems for hearing impaired
US20030064746A1 (en) * 2001-09-20 2003-04-03 Rader R. Scott Sound enhancement for mobile phones and other products producing personalized audio for users
US6545989B1 (en) * 1998-02-19 2003-04-08 Qualcomm Incorporated Transmit gating in a wireless communication system
US6557029B2 (en) * 1999-06-28 2003-04-29 Micro Design Services, Llc System and method for distributing messages
US6565503B2 (en) * 2000-04-13 2003-05-20 Cochlear Limited At least partially implantable system for rehabilitation of hearing disorder
US6575894B2 (en) * 2000-04-13 2003-06-10 Cochlear Limited At least partially implantable system for rehabilitation of a hearing disorder
US6582628B2 (en) * 2001-01-17 2003-06-24 Dupont Mitsui Fluorochemicals Conductive melt-processible fluoropolymer
US6584356B2 (en) * 2001-01-05 2003-06-24 Medtronic, Inc. Downloadable software support in a pacemaker
US6590986B1 (en) * 1999-11-12 2003-07-08 Siemens Hearing Instruments, Inc. Patient-isolating programming interface for programming hearing aids
US6590987B2 (en) * 2001-01-17 2003-07-08 Etymotic Research, Inc. Two-wired hearing aid system utilizing two-way communication for programming
US20030128859A1 (en) * 2002-01-08 2003-07-10 International Business Machines Corporation System and method for audio enhancement of digital devices for hearing impaired
US6601093B1 (en) * 1999-12-01 2003-07-29 Ibm Corporation Address resolution in ad-hoc networking
US6674867B2 (en) * 1997-10-15 2004-01-06 Belltone Electronics Corporation Neurofuzzy based device for programmable hearing aids
US6695943B2 (en) * 1997-12-18 2004-02-24 Softear Technologies, L.L.C. Method of manufacturing a soft hearing aid
US6707581B1 (en) * 1997-09-17 2004-03-16 Denton R. Browning Remote information access system which utilizes handheld scanner
US6717925B1 (en) * 1997-08-12 2004-04-06 Nokia Mobile Phones Limited Point-to-multipoint mobile radio transmission
US20040199375A1 (en) * 1999-05-28 2004-10-07 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
US6823312B2 (en) * 2001-01-18 2004-11-23 International Business Machines Corporation Personalized system for providing improved understandability of received speech
US6838485B1 (en) * 1998-10-23 2005-01-04 Baker Hughes Incorporated Treatments for drill cuttings
US6978155B2 (en) * 2000-02-18 2005-12-20 Phonak Ag Fitting-setup for hearing device
US20050288263A1 (en) * 2002-05-09 2005-12-29 Jinghua Yang 2-(&agr;-Hydroxypently) benzoate and its preparation and use

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999000789A1 (en) 1997-06-26 1999-01-07 Koninklijke Philips Electronics N.V. A machine-organized method and a device for translating a word-organized source text into a word-organized target text

Patent Citations (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4689768A (en) * 1982-06-30 1987-08-25 International Business Machines Corporation Spelling verification system with immediate operator alerts to non-matches between inputted words and words stored in plural dictionary memories
US4899148A (en) * 1987-02-25 1990-02-06 Oki Electric Industry Co., Ltd. Data compression method
US20020094398A1 (en) * 1993-07-02 2002-07-18 Gockel Gary L. Extruded multilayer polymeric shell having textured and marbled surface
US5621859A (en) * 1994-01-19 1997-04-15 Bbn Corporation Single tree method for grammar directed, very large vocabulary speech recognizer
US20010044668A1 (en) * 1994-12-09 2001-11-22 Kimbrough Thomas C. System and method for producing a three dimensional relief
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US5872530A (en) * 1996-01-31 1999-02-16 Hitachi, Ltd. Method of and apparatus for compressing and decompressing data and data processing apparatus and network system using the same
US5800473A (en) * 1996-02-08 1998-09-01 Ela Medical S.A. Systems, methods, and apparatus for automatic updating of a programmer for an active implantable medical device
US5822729A (en) * 1996-06-05 1998-10-13 Massachusetts Institute Of Technology Feature-based speech recognizer having probabilistic linguistic processor providing word matching based on the entire space of feature vectors
US5963893A (en) * 1996-06-28 1999-10-05 Microsoft Corporation Identification of words in Japanese text by a computer system
US6201975B1 (en) * 1996-07-03 2001-03-13 Telefonaktiebolaget Lm Ericsson Method and device relating to a user unit
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US20020083235A1 (en) * 1997-01-13 2002-06-27 Scott T. Armitage System for programming hearing aids
US6888948B2 (en) * 1997-01-13 2005-05-03 Micro Ear Technology, Inc. Portable system programming hearing aids
US6851048B2 (en) * 1997-01-13 2005-02-01 Micro Ear Technology, Inc. System for programming hearing aids
US20050196002A1 (en) * 1997-01-13 2005-09-08 Micro Ear Technology, Inc., D/B/A Micro-Tech Portable system for programming hearing aids
US20020168075A1 (en) * 1997-01-13 2002-11-14 Micro Ear Technology, Inc. Portable system programming hearing aids
US6157912A (en) * 1997-02-28 2000-12-05 U.S. Philips Corporation Speech recognition method with language model adaptation
US20010041602A1 (en) * 1997-05-02 2001-11-15 H. Stephen Berger Intergrated hearing aid for telecommunications devices
US6240194B1 (en) * 1997-07-18 2001-05-29 U.S. Philips Corporation Hearing aid with external frequency control
US6229900B1 (en) * 1997-07-18 2001-05-08 Beltone Netherlands B.V. Hearing aid including a programmable processor
US6330233B1 (en) * 1997-07-29 2001-12-11 Matsushita Electric Industrial Co., Ltd. CDMA radio transmitting apparatus and CDMA radio receiving apparatus
US6048305A (en) * 1997-08-07 2000-04-11 Natan Bauman Apparatus and method for an open ear auditory pathway stimulator to manage tinnitus and hyperacusis
US6717925B1 (en) * 1997-08-12 2004-04-06 Nokia Mobile Phones Limited Point-to-multipoint mobile radio transmission
US6052657A (en) * 1997-09-09 2000-04-18 Dragon Systems, Inc. Text segmentation and identification of topic using language models
US6707581B1 (en) * 1997-09-17 2004-03-16 Denton R. Browning Remote information access system which utilizes handheld scanner
US6076056A (en) * 1997-09-19 2000-06-13 Microsoft Corporation Speech recognition system for recognizing continuous and isolated speech
US6163769A (en) * 1997-10-02 2000-12-19 Microsoft Corporation Text-to-speech using clustered context-dependent phoneme-based units
US6674867B2 (en) * 1997-10-15 2004-01-06 Belltone Electronics Corporation Neurofuzzy based device for programmable hearing aids
US6219427B1 (en) * 1997-11-18 2001-04-17 Gn Resound As Feedback cancellation improvements
US6695943B2 (en) * 1997-12-18 2004-02-24 Softear Technologies, L.L.C. Method of manufacturing a soft hearing aid
US20020111745A1 (en) * 1998-01-09 2002-08-15 Micro Ear Technology, Inc., D/B/A Micro-Tech. Portable hearing-related analysis system
US6366863B1 (en) * 1998-01-09 2002-04-02 Micro Ear Technology Inc. Portable hearing-related analysis system
US6023570A (en) * 1998-02-13 2000-02-08 Lattice Semiconductor Corp. Sequential and simultaneous manufacturing programming of multiple in-system programmable systems through a data network
US6545989B1 (en) * 1998-02-19 2003-04-08 Qualcomm Incorporated Transmit gating in a wireless communication system
US6104913A (en) * 1998-03-11 2000-08-15 Bell Atlantic Network Services, Inc. Personal area network for personal telephone services
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US6141641A (en) * 1998-04-15 2000-10-31 Microsoft Corporation Dynamically configurable acoustic model for speech recognition system
US6347148B1 (en) * 1998-04-16 2002-02-12 Dspfactory Ltd. Method and apparatus for feedback reduction in acoustic systems, particularly in hearing aids
US6351472B1 (en) * 1998-04-30 2002-02-26 Siemens Audiologische Technik Gmbh Serial bidirectional data transmission method for hearing devices by means of signals of different pulsewidths
US6137889A (en) * 1998-05-27 2000-10-24 Insonus Medical, Inc. Direct tympanic membrane excitation via vibrationally conductive assembly
US6188979B1 (en) * 1998-05-28 2001-02-13 Motorola, Inc. Method and apparatus for estimating the fundamental frequency of a signal
US6151645A (en) * 1998-08-07 2000-11-21 Gateway 2000, Inc. Computer communicates with two incompatible wireless peripherals using fewer transceivers
US6240193B1 (en) * 1998-09-17 2001-05-29 Sonic Innovations, Inc. Two line variable word length serial interface
US6061431A (en) * 1998-10-09 2000-05-09 Cisco Technology, Inc. Method for hearing loss compensation in telephony systems based on telephone number resolution
US6188976B1 (en) * 1998-10-23 2001-02-13 International Business Machines Corporation Apparatus and method for building domain-specific language models
US6838485B1 (en) * 1998-10-23 2005-01-04 Baker Hughes Incorporated Treatments for drill cuttings
US6265102B1 (en) * 1998-11-05 2001-07-24 Electric Fuel Limited (E.F.L.) Prismatic metal-air cells
US6374210B1 (en) * 1998-11-30 2002-04-16 U.S. Philips Corporation Automatic segmentation of a text
US6251062B1 (en) * 1998-12-17 2001-06-26 Implex Aktiengesellschaft Hearing Technology Implantable device for treatment of tinnitus
US6208273B1 (en) * 1999-01-29 2001-03-27 Interactive Silicon, Inc. System and method for performing scalable embedded parallel data compression
US6334072B1 (en) * 1999-04-01 2001-12-25 Implex Aktiengesellschaft Hearing Technology Fully implantable hearing system with telemetric sensor testing
US6198971B1 (en) * 1999-04-08 2001-03-06 Implex Aktiengesellschaft Hearing Technology Implantable system for rehabilitation of a hearing disorder
US6094492A (en) * 1999-05-10 2000-07-25 Boesen; Peter V. Bone conduction voice transmission apparatus and system
US20040199375A1 (en) * 1999-05-28 2004-10-07 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
US6557029B2 (en) * 1999-06-28 2003-04-29 Micro Design Services, Llc System and method for distributing messages
US6490558B1 (en) * 1999-07-28 2002-12-03 Custom Speech Usa, Inc. System and method for improving the accuracy of a speech recognition program through repetitive training
US6590986B1 (en) * 1999-11-12 2003-07-08 Siemens Hearing Instruments, Inc. Patient-isolating programming interface for programming hearing aids
US6324907B1 (en) * 1999-11-29 2001-12-04 Microtronic A/S Flexible substrate transducer assembly
US6366880B1 (en) * 1999-11-30 2002-04-02 Motorola, Inc. Method and apparatus for suppressing acoustic background noise in a communication system by equaliztion of pre-and post-comb-filtered subband spectral energies
US6601093B1 (en) * 1999-12-01 2003-07-29 Ibm Corporation Address resolution in ad-hoc networking
US20010003542A1 (en) * 1999-12-14 2001-06-14 Kazunori Kita Earphone-type music reproducing device and music reproducing system using the device
US6377925B1 (en) * 1999-12-16 2002-04-23 Interactive Solutions, Inc. Electronic translator for assisting communications
US20010040873A1 (en) * 1999-12-20 2001-11-15 Kabushiki Kaisha Toshiba Communication apparatus and method
US20010004397A1 (en) * 1999-12-21 2001-06-21 Kazunori Kita Body-wearable type music reproducing apparatus and music reproducing system which comprises such music eproducing appaartus
US20010031999A1 (en) * 2000-01-07 2001-10-18 John Carter Electro therapy method and apparatus
US6978155B2 (en) * 2000-02-18 2005-12-20 Phonak Ag Fitting-setup for hearing device
US20010033664A1 (en) * 2000-03-13 2001-10-25 Songbird Hearing, Inc. Hearing aid format selector
US6565503B2 (en) * 2000-04-13 2003-05-20 Cochlear Limited At least partially implantable system for rehabilitation of hearing disorder
US6575894B2 (en) * 2000-04-13 2003-06-10 Cochlear Limited At least partially implantable system for rehabilitation of a hearing disorder
US20010031996A1 (en) * 2000-04-13 2001-10-18 Hans Leysieffer At least partially implantable system for rehabilitation of a hearing disorder
US6697674B2 (en) * 2000-04-13 2004-02-24 Cochlear Limited At least partially implantable system for rehabilitation of a hearing disorder
US20010049566A1 (en) * 2000-05-12 2001-12-06 Samsung Electronics Co., Ltd. Apparatus and method for controlling audio output in a mobile terminal
US20020048374A1 (en) * 2000-06-01 2002-04-25 Sigfrid Soli Method and apparatus for measuring the performance of an implantable middle ear hearing aid, and the respones of a patient wearing such a hearing aid
US20020012438A1 (en) * 2000-06-30 2002-01-31 Hans Leysieffer System for rehabilitation of a hearing disorder
US20020026091A1 (en) * 2000-08-25 2002-02-28 Hans Leysieffer Implantable hearing system with means for measuring its coupling quality
US6554762B2 (en) * 2000-08-25 2003-04-29 Cochlear Limited Implantable hearing system with means for measuring its coupling quality
US20020076073A1 (en) * 2000-12-19 2002-06-20 Taenzer Jon C. Automatically switched hearing aid communications earpiece
US6584356B2 (en) * 2001-01-05 2003-06-24 Medtronic, Inc. Downloadable software support in a pacemaker
US20020095892A1 (en) * 2001-01-09 2002-07-25 Johnson Charles O. Cantilevered structural support
US6590987B2 (en) * 2001-01-17 2003-07-08 Etymotic Research, Inc. Two-wired hearing aid system utilizing two-way communication for programming
US6582628B2 (en) * 2001-01-17 2003-06-24 Dupont Mitsui Fluorochemicals Conductive melt-processible fluoropolymer
US6823312B2 (en) * 2001-01-18 2004-11-23 International Business Machines Corporation Personalized system for providing improved understandability of received speech
US20020150219A1 (en) * 2001-04-12 2002-10-17 Jorgenson Joel A. Distributed audio system for the capture, conditioning and delivery of sound
US20020183648A1 (en) * 2001-05-03 2002-12-05 Audia Technology, Inc. Method for customizing audio systems for hearing impaired
US6913578B2 (en) * 2001-05-03 2005-07-05 Apherma Corporation Method for customizing audio systems for hearing impaired
US20030064746A1 (en) * 2001-09-20 2003-04-03 Rader R. Scott Sound enhancement for mobile phones and other products producing personalized audio for users
US20030128859A1 (en) * 2002-01-08 2003-07-10 International Business Machines Corporation System and method for audio enhancement of digital devices for hearing impaired
US20050288263A1 (en) * 2002-05-09 2005-12-29 Jinghua Yang 2-(&agr;-Hydroxypently) benzoate and its preparation and use

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7228270B2 (en) * 2001-07-23 2007-06-05 Canon Kabushiki Kaisha Dictionary management apparatus for speech conversion
US20030028369A1 (en) * 2001-07-23 2003-02-06 Canon Kabushiki Kaisha Dictionary management apparatus for speech conversion
US20060206313A1 (en) * 2005-01-31 2006-09-14 Nec (China) Co., Ltd. Dictionary learning method and device using the same, input method and user terminal device using the same
US9640176B2 (en) * 2005-03-21 2017-05-02 Nuance Communications, Inc. Apparatus and method for model adaptation for spoken language understanding
US20140330565A1 (en) * 2005-03-21 2014-11-06 At&T Intellectual Property Ii, L.P. Apparatus and Method for Model Adaptation for Spoken Language Understanding
US7941418B2 (en) * 2005-11-09 2011-05-10 Microsoft Corporation Dynamic corpus generation
US20070106977A1 (en) * 2005-11-09 2007-05-10 Microsoft Corporation Dynamic corpus generation
US20090006092A1 (en) * 2006-01-23 2009-01-01 Nec Corporation Speech Recognition Language Model Making System, Method, and Program, and Speech Recognition System
US9619465B2 (en) * 2006-02-17 2017-04-11 Google Inc. Encoding and adaptive, scalable accessing of distributed models
US20140257787A1 (en) * 2006-02-17 2014-09-11 Google Inc. Encoding and adaptive, scalable accessing of distributed models
US10089304B2 (en) 2006-02-17 2018-10-02 Google Llc Encoding and adaptive, scalable accessing of distributed models
US10885285B2 (en) * 2006-02-17 2021-01-05 Google Llc Encoding and adaptive, scalable accessing of distributed models
US20070271087A1 (en) * 2006-05-18 2007-11-22 Microsoft Corporation Language-independent language model using character classes
US20080195940A1 (en) * 2007-02-09 2008-08-14 International Business Machines Corporation Method and Apparatus for Automatic Detection of Spelling Errors in One or More Documents
US9465791B2 (en) * 2007-02-09 2016-10-11 International Business Machines Corporation Method and apparatus for automatic detection of spelling errors in one or more documents
US8463598B2 (en) * 2007-08-23 2013-06-11 Google Inc. Word detection
US20110137642A1 (en) * 2007-08-23 2011-06-09 Google Inc. Word Detection
US8010341B2 (en) 2007-09-13 2011-08-30 Microsoft Corporation Adding prototype information into probabilistic models
US20090076794A1 (en) * 2007-09-13 2009-03-19 Microsoft Corporation Adding prototype information into probabilistic models
US8353008B2 (en) * 2008-05-19 2013-01-08 Yahoo! Inc. Authentication detection
US20090288142A1 (en) * 2008-05-19 2009-11-19 Yahoo! Inc. Authentication detection
US8462123B1 (en) 2008-10-21 2013-06-11 Google Inc. Constrained keyboard organization
US8384686B1 (en) * 2008-10-21 2013-02-26 Google Inc. Constrained keyboard organization
US20140214407A1 (en) * 2013-01-29 2014-07-31 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US9798714B2 (en) * 2013-01-29 2017-10-24 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US10198427B2 (en) * 2013-01-29 2019-02-05 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US9639520B2 (en) * 2013-01-29 2017-05-02 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US20180067921A1 (en) * 2013-01-29 2018-03-08 Verint Systems Ltd. System and method for keyword spotting using representative dictionary
US9396724B2 (en) 2013-05-29 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
CN104217717A (en) * 2013-05-29 2014-12-17 腾讯科技(深圳)有限公司 Language model constructing method and device
WO2014190732A1 (en) * 2013-05-29 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
US9972311B2 (en) * 2014-05-07 2018-05-15 Microsoft Technology Licensing, Llc Language model optimization for in-domain application
US20150325235A1 (en) * 2014-05-07 2015-11-12 Microsoft Corporation Language Model Optimization For In-Domain Application
US9734826B2 (en) 2015-03-11 2017-08-15 Microsoft Technology Licensing, Llc Token-level interpolation for class-based language models
US11132389B2 (en) * 2015-03-18 2021-09-28 Research & Business Foundation Sungkyunkwan University Method and apparatus with latent keyword generation
US10614107B2 (en) 2015-10-22 2020-04-07 Verint Systems Ltd. System and method for keyword searching using both static and dynamic dictionaries
US11386135B2 (en) 2015-10-22 2022-07-12 Cognyte Technologies Israel Ltd. System and method for maintaining a dynamic dictionary
US10546008B2 (en) 2015-10-22 2020-01-28 Verint Systems Ltd. System and method for maintaining a dynamic dictionary
US11093534B2 (en) 2015-10-22 2021-08-17 Verint Systems Ltd. System and method for keyword searching using both static and dynamic dictionaries
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
US10607604B2 (en) * 2017-10-27 2020-03-31 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
US11276394B2 (en) * 2017-10-27 2022-03-15 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
US20190130902A1 (en) * 2017-10-27 2019-05-02 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
CN110162681A (en) * 2018-10-08 2019-08-23 腾讯科技(深圳)有限公司 Text identification, text handling method, device, computer equipment and storage medium
CN110853628A (en) * 2019-11-18 2020-02-28 苏州思必驰信息科技有限公司 Model training method and device, electronic equipment and storage medium
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium
CN113468308A (en) * 2021-06-30 2021-10-01 竹间智能科技(上海)有限公司 Conversation behavior classification method and device and electronic equipment

Also Published As

Publication number Publication date
WO2001037128A3 (en) 2002-02-07
US6904402B1 (en) 2005-06-07
WO2001037128A2 (en) 2001-05-25
AU4610401A (en) 2001-05-30
JP2003523559A (en) 2003-08-05
CN1387651A (en) 2002-12-25
CN100430929C (en) 2008-11-05

Similar Documents

Publication Publication Date Title
US6904402B1 (en) System and iterative method for lexicon, segmentation and language model joint optimization
US7275029B1 (en) System and method for joint optimization of language model performance and size
US7020587B1 (en) Method and apparatus for generating and managing a language model data structure
US7493251B2 (en) Using source-channel models for word segmentation
US6816830B1 (en) Finite state data structures with paths representing paired strings of tags and tag combinations
JP4945086B2 (en) Statistical language model for logical forms
US7158930B2 (en) Method and apparatus for expanding dictionaries during parsing
US11210468B2 (en) System and method for comparing plurality of documents
JP4215418B2 (en) Word prediction method, speech recognition method, speech recognition apparatus and program using the method
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
US20030046078A1 (en) Supervised automatic text generation based on word classes for language modeling
US9720903B2 (en) Method for parsing natural language text with simple links
US20060277028A1 (en) Training a statistical parser on noisy data by filtering
US11068653B2 (en) System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions
JP2002215619A (en) Translation sentence extracting method from translated document
Singh et al. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
CN110688450B (en) Keyword generation method based on Monte Carlo tree search, keyword generation model based on reinforcement learning and electronic equipment
Babii et al. Modeling vocabulary for big code machine learning
JP2006065387A (en) Text sentence search device, method, and program
US10810368B2 (en) Method for parsing natural language text with constituent construction links
JP5291645B2 (en) Data extraction apparatus, data extraction method, and program
CN114817458A (en) Bid-winning item retrieval method based on funnel model and cosine algorithm
Pla et al. Improving chunking by means of lexical-contextual information in statistical language models
Mammadov et al. Part-of-speech tagging for azerbaijani language
JP3369127B2 (en) Morphological analyzer

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001

Effective date: 20141014