US 7680659 B2
A method of training language model parameters trains discriminative model parameters in the language model based on a performance measure having discrete values.
1. A method comprising:
for each value of a feature weight in a set of discrete values for the feature weight: for each of a set of phonetic sequences:
a processor using a baseline language model to identify a set of candidate word sequences from the phonetic sequence, wherein the baseline language model designates one of the candidate word sequences as a most likely word sequence and wherein the baseline language model provides a probability for each candidate word sequence;
for each candidate word sequence in the set of candidate word sequences:
determining a value for a feature from the candidate word sequence;
multiplying the value of the feature weight by the value for the feature to produce a result and summing the result with the probability for the candidate word sequence provided by the baseline language model to produce a score for the candidate word sequence;
selecting the candidate word sequence with the highest score;
comparing the candidate word sequence with the highest score to an actual word sequence to determine a sum of the number of words in the actual word sequence that are replaced with another word in a candidate word sequence, the number of words in the actual word sequence that are omitted in the candidate word sequence, and the number of words present in the candidate word sequence that are not present in the actual word sequence to produce an error value;
summing the error values for the phonetic sequences together to form a sample risk; and
selecting the value for the feature weight that provides the smallest sample risk as the feature weight value for a feature in a discriminative language model.
2. The method of
using sample risk to select a feature to include in the discriminative language model.
3. The method of
4. The method of
5. A computer-readable storage medium storing computer-executable instructions that when executed by a processor cause the processor to perform steps comprising:
selecting a feature having a feature function for inclusion in a discriminative language model comprising a sum of weighted feature functions;
for each of a plurality of different possible values of a weight applied to the feature function for the selected feature in the sum of weighted feature functions in the discriminative language model:
scoring each of a plurality of candidate word sequences to produce a score for each candidate word sequence, wherein each candidate word sequence score is computed through steps comprising determining a value for the feature function of the selected feature from the respective candidate word sequence, multiplying the value of the weight by the value of the feature function of the selected feature and adding the result to a probability of the respective candidate word sequence provided by a baseline language model;
selecting a candidate word sequence of the plurality of word sequences with a best score of the scores for the candidate word sequences;
using the respective selected word sequence to generate a performance measure that is associated with the value of the weight by comparing the selected word sequence to an actual word sequence to determine a value for a discrete error function; and
using the performance measures associated with the values of the weight to select a value of the weight to store in the discriminative language model.
6. The computer-readable storage medium of
7. The computer-readable storage medium of
8. The computer-readable storage medium of
9. The computer-readable storage medium of
determining a plurality of sample risks from a plurality of candidate features, with one sample risk being associated with each candidate feature;
determining a baseline sample risk from a baseline model; and
using the plurality of sample risks and the baseline sample risk to select a feature from the plurality of candidate features.
10. The computer-readable storage medium of
11. The computer-readable storage medium of
12. A method of selecting features for a discriminative language model, the method comprising:
for each of a set of candidate features, determining a difference between a performance measure associated with a discriminative language model that uses the feature and a performance measure associated with a discriminative language model that does not use the feature, wherein the performance measure associated with the discriminative language model that uses the feature is based on a count of the number of words in an actual word sequence that are omitted in a candidate word sequence selected using the discriminative language model that uses the feature, wherein the discriminative language model is a linear discriminative function that provides a score for a candidate word sequence wherein the linear discriminative function comprises a weighted sum of feature function values that includes a feature function value for the feature and a separate probability for the candidate word sequence;
using each difference to score each candidate feature; and
selecting a candidate feature based on the scores.
13. The method of
determining an interference score for the feature, the interference score indicating the similarity in the performance of a discriminative language model that uses the feature and a discriminative language model that does not use the feature; and
using the interference score and the difference to score the feature.
14. The method of
determining a plurality of performance measures using different values for a model parameter associated with the feature;
selecting a value for the model parameter based on the plurality of performance measures; and
using the selected value of the model parameter when determining the performance measure for the discriminative model that uses the feature.
The task of language modeling is to estimate the likelihood of a word string. This is fundamental to a wide range of applications such as speech recognition and Asian language text input.
The traditional approach to language modeling uses a parametric model with maximum likelihood estimation (MLE), usually with smoothing methods to deal with data sparseness problems. This approach is optimal under the assumption that the true distribution of data on which the parametric model is based is known. Unfortunately, this assumption rarely holds in realistic applications.
An alternative approach to language modeling is based on the framework of discriminative training, which uses a much weaker assumption that training and test data are generated from the same distribution but the form of the distribution is unknown. Unlike the traditional approach that maximizes a function (i.e. likelihood of training data) that is loosely associated with the error rate, discriminative training methods ideally aim to minimize the same performance measure used to evaluate the language model, namely the error rate on training data.
However, this ideal has not been achieved because the error rate of a given finite set of training samples is usually a set of discrete values that appear as a step function (or piecewise constant function) of model parameters, and thus cannot be easily minimized. To address the problem, previous research has concentrated on the development of a loss function that provides a smooth loss curve that approximates the error rate. Using such loss functions adds theoretically appealing properties, such as convergence and bounded generalization error. However, the minimization of a loss function instead of the error rate means that such systems are optimizing a different performance measure than the performance measure that is used to evaluate the system that a language model is applied in. As a result, training the language model to optimize the loss function does not guarantee that the language model will provide a minimum number of errors in realistic applications.
A method of training language model parameters trains discriminative model parameters in the language model based on a performance measure having discrete values.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with embodiments of the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments of the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments of the invention are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Memory 204 is implemented as non-volatile electronic memory such as random access memory (RAM) with a battery back-up module (not shown) such that information stored in memory 204 is not lost when the general power to mobile device 200 is shut down. A portion of memory 204 is preferably allocated as addressable memory for program execution, while another portion of memory 204 is preferably used for storage, such as to simulate storage on a disk drive.
Memory 204 includes an operating system 212, application programs 214 as well as an object store 216. During operation, operating system 212 is preferably executed by processor 202 from memory 204. Operating system 212, in one preferred embodiment, is a WINDOWS® CE brand operating system commercially available from Microsoft Corporation. Operating system 212 is preferably designed for mobile devices, and implements database features that can be utilized by applications 214 through a set of exposed application programming interfaces and methods. The objects in object store 216 are maintained by applications 214 and operating system 212, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 208 represents numerous devices and technologies that allow mobile device 200 to send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 200 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 208 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 206 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, and a microphone as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 200. In addition, other input/output devices may be attached to or found with mobile device 200 within the scope of the present invention.
Many embodiments of the present invention provide a method of training discriminative model parameters by attempting to minimize the error rate of the word sequences identified using the model parameters instead of optimizing a loss function approximation to the error rate.
Under one embodiment of the present invention, the discriminative language model is a linear discriminative function that provides a score for each of a set of candidate word sequences. The linear discriminative function uses a sum of weighted feature functions such as:
Thus, given a phonetic string A, the selection of a word sequence is defined as:
In order to use the discriminative language model, the features for the model must be selected and their respective model parameters must be trained.
In step 300 of
At step 302, a discriminative model trainer 408 evaluates an error function for the top candidate identified by baseline language model 402 in each of training samples 404. This error function is determined using an actual word sequence 406 that corresponds to phonetic sequences 400. In general, the error function provides a count of one for each word in actual word sequences 406 that is replaced with a different word in training samples 404 or is omitted from training samples 404. In addition, any extra word inserted into training sample 404 that is not found in actual word sequence 406 also adds one to the count. Summing the values of the error function for the training samples produces the sample risk for the baseline language model, (SR(λ0)).
At step 304, discriminative model trainer 408 performs a line search to identify the best weight λi and associated sample risk, SR(λi), for each feature i(W), i>0, in a set of candidate features 410. The weight of the base feature, λ0, is fixed during the training procedure. Under many embodiments, candidate features 410 include features for specific N-grams, which add one to a count each time a specific N-gram is found in a word sequence. A method for performing such a line search is shown in
In step 500 of
After all of the candidate word sequences in the training sample have been scored, the candidate word sequence with the highest score is selected and the error function is evaluated using the selected word sequence and the actual word sequence from actual word sequences 406 at step 506. At step 508, the value of the error function is stored for this particular value for λi and the current sample.
At step 510, the method determines if there are more values for λi to be evaluated. If there are more values, the process returns to step 502 and steps 504, 506 and 508 are repeated for the new value of λi. Steps 502, 504, 506, 508 and 510 are repeated until all of the discrete values for λi have been evaluated for the current feature.
At step 512, the process determines if there are more training samples in training samples 404. If there are more training samples, the next training sample is selected at step 500 and steps 500-510 are repeated for the next training sample. When there are no further training samples at step 512, the process continues at step 514 where the values of the error function for each value of λi are summed over all of the training samples to form a set of sample risks SR(λ0,λi), with a separate sample risk for each value of λi. In terms of an equation, the sample risk for a particular value of λi is:
At step 516, the optimum value for λi, is selected based on the sample risk values. In some embodiments, the value of λi that produces the smallest sample risk is selected as the optimum value of λi. In other embodiments, a window is formed around each value of λi and the sample risk is integrated over the window. The value of λi that produces the lowest integration across the window is selected as the optimum value for λi as shown by:
At step 308, the top N features in ranked features 412 are examined to determine which feature provides a best gain value G(λ) which is calculated as:
The sample risk reduction term of equation 6 is determined as:
The interference term Int(f) is calculated as the cosine similarity between two vectors:
In EQ. 8, Tr() is also a column vector having a separate element for each training sample in training samples 404. The value of the i-th element in Tr() is the difference between the value of the error function for the top word sequence candidate identified by baseline model 402 for training sample i and the value of the error function for the top scoring word sequence identified by feature i using the optimum weight identified in step 304. The denominator in EQ. 8 is the product of the Euclidean length of vectors Tr(f) and Tr().
At step 310, the feature, s, with the highest gain value as computed using EQ 6 is selected. At step 312, the value of the model parameter λs for the selected feature is updated using the line search of
At step 314, the updated value of λs and the feature, s, are added to a list of selected features and parameters of the discriminative model 414 at step 314. The feature that is added to selected features 414 is also removed from ranked features 412.
At step 316, the gain values for the top N features remaining in ranked feature list 412 are updated using the newly added selected features and parameters to recompute column vector Tr(f) of EQ. 8.
At step 318, the process determines if more features should be added to the discriminative model. Under some embodiments, this is determined using a threshold value compared to the gain values of the top N features in ranked feature list 412. If at least one feature has a gain value that exceeds the threshold, the process returns to step 310 to select the feature with the best gain value and steps 312, 314 and 316 are repeated.
When no more features are to be added to discriminative model 414 at step 318, the selected features and their parameters can be used as a discriminative model to score candidate word sequences and thus select one candidate word sequence from a plurality of candidate word sequences as representing a phonetic string. This is shown in step 320 in
As shown in
Although an error function has been used above, the present invention is not limited to such error functions. Instead, any performance measure with discrete values may be used in the line search of
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.