WO2007068123A1

WO2007068123A1 - Method and system for training and applying a distortion component to machine translation

Info

Publication number: WO2007068123A1
Application number: PCT/CA2006/002056
Authority: WO
Inventors: Roland Kuhn; George Foster; Michel Simard; Eric Joanis
Original assignee: National Research Council Of Canada
Priority date: 2005-12-16
Filing date: 2006-12-18
Publication date: 2007-06-21

Abstract

Machine translation is the translation by a machine of sentences in one human language (the source language) into sentences in a second human language, (the target language). However, once words in a sentence have been translated, the target-language words that have been found must often be reordered to reflect the characteristics of the target language. Thus, a 'distortion' component is desirable to assess the extent that each reordering reflects a correct translation. Since rules for word order vary from language to language, the system provides a distortion component which assigns a distortion score to individual translation hypothesizes. The distortion component is derived through a supervised learning system from a source sentence and a distorted source sentence originating from a bilingual sentence pair. The distortion component is based on multiple features; the features can be position based features, a word based feature and or a syntax based features.

Description

Title: Method and System for Training and Applying a Distortion Component to Machine Translation

Field of the Invention

This application is related to a means and a method for translating source language into target language. More specifically this application relates to a means and method of training and using a distortion component to machine translation.

Background of the Invention

Machine translation is the translation by a machine of sentences in one human language, the source language, into sentences in a second human language, the target language. By "sentence" is meant any normal-sounding sequence of words in a given language conveying a complete meaning in its context, not necessarily a sequence of words that might be considered grammatically correct; for instance, "No way!" is a sentence in the sense intended here. An important aspect of machine translation is finding, for a word or word sequence in the source language, the words in the target language that best translate it. However, once this has been done, the target- language words that have been found must often be reordered to reflect the characteristics of the target language. Thus, today's machine translation systems have a "distortion" component that assesses the likelihood of each of the possible reorderings; this component may be a separate module or incorporated into other components. Since rules for word order vary from language to language, the distortion component must be created anew for each combination of source and target language: for instance, the distortion component in a system for translating German to English might have very different properties from the distortion component in a system for translating Chinese to English.

Important early work on statistical machine translation, preceding the development of phrase-based translation, was carried out by researchers at IBM in the 1990s. These researchers developed a set of mathematical models for machine translation now collectively known in the machine translation research community as the "IBM models", which are defined in "The Mathematics of Statistical Machine Translation: Parameter Estimation" by P. Brown et al., Computational Linguistics, June 1993, V. 19, no. 2, pp. 263-312. Henceforth, the expression "IBM models" in this document will refer to the mathematical models defined in this article by P. Brown et al.

Though mathematically powerful, these IBM models have some key drawbacks compared to today's phrase-based models. They are computationally expensive, both at the training step (when their parameters are calculated from training data) and when being used to carry out translation. Another disadvantage is that they allow a single word in one language to generate zero, one, or many words in the other language, but do not permit several words in one language to generate, as a group, any number of words in the other language. In other words, the IBM models allow one-to-many generation, but not many-to-many generation.

A phrase-based machine translation based on joint probabilities is described in "A Phrase-Based, Joint Probability Model for Statistical Machine Translation" by D. Marcu and W. Wong in Empirical Methods in Natural Language Processing, (University of Pennsylvania, July 2002); a slightly different form of phrase-based machine translation based on conditional probabilities is described in "Statistical Phrase-Based Translation" by P. Koehn, F.-J. Och, and D. Marcu in Proceedings of the North American Chapter of the Association for Computational Linguistics, 2003, pp. 127-133. In these documents, a "phrase" can be any sequence of contiguous words in a source-language or target-language sentence. Phrase-based machine translation offers the advantage of both one-to-many word generation and many-to-many word generation.

The distortion component of a machine translation MT system inputs the source-language sentence (henceforth referred to as the "source sentence") and a set of complete or partial target-language hypotheses (henceforth the "target hypotheses") and generates a distortion score for each of the target hypotheses. This score reflects how likely the distortion component considers the reordering of words in a particular hypothesis to be. In some of today's systems, the preferred hypotheses receive a high score and the ones considered to be unlikely receive a low score; in other systems, the convention is the opposite and it is the lowest scores that indicate the hypotheses preferred by the distortion component.

In many of today's state-of-the-art machine translation systems, the distortion score is basically a penalty on reordering. Expressed as a negative number, this penalty is least severe (closest to zero) for hypotheses whose word order is similar to that of the original source-language sentence, and more severe

(negative of larger magnitude) for hypotheses whose word order is very different from that of the original sentence. Given two target hypotheses for a source sentence, a distortion penalty of this kind cannot prefer the hypothesis that has undergone a more drastic rearrangement, even though this may be desired for a specific language.

The IBM models described above were capable of assigning higher scores to a hypothesis that has undergone a more drastic rearrangement than to competing hypotheses whose word order is similar to that of the original source sentence. There has also been some recent work on building phrase- based systems with this property. Such work is described in two recently published articles:

• "A Localized Prediction Model for Statistical Machine Translation", by C. Tillmann and T. Zhang, Association for Computational Linguistics,

Ann Arbor, Michigan, USA, June 2005;

• "Local Phrase Reordering Models for Statistical Machine Translation" by S. Kumar and W. Byrne, published in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, Canada, Oct. 2005.

However the phrase-based systems described in these two articles can only allow local rearrangements of phrases, within a window of (maximally) three phrases.

To understand why a distortion component of some kind is needed for machine translation, consider a system that translates German sentences into English. German and English share a common ancestry, and it is thus often easy to translate a German word into an English word that corresponds to it exactly (this is usually harder to do for unrelated languages, such as Chinese and English). However, the location of the past participle of a verb in a German sentence is different from its typical location in an English sentence. For instance, the literal word-to-word English translation of the German sentence "lch habe das Buch gelesen" is "I have the book read"; the literal English translation of "Du hast ein Hund gesehen" is "You have a dog seen". It is the job of the distortion component in a German-to-English system to make it more likely that the system produces translations with correct English word ordering. In the case of a conventional distortion penalty, translation hypotheses are penalized more severely the more they differ from the original word order.

For instance, given the input "lch habe das Buch gelesen" a conventional distortion penalty would severely penalize the translation hypotheses "Book I the read have" and "Have read the I book", assigning them a lower score than the hypotheses "I have the book read" and "I have read the book". Thus, the conventional distortion penalty is successful in discouraging wildly unlikely reorderings. Unfortunately, among the latter two hypotheses, the one with the word order most similar to the original - "I have the book read" - will be assigned a higher distortion score than the correct English translation, "I have read the book", since the latter involves more reordering. Similarly, as a translation of "Du hast ein Hund gesehen", a conventional distortion penalty will assign a higher score to "You have a dog seen" than to "You have seen a dog".

It is clearly desirable to build a distortion component that is capable of assigning a higher score to a hypothesis that has undergone more reordering than to one that has undergone less reordering, when that is appropriate.

It is interesting to note that the system for German-to-English translation described in the article "Improved Alignment Models for Statistical Machine Translation", by F-J. Och, C. Tillmann, and H. Ney (published in Empirical Methods in Natural Language Processing, June 1999, University of Maryland) used a conventional distortion penalty but had to incorporate an ad hoc modification to handle German verbs. Having to incorporate such ad hoc modifications for each of the different translation languages can be costly and time consuming.

As mentioned above, the kind of reordering necessary to perform good translation depends on the identity of the two languages involved. For instance, the distortion component of a French-to-English system may not need to concern itself with the placement of verbs, since French and English tend to put verbs in the same sentence locations. On the other hand, this French-to-English distortion component may need to handle the location of adjectives relative to the nouns they modify, since French tends to place adjectives after the noun instead of before the noun as in English (e.g., "Ia maison rouge" can be literally translated as "the house red"). In the case of a Chinese-to-English system, Chinese questions have a very different word order from English questions, while Chinese statements often have a word order that is similar to that of English. For instance, "Ni jiao shenme mingzi?" translated word-for-word into English without reordering would be "You are called what name?", which needs to be reordered, while a word-to-word translation of the possible answer "Wo jiao Ming" would be "I'm called Ming", which does not require reordering. This example illustrates that the distortion component which would know not only about the words that are being considered for reordering, but also about the overall properties of the source- language sentence in which they are embedded (in this example, whether it's a question or a statement) would be desirable.

A conventional penalty-based distortion component can only learn one language-pair-dependent aspect of distortion: the severity of the penalty that should be assigned to reordering. For instance, it can learn that for language pairs such as Japanese and English, which have very different word order, the distortion penalty should be mild, while for language pairs such as French and English, which have somewhat similar word order, the distortion penalty should be severe. However, it is incapable of learning for a particular language pair when the words in an input sentence in the source language need to be drastically reordered for translation, and when the words of an input sentence can be left in roughly the same order. In other words, the conventional penalty is incapable of learning for a particular language pair when reordering is more or less likely. The systems described in the papers cited above by C. Tillmann and T. Zhang, and by S. Kumar and W. Byrne, can learn when local reordering is likely (within phrases that are very close to each other) but cannot deal with global reordering.

Summary of the Invention

It is an object of the invention to provide a distortion component which is capable of learning for a particular language pair when reordering is more or less likely.

It is an object of the invention to train a distortion component that incorporates multiple input features in generation of the distortion score.

It is an object of the invention to offer a method and system which assigns a higher score to a hypothesis that has undergone more reordering than to one that has undergone less reordering, when that is appropriate. It is an object of the invention to offer a method and a system which is capable of rewarding reordering when it is appropriate and penalizing it when it is inappropriate, depending on which words are being considered for reordering, and depending on the properties of the source language sentence.

It is another object of the invention to offer a method and a system which allows rearrangement within an arbitrarily wide window of segments or phrases.

It is an object of the invention to offer a method and a system which allows global rearrangement of phrases, in addition to the local rearrangement.

It is another object of the invention to offer a method and a system where the distortion component learns what particular kinds of reordering are likely to occur for a particular language pair, and under what circumstances they are likely to occur.

These objects are met by providing a method for generating a distortion component used in assigning a distortion score in a machine translation system which comprises the steps of a) providing a bilingual sentence pair having a training source sentence and a training target sentence; b) segmenting the training source sentence and the training target sentence into one or more than one training source sentence segments and one or more than one training target sentence segments; c) aligning the order of the training source sentence segments with the order of the corresponding training target sentence segments; d) forming a distorted training source sentence with the aligned training source sentence segments in the position order of the training target sentence segments; e) outputting the training source sentence and the associated distorted training source sentence to form a distortion training corpus; f) using a supervised learning tool on the distortion training corpus to generate a distortion component.

And by providing a method for providing a distortion score to a translation hypothesis for the translation of a source sentence made of words into a target sentence comprising the steps: providing a source sentence to a decoder; segmenting said source sentence into one or more than one segments; choosing a segment; removing said selected segment from the source sentence leaving remaining words; inputting said chosen segment into a partial distortion hypothesis; providing said chosen segment, the remaining words, and the partial distortion hypothesis to a distortion component; calculating a distortion score with a distortion component acquired through supervised learning from a distortion training corpus; repeating steps c to e until all words have been chosen; calculating a cumulative distortion score; outputting to said decoder said distortion score.

And by providing a computer readable memory for obtaining distortion score for a translation hypothesis of a source sentence into a target sentence in a machine translation system comprising; a source sentence; a decoder; the source sentence being inputted in said decoder; a phrase table; the phrase table having source words associated with corresponding target words, the phrase table providing possible segments to said decoder; a distortion component; the distortion component acquired through supervised learning from a distortion training corpus; the distortion training corpus made of a training source sentence and an associated distorted training source sentence; the decoder providing said distortion component with a selected segment from said source sentence and remaining segments from said source sentence and a distorted sentence hypothesis; the distortion component outputting a distortion score to said decoder;

A method for assigning a score to a translation hypothesis of a source sentence comprising the steps: providing a segmented translation sentence hypothesis; providing an associated segmented sources sentence; providing an alignment of the segments of the translation sentence hypothesis with the segments of the source sentence; calculating a new distortion score based on the alignment with a distortion component acquired through supervised learning from a distortion training corpus having a new source information; outputting a new set of rescored hypothesis. And by providing a machine translation system for translating a source sentence into a target sentence where a translation hypothesis is provided, the translation hypothesis provided is assigned a cumulative score comprising of a phrase translation score; a language model score; a number of words score; distortion score where said distortion score is assigned by a distortion component derived by means of a supervised learning tool from a bilingual sentence pair comprising of a segmented training source sentence and a segmented training target sentence, such that the segments of the training source sentence have been aligned with the segments of the training target sentence.

Further features of the invention will be described or will become apparent in the course of the following detailed description.

Brief Description of the Drawings

In order that the invention may be more clearly understood, embodiments thereof will now be described in detail by way of example, with reference to the accompanying drawings, in which:

Figure 1 illustrates one embodiment of a method of using a distortion component in a machine translation system.

Figure 2 illustrates one embodiment of the invention of using a distortion component during decoding.

Figure 3 illustrates one embodiment of the invention to generate a distortion training corpus.

Figure 4 illustrates another embodiment of the invention to generate a distortion training corpus. Figure 5 illustrates yet another embodiment of the invention to generate a distortion training corpus.

Figure 6 illustrates an embodiment of the invention using choice tree for forming a distortion component.

Figure 7 illustrates an embodiment of the invention for growing choice trees.

Figure 8 illustrates another embodiment of the invention for growing choice trees.

Figure 9 illustrates another embodiment of the invention of using a distortion component during decoding.

Figure 10 illustrates an embodiment of the invention using labeling.

Figure 11 illustrates an embodiment of the invention of a training a distortion source language model using a distorted source-language corpus.

Figure 12 illustrates an embodiment of the invention using a distortion component during rescoring.

Figure 13 illustrates an embodiment of the invention using a distortion component during rescoring.

Description of Preferred Embodiments

The invention described here addresses the design of the distortion component for any specified pair of source and target languages. According to the invention, the mathematical relationship between the multiple input features and the distortion score is learned in a separate training step on a set of examples of reordered sentences called "segment-aligned sentence pairs"; this set of examples must be generated for each combination of source and target language. For instance, to learn the relationship between the input features and the distortion score for the case of Chinese source sentences and English target hypotheses, one would train the system on segment- aligned sentence pairs in which one sentence in each pair is Chinese and the other English. To learn the relationship for the case of German source sentences and English target hypotheses one would train the system on segment-aligned sentence pairs in which one sentence in each pair is German and the other English, and so on. This separate training step for the distortion component of the invention has no parallel in conventional systems whose distortion component is a penalty. In such conventional systems, the value of the penalty is typically estimated by constructing the complete machine translation system for a given language pair and then experimenting with greater or lesser values of the penalty on trial data - in other words, penalty estimation is typically a refined trial-and-error procedure therefore yielding an inferior translation performance translation performance. The current embodiment of the invention has been carried out in the framework of phrase-based statistical machine translation.

Although the current embodiment of the invention is phrase-based, the invention is also applicable in the context of other approaches to statistical machine translation. For instance, the invention is applicable to systems in which groups of words in the source sentence have been transformed in some way prior to translation. Thus, it is applicable to systems in which some groups of words have been replaced by a structure indicating the presence of a given type of information or syntactic structure (e.g., a number, name, or date), including systems where such structures can cover originally noncontiguous words. The invention applies not only to phrases but to other groupings and transformations of words as such, the word "segment" will be used instead of "phrase".

Furthermore, the invention is applicable to systems in which reordering (distortion) of segments is applied before translation of the segments, to systems in which reordering is applied after translation of the segments, and to systems in which reordering and translation occur in arbitrary order. In the first type of system, the segments of the source-language sentence are reordered first, then each segment in the reordered sentence is translated into the target language. In the second type of system, the segments of the source-language sentence are left in their original order, then each is translated into a target-language segment; subsequently, the target-language segments are reordered. In the third type of system, there may be some reordering prior to translation of segments, and some reordering afterwards - i.e., the two processes are interleaved.

For convenience, the description that follows will follow the convention in which a high score for a translation hypothesis indicates that the distortion component considers the word order in the hypothesis to be highly probable or suitable (the invention also applies to a system with the reverse convention).

In phrase-based machine translation, decoding (translation) of a source sentence S is carried out by finding word sequences T in the target language as shown in Fig. 1. In this figure, words belonging to the source language are written in the form Sj and words belonging to the target language are written in the form t_j. As the name indicates, phrase-based machine translation is built around "phrases". In this context, the word "phrase" does not have grammatical significance; a phrase is simply a contiguous sequence of one or more words. In Fig. 1 , the phrase boundaries are indicated by vertical bars (except at the beginning or end of a sentence, where they are unnecessary). For instance, the expression "s1 s2|s3 s4|s5 s6 s7|s8 s9" denotes a segmentation of a nine-word sentence into four phrases: "s1 s2", "s3 s4", "s5 s6 s7", and "s8 s9". A "phrase table" is a data structure that shows, for a phrase in one language, which phrases in the other language correspond to it; it is obtained from bilingual training data by means of a complex alignment procedure. For instance, the phrase table shown in the figure states that the two-word source phrase "s1 s2" can be translated into the target language as "t6 t7 t8" (three target-language words), "t6 t8" (two words), or "t7" (one word). In typical phrase-based systems, the phrase table will also include information about probability, relative frequency or likelihood of each possible translation for a given phrase (for space reasons this kind of information contained in the phrase table is not shown in the figure).

The process whereby translation hypotheses are generated from a source sentence occurs in two steps:

1. a segmentation step

2. a distortion and phrase translation step.

The process as a whole is called "search" or "decoding"; the component of the system that performs it is called the "search engine" or "decoder". These two steps are often interleaved during decoding, but for clarity, they will be described as if they occur sequentially.

In the segmentation step, the source sentence is segmented into "phrases" (segments). Typically, the system will only create segmentations using phrases that are contained in the phrase table {i.e., those for which phrase translations are known). In cases where the source sentence contains a word that is "unknown" (not included in the phrase table) a special handling mechanism is invoked - e.g., the word might simply be cut out of the source sentence prior to segmentation. The figure shows three possible segmentations of a nine-word source sentence; there may be other possible segmentations.

Next, distortion (reordering) and phrase translation take place. Distortion means that phrases can be shuffled; phrase translation means that a phrase in the source language is replaced by a phrase in the target language. For instance, the top, leftmost target word sequence shown (including phrase boundaries) is "t1 |t2 t3 t4| t5 |t6 t7 t8". This was obtained from the distorted source sequence "s8 s9|s3 s4|s5 s6 s7| s1 s2" by replacing the phrase "s8 s9" with the phrase "t1 ", the phrase "s3 s4" with the phrase "t2 t3 t4", and so on. Of course, "no reordering" - leaving the phrases in the same order in the distorted source hypothesis as in the original source-language sequence - will always be one of the possible distortions.

The end result of this process will be many different translation hypotheses. For a long source sentence, a typical phrase-based machine translation system could generate tens or hundreds of thousands of hypotheses. Typically, the user of a machine translation system is only interested in one or a few of these hypotheses - those representing the system's best guesses at the correct translation. Thus, phrase-based machine translation systems assign numerical scores to translation hypotheses.

The overall score assigned to a translation hypothesis will typically be a combination of sub-scores reflecting different aspects of the hypothesis and its relationship to the source sentence. The sub-scores used in the calculation typically include a "phrase translation" score (which is higher if the phrase translations employed in generating the hypothesis had a high probability according to the phrase table, and lower if these translations are improbable), a "language model" score (which is higher if the sequence of words in the hypothesis is probable according to a model of the target language, and lower if the sequence of words is improbable), a "number of words" score (which penalizes hypotheses that seem to have too few or too many words), and a distortion score. If the user asks for a single translation of the source sentence, the translation hypothesis with the best overall score will be output.

One of the objects of the current invention is to assign distortion scores that more accurately reflect the probability that a given distortion will occur. Although the distortion score is not the only component that goes into determining the overall score of a hypothesis, making it more accurate will improve the average performance of the overall score, ensuring that good translation hypotheses are output by the system more often. In Figure 1 , the abbreviation "DSH" is used for "distorted source hypothesis". This abbreviation will be used repeatedly in the text that follows. Note that each DSH can generate several different translation hypotheses, since a particular phrase in a DSH may be translated into several different target- language phrases. The figure shows three groups of translation hypotheses, each sharing the same DSH. In the current embodiment of the invention, the distortion score found during decoding is the same for each member of a group, since this distortion score depends only on the relationship between the DSH and the original, undistorted, segmented source sentence; the distortion score doesn't depend on the target-language phrases chosen. It would be possible to construct another embodiment in which the distortion score does depend on the target-language phrases. (Later, it will be shown how even in the current embodiment, a distortion score obtained during an optional "rescoring of translation hypotheses" step that occurs after the initial decoding may depend on the sequence of words in each target-language hypothesis). However, the overall scores for the translation hypotheses in a group will typically differ, since the non-distortion components of the overall score found during decoding (e.g., language model score, phrase translation score) depend on the target-language phrases chosen to match each source- language phrase.

In Fig. 1 , distortion and phrase translation are not shown sequentially (i.e., with distortion first and then phrase translation). This is because in many state-of-the-art systems, distortion and phrase translation are interleaved. For instance, generation of the target hypothesis "t1 |t2 t3 t4| t5 |t6 t7 t8" from the segmented source sentence "s1 s2|s3 s4|s5 s6 s7|s8 s9|" might occur in the following order:

1. the system decides to try choosing the phrase "s8 s9" as the first (leftmost) phrase in a DSH. 2. from this partial DSH, the system starts a number of different translation hypotheses by mapping "s8 s9" onto different possible translations for this phrase found in the phrase table. In the figure, we see that the system created partial hypotheses in which "s8 s9" was translated by the target-language phrases "t1 ", "t1 t5", and "t2 t5", so we can deduce that the phrase table allows "s8 s9" to be translated by any one of these target-language phrases (there may be other translations of "s8 s9" that are possible according to the phrase table, but which were not chosen by the system). Note that each of these partial hypotheses is made of up of two matching halves: the DSH and a target-language part.

3. the system now takes the partial hypothesis whose DSH begins with "s8 s9" and whose target-language part begins with "t1 ", and extends it by adding "s3 s4" to the right end of the partial DSH. It then consults the phrase table to find possible translations for the phrase "s3 s4".

One of these, "t2 t3 t4", is chosen, yielding a partial hypothesis whose DSH part is "s8 s9|s3 s4| ..." and whose target-language part is "t1 |t2 t3 t4|..."

4. the system continues in this manner, each time choosing a phrase from the source sentence that has not yet been "used up", adding it to the right end of the partial DSH, choosing from the phrase table a target- language phrase translation for the source-language phrase, and then adding that target-language phrase to the right end of the target- language part of the hypothesis. Eventually the DSH is of the same length as the original source-language sentence (and contains all the same phrases), and the target-language part of the hypothesis contains a matching phrase translation for each phrase in the DSH. The hypothesis is now complete. Use of the Invention During Decoding

In the current embodiment of the invention, it is designed to assist the decoder (search engine) during decoding (translation) as in the example shown in Figure 2. The figure shows a snapshot of the state of the decoder at a given point during the decoding process. At the time shown, the decoder is attempting to extend a partial DSH consisting of two segments or phrases: "s8 s9|s3 s4|". These were extracted from the source sentence: "s1 s2 s3 s4 s5 s6 s7 s8 s9". Because four of the words in the source sentence have already been consumed in the construction of the distorted source-language part of the partial hypothesis ("s3", "s4", "sδ", and "s9"), only the remaining five words need to be added to the partial hypothesis. The decoder has just consulted the phrase table and determined that one possibility is to now add the phrase "s1 s2" to the right edge of the partial DSH (which would yield the partial hypothesis "s8 s9|s3 s4|s1 s2", leaving only the words "s5", "s6", and "s7" to be added to the partial hypothesis in some order). Thus, the decoder requires a distortion score in order to assess whether adding a given segment or phrase (e.g., "s1 s2") to the right end of the partial DSH is likely or unlikely. In the current embodiment, this score is formulated as the estimated probability that the given phrase will be chosen next; in Fig. 2, the estimated probability that the phrase "s1 s2" will be chosen next is denoted P(s1 s2).

In order to estimate the probability that a given phrase will be chosen next, the distortion component is supplied with information about the context by the decoder. As shown in Fig. 2, this contextual information includes the partial DSH (with phrase boundary information retained in it) and information about which words in the source sentence have been consumed and which must still be placed into the DSH. Although in the figure, the score returned by the distortion component is shown as a probability, in the current embodiment the logarithm of this probability is used. Note that as segments or phrases are added to the right end of the partial DSH, a cumulative distortion score can be obtained. For instance, if the completed DSH ends up being "s8 s9|s3 s4|s1 s2|s5 s6 s7", the cumulative distortion score (in terms of probability) would be the product of the P(s8 s9) returned by the distortion component initially (when the partial DSH was still empty), with the P(s3 s4) returned by it when the DSH is set to "s8 s9", and so on: P(s8 s9)*P(s3 s4)*P(s1 s2)*P(s5 s6 s7). : In terms of the logarithm of the probability, this is equivalent to returning the sum of the logarithms of the individual probabilities:

log P(s8 s9) + log P(s3 s4) + log P(s1 s2) + log P(s5 s6 s7)

In addition to returning the distortion score for adding a given source segment or phrase to the edge of the partial DSH, the distortion component may also optionally return an estimate called the "future score" as shown in Figure 2. The future score is an estimate of the cumulative score associated with completing the DSH. In the figure, if the phrase "s1 s2" is added to the partial DSH, the source-language words "s5", "s6", and "s7" will not yet have been used up - they will eventually have to be assigned to the DSH, in some order. Imagine a situation where the distortion score for adding the source phrase currently being considered to the partial DSH is good, but choosing this phrase leads to later choices that have very bad scores. It would be useful for the decoder to know this, so that it can avoid making choices that appear attractive in the short term but will lead to bad long-term consequences; the future score is a means for the distortion component to convey information about these long-term dangers to the decoder.

In the example shown in Figure 2, suppose that P(s1 s2) is high, but assigning "s5", "s6", and "s7" and phrases containing them in subsequent steps will involve very low probabilities. In this case, the distortion could return a value F(s5 s6 s7) for the future score to the decoder that was very low. The same reasoning operates in reverse - there may be situations where choosing a particular source segment or phrase looks unappealing in the short term (low distortion score) but has great long-term potential (high future score). Exactly how the future score is used during decoding depends on the decoder's strategy. Typically, future scores are estimates rather than values the distortion component can calculate exactly. In the example above, for instance, a future score for the words remaining in the source sentence, "s5", "s6", and "s7", may be returned without the distortion component having exact information on how they will themselves be segmented or distorted later on (e.g., at a later stage they might be segmented as "s5| s6|s7", "s5 s6 s7", "s5| s6 s7", etc.).

Input Features Used By the Distortion Component

As shown in Figure 2, the new distortion component outputs a distortion score (and optionally, a future score) for a given source segment or phrase that the decoder is considering adding to the edge of a partial DSH. Many different types of input information can be used to compute the score; the precise way in which these input features are combined is learned from the training corpus, which consists of segment-aligned sentence pairs. The training process is described below.

To understand the kind of tradeoff that goes on in computing the distortion score, consider an example similar to that shown in Figure 2. Suppose that the partial DSH is "s8 s9| s3 s4", and that there are only two choices of source-language phrases that can be added next to the partial DSH: "s1 s2" or "s5 s6 s7". There are arguments in favour of both. The argument for choosing "s1 s2" next is that words that occur near the beginning of the source sentence should occur as near as possible to the beginning in the DSH - by choosing "s1 s2", one minimizes the distance between the new and the original positions of words. The argument in favour of choosing "s5 s6 s7" next is that in this way, a long unbroken sequence of words "s3 ... s7" is kept in its original order (given the partial DSH, "s1 s2" is guaranteed to be out of order no matter what is decided). Clearly, there is no way of deciding between these two arguments without looking at examples of what happens in similar cases for the pair of languages in question. In effect, the new distortion component does exactly that, learning from the training corpus which of the two factors - distance from original position or preservation of long, unbroken word sequences - to assign more importance to in a given case.

Examples of the kinds of input features that can be used in the computation of the distortion score and the future score include position-based, word-based, and syntax based features, such as:

• The difference between the original positions in the source sentence of the leftmost word in the given source phrase and the rightmost word in the partial DSH. For instance, in Figure 2, the given source phrase is "s1 s2" (whose leftmost word had original position 1 ) and the last word in the partial DSH is "s4" (whose original position was 4). Thus, the difference for this case is 1-4 = -3. The conventional penalty-based distortion score is typically based entirely on this input feature.

• The difference between the original position of the leftmost word of the given source phrase and what will be its new position if it is chosen at a given step. • The length in words of the partial DSH, the length in words of the given source phrase, and the number of remaining, unused words in the source sentence, and the differences between these.

• The average number of words per phrase in the partial DSH, in the source sentence, and in the remaining, unused part of the source sentence. • The number of phrases in the partial DSH.

• The degree of permutation in the partial DSH, in other words, the extent to which the phrases in the partial DSH have departed from their original order. There are several possible metrics for measuring this permutation quantitatively, such as Spearman's rank correlation or

Kendall's tau.

• The degree of fragmentation in the remaining, unused part of the source sentence as measured, for instance, by the number of "holes" it contains (a hole is a gap in the original sequence of words caused by the removal of words that are now in the partial DSH).

• The identity of the words in the source sentence, in the partial DSH, in the source segment or phrase being scored, or the identity of the words that remain (i.e., those that are contained in the source sentence but have not yet been assigned a position in the DSH, and that are not contained in the given source phrase). For instance, in a Chinese-to-

English system, the presence of a Chinese question word like "shenme" in the Chinese source sentence may make it more likely that several Chinese phrases in the source will have positions in the DSH that are quite far from their original positions (since as was mentioned above, Chinese and English have word orders that tend to be more different for questions than for other types of sentences, such as statements).

• The identity of words in given positions of the source sentence, of the partial DSH, or of a particular segment or phrase. For instance, a possible feature could be the occurrence or non-occurrence of a given word in the first position of a particular segment or phrase; another possible feature could be the occurrence or non-occurrence of a given word in the last position of the source sentence.

• The syntactic category to which individual words or phrases in the partial DSH, the given source phrase, and the remaining words belong

(e.g., adjective, verb, noun phrase, etc.) For instance, in the case of an English-to-German system, the presence of a verb in the past participle in the given source phrase might cause it to be assigned a low probability of being assigned next when there are still many remaining, unused words in the source sentence (causing the past participle to be pushed rightwards in the distorted English DSH). In the technical literature, syntactic categories such as nouns, verbs, adjectives, etc., to which words may belong are often referred to as "part-of-speech tags".

Generating the Distortion Training Corpus

Figure 3 shows the process for generating the "distortion training corpus", which is used to estimate the parameters of the distortion component of the invention. Note that in this and following examples, the training source and training target sentences will often be shown in all lower-case form, since capitalization is usually removed in today's systems prior to translation (such systems often have a module that restores capitalization where appropriate after translation has taken place). In this example, the training source language is German and the training target language is English. The distortion training corpus is derived from a bilingual "sentence-aligned" corpus in which each sentence in the training source language is associated with exactly one sentence in the training target language, such that each member of the pair is a translation of the other. A segment aligner processes the training bilingual sentence pairs in this corpus by dividing each member of the pair into segments, then deciding for each segment in the training source sentence which segment in the training target sentence corresponds to it. The segment aligner then outputs, in addition to the segmented training source sentence, a distorted version of it in which the segments follow the order of the segments with which they are aligned in the training target sentence. In the figure, the distortion training corpus shown contains not only the original segmented training source sentences and their distorted counterparts, but also the corresponding segmented training target sentences (in italics). There are many different ways of building the segment aligner. In the preferred embodiment, a phrase-based MT system searches for a way of segmenting the source sentence in a bilingual sentence pair into phrases, and then translating each of the training phrases (using its phrase table) in such a manner that the resulting target-language phrases can be reordered to form precisely the training target sentence of the sentence pair.

Clearly, it will not always be possible to segment-align a bilingual sentence pair in this way. For instance, if the phrase table of the phrase-based MT system does not contain a word found in the training target sentence of the pair, it will be unable to use its phrase table to generate this training target sentence from the training source sentence of the pair. Thus, if a phrase- based MT system is used as the segment aligner, only a fraction of the sentence pairs found in the bilingual sentence-aligned corpus can be segment-aligned and placed in the distortion training corpus - the remaining sentence pairs are not used for training the distortion component. In one experiment on a bilingual sentence-aligned corpus in which the source language was Chinese and the target language English, the "yield" for segment alignment was about 30% - that is, about 30% of the training sentence pairs in the bilingual sentence-aligned corpus could be segment- aligned and placed in the distortion training corpus by the phrase-based MT system. For this experiment, the bilingual sentence-aligned corpus was different from the one from which the MT system's phrase table had been obtained.

One way of increasing the yield for the segment alignment step is to allow a phrase-based MT system to carry out segment alignment on the corpus on which the system was originally trained. In this case, there is a higher probability that the system will find a way of generating the target sentence of a sentence pair from the source sentence, since this exact sentence pair was used (along with others) to populate the system's phrase table. In an experiment on a Chinese-English sentence-aligned corpus that was first used to train a phrase-based MT system and then was subjected to segment alignment by this same system, the yield (proportion of sentence pairs that could be completely segment-aligned) was 60%. However, in some ways the resulting distortion corpus is less suitable for training the distortion component than if it had been derived from data not used to train the system, because it is biased. For instance, during segment alignment on a bilingual corpus used to train the MT system that is carrying out the alignment, the system will never encounter words in the training source sentence that are not in its vocabulary. Thus, the distortion training corpus will not contain examples of how a source sentence containing such "unknown" words could be segment-aligned. Other, more subtle, biases will also be present in the distortion training corpus resulting from this kind of training; for instance, long phrases in the source sentence will have matches in the phrase table more often than may be the case for the final system (i.e., the system in which the distortion component will be incorporated).

Figure 4 shows a high-yield segment alignment method that avoids some of these problems. In this method, where possible, a sentence pair is segment- aligned by the phrase-based MT system and placed in the distortion training corpus (as described above). When a sentence pair cannot be segment- aligned, the MT system generates a list of N translations for the training source sentence from the pair. A "selector" module then compares each of the N hypotheses with the actual target sentence and picks the hypothesis that is closest to it according to some reasonable metric - e.g., a measure of N-gram overlap similar to the BLEU metric (see K. Papineni, S. Roukos, T. Ward, and W. -J. Zhu, "BLEU: A method for automatic evaluation of machine translation", IBM Tech. Rep. RC22176, Sept. 2001 ). Because of the way phrase-based translation works, in the course of generating each of these N hypotheses, the system can keep track of the segment alignment between the hypothesis and the source sentence. Thus, it can produce a segment alignment between the training source sentence and the chosen translation hypothesis, and thus also between the training source sentence and the distorted training source sentence corresponding to the chosen translation. These are put into the distortion training corpus.

Clearly, the yield of this process is 100%. The approximation made is that in cases where the MT system cannot segment-align (i.e., generate) a sentence pair, the training target sentence is replaced by a similar sentence that can be generated by the MT system. In Figure 4, the MT system can generate "i have read the book" from "ich habe das buch gelesen", but it cannot generate "the poems were especially good" from "die gedichte waren besonders gut". Perhaps in the system's phrase tables, "besonders gut" was never aligned with "especially good". Thus, the system generates N translations of "die gedichte waren besonders gut". Of these N hypotheses, the translation that comes closest to "the poems were especially good" is "the poems were excellent", so the segment-alignment that ends up in the distortion training corpus is the one associated with "the poems were excellent". Note that for training the distortion corpus, the target-language words chosen are irrelevant anyway; all that matters is having the correct segment alignment between the training source sentence and its distorted source counterpart. Thus, even if the quality of the translation hypothesis chosen is poor (in terms of the target word sequence), it may still yield the correct segment alignment.

In a variant of the Figure 4 approach, one could deal with a sentence pair that cannot be segment-aligned by running the MT system in reverse to produce N translations of the training target sentence into the training source language, then picking the translation that is closest to the training source sentence. Thus, for sentence pairs that cannot be segment-aligned, one could choose to replace either the training source sentence or the training target sentence, depending on the characteristics of the sentence pair and the MT system's phrase tables.

There are other ways of producing the distortion training corpus that will not be described in detail. For instance, in cases where the MT system cannot segment-align a sentence pair, information about synonyms, the syntactic classes of words, etc., could be applied to generate the most plausible alignments. Another possibility would be to use a non-phrase-based MT translation model, such as any one of the IBM models, either by itself or in conjunction with a phrase-based model, to carry out segment alignment in order to produce the distortion training corpus. Finally, it would be possible to generate from a sentence pair more than one plausible segment alignment for the distortion training corpus - perhaps all possible segment alignments for that sentence pair, or a strategically defined subset of the possible segment alignments.

Training the Distortion Component

The role of the distortion component is to provide the decoder with information about which possible rearrangements of the words in the source sentence are likely, and which are unlikely. The distortion component of this invention learns how to make predictions of this kind from the distortion training corpus, which typically consists of source-language sentences paired with distorted source-language sentences containing exactly the same words but representing a word order characteristic of the target language; each such pair can be called an example of a distortion. It is assumed that the distortion training corpus consists mainly of examples of correct or approximately correct distortions. There is a vast technical literature devoted to computational tools that enable computers to learn how to make predictions from data consisting of examples labeled with the correct outcome. This learning process is often called "supervised learning" in the technical literature. A recent book describing tools for supervised learning is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by T. Hastie, R. Tibshirani, and J. Friedman (published by Springer-Verlag, New York, 2001 ). Most of the computational tools for supervised learning described in this book are applicable to the task of learning from a distortion training corpus for a particular pair of source and target language how to predict which distortions of a source sentence are likely, and which unlikely. For instance, various linear models described in the book, various nearest-neighbour methods, support vector machines, kernel methods, neural networks, and trees could all be applied to this distortion prediction task. Other supervised learning methods found in the technical literature, such as hidden Markov models, could also be used. Below, one of these other methods, called "statistical language modeling", and how to apply it to distortion, will be described in detail.

The present embodiment of the invention uses trees; however, it would be possible to modify the tree-based embodiment of the invention so that another tool for supervised learning was used instead of trees. The tree training method used in this embodiment is the Gelfand-Ravishankar-Delp expansion- pruning method (S. Gelfand, C. Ravishankar, and E. DeIp. "An Iterative Growing and Pruning Algorithm for Classification Tree Design", IEEE Transactions on Pattern Analysis and Machine Intelligence, V. 13, no. 2, pp. 163-174, February 1991 ). However, the pruning step involves estimating probabilities for the held-out data (as in A. Lazarides, Y. Normandin, and R. Kuhn, "Improving Decision Trees for Acoustic Modeling", International Conference on Spoken Language Processing, V. 2, pp. 1053-1056, Philadelphia, PA, October 1996). In what follows, the word "tree" will be used to mean the types of trees described in the machine learning literature, such as classification trees, regression trees, and decision trees. This meaning of "tree" is quite different from another meaning of "tree" sometimes used in the natural language processing literature, where a "tree" can be a grammatical analysis of a sentence, such as a parse tree or a syntactic tree.

Figure 5 shows how the distortion training corpus is further processed in one embodiment of the invention, preparatory to training a set of trees that will be incorporated in the distortion component. In the example, the source language is German; three German sentences are shown, along with their distortions into a phrase order imposed by the target language, English. We have already seen the sentence "lch habe das Buch gelesen" ("I have the book read") which is distorted into the English phrase order "lch habe gelesen das Buch", and the sentence "Die Gedichte waren besonders gut" ("The poems were especially good") which retains the same phrase order. A third sentence, "Damals nicht!" ("Then not!") - presumably part of a dialogue and meaning "Not at that time!" - is distorted to a phrase order that is more acceptable for English, "Nicht damals!" ("Not then!").

Each example of a distorted sentence is processed in a series of steps in which the distorted sentence is constructed from left to right, as would occur during decoding, to reflect the choices made. For instance, the first step in constructing "ich habe | das buch | gelesen | . " from "ich habe | gelesen | das buch |." was to choose "ich habe" from among four choices ("ich habe", "das buch", "gelesen", and ".") and add it to the right edge of the DSH (which, initially, was the empty string). This causes the set of remaining segments (RS) to be reduced to three at the start of the next step: "das buch", "gelesen", and "." . One of these, "gelesen", is then added to the right side of the DSH - and so on. The "segment choice history corpus" in Figure 5 reflects this history of choices made in the course of generating the distorted version of the source sentence, with the segment chosen shown after an arrow for each step. The last step, in which there is only one segment left in the RS, is always omitted, since at this stage there is effectively no choice left - the last remaining segment must be added to the right edge of the DSH.

Finally, the figure shows how the segment choice history corpus is further processed to yield subcorpora based on the number of choices made at each step. For instance, the "4-choice" corpus contains only examples of steps where, in the process of generating the distortion, the choice is between exactly four segments left in the RS. Note that the distortion pair derived from "Nicht damals!" does not contribute any examples to the "4-choice corpus", since the maximum number of choices made during construction of "damals nicht !" from "nicht damals !" was three. In general, a training source sentence with S segments can contribute examples to a maximum of S-1 subcorpora (the 2-choice corpus, the 3-choice corpus, and so on up to the S corpus).

In practice, it is desirable to fix a maximum number M for the number of subcorpora obtained in this manner. The last subcorpus handles not only cases of exactly M choices, but also cases of more than M choices; this subcorpus is denoted the "M+" corpus. For instance, in recent Chinese-to- English experiments, good results were obtained by setting M to 14. Thus, the subcorpora generated included a 2-choice corpus, a 3-choice corpus, and so on, with the last two corpora being the 13-choice corpus and the 14+-choice corpus. The latter includes not only examples of cases where there were 14 choices, but also cases where there were 15 choices, 16 choices, and so on. Figure 6 shows how each of the subcorpora is used to train a corresponding choice tree. In the example shown, the value of M is 4; the 4+-choice tree handles cases of four or more choices. More detail on this training process is given in Figure 7. Note that since trees are a way of assigning probabilities to classes of examples, it is necessary to assign class labels to the choices in the training corpus. Many different labeling schemes are possible. For an N-choice or N+-tree, N different labels will be needed. The labeling scheme shown in the figure is a simple left-to- right one, in which the leftmost choice in the RS receives the label "A", the middle one receives the label "B", and the rightmost one receives the letter "C". Each example of a choice made will be assigned the label of the choice made in that situation: for instance, if the leftmost segment in the RS was chosen in a given example, that example receives the label "A". In the figure, the label assigned to each example follows an arrow "→".

Figure 7 shows how once a classification tree has been grown using the standard tree-growing methods described in the technical literature cited above, it partitions the examples found in the training corpus. For instance, consider the first example shown in Figure 7, where the three-way choice is between "das buch" (labeled A), "gelesen" (labeled B), and the period "." (labeled C). The choice actually made was B, so the example as a whole receives the label B. For this example, the answer to the question in the top node (which will be explained shortly) is "no", so this example would be placed into the rightmost leaf node (the node numbered "5"). There, it would cause the count of the B class to be incremented by 1.

Once all the examples have been partitioned by the tree, the counts associated with each class in a leaf can be used to estimate probabilities for new examples encountered during use of a tree. For instance, suppose that during decoding a situation is encountered where the decoder must choose between three segments in the RS, any one of which may now be added to the right edge of a given DSH. Also suppose that the tree shown in Figure 7 assigns this example to the rightmost leaf node in the figure (node number 5). Since that node has a count of 13 for label A, 11 for label B, and 1 for label C, one way of estimating probabilities for each of the three possibilities in the example would be to take the count for a label and divide by the total count of 25 for a node. This would yield an estimated probability of 13/25 = 0.52 for the segment with label A (the leftmost segment), an estimated probability of 11/25 = 0.44 for the segment with label B (the middle segment), and an estimated probability of 1/25 = 0.04 for the segment with label C (the rightmost segment). The technical literature contains many similar techniques for estimating probabilities from trees in this situation.

As explained above, M+ trees differ from the others in that they handle not only cases where there are M choices in the RS, but also cases where there are more than M choices. These trees are trained on examples of M or more choices, and applied during decoding to cases of M or more choices; however, they only have M class labels. An example of a 4+-choice tree is shown in Figure 8. The labeling scheme used here is the same left-to-right one used in Figure 7, but an extension to it is needed to deal with cases where there are more than M choices in the RS. Note that in the last training example shown here, the RS contains six segments - more than there are labels (only the labels A, B, C, and D are available). The extension to the labeling scheme adopted here is to label the first three segments from left to right as "A", "B", and "C" (as was done previously) and then to label all segments that are further right as "D". Thus, "D" is a special kind of "don't care" label for the M+ tree.

During decoding, if there are several segments in the RS that receive the "don't care" label (in this case, "D"), the probability assigned to this label must be divided up among all these segments in order to ensure that the probabilities assigned to all possible choices sum to one. One way of doing this is to assign uniformly to each such segment the total probability assigned to the "don't care" label, divided by the number of segments in the RS. For instance, in Figure 8, if both the first example (RS = "|ich | habe | das |buch | gelesen] " and the last example (RS = "|weiss| nicht | was | soil | es | bedeuten I") were encountered during decoding, they would end up in node 7 of the tree. If the simple probability estimate given earlier (dividing the count of each class by the total count) is used, then in this node the estimated probability of class A is 5/30 = 0.167, the estimated probability of class B is 4/30 = 0.133, the estimated probability of class C is 6/30 = 0.2, and the estimated probability of class D is 15/30 = 0.5. Consider the segments labeled "D" in the two examples. In the first example, there is only one such segment: the period ".". It would be assigned the full "D" probability of 0.5 . In the second example, there are three segments labeled "D": "soil", "es", and "bedeuten". The "D" probability of 0.5 must be divided up among these three segments. If the uniform probability assignment is used, these three segments would each be assigned a probability of 0.5/3 = 0.167 . Thus, this tree is capable of assigning probabilities to any number of choices that is at least four.

Incidentally, there are many other ways of dividing up the probability assigned to the segments labeled "don't care" besides uniform assignment. For instance, a left-to-right bias could be incorporated within this label, so that each such segment gets a higher probability than ones to the right of it. This might be done, for example, by giving each segment 2/3 of the probability of the one to the left of it, and then normalizing to ensure that the total probability for the "don't care" segments attains the value predicted by that node of the tree; many other such biases are possible. One might also wish to adjust the total probability assigned to the "don't care" label so that it could depend on the number of such segments (e.g., by allowing the M+ tree to contain questions about the number of "don't care" segments).

Figure 9 is an expanded version of Figure 2, showing details of the component labeled "Distortion component" in Figure 2 (according to one embodiment of the invention). Note that for the choice trees described above to provide a probability estimate that a given segment will be chosen next for the right edge of the partial DSH, the words in the source sentence that are not yet in the DSH must be assigned to segments. However, this may not yet be the case. In the Figure 2 and Figure 9 example, the decoder has requested a score for the segment "s1 s2". However, the decoder does not have information about the segmentation of the words "s5 s6 s7", which remain "unconsumed" by the DSH. These words might be segmented as three one-word segments (|s5|, |s6|, and |s7|), as two segments (|s5| and |s6 s7|, or |s5 s6| and |s7|), or as one segment (|s5 s6 s7|). This information clearly affects the probability estimate that will be returned. For instance, if the task is to estimate the probability of |s1 s2| given that the other choices are |s5|, |s6|, and |s7|, the 4-choice tree or 4+-choice tree (depending on the system's design) will be used to estimate this probability. On the other hand, if the task is to estimate the probability of |s1 s2| given that there is one other choice, |s5 s6 s7|, the 2-choice tree will be used; the result may be quite different. Thus, a segmentation for "unconsumed" words (other than the words in the segment whose probability is being estimated) will be needed.

Figure 9 shows one way of solving this problem. Here, a module called the "segment generator" generates possible segmentations for the relevant "unconsumed" words. This module may consult the phrase table so that only segmentations compatible with the phrase table are generated. For each of these, a score for the segment of interest can be generated from the appropriate choice tree. The heuristic currently used is to return the maximum value found for the probability of interest (in the example, the probability of choosing "s1 s2" next).

Note that if the future score could be known exactly, it would be a function of the future segmentation for the "unconsumed" words. For instance, in Figure 9, if the future segmentation for "s5 s6 s7" turns out to be "|s5| s6| s7|", then the future score will be derived from the probabilities estimated when adding these three segments, in some order, to the DSH. In one embodiment of the invention, once the segmentation for unconsumed words that returns the maximal probability is found, a greedy future score estimate is obtained for that same segmentation, and returned as the future score estimate. For instance, if the highest probability for "s1 s2" in Figure 9 is obtained with the segmentation "|s5| s6| s7|", the future score estimate returned will be the greedy future score estimate for "|s5| s6| s7|".

That greedy estimate is obtained by adding the segment being scored (here, "s1 s2") to the right edge of the DSH, then using the choice trees to find which of the assumed segments has the highest probability of being chosen next, and making that choice repeatedly while calculating the product of distortion probabilities. Suppose that in the example, the 3-choice tree estimates that with a DSH set to "s8 s9| s3 s4| s1 s2", among "|s5|", "|s6|", and "|s7|", "|s7|" has the highest probability of being chosen: P(s7) = 0.5 . Then the system sets the DSH to "s8 s9| s3 s4| s1 s2| s7", and uses the 2-choice tree to determine which is more probable as the next choice, "|s5|" or "|s6|". Suppose the winner is "s5", with P(s5) = 0.7 . The DSH is now complete ("s6" will be used to complete it, with probability 1.0) and the future score estimate returned to the decoder will be 0.5*0.7 .

Above, under the heading "Input Features Used By the Distortion Component", a partial list of features that can be used in computing distortion scores was given:

• The difference between the original positions in the source sentence of the leftmost word in a given source-language segment and the rightmost word in the partial DSH. • The difference between the original position of the leftmost word of the given source segment and what will be its new position if it is chosen at a given step.

• The length in words of the partial DSH, the length in words of the given source segment, and the number of remaining, unused words in the source sentence, and the differences between these.

• The average number of words per segment in the partial DSH, in the source sentence, and in the remaining, unused part of the source sentence.

• The number of segments in the partial DSH. • The degree of permutation in the partial DSH, in other words, the extent to which the segments in the partial DSH have departed from their original order.

• The degree of fragmentation in the remaining, unused part of the source sentence. • The identity of the words in the source sentence or in the partial DSH, the identity of the words in the given source segment, and the identity of the words that remain. • The identity of words in given positions of the source sentence, of the partial DSH, or of a particular segment.

• The syntactic category to which individual words or segments in the partial DSH, the given source segment, and the remaining words belong (e.g., adjective, verb, noun phrase, etc.)

Many other such features can be devised; these ones are listed purely for illustrative purposes.

Figure 7 and Figure 8 illustrate use of some of these features in the trees. In the context of tree methodology, information about features is conveyed by the use of yes/no questions. For instance, to understand the question in the top node in Figure 7, one must know that in the current embodiment, "pos(X)" is defined as a symbol for the original position plus one, in the source sentence, of the rightmost word in the partial DSH; "pos(A)" is defined as the original position of the leftmost word in the RS segment labeled "A". Thus, if the answer to the question "is pos(A)-pos(X)<0?" is "yes", then the segment labeled A was originally in a position preceding the word now at the right edge of the partial DSH.

The question in node 2 of the tree shown in Figure 7, "ich ε DSH?", is a question asking if the German word "ich" is found anywhere in the DSH. Figure 8 illustrates three other kinds of questions. The question in node 1 , "VERB ε B?" asks if any of the words in segment B are verbs (this question is only allowed if the system also contains a syntactic tagger for German capable of labeling words by their syntactic categories, such as VERB, NOUN, and so on). The question in node 2 of Figure 8 asks whether the German word "und" is present at a specific location of the DSH (position 2). Finally, the question in node 5 of figure 8 asks whether the length of the segment labeled A, in words, is greater than five.

With respect to the labeling scheme for segments in the RS, many different labeling schemes are possible. Whichever scheme is chosen, it must be applied both when the trees are grown from the distortion training corpus, and at decoding time. A simple left-to-right labeling scheme was already described, and illustrated in the examples above. The currently preferred labeling scheme is somewhat different. If "pos(X)" denotes the original position in the source sentence of the rightmost word in the DSH, plus one (as described above), then the current labeling scheme involves the difference between the original position of the leftmost word in each segment in the RS, and pos(X). Segments are labeled in ascending order of this difference, with ties being broken in favour of the rightmost of the tied segments. This "distance-from-X" labeling scheme, as used both prior to training and for decoding, is illustrated in Figure 10. In that figure, four examples are shown before and after labeling. Example #3 illustrates the tie-breaking rule. Example #4 illustrates how, when the number of segments exceeds the available labels (this is a situation that will arise during the growing and use of an M+ tree, in this case a 4+ tree), the segments furthest from X receive the last, "don't care" label.

One of the other possible labeling schemes would make it unnecessary to have multiple trees. In this scheme, there are only two possible labels for segments: "C" for "chosen" and "N" for "not chosen". A single tree is trained directly on the segment choice history corpus (see Figure 5). In each of the examples in the corpus, the segment actually chosen is labeled with "C", and all the others are labeled "N". From this corpus, a tree learns rules for estimating the probability that a given segment is "C"; this is the probability returned to the decoder for that segment. Order of Segment Choice

In the description of the embodiment above, it was shown that this embodiment models the choices of segment available to the decoder at a given time. As was explained there, a typical decoder in a state-of-the-art system builds each target-language hypothesis starting with the left edge and proceeding rightwards, each time adding a previously unassigned segment from the source sentence to the right edge of the growing hypothesis, until all segments in the source sentence have been consumed. Thus, the description of the invention given above assumes that it is applied from left to right, in step with the choices made by the decoder.

It would be possible to decode in a different order - for instance, starting at the right edge of each hypothesis and adding segments to the left edge, or from the middle of each hypothesis outwards. One could apply the invention to systems for which the decoder works in such a different order; one must train the distortion component in a way that is consistent with the way it will be used during decoding. For instance, to train the distortion component of the invention to work with a right-to-left decoder, the training algorithm would consider that for each example of a segment-aligned sentence pair, the largest number of choices occurred when the rightmost segment in the DSH was chosen. Subsequent choices were made by growing the DSH leftwards.

Statistical N-gram Modeling of Distorted Source Language

Figure 11 shows a method for extracting useful distortion-related information from the distortion training corpus of Figures 3-5 that is completely different from the methods discussed so far. There are well-established techniques in computational linguistics for extracting "statistical language models" (statistical LMs) from large text corpora; these are used in both machine translation and automatic speech recognition (see, for instance, F. Jelinek. "Self-Organized Language Modeling for Speech Recognition", in Readings in Speech Recognition, ed. A. Waibel and K. Lee, publ. Morgan Kaufmann, pp. 450-506, 1990). These models rely on statistics of frequent word sequences in a given language to make predictions about the probabilities of newly-encountered word sequences in that language. Some forms of these models are called "N- gram" models, because they rely on sequences of N or fewer words. For instance, in the 3-gram model used in many automatic speech recognition systems, the probability of occurrence of a word is calculated from the two words preceding it. The calculation involves counts of 3-word sequences, 2- word sequences, and single words obtained from a large corpus. For instance, given that someone has just said the words "the old ...", a 3-gram model can be used to estimate the probability that the next word will be "elephant".

As shown in Figure 11 , once the distortion training corpus has been produced by one of the methods described earlier, a "distorted source-language corpus" containing only the distorted source sentences, with segmentation information removed, can be extracted from it. This corpus contains sentences from the source language reordered to reflect the word order characteristic of a particular target language. The example shows German sentences that have been reordered to reflect English word order, as was described earlier (if these same German sentences had originally been segment-aligned with sentences from a language other than English, their word order in the distorted source-language corpus might be quite different). A distorted source language model (DSLM) is then trained on the distorted source-language corpus by means of standard techniques. Of course, the predictions made by a DSLM will always depend on both pairs of languages: a DSLM trained on English sentences that have been distorted to reflect Chinese word order will be quite different from a DSLM trained on English sentences that have been distorted to reflect German word order, though both DSLMs will deal with sequences of English words.

Since a DSLM can output probabilities on partial distorted source-language hypotheses (DSHs), it can be used as a standalone distortion component. That is, just as the module called "Distortion component" in Figure 2 can be embodied in a tree-based component as shown in Figure 9, it could also be embodied in a DSLM. In this embodiment, segmentation information is discarded from the DSH. In the Figure 2 example, where the partial DSH is "s8 s9 I s3 s4|", the score for segment "s1 s2" would simply be the conditional probability estimated by the DSLM of "s8 s9 s3 s4" being followed by "s1 s2". The future score estimate could be obtained by a greedy procedure analogous to that described earlier. For this Figure 2 example, the future score estimate for "s1 s2" is an estimate of the future probability score obtained when the remaining words in the source sentence ("s5", "s6", and "s7") are added to DSH. This could be obtained by assuming that the DSH is now "s8 s9 s3 s4 s1 s2", using the DSLM to determine which of the three remaining words has the highest probability of appearing at the right edge of this DSH, adding that word there while incorporating its DSLM probability in the future score, and so on until the remaining words are used up.

Many different kinds of DSLMs are possible. One possibility would be to treat not the individual words, but the segments, found in the distortion training corpus as words for the purpose of statistical N-gram modeling. For instance, the N-gram model trained on the data shown in Figure 11 would treat the first distorted example shown there as the "word" "ich habe" being followed by the "word" "gelesen" followed by the "word" "das buch" and the "word" "." . The resulting N-gram model will have as its units segments found in the distortion training corpus, rather than individual words.

Some other useful kinds of DSLMs arise from the observation that there is a drawback to training a DSLM directly on distorted sentences as indicated in

Figure 11 , which is that the actual word movements are lost. For instance, for translating from English to another language, the DSLM may predict that

"John misses Mary" and "Mary misses John" are equally likely. It has no way of helping the MT system decide which of these is a better translation, because it has lost track of the order of "John" and "Mary" in the English source sentence.

To deal with this drawback, instead of training the DSLM with re-ordered source words, we could train it with re-ordered source word positions: This sort of model would estimate the probability of measuring a re-ordered sequence such as "1 2 3 5 7 6 4" (where the original sequence was "1 2 3 4 5 6 7"). Unfortunately, this kind of DSLM suffers from a weakness that is the exact opposite of the weakness of the previous kind of DSLM: it knows about word movements, but forgets about the words themselves. This can be a problem, since the likelihood of a given distortion often depends on the words involved. For example, in general, French and English have similar word orders. Thus, modeling French-to-English movements for the French word positions "1 2 3 4", a good DSLM will assign "1 2 3 4" higher probability than "3 2 1 4", "1 4 3 2", or "2 1 3 4", etc. This is fine for "Mais Jean aime Marie" ("But John loves Mary"). However, this kind of DSLM won't work so well for "Marie manque a Jean", a word-by-word translation of which is "Mary misses to John", but which is correctly translated as "John misses Mary". Thus, we are led to consider hybrid DSLMs that have both kinds of information (word information and positional information). In such a model, the training data might consist of distorted word sequences in which each word is annotated with its displacement from its original position. As another French- to-English example, consider a segmented French source sentence "j¹ | ai laisse | couler | I¹ 1 eau | chaude", a word-by-word translation of which is "I | let I run I the | water | hot". When the French is distorted into English word order, one gets "j¹ | ai laisse | I¹ | chaude | eau | couler" ("I | let | the | hot | water | run|"). The annotated version of the distorted French sentence used for training purposes would be:

"j'° ai⁰ laisse⁰ 1^1"1 chaude^"1 eau⁰ couler⁺³".

Alternatively, one could annotate each word by its displacement relative to the word originally preceding it. For the example in the previous paragraph, this would yield:

"j'° ai⁰ laisse⁰ 1^1"3 chaude^"1 eau⁺¹ couler⁺⁴".

Here "j"' originally appeared just after the start-of-sentence marker, and it still does; similarly, "ai" appeared right after "j"', and "laisse" right after "ai", so all of these receive a displacement of 0. By contrast, 'T" originally appeared after "couler", but now it's three words to the left of "couler", so it gets "-3" - and so on. There are a number of possible schemes of this sort for annotating displacement.

Once the annotation protocol for displacement has been determined and the training data obtained, it is straightforward to apply N-gram language modeling techniques that will yield a hybrid DSLM. For instance, one can turn each word annotated by a displacement tag into a "word" in the DSLM. Thus, the distorted French sequence "j'° ai⁰ laisse⁰ I'^"3 chaude^"1 eau⁺¹ couler⁺⁴" would become an example of the word "j'° being followed by the word "ai° , and so on, with the DSLM not containing the information that "j'°^", "j'^"r. or "j'^{+1 '}, and so on, are related in any way. The problem with this approach is that it leads to data sparsity. Thus, it makes more sense to ignore the displacement tag in the context being conditioned on, while keeping it for the current word. This is a kind of N-pos approach (N-pos approaches are well-documented in the technical literature on statistical language modeling). For instance, for a hybrid bi-pos DSLM, the probability of the distorted example sentence "j'° ai⁰ laisse⁰ I'^"3 chaude^"1 eau⁺¹ couler⁺⁴ would be (letting "ST" represent the sentence start marker):

Many other ways of generating a hybrid DSLM, such as defining syntactic or automatically defined classes of words with part-of-speech tags, are possible.

The foregoing shows how to create DSLMs based on words, DSLMs based on word positions, and hybrid DSLMs that combine both word and positional information. Clearly, it would also be possible to train several different DSLMs, and use their combined outputs as the distortion component. For instance, one could combine outputs from several different DSLMs, each representing a probability, by interpolating them and thus obtaining a single probability. Besides interpolation, many other methods for combining probabilities from probabilistic language models are found in the technical literature.

As described above, a DSLM or a combination of DSLMs can be used as a standalone distortion component. Another embodiment of the use of DSLMs is as features input to a system based on supervised learning, such as the tree-based embodiment of the invention shown in Figure 9. Many of the input features described earlier complement the kind of information contained in the DSLM. To incorporate DSLM information in the trees, yes/no questions pertaining to DSLM scores are devised. For instance, if "DSLM(seg, DSH)" denotes the conditional probability assigned by the DSLM to the RS segment with label "seg" following the words in the DSH, then examples of possible questions could be "Is DSLM(A, DSH) > DSLM(B, DSH)?", "Is DSLM(C, DSH)

⁴"would be (letting "ST" represent the sentence start marker):

Use of the Distortion Component for Rescoring Translation Hypotheses

Many of today's statistical machine translation systems perform translation in two steps, as illustrated in Figure 12 for a French-to-English example. The output of the first step, performed by the decoder and based on a given set of information sources, is a representation of the most probable translation hypotheses according to these information sources. This representation may, for instance, be a set of N hypotheses, each accompanied with a probability score (an "N-best list"), or a word lattice with probabilities associated with transitions in the lattice. In the second step, a second set of information sources is used to assign new probability scores to the translation hypotheses encoded in the representation output from the first step; this is called "rescoring". The hypotheses are then reordered according to the new scores, so that a hypothesis that received a fairly low ranking from the decoder may end up as the system's top guess after rescoring.

Typically (though not necessarily), the set of information sources used for rescoring is a superset of the set of information sources used for decoding - hence the names "small set" and "large set" employed in Figure 12 for the set of decoding and rescoring information sources respectively. The only requirement for an information source for rescoring is that it be capable of generating a numerical score for each translation hypothesis. This score is returned by a "feature function" whose input is a translation hypothesis, and often the original source sentence. Typically, a weight estimation procedure is invoked prior to use of the complete two-step system to assign weights to the information sources employed in the second step, with larger weights being assigned to more reliable information sources. Full technical descriptions of rescoring can be found, for instance, in the articles "Discriminative Training and Maximum Entropy Models for Statistical Machine Translation" by F. Och and H. Ney (Proceedings of the 40^th Annual Meeting of the Association for Computational Linguistics, July 2002) and "Minimum Error Rate Training for Statistical Machine Translation" by F. Och (Proceedings of the 41^st Annual Meeting of the Association for Computational Linguistics, July 2003).

Above, it was described how one embodiment of the invention can be applied at the decoding step. However, it can also be used at the rescoring step. Note that this gives three ways of applying the invention during MT: the new distortion method can be applied during decoding, during rescoring, or during both decoding and rescoring.

As shown in Figure 13, to apply the current invention to rescoring, each of the N-best translation hypotheses can be segment-aligned with the original source sentence, so that the DSH can be recovered. This can be achieved by ensuring that during decoding, information about which segment of the source sentence generated which segment of the target sentence is retained. Note that unlike the situation during decoding, the system now doesn't have to guess the segmentation for the source sentence - it is known. Also note that while for the decoding step, only a source-to-target language distortion model is used, for the rescoring step a model for the reverse direction can also be used. The model for the reverse, target-to-source direction could be trained on the same segment-aligned data as the source-to-target one, or on different data. The source-to-target model is used to estimate the probability that the original source sentence could have been distorted into the word order represented by a particular hypothesis, while the target-to-source model is used to estimate the probability that a particular hypothesis could be distorted into the word order represented by the original source sentence. In the figure, the source language is German and the target language is English.

One way for the system to generate the source-to-target distortion score is for it to move from left to right in the DSH, multiplying probabilities assigned by a source-to-target distortion model (of the same form as that described earlier for decoding) as it goes. For instance, the score for H1 in Figure 13 generated by the German-to-English distortion feature function would be initialized to the probability assigned by the model to choosing the segment "ich habe" when the DSH is empty and the RS consists of the segments "das buch", "gelesen", and ".". This probability would then be multiplied by the probability of choosing "gelesen" at the next step - and so on. Note that in the example, although H1 and H3 are not the same, they represent the same German-to-English distortion, and will thus be assigned the same German-to- English distortion feature function score.

For the English-to-German distortion score, the system is calculating the probability of distorting an English-language hypothesis into German-like word order. For instance, its score for H1 will be based on the probability (according to a model of English distorted into German word order that is trained separately from the German-to-English one) that "i have | read | the book | . " will be distorted into "i have | the book | read | . ". The source-to-target and target-to-source distortion models are by no means equivalent; for instance, they may use different word input features. For the tree-based embodiment of the invention, it was shown earlier that the German-to-English trees may contain questions about the presence or absence of certain German words; the English-to-German trees would contain questions about the presence or absence of English words. Similarly, in the standalone distorted language model embodiment of the invention, the distorted source language model (DSLM) for this example would involve only German words (in English-like order), while the distorted target language model (DTLM) for this invention would involve only English words (in German-like word order). Thus typically, the two feature functions shown in Figure 13 - the source-to-target distortion feature function, and the target-to-source distortion feature function - will generate different probability scores for the same hypotheses.

For rescoring, one may choose to apply the segment choice aspect of the invention in any order of choices, not necessarily the left-to-right order on target-language hypotheses favoured by today's decoders. For instance, one could train and apply a rescoring module that assumes the DSH is constructed by first choosing the rightmost segment, then choosing the second rightmost segment, and so on, proceeding leftwards until all the segments in the source sentence have been consumed. Another possibility would be a rescoring module that begins in the middle of a DSH and proceeds to grow outwards, making choices at both its left and right edges.

The distortion-based feature functions for rescoring just described are similar to those that can be used for decoding. However, other distortion-based feature functions that are trained on distortion corpora, but that are less tied to the functioning of the decoder, can be devised. Such feature functions assess the overall relationship between the source sentence and the complete DSH associated with a complete target-language hypothesis, or between a complete target-language hypothesis and its distortion into the ordering represented by the source sentence. Details of the order in which choices are made may be ignored by such feature functions.

One such feature function assesses the probability of the permutation between the source sentence and a complete DSH (or between a target- language hypothesis and its distorted version). This permutation could be measured, for instance, by a rank correlation metric such as Spearman's correlation or Kendall's tau, both of which are used to measure the extent to which two lists of ranks (e.g., from 1 to N) are correlated with each other. These take values between +1 and -1 , with +1 meaning that the two lists are perfectly correlated. Here, the inputs to one of these metrics would be the original word sequence and its distorted counterpart. A permutation-based feature function would assign a better score to hypotheses in the N-best list whose word sequence had approximately the most probable correlation with the word sequence in the source sentence, and penalize those whose correlation with the source sentence was too great or too little.

The probability of a given amount of permutation often depends on the nature of the sentence, so the usefulness of the permutation-based feature function would be enhanced by dividing input sentences into several classes, each associated with a different distribution of probability across possible amounts of permutation. The supervised learning tools described earlier (such as trees), combined with input features based on the presence or absence of certain words or syntactic categories, can be used to do this. For instance, recall that word order in Chinese and English differs more - there is more permutation - if the sentence is a question. A tree-based feature function for predicting the amount of permutation for the Chinese-to-English or English-to- Chinese task would probably contain questions about the presence or absence of "?" at the end of the source sentence, and assign higher probability to lower correlation between the original and the distorted word sequence if the "?" was present. Similarly, a tree-based permutation feature function for English-to-German or German-to-English would probably contain questions about the presence or absence of subordinate verbs.

Another kind of feature function would assess the relationship between the original position of each individual word in the source sentence, and its position in the complete DSH. For a given source language and target language, certain source words may have a tendency to move left or right in the DSH, by small or large amounts. For instance, in the German-to-English case, part participles of German verbs have a tendency to move left. In the section above entitled "Statistical N-gram Modeling of Distorted Source Language", it was shown how a word in the complete DSH may be annotated by its displacement from its original position; it was shown how this displacement can be calculated in various ways. The section showed how a hybrid DSLM can be used during decoding to score the probability that a given word will be displaced a given amount from its original position. As mentioned above, these and other types of DSLM can be used as feature functions for rescoring; so can their converse, DTLMs that score the probability of a target hypothesis being distorted into the word order characteristic of the source sentence.

DSLMs and DTLMs based on statistical N-gram approaches work particularly well for decoding in today's systems, because of the left-to-right nature of such decoding. However, at rescoring time, the system has knowledge of the complete sequence of each target-language hypothesis, and thus knows the complete DSH for the hypothesis (and its converse, the complete sequence of words in the target hypothesis rearranged to reflect the order of words in the source sentence). In this context, feature functions for modeling word displacement can be based on a broad range of supervised learning tools with word-based and syntax-based features (as well as on DSLMs and DTLMs). Such tools can, for instance, learn how the presence of other words at other positions in the source sentence affects the direction and magnitude of the displacement of the word currently being considered.

For instance, one can grow a decision tree each of whose leaves contains a probability distribution giving the probability of various possible displacements for the given word. The questions in the tree ask about the identity of the word itself, about the identity of words elsewhere in the sentence, about the number of words in the sentence, and so on. For each word in a given source sentence, one can thus calculate its probability of being displaced by the amount found in the DSH. The overall feature function score for a given DSH is obtained by multiplying together the probabilities assigned by the tree to the displacements of the individual words. Clearly, one can build a similar feature function that will be applied in the opposite direction, to assess the relationship between the positions of words in a target hypothesis and their new positions when they are distorted to match the word order in the source sentence.

Since rescoring is computationally cheap, it is quite practical to apply several or all of the feature functions described in this section during rescoring. Other advantages that are inherent to the structure are obvious to one skilled in the art. The embodiments are described herein illustratively and are not meant to limit the scope of the invention as claimed. Variations of the foregoing embodiments will be evident to a person of ordinary skill and are intended by the inventor to be encompassed by the following claims.

Claims

Claims:

1. A method for generating a distortion component for use in assigning a distortion score in a machine translation system comprising the steps of:

a) providing a bilingual sentence pair having a training source sentence and a training target sentence;

b) segmenting the training source sentence and the training target sentence into one or more than one training source sentence segments and one or more than one training target sentence segments;

c) aligning the order of the training source sentence segments with the order of the corresponding training target sentence segments;

d) forming a distorted training source sentence with the aligned training source sentence segments in the position order of the training target sentence segments;

e) outputting the training source sentence and the associated distorted training source sentence to form a distortion training corpus;

f) using a supervised learning tool on the distortion training corpus to generate a distortion component.

2. The method of claim 1 where supervised learning tool comprises a decision tree method.

3. The method of claim 1 where the supervised learning tool comprises of hybrid distorted source language model.

4. The method of claim 1 where using the supervised learning tool results in an N-gram distorted source language model.

5. A method for providing a distortion score to a translation hypothesis for the translation of a source sentence made of words into a target sentence comprising the steps:

a) providing a source sentence to a decoder;

b) segmenting said source sentence into one or more than one segments;

c) choosing a segment;

d) removing said selected segment from the source sentence leaving remaining words;

e) inputing said chosen segment into a partial distortion hypothesis;

f) providing said chosen segment, the remaining words, and the partial distortion hypothesis to a distortion component;

g) calculating a distortion score with a distortion component acquired through supervised learning from a distortion training corpus;

h) repeating steps c to e until all words have been chosen;

i) calculating a cumulative distortion score;

j) outputting to said decoder said distortion score.

6. The method of claim 5 where the supervised learning technique comprises a decision tree method.

7. The method of claim 5 where the supervised learning technique comprises distorted-source language model.

8. The method of claim 5 comprising an additional step at step g) of having said distortion component also calculates a future score.

9. The method of claim 8 where a future score is an estimate of the cumulative score for using the remaining words.

10. The method of claim 5 where the distortion component is based on one or more than one of the following features:

• a position based feature;

• a word based feature;

• a syntax based feature.

11. The method of claim 5 where calculating the distortion score is based on two or more than two of the following input features,

- information about the position of chosen segment;

- the length of the chosen segment;

- information about the position of remaining segments;

- lengths of remaining segments; information about the difference between the original position of a segment in the source sentence and its new position in a hypothesis;

information about the presence of a particular word;

information about the absence of particular word;

information about a part-of-speech tag in chosen segment;

information about a part of speech tag in a remaining segments;

information about a part of speech tag in the source sentence;

information about the presence of particular words at a given position in chosen segments;

information about the presence of particular words at a given position in remaining segments;

information about the presence of particular words at a given position in the source sentence;

information about the absence of particular words at a given position in chosen segments;

information about the absence of particular words at a given position in remaining segments;

information about the absence of particular words at a given position in the original sentence;

information about the degree of permutation exhibited by a sequence of chosen segments;

information about the degree of permutation exhibited by a sequence of chosen words; information about the degree of fragmentation exhibited by the word sequence in the remaining segments;

the number of gaps exhibited by the words in the word sequence in the remaining segments;

information about the scores generated by distorted-source language models applied to the words in the sequence of chosen segments.

12. The method of claim 5 where the distortion score is based on information about whether the source sentence is a question.

13. The method of claim 5 where the distortion score is based on information about whether the source sentence is a statement.

14. The method of claim 5 where the distortion score is based on syntactic information about one or more than one word in the source sentence.

15. A method for assigning a score to a translation hypothesis of a source sentence comprising the steps:

a) providing a segmented translation sentence hypothesis;

b) providing an associated segmented sources sentence;

c) providing an alignment of the segments of the translation sentence hypothesis with the segments of the source sentence; d) calculating a new distortion score based on the alignment with a distortion component acquired through supervised learning from a distortion training corpus having a new source information;

e) outputting a new set of rescored hypothesis.

16. The method of claim 15 where the distortion component is based on two or more than two of the following features:

information about the positions or lengths of segments in the hypothesis;

information about the difference between the original positions of the segments and their new position in the hypothesis;

information about the presence or absence of particular words or part- of-speech tags in the original sentence or in the target hypothesis;

information about the presence or absence of particular words at given positions in the original sentence or in the target hypothesis;

information about the degree of permutation undergone by the sequence of segments associated with the target hypothesis compared with their original sequence;

information about the distortion incurred by transforming the current target-language hypothesis into the source-language sentence;

information about the scores generated by distorted-source language models applied to the words in the new sequence of segments associated with the target hypothesis.

17. A computer readable memory for obtaining distortion score for a translation hypothesis of a source sentence into a target sentence in a machine translation system comprising;

- a source sentence;

- a decoder;

- said source sentence being inputted in said decoder;

- a phrase table;

- said phrase table having source words associated with corresponding target words;

- said phrase table providing possible segments to said decoder;

- a distortion component;

- said distortion component acquired through supervised learning from a distortion training corpus;

- said distortion training corpus made of a training source sentence and an associated distorted training source sentence;

- said decoder providing said distortion component with a selected segment from said source sentence and remaining segments from said source sentence and a distorted sentence hypothesis;

- said distortion component outputting a distortion score to said decoder.

18. The computer readable memory of claim 17 where said distortion component also outputs a future score estimate.

19. The computer readable memory of claim 17 where said the distortion component is made of trees derived from the distortion training corpus.

20. The computer readable memory of claim 19 where the trees are based on one or more than one of the following features:

• a position based feature;

• a word based feature;

• a syntax based feature.

21. A machine translation system for translating a source sentence into a target sentence where a translation hypothesis is provided, the translation hypothesis provided is assigned a cumulative score comprising of the following elements:

• a phrase translation score;

• a language model score;

• a number of words score;

• and a distortion score where said distortion score is assigned by a distortion component derived by means of a supervised learning tool from a bilingual sentence pair comprising of a segmented training source sentence and a segmented training target sentence, such that the segments of the training source sentence have been aligned with the segments of the training target sentence.