WO2009002141A1

WO2009002141A1 - A system amd method of language translation

Info

Publication number: WO2009002141A1
Application number: PCT/MY2008/000061
Authority: WO
Inventors: Abdul Aziz Normaziah; Bin Ab. Rahman Suhaimi
Original assignee: Mimos Berhad
Priority date: 2007-06-27
Filing date: 2008-06-27
Publication date: 2008-12-31
Also published as: MY151645A

Abstract

The present invention generally relates to a system to translate a language automatically wherein a hardware such as a computer is used in the system. The system comprises of sentence level and a phrase look-up matching technique and wherein the present invention searchers a translation memory to find out whether there are closely matched examples and sentence level matching technique. When a record is found, the translation technique would return the translation suggestion for the input sentence and when no more records are matched, then a phrase look up matching technique is applied and wherein further to this a bi-section technique would analyze an input sentence by splitting it into two different parts and wherein a first part is from the first word until a splitting point and the second part is to hold the rest of the word/sentence and wherein the longest chunk for the input sentence is searched starting movement from the left portion of the word to the right portion of the word and wherein after obtaining an output from the first part of the original phrase, the second part will act as the original phrase and it will be split further.

Description

A SYSTEM AMD METHOD OF LANGUAGE TRANSLATION

FIELD OF THE INVENTION

The present invention relates to a system and method of language translation.

Mote particularly the present invention relates to a system and method of language translation by utilizing a translation memory method using phrases look up approach and word alignment information database.

BACKGROUND OF THE INVENTION

Several method of language translation has been introduced in the prior art. Some such examples are from Hua et al., 2005, Simard and Langlais, 2001, Macklovitch and Russell, 2000, Hanas and Furuse, 2000. In the prior art, conventional translation methods retrieves examples matched with the input sentence at sentence level. In such approach, the method would provide a good translation suggestion only when there are closely matched examples.

Basically, in the prior art, the method could be divided into three different steps:-

(a) translation pairs are recorded (inclusive of word alignment information)

(b) retrieving examples from the translation memory by using search engines

(c) online learning mechanism

Ih another prior art, Simard and Langlais, 2001, the examples are ranked according to the length of the matched sub-sequences of words.

Therefore, it is an objective of the present invention to introduce a new method of translation wherein a good translation suggestion is only provided when there are closely matched examples. Further to this, the present invention adopts the longest available sub-sequence to cover as much as possible the source of sentences. The present invention is also deigned to develop a user interface for a translator to perform interactive translation process.

SUMMARY OF THE INVENTION

The present invention relates to a system to translate a language automatically wherein a hardware such as a computer is used in the system and wherein the system comprises of sentence level technique and a phrase look - up matching technique. The system searchers a translation memory to find out whether there are closely matched examples and sentence level matching technique. When a record is found, the translation technique would return the translation suggestion for the input sentence and when no more records are matched then a phrase look up matching technique is applied. A bi-section technique would analyze an input sentence by splitting it into two different parts. A first part is from the first word until a splitting point and the second part is to hold the rest of the word/sentence. The longest chunk for trie input sentence is searched starting movement from the left portion of the word to the right portion of the word. After obtaining an output from the first part of the original phrase, the second part will act as the original phrase and it will be split further.

According to the present invention the system would be repeated and the result for all these sub phrases would be aggregated for the basic output. A repetition avoidance technique is applied thereafter wherein this technique would accumulate and identify the equivalent meaning or repeated words for a certain word. Said technique of accumulating and identifying of the similar meaning word is based of the equivalent search for source and target word retrieved from word alignment information database. Further to this an inter-phrase word to word distance summation technique is used therein in which a summation value is calculate to give an estimated value on which word is repeated and needs to be omitted and which one is not.

A method of language translation automatically wherein the method includes a hardware wherein when translating a new sentence, the system first searchers the example from word phrase database or translation memory database. The sentence mat is input into the system is compared to the source language of the example from the phrase and language database. If there is an exact matched example, then the target part of this example is suggested as the translation and wherein before the input sentence is processed, a module to filter each word in the sentence is invoked and any punctuation, symbol or word space will be extracted and all words would thereafter merged. By using normalization, two sentences with the same words are matched even though said similar words are in different order. When no exact match is obtained the method will search examples from the next database, translation memory database. The phrase lookup matching is used to find suggested meaning for a phrase by parsing the phrase into sub phrases and finding a meaning to these sub phrases and combining the results to get a final output.

The method further includes an input occurrence detection and a verification process that is used by the phrase look-up algorithm in order to commence the translation process, if there is no meaning or the word is not in the input, the system would just skip to the next word or phrase.

According to the present invention the system and method is conducted via hardware such as a computer or any other device in similar capacity and wherein the system and method could work locally or remotely or over a network or via the internet.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 shows abi-section algorithm flow process.

Figure 2 shows a flow diagram of a SAT and WAT used by the bi-section algorithm as shown in Figure 1.

Figure 3 shows a flow chart of an example of a basic output with a repetition of word.

Figure 4 shows a flow chart of getting a final output using an inter phrase word to word distance summation algorithm. Figure 5 shows a detail flow chart of the method and system of language translation according to the present invention.

Figure 6 shows a diagram of an example of word filtering formatting output.

Figure 7 shows a chart for translation suggestion word phrases using translation memory database.

Figure 8 shows a chart for translation suggestion word phrases using phrase database.

Figure 9 shows a chart showing an example of the phrase look up matching technique.

Figure 10 shows a graph showing the relation between the index and the inter-phrase distance summation.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

The present invention will now be described with reference made to the accompanied figures but not limited thereto. It should also be noted that the present invention ean be used for all type of language translation. However, for the purpose of this description, the invention would be described with reference made to a English ~

Malay translation method.

The present invention employs mainly two techniques: (1) sentence level and (2) phrase look - up matching. Firstly, the present invention searchers the translation memory to find out whether there are closely matched examples. Ih said method, the present invention uses sentence level matching technique. According to the present invention, if a record is tbund, the translation method would return the translation suggestion for the input sentence. If no more records are matched, then the present invention would perform a phrase look up matching method. This method would segment the input sentence into several sub-sequences using a bi-section algorithm. The bi-section technique method would analyze an input sentence by splitting it into two different parts. The first part is from the first word until a splitting point and the second part is to hold the rest of the word/sentence. The purpose of this method is to search for the longest chunk for the input sentence starting movement from the left portion of the word to the right portion of the word. After obtaining an output from the first part of the original phrase, the second part will act as the original phrase and it will be split further. According to the present invention, this process would be repeated as in the previous process and the result for all these sub phrases would be aggregated for the basic output. Reference is made to Figure 1 wherein is shown the details of the bi- section algorithm process. Alternatively, reference is also made to Figure 2 wherein is shown two types of database such as sentence alignment table (SAT) and word alignment table (WAT) that stored all the information required by the present invention.

The repetition avoidance technique is applied thereafter wherein this method would accumulate and identify the equivalent meaning or repeated Malay words for a certain English word. The method of accumulating and identifying of the similar meaning word is based of the equivalent search for source and target word retrieved from word alignment information database. These repeated words are generated because of the way the source is dealt with the source (English) and target (Malay) pairs of sentences. Basically, the structure of Malay sentence generated from this process is acceptable but it may contain repeated word that need to be omitted. Figure 3 shows an example how a repeated word "kantu" occurred twice in the basic output.

Next, the inter-phrase word to word distance summation technique is used therein. In this method, an algorithm which uses mathematical calculation to calculate a summation value that will give a clue on which word is repeated and needs to be omitted and which one is not. Figure 4 shows the basic process on how to obtain a final output using this method. According to this method, there are three steps involved to calculate summation word-to-word distance value:-

- determining the location of word in the basic output (locj) and SAT table (loc_j) - the distance between two words (loc_j, locj) is calculated wherein, di =- [loc_j - loci ]

in which, i,j = 0,1, n

Loci = location value for basic output

LOCJ = location of value for SAT' s target sentence

- Summation of all the distance word values generated from dj and shown via the equation as shown below

To further describe the present invention reference is made to Figure 5 wherein the details of the method according to the present invention are shown therein. When translating a new sentence, the system first searchers the example from word phrase database or translation memory database. The sentence that is input into the system is compared to the source language of the example from the phrase and language database.

If there is an exact matched example, then the target part of this example is suggested as the translation. However, there could be some instances that the example failed to find a related example from the databases. Ih order to eliminate such problems, three methods are applied into the sentence level matching technique.

Before the input sentence is processed, a module to filter each word in the sentence is invoked. Any punctuation, symbol or word space will be extracted and all words would thereafter merged, Figure 6. Then, by using normalization, two sentences with the same words can be matched even though said similar words are n different order. For example, "Vanilla Ice Cream" can be matched with "Ice Cream Vanilla".

This is done by using SQL commands such as LIKE together with the AND logical command. Currently, there are 2500 Malay-English phrases. In the present invention, the inventors have created another database entry o locate all these phrases. The first step according to the present invention is to select a word from word phrase database and to lookup the meaning of the input sentence. Jf there is an exact matched example, the target part of this example is suggested as the translation. Otherwise, the present invention will search examples from the next database, translation memory database. Figures 7 and 8 shows the result of translation being retrieving from translation memory and word phrase database.

Further to this, the phrase look-up matching is used to find suggested meaning for a phrase by parsing the phrase into sub phrases and finding a meaning to these sub phrases and combining the results to get a final output. The algorithm is based on bisection concept as described earlier. Figure 9 shows an example of the phrase lookup matching process.

Input occurrence detection and verification is a two stages process that is used by the phrase look-up algorithm in order to commence the translation process. It doesn't build the output, but it will give the green light to the basic output construction process and will get the index value of the matching Sentence Alignment Table (SAT) entry. The first stage is detection followed by the second stage, which is verification. Input occurrence detection is the process in which the SAT table is searched to find an entry that contains word sequence that exactly matches the user input. The entry might contain more words than the input, and the input doesn't need to be in the beginning. For instance if the input is "big industry" and the SAT table has one entry "gaming has become a big industry recently", that entry would be considered a match. Verification is another process that follows detection. In this process, the Word Alignment Table (WAT) is checked for the single words that form up the SAT entry. It is important to find WAT entry sequence (which can be one or few number of words) that will exactly matched the user's input. By this process the present invention would be able to verify that we can find a meaning for the input. For example if the input is "I like to" and we have detected a SAT entry "I like to see the sky", the corresponding WAT entries is "I", "like", "to see" and "the sky", the verification will return false. This means that it cannot match exactly the input using the WAT entries, because if the first three WAT entries are combined we will get "I like to see" which is not the same as "I like to". The third WAT entry must be "to" instead of "to see" in order to verify the occurrence of the input. The SAT entry will be dropped and the system will start all over again with the detection process. This process will keep on repeating until we successfully detect and verify an entry. The entries are stored in an array for further processing. After the verification is completed, the output construction will begin.

The basic output construction is the primary stage in generating the output. This output is an intermediate output that will be modified by the repetition avoidance stage to get the final output. Il is correct from the term of structure, but might have some words repeated many times. In the present invention the repetition avoidance algorithm is implemented to solve this problem. The output will be constructed based on the SAT and WAT entries obtained from the above procedure. It is important to highlight the concept of using source-target pairs through out the rest of the algorithm. These pairs represent a source word and its target meaning as mentioned in the WAT.

Initially the present invention had an array of source-target pairs obtained from verification of input occurrence above, which follows the source language sentence structure. Adding the target word for each relevant source word in the source language order will probably not make sense, e.g. in English "red car", using the order from the source will give us "tnerah kereta"; when the correct translation in Malay language is "kereta merah", To solve this problem, the target Malay entry in SAT is selected; it has a correct structure because it's actually a Malay sentence. Word by word is taken starting at the beginning and at the same time the system would look for the target word in the source ordered array that matches the first SAT word and add it as the first entry in the Malay ordered pair array. The English ordered pair would be deleted, so that said pair is not chosen in the future. Next, the second word in SAT entry is selected and the same procedure is repeated for the rest of the words until a target ordered pairs array is achieved.

If there is no meaning or the word is not in the input, the system would just skip to the next word or phrase. The system will keep on taking entries until it is done with the Malay ordered source-target pairs. Now, we have constructed the basic output.

The basic output has no problems in structure, but it may contain repeated words. These repeated words are generated because of the way we deai with the source target pairs. Let us consider this example:

The target word "anda" for the source "you" is repeated 4 times in the output although it is mentioned only once in the input. The SAT has the word "anda" 4 times, and in each time the input agrees to include it in the output. We can easily take the first one, but what will happen if the input was "the kind of plants you can grow most successfully'? We need the second "anda" in the output. We resolve this problem using a mathematical model that we named as the inter- phrase word-to-word distance summation.

This algorithm uses mathematical calculation to calculate a summation value that will give a clue on which word is repeated and needs to be omitted and which one is not as we have already mentioned earlier in this description.

Table 1 shows the value of loc, for each of the entry in the basic output.

The underlined "anda" is a repeated "anda". Running the repetition avoidance algorithm will first give us the ∑d for each word. Table 2 describes the details for getting £d for each word. The ∑d will be used as judgment value to omit the extra "anda". Since we have two you's in the basic output but there is only one in the input, so one of them must be omitted. We have crossed the row values belong to the conflicted words. Then we sum the rest to get the sigma-D of each word. We use the summation ∑d values in Table 3 to get the minimum "you" index. The other "you" will not be taken in the final output. Figure 10 shows the plot of the summations values versus the index. The thick black dotted line represents the margin between acceptance and non-acceptance words, i.e. everything below the line is acceptance and the rest is not. After removing the dropped words from the basic output, we will have the final output as follows: "Jika anda menμlϊh. suatu program bukan kritikal".

There are several further modifications done to the present invention to improve the application and workability of the present invention. The improvements are such as:-

(a) Implementing recursive code instead of the nested loops to speed up the process. (b) Using a simpler and linear model to implement the repetition avoidance algorithm that chooses one pivot value and calculates the distance to the pivot.

(c) Derivation of second-degree polynomial formulas to estimate the summation. By using them with single loops in the calculation of inner-sentence word - to - word distance summation instead of nested loops, we can effectively reduce the lime.

In the present invention the system and method is conducted via hardware such as a computer or any other device in similar capacity and wherein the system and method could work locally or remotely or over a network or via die internet.

Claims

1. A system to translate a language automatically characterized in that wherein a hardware such as a computer is used in the system and wherein the system comprises Of sentence level technique and a phrase look - up matching technique and wherein the system searchers a translation memory to find out whether there are closely matched examples and sentence level matching technique and wherein when a record is found, the translation technique would return the translation suggestion for the input sentence and when no more records are matched, then a phrase look up matching technique is applied and wherein further to this a bi-section technique would analyze an input sentence by splitting it into two different parts and wherein a first part is from the first word until a splitting point and the second part is to hold the rest of the word/sentence and wherein the longest chunk for the input sentence is searched starting movement from the left portion of the word to the right portion of the word and wherein after obtaining an output from the first part of the original phrase, the second part will act as the original phrase and it will be split further.

2. A system to translate a language automatically as claimed in Claim 1 wherein the system would be repeated and the result for all these sub phrases would be aggregated for the basic output.

3. A system to translate a language automatically as claimed in Claim 1 wherein a repetition avoidance technique is applied thereafter wherein this technique would accumulate and identify the equivalent meaning or repeated words for a certain word and wherein said technique of accumulating and identifying of the similar meaning word is based of the equivalent search for source and target word retrieved from word alignment information database.

4. A system to translate a language automatically as claimed in Claim 1 wherein an inter-phrase word to word distance summation technique is used therein in which a summation value is calculate to give an estimated value on which word is repeated and needs to be omitted and which one is not,

5. A method of language translation automatically characterized in that wherein the method includes a hardware wherein when translating a new sentence, the system first searchers the example from word phrase database or translation memory database and wherein the sentence that is input into the system is compared to the source language of the example from the phrase and language database and wherein if there is an exact matched example, then the target part of this example is suggested as the translation and wherein before the input sentence is processed, a module to Filter each word in the sentence is invoked and any punctuation, symbol or word space will be extracted and all words would thereafter merged and wherein by using normalization, two sentences with the same words are matched even though said similar words are in different order.

6. A method of language translation automatically as claimed in Claim 5 wherein when no exact match is obtained the method will search examples from the next database, translation memory database and wherein the phrase look-up matching is used to find suggested meaning for a phrase by parsing the phrase into sub phrases and finding a meaning to these sub phrases and combining the results to get a final output.

7. A method of language translation automatically as claimed in Claim 5 wherein the method further includes an input occurrence detection and a verification process that is used by the phrase look-up algorithm in order to commence the translation process.

8. A method of language translation automatically as claimed in Claim 5 wherein if there is no meaning or the word is not in the input, the system would just skip to the next word or phrase.

9. A system and method as claimed in any of the preceding claims wherein the system and, method is conducted via hardware such as a computer or any other device in similar capacity and wherein the system and method could work locally or remotely or over a network or via the internet.