WO2007088538A2

WO2007088538A2 - Method and apparatus for translating utterances

Info

Publication number: WO2007088538A2
Application number: PCT/IL2007/000123
Authority: WO
Inventors: David Horn; Eytan Ruppin; Shimon Edelman; Zach Solan
Original assignee: Ramot At Tel Aviv University Ltd.; Cornell Research Foundation, Inc.
Priority date: 2006-01-31
Filing date: 2007-01-31
Publication date: 2007-08-09
Also published as: WO2007088538A3

Abstract

A method of constructing translation rules using a first dataset of sentences in a source language and a second dataset of sentences in a target language is disclosed. The method comprises, acquiring at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language. The method further comprises, for each sentence of the first dataset, generating a mapping function for mapping an ordered set of source grammar symbols being associated with the sentence to an ordered set of target grammar symbols, thereby providing a set of mapping functions. The mapping functions are then archived in a database to be used for translation.

Description

METHOD AND APPARATUS FOR TRANSLATING UTTERANCES

FIELD AND BACKGROUND OF THE INVENTION

The present invention relates generally to the field of language processing and, more particularly, but not exclusively, to the translation of language utterances for use in automated systems, such as machine translation systems, data transformation systems, speech recognition systems, and the like.

Applications for automated recognition and/or translation of languages abound. Well-known examples include speech or voice recognition, machine translation, or any other field in which it is required to maintain multiple instances of the same semantic information, such as the management of heterogeneous data environments or data warehousing. Speech recognition is the automated processing of verbal input. This permits a person to converse with a machine (e.g., a computer system), thereby foregoing the need for laborious input devices such as keyboards. Machine translation is the automated process of translating one language (a source language) into another language (a target language).

The need for automated data translations is driven in part by rapid changes in computing and information systems technology but also by business changes. To conduct world-wide business, for example, international companies must acquire and process vast amounts of data which are often written in a foreign language. In turn, these companies must also communicate with overseas concerns, such as foreign companies, governments and customers. Since translation performed by human translators is a time consuming and expensive task, any automation of the process is highly desirable. Automated data translations are widely needed for natural as well as non- natural languages. In natural languages, automated data translations translate phrases and sentences and are widely used in many applications, e.g., text editors, real time translations over the internet and the like. In non-natural languages, such as computer-programming languages or formal logic languages, data in a format recognizable by legacy software is converted to a format recognizable by new software.

When translating natural languages, a system must process information which is not only vast, but often ambiguous or uncertain. A word in a given passage will often have a meaning which can only be discerned from its context. Consider, for example, the word "flies" in the phrases "fruit flies like a banana" and "time flies like an arrow." In the former, the word is a noun; in the latter, it is a verb. Thus the word "flies," when examined in isolation, is ambiguous since its meaning cannot be clearly discerned. Consequently, a system translating a passage of text must "disambiguate" each word, i.e., determine the best analysis for a word from a number of possible analyses by examining the context in which the word appears. For a text passage of even modest size, this process demands significant, if not substantial, processing time and expense.

Given a lexicon of tokens, a language is a (possibly infinite) set of sequences of these tokens. In natural languages the tokens are the words of the language. A language is defined by a grammar describing a syntactic structure which is a breakdown of sequences and a description of how those sequences combine into larger sequences. The grammar, however, is not unique to the language in the sense that a number of different grammars can define the same language. In practical terms, the grammar consists of sets of symbols and a set of production rules. The sets of symbols generally include the lexicon and an additional set of symbols which represents other objects such as collections of lexicon entries. The production rules indicate how a symbol may be successively replaced by substituting other symbols for it. The ultimate result of a production process is a sequence consisting of lexicon entries. Such sequence is referred to herein as an utterance. For example, in natural languages an utterance can be a phrase or a sentence, and in computer programming language an utterance can be a program statement.

Formally, a grammar G of a language is a four-tuple G(VM, V_T, P, S), where V_N is a set of symbols called "nonterminals," VT is a set of symbols called "terminals," P is a set of production rules, and S is a symbol called a "start" symbol. The elements of VT are the lexicon entries of the language. They are called "terminals" because they do not have to be replaced to form a valid utterance. Conversely, the elements of VN are "nonterminals" in that they indicate the need for further replacement in order to complete the production of the utterance. The nonterminals thus represent purely abstract objects, which do not appear in any utterance in the language, but rather map to production rules (elements of P) which indicate how to replace the nonterminals with terminals or with other nonterminals. Each production rule consists of a left hand side sequence called a "name", and a right hand side sequence called a "body". A production rule is applied to sequences and is interpreted as a left-to-right replacement rule.

An utterance generating machine repeatedly applies the production rules P until it has replaced all the nonterminals such as to form a valid utterance consisting of terminals only. The start symbol S is the initial symbol which starts the production of the utterance. Replacing a symbol or a sequence of symbols is done by finding the production rule that has that symbol or sequence of symbols as its rule name, and putting the rule body (which may include one or more elements of V_M or V_T) in its place. The result may be that more nonterminals have been inserted by the application of the rule, so the process repeats. Eventually, no elements of V_M are left and the result is the textual statement.

Natural language processing methods have been developed for grammar induction in which a representation of a grammar is induced from a corpus of text. Of particular interests are statistical grammar induction methods aiming to identify the most probable grammar, for a given training data [K. Lari and S. J. Young, "The estimation of stochastic context-free grammars using the Inside-Outside algorithm," Computer Speech and Language, 4:35-56, 1990; F. Pereira and Y. Schab'es, "Inside- Outside reestimation from partially bracketed corpora," in Annual Meeting of the ACL, 128-135, 1992]. Over the years, a large body of work has been done in natural language processing to develop various techniques aiming to induce a grammar hence to learn, recognize and/or generalize text corpora.

Yet, in the area of language translation, most commercially available automated translation techniques are limited to word-by-word translation. Machine translation systems that attempt phrase- and sentence-level translation, may be useful in certain simple situations, but are generally unreliable in many respects.

Conventional machine translation techniques rely on a phrase-based approach, which uses phrase-level conditional probability information, to map the source sentence into the target sentence as follows: firstly, the original sentence is segmented into phrases, typically with a uniform distribution over segmentations; secondly, the phrases are reordered according to some distortion model; and thirdly each phrase is translated into a target-language phrase according to a model estimated from the training data [Chiang, D., "A hierarchical phrase-based model for statistical machine translation," Proceedings of ACL'05, pages 263-270, 2005]. It has been noted that because of data sparseness, using phrases of more than three words does not improve performance. The main thrust of research, therefore, focuses on capturing the reordering regularities. For example, the above technique links hierarchical phrases in the two languages, by learning synchronous context free grammars from examples. This approach however, is limited by the quality of the grammars.

There is thus a widely recognized need for, and it would be highly advantageous to have a method and apparatus for translating utterances, devoid of the above limitations.

SUMMARY OF THE INVENTION

According to one aspect of the present invention there is provided a method of constructing translation rules using a first dataset of sentences in a source language and a second dataset of sentences in a target language. The translation rules are for translating utterances of the source language into utterances of the target language.

The method comprises: acquiring at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language; and, for each sentence of the first dataset, generating a mapping function for mapping an ordered set of source grammar symbols being associated with the sentence to an ordered set of target grammar symbols, thereby providing a set of mapping functions. The method further comprises archiving the set of mapping functions in a database, thereby constructing the translation rules.

According to further features in preferred embodiments of the invention described below, the mapping function are initialized by a machine readable dictionary.

According to still further features in the described preferred embodiments the method further comprises archiving the at least partial source grammar and the at least partial source target grammar in the database.

According to still further features in the described preferred embodiments the generation of the mapping function comprises, for each dataset of the first and second datasets, parsing each sentence of the dataset to provide an ordered set of symbols covering the sentence, and updating the mapping functions based on the ordered set of symbols. According to still further features in the described preferred embodiments the method further comprises, constructing a source probability matrix from the source grammar and a target probability matrix from the target grammar, each probability matrix comprises a plurality of entries representing co-occurrence probabilities of symbols in a respective grammar.

According to still further features in the described preferred embodiments the method further comprises, for each dataset of the first and second datasets, parsing each sentence of the dataset to provide an ordered set of symbols covering the sentence, and updating a respective probability matrix based using the ordered set of symbols.

According to still further features in the described preferred embodiments the method further comprises updating the set of mapping functions using the probability matrices.

According to still further features in the described preferred embodiments each of the source grammar and the target grammar independently comprises: terminals being associated with tokens of a lexicon characterizing the dataset and nonterminals being associated with equivalence classes of tokens of the lexicon and/or significant patterns of a respective dataset.

According to still further features in the described preferred embodiments the acquiring of the source grammar and the target grammar, comprises, for each sentence of the first dataset and for each sentence of the second dataset, searching for partial overlaps between the sentence and other sentences of the respective dataset, applying a significance test on the partial overlaps, and defining a most significant partial overlap as a significant pattern of the sentence, thereby extracting significant patterns from the first and the second datasets, thereby acquiring nonterminals for the source grammar and the target grammar.

According to still further features in the described preferred embodiments the acquiring of the source grammar and the target grammar, comprises: for each dataset of the first dataset and the second dataset: searching over the dataset for similarity sets, each similarity set comprises a plurality of segments of size L having L-S common tokens and S uncommon tokens, each of the plurality of segments being a portion of a different sentence of the dataset; and defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby acquiring nonterminals for the source grammar and the target grammar.

According to still further features in the described preferred embodiments the definition of the plurality of equivalence classes comprises, for each segment of each similarity set: extracting a significant pattern corresponding to a most significant partial overlap between the segment and other segments or combination of segments of the similarity set, thereby providing, for each similarity set, a plurality of significant patterns; and using the plurality of significant patterns for classifying tokens of the similarity set into at least one equivalence class; thereby defining the plurality of equivalence classes.

According to still further features in the described preferred embodiments the classification of the tokens comprises, selecting a leading significant pattern of the similarity set, and defining uncommon tokens of segments corresponding to the leading significant pattern as an equivalence class. According to still further features in the described preferred embodiments the method further comprises, prior to the search for the similarity sets: extracting a plurality of significant patterns from the dataset, each significant pattern of the plurality of significant patterns corresponding to a most significant partial overlap between one sentence of the dataset and other sentences of the dataset; and for each significant pattern of the plurality of significant patterns, grouping at least a few tokens of the significant pattern, thereby redefining the dataset.

According to still further features in the described preferred embodiments the method further comprises, for each similarity set having at least one equivalence class, grouping at least a few tokens of the similarity set thereby redefining the dataset. According to still further features in the described preferred embodiments the method further comprises for each sentence, searching over the sentence for tokens being identified as members of previously defined equivalence classes, and attributing a respective equivalence class to each identified token, thereby acquiring additional nonterminals. According to still further features in the described preferred embodiments the attribution of the respective equivalence class to the identified token is subjected to a generalization test. According to still further features in the described preferred embodiments the generalization test comprises determining a number of different sentences having tokens being identified as other elements of the respective equivalence class, and if the number of different sentences is larger than a predetermined generalization threshold, then attributing the respective equivalence class to the identified token.

According to still further features in the described preferred embodiments the attribution of the respective equivalence class to the identified token is subjected to a significance test.

According to still further features in the described preferred embodiments the significance test comprises: for each sentence having elements of the respective equivalence class, searching for partial overlaps between the sentence and other sentences having elements of the respective equivalence class, and defining a most significant partial overlap as a significant pattern of the sentence, thereby extracting a plurality of significant patterns; selecting a leading significant pattern of the plurality of significant patterns; and if the leading significant pattern includes the identified token, then attributing the respective equivalence class to the identified token.

According to still further features in the described preferred embodiments the method further comprises constructing a graph having a plurality of paths representing the dataset, wherein each extraction of significant pattern is by searching for partial overlaps between paths of the graph.

According to still further features in the described preferred embodiments the method further comprises calculating, for each path, a set of probability functions characterizing the partial overlaps.

According to still further features in the described preferred embodiments the most significant partial overlap is determined by a significance test being performed by evaluating a statistical significance of the set of probability functions.

According to another aspect of the present invention there is provided apparatus for constructing translation rules using a first dataset of sentences in a source language and a second dataset of sentences in a target language. The apparatus comprises: a grammar acquirer, for acquiring at least a partial source grammar characterizing the source language symbols and at least a partial target grammar characterizing the target language; a mapping function generator for generating, for each sentence of the first dataset, a mapping function for mapping an ordered set of source grammar symbols being associated with the sentence to an ordered set of target grammar symbols, thereby to provide a set of mapping functions; and an archiving unit for archiving the set of mapping functions in a database.

According to further features in preferred embodiments of the invention described below, the mapping function generator is operable to generate the mapping function using a machine readable dictionary.

According to still further features in the described preferred embodiments the archiving unit is operable to archive the at least partial source grammar and the at least partial target grammar in the database. According to still further features in the described preferred embodiments the mapping function generator comprises a parser for parsing each sentence of the datasets to provide an ordered set of symbols covering the sentence, and a mapping function update unit for updating the set of mapping functions using the ordered set of symbols. According to still further features in the described preferred embodiments the apparatus further comprises, a probability matrix constructor, for constructing a source probability matrix from the source grammar and a target probability matrix from the target grammar, each probability matrix comprises a plurality of entries representing co-occurrence probabilities of symbols in a respective grammar. According to still further features in the described preferred embodiments the mapping function generator comprises a parser for parsing each sentence of the datasets to provide an ordered set of symbols covering the sentence, and wherein the probability matrix constructor is operable to update the probability matrices using the ordered set of symbols. According to still further features in the described preferred embodiments the mapping function generator comprises a mapping function update unit for updating the set of mapping functions using the probability matrices.

According to still further features in the described preferred embodiments each of the set of source grammar symbols and the set of target grammar symbols independently comprises: terminals being associated with tokens of a lexicon characterizing the dataset and nonterminals being associated with equivalence classes of tokens of the lexicon and/or significant patterns of a respective dataset. According to still further features in the described preferred embodiments the grammar acquirer comprises: a searcher, for searching, for each sentence of the first dataset and for each sentence of the second dataset, partial overlaps between the sentence and other sentences of the respective dataset; a testing unit, for applying a significance test on the partial overlaps; and a definition unit, for defining a most significant partial overlap as a significant pattern of the sentence, thereby to acquire nonterminals for the source grammar and the target grammar.

According to still further features in the described preferred embodiments the grammar acquirer comprises: a searcher, for searching over each dataset of the first dataset and the second dataset for similarity sets, each similarity set comprises a plurality of segments of size L having L-S common tokens and S uncommon tokens, each of the plurality of segments being a portion of a different sentence of the respective dataset; and a definition unit, for defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby to acquire nonterminals for the source grammar and the target grammar.

According to still further features in the described preferred embodiments the apparatus further comprises an extractor, capable of extracting, for a given set of sentences, a significant pattern corresponding to a most significant partial overlap between one sentence of the dataset and other sequences of the dataset, thereby providing, for the given set of sentences, a plurality of significant patterns.

According to still further features in the described preferred embodiments the given set of sequences is a similarity set, hence the plurality of significant patterns corresponds to the similarity set.

According to still further features in the described preferred embodiments the definition unit comprises a classifier, capable of classifying tokens of the similarity set into at least one equivalence class using the plurality of significant patterns.

According to still further features in the described preferred embodiments the classifier is designed for selecting a leading significant pattern of the similarity set, and defining uncommon tokens of segments corresponding to the leading significant pattern as an equivalence class.

According to still further features in the described preferred embodiments the given set of sentences is the dataset, hence the plurality of significant patterns corresponds to the dataset. According to still further features in the described preferred embodiments the apparatus further comprises a first grouper for grouping at least a few tokens of each significant pattern of the plurality of significant patterns.

According to still further features in the described preferred embodiments the apparatus further comprises a second grouper, for grouping at least a few tokens of each similarity set having at least one equivalence class.

According to still further features in the described preferred embodiments the apparatus further comprises a second definition unit having a second searcher, for searching over each sentence for tokens being identified as members of previously defined equivalence classes, wherein the second definition unit is designed to attribute a respective equivalence class to each identified token.

According to still further features in the described preferred embodiments the apparatus further comprises a constructor, for constructing a graph having a plurality of paths representing the dataset. According to still further features in the described preferred embodiments the extractor is designed to search for partial overlaps between paths of the graph.

According to still further features in the described preferred embodiments the graph comprises a plurality of vertices, each representing one token of the lexicon, and further wherein each path of the plurality of paths comprises a sequence of vertices respectively corresponding to one sentence of the dataset.

According to still further features in the described preferred embodiments the apparatus further comprises electronic-calculation functionality for calculating, for each path, a set of probability functions characterizing the partial overlaps.

According to still further features in the described preferred embodiments the extractor comprises a testing unit capable of evaluating a statistical significance of the set of probability functions.

According to yet another aspect of the present invention there is provided a method of translating an utterance from a source language to a target language. The method comprises: employing a structured stochastic language model so as to generate from the utterance a plurality of candidate utterances in the target language, and so as to assign a score to each candidate utterance of the plurality of candidate utterances; and selecting a candidate utterance having an optimal score, thereby translating the utterance from the source language to the target language. According to further features in preferred embodiments of the invention described below, the method further comprises accessing a database to obtain at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language, wherein the generation of the plurality of candidate utterances is based on the at least partial grammars.

According to still further features in the described preferred embodiments the method further comprises, prior to the generation of the plurality of candidate utterances: accessing the database to obtain translation rules defined by a set of mapping functions for mapping ordered sets of source grammar symbols to ordered sets of target grammar symbols; parsing the utterance to obtain an ordered set of source grammar symbols covering the utterance, thereby providing an utterance cover in the at least partial source grammar; using the translation rules for mapping the utterance cover to an ordered set of target grammar symbols; and assigning prior probabilities in the structured stochastic language model for each target grammar symbol of the grammar symbols set.

According to still further features in the described preferred embodiments the method further comprises prior to the selection of the candidate utterance, processing the plurality of candidate utterances according to additional ranking criteria.

According to still further features in the described preferred embodiments the processing comprises, for each candidate utterance of the plurality of candidate utterances: parsing the candidate utterance to obtain an ordered set of target grammar symbols covering the candidate utterance, thereby providing a candidate utterance cover in the at least partial target grammar; and performing correspondence analysis to provide goodness of correspondence between the candidate utterance cover in the at least partial target grammar and the utterance cover in the at least partial source grammar.

According to still further features in the described preferred embodiments the employment of stochastic language model comprises replacing the prior probabilities with conditional probabilities representing a discourse context of the utterance. According to still another aspect of the present invention there is provided apparatus for translating an utterance from a source language to a target language, comprises: a candidate utterance generator operable to employ a structured stochastic language model to generate from the utterance a plurality of candidate utterances in the target language, and to assign a score to each candidate utterance of the plurality of candidate utterances; and an optimizer, for selecting from the plurality of candidate utterances, a candidate utterance having an optimal score.

According to further features in preferred embodiments of the invention described below, the apparatus further comprises an input unit configured for of accessing a database to obtain at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language, wherein the candidate utterance generator is configured to generate the plurality of candidate utterances based on the at least partial grammars. According to still further features in the described preferred embodiments the input unit is further configured for accessing the database to obtain translation rules defined by a set of mapping functions for mapping ordered sets of source grammar symbols to ordered sets of target grammar symbols, and the apparatus further comprises: a parser, configured for parsing the utterance to obtain an ordered set of source grammar symbols covering the utterance, thereby to provide an utterance cover in the at least partial source grammar; a mapping unit configured for mapping the utterance cover to an ordered set of target grammar symbols using the translation rules; and a probability assigner, configured for assigning prior probabilities in the structured stochastic language model for each target grammar symbol of the grammar symbols set.

According to still further features in the described preferred embodiments the apparatus further comprises a candidate utterance processing unit for processing the plurality of candidate utterances according to thematic criteria.

According to still further features in the described preferred embodiments the parser is further configured to parse each candidate utterance to obtain an ordered set of target grammar symbols covering the candidate utterance, thereby to provide a candidate utterance cover in the at least partial target grammar, and wherein the candidate utterance processing unit is configured for performing correspondence analysis to provide goodness of correspondence between the candidate utterance cover in the at least partial target grammar and the utterance cover in the at least partial source grammar.

According to still further features in the described preferred embodiments the probability assigner is further configured for obtaining conditional probabilities representing a discourse context of the utterance and replacing the prior probabilities with the conditional probabilities.

According to an additional aspect of the present invention there is provided a text processing system having a translator, the translator comprises the translation apparatus described herein.

According to yet an additional aspect of the present invention there is provided a text processing system having a style checker, the style checker comprises the the translation apparatus described herein.

According to still an additional aspect of the present invention there is provided a voice command and control system comprises a voice input unit, an appliance and the translation apparatus described herein. The voice input unit is operable to receive a voice command in the source language and to convert the voice command to an utterance recognizable by the apparatus, and the apparatus is configured to translate the utterance from to a target language recognizable by the appliance.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods and materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.

Implementation of the method and system of the present invention involves performing or completing selected tasks or steps manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of preferred embodiments of the method and system of the present invention, several selected steps could be implemented by hardware or by software on any operating system of any firmware or a combination thereof. For example, as hardware, selected steps of the invention could be implemented as a chip or a circuit. As software, selected steps of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In any case, selected steps of the method and system of the invention could be described as being performed by a data processor, such as a computing platform for executing a plurality of instructions.

BRIEF DESCRIPTION OF THE DRAWINGS The invention is herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice. In the drawings:

FIG. 1 is a flowchart diagram of the method suitable for constructing translation rules according to various exemplary embodiments of the present invention;

FIG. 2 is a schematic illustration of an apparatus for constructing translation rules, according to various exemplary embodiments of the present invention;

FIG. 3 is a flowchart diagram of a method for extracting significant patterns from a dataset, according to various exemplary embodiments of the present invention;

FIGs. 4a-b are simplified illustrations a structured graph (Figure 4a) and a random graph (Figure 4b), according to various exemplary embodiments of the present invention;

FIG. 5 illustrates a representative example of a portion of a graph with a search-path going through five vertices, according to various exemplary embodiments of the present invention;

FIG. 6 illustrates a pattern-vertex having three vertices which are identified as significant pattern of the trial path of Figure 5, according to various exemplary embodiments of the present invention;

FIG. 7 is a flowchart diagram of a method for defining equivalence classes, according to various exemplary embodiments of the present invention; FIG. 8a is a schematic illustration of a portion of a graph constructed for a corpus of text in which the tokens are words, according to various exemplary embodiments of the present invention;

FIG. 8b illustrates a generalized-vertex, defined for a similarity set having an equivalence class, according to various exemplary embodiments of the present invention;

FIG. 9a illustrates a portion of a graph in which an equivalence class is attributed to vertices identified as elements thereof, according to various exemplary embodiments of the present invention; FIG. 9b illustrates an additional step of the method in which once a particular path has been supplemented by an additional equivalence class, the graph or a portion thereof is rewired, by defining a generalized- vertex including the existing equivalence class and the newly attributed equivalence class, according to various exemplary embodiments of the present invention; FIG. 9c illustrates the additional step of Figure 9b, with an optional modification in which the generalized-vertex also includes other vertices within a predetermined window, according to various exemplary embodiments of the present invention;

FIGs. 10a-d illustrate nested relationships between significant patterns and equivalence classless in a tree format, according to various exemplary embodiments of the present invention;

FIG. 11 a schematic illustration of grammar acquirer which can be used in the apparatus of Figure 2, according to various exemplary embodiments of the present invention; FIG. 12 is a flowchart diagram of a method suitable for translating an utterance from a source language to a target language, according to various exemplary embodiments of the present invention;

FIG. 13 is a schematic illustration of apparatus suitable for translating an utterance from a source language to a target language, according to various exemplary embodiments of the present invention; and

FIG. 14 is a schematic illustration of a translation of an utterance sentence from English to Chinese, according to various exemplary embodiments of the present invention. DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present embodiments comprise a method and apparatus which can be used for translation. Specifically, the present embodiments can be used to construct translation rules from a source language to a target language. The present embodiment can further be used for translating an utterance from a source language to a target language.

The principles and operation of a method and apparatus according to the present embodiments may be better understood with reference to the drawings and accompanying descriptions. Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.

The method of the present embodiments can be embodied in many forms. For example, it can be embodied in on a tangible medium such as a computer for performing the method steps. It can be embodied on a computer readable medium, comprising computer readable instructions for carrying out the method steps. In can also be embodied in electronic device having digital computer capabilities arranged to run the computer program on the tangible medium or execute the instruction on a computer readable medium. For example, computer programs implementing the method can commonly be distributed to users on a distribution medium such as, but not limited to, a floppy disk or CD-ROM. From the distribution medium, the computer programs can be copied to a hard disk or a similar intermediate storage medium. The computer programs can be run by loading the computer instructions either from their distribution medium or their intermediate storage medium into the execution memory of the computer, configuring the computer to act in accordance with the method of this invention. All these operations are well-known to those skilled in the art of computer systems.

The method and apparatus of the present embodiments can be used in many applications. For example, a method and apparatus according to the present embodiments can be used to translate a body of text in one language refers to herein as the source language to a body of text in another language referred to herein as the target language, in an intelligent phrasebook running on a personal digital assistant or a cellular telephone, for use by tourists, government officials, etc. A method and apparatus according to the present embodiments can also be used in automatic translation of online material, e.g., in frequently updated digital content, such as web feed format news feeds and the like. A method and apparatus according to the present embodiments can also be incorporated into chatrooms and other instant messaging software to provide automatic translation thereto. The present embodiments can also be operating in a monolingual mode. For example, a method and apparatus according to the present embodiments can be incorporated in text processing software to verify, correct or suggests alternative idioms. This is particularly useful for helping non-native speakers to ensure that their language is idiomatically correct. A method and apparatus according to the present embodiments can also be employed in an automatic learning system, whereby the user provides an utterance and the system verifies whether or not, or to what extent, the utterance is grammatically correct.

The present embodiments can also be used to provide automatic translation of an input stream in a source natural language into a semantically equivalent target stream of executable commands, e.g., in voice command and control systems. The translation of the present embodiments is also useful when the utterances in the source natural language are syntactically out-of-grammar but semantically accurate. Thus the method and apparatus of the present embodiments can be used as add-ons to existing voice command and control systems to allow the systems to execute commands in response to freely spoken utterances. For example, in a voice command and control system controlling a computer, a user can provide the utterance "make a hard copy" when the pre-defined wording of the command is "print."

The present embodiments can also be used for converting one formal language to the other. For example, the present embodiments can convert a computer program written in a third-, fourth- or fifth-generation programming language to another computer program in a lower generation programming language.

The method and apparatus for constructing translation rules according to the present embodiments uses two datasets: a first dataset which includes sentences defined over a lexicon of tokens in the source language, and a second dataset which includes sentences defined over a lexicon of tokens in the target language. The two datasets are preferably parallel matched corpora, e.g., two translations of a given set of texts. The tokens of any of the lexicons can be of any type known in the art of language processing, include, without limitation, natural language words, computer programming language statements, machine executable commands, and the like. In various exemplary embodiments of the invention the tokens are natural language words.

It is expected that during the life of this patent many relevant sequential datasets will be developed and the scope of the terms "token" and "sequence of tokens" are intended to include all such new technologies a priori.

The source and target languages are preferably, but not obligatorily, different languages, either of the same type {e.g., both natural languages) or of different types {e.g., the source language is a natural language and the target language is a machine command language). In embodiments in which a monolingual operation mode is employed, the source and target languages are identical.

Once the translation rules are constructed as further detailed hereinbelow, they are archived in a database and can be later used for translating utterances from the source language to the target language. The term "utterance" as used herein refers to any sequence of tokens from the lexicon of the respective language. Thus, in natural languages, an utterance can refer to a complete sentence {e.g., a sequence of words which includes a subject and a verb, as in most natural languages), a sentence fragment {e.g., a sequence of words which lacks a subject or a verb), a phrase, a concatenation of phrases, a statement and the like. In computer programming language, an utterance can refer to a sequence of computer program statements, such as, but not limited to, as a program block.

Referring now to the drawings Figure 1 is a flowchart diagram of the method suitable for constructing translation rules according to various exemplary embodiments of the present invention. It is to be understood that, unless otherwise defined, the method steps described hereinbelow can be executed either contemporaneously or sequentially in many combinations or orders of execution. Specifically, the ordering of the flowchart diagrams is not to be considered as limiting. For example, two or more method steps, appearing in the following description or in the flowchart diagrams in a particular order, can be executed in a different order (e.g., a reverse order) or substantially contemporaneously. Additionally, several method steps described below are optional and may not be executed. The method begins at step 10 and continues to step 11 in which at least a partial source grammar G^A characterizing the source language A and at least a partial target grammar G^B characterizing the target language B is acquired. The grammars (or partial grammars) are in the form of sets of grammar symbols (a set of terminals and a set of nonterminals) and a set of production rules, as further explained in the Background section above. The grammars can be partial in the sense that the set of terminals can include only a part of the lexicon entries (e.g., those tokens that actually exit in the respective dataset or even a portion thereof). The term "grammar" as used herein refers to a grammar or a partial grammar.

The grammars can be read from a grammar database or can be acquired directly from the two datasets using any method known in the art. In various exemplary embodiments of the invention the grammars are context free grammars. Methods for acquiring a context free grammar from a dataset which are suitable for the present embodiments are found in International Patent Application No. PCT/IL2004/000704, U.S. Patent Nos. 6,836,760, 6,957,184 and 7,155,392 and U.S. Patent No. 20060085193 the contents of which are hereby incorporated by reference . Also contemplated are the methods disclosed in "Computational Grammatical Inference", Pieter Adriaans and Menno van Zaanen; Chapter 7 in "INNOVATIONS IN MACHINE LEARNING: Theory and Applications" by Lakhmi Jain and Dawn Holmes (eds.), pp. 187-203, 2006, ISBN: 3-540-30609-9, Springer Verlag; and "Grammatical Inference for Syntax-Based Statistical Machine Translation", Menno van Zaanen and Jeroen Geertzen, Proceedings of the International Colloquium on Grammatical Inference (ICGI), Tokyo, Japan, pp. 356-358, 2006.

A preferred technique for acquiring a context free grammar is provided hereinunder. The method continues to step 12 in which a set T of mapping functions is generated. Step 12 is executed by generating, for each sentence of the first dataset, a mapping function for mapping an ordered set of source grammar symbols which are associated with the sentence to an ordered set of target grammar symbols. The mapping functions can be generated using a machine readable dictionary, but, more preferably, the mapping functions are initialized by the machine readable dictionary and being updated thereafter.

Thus, according to a preferred embodiment of the present invention each sentence / of the first dataset is parsed using the source grammar G^A to provide an ordered set L^A c G^A of source grammar symbols covering the sentence s⁴. The parsing is performed by a parser which receives s⁴ and G^A and determines how s⁴ can be generated from G^A. The result of the parsing is the ordered set L^A. Similarly to s⁴, each sentence s° of the second dataset is parsed using the target grammar G^B to provide an ordered set L^B e G^B of target grammar symbols covering the sentence s⁸. The ordered sets L and L^B preferably represent grammatical structures in the respective grammar. The mapping functions are then updated based on the obtained ordered set. Formally, for all a_/ e L^A and for all b_k e L , T is updated such that T(a_j) -» b_k. As will be understood by one of ordinary skill in the art, since the sets L^A and L^B are obtained using the grammars G^A and G^B, the update of T is based on the grammars. When the ordered sets represent grammatical structures, the mapping functions preferably map between grammatical structures and not necessarily individual grammar symbols.

It is appreciated that when the grammars G and G^B are acquired probabilistically from the datasets, the obtained mapping functions have a probabilistic origin. In various exemplary embodiments of the invention the mapping functions provide a probabilistically weighted set of candidate grammar structures.

A further update of T is preferably performed by means of estimating the optimal correspondence between points in the two datasets. The advantage of such update is that, the use of correspondence information allows enforcing proper mapping of thematic relations. Additionally, the use of correspondence information allows linking both the associations of patterns in the two languages and the corresponding slots of patterns. Thus, when more than one possible mapping exists from the set L^A to the set L^B, the mapping that preserves the correspondence receives a higher likelihood as the correct mapping.

The optimal correspondence between points can be achieved using a technique known as the "distance spectrum" technique [see, e.g., Edelman, S. "Representation and recognition in vision", MIT Press, Cambridge, MA, 1999]. According to the presently preferred embodiment of the invention each grammar is represented as a metric space characterized by a distance function between every two tokens of the dataset. The value of the distance function is the length of the minimal "geodesic" of the metric space which passes through the two symbols. The optimal correspondence between points in the two metric spaces is ultimately estimated using the distance spectra in each space.

The procedure can be better understood by considering the following intuitive example: consider a map of the U.S. on which point-like locations of three cities, say Boston, New York and Los Angeles, are marked, but not named. Suppose further that the information about the relative distances among these cities is available. The latter is referred to as "the distance spectrum" of the point configuration. A second distance spectrum can also be computed for the map. By assign to each city on the map its correct name, the two distance spectra can be optimally matched.

Returning to the present embodiments of the invention, the principle of distance spectra is preferably used for estimating the optimal set of mapping functions T, using as inter-symbol distances in the two datasets. Thus, a source probability matrix is preferably constructed from the source grammar, and a target probability matrix is preferably constructed from the target grammar, where the entries of the probability matrices represent co-occurrence probabilities of symbols in a respective grammar. Preferably, the probability matrices are updated once the sets L^A and L^B axe known. Formally, for each pair (αβ, α_ji) e L^A the source probability matrix P(α_ju cip) has an probability entry representing the "geodesic distance" between the symbols α_β and α_j.₂ (low probability for large "geodesic distance" and high probability for small "geodesic distance"), and for each pair (bu, bio) e L^B the target probability matrix hiα) has an probability entry representing the "geodesic distance" between the symbols bki and ό/y-

Once the probability matrices are calculated, the set T of mapping functions is preferably updated using the probability matrices. The entries in the probability matrices serve as prior probabilities for grammatical structures. Given an ordered set of source grammar symbols, the method can determine all the grammatical structures which are associated with the set, and used the source probability matrix to assign a prior to each structure. Similar procedure can be employed for the target grammar. The priors associated with the grammatical structures of each grammar can then be used for updating the mapping functions.

From step 12 the method continues to step 13 in which the set T and optionally the grammars G^A and G^B are archived in a database. The method ends at step 15. An exemplified pseudo source code for implementing the method on a data processor is provided in the Examples section that follows.

Reference is now made to Figure 2 which is a schematic illustration of an apparatus 20 for constructing translation rules, according to various exemplary embodiments of the present invention. Apparatus 20 can be used to execute selected steps of the method described above and illustrated in the flowchart diagram of Figure 1. Apparatus 20 preferably comprises a grammar acquirer 22 which acquires grammars G^Λ and G^B. Grammar acquirer 22 can acquire the grammars either by accessing a grammar database or directly from the two datasets generally shown at 36. A preferred configuration of acquirer 22 is provided hereinunder (see Figure 11 and the accompanying description). Apparatus 20 further comprises a mapping function generator 24 which provides the set T of mapping functions, by generating, for each sentence of the first dataset, a mapping function for mapping an ordered set of source grammar symbols associated with the sentence to an ordered set of target grammar symbols. Generator 24 can generate the set T using a machine readable dictionary. Generator 24 can also generate the set by parsing the sentences in the datasets. Thus, generator 24 preferably comprises with a parser 26 which parses each sentence according to the respective grammar, as described above, and an update unit 34 which updates the set T according to the sets L^A and L^B. Optionally and preferably, generator 24 communicates with a probability matrix constructor 28 which constructs the probability matrices as further detailed hereinabove. In a preferred embodiment, generator 24 receives the matrices from constructor 28, and activates unit 34 to update T based on the matrices.

Apparatus 20 further comprises an archiving unit 30 for archiving the set T and optionally the grammars G^A and G^A in a database 32.

Following is a description of a preferred procedure for acquiring a grammar from a dataset. In various exemplary embodiments of the invention the procedure is be employed to each of the two datasets to acquire grammars G^A and G^B. Generally, the procedure involves the extraction of significant patterns from the dataset, followed by the definition of equivalence classes of tokens of the datasets. In various exemplary embodiments of the invention the nonterminals of the grammar are defined as significant patterns and/or equivalence classes of the dataset. The description below is by way of two methods 200 and 210 whereby method

200 extracts the significant patterns and method 210 defines the equivalence class. Thus, in various exemplary embodiments of the invention method step 11 above (see Figure 1) comprises the successive execution of methods 200 and 210.

Method 200 is described first with reference to the flowchart diagram of Figure 3 and method 210 is described hereinafter with reference to the flowchart diagram of Figure 7.

Method 200 begins at step 201 comprises the following method steps which are illustrated in the flowchart of Figure 3. Method 200 begins at step 201 and continues to step 202 in which overlaps between sentences (i.e., sequences of tokens of the dataset) are searched, considering each sentences of the dataset as "trial- sequence" which is compared, segment by segment, to all other sentences. The terms "sentences" and "sequences" are referred to hereinbelow interchangeably.

This can be done for example, by constructing a graph which represents the dataset. Such graph may include a plurality of vertices and paths of vertices, where each vertex represent one token of the lexicon and each path of vertices represent a sentence of the dataset. Thus, according to a preferred embodiment of the present invention, for a lexicon of size n (say, n different words), there are n vertices on the graph. These n vertices are connected thereamongst by edges, preferably directed edges, in many combinations, depending on the sentences of the raw dataset on which the method of the presently preferred embodiment is applied.

The endpoints of each path of the graph are preferably marked, e.g., by adding marking vertices, such as a "begin" vertex before its first vertex and an "end" vertex after its last vertex. These marking vertices represent the beginning and end of the respective sentence of the dataset. For example, when the tokens are words of a natural language the "begin" and "end" vertices can be interpreted as regular expression tokens which are typically used by text editors to locate the endpoints of a sentence. Thus, each vertex which represents a token has at least one incoming path and at least one outgoing path, preferably an equal number of incoming and outgoing paths.

Once the graph is constructed, overlaps between the paths thereof can be searched, for example, by considering different sub-paths of different lengths for each path and comparing these sub-paths with sub-paths of other paths of the graph. As the dataset inherently possesses some kind of structure, the constructed graph is not a random graph. Rather, the graph represents the structure of the dataset with the appearance of bundles of sub-paths, signifying a relatively high probability associated with a given sub-structure which can be identified as a motif. Figures 2a-b, show simplified illustrations a structured graph (Figure 4a) and a random graph (Figure 4b). Shown in Figures 2a-b, a plurality of vertices el, el, ..., el 6, each representing one token of the lexicon. Referring to Figure 4a, of particular interest are vertex el and vertex el 5 which are connected by many sub-paths of the graph, hence defining an overlap 32 therebetween. Method 200 preferably continues to step 203 in which a significance test is applied on the partial overlaps which are obtained in step 202. Significance tests are known in the art and can include, for example, statistical evaluation of flow quantities, such as, but not limited to, probability functions or conditional probability functions which characterize the partial overlaps between paths on the graph. According to a preferred embodiment of the present invention a set of probability functions is defined using the number of paths connecting particular vertices on the graph. For example, considering a single vertex, e_\, on the graph, a probability, p{e_\), can be defined as the number of paths leaving e_\ divided by the total number of paths. Similarly, considering two vertices, e_\ and e₂, a (conditional) probability, pie.% \ e_\), can be defined as the number of paths leading from e_\ to e₂ divided by the total number of paths leaving e_\. This prescription is preferably applied to all combinations of vertices on the graph, defining, e.g.,p(e{),p(e2 \ e{),p{ei- \ e_\ e₂), for paths leaving e_\ and going through e₂ and e₃, and_jp(e₁), p{e\ | e₂), p(e_\ | e₂ e₃), for paths going through e₃ and e₂ and entering e_\. In terms of all the conditional probabilities, the graph can define a Markov model. Thus, a "search-path," of length K, going through vertices e_\ e₂ ... e_& on the graph (corresponding to a trial-sequence of K tokens of the dataset), can be used to define a variable order Markov model up to order K, represented by the following matrix:

(EQ. 1) For any sub-path of ele2...eK having a length m < K, a similar Markov model can be obtained from an m x m diagonal sub-matrix of M. It will be appreciated that whereas the collection of all paths which represent a sentence of the dataset defines all the conditional probabilities appearing in M, the search-path e^_^e^Tused in M does not necessarily represent a sentence of the dataset. The definition of the search-path is based on conditional probabilities, such asp(β2 | e{), which are predetermined by those paths which represent the sentences of the dataset.

An occurrence of a significant overlap (e.g., overlap 32 in Figure 4a), along a search-path can be identified by observing some extreme values of the relevant conditional probabilities. According to a preferred embodiment of the present invention, the probability functions comprise probability functions characterizing a rightward direction on each path and probability function characterizing a leftward direction on each path. Thus, for a search-path e\e2...,Q_n,—^k, a probability function, P_#, characterizing a rightward direction, is preferably defined by the first column of M, moving top down, and a probability function, Pi, characterizing a leftward direction, is preferably defined by the last column of M, moving bottom up. Specifically,

PRQI) =p(e_a I eχe₂... e_n-\) and P_L(n) =p(e_a \ e_n+ιe_n+2... e_k). (EQ. 2) As will be appreciated by one ordinarily skilled in the art, both P_R and Pi vary between 0 and 1 and are specific to the path in question.

In terms of the number of paths, PR and Pi can be understood considering, for simplicity, that the path in question is eXeleieA (X=4). Hence, according to a preferred embodiment of the present invention, P_R(3) =/>(e3 | el e2), the rightward direction probability corresponding to the sub-path ele2e3 equals the number of paths moving from el through e2 into eh divided by the number of paths moving from el to el, and P_L(3) =p(e3 \ e4), the leftward direction probability corresponding to the sub- path e3e4 equals the number of paths moving from e3 to e4 divided by the number of paths entering e4. It is convenient to define the aforementioned probabilities in the explicit notations Pκ(el;e3) and Pι(e4;e3), respectively.

Figure 5 illustrate a representative example of a portion of a graph in which a search-path, going through ele2e3e4e5 and marked with a "begin" vertex at its beginning and an "end" vertex on its end, is selected. Also shown in Figure 5, are other paths, joining and leaving the search-path at various vertices. The bundle of sub- paths between vertex e2 and vertex e4 displays certain coherence, possibly indicating the presence of a significant pattern in the dataset.

To illustrate the use of the probabilities PR and P_L, the portion of the graph is positioned in a rectangle coordinate system in which the vertices are conveniently arranged along the abscissa while the ordinate represent probability values. Progressing from el rightwards, PR(U), « = 1, 2, 3, 4, 5, has the values 4/41, 3/4, 1, 1 and 1/3 respectively. Progressing from e4 leftwards, Pι(ri), n = 4, 3, 2, 1 has the values 6/41, 5/6, l and 3/5. Thus, PR first increases because some other paths join to form a coherent bundle, then decreases at e5, because many paths leave the path at e4. Similarly, progressing leftward, PL first increases because other paths join as e4 and then decreases because paths leave the path at el. The decline of PR or Pi is preferably interpreted as an indication of the end of the candidate pattern. The overlaps can be identified by requiring that the values of PR and PL within a candidate overlap are sufficiently large. Thus, a candidate overlap can be defined as a sub-sequence represented by a path or a sub-path on the graph in which PR > 1 - 8R and PL > 1 - SL where SR and SL are two parameters smaller than unity. A typical value for 8R and SR is . from about 0.01 to about 0.99. As used herein the term "about" refers to + 10 %.

Optionally and preferably, the decrement of PR and P_L can be quantified by defining decrease functions and comparing their values with predetermined cutoffs hence to identify overlaps between paths or sub-paths. According to a preferred embodiment of the present invention, the decrease functions are defined as ratios between probabilities of paths having some common vertices. In the example shown in Figure 5 the decrement of PR at e4 can be quantified using a rightward direction decrease function, /J_R, defined as D_R(el;e4) = P_R(el;e5)/P_R(el;e4), and the decrement of PL at e2 can be quantified using a leftward direction decrease function, Di, defined as D_\ieA;e2) = P_\£eA\e\yP_\£e4;e2). Denoting the predetermined cutoffs by ηn and ηι, respectively, a partial overlap can be identified when both D_R < η_& and D_L < JJ_L- A typical value for both η^ and 77₁ is from about 0.4 to about 0.8.

Thus, the statistical significance of the decreases in P_R and P_L can be evaluated, for example, by defining their significance in terms a null hypothesis and requiring that the corresponding ^-values are, on the average, smaller than a predetermined threshold, α. A typical value for α is from 0.001 to 0.1.

The null hypothesis depends on the choice of the functions which characterize the overlaps. For example, when the ratios are used, the null hypothesis can be P_R(el;e5) > η_RP_R(el;e4) and P_L(e4;el) ≥ r\ιJP_L(e4;e2). Alternatively, the null hypothesis can be PR > 1 — S_R and PL > 1 - S_L or any other combination of the above conditions.

For a given search-path, P_L and P_R are preferably calculated from many starting points (such as el and e4 in the present example), more preferably from all starting points on the search-path, traversing each sub-path both leftward and rightward. This procedure defines many search-sections on the search-path, from which several partial overlaps can be identified. Once the partial overlaps have been identified, the most significant partial overlap is defined as a significant pattern. This step is designated in Figure 3 by Block 204. In an alternative, yet preferred, embodiment, a set of cohesion coefficients, Cy, i >j, are calculated, for each trial path, as follows: cy = My log M₀Z(Mt-U M_iJ+ύ (EQ. 3) where My are elements of the variable order Markov model matrix (see Equation 1).

For a given search-path there are many sub-paths, each represented by an element in the set Cy, which can be considered as an "overlap score." Once the set Cy is calculated, its supremum is selected and the sub-path which corresponds to the supremum is preferably defined as the significant pattern of the search-path.

It is to be understood that it is not intended to limit the scope of the present invention to the above statistical significance tests, and that other significance tests as well as other probability functions or cohesion coefficients can be implemented.

The procedure in which overlaps are searched along a search-path is preferably repeated for more than one path of the original graph, more preferably on all the paths of the original path (hence on all the sentences of the dataset). It will be appreciated that significant patterns can be found, depending on the degree by which the search- path overlaps with other paths.

Once constructed, the graph can be "rewired" by merging each, or at least a few, significant patterns into a new vertex, referred to hereinafter as a pattern-vertex. This is equivalent to a redefinition of the dataset whereby several tokens are grouped according to the significant patterns to which they belong. This rewiring process reduces the length of the paths of the graph, nonetheless the contents of the paths in terms of the original sentences of the dataset is conserved. As further detailed hereinunder, the rewiring can be used to define equivalence classes for the database. Thus, pattern-vertices can define nonterminals of the grammar.

In principle, the identification of the significant patterns can depend on other vertices of the search-path, and not only on the vertices belonging to the overlapping sub-paths. The extent of this dependence is dictated by the selected identification procedure {e.g., the choice of the probability functions, the significant test, etc.). Referring to the example of Figure 5, a sub-path eleheA is defined as a significant pattern of the search-path "begin"-* el-»...-»e5— >"end." By definition, the vertices el, e3 and e4, also belong to other paths on the graph, each in turn can also be selected as a search-path along which partial overlaps are searched. Being dependent on other vertices of the search-path, the sub-path e2e3e4 may be accepted as a significant pattern for one search-path and may be rejected, on account of failing to pass the selected significance test, for another search-path.

The definition of the pattern-vertices of the graph can therefore be done in more than one way. In one embodiment, referred to hereinafter as the "context-sensitive embodiment," significant patterns are merged only on the path for which they turned out to be significant, while leaving the vertices unmerged on other paths.

In another embodiment, referred to hereinafter as the "context-free embodiment," after each search on each search-path, sub-paths which are identified as significant patterns are merged into pattern- vertex, irrespectively whether or not these sub-paths are defined as significant patterns also in other paths. In still another embodiment, referred to hereinafter as the "single rewiring embodiment," after each search on each search-path, the sub-paths which are identified as significant patterns are merged into a pattern- vertex.

In yet another embodiment, referred to hereinafter as the "multiple rewiring embodiment," after each search on each search-path, the sub-paths which are identified as significant patterns are merged into pattern-vertices.

In a further embodiment, referred to hereinafter as the "batch rewiring embodiment," after all paths are searched, the sub-paths which are identified as significant patterns are merged into pattern- vertices. Figure 6 illustrate a pattern-vertex 42 having vertices el, e3 and eA, which are identified as significant pattern for the trial path of Figure 5. Note that vertices el, e3 and eA remain on the graph in addition to pattern-vertex 42, because, in the present example, there is a path which goes through el and e3 but not through eA, and a path which goes through eA and e5 (see Figure 5) but not through el and e3. The rewiring procedure, as stated, can be used to defining equivalence classes of tokens and allowing, for a given sentence, the replacement of one or more tokens of the sentence with other tokens which are members of the same equivalence class (see, e.g., J. G. Wolff, "Learning syntax and meanings through optimization and distributional analysis," in Y. Levy, I. M. Schlesinger and M. D. S. Braine, Ed., Categories and Processes in Language Acquisition, 179-215, Lawrence Erlbaum, Hillsdale, NJ, 1988).

For example, suppose that for a particular dataset an equivalence class, E, of two vertices, e3 and e6, is defined, i.e., E = {e3, e6}. Suppose further that among the sentences of the dataset there are two sentences, say, elele3eAe5 and eleleβeAel, which include the members of E. These sentences can be generalized to elelEeAe5 and ele2EeAe7, which, in addition to the original sentences of the dataset, also include new sentences elele6eAe5 and ele2e3eAe7, not necessarily present in the original dataset.

Method 200 ends at step 205. Following is a description of a method 210 suitable for defining equivalence so as to identify the nonterminals of the grammar. The method begins at step 211 and continues to step 212, significant patterns are extracted from the dataset, for example, using selected steps of method 200 as further detailed hereinabove. Preferably, once the significant patterns are extracted, the dataset is redefined, as stated, by grouping tokens thereof according to the significant pattern to which they belong. The method continues to step 213 in which the dataset is searched for similarity sets.

As used herein, "similarity set" refers to a plurality of segments of different sequences, preferably of equal size, having a predetermined number of common tokens and a predetermined number of uncommon tokens. As further detailed hereinunder, selected steps of method 210 can be represented mathematically as operations performed on a graph having vertices and paths where each vertex represent one token of the lexicon and each path represent a sentence of the dataset. In conjunction to a graph, "similarity set" refers to a plurality of paths sharing a predetermined number of vertices within a given window of vertices. Denoting the window size (or, equivalently, the size of the segment) by L and the number of unshared vertices within the L-size window (or, equivalently, the number of uncommon tokens in the Z-size segment) by S, the number of shared vertices (or common tokens) is L - S.

Figure 8a is a schematic illustration of a portion of a graph constructed for a corpus of text in which the tokens are words. Shown in Figure 8a is a similarity set 62 of four paths sharing three vertices within a window of four vertices. A similarity set can thus be considered as a kind of a generalized search-path, which is allowed to branch at S given locations into other vertices of other paths sharing the prefix and suffix sub-paths of the original search-path within some limited window of a predetermined length, L. All the vertices at each branching location of the generalized search-path are collectively referred to hereinbelow as a slot of vertices. In the example shown in Figure 8a, similarity set 62 comprises L - S - 3 shared vertices within a window of size L = 4, hence having 5 = 1 slot (designated by numeral 64 in Figure 8a).

Referring now again to Figure 7, method 210 continues to step 214 in which the similarity sets are used for defining equivalence classes corresponding to slots of vertices which represent uncommon tokens of similarity sets. As each similarity set comprises a plurality of paths, the definition of the equivalence classes is preferably done, using method 200 which, as stated, can be used for extracting one or more significant patterns from a search-path. Thus, according to a preferred embodiment of the present invention if a significant pattern emerges by searching along the generalized search-path, the set of all alternative vertices at the given location is defined as an equivalence class.

The significance test employed by method 200 when searching for significant patterns of a similarity set can be generalized by defining the probabilities for a path with an open slot in terms of probabilities of the individual paths which form the similarity set. For example, consider a window of size L = 3, composed of vertices el, e3 and e4, with a slot at e3. The similarity set in this case consists of all the paths that share el, e4 and branch into all possible vertices at location e3. According to a preferred embodiment of the present invention the probability P(e3\e2;e4) is defined as ∑p = P(e3_β|e2;e4), where each P(e3_$\e2;e4) is calculated by considering a different path going through the corresponding e3_β. Similarly, for el, e3, e4 and e5 the probability P(e5\ele3e4) is preferably defined as ΣpP(e5|e2;e3_β;e4) and so on.

It will be appreciated that once an equivalence class is defined for a given path, the path is generalized, because, in addition to the original sentences that led to the existence of the equivalence class, other sentences can be generated from the path.

According to a preferred embodiment of the present invention the method may further comprise a step which is similar to the rewiring step introduced in method 200 above. More specifically, for each similarity set found to have at least one equivalence class therein, a generalized- vertex is defined, representing all vertices of a respective Z-size window of the similarity set. Figure 8b illustrates a generalized- vertex 68, defined for a similarity set having an equivalence class 66. Generalized- vertex 68 preferably represents the vertices of equivalence class 66 as well as all the vertices of the Z-size window used to define equivalence class 66. The rewiring of the graph can be done in any rewiring mode including, without limitation, multiple, single and batch rewiring modes, as further detailed hereinabove.

It will be appreciated that the definition of generalized-vertex 68 with its enclosed equivalence class 66, also generalize all other paths participating in its definition. Thus, once the creation of equivalence classes is allowed, the dataset is generalized in the sense that many of its paths generate sentences that were not listed as sentences in the original dataset.

The generalization procedure can be taken one step further by allowing for multiple appearances of equivalence class within a generalized-vertex, even when such equivalence classes were not found in the search for shared vertices within the L-size window. Hence, according to a preferred embodiment of the present invention the method continues to step 215 in which equivalence classes are attributed to individual members of previously defined equivalence classes. More specifically, in this embodiment each path is searched for vertices identified as members of previously defined equivalence classes. Once such vertex is found, the respective equivalence class is attributed thereto. Figure 9a illustrates a portion of a graph in which an equivalence class 72 is attributed to vertices identified as elements thereof. Equivalence class 72 is adjacent to existing equivalence class 66 hence forming, together with the other vertices of the Z-size window, a further generalized path designated by numeral 74.

The attribution of the equivalence classes is preferably subjected to a generalization test, so as to prevent over generalization of the dataset. This can be done, for example, by imposing a condition is which there is a sufficient number (say, larger than a generalization threshold, ω) of members of equivalence class 72 which already exist in path 74 at the time the aforementioned search is made. A typical value for the generalization threshold, ω, is from about 50 % to about 65 % of the size of the respective equivalence class (class 72 in the example of Figure 8b).

In addition to the generalization test, the attribution of the equivalence classes can also be subjected to a significance test, e.g., one of the significance test of method 200. More specifically, path 74 can be used as a generalized search-path on which method 200 can be employed for extracting one or more significant patterns. According to a preferred embodiment of the present invention, class 72 is attributed to path 74 if a significant pattern emerges by searching along path 74.

Reference is now made to Figures 9b-c, which are illustrations of an additional step of method 210, according to a preferred embodiment of the present invention. Hence, once a particular path has been supplemented by an additional equivalence class, the graph or a portion thereof can be rewired, again, by defining a generalized- vertex including the existing equivalence class, the newly attributed equivalence class and, optionally, other vertices of the respective L-size window. Similarly to the above rewiring procedure, this procedure can be done in any rewiring mode including, without limitation, multiple, single and batch rewiring modes, as further detailed hereinabove. Figure 9b illustrates a generalized-vertex 76, representing the vertices of equivalence class 66 and the vertices of equivalence class 72. Figure 9c illustrates a generalized- vertex 78, representing the vertices of equivalence class 66, the vertices of equivalence class 72 and the vertices of the Z-size window used to define equivalence classes 66 and 72.

Preferably, the procedure of generalization and redefinition of the dataset is iteratively repeated. With each reiteration, new significant patterns and equivalence classes are defined in terms of previously defined significant patterns and equivalence classes as well as remaining tokens. These iterations are preferably performed over all sequences of the redefined dataset, time and again, until, say, no further significant pattern are found.

Thus, during the iterative process, the list of equivalence classes is updated continuously, and new significant patterns are found using the existing equivalence classes. For each set of candidate paths, the vertices are compared to one or more equivalence classes from the pool of existing equivalence classes. Because a vertex or a token can appear in several classes, different combinations of equivalence classes are checked, preferably while scoring each combination. The winner combination is preferably the largest class for which most of the members are found among the candidate paths in the set (the ratio between the number of members that have been found among the paths and the total number of members in the equivalence class is compared to the predetermined generalization threshold as one of the configuration acceptance criteria). If not all the members appear in an existing set, a new equivalence class can be created, with only those members that do. Thus, as the portion of the dataset that is processed increases, the dataset is enriches with new significant patterns and their accompanying equivalence classes, and the graph is bootstrapped with the pattern-vertices and generalized vertices. The recursive nature of this process allows method 210 to form more and more complex patterns, in a hierarchical manner.

Thus, the present embodiments enable the construction of a graph having many paths, in principle of the same order of magnitude as the original number of paths, yet its overall structure is much reduced, since many of the vertices and sub-paths are merged to pattern-vertices. The pattern- vertices that are left in the final format of the graph are referred to herein as "root-patterns." The set of all tokens, equivalence classes and significant patterns thus form a context free grammar, whereby the terminals are the tokens, and the nonterminals are equivalence classes or significant patterns. It is commonly acceptable to represent the context free grammar hierarchically as a forest of multilevel trees. Each tree can represent a pattern of tokens of the generalized dataset, whereby child nodes, appearing on the leaf level of the tree, correspond to tokens, and parent nodes, appearing on the partition levels, correspond to significant patterns or equivalence classes. The tree representation specifies the relations between all the significant patterns and equivalence classes that appear in the tree and can therefore be considered as the set of rules of the context free grammar.

In general, any path on the graph can be represented as one root-pattern, or a set of consecutive root-patterns and some of the original tokens. To produce a sentence from a given path, each root-pattern is preferably considered in its tree format. The tree can be constructed to be read from top to bottom and from left to right, where, preferably, only one of the children of each equivalence class is selected to generate a sequence, appearing on the leaf-level of the tree.

Figures lOa-c illustrate nested relationships between significant patterns and equivalence classless in a tree format. Figure 10a shows a simple relationship of a sequence containing several tokens and one significant pattern (designated by blob 67 in Figure 10a) of two tokens. Such relationships are typically obtained in early iterations of the generalization procedure. A further reiteration is shown in Figure 10b, where significant pattern 67 is found to belong to another significant pattern, designated by blob 101 in Figure 10b, together with an equivalence class, designated by blob 98. Also shown in Figure 10a is an additional significant pattern 120 on the same partition level as significant pattern 101, parenting two equivalence classes, 70 and 66. Whereas equivalence class 70 is partitioned to child nodes on the leaf level of the tree, equivalence class 66 is partitioned to one child node and one parent node, representing another equivalence class, designated by blob 65. A typical final tree is shown in Figure 10c, where a root-pattern 144, parenting the aforementioned significant patterns 120 and 101, is left between the "begin" vertex and the "end" vertex of the graph from which the tree is constructed. An additional tree structure is illustrated in Figure 1Od. Any such tree or forest structure thus represents a four-tuple G(V_N, VT, P, S) in which the terminals V_T are the child nodes on the leaf level of the tree, the nonterminals VM are the parent nodes on the partition levels. Starting from the distinguished start symbol S, utterance can be produced by successive application of productions rules P thereto. Table 1 below is a representative example of a list of production rules suitable for producing utterances from the tree structure of Figure 1Od.

Table 1

When applied on the terminals and nonterminals of the grammar illustrated in Figure 1Od, the following candidate utterances can be produced: (i) "BEGIN George is working extremely far away END"; (ii) "BEGIN Cindy is playing really far away END"; (iii) "BEGIN Jim is living extremely far away END".

One ordinarily skilled in the art will appreciate that the generalization procedure of method 210 depends, in principle, on the order in which the paths are selected to be searched and rewired. Hence, one can construct a set of graphs which differ from each other by the paths traversal order used in their construction. Each graph in the set corresponds to another generalized dataset.

According to a preferred embodiment of the present invention method 210 further comprises an optimization procedure in which selected steps (e.g., steps 213, 214 and 215) are repeated a plurality of times, while permuting a searching order of the similarity sets. Thus, a plurality of generalized datasets is obtained, each corresponding to a different generalization of the same input dataset.

Preferably, the optimization is achieved by calculating, for each generalized dataset, a generalization factor, which can be defined, for example, as a ratio between number of sequences of the generalized dataset and a number of sequences of the original dataset. The optimal generalized dataset can be selected as the generalized dataset corresponding to the maximal generalization factor.

Alternatively, the optimization can be achieved by calculating, for each generalized dataset a recall-precision pair. Recall and precision are effectiveness measures known in the art, in particular is the areas of data mining, database processing and information retrieval. Broadly, a recall value is the amount of relevant information (e.g., number of sequences) retrieved from the database divided by the amount of relevant information which exists in the database; and a precision value is the amount of relevant information retrieved from the database divided by the total amount of information which is retrieved. Hence, large value of the precision and small value of the recall corresponds to low productivity while small value of the precision and large value of the recall corresponds to over generalization. Thus, according to a preferred embodiment of the present invention the optimal generalized dataset is selected as the generalized dataset corresponding to optimal combination (e.g., multiplication) of the precision and recall values.

A particular feature of the present embodiment is the ability to make an educated guess as to the meaning of unfamiliar sentences, by considering the patterns that become active. More specifically, novel sentences can be characterized by distributed representations formed in terms of activities of existing patterns. Hence, according to a preferred embodiment of the present invention the activities of each sequence are calculated by propagating upwards on each pattern, preferably from its leaf level to its pattern- vertex. For example, denoting a novel sequence of length k by si, ... , Sk, the initial activities, Oj, of the terminals g_j can be probabilistically defined as aj = maxi=i.._k{.P(₁sϊ,e_/) logP(sι,ej)/(P(s{)P(e_j)} , where P(sι,ej) is the joint probability for both si and e_y to appear in the same equivalence class, and P(si), P(ej) are, respectively, the probabilities of si and βj to appear in any equivalence class. For an equivalence class, the value propagated upwards is preferably the strongest non-zero activation of its members; for a pattern, it is preferably the average weight of the child nodes, on the condition that all the children are activated by adjacent inputs.

Method 210 ends at step 216. An exemplified pseudo source code for implementing methods 200 and 210 on a data processor is provided in the Examples section that follows. Reference is now made to Figure 11, which is a simplified illustration of grammar acquirer 22, which can be employed in various exemplary embodiments of the invention according to a preferred embodiment by apparatus 20. Acquirer 22 can be used for executing selected steps of methods 200 and 210, and preferably comprises a constructor 82, for constructing a graph representing the dataset as further detailed hereinabove. Acquirer 22 further comprises a searcher 84, for searching for partial overlaps between sentence and other sentences of the dataset, a testing unit 86, for applying significance tests on the partial overlaps, and a significant pattern definition unit 88, for defining significant pattern of sentence, as further detailed hereinabove. Acquirer 22 can further comprise a searcher 94, for searching over the dataset for similarity sets, and an equivalence class definition unit 96, for defining equivalence classes as further detailed hereinabove.

Once constructed grammar can be stored in appropriate memory media for future use. The memory media can be any memory media known to those skilled in the art, capable of storing the generalized dataset either in a digital form or in an analog form. Preferably, but not exclusively, the memory is removable so as to allow plugging the memory into a host {e.g., a processing system), thereby allowing the host to store the generalized dataset in it or to retrieve the generalized dataset from it. Examples for memory media which may be used include, but are not limited to, disk drives (e.g., magnetic, optical or semiconductor), CD-ROMs, floppy disks, flash cards, compact flash cards, miniature cards, solid state floppy disk cards, battery-backed SRAM cards and the like.

According to a preferred embodiment of the present invention, the grammar is stored in the memory media in a retrievable format so as to provide accessibility to the stored data. It is appreciated that in all the above embodiments, the grammar can be stored in the memory media in an appropriate displayable format, either graphically or textually. Many displayable formats are presently known, for example, TEXT, BITMAP™, DBF™, TIFF™, DIB™, PALETTE™, RIFF™, PDF™, DVI™ and the like. However it is to be understood that any other format that is presently known or will be developed during the life time of this patent, is within the scope of the present invention.

The present embodiments successfully provide a method and apparatus for translating an utterance u⁴ from a source language to a target language. Generally, the method and apparatus of the present embodiments translate an input utterance by employing a structured stochastic language model. The model is preferably a grammar-based generative model. The role of the stochastic language model according to the present embodiments is to generate all possible sequences of tokens and to assess the relative possibilities of the generated sequences by assigning a score to each such sequence. When the model is a grammar-based generative model, the generation of sequences is according to the production rules and grammar symbols dictated by the appropriate grammar. To this end, the language model uses the target grammar G^B to generate candidate utterances in the target language. Thus, the ultimate result of the application of the stochastic language model according to the present embodiments is a list of candidate utterances ranked by their score. The scores obtained from the language model can be used for a comparison of competing candidate utterances. The candidate utterance having the optimal score is selected as the translation of the input utterance. The use of a grammar-based structured stochastic language model allows producing the entire candidate utterance, unlike, e.g., phrase-based techniques (see, e.g., Chiang, supra) in which a language model serves only for ranking the already structured results produced using phrase-level conditional probability information.

Reference is now made to Figure 12 which is a flowchart diagram of a method suitable for translating an input utterance from a source language to a target language, according to various exemplary embodiments of the present invention. In the exemplary illustration of Figure 12 the method begins at step 221 and optionally and preferably continues to step 222 in which the method accesses a database to obtain at least a partial source grammar G^Λ and at least a partial target grammar G^B. In various exemplary embodiments of the invention the grammars are produced and archived as described hereinabove {e.g., by means of methods 200 and 210 or using acquirer 22), but grammars generated by other means are not excluded from the scope of the present invention.

The method preferably continues to step 223 in which the database is accesses to obtain translation rules. In various exemplary embodiments of the invention the translation rules comprise the set T of mapping functions described above. According to a preferred embodiment of the present invention the method continues to step 224 in which the input utterance u is parsed to obtain an ordered set L^A of source grammar symbols as described above. In this embodiment, the method proceeds to step 225 in which L^A is mapped using the mapping functions to an ordered set L^B of target grammar symbols, and to step 226 in which, for each target grammar symbol b e 1? , the method assigns prior probabilities in a structured stochastic language model SSLM^B defined over the target grammar G^B.

According to a preferred embodiment of the present invention the method continues to step 227 in which prior probabilities in SSLM^S are replaced with conditional probabilities P(b\D) representing a discourse context D of the input utterance. The information sources to determine the discourse context D may include textual and extra-linguistic settings of u .

The method continues to step 228 in which the structured stochastic language model is employed so as to generate a list if of candidate utterances in the target language. The language models also assigns a score (typically a likelihood) to each candidate utterance u^B e if. The method continues to step 230 in which the candidate utterance having the optimal score is selected as the translation of u^A.

In various exemplary embodiments of the invention step 230 is preceded by a further processing step 229 in which the candidate utterances are processes according to ranking criteria, such as, but not limited to, a thematic fit. For example, in one embodiment, each candidate utterance u^B e if is parsed to obtain an ordered set A^B of target grammar symbols covering the candidate utterance. Subsequently a correspondence analysis is performed to provide goodness of correspondence between A^B and L^A. The goodness of correspondence can serve as a ranking criterion for the further processing, whereby the translation of u^A is defined as the argmax of corresp(Λ⁵, L^Λ). In another embodiment, the correspondence analysis is supplemented by a calculation of an overall probability P(u^B) for each u^B e if, and the translation of u^A is defined as the argmax of the weighted sum β P(u^B) + (1 - β) corresp(Λ⁵, L^Λ).

The method ends at step 231. An exemplified pseudo source code for implementing the method on a data processor is provided in the Examples section that follows.

Reference is now made to Figure 13 which is a simplified illustration of an apparatus 240 for translating an input utterance u^A from a source language to a target language. In various exemplary embodiments of the invention apparatus 240 comprises an input unit 242 which accesses a database to obtain the grammars G^Λ, G^B and the translation rules. Apparatus 240 can further comprise parser 26 for parsing the utterance i/ to provide the ordered set L^A as described above. Apparatus 240 can also comprise a mapping unit 244 which maps L to the ordered set L , and a probability assigner 246 which assigns the prior probabilities in SSLM⁵ as further detailed hereinabove. Probability assigner 246 can also obtain conditional probabilities representing discourse context and replacing the prior probabilities in SSLM⁵ with the conditional probabilities.

Apparatus 240 further comprises a candidate utterance generator 248 which employs the structured stochastic language model and generates the candidate utterances lfi as described above. An optimizer 250 selects from CT⁸ the candidate utterance having the optimal score. When additional processing is desired, apparatus 240 comprises a candidate utterance processing unit 252 which processes Ur. In this embodiment, parser 26 preferably parses each candidate utterance, and unit 252 performs correspondence analysis, as described above.

Additional objects, advantages and novel features of the present invention will become apparent to one ordinarily skilled in the art upon examination of the following examples, which are not intended to be limiting. Additionally, each of the various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below finds experimental support in the following examples.

EXAMPLES

Reference is now made to the following examples, which together with the above descriptions illustrate the invention in a non limiting fashion.

EXAMPLE 1

Following is a detailed algorithm which can be used, according to various exemplary embodiments of the present invention for acquiring a grammar from a dataset. For a better understanding of the present embodiment, the algorithm is explained for the case in which the dataset is corpus of text having a plurality of sentences defined over a lexicon of words. 1. Initialization: load all sentences as paths onto a graph whose vertices are the unique words of the corpus.

2. Pattern Distillation: for each path 2.1 find the leading significant pattern: define the path as a search-path and perform method 200 on the search-path by considering all search segments (ij), j > i, starting PR at e\ and Pi at e/, choose out of all segments the leading significant pattern, P, for the search-path; and

2.2 rewire graph: create a new vertex corresponding to P and replace the string of vertices comprising P with the new vertex P using the context-free embodiment or the context- sensitive embodiment.

3. Generalization - First Step: for each path 3.1 slide a context window of size L along the search-path from its beginning vertex to its end; at each step i (i = 1,..., K-L-X for a path of length K) examine the generalized search-paths: for ally = / + 1, ... , i + L-I do

3.1.1 define a slot at location./; 3.1.2 define the generalized path consisting of all paths that have identical prefix (at locations i toy-1) and identical suffix (at locations y + 1 to i + L-I); and

3.1.2 execute method 200 on the generalized path;

3.2 choose the leading P for all searches performed on each generalized path; 3.3 for the leading P define an equivalence class E consisting of all the vertices that appeared in the relevant slot at location^ of the generalized path; and

3.3 rewire graph: create a new vertex corresponding to P, and replace the string of vertices it subsumes with the new vertex P using the context-free embodiment or the context- sensitive embodiment.

4. Generalization - Bootstrap: for each path 4.1 slide a context window of size L along the search-path from its beginning vertex to its end; at each step i (i = 1,..., K-L-I for a path of length K) do:

4.1.1 construct generalized search-path for all slots at locations/,/ = / + 1, ... , i + L-2, do

(i) consider all possible paths through these slots; and

(ii) at each slot/ compare the set of all encountered vertices to the list of existing equivalence classes, selecting the one E(J) that has the largest overlap with this set, provided it is larger than a minimum overlap ω; 4.1.2 reduce generalized search-path: for each k, k = i + 1, ... , i + L-2 and all/,/ = i + 1, ... , i + L-X such that/ ≠ k do:

(i) consider the paths going through all the vertices in k that belong to E(J) for all/, if.no E(J) is assigned to a particular/, choose the vertex that appears on the original search-path at location/; and

(ii) execute method 200 on the resulting generalized path;

4.1.3 extract the leading P, which may include one new equivalence class E, or none; and

4.1.4 rewire graph create a new vertex corresponding to P either by replacing the string of vertices subsumed by P with the new vertex P using the context-free embodiment or the context-sensitive embodiment. 5. Reiteration:

Repeat step 4 until no further significant patterns is found.

EXAMPLE 2

Following is a detailed algorithm which can be used, according to various exemplary embodiments of the present invention for constructing translation rules. The algorithm is provided by way of a pseudo code for the case in which the dataset is corpus of text.

Require: Two context free grammars G^Λ = {α,}, G^B = {b_k}.

Il Each grammar (set of terminals and nonterminals) can be acquired by the algorithm of Example 1. Require: Two parallel matched corpor

do: initialize set T using a bilingual machine-readable dictionary //PASS 1: update T with parallel-corpus data; update the probability matrices P(O_jI, Ofl) and P(b_kl, b®):

update

using distance spectrum relaxation, wit

and as the corresponding "distance" matrices.

end for end for end for

EXAMPLE 3 Following is a detailed algorithm which can be used, according to various exemplary embodiments of the present invention, for translating an utterance from a source language to a target langue. The algorithm is provided by way of a pseudo code for the case in which the dataset is corpus of text.

Require: Two context free grammars G^A = {a_/}, G^B = {bk}', // Each grammar (set of terminals and nonterminals) can be acquired by the algorithm of Example 1. Require: set of mapping functions T(a, b) Il The set T can be estimated by the algorithm of Example 2 Require: an input utterance i/ in the source language do: L^Λ <= parse (?/); do: determine the discourse context D from L^A and any other information source; do: L^B <= T (L^A); Il Map the list L^A into its counterpart L^B using T for all b_j e L^B do: initialize the prior attached to b_j in SSLM^β end for for all b_t e G^B do: update the prior of b_t using P{b_t\D); end for do: run SSLM^S starting with the priors computed above, to generate a list if of candidate utterances ranked by likelihood;

// Post-process (re-rank) if using additional criteria such as thematic fit for all u_m = (β_u - A) e if do

P(u_m) <= π_n=1:i P(Q C(u_m, i/) <= corresp(parse(w,,,)₃ parsed), T)

Il Goodness of thematic correspondence end for do: u^B

EXAMPLE 4

Figure 14 exemplifies a translation of an utterance sentence from a source language (English in the present Example) to a target language (Chinese in the present Example).

In Figure 14, solid lines enclose and point to evoked terminals, dot-dashed lines enclose and point to evoked patterns (grammatical structures), and thick arrows represent translation path selected according to a preferred embodiment of the present invention.

The left panel of Figure 14 represents the source grammar and right panel of represents the target grammar. A few terminals, patterns and equivalence classes are shown for each grammar. The parsing of the input utterance "The cat is on the mat" evokes source grammar terminals shown in bold in the left panel. The source grammar terminals are mapped to corresponding target grammar terminals on the right panel. Patterns such as "the " and "is on " and equivalence classes are also mapped to their counterparts. Due to polysemy and initial structural ambiguity, a source grammar symbol can be mapped to more than one target grammar symbol. The disambiguation is enforced by the context via the interaction between the target elements and the target language model. Given an ordered set of target grammar symbols, the target language model constructs the most probable utterance that is consistent with the source language utterance.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination.

Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. AU publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.

Claims

WHAT IS CLAIMED IS:

1. A method of constructing translation rules using a first dataset of sentences in a source language and a second dataset of sentences in a target language, the translation rules being for translating utterances of the source language into utterances of the target language, the method comprising: acquiring at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language; for each sentence of the first dataset, generating a mapping function for mapping an ordered set of source grammar symbols being associated with said sentence to an ordered set of target grammar symbols, thereby providing a set of mapping functions; and archiving said set of mapping functions in a database, thereby constructing the translation rules.

2. The method of claim 1, wherein said generating said mapping function is by a machine readable dictionary.

3. The method of claim 1, further comprising archiving said at least partial source grammar and said at least partial source target grammar in said database.

4. The method of claim I₅ wherein said generating said mapping function comprises, for each dataset of the first and second datasets, parsing each sentence of said dataset to provide an ordered set of symbols covering said sentence, and updating said mapping functions based on said ordered set of symbols.

5. The method of claim 1, further comprising, constructing a source probability matrix from the source grammar and a target probability matrix from the target grammar, each probability matrix comprising a plurality of entries representing co-occurrence probabilities of symbols in a respective grammar.

6. The method of claim 5, further comprising, for each dataset of the first and second datasets, parsing each sentence of said dataset to provide an ordered set of symbols covering said sentence, and updating a respective probability matrix based using said ordered set of symbols.

7. The method of claim 6, further comprising updating said set of mapping functions using said probability matrices.

8. The method of claim 1, wherein each of said source grammar and said target grammar independently comprises: terminals being associated with tokens of a lexicon characterizing said dataset and nonterminals being associated with equivalence classes of tokens of said lexicon and/or significant patterns of a respective dataset.

9. The method of claim 8, wherein said acquiring said source grammar and said target grammar, comprises for each sentence of the first dataset and for each sentence of the second dataset, searching for partial overlaps between said sentence and other sentences of the respective dataset, applying a significance test on said partial overlaps, and defining a most significant partial overlap as a significant pattern of said sentence, thereby extracting significant patterns from the first and the second datasets, thereby acquiring nonterminals for said source grammar and said target grammar.

10. The method of claim 8, wherein said acquiring said source grammar and said target grammar, comprises: for each dataset of the first dataset and the second dataset: searching over said dataset for similarity sets, each similarity set comprising a plurality of segments of size L having L-S common tokens and S uncommon tokens, each of said plurality of segments being a portion of a different sentence of said dataset; and defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby acquiring nonterminals for said source grammar and said target grammar.

11. The method of claim 10, wherein said definition of said plurality of equivalence classes comprises, for each segment of each similarity set: extracting a significant pattern corresponding to a most significant partial overlap between said segment and other segments or combination of segments of said similarity set, thereby providing, for each similarity set, a plurality of significant patterns; and using said plurality of significant patterns for classifying tokens of said similarity set into at least one equivalence class; thereby defining said plurality of equivalence classes.

12. The method of claim 11, wherein said classification of said tokens comprises, selecting a leading significant pattern of said similarity set, and defining uncommon tokens of segments corresponding to said leading significant pattern as an equivalence class.

13. The method of claim 11, further comprising, prior to said search for said similarity sets: extracting a plurality of significant patterns from the dataset, each significant pattern of said plurality of significant patterns corresponding to a most significant partial overlap between one sentence of the dataset and other sentences of the dataset; and for each significant pattern of said plurality of significant patterns, grouping at least a few tokens of said significant pattern, thereby redefining the dataset.

14. The method of claim 13, further comprising constructing a graph having a plurality of paths representing the dataset, wherein each extraction of significant pattern is by searching for partial overlaps between paths of said graph.

15. The method of claim 14, wherein said graph comprises a plurality of vertices, each representing one token of the lexicon, and further wherein each path of said plurality of paths comprises a sequence of vertices respectively corresponding to one sentence of the dataset.

16. The method of claim 14, further comprising calculating, for each path, a set of probability functions characterizing said partial overlaps.

17. Apparatus for constructing translation rules using a first dataset of sentences in a source language and a second dataset of sentences in a target language, the translation rules being for translating utterances of the source language into utterances of the target language, the apparatus comprising: a grammar acquirer, for acquiring at least a partial source grammar characterizing the source language symbols and at least a partial target grammar characterizing the target language; a mapping function generator for generating, for each sentence of the first dataset, a mapping function for mapping an ordered set of source grammar symbols being associated with said sentence to an ordered set of target grammar symbols, thereby to provide a set of mapping functions; and an archiving unit for archiving said set of mapping functions in a database.

18. The Apparatus of claim 17, wherein said mapping function generator is operable to generate said mapping function using a machine readable dictionary.

19. The Apparatus of claim 17, wherein said archiving unit is operable to archive said at least partial source grammar and said at least partial target grammar in said database.

20. The apparatus of claim 17, wherein said mapping function generator comprises a parser for parsing each sentence of the datasets to provide an ordered set of symbols covering said sentence, and a mapping function update unit for updating said set of mapping functions using said ordered set of symbols.

21. The apparatus of claim 17, further comprising, a probability matrix constructor, for constructing a source probability matrix from the source grammar and a target probability matrix from the target grammar, each probability matrix comprising a plurality of entries representing co-occurrence probabilities of symbols in a respective grammar.

22. The apparatus of claim 21, wherein said mapping function generator comprises a parser for parsing each sentence of the datasets to provide an ordered set of symbols covering said sentence, and wherein said probability matrix constructor is operable to update said probability matrices using said ordered set of symbols.

23. The apparatus of claim 22, wherein said mapping function generator comprises a mapping function update unit for updating said set of mapping functions using said probability matrices.

24. The apparatus of claim 17, wherein each of said set of source grammar symbols and said set of target grammar symbols independently comprises: terminals being associated with tokens of a lexicon characterizing said dataset and nonterminals being associated with equivalence classes of tokens of said lexicon and/or significant patterns of a respective dataset.

25. The apparatus of claim 24, wherein said grammar acquirer comprises: a searcher, for searching, for each sentence of the first dataset and for each sentence of the second dataset, partial overlaps between said sentence and other sentences of the respective dataset; a testing unit, for applying a significance test on said partial overlaps; and a definition unit, for defining a most significant partial overlap as a significant pattern of said sentence, thereby to acquire nonterminals for said source grammar and said target grammar.

26. The apparatus of claim 24, wherein said grammar acquirer comprises: a searcher, for searching over each dataset of the first dataset and the second dataset for similarity sets, each similarity set comprising a plurality of segments of size L having L-S common tokens and S uncommon tokens, each of said plurality of segments being a portion of a different sentence of the respective dataset; and a definition unit, for defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby to acquire nonterminals for said source grammar and said target grammar.

27. The apparatus of claim 26, further comprising an extractor, capable of extracting, for a given set of sentences, a significant pattern corresponding to a most significant partial overlap between one sentence of said dataset and other sequences of said dataset, thereby providing, for said given set of sentences, a plurality of significant patterns.

28. The apparatus of claim 27, wherein said given set of sequences is a similarity set, hence said plurality of significant patterns corresponds to said similarity set.

29. The apparatus of claim 28, wherein said definition unit comprises a classifier, capable of classifying tokens of said similarity set into at least one equivalence class using said plurality of significant patterns.

30. The apparatus of claim 27, wherein said classifier is designed for selecting a leading significant pattern of said similarity set, and defining uncommon tokens of segments corresponding to said leading significant pattern as an equivalence class.

31. The apparatus of claim 27, wherein said given set of sentences is the dataset, hence said plurality of significant patterns corresponds to the dataset.

32. The apparatus of claim 27, further comprising a constructor, for constructing a graph having a plurality of paths representing the dataset.

33. The apparatus of claim 32, wherein said extractor is designed to search for partial overlaps between paths of said graph.

34. The apparatus of claim 33, wherein said graph comprises a plurality of vertices, each representing one token of the lexicon, and further wherein each path of said plurality of paths comprises a sequence of vertices respectively corresponding to one sentence of the dataset.

35. The apparatus of claim 33, further comprising electronic-calculation functionality for calculating, for each path, a set of probability functions characterizing said partial overlaps.

36. A method of translating an utterance from a source language to a target language, comprising: employing a structured stochastic language model so as to generate from the utterance a plurality of candidate utterances in the target language, and so as to assign a score to each candidate utterance of said plurality of candidate utterances; and selecting a candidate utterance having an optimal score, thereby translating the utterance from the source language to the target language.

37. The method of claim 36, further comprising accessing a database to obtain at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language, wherein said generation of said plurality of candidate utterances is based on said at least partial grammars.

38. The method of claim 37, further comprising, prior to said generation of said plurality of candidate utterances: accessing said database to obtain translation rules defined by a set of mapping functions for mapping ordered sets of source grammar symbols to ordered sets of target grammar symbols; parsing the utterance to obtain an ordered set of source grammar symbols covering the utterance, thereby providing an utterance cover in said at least partial source grammar; using said translation rules for mapping said utterance cover to an ordered set of target grammar symbols; and assigning prior probabilities in said structured stochastic language model for each target grammar symbol of said grammar symbols set.

39. The method of claim 38, further comprising prior to said selection of said candidate utterance, processing said plurality of candidate utterances according to additional ranking criteria.

40. The method of claim 39, wherein said processing comprises, for each candidate utterance of said plurality of candidate utterances: parsing said candidate utterance to obtain an ordered set of target grammar symbols covering said candidate utterance, thereby providing a candidate utterance cover in said at least partial target grammar; and performing correspondence analysis to provide goodness of correspondence between said candidate utterance cover in said at least partial target grammar and said utterance cover in said at least partial source grammar.

41. The method of claim 38, wherein said employing said stochastic language model comprises replacing said prior probabilities with conditional probabilities representing a discourse context of the utterance.

42. Apparatus for translating an utterance from a source language to a target language, comprising: a candidate utterance generator operable to employ a structured stochastic language model to generate from the utterance a plurality of candidate utterances in the target language, and to assign a score to each candidate utterance of said plurality of candidate utterances; and an optimizer, for selecting from said plurality of candidate utterances, a candidate utterance having an optimal score.

43. The apparatus of claim 42, further comprising an input unit configured for of accessing a database to obtain at least a partial source grammar characterizing the source language and at least a partial target grammar characterizing the target language, wherein said candidate utterance generator is configured to generate said plurality of candidate utterances based on said at least partial grammars.

44. The apparatus of claim 42, wherein said input unit is further configured for accessing said database to obtain translation rules defined by a set of mapping functions for mapping ordered sets of source grammar symbols to ordered sets of target grammar symbols, and the apparatus further comprises: a parser, configured for parsing the utterance to obtain an ordered set of source grammar symbols covering the utterance, thereby to provide an utterance cover in said at least partial source grammar; a mapping unit configured for mapping said utterance cover to an ordered set of target grammar symbols using said translation rules; and a probability assigner, configured for assigning prior probabilities in said structured stochastic language model for each target grammar symbol of said grammar symbols set.

45. The apparatus of claim 44, further comprising a candidate utterance processing unit for processing said plurality of candidate utterances according to thematic criteria.

46. The apparatus of claim 45, wherein said parser is further configured to parse each candidate utterance to obtain an ordered set of target grammar symbols covering said candidate utterance, thereby to provide a candidate utterance cover in said at least partial target grammar, and wherein said candidate utterance processing unit is configured for performing correspondence analysis to provide goodness of correspondence between said candidate utterance cover in said at least partial target grammar and said utterance cover in said at least partial source grammar.

47. The apparatus of claim 44, wherein said probability assigner is further configured for obtaining conditional probabilities representing a discourse context of the utterance and replacing said prior probabilities with said conditional probabilities.

48. A text processing system having a translator, the translator comprising the apparatus of any of claims 42-47.

49. A text processing system having a style checker, the style checker comprising the apparatus of any of claims 42-47.

50. A voice command and control system comprising a voice input unit, an appliance and the apparatus of any of claims 42-47, said voice input unit being operable to receive a voice command in the source language and to convert said voice command to an utterance recognizable by the apparatus, and said apparatus being configured to translate the utterance from to a target language recognizable by said appliance.