US8560297B2 - Locating parallel word sequences in electronic documents - Google Patents

Locating parallel word sequences in electronic documents Download PDF

Info

Publication number
US8560297B2
US8560297B2 US12/794,778 US79477810A US8560297B2 US 8560297 B2 US8560297 B2 US 8560297B2 US 79477810 A US79477810 A US 79477810A US 8560297 B2 US8560297 B2 US 8560297B2
Authority
US
United States
Prior art keywords
word
electronic document
word sequence
sequences
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US12/794,778
Other versions
US20110301935A1 (en
Inventor
Christopher Brian Quirk
Kristina N. Toutanova
Jason Robert Smith
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US12/794,778 priority Critical patent/US8560297B2/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: QUIRK, CHRISTOPHER BRIAN, SMITH, JASON ROBERT, TOUTANOVA, KRISTINA N.
Publication of US20110301935A1 publication Critical patent/US20110301935A1/en
Application granted granted Critical
Publication of US8560297B2 publication Critical patent/US8560297B2/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • Machine translation refers to the utilization of a computing device to translate text or speech from a source natural language to a target natural language. Due to complexities in natural languages, however, executing a translation between natural languages is a non-trivial task.
  • labeled training data can include word sequence pairs that are either translations of one another or are not translations of one another. These word sequence pairs are labeled and provided to a learning system that learns a model based at least in part upon this labeled training data.
  • machine translation systems are trained in such a way that they can translate a word sequence not included in the training data by observing how translation between a source language and target language works in practice with respect to various word sequences in the training data.
  • This training data is to have individuals that can speak both German and Bulgarian, for example, manually label word sequences (e.g., as being parallel to one another or not parallel to one another). This labeled data may then be used to train a machine translation system that is configured to translate between the German and Bulgarian languages, for instance. Manually labeling training data, however, is a relatively daunting task, particularly if a great amount of training data is desired.
  • Such technologies include utilizing electronic sources that have heretofore not been considered for obtaining parallel data.
  • These sources include online collaborative dictionaries/encyclopedias.
  • Such an encyclopedia may include various documents in multiple different languages that discuss a substantially similar subject. Therefore, for instance, such a collaborative encyclopedia may comprise a first entry about a first topic written in the German language and may also comprise a second entry about the same topic written in the Bulgarian language.
  • Such documents may be direct translations of one another, partial translations of one another or authored entirely separate from one another. That the electronic documents are in some way related can be inferred, for instance, as oftentimes links exist between the documents in the collaborative online encyclopedia.
  • each of the documents may include a link to a substantially similar image.
  • entries in the online collaborative encyclopedia pertaining to a substantially similar topic may include links in the text to substantially similar web pages (e.g., web pages that are also linked as being related to one another), etc.
  • Other sources of information are also contemplated, such as news sources that publish news about events in different languages, though in this case determining the related document pairs may be more challenging.
  • documents that have similar person names in the title and are published within one week of one another might be considered related.
  • word sequences in documents of different languages may or may not be direct translations of one another. Regardless, these sources may be good sources for obtaining parallel data that can be used for training a machine translation system. Accordingly, also described herein are technologies pertaining to ascertaining if a word sequence in a first electronic document written in a first language is parallel to a word sequence in a second electronic document that is written in a second language. For example, electronic documents that are known to be related to one another can be analyzed and word sequences can be generated through analysis of these documents. In one example, a word sequence can be an entire sentence. In another example, a word sequence may be a phrase.
  • a word sequence in the first electronic document in the first language can be selected and paired separately with each word sequence in the second electronic document to create word sequence pairs.
  • a score may be assigned to each word sequence pair, wherein the score is indicative of whether or not the word sequences in the word sequence pair are parallel to one another (e.g., whether the first word sequence is parallel to the other word sequence in the word sequence pair). The higher the score assigned to the word sequence pair, the greater the probability that the two word sequences in the word sequence pair are parallel to one another. Thus, a ranked list of word sequence pairs can be output.
  • scores can be assigned to sequence pairs through utilization of any suitable method/mechanism.
  • a ranker can be trained through utilization of manually labeled parallel word sequence pairs and non-parallel word sequence pairs.
  • the ranker can be trained to take into consideration the global configuration of electronic documents. That is, in many sources it may be common for parallel word sequences to occur in clusters. The ranker can consider such alignment when assigning scores to word sequence pairs.
  • a threshold value may then be compared to a highest score assigned to a word sequence pair. If the score is above the threshold value, then a label can be assigned to the word sequence pair indicating that the word sequences in the word sequence pair are parallel to one another (e.g., translations of one another). If no score is above the threshold value, then a null value can be assigned to the first word sequence which indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. As an example, the threshold value may be selected by the user before executing the system.
  • a substantially optimal (in terms of being parallel) second language word sequence may be sought for each first language word sequence.
  • a ranker described herein can score every pairing with second language word sequences, as well as the pair of the first the language word sequence and the empty word sequence.
  • a label may then be assigned to the second language word sequence assigned the highest (or best) score unless the best scoring second language word sequence is the empty word sequence, in which case the null value is assigned to the aforementioned first language word sequence. In this way, a separate threshold is automatically selected for each first language word sequence.
  • an exemplary ranker described herein need not consider each first language word sequence in isolation. Instead, the ranker may assign scores to particular word sequence alignments across documents (e.g., whole documents or a collection of word sequences between documents), where every first language word sequence is labeled with either a corresponding second language word sequence or the empty word sequence. In this way, the ranker can look at adjacent information to affect its score. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker can output a high score to correspondences where this phenomenon occurs.
  • FIG. 1 is a functional block diagram of an example system that facilitates assigning a label to a word sequence pair that indicates that word sequences in such pair are parallel to one another.
  • FIG. 2 is a functional block diagram of an example system that facilitates segmenting electronic documents into word sequences.
  • FIG. 3 is an example depiction of word sequence pairs.
  • FIG. 4 is a flow diagram that illustrates an example methodology for labeling a word sequence pair to indicate that word sequences in such pair are parallel to one another.
  • FIG. 5 is an example computing system.
  • the system 100 includes a data store 102 that comprises a first electronic document 104 and a second electronic document 106 .
  • the first electronic document 104 can include a first set of word sequences 108 and the second electronic document 106 can include a second set of word sequences 110 .
  • these word sequences can be sentences.
  • the word sequences can be multi-word phrases.
  • a receiver component 112 can receive the first electronic document 104 and the second electronic document 106 from the data store 102 .
  • a feature extractor component 114 is in communication with the receiver component 112 and can extract features from the first electronic document 104 and the second electronic document 106 , respectively. Example features that can be extracted will be described in greater detail below.
  • the ranker component 116 can receive the features extracted by the feature extractor component 114 and based at least in part upon such features can assign scores to word sequence pairs between word sequences in the first electronic document 104 and the second electronic document 106 .
  • the first set of word sequences 108 can include a first word sequence. This first word sequence can be paired individually with each word sequence in the second set of word sequences 110 , thereby creating a plurality of word sequence pairs.
  • the ranker component 116 can assign a score to each word sequence pair, wherein the score can be indicative of an amount of parallelism between the word sequences in the word sequence pair.
  • parallellism can refer to word sequences having a substantially similar semantic meaning.
  • a comparer component 118 can receive the scores assigned to the word sequence pairs by the ranker component 116 and can compare a highest score output by the ranker component 116 with a threshold value.
  • a score above the threshold value indicates that it is relatively highly probable that the word sequences in the corresponding word sequence pair are parallel to one another, while a score below the threshold value can indicate that it is more likely that the second set of word sequences 110 in the second electronic document 106 does not include a word sequence that is parallel to the aforementioned first word sequence.
  • a labeler component 120 can be in communication with the comparer component 118 and can assign a label to the word sequence pair. For example, the labeler component 120 can label a word sequence pair as including parallel word sequences if the comparer component 118 found that the score assigned to the word sequence pair is above the threshold. Additionally, the labeler component 120 can be configured to label word sequence pairs assigned a score below a threshold by the ranker component 116 as not being parallel word sequences.
  • the ranker component 116 can receive the aforementioned features and can assign scores to a plurality of different possible alignments between word sequences in the first electronic document 104 and word sequences in the second electronic document 106 .
  • each word sequence in the first set of word sequences 108 can be labeled with either a corresponding word sequence in the second set of word sequences 110 or an empty word sequence.
  • the ranker component 116 may then assign a score to a possible alignment, wherein the alignment refers to the aforementioned alignment of word sequences in the first set of word sequences with the respective word sequences in the second set of word sequences as evidenced by the labels.
  • the ranker component 116 can take into consideration adjacency of word sequences when assigning a score assigned to an alignment. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker component 116 can output a high score to correspondences where this phenomenon occurs.
  • the ranker component 116 can assign scores to a plurality of different possible alignments, and may cause a highest assigned score (and corresponding alignment) to be stored in the data repository 102 . This score is indicative of a probability that word sequences in the first set of word sequences 108 in the particular alignment are parallel to corresponding word sequences in the second set of word sequence 110 in the particular alignment. In an example, this score can be taken into consideration when assigning scores to particular word sequence pairs as described above. Alternatively, possible alignments can be selected based at least in part upon scores assigned to word sequence pair, wherein such scores are assigned as described above.
  • the system 100 can be configured to locate parallel word sequences in electronic documents that are written in different languages.
  • the first set of word sequences 108 can be written in a first language and the second set of word sequences 110 can be written in a second language.
  • the first electronic document 104 and the second electronic document 106 may each be web pages.
  • the web pages may belong to a collaborative online encyclopedia that includes entries on various topics in several different languages. Therefore, the first electronic document 104 may pertain to a particular topic written in a first language and the second electronic document 106 may pertain to the same topic but is written in the second language.
  • the relationship between the first electronic document 104 and the second electronic document 106 can be known based upon links between the first electronic document 104 and the second electronic document 106 .
  • the online collaborative encyclopedia may link entries pertaining to substantially similar topics that are written in different languages.
  • the online collaborative encyclopedia may store images in a common repository, and entries written in different languages that link to the same image may be deemed to be related.
  • hyperlinks in electronic documents can be compared to see whether such hyperlinks link to a substantially similar page or pages that are known to be related (e.g., entries pertaining to different topics written in different languages).
  • the first electronic document 104 and the second electronic document 106 may be web pages pertaining to news items. That is, the first electronic document 104 may be a news item written in a first language pertaining to a particular event and the second electronic document 106 may be a news item pertaining to a substantially similar event but written in the second language. With respect to the online sources described above, the second electronic document 106 may be a direct translation of the first electronic document 104 . However, in some cases some word sequences in the second set of word sequences 110 may be translations of some word sequences in the first set of word sequences 108 while other word sequences in the second set of word sequences 110 are not parallel with word sequences in the first set of word sequences 108 .
  • the first electronic document 104 and the second electronic document 106 are obtained from an online collaborative encyclopedia
  • the first electronic document 104 may be written by a first author in the first language about a particular topic.
  • the second electronic document 106 may be written by a second author in the second language, and the second author may have used some of the language in the first electronic document 104 and placed a translation thereof in the second electronic document 106 in the second language.
  • a multi-lingual individual may have written the second electronic document 106 and may have directly translated at least some text from the first electronic document 104 to the second electronic document 106 .
  • Other authors may have generated original text which does not align with text in the first electronic document 104 . Since the first electronic document 104 and the second electronic document 106 are known to be related and may be comparable, however, such documents 104 and 106 may still be good sources for obtaining parallel word sequences in the first and second languages, respectively.
  • the feature extractor component 114 can extract features from word sequences in the first set of word sequences 108 and the second set of word sequences 110 .
  • the feature extractor component 114 can be configured to extract features from each word sequence pair of interest.
  • Such features extracted by the feature extractor component 114 can include but are not limited to features derived from word alignments, distortion features, features derived from a markup of an online collaborative encyclopedia and/or word level induced lexicon features.
  • word sequences in the first electronic document 104 and the second electronic document 106 can be aligned through utilization of a known/existing word alignment model.
  • the features derived from word alignments refer to such alignment output by the known alignment model.
  • features include the log probability of an alignment, a number of aligned/unaligned words, longest aligned sequence of words, and a number of words with fertility 1, 2 and 3 or above.
  • two more features that can be classified as being derived from word alignments can include a sentence length feature that models the length ratio between source and target word sequences with a Poisson distribution, and a difference in relative document position of the two word sequences which captures the idea that aligned electronic documents have similar topic progression.
  • Such features can be defined on word sequence pairs and are contemplated by the ranker component 116 when assigning scores to word sequence pairs.
  • Distortion features can refer to reviewing the difference between position of previous and currently aligned word sequences.
  • a first set of distortion features can bin such distances while a second set of features can look at the absolute difference between an expected position (one after the previously aligned word sequence) and an actual position.
  • Features derived from an online collaborative encyclopedia can include a number of matching links in a particular sequence pair of interest. Such links can be weighted by their inverse frequency in the document such that a link that appears often does not contribute much to such features value.
  • Another feature that can be derived from the markup of the online collaborative dictionary is an image feature which can be set to a particular value when two word sequences are captions of a same image.
  • a list feature can have a particular value when two word sequences in a word sequence pair are both items in a list. The image feature and the list feature can fire with a negative value when the feature matches on one word sequence and not another. None of the above features fire on a null alignment. There is also a bias feature for these two models which fires on all non-null alignments.
  • word-level induced lexicon features can be extracted from the feature extractor component 114 .
  • a lexicon model can be built using an approach similar to ones developed for unsupervised lexicon induction from monolingual or comparable documents corpora. This lexicon model and its utilization by the feature extractor component 114 in connection with word sequence extraction is described herein.
  • the lexicon model can be based on a probabilistic model P (w t
  • P w t
  • word pairs can be aligned during training rather than word sequence pairs.
  • Such a model can be trained from a relatively small set of annotated article pairs (e.g., articles in an online collaborative encyclopedia) where for some words in the source language, one or more words are marked as corresponding to the source word (in the context of the article pair) or it is indicated that the source word does not have a corresponding relation in the target article.
  • the word level annotated articles are disjoined from the word sequence aligned articles described herein.
  • the following features are used in the lexicon model and can be extracted by the feature extractor component 114 .
  • the first feature is a transition probability which is the transition probability p(w t
  • Another feature that can be extracted from the first electronic document 104 and/or the second electronic document 106 by the feature extractor component 114 can be a position difference, which is an absolute value of a difference in relative position of words w s and w t in articles S and T.
  • Another word-level induced lexicon feature that can be extracted by the feature extractor component 114 is a feature indicative of orthographic similarity.
  • Orthographic similarity is a function of the added distance between source and target words. The added distance between words written in different alphabets is computed by first performing a deterministic phonetic translation of the word to a common alphabet. This translation is inexact and thus can be subject to improvement.
  • the feature extractor component 114 can be configured to extract a context translation probability.
  • the context translation probability feature can review all words occurring next to word w s in the article S and next to w t in the article T in a local context window (e.g., one word to the left and one word to the right of the subject word).
  • Several scoring functions can be computed measuring the translation correspondence between the context (e.g., using a particular model trained from seed parallel data). This feature is similar to distributional similarity measures used in previous work with a difference that it is limited to context of words within a linked article pair.
  • Still another example of a word-level induced lexicon feature that can be extracted from the first electronic document 104 and the second electronic document 106 includes distributional similarity.
  • the distributional similarity feature corresponds more closely to context similarity measures used in existing work on lexicon induction. For each source head word w s , a distribution over context positions o ⁇ 2, ⁇ 1, +1, +2 ⁇ is collected and context words v s in those positions are also collected based on a count of times a context word occurred at that offset from a head word: P(o,v s
  • a distribution over target words and context for each target head word can be gathered: P(o,v t
  • v s ) estimated on the seed parallel corpus
  • a cross-lingual context distribution can be estimated as the following: P(o,v t
  • w t ) ⁇ v s P(v t
  • the similarity of words w s and w t can be defined as 1 minus the Jensen-Shannon divergence of a distribution over positions and target words.
  • the ranker component 116 can receive these features extracted from the electronic documents 104 and 106 from the feature extractor component 114 and can output a ranked list of word sequence pairs that are indicative of an amount of parallelism between word sequences in such word sequence pairs.
  • the ranker component 116 can utilize two separate models for outputting ranked lists of word sequence pairs.
  • the first ranking model used by the ranker component 116 can, for each word sequence in the source document, (the first electronic document 104 ) can select either a word sequence in the target document (second electronic document 106 ) that it is parallel to or assign a null value to the word sequence in the source document.
  • the ranker component 116 can be or include a model that models a global alignment of word sequences between the first electronic document 104 and the second electronic document 106 .
  • the first set of word sequences 108 and the second set of word sequences 110 are observed, and for each source word sequence a hidden variable can indicate the corresponding target sequence in the target document to which the source word sequence is aligned.
  • this model can be a first order linear chain conditional random field (CRF).
  • CCF linear chain conditional random field
  • the features extracted by the feature extractor component 114 can be provided to the ranker component 116 and the ranker component 116 can assign a score to each word sequence pair, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pair. Additionally, during training of the ranker component 116 the features described above can be utilized to train the ranker component 116 . More particularly, given the set of feature functions described above, the weights of a log linear ranking model for P(w t
  • s) can be generated which can be defined as P lex (t
  • This new translation table can be used to define another HMM word alignment model (together with distortion probabilities trained from parallel data) for use in sentence or word sequence extraction models. Two copies of each feature using the HMM word alignment model can be generated, one using the seed data HMM model and another using this new HMM model.
  • the comparer component 118 and the labeler component 120 can act as described above in connection with assigning the label 122 to a word sequence pair. Again this label 122 can indicate that a word sequence in the second set of word sequences 110 is a translation of a word sequence in the first set of word sequences 108 .
  • the system 100 can be utilized to identify paraphrases of word sequences written in the same language. That is, the first electronic document 104 and the second electronic document 106 can include the first set of word sequences 108 and the second set of word sequences 110 , wherein such word sequences 108 and 110 are written in the same language.
  • the feature extractor component 114 can extract at least some of the features mentioned above from the first electronic document 104 and the second electronic document 106 and the ranker component 116 can rank word sequence pairs as indicated previously. Scores output by the ranker component 116 can be indicative of a likelihood that word sequences in scored word sequence pairs are paraphrases of one another.
  • the comparer component 118 can compare a highest score assigned to word sequences considered by the ranker component 116 , and if the highest score is above a threshold value, can instruct the labeler component 120 to label the word sequences in the word sequence pair as being paraphrases of one another.
  • Paraphrases ascertained through utilization of the system 100 can be employed in practice in a variety of settings.
  • such paraphrases can be utilized in a search engine such that if a user submits a query, the search engine can execute a search over a document corpus utilizing such query as well as paraphrases of such query.
  • the user may submit a query to a search engine and the search engine may execute a search over a document corpus utilizing the query and a modification of such query, wherein such modification includes paraphrases of word sequences that exist in the query.
  • paraphrases ascertained through utilization of the system 100 can be employed in connection with advertising. For example, a user may submit a query to a search engine and the search engine can sell advertisements pertaining to such query by selling paraphrases of the query to advertisers that are willing to bid on such paraphrases.
  • such identified parallel word sequences can be utilized in a variety of settings.
  • a labeled word sequence pair with respect to a first language and a second language could be provided to a system that is configured to learn a machine translation system. This machine translation system may then be utilized to translate text from the first language to the second language.
  • word sequences that are labeled as being substantially parallel to one another when such word sequences are written in different languages can be employed in a multi-lingual search setting. For example, a user may wish to formulate a query in a first language because the user is more comfortable with writing in such language, but the user may wish to receive the search results in a second language.
  • word sequence pairs can be employed in connection with automatically translating a query from the first language to the second language to execute a search over a document corpus that comprises documents in the second language.
  • Other utilizations of word sequence alignments are contemplated and intended to fall under the scope of the hereto appended claims.
  • the system 200 includes a data store 102 that comprises the first electronic document 104 and the second electronic document 106 .
  • a segmenter component 202 can analyze linguistic features in the first electronic document 104 and the second electronic document 106 and can extract word sequences from such electronic documents.
  • the segmenter component 202 can be configured to extract sentences from the electronic documents 104 and 106 .
  • the segmenter component 202 can be configured to search for certain types of punctuation including periods, question marks, etc. and can extract word sequences by selecting text that is between such punctuation marks.
  • the segmenter component 202 can be configured to analyze content of electronic documents 104 and 106 to extract certain types of phrases. For instance, the segmenter component 202 can search for certain noun/verb arrangements or other parts of a language and can extract word sequences based at least in part upon such analysis. Thus, the segmenter component 202 can output the first set of word sequences 108 and the second set of word sequences 110 that correspond to the first electronic document 104 and the second electronic document 106 and can cause such sets of word sequences 108 and 110 to be stored in the data store 102 .
  • a first electronic document 302 can comprise multiple word sequences 304 - 310 . These word sequences are labeled respectively as 1 , 2 , 3 and 4 and may be written in a first language.
  • a second electronic document 312 can comprise multiple word sequences 314 - 320 wherein such word sequences are in a second language.
  • the electronic document 302 and 312 may be somehow related such as for example two articles in different languages in an online collaborative encyclopedia that are linked to one another.
  • a word sequence in the electronic documents 302 - 312 may be translations of one another, may be partial translations of one another or may not be translations of one another.
  • a first model can be utilized to indicate probabilities of alignments between word sequences in the electronic document 302 and 312 . That is, for each word sequence in a first electronic document 302 a value can be assigned with respect to a word sequence pair regarding whether a particular word sequence in the electronic document 312 is aligned with such word sequence in the electronic document 302 and such alignment information can be utilized in connection with assigning scores to word sequence pairs.
  • each word sequence in a first electronic document 302 can be paired with each word sequence in the second electronic document 312 and such word sequence pairs can be assigned scores.
  • word sequence 1 shown by reference numeral 304
  • word sequence pairs are shown in the matrix 322 by the values 1 A, 1 B, 1 C and 1 D.
  • the ranker component 116 ( FIG. 1 ) can assign a score to each of those word sequence pairs.
  • a highest score assigned to the word sequence pairs 1 A, 1 B, 1 C, and 1 D can be compared to a threshold value, and if such highest score is above the threshold value, then the corresponding word sequence pair can be labeled as including parallel word sequences. Therefore, if the word sequence pair 1 A is assigned the highest score amongst the word sequence pairs 1 A, 1 B, 1 C, and 1 D and such score is above the threshold value, then the word sequence pair 1 A can be labeled as parallel such that the word sequence A 314 is labeled as being a translation of the word sequence 1 304 . As can be seen in the matrix 322 , scores can be assigned to each possible word sequence pair between the electronic documents 302 and 312 .
  • scores can be assigned to different alignments of word sequences in the electronic documents 302 and 312 .
  • an empty second language word sequence can exist (as an additional row in the matrix 322 ), and the matrix 322 can include binary values, with a single “1” in each column (which indicates that a word sequence in the first electronic document corresponds with a certain word sequence in the second electronic document or the empty word sequence) and the remainder “0” in every column.
  • the ranker may then assign scores to the entirety of the matrix. This score is indicative of parallelism between word sequences in the scored alignment.
  • FIG. 4 an example methodology 400 that facilitates determining whether two word sequences in a word sequence pair are parallel to one another is illustrated. While the methodology 400 is described as being a series of acts that are performed in a sequence, it is to be understood that the methodology is not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement the methodology described herein.
  • the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media.
  • the computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like.
  • results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like.
  • the computer-readable medium may be a non-transitory medium, such as memory, hard drive, CD, DVD, flash drive, or the like.
  • the methodology 400 begins at 402 , and at 404 a first electronic document that comprises a first word sequence is received.
  • the first word sequence can be written in a first language.
  • a second electronic document is received, wherein the second electronic document comprises a set of word sequences.
  • the set of word sequences can be written in a second language.
  • the first and second electronic documents may be web pages such as related web pages that belong to an online collaborative encyclopedia.
  • a score is assigned to each word sequence pair.
  • the score can be indicative of an amount of parallelism between the first word sequence and the other word sequences in the respective word sequence pairs.
  • a highest score assigned to the word sequence pairs is compared to some threshold value.
  • the data can indicate that the word sequence from the set of word sequences in the word sequence pair is a translation of the first word sequence. In another example the data can indicate that the two word sequences in the word sequence pair are paraphrases of one another.
  • FIG. 5 a high-level illustration of an example computing device 500 that can be used in accordance with the systems and methodologies disclosed herein is illustrated.
  • the computing device 500 may be used in a system that supports detecting parallel word sequences in electronic documents.
  • at least a portion of the computing device 500 may be used in a system that supports extracting parallel word sequences from web pages that belong to an online collaborative encyclopedia.
  • the computing device 500 includes at least one processor 502 that executes instructions that are stored in a memory 504 .
  • the memory 504 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory.
  • the instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above.
  • the processor 502 may access the memory 504 by way of a system bus 506 .
  • the memory 504 may also store electronic documents, word sequences, ranking models, word sequence pairs, data indicating that words in a word sequence pair are parallel to one another, etc.
  • the computing device 500 additionally includes a data store 508 that is accessible by the processor 502 by way of the system bus 506 .
  • the data store 508 may be or include any suitable computer-readable storage, including a hard disk, memory, etc.
  • the data store 508 may include executable instructions, electronic documents which may be web pages, word sequences, etc.
  • the computing device 500 also includes an input interface 510 that allows external devices to communicate with the computing device 500 .
  • the input interface 510 may be used to receive instructions from an external computer device, a user, etc.
  • the computing device 500 also includes an output interface 512 that interfaces the computing device 500 with one or more external devices.
  • the computing device 500 may display text, images, etc. by way of the output interface 512 .
  • the computing device 500 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 500 .
  • a system or component may be a process, a process executing on a processor, or a processor.
  • a component or system may be localized on a single device or distributed across several devices.
  • a component or system may refer to a portion of memory and/or a series of transistors.

Abstract

Systems and methods for automatically extracting parallel word sequences from comparable corpora are described. Electronic documents, such as web pages belonging to a collaborative online encyclopedia, are analyzed to locate parallel word sequences between electronic documents written in different languages. These parallel word sequences are then used to train a machine translation system that can translate text from one language to another.

Description

BACKGROUND
Machine translation refers to the utilization of a computing device to translate text or speech from a source natural language to a target natural language. Due to complexities in natural languages, however, executing a translation between natural languages is a non-trivial task.
Generally, machine translation systems are learned systems that are trained through utilization of training data. Pursuant to an example, labeled training data can include word sequence pairs that are either translations of one another or are not translations of one another. These word sequence pairs are labeled and provided to a learning system that learns a model based at least in part upon this labeled training data. Thus, machine translation systems are trained in such a way that they can translate a word sequence not included in the training data by observing how translation between a source language and target language works in practice with respect to various word sequences in the training data.
It can be ascertained that acquiring more training data that can be utilized to train a machine translation system causes translations output by a machine translation system to be increasingly accurate. Several languages have a significant amount of training data associated therewith. Thus, many conventional machine translation systems are quite adept at translating between, for example, English and Spanish. For other languages, however, there is a lack of training data that can be utilized to train a machine translation system that is desirably configured to translate between such languages. In an example, a lack of training data exists that allows for machine translation systems to efficiently translate between German and Bulgarian, for example.
One manner for obtaining this training data is to have individuals that can speak both German and Bulgarian, for example, manually label word sequences (e.g., as being parallel to one another or not parallel to one another). This labeled data may then be used to train a machine translation system that is configured to translate between the German and Bulgarian languages, for instance. Manually labeling training data, however, is a relatively monumental task, particularly if a great amount of training data is desired.
SUMMARY
The following is a brief summary of subject matter that is described in greater detail herein. This summary is not intended to be limiting as to the scope of the claims.
Various technologies pertaining to acquiring training data for a machine translation system are described herein. These technologies include utilizing electronic sources that have heretofore not been considered for obtaining parallel data. These sources include online collaborative dictionaries/encyclopedias. Such an encyclopedia may include various documents in multiple different languages that discuss a substantially similar subject. Therefore, for instance, such a collaborative encyclopedia may comprise a first entry about a first topic written in the German language and may also comprise a second entry about the same topic written in the Bulgarian language. Such documents may be direct translations of one another, partial translations of one another or authored entirely separate from one another. That the electronic documents are in some way related can be inferred, for instance, as oftentimes links exist between the documents in the collaborative online encyclopedia. In another example, each of the documents may include a link to a substantially similar image. Furthermore, entries in the online collaborative encyclopedia pertaining to a substantially similar topic may include links in the text to substantially similar web pages (e.g., web pages that are also linked as being related to one another), etc. Other sources of information are also contemplated, such as news sources that publish news about events in different languages, though in this case determining the related document pairs may be more challenging. As an example, documents that have similar person names in the title and are published within one week of one another might be considered related.
As indicated above, word sequences in documents of different languages may or may not be direct translations of one another. Regardless, these sources may be good sources for obtaining parallel data that can be used for training a machine translation system. Accordingly, also described herein are technologies pertaining to ascertaining if a word sequence in a first electronic document written in a first language is parallel to a word sequence in a second electronic document that is written in a second language. For example, electronic documents that are known to be related to one another can be analyzed and word sequences can be generated through analysis of these documents. In one example, a word sequence can be an entire sentence. In another example, a word sequence may be a phrase. A word sequence in the first electronic document in the first language can be selected and paired separately with each word sequence in the second electronic document to create word sequence pairs. A score may be assigned to each word sequence pair, wherein the score is indicative of whether or not the word sequences in the word sequence pair are parallel to one another (e.g., whether the first word sequence is parallel to the other word sequence in the word sequence pair). The higher the score assigned to the word sequence pair, the greater the probability that the two word sequences in the word sequence pair are parallel to one another. Thus, a ranked list of word sequence pairs can be output.
With more detail pertaining to assigning scores, scores can be assigned to sequence pairs through utilization of any suitable method/mechanism. For instance, a ranker can be trained through utilization of manually labeled parallel word sequence pairs and non-parallel word sequence pairs. In addition, the ranker can be trained to take into consideration the global configuration of electronic documents. That is, in many sources it may be common for parallel word sequences to occur in clusters. The ranker can consider such alignment when assigning scores to word sequence pairs.
A threshold value may then be compared to a highest score assigned to a word sequence pair. If the score is above the threshold value, then a label can be assigned to the word sequence pair indicating that the word sequences in the word sequence pair are parallel to one another (e.g., translations of one another). If no score is above the threshold value, then a null value can be assigned to the first word sequence which indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. As an example, the threshold value may be selected by the user before executing the system.
In another exemplary embodiment, a substantially optimal (in terms of being parallel) second language word sequence may be sought for each first language word sequence. Given a first language word sequence, a ranker described herein can score every pairing with second language word sequences, as well as the pair of the first the language word sequence and the empty word sequence. A label may then be assigned to the second language word sequence assigned the highest (or best) score unless the best scoring second language word sequence is the empty word sequence, in which case the null value is assigned to the aforementioned first language word sequence. In this way, a separate threshold is automatically selected for each first language word sequence.
Furthermore, an exemplary ranker described herein need not consider each first language word sequence in isolation. Instead, the ranker may assign scores to particular word sequence alignments across documents (e.g., whole documents or a collection of word sequences between documents), where every first language word sequence is labeled with either a corresponding second language word sequence or the empty word sequence. In this way, the ranker can look at adjacent information to affect its score. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker can output a high score to correspondences where this phenomenon occurs.
Other aspects will be appreciated upon reading and understanding the attached figures and description.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a functional block diagram of an example system that facilitates assigning a label to a word sequence pair that indicates that word sequences in such pair are parallel to one another.
FIG. 2 is a functional block diagram of an example system that facilitates segmenting electronic documents into word sequences.
FIG. 3 is an example depiction of word sequence pairs.
FIG. 4 is a flow diagram that illustrates an example methodology for labeling a word sequence pair to indicate that word sequences in such pair are parallel to one another.
FIG. 5 is an example computing system.
DETAILED DESCRIPTION
Various technologies pertaining to extracting parallel word sequences from aligned comparable electronic documents will now be described with reference to the drawings, where like reference numerals represent like elements throughout. In addition, several functional block diagrams of example systems are illustrated and described herein for purposes of explanation; however, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components.
With reference to FIG. 1, an example system 100 that facilitates acquiring parallel word sequences from aligned, related electronic documents is illustrated. The system 100 includes a data store 102 that comprises a first electronic document 104 and a second electronic document 106. The first electronic document 104 can include a first set of word sequences 108 and the second electronic document 106 can include a second set of word sequences 110. In an example, these word sequences can be sentences. In another example, the word sequences can be multi-word phrases.
A receiver component 112 can receive the first electronic document 104 and the second electronic document 106 from the data store 102. A feature extractor component 114 is in communication with the receiver component 112 and can extract features from the first electronic document 104 and the second electronic document 106, respectively. Example features that can be extracted will be described in greater detail below.
In a first exemplary embodiment, the ranker component 116 can receive the features extracted by the feature extractor component 114 and based at least in part upon such features can assign scores to word sequence pairs between word sequences in the first electronic document 104 and the second electronic document 106. Pursuant to an example, the first set of word sequences 108 can include a first word sequence. This first word sequence can be paired individually with each word sequence in the second set of word sequences 110, thereby creating a plurality of word sequence pairs. The ranker component 116 can assign a score to each word sequence pair, wherein the score can be indicative of an amount of parallelism between the word sequences in the word sequence pair. As used herein, the term “parallelism” can refer to word sequences having a substantially similar semantic meaning.
A comparer component 118 can receive the scores assigned to the word sequence pairs by the ranker component 116 and can compare a highest score output by the ranker component 116 with a threshold value. A score above the threshold value indicates that it is relatively highly probable that the word sequences in the corresponding word sequence pair are parallel to one another, while a score below the threshold value can indicate that it is more likely that the second set of word sequences 110 in the second electronic document 106 does not include a word sequence that is parallel to the aforementioned first word sequence.
A labeler component 120 can be in communication with the comparer component 118 and can assign a label to the word sequence pair. For example, the labeler component 120 can label a word sequence pair as including parallel word sequences if the comparer component 118 found that the score assigned to the word sequence pair is above the threshold. Additionally, the labeler component 120 can be configured to label word sequence pairs assigned a score below a threshold by the ranker component 116 as not being parallel word sequences.
In a second exemplary embodiment (which can operate in combination with the first exemplary embodiment), the ranker component 116 can receive the aforementioned features and can assign scores to a plurality of different possible alignments between word sequences in the first electronic document 104 and word sequences in the second electronic document 106. For instance, each word sequence in the first set of word sequences 108 can be labeled with either a corresponding word sequence in the second set of word sequences 110 or an empty word sequence. The ranker component 116 may then assign a score to a possible alignment, wherein the alignment refers to the aforementioned alignment of word sequences in the first set of word sequences with the respective word sequences in the second set of word sequences as evidenced by the labels. In this way, the ranker component 116 can take into consideration adjacency of word sequences when assigning a score assigned to an alignment. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker component 116 can output a high score to correspondences where this phenomenon occurs. The ranker component 116 can assign scores to a plurality of different possible alignments, and may cause a highest assigned score (and corresponding alignment) to be stored in the data repository 102. This score is indicative of a probability that word sequences in the first set of word sequences 108 in the particular alignment are parallel to corresponding word sequences in the second set of word sequence 110 in the particular alignment. In an example, this score can be taken into consideration when assigning scores to particular word sequence pairs as described above. Alternatively, possible alignments can be selected based at least in part upon scores assigned to word sequence pair, wherein such scores are assigned as described above.
Exemplary embodiments of the system 100 will now be described. In a first embodiment the system 100 can be configured to locate parallel word sequences in electronic documents that are written in different languages. Thus, the first set of word sequences 108 can be written in a first language and the second set of word sequences 110 can be written in a second language. Moreover, the first electronic document 104 and the second electronic document 106 may each be web pages. Pursuant to a particular example, the web pages may belong to a collaborative online encyclopedia that includes entries on various topics in several different languages. Therefore, the first electronic document 104 may pertain to a particular topic written in a first language and the second electronic document 106 may pertain to the same topic but is written in the second language. In an example, the relationship between the first electronic document 104 and the second electronic document 106 can be known based upon links between the first electronic document 104 and the second electronic document 106. For instance, the online collaborative encyclopedia may link entries pertaining to substantially similar topics that are written in different languages. In another example, the online collaborative encyclopedia may store images in a common repository, and entries written in different languages that link to the same image may be deemed to be related. Still further, hyperlinks in electronic documents can be compared to see whether such hyperlinks link to a substantially similar page or pages that are known to be related (e.g., entries pertaining to different topics written in different languages).
In another example, the first electronic document 104 and the second electronic document 106 may be web pages pertaining to news items. That is, the first electronic document 104 may be a news item written in a first language pertaining to a particular event and the second electronic document 106 may be a news item pertaining to a substantially similar event but written in the second language. With respect to the online sources described above, the second electronic document 106 may be a direct translation of the first electronic document 104. However, in some cases some word sequences in the second set of word sequences 110 may be translations of some word sequences in the first set of word sequences 108 while other word sequences in the second set of word sequences 110 are not parallel with word sequences in the first set of word sequences 108. For example, if the first electronic document 104 and the second electronic document 106 are obtained from an online collaborative encyclopedia, the first electronic document 104 may be written by a first author in the first language about a particular topic. The second electronic document 106 may be written by a second author in the second language, and the second author may have used some of the language in the first electronic document 104 and placed a translation thereof in the second electronic document 106 in the second language. For instance, a multi-lingual individual may have written the second electronic document 106 and may have directly translated at least some text from the first electronic document 104 to the second electronic document 106. Other authors, however, may have generated original text which does not align with text in the first electronic document 104. Since the first electronic document 104 and the second electronic document 106 are known to be related and may be comparable, however, such documents 104 and 106 may still be good sources for obtaining parallel word sequences in the first and second languages, respectively.
As mentioned above, the feature extractor component 114 can extract features from word sequences in the first set of word sequences 108 and the second set of word sequences 110. The feature extractor component 114 can be configured to extract features from each word sequence pair of interest. Such features extracted by the feature extractor component 114 can include but are not limited to features derived from word alignments, distortion features, features derived from a markup of an online collaborative encyclopedia and/or word level induced lexicon features. Pursuant to an example, word sequences in the first electronic document 104 and the second electronic document 106 can be aligned through utilization of a known/existing word alignment model. The features derived from word alignments refer to such alignment output by the known alignment model. These features include the log probability of an alignment, a number of aligned/unaligned words, longest aligned sequence of words, and a number of words with fertility 1, 2 and 3 or above. Additionally, two more features that can be classified as being derived from word alignments can include a sentence length feature that models the length ratio between source and target word sequences with a Poisson distribution, and a difference in relative document position of the two word sequences which captures the idea that aligned electronic documents have similar topic progression. Such features can be defined on word sequence pairs and are contemplated by the ranker component 116 when assigning scores to word sequence pairs.
Distortion features can refer to reviewing the difference between position of previous and currently aligned word sequences. A first set of distortion features can bin such distances while a second set of features can look at the absolute difference between an expected position (one after the previously aligned word sequence) and an actual position.
Features derived from an online collaborative encyclopedia can include a number of matching links in a particular sequence pair of interest. Such links can be weighted by their inverse frequency in the document such that a link that appears often does not contribute much to such features value. Another feature that can be derived from the markup of the online collaborative dictionary is an image feature which can be set to a particular value when two word sequences are captions of a same image. A list feature can have a particular value when two word sequences in a word sequence pair are both items in a list. The image feature and the list feature can fire with a negative value when the feature matches on one word sequence and not another. None of the above features fire on a null alignment. There is also a bias feature for these two models which fires on all non-null alignments.
Furthermore, as indicated above, word-level induced lexicon features can be extracted from the feature extractor component 114. Pursuant to an example, a lexicon model can be built using an approach similar to ones developed for unsupervised lexicon induction from monolingual or comparable documents corpora. This lexicon model and its utilization by the feature extractor component 114 in connection with word sequence extraction is described herein. The lexicon model can be based on a probabilistic model P (wt|ws, T, S) where wt is a word in the target language and ws is a word in the source language and T and S are linked articles in the target and source languages, respectively. This model can be trained through utilization of any suitable training technique. Specifically, word pairs can be aligned during training rather than word sequence pairs. Such a model can be trained from a relatively small set of annotated article pairs (e.g., articles in an online collaborative encyclopedia) where for some words in the source language, one or more words are marked as corresponding to the source word (in the context of the article pair) or it is indicated that the source word does not have a corresponding relation in the target article. The word level annotated articles are disjoined from the word sequence aligned articles described herein.
The following features are used in the lexicon model and can be extracted by the feature extractor component 114. The first feature is a transition probability which is the transition probability p(wt|ws) from a suitable word alignment model that is trained on particular seed parallel data. This probability can be utilized bi-directionally as well as the log probabilities in the two directions.
Another feature that can be extracted from the first electronic document 104 and/or the second electronic document 106 by the feature extractor component 114 can be a position difference, which is an absolute value of a difference in relative position of words ws and wt in articles S and T.
Another word-level induced lexicon feature that can be extracted by the feature extractor component 114 is a feature indicative of orthographic similarity. Orthographic similarity is a function of the added distance between source and target words. The added distance between words written in different alphabets is computed by first performing a deterministic phonetic translation of the word to a common alphabet. This translation is inexact and thus can be subject to improvement.
In yet another example of a word-level induced lexicon feature, the feature extractor component 114 can be configured to extract a context translation probability. The context translation probability feature can review all words occurring next to word ws in the article S and next to wt in the article T in a local context window (e.g., one word to the left and one word to the right of the subject word). Several scoring functions can be computed measuring the translation correspondence between the context (e.g., using a particular model trained from seed parallel data). This feature is similar to distributional similarity measures used in previous work with a difference that it is limited to context of words within a linked article pair.
Still another example of a word-level induced lexicon feature that can be extracted from the first electronic document 104 and the second electronic document 106 includes distributional similarity. The distributional similarity feature corresponds more closely to context similarity measures used in existing work on lexicon induction. For each source head word ws, a distribution over context positions oε{−2, −1, +1, +2} is collected and context words vs in those positions are also collected based on a count of times a context word occurred at that offset from a head word: P(o,vs|ws)∝weight(o)·C(ws, o, vs). Adjacent positions −1 and +1 have a weight of 2; other positions have a weight of 1. Likewise, a distribution over target words and context for each target head word can be gathered: P(o,vt|wt). Using an existing word translation table P (vt|vs) estimated on the seed parallel corpus, a cross-lingual context distribution can be estimated as the following: P(o,vt|wt)=Σv s P(vt|vs)·P(o,vs|ws). The similarity of words ws and wt can be defined as 1 minus the Jensen-Shannon divergence of a distribution over positions and target words.
As described previously, the ranker component 116 can receive these features extracted from the electronic documents 104 and 106 from the feature extractor component 114 and can output a ranked list of word sequence pairs that are indicative of an amount of parallelism between word sequences in such word sequence pairs. The ranker component 116 can utilize two separate models for outputting ranked lists of word sequence pairs. The first ranking model used by the ranker component 116 can, for each word sequence in the source document, (the first electronic document 104) can select either a word sequence in the target document (second electronic document 106) that it is parallel to or assign a null value to the word sequence in the source document. Furthermore, the ranker component 116 can be or include a model that models a global alignment of word sequences between the first electronic document 104 and the second electronic document 106. In this exemplary model, the first set of word sequences 108 and the second set of word sequences 110 are observed, and for each source word sequence a hidden variable can indicate the corresponding target sequence in the target document to which the source word sequence is aligned. Pursuant to an example, this model can be a first order linear chain conditional random field (CRF). Of course, other suitable models are also contemplated by the inventors and are intended to fall under the scope of the hereto appended claims.
The features extracted by the feature extractor component 114 can be provided to the ranker component 116 and the ranker component 116 can assign a score to each word sequence pair, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pair. Additionally, during training of the ranker component 116 the features described above can be utilized to train the ranker component 116. More particularly, given the set of feature functions described above, the weights of a log linear ranking model for P(wt|ws, T, S) can be trained based at least in part upon the word level annotated article pairs. After this model is trained, a new translation table Plex(t|s) can be generated which can be defined as Plex(t|s)∝ΣtεT,sεSP(t|s, T, S). The summation is over occurrences of the source and target word in linked articles (articles known to be in some way related). This new translation table can be used to define another HMM word alignment model (together with distortion probabilities trained from parallel data) for use in sentence or word sequence extraction models. Two copies of each feature using the HMM word alignment model can be generated, one using the seed data HMM model and another using this new HMM model.
The comparer component 118 and the labeler component 120 can act as described above in connection with assigning the label 122 to a word sequence pair. Again this label 122 can indicate that a word sequence in the second set of word sequences 110 is a translation of a word sequence in the first set of word sequences 108.
In another embodiment, the system 100 can be utilized to identify paraphrases of word sequences written in the same language. That is, the first electronic document 104 and the second electronic document 106 can include the first set of word sequences 108 and the second set of word sequences 110, wherein such word sequences 108 and 110 are written in the same language. The feature extractor component 114 can extract at least some of the features mentioned above from the first electronic document 104 and the second electronic document 106 and the ranker component 116 can rank word sequence pairs as indicated previously. Scores output by the ranker component 116 can be indicative of a likelihood that word sequences in scored word sequence pairs are paraphrases of one another. The comparer component 118 can compare a highest score assigned to word sequences considered by the ranker component 116, and if the highest score is above a threshold value, can instruct the labeler component 120 to label the word sequences in the word sequence pair as being paraphrases of one another.
Paraphrases ascertained through utilization of the system 100 can be employed in practice in a variety of settings. For example, such paraphrases can be utilized in a search engine such that if a user submits a query, the search engine can execute a search over a document corpus utilizing such query as well as paraphrases of such query. In a similar example, the user may submit a query to a search engine and the search engine may execute a search over a document corpus utilizing the query and a modification of such query, wherein such modification includes paraphrases of word sequences that exist in the query. Furthermore, paraphrases ascertained through utilization of the system 100 can be employed in connection with advertising. For example, a user may submit a query to a search engine and the search engine can sell advertisements pertaining to such query by selling paraphrases of the query to advertisers that are willing to bid on such paraphrases.
If the system 100 is employed in connection with locating translations between word sequences in different languages such identified parallel word sequences can be utilized in a variety of settings. For example, a labeled word sequence pair with respect to a first language and a second language could be provided to a system that is configured to learn a machine translation system. This machine translation system may then be utilized to translate text from the first language to the second language. In another example, word sequences that are labeled as being substantially parallel to one another when such word sequences are written in different languages can be employed in a multi-lingual search setting. For example, a user may wish to formulate a query in a first language because the user is more comfortable with writing in such language, but the user may wish to receive the search results in a second language. These word sequence pairs can be employed in connection with automatically translating a query from the first language to the second language to execute a search over a document corpus that comprises documents in the second language. Other utilizations of word sequence alignments are contemplated and intended to fall under the scope of the hereto appended claims.
Referring now to FIG. 2, an example system 200 that facilitates extracting word sequences from electronic documents is illustrated. The system 200 includes a data store 102 that comprises the first electronic document 104 and the second electronic document 106. A segmenter component 202 can analyze linguistic features in the first electronic document 104 and the second electronic document 106 and can extract word sequences from such electronic documents. Pursuant to an example, the segmenter component 202 can be configured to extract sentences from the electronic documents 104 and 106. In such a case the segmenter component 202 can be configured to search for certain types of punctuation including periods, question marks, etc. and can extract word sequences by selecting text that is between such punctuation marks. In another example, the segmenter component 202 can be configured to analyze content of electronic documents 104 and 106 to extract certain types of phrases. For instance, the segmenter component 202 can search for certain noun/verb arrangements or other parts of a language and can extract word sequences based at least in part upon such analysis. Thus, the segmenter component 202 can output the first set of word sequences 108 and the second set of word sequences 110 that correspond to the first electronic document 104 and the second electronic document 106 and can cause such sets of word sequences 108 and 110 to be stored in the data store 102.
With reference now to FIG. 3, an example depiction 300 of assigning scores to word sequence pairs existent in a first electronic document and a second electronic document is illustrated. A first electronic document 302 can comprise multiple word sequences 304-310. These word sequences are labeled respectively as 1, 2, 3 and 4 and may be written in a first language.
A second electronic document 312 can comprise multiple word sequences 314-320 wherein such word sequences are in a second language. The electronic document 302 and 312 may be somehow related such as for example two articles in different languages in an online collaborative encyclopedia that are linked to one another. Thus, a word sequence in the electronic documents 302-312 may be translations of one another, may be partial translations of one another or may not be translations of one another.
As described above a first model can be utilized to indicate probabilities of alignments between word sequences in the electronic document 302 and 312. That is, for each word sequence in a first electronic document 302 a value can be assigned with respect to a word sequence pair regarding whether a particular word sequence in the electronic document 312 is aligned with such word sequence in the electronic document 302 and such alignment information can be utilized in connection with assigning scores to word sequence pairs.
Thereafter, each word sequence in a first electronic document 302 can be paired with each word sequence in the second electronic document 312 and such word sequence pairs can be assigned scores. Thus, as is shown in an example matrix 322 with respect to word sequence 1 (shown by reference numeral 304), such word sequence is paired with each of the word sequences 314 and 320 in the second electronic document 312. These word sequence pairs are shown in the matrix 322 by the values 1A, 1B, 1C and 1D. The ranker component 116 (FIG. 1) can assign a score to each of those word sequence pairs. Thereafter, as described previously, a highest score assigned to the word sequence pairs 1A, 1B, 1C, and 1D can be compared to a threshold value, and if such highest score is above the threshold value, then the corresponding word sequence pair can be labeled as including parallel word sequences. Therefore, if the word sequence pair 1A is assigned the highest score amongst the word sequence pairs 1A, 1B, 1C, and 1D and such score is above the threshold value, then the word sequence pair 1A can be labeled as parallel such that the word sequence A 314 is labeled as being a translation of the word sequence 1 304. As can be seen in the matrix 322, scores can be assigned to each possible word sequence pair between the electronic documents 302 and 312.
In another exemplary embodiment, in addition to or alternatively to assigning scores individually to word sequence pairs, scores can be assigned to different alignments of word sequences in the electronic documents 302 and 312. In this case, an empty second language word sequence can exist (as an additional row in the matrix 322), and the matrix 322 can include binary values, with a single “1” in each column (which indicates that a word sequence in the first electronic document corresponds with a certain word sequence in the second electronic document or the empty word sequence) and the remainder “0” in every column. The ranker may then assign scores to the entirety of the matrix. This score is indicative of parallelism between word sequences in the scored alignment.
With reference now to FIG. 4, an example methodology 400 that facilitates determining whether two word sequences in a word sequence pair are parallel to one another is illustrated. While the methodology 400 is described as being a series of acts that are performed in a sequence, it is to be understood that the methodology is not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement the methodology described herein.
Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like. The computer-readable medium may be a non-transitory medium, such as memory, hard drive, CD, DVD, flash drive, or the like.
The methodology 400 begins at 402, and at 404 a first electronic document that comprises a first word sequence is received. In an example, the first word sequence can be written in a first language.
At 406, a second electronic document is received, wherein the second electronic document comprises a set of word sequences. In an example, the set of word sequences can be written in a second language. Furthermore, as described herein, the first and second electronic documents may be web pages such as related web pages that belong to an online collaborative encyclopedia.
At 408, a score is assigned to each word sequence pair. The score can be indicative of an amount of parallelism between the first word sequence and the other word sequences in the respective word sequence pairs.
At 410, a highest score assigned to the word sequence pairs is compared to some threshold value. At 412, a determination is made regarding whether the score is above the threshold. If the score is above the threshold then at 414 data is generated that indicates that the word sequences in the word sequence pairs are parallel to one another. For example, the data can indicate that the word sequence from the set of word sequences in the word sequence pair is a translation of the first word sequence. In another example the data can indicate that the two word sequences in the word sequence pair are paraphrases of one another.
If at 412 it is determined that the highest score is below the threshold then at 416 data is generated that indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. The methodology then completes at 418.
Now referring to FIG. 5, a high-level illustration of an example computing device 500 that can be used in accordance with the systems and methodologies disclosed herein is illustrated. For instance, the computing device 500 may be used in a system that supports detecting parallel word sequences in electronic documents. In another example, at least a portion of the computing device 500 may be used in a system that supports extracting parallel word sequences from web pages that belong to an online collaborative encyclopedia. The computing device 500 includes at least one processor 502 that executes instructions that are stored in a memory 504. The memory 504 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory. The instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above. The processor 502 may access the memory 504 by way of a system bus 506. In addition to storing executable instructions, the memory 504 may also store electronic documents, word sequences, ranking models, word sequence pairs, data indicating that words in a word sequence pair are parallel to one another, etc.
The computing device 500 additionally includes a data store 508 that is accessible by the processor 502 by way of the system bus 506. The data store 508 may be or include any suitable computer-readable storage, including a hard disk, memory, etc. The data store 508 may include executable instructions, electronic documents which may be web pages, word sequences, etc. The computing device 500 also includes an input interface 510 that allows external devices to communicate with the computing device 500. For instance, the input interface 510 may be used to receive instructions from an external computer device, a user, etc. The computing device 500 also includes an output interface 512 that interfaces the computing device 500 with one or more external devices. For example, the computing device 500 may display text, images, etc. by way of the output interface 512.
Additionally, while illustrated as a single system, it is to be understood that the computing device 500 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 500.
As used herein, the terms “component” and “system” are intended to encompass hardware, software, or a combination of hardware and software. Thus, for example, a system or component may be a process, a process executing on a processor, or a processor. Additionally, a component or system may be localized on a single device or distributed across several devices. Furthermore, a component or system may refer to a portion of memory and/or a series of transistors.
It is noted that several examples have been provided for purposes of explanation. These examples are not to be construed as limiting the hereto-appended claims. Additionally, it may be recognized that the examples provided herein may be permutated while still falling under the scope of the claims.

Claims (20)

What is claimed is:
1. A method comprising the following computer-executable acts:
receiving a first electronic document, wherein the first electronic document comprises a first set of word sequences;
receiving a second electronic document, wherein the second electronic document comprises a second set of word sequences, wherein a word sequence pair comprises a word sequence from the first set of word sequences and a word sequence from the second set of word sequences or an empty word sequence, and wherein the second document comprises a hyperlink to the first document;
automatically correlating the first electronic document and the second electronic document based at least in part upon the hyperlink;
assigning a respective label to each word sequence pair to generate a plurality of possible alignments of word sequences in the first set of word sequences with respect to word sequences in the second set of word sequences;
assigning respective scores to a plurality of different alignments, wherein a score is based at least in part upon a plurality of features comprising:
a first distortion feature that indicates a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences; and
a second distortion feature that is indicative of a difference between:
an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
causing a highest score assigned to an alignment amongst all scores assigned to the plurality of different alignments to be stored in a data repository, wherein the score is indicative of an amount of parallelism between word sequences aligned in the alignment.
2. The method of claim 1, wherein the word sequences in the first set of word sequences are in a first language and the word sequences in the second set of word sequences are in a second language.
3. The method of claim 2, wherein the highest score is utilized to train a statistical machine translation system that is configured to translate text in the first language to text in the second language.
4. The method of claim 2, wherein the first electronic document and the second electronic document are a first web page and a second web page, respectively.
5. The method of claim 4, further comprising:
determining that the first web page and the second web page are directed toward the same subject matter based at least in part upon the second electronic document comprising the hyperlink to the first electronic document; and
automatically correlating the first web page with the second web page based at least in part upon the determining that the first web page and the second web page are directed toward the same subject matter.
6. The method of claim 5, wherein the first web page and the second web page are from an online collaborative encyclopedia.
7. The method of claim 1, wherein the score is assigned based at least in part upon features derived from word alignments between words in word sequences in the first set of word sequence and words in the respective word sequences in the second set of word sequences.
8. The method of claim 1, wherein the score is assigned based at least in part upon word-level induced lexicon features between words in at least one word sequence in the first set of word sequences and words in the respective word sequences in the second set of word sequences.
9. A computing apparatus, comprising:
a processor; and
a memory that is configured with components that are executable by the processor, the components comprising:
a receiver component that receives:
a first electronic document that comprises a first set of word sequences; and
a second electronic document that comprises a second set of word sequences and a hyperlink to the first electronic document, wherein the first electronic document is automatically correlated with the second electronic document based at least in part upon the hyperlink to the first electronic document in the second electronic document;
a feature extractor component that extracts a plurality of features based on the first electronic document and the second electronic document, the plurality of features comprising:
a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and
a second distortion feature that is indicative of a difference between:
an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
a ranker component that outputs a ranked list of word sequence pairs, wherein the word sequence pairs comprise a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the ranked list of word sequence pairs are ranked in an order based at least in part upon the first distortion feature and the second distortion feature and that is indicative of an amount of parallelism between word sequences in the word sequence pairs.
10. The computing apparatus of claim 9, wherein the components further comprise a comparer component that compares scores assigned to the word sequence pairs with a threshold score and selects the word sequence pair with a highest score as including word sequences that are parallel to one another.
11. The computing apparatus of claim 9, wherein the ranker component outputs the ranked list of word sequence pairs based at least in part upon word sequence alignment between the first electronic document and the second electronic document.
12. The computing apparatus of claim 9, wherein the first electronic document and the second electronic document are a first web page and a second web page, respectively.
13. The computing apparatus of claim 12, wherein the first web page and the second web page belong to an online collaborative encyclopedia.
14. The computing apparatus of claim 13, wherein the first set of word sequences in the first web page is in a first language and the second set of word sequences in the second web page is in a second language.
15. The computing apparatus of claim 9, wherein the plurality of features further comprise one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.
16. A computer-readable data storage device comprising instructions that, when executed by a processor, cause the processor to perform acts, comprising:
receiving a first web page that comprises a first set of word sequences in a first language;
receiving a second web page that comprises a second set of word sequences in a second language, wherein the first web page and the second web page are web pages in a collaborative encyclopedia, wherein the first web page and the second web page are directed toward same subject matter, wherein the first web page comprises a first hyperlink to the second web page and the second web page comprises a second hyperlink to the first web page;
automatically correlating the first web page and the second web page based at least in part upon the first hyperlink and the second hyperlink;
assigning a score to a plurality of word sequence pairs, wherein a word sequence pair comprises a word sequence in the first set of word sequences and a word sequence in the second set of word sequences, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pairs, and wherein the score is assigned based at least in part upon a plurality of features comprising:
a first distortion feature that is indicative of a difference between a position of a previously aligned word sequence and a currently aligned word sequence with respect to at least one word sequence in the first set of word sequences and the respective word sequences in the second set of word sequences or an empty word sequence; and
a second distortion feature that is indicative of a difference between:
an actual position of the currently aligned word sequence in the second electronic document relative to the previously aligned word sequence in the second electronic document; and
an expected position of the currently aligned word sequence in the second electronic document, the expected position being adjacent to the previously aligned word sequence; and
comparing a highest score assigned to a word sequence pair amongst all scores assigned to word sequence pairs that comprise a first word sequence to a threshold value; and
if the highest score is above the threshold value, outputting data that indicates that the word sequences in the word sequence pair that has been assigned the highest score are parallel to one another.
17. The method of claim 1, wherein the plurality of features further comprises one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.
18. The computer-readable data storage device of claim 16, wherein the plurality of features further comprises one or more of a feature derived from word alignments between the first electronic document and the second electronic document, a feature derived from a markup of the first electronic document or the second electronic document, or a word-level induced lexicon feature.
19. The method of claim 4, wherein at least one of the first web page or the second web page is a web page pertaining to news items.
20. The method of claim 1, wherein the plurality of features further comprises an image feature indicating whether a word sequence in the first set of word sequences and a word sequence in the second set of word sequences are both captions of an image.
US12/794,778 2010-06-07 2010-06-07 Locating parallel word sequences in electronic documents Active 2031-08-28 US8560297B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/794,778 US8560297B2 (en) 2010-06-07 2010-06-07 Locating parallel word sequences in electronic documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/794,778 US8560297B2 (en) 2010-06-07 2010-06-07 Locating parallel word sequences in electronic documents

Publications (2)

Publication Number Publication Date
US20110301935A1 US20110301935A1 (en) 2011-12-08
US8560297B2 true US8560297B2 (en) 2013-10-15

Family

ID=45065161

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/794,778 Active 2031-08-28 US8560297B2 (en) 2010-06-07 2010-06-07 Locating parallel word sequences in electronic documents

Country Status (1)

Country Link
US (1) US8560297B2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144878A1 (en) * 2011-12-02 2013-06-06 Microsoft Corporation Data discovery and description service
US9292094B2 (en) 2011-12-16 2016-03-22 Microsoft Technology Licensing, Llc Gesture inferred vocabulary bindings
US10303761B2 (en) * 2016-09-27 2019-05-28 Panasonic Intellectual Property Management Co., Ltd. Method, non-transitory computer-readable recording medium storing a program, apparatus, and system for creating similar sentence from original sentences to be translated
US10755729B2 (en) * 2016-11-07 2020-08-25 Axon Enterprise, Inc. Systems and methods for interrelating text transcript information with video and/or audio information

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655640B2 (en) * 2011-03-02 2014-02-18 Raytheon Bbn Technologies Corp. Automatic word alignment
US8838434B1 (en) * 2011-07-29 2014-09-16 Nuance Communications, Inc. Bootstrap call router to other languages using selected N-best translations
US20130103695A1 (en) * 2011-10-21 2013-04-25 Microsoft Corporation Machine translation detection in web-scraped parallel corpora
US9374427B2 (en) * 2011-11-01 2016-06-21 Rahul Shekher Geographical location based cloud storage
US9009148B2 (en) * 2011-12-19 2015-04-14 Microsoft Technology Licensing, Llc Clickthrough-based latent semantic model
US20130212095A1 (en) * 2012-01-16 2013-08-15 Haim BARAD System and method for mark-up language document rank analysis
US11308037B2 (en) * 2012-10-30 2022-04-19 Google Llc Automatic collaboration
US10282419B2 (en) * 2012-12-12 2019-05-07 Nuance Communications, Inc. Multi-domain natural language processing architecture
CN103473280B (en) * 2013-08-28 2017-02-08 中国科学院合肥物质科学研究院 Method for mining comparable network language materials
CN106462579B (en) * 2014-10-15 2019-09-27 微软技术许可有限责任公司 Dictionary is constructed for selected context
US9400781B1 (en) * 2016-02-08 2016-07-26 International Business Machines Corporation Automatic cognate detection in a computer-assisted language learning system
US10318640B2 (en) * 2016-06-24 2019-06-11 Facebook, Inc. Identifying risky translations
US10387576B2 (en) * 2017-11-30 2019-08-20 International Business Machines Corporation Document preparation with argumentation support from a deep question answering system
US10984279B2 (en) * 2019-06-13 2021-04-20 Wipro Limited System and method for machine translation of text

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6321191B1 (en) 1999-01-19 2001-11-20 Fuji Xerox Co., Ltd. Related sentence retrieval system having a plurality of cross-lingual retrieving units that pairs similar sentences based on extracted independent words
US20030233222A1 (en) * 2002-03-26 2003-12-18 Radu Soricut Statistical translation using a large monolingual corpus
US20040122656A1 (en) * 2001-03-16 2004-06-24 Eli Abir Knowledge system method and appparatus
US6757646B2 (en) * 2000-03-22 2004-06-29 Insightful Corporation Extended functionality for an inverse inference engine based web search
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US20050228643A1 (en) * 2004-03-23 2005-10-13 Munteanu Dragos S Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US20060009963A1 (en) 2004-07-12 2006-01-12 Xerox Corporation Method and apparatus for identifying bilingual lexicons in comparable corpora
US20070250306A1 (en) 2006-04-07 2007-10-25 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US20080120092A1 (en) 2006-11-20 2008-05-22 Microsoft Corporation Phrase pair extraction for statistical machine translation
US20080262826A1 (en) * 2007-04-20 2008-10-23 Xerox Corporation Method for building parallel corpora
US20080300857A1 (en) * 2006-05-10 2008-12-04 Xerox Corporation Method for aligning sentences at the word level enforcing selective contiguity constraints
US20090076792A1 (en) * 2005-12-16 2009-03-19 Emil Ltd Text editing apparatus and method
US20090083023A1 (en) * 2005-06-17 2009-03-26 George Foster Means and Method for Adapted Language Translation
US7620538B2 (en) * 2002-03-26 2009-11-17 University Of Southern California Constructing a translation lexicon from comparable, non-parallel corpora
US20100177970A1 (en) * 2004-02-15 2010-07-15 Exbiblio B.V. Capturing text from rendered documents using supplemental information
US8249855B2 (en) * 2006-08-07 2012-08-21 Microsoft Corporation Identifying parallel bilingual data over a network

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6321191B1 (en) 1999-01-19 2001-11-20 Fuji Xerox Co., Ltd. Related sentence retrieval system having a plurality of cross-lingual retrieving units that pairs similar sentences based on extracted independent words
US6757646B2 (en) * 2000-03-22 2004-06-29 Insightful Corporation Extended functionality for an inverse inference engine based web search
US20040122656A1 (en) * 2001-03-16 2004-06-24 Eli Abir Knowledge system method and appparatus
US20030233222A1 (en) * 2002-03-26 2003-12-18 Radu Soricut Statistical translation using a large monolingual corpus
US7620538B2 (en) * 2002-03-26 2009-11-17 University Of Southern California Constructing a translation lexicon from comparable, non-parallel corpora
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US20100177970A1 (en) * 2004-02-15 2010-07-15 Exbiblio B.V. Capturing text from rendered documents using supplemental information
US20050228643A1 (en) * 2004-03-23 2005-10-13 Munteanu Dragos S Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US20060009963A1 (en) 2004-07-12 2006-01-12 Xerox Corporation Method and apparatus for identifying bilingual lexicons in comparable corpora
US20090083023A1 (en) * 2005-06-17 2009-03-26 George Foster Means and Method for Adapted Language Translation
US20090076792A1 (en) * 2005-12-16 2009-03-19 Emil Ltd Text editing apparatus and method
US20070250306A1 (en) 2006-04-07 2007-10-25 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US20080300857A1 (en) * 2006-05-10 2008-12-04 Xerox Corporation Method for aligning sentences at the word level enforcing selective contiguity constraints
US8249855B2 (en) * 2006-08-07 2012-08-21 Microsoft Corporation Identifying parallel bilingual data over a network
US20080120092A1 (en) 2006-11-20 2008-05-22 Microsoft Corporation Phrase pair extraction for statistical machine translation
US20080262826A1 (en) * 2007-04-20 2008-10-23 Xerox Corporation Method for building parallel corpora

Non-Patent Citations (49)

* Cited by examiner, † Cited by third party
Title
Adafre, et al., "Finding Similar Sentences Across Multiple Languages in Wikipedia", Retrieved at << http://www.sics.se/jussi/newtext/working—notes/10—adafre—derijke.pdf >>, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, pp. 1-8.
Adafre, et al., "Finding Similar Sentences Across Multiple Languages in Wikipedia", Retrieved at >, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, pp. 1-8.
Bar-Haim, et al., "The Second Pascal Recognising Textual Entailment Challenge", Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.9327&rep=rep1&type=pdf >>, In The Second PASCAL Recognising Textual Entailment Challenge, Apr. 10, 2006, pp. 1-9.
Bar-Haim, et al., "The Second Pascal Recognising Textual Entailment Challenge", Retrieved at >, In The Second PASCAL Recognising Textual Entailment Challenge, Apr. 10, 2006, pp. 1-9.
Blunsom, et al., "Discriminative Word Alignment with Conditional Random Fields", Retrieved at << http://portal.acm.org/citation.cfm?id=1220184 >>, Annual Meeting of the ACL, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Jul. 17-18, 2006, pp. 65-72.
Blunsom, et al., "Discriminative Word Alignment with Conditional Random Fields", Retrieved at >, Annual Meeting of the ACL, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Jul. 17-18, 2006, pp. 65-72.
Brown, et al., "The Mathematics of Statistical Machine Translation: Parameter Estimation.", Retrieved at << http://acl.ldc.upenn.edu/J/J93/J93-2003.pdf >>, Computational Linguistics, Special issue on using large corpora: II, vol. 19, No. 2, Jun. 1993, pp. 1-50.
Brown, et al., "The Mathematics of Statistical Machine Translation: Parameter Estimation.", Retrieved at >, Computational Linguistics, Special issue on using large corpora: II, vol. 19, No. 2, Jun. 1993, pp. 1-50.
Fung, et al., "Multi-Level Bootstrapping For Extracting Parallel Sentences from a Quasicomparable Corpus", Retrieved at << http://www.mt-archive.info/Coling-2004-Fung-1.pdf >>, International Conference On Computational Linguistics, Proceedings of the 20th international conference on Computational Linguistics, Aug. 23-27, 2004, pp. 1-7.
Fung, et al., "Multi-Level Bootstrapping For Extracting Parallel Sentences from a Quasicomparable Corpus", Retrieved at >, International Conference On Computational Linguistics, Proceedings of the 20th international conference on Computational Linguistics, Aug. 23-27, 2004, pp. 1-7.
Gale, et al., "Identifying Word Correspondences in Parallel Texts", Retrieved at << http://www.aclweb.org/anthology/H/H91/H91-1026.pdf >>, Speech and Natural Language, Proceedings of a Workshop, Feb. 19-22, 1991, pp. 152-157.
Gale, et al., "Identifying Word Correspondences in Parallel Texts", Retrieved at >, Speech and Natural Language, Proceedings of a Workshop, Feb. 19-22, 1991, pp. 152-157.
Haghighi, et al., "Learning Bilingual Lexicons from Monolingual Corpora", Retrieved at << http://www.aclweb.org/anthology/P/P08/P08-1088.pdf >>, In Proceedings of ACL-08: HLT, Jun. 2008, pp. 771-779.
Haghighi, et al., "Learning Bilingual Lexicons from Monolingual Corpora", Retrieved at >, In Proceedings of ACL-08: HLT, Jun. 2008, pp. 771-779.
Koehn, et al., "Learning a Translation Lexicon from Monolingual Corpora", Retrieved at << http://www.aclweb.org/anthology/W/W02/W02-0902.pdf >>, Annual Meeting of the ACL, Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, vol. 9, Jul. 12, 2002, pp. 1-8.
Koehn, et al., "Learning a Translation Lexicon from Monolingual Corpora", Retrieved at >, Annual Meeting of the ACL, Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, vol. 9, Jul. 12, 2002, pp. 1-8.
Koehn, et al., "Statistical Phrase-Based Translation", Retrieved at << http://www.isi.edu/˜marcu/papers/phrases-hlt2003.pdf >>, North American Chapter Of The Association For Computational Linguistics, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, May 27-Jun. 1, 2003, pp. 1-7.
Koehn, et al., "Statistical Phrase-Based Translation", Retrieved at >, North American Chapter Of The Association For Computational Linguistics, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, May 27-Jun. 1, 2003, pp. 1-7.
Koehn, Philipp., "Europarl: A Parallel Corpus for Statistical Machine Translation", Retrieved at << http://www.mt-archive.info/MTS-2005-Koehn.pdf >>, In MT summit X, The tenth machine translation summit, 2005, pp. 79-86.
Koehn, Philipp., "Europarl: A Parallel Corpus for Statistical Machine Translation", Retrieved at >, In MT summit X, The tenth machine translation summit, 2005, pp. 79-86.
Lafferty, et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Retrieved at http://www.cis.upenn.edu/~pereira/papers/crf.pdf >>, Proceedings of the Eighteenth International Conference on Machine Learning, Jun. 28-Jul. 1, 2001, pp. 1-8.
Lafferty, et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Retrieved at http://www.cis.upenn.edu/˜pereira/papers/crf.pdf >>, Proceedings of the Eighteenth International Conference on Machine Learning, Jun. 28-Jul. 1, 2001, pp. 1-8.
Mining meaning from Wikipedia by Olena Medelyan, David Milne, Catherine Legg and Ian H. Witten, International Journal of Human-Computer Studies, 2009, pp. 716-754. *
Moore, Robert C., "Fast and Accurate Sentence Alignment of Bilingual Corpora", Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.198&rep=rep1&type=pdf >>, Lecture Notes In Computer Science, vol. 2499, Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, Oct. 6-12, 2002, pp. 1-10.
Moore, Robert C., "Fast and Accurate Sentence Alignment of Bilingual Corpora", Retrieved at >, Lecture Notes In Computer Science, vol. 2499, Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, Oct. 6-12, 2002, pp. 1-10.
Munteanu, et al., "Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora.", Retrieved at << http://www.mt-archive.info/HLT-NAACL-2004-Munteanu.pdf >>, HLT-NAACL, 2004, pp. 1-8.
Munteanu, et al., "Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora.", Retrieved at >, HLT-NAACL, 2004, pp. 1-8.
Munteanu, et al., "Improving Machine Translation Performance by Exploiting Non-Parallel Corpora", Retrieved at << http://www.aclweb.org/anthology/J/J05/J05-4003.pdf >>, Computational Linguistics, vol. 31, No. 4, Dec. 2005, pp. 477-504.
Munteanu, et al., "Improving Machine Translation Performance by Exploiting Non-Parallel Corpora", Retrieved at >, Computational Linguistics, vol. 31, No. 4, Dec. 2005, pp. 477-504.
Och, Franz J., "Minimum Error Rate Training in Statistical Machine Translation", Retrieved at << http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf >>, Annual Meeting of the ACL, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, Jul. 7-12, 2003, pp. 1-8.
Och, Franz J., "Minimum Error Rate Training in Statistical Machine Translation", Retrieved at >, Annual Meeting of the ACL, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, Jul. 7-12, 2003, pp. 1-8.
Papineni, et al., "BLEU: A Method for Automatic Evaluation Of machine Translation", Retrieved at << http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main076.pdf >>, Annual Meeting of the ACL, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Jul. 7-12, 2002, pp. 1-8.
Papineni, et al., "BLEU: A Method for Automatic Evaluation Of machine Translation", Retrieved at >, Annual Meeting of the ACL, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Jul. 7-12, 2002, pp. 1-8.
Rapp, Reinhard., "Automatic Identification of Word Translations from Unrelated English and German Corpora", Retrieved at << http://acl.ldc.upenn.edu/P/P99/P99-1067.pdf >>, Annual Meeting of the ACL, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Jun. 20-26, 1999, pp. 519-526.
Rapp, Reinhard., "Automatic Identification of Word Translations from Unrelated English and German Corpora", Retrieved at >, Annual Meeting of the ACL, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Jun. 20-26, 1999, pp. 519-526.
Resnik, et al., "The Web as a Parallel Corpus", Retrieved at << http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/nasmith/papers/resnik+smith.cl03.pdf >>, Computational Linguistics, Special issue on web as corpus, vol. 29, No. 3, Sep. 2003, pp. 349-380.
Resnik, et al., "The Web as a Parallel Corpus", Retrieved at >, Computational Linguistics, Special issue on web as corpus, vol. 29, No. 3, Sep. 2003, pp. 349-380.
Sarikaya, et al., "Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation", Retrieved at << http://www1.cs.columbia.edu/˜smaskey/papers/sent—select—mt.pdf >>, In Interspeech, 2009, pp. 432-435.
Sarikaya, et al., "Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation", Retrieved at >, In Interspeech, 2009, pp. 432-435.
Tillmann, Christoph., "A Beam-search Extraction Algorithm for Comparable Data", Retrieved at << http://www.aclweb.org/anthology/P/P09/P09-2057.pdf >>, Annual Meeting of the ACL, Proceedings of the ACL-IJCNLP 2009, Conference Short Papers, Aug. 4, 2009, pp. 225-228.
Tillmann, Christoph., "A Beam-search Extraction Algorithm for Comparable Data", Retrieved at >, Annual Meeting of the ACL, Proceedings of the ACL-IJCNLP 2009, Conference Short Papers, Aug. 4, 2009, pp. 225-228.
Tillmann, et al., "A Simple Sentence-level Extraction Algorithm for Comparable Data", Retrieved at << http://www.aclweb.org/anthology/N/N09/N09-2024.pdf >>, Human Language Technology Conference, Proceedings of Human Language Technologies, The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, May 31-Jun. 5, 2009, pp. 93-96.
Tillmann, et al., "A Simple Sentence-level Extraction Algorithm for Comparable Data", Retrieved at >, Human Language Technology Conference, Proceedings of Human Language Technologies, The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, May 31-Jun. 5, 2009, pp. 93-96.
Vogel, et al., "HMM-based Word Alignment in Statistical Translation", Retrieved at << http://www.aclweb.org/anthology/C/C96/C96-2141.pdf >>, International Conference On Computational Linguistics, Proceedings of the 16th conference on Computational linguistics, vol. 2, Aug. 5-9, 1996, pp. 836-841.
Vogel, et al., "HMM-based Word Alignment in Statistical Translation", Retrieved at >, International Conference On Computational Linguistics, Proceedings of the 16th conference on Computational linguistics, vol. 2, Aug. 5-9, 1996, pp. 836-841.
Wu, et al., "Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora", Retrieved at << http://www.google.co.in/url?sa=t&source=web&ct=res&cd=4&ved=0CBcQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.75.5058%26rep%3Drep1%26type%3Dpdf&rct=j&q=extract+%22parallel+sentences%22+%2B+whole+document&ei=IZrJS-GNJc2lkAWW3Zn1BA&usg=AFQjCNFIZ1UDipq-JTgjHVsz2PJn7BEmhw >>, In Forthcoming, 2005, pp. 1-12.
Wu, et al., "Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora", Retrieved at << http://www.google.co.in/url?sa=t&source=web&ct=res&cd=4&ved=0CBcQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.75.5058%26rep%3Drep1%26type%3Dpdf&rct=j&q=extract+%22parallel+sentences%22+%2B+whole+document&ei=IZrJS-GNJc2lkAWW3Zn1BA&usg=AFQjCNFIZ1UDipq—JTgjHVsz2PJn7BEmhw >>, In Forthcoming, 2005, pp. 1-12.
Zhao, et al., "Adaptive Parallel Sentences Mining from Web Bilingual News Collection", Retrieved at << http://www.cs.cmu.edu/˜bzhao/publications/ICDM2002—SentAlign.pdf >>, ICDM, Proceedings of the 2002 IEEE International Conference on Data Mining, Dec. 9-12, 2002, pp. 745-748.
Zhao, et al., "Adaptive Parallel Sentences Mining from Web Bilingual News Collection", Retrieved at >, ICDM, Proceedings of the 2002 IEEE International Conference on Data Mining, Dec. 9-12, 2002, pp. 745-748.

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144878A1 (en) * 2011-12-02 2013-06-06 Microsoft Corporation Data discovery and description service
US9286414B2 (en) * 2011-12-02 2016-03-15 Microsoft Technology Licensing, Llc Data discovery and description service
US9292094B2 (en) 2011-12-16 2016-03-22 Microsoft Technology Licensing, Llc Gesture inferred vocabulary bindings
US9746932B2 (en) 2011-12-16 2017-08-29 Microsoft Technology Licensing, Llc Gesture inferred vocabulary bindings
US10303761B2 (en) * 2016-09-27 2019-05-28 Panasonic Intellectual Property Management Co., Ltd. Method, non-transitory computer-readable recording medium storing a program, apparatus, and system for creating similar sentence from original sentences to be translated
US10755729B2 (en) * 2016-11-07 2020-08-25 Axon Enterprise, Inc. Systems and methods for interrelating text transcript information with video and/or audio information
US10943600B2 (en) 2016-11-07 2021-03-09 Axon Enterprise, Inc. Systems and methods for interrelating text transcript information with video and/or audio information

Also Published As

Publication number Publication date
US20110301935A1 (en) 2011-12-08

Similar Documents

Publication Publication Date Title
US8560297B2 (en) Locating parallel word sequences in electronic documents
Denecke Using sentiwordnet for multilingual sentiment analysis
Van der Plas et al. Finding synonyms using automatic word alignment and measures of distributional similarity
JP5356197B2 (en) Word semantic relation extraction device
Cook et al. Novel word-sense identification
US9262400B2 (en) Non-transitory computer readable medium and information processing apparatus and method for classifying multilingual documents
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
Bal et al. Sentiment analysis with a multilingual pipeline
Piskorski et al. Slav-NER: the 3rd cross-lingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages
Mataoui et al. A new syntax-based aspect detection approach for sentiment analysis in Arabic reviews
Yun Ying et al. Opinion mining on Viet Thanh Nguyen’s the sympathizer using topic modelling and sentiment analysis
Attia et al. An automatically built named entity lexicon for Arabic
Duran et al. Some issues on the normalization of a corpus of products reviews in Portuguese
Daðason Post-correction of Icelandic OCR text
Rayson et al. Automatic error tagging of spelling mistakes in learner corpora
Warburton Processing terminology for the translation pipeline
Kanojia et al. Challenge dataset of cognates and false friend pairs from indian languages
Chatzitheodorou COSTA MT evaluation tool: An open toolkit for human machine translation evaluation
Khairova et al. Automatic extraction of synonymous collocation pairs from a text corpus
Eyecioglu et al. Constructing a Turkish corpus for paraphrase identification and semantic similarity
Zampieri et al. Grammatical error detection with limited training data: The case of chinese
Hakkani-Tur et al. Statistical sentence extraction for information distillation
Saralegi et al. Cross-lingual projections vs. corpora extracted subjectivity lexicons for less-resourced languages
Drury et al. A contextual classification strategy for polarity analysis of direct quotations from financial news
Mridha et al. Semantic error detection and correction in Bangla sentence

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:QUIRK, CHRISTOPHER BRIAN;TOUTANOVA, KRISTINA N.;SMITH, JASON ROBERT;REEL/FRAME:024490/0122

Effective date: 20100602

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001

Effective date: 20141014

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8