US20110301935A1 - Locating parallel word sequences in electronic documents - Google Patents
Locating parallel word sequences in electronic documents Download PDFInfo
- Publication number
- US20110301935A1 US20110301935A1 US12/794,778 US79477810A US2011301935A1 US 20110301935 A1 US20110301935 A1 US 20110301935A1 US 79477810 A US79477810 A US 79477810A US 2011301935 A1 US2011301935 A1 US 2011301935A1
- Authority
- US
- United States
- Prior art keywords
- word
- word sequence
- electronic document
- sequences
- word sequences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Definitions
- Machine translation refers to the utilization of a computing device to translate text or speech from a source natural language to a target natural language. Due to complexities in natural languages, however, executing a translation between natural languages is a non-trivial task.
- labeled training data can include word sequence pairs that are either translations of one another or are not translations of one another. These word sequence pairs are labeled and provided to a learning system that learns a model based at least in part upon this labeled training data.
- machine translation systems are trained in such a way that they can translate a word sequence not included in the training data by observing how translation between a source language and target language works in practice with respect to various word sequences in the training data.
- This training data is to have individuals that can speak both German and Bulgarian, for example, manually label word sequences (e.g., as being parallel to one another or not parallel to one another). This labeled data may then be used to train a machine translation system that is configured to translate between the German and Bulgarian languages, for instance. Manually labeling training data, however, is a relatively daunting task, particularly if a great amount of training data is desired.
- Such technologies include utilizing electronic sources that have heretofore not been considered for obtaining parallel data.
- These sources include online collaborative dictionaries/encyclopedias.
- Such an encyclopedia may include various documents in multiple different languages that discuss a substantially similar subject. Therefore, for instance, such a collaborative encyclopedia may comprise a first entry about a first topic written in the German language and may also comprise a second entry about the same topic written in the Bulgarian language.
- Such documents may be direct translations of one another, partial translations of one another or authored entirely separate from one another. That the electronic documents are in some way related can be inferred, for instance, as oftentimes links exist between the documents in the collaborative online encyclopedia.
- each of the documents may include a link to a substantially similar image.
- entries in the online collaborative encyclopedia pertaining to a substantially similar topic may include links in the text to substantially similar web pages (e.g., web pages that are also linked as being related to one another), etc.
- Other sources of information are also contemplated, such as news sources that publish news about events in different languages, though in this case determining the related document pairs may be more challenging.
- documents that have similar person names in the title and are published within one week of one another might be considered related.
- word sequences in documents of different languages may or may not be direct translations of one another. Regardless, these sources may be good sources for obtaining parallel data that can be used for training a machine translation system. Accordingly, also described herein are technologies pertaining to ascertaining if a word sequence in a first electronic document written in a first language is parallel to a word sequence in a second electronic document that is written in a second language. For example, electronic documents that are known to be related to one another can be analyzed and word sequences can be generated through analysis of these documents. In one example, a word sequence can be an entire sentence. In another example, a word sequence may be a phrase.
- a word sequence in the first electronic document in the first language can be selected and paired separately with each word sequence in the second electronic document to create word sequence pairs.
- a score may be assigned to each word sequence pair, wherein the score is indicative of whether or not the word sequences in the word sequence pair are parallel to one another (e.g., whether the first word sequence is parallel to the other word sequence in the word sequence pair). The higher the score assigned to the word sequence pair, the greater the probability that the two word sequences in the word sequence pair are parallel to one another. Thus, a ranked list of word sequence pairs can be output.
- scores can be assigned to sequence pairs through utilization of any suitable method/mechanism.
- a ranker can be trained through utilization of manually labeled parallel word sequence pairs and non-parallel word sequence pairs.
- the ranker can be trained to take into consideration the global configuration of electronic documents. That is, in many sources it may be common for parallel word sequences to occur in clusters. The ranker can consider such alignment when assigning scores to word sequence pairs.
- a threshold value may then be compared to a highest score assigned to a word sequence pair. If the score is above the threshold value, then a label can be assigned to the word sequence pair indicating that the word sequences in the word sequence pair are parallel to one another (e.g., translations of one another). If no score is above the threshold value, then a null value can be assigned to the first word sequence which indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. As an example, the threshold value may be selected by the user before executing the system.
- a substantially optimal (in terms of being parallel) second language word sequence may be sought for each first language word sequence.
- a ranker described herein can score every pairing with second language word sequences, as well as the pair of the first the language word sequence and the empty word sequence.
- a label may then be assigned to the second language word sequence assigned the highest (or best) score unless the best scoring second language word sequence is the empty word sequence, in which case the null value is assigned to the aforementioned first language word sequence. In this way, a separate threshold is automatically selected for each first language word sequence.
- an exemplary ranker described herein need not consider each first language word sequence in isolation. Instead, the ranker may assign scores to particular word sequence alignments across documents (e.g., whole documents or a collection of word sequences between documents), where every first language word sequence is labeled with either a corresponding second language word sequence or the empty word sequence. In this way, the ranker can look at adjacent information to affect its score. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker can output a high score to correspondences where this phenomenon occurs.
- FIG. 1 is a functional block diagram of an example system that facilitates assigning a label to a word sequence pair that indicates that word sequences in such pair are parallel to one another.
- FIG. 2 is a functional block diagram of an example system that facilitates segmenting electronic documents into word sequences.
- FIG. 3 is an example depiction of word sequence pairs.
- FIG. 4 is a flow diagram that illustrates an example methodology for labeling a word sequence pair to indicate that word sequences in such pair are parallel to one another.
- FIG. 5 is an example computing system.
- the system 100 includes a data store 102 that comprises a first electronic document 104 and a second electronic document 106 .
- the first electronic document 104 can include a first set of word sequences 108 and the second electronic document 106 can include a second set of word sequences 110 .
- these word sequences can be sentences.
- the word sequences can be multi-word phrases.
- a receiver component 112 can receive the first electronic document 104 and the second electronic document 106 from the data store 102 .
- a feature extractor component 114 is in communication with the receiver component 112 and can extract features from the first electronic document 104 and the second electronic document 106 , respectively. Example features that can be extracted will be described in greater detail below.
- the ranker component 116 can receive the features extracted by the feature extractor component 114 and based at least in part upon such features can assign scores to word sequence pairs between word sequences in the first electronic document 104 and the second electronic document 106 .
- the first set of word sequences 108 can include a first word sequence. This first word sequence can be paired individually with each word sequence in the second set of word sequences 110 , thereby creating a plurality of word sequence pairs.
- the ranker component 116 can assign a score to each word sequence pair, wherein the score can be indicative of an amount of parallelism between the word sequences in the word sequence pair.
- parallellism can refer to word sequences having a substantially similar semantic meaning.
- a comparer component 118 can receive the scores assigned to the word sequence pairs by the ranker component 116 and can compare a highest score output by the ranker component 116 with a threshold value.
- a score above the threshold value indicates that it is relatively highly probable that the word sequences in the corresponding word sequence pair are parallel to one another, while a score below the threshold value can indicate that it is more likely that the second set of word sequences 110 in the second electronic document 106 does not include a word sequence that is parallel to the aforementioned first word sequence.
- a labeler component 120 can be in communication with the comparer component 118 and can assign a label to the word sequence pair. For example, the labeler component 120 can label a word sequence pair as including parallel word sequences if the comparer component 118 found that the score assigned to the word sequence pair is above the threshold. Additionally, the labeler component 120 can be configured to label word sequence pairs assigned a score below a threshold by the ranker component 116 as not being parallel word sequences.
- the ranker component 116 can receive the aforementioned features and can assign scores to a plurality of different possible alignments between word sequences in the first electronic document 104 and word sequences in the second electronic document 106 .
- each word sequence in the first set of word sequences 108 can be labeled with either a corresponding word sequence in the second set of word sequences 110 or an empty word sequence.
- the ranker component 116 may then assign a score to a possible alignment, wherein the alignment refers to the aforementioned alignment of word sequences in the first set of word sequences with the respective word sequences in the second set of word sequences as evidenced by the labels.
- the ranker component 116 can take into consideration adjacency of word sequences when assigning a score assigned to an alignment. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker component 116 can output a high score to correspondences where this phenomenon occurs.
- the ranker component 116 can assign scores to a plurality of different possible alignments, and may cause a highest assigned score (and corresponding alignment) to be stored in the data repository 102 . This score is indicative of a probability that word sequences in the first set of word sequences 108 in the particular alignment are parallel to corresponding word sequences in the second set of word sequence 110 in the particular alignment. In an example, this score can be taken into consideration when assigning scores to particular word sequence pairs as described above. Alternatively, possible alignments can be selected based at least in part upon scores assigned to word sequence pair, wherein such scores are assigned as described above.
- the system 100 can be configured to locate parallel word sequences in electronic documents that are written in different languages.
- the first set of word sequences 108 can be written in a first language and the second set of word sequences 110 can be written in a second language.
- the first electronic document 104 and the second electronic document 106 may each be web pages.
- the web pages may belong to a collaborative online encyclopedia that includes entries on various topics in several different languages. Therefore, the first electronic document 104 may pertain to a particular topic written in a first language and the second electronic document 106 may pertain to the same topic but is written in the second language.
- the relationship between the first electronic document 104 and the second electronic document 106 can be known based upon links between the first electronic document 104 and the second electronic document 106 .
- the online collaborative encyclopedia may link entries pertaining to substantially similar topics that are written in different languages.
- the online collaborative encyclopedia may store images in a common repository, and entries written in different languages that link to the same image may be deemed to be related.
- hyperlinks in electronic documents can be compared to see whether such hyperlinks link to a substantially similar page or pages that are known to be related (e.g., entries pertaining to different topics written in different languages).
- the first electronic document 104 and the second electronic document 106 may be web pages pertaining to news items. That is, the first electronic document 104 may be a news item written in a first language pertaining to a particular event and the second electronic document 106 may be a news item pertaining to a substantially similar event but written in the second language. With respect to the online sources described above, the second electronic document 106 may be a direct translation of the first electronic document 104 . However, in some cases some word sequences in the second set of word sequences 110 may be translations of some word sequences in the first set of word sequences 108 while other word sequences in the second set of word sequences 110 are not parallel with word sequences in the first set of word sequences 108 .
- the first electronic document 104 and the second electronic document 106 are obtained from an online collaborative encyclopedia
- the first electronic document 104 may be written by a first author in the first language about a particular topic.
- the second electronic document 106 may be written by a second author in the second language, and the second author may have used some of the language in the first electronic document 104 and placed a translation thereof in the second electronic document 106 in the second language.
- a multi-lingual individual may have written the second electronic document 106 and may have directly translated at least some text from the first electronic document 104 to the second electronic document 106 .
- Other authors may have generated original text which does not align with text in the first electronic document 104 . Since the first electronic document 104 and the second electronic document 106 are known to be related and may be comparable, however, such documents 104 and 106 may still be good sources for obtaining parallel word sequences in the first and second languages, respectively.
- the feature extractor component 114 can extract features from word sequences in the first set of word sequences 108 and the second set of word sequences 110 .
- the feature extractor component 114 can be configured to extract features from each word sequence pair of interest.
- Such features extracted by the feature extractor component 114 can include but are not limited to features derived from word alignments, distortion features, features derived from a markup of an online collaborative encyclopedia and/or word level induced lexicon features.
- word sequences in the first electronic document 104 and the second electronic document 106 can be aligned through utilization of a known/existing word alignment model.
- the features derived from word alignments refer to such alignment output by the known alignment model.
- features include the log probability of an alignment, a number of aligned/unaligned words, longest aligned sequence of words, and a number of words with fertility 1, 2 and 3 or above.
- two more features that can be classified as being derived from word alignments can include a sentence length feature that models the length ratio between source and target word sequences with a Poisson distribution, and a difference in relative document position of the two word sequences which captures the idea that aligned electronic documents have similar topic progression.
- Such features can be defined on word sequence pairs and are contemplated by the ranker component 116 when assigning scores to word sequence pairs.
- Distortion features can refer to reviewing the difference between position of previous and currently aligned word sequences.
- a first set of distortion features can bin such distances while a second set of features can look at the absolute difference between an expected position (one after the previously aligned word sequence) and an actual position.
- Features derived from an online collaborative encyclopedia can include a number of matching links in a particular sequence pair of interest. Such links can be weighted by their inverse frequency in the document such that a link that appears often does not contribute much to such features value.
- Another feature that can be derived from the markup of the online collaborative dictionary is an image feature which can be set to a particular value when two word sequences are captions of a same image.
- a list feature can have a particular value when two word sequences in a word sequence pair are both items in a list. The image feature and the list feature can fire with a negative value when the feature matches on one word sequence and not another. None of the above features fire on a null alignment. There is also a bias feature for these two models which fires on all non-null alignments.
- word-level induced lexicon features can be extracted from the feature extractor component 114 .
- a lexicon model can be built using an approach similar to ones developed for unsupervised lexicon induction from monolingual or comparable documents corpora. This lexicon model and its utilization by the feature extractor component 114 in connection with word sequence extraction is described herein.
- the lexicon model can be based on a probabilistic model P (w t
- P w t
- word pairs can be aligned during training rather than word sequence pairs.
- Such a model can be trained from a relatively small set of annotated article pairs (e.g., articles in an online collaborative encyclopedia) where for some words in the source language, one or more words are marked as corresponding to the source word (in the context of the article pair) or it is indicated that the source word does not have a corresponding relation in the target article.
- the word level annotated articles are disjoined from the word sequence aligned articles described herein.
- the following features are used in the lexicon model and can be extracted by the feature extractor component 114 .
- the first feature is a transition probability which is the transition probability p(w t
- Another feature that can be extracted from the first electronic document 104 and/or the second electronic document 106 by the feature extractor component 114 can be a position difference, which is an absolute value of a difference in relative position of words w s and w t in articles S and T.
- Another word-level induced lexicon feature that can be extracted by the feature extractor component 114 is a feature indicative of orthographic similarity.
- Orthographic similarity is a function of the added distance between source and target words. The added distance between words written in different alphabets is computed by first performing a deterministic phonetic translation of the word to a common alphabet. This translation is inexact and thus can be subject to improvement.
- the feature extractor component 114 can be configured to extract a context translation probability.
- the context translation probability feature can review all words occurring next to word w s in the article S and next to w t in the article T in a local context window (e.g., one word to the left and one word to the right of the subject word).
- Several scoring functions can be computed measuring the translation correspondence between the context (e.g., using a particular model trained from seed parallel data). This feature is similar to distributional similarity measures used in previous work with a difference that it is limited to context of words within a linked article pair.
- Still another example of a word-level induced lexicon feature that can be extracted from the first electronic document 104 and the second electronic document 106 includes distributional similarity.
- the distributional similarity feature corresponds more closely to context similarity measures used in existing work on lexicon induction. For each source head word w s , a distribution over context positions o ⁇ 2, ⁇ 1, +1, +2 ⁇ is collected and context words v s in those positions are also collected based on a count of times a context word occurred at that offset from a head word: P(o,v s
- a distribution over target words and context for each target head word can be gathered: P(o,v t
- v s ) estimated on the seed parallel corpus
- a cross-lingual context distribution can be estimated as the following: P(o,v t
- w t ) ⁇ v s P(v t
- the similarity of words w s and w t can be defined as 1 minus the Jensen-Shannon divergence of a distribution over positions and target words.
- the ranker component 116 can receive these features extracted from the electronic documents 104 and 106 from the feature extractor component 114 and can output a ranked list of word sequence pairs that are indicative of an amount of parallelism between word sequences in such word sequence pairs.
- the ranker component 116 can utilize two separate models for outputting ranked lists of word sequence pairs.
- the first ranking model used by the ranker component 116 can, for each word sequence in the source document, (the first electronic document 104 ) can select either a word sequence in the target document (second electronic document 106 ) that it is parallel to or assign a null value to the word sequence in the source document.
- the ranker component 116 can be or include a model that models a global alignment of word sequences between the first electronic document 104 and the second electronic document 106 .
- the first set of word sequences 108 and the second set of word sequences 110 are observed, and for each source word sequence a hidden variable can indicate the corresponding target sequence in the target document to which the source word sequence is aligned.
- this model can be a first order linear chain conditional random field (CRF).
- CCF linear chain conditional random field
- the features extracted by the feature extractor component 114 can be provided to the ranker component 116 and the ranker component 116 can assign a score to each word sequence pair, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pair. Additionally, during training of the ranker component 116 the features described above can be utilized to train the ranker component 116 . More particularly, given the set of feature functions described above, the weights of a log linear ranking model for P(w t
- s) can be generated which can be defined as P lex (t
- This new translation table can be used to define another HMM word alignment model (together with distortion probabilities trained from parallel data) for use in sentence or word sequence extraction models. Two copies of each feature using the HMM word alignment model can be generated, one using the seed data HMM model and another using this new HMM model.
- the comparer component 118 and the labeler component 120 can act as described above in connection with assigning the label 122 to a word sequence pair. Again this label 122 can indicate that a word sequence in the second set of word sequences 110 is a translation of a word sequence in the first set of word sequences 108 .
- the system 100 can be utilized to identify paraphrases of word sequences written in the same language. That is, the first electronic document 104 and the second electronic document 106 can include the first set of word sequences 108 and the second set of word sequences 110 , wherein such word sequences 108 and 110 are written in the same language.
- the feature extractor component 114 can extract at least some of the features mentioned above from the first electronic document 104 and the second electronic document 106 and the ranker component 116 can rank word sequence pairs as indicated previously. Scores output by the ranker component 116 can be indicative of a likelihood that word sequences in scored word sequence pairs are paraphrases of one another.
- the comparer component 118 can compare a highest score assigned to word sequences considered by the ranker component 116 , and if the highest score is above a threshold value, can instruct the labeler component 120 to label the word sequences in the word sequence pair as being paraphrases of one another.
- Paraphrases ascertained through utilization of the system 100 can be employed in practice in a variety of settings.
- such paraphrases can be utilized in a search engine such that if a user submits a query, the search engine can execute a search over a document corpus utilizing such query as well as paraphrases of such query.
- the user may submit a query to a search engine and the search engine may execute a search over a document corpus utilizing the query and a modification of such query, wherein such modification includes paraphrases of word sequences that exist in the query.
- paraphrases ascertained through utilization of the system 100 can be employed in connection with advertising. For example, a user may submit a query to a search engine and the search engine can sell advertisements pertaining to such query by selling paraphrases of the query to advertisers that are willing to bid on such paraphrases.
- such identified parallel word sequences can be utilized in a variety of settings.
- a labeled word sequence pair with respect to a first language and a second language could be provided to a system that is configured to learn a machine translation system. This machine translation system may then be utilized to translate text from the first language to the second language.
- word sequences that are labeled as being substantially parallel to one another when such word sequences are written in different languages can be employed in a multi-lingual search setting. For example, a user may wish to formulate a query in a first language because the user is more comfortable with writing in such language, but the user may wish to receive the search results in a second language.
- word sequence pairs can be employed in connection with automatically translating a query from the first language to the second language to execute a search over a document corpus that comprises documents in the second language.
- Other utilizations of word sequence alignments are contemplated and intended to fall under the scope of the hereto appended claims.
- the system 200 includes a data store 102 that comprises the first electronic document 104 and the second electronic document 106 .
- a segmenter component 202 can analyze linguistic features in the first electronic document 104 and the second electronic document 106 and can extract word sequences from such electronic documents.
- the segmenter component 202 can be configured to extract sentences from the electronic documents 104 and 106 .
- the segmenter component 202 can be configured to search for certain types of punctuation including periods, question marks, etc. and can extract word sequences by selecting text that is between such punctuation marks.
- the segmenter component 202 can be configured to analyze content of electronic documents 104 and 106 to extract certain types of phrases. For instance, the segmenter component 202 can search for certain noun/verb arrangements or other parts of a language and can extract word sequences based at least in part upon such analysis. Thus, the segmenter component 202 can output the first set of word sequences 108 and the second set of word sequences 110 that correspond to the first electronic document 104 and the second electronic document 106 and can cause such sets of word sequences 108 and 110 to be stored in the data store 102 .
- a first electronic document 302 can comprise multiple word sequences 304 - 310 . These word sequences are labeled respectively as 1 , 2 , 3 and 4 and may be written in a first language.
- a second electronic document 312 can comprise multiple word sequences 314 - 320 wherein such word sequences are in a second language.
- the electronic document 302 and 312 may be somehow related such as for example two articles in different languages in an online collaborative encyclopedia that are linked to one another.
- a word sequence in the electronic documents 302 - 312 may be translations of one another, may be partial translations of one another or may not be translations of one another.
- a first model can be utilized to indicate probabilities of alignments between word sequences in the electronic document 302 and 312 . That is, for each word sequence in a first electronic document 302 a value can be assigned with respect to a word sequence pair regarding whether a particular word sequence in the electronic document 312 is aligned with such word sequence in the electronic document 302 and such alignment information can be utilized in connection with assigning scores to word sequence pairs.
- each word sequence in a first electronic document 302 can be paired with each word sequence in the second electronic document 312 and such word sequence pairs can be assigned scores.
- word sequence 1 shown by reference numeral 304
- word sequence pairs are shown in the matrix 322 by the values 1 A, 1 B, 1 C and 1 D.
- the ranker component 116 ( FIG. 1 ) can assign a score to each of those word sequence pairs.
- a highest score assigned to the word sequence pairs 1 A, 1 B, 1 C, and 1 D can be compared to a threshold value, and if such highest score is above the threshold value, then the corresponding word sequence pair can be labeled as including parallel word sequences. Therefore, if the word sequence pair 1 A is assigned the highest score amongst the word sequence pairs 1 A, 1 B, 1 C, and 1 D and such score is above the threshold value, then the word sequence pair 1 A can be labeled as parallel such that the word sequence A 314 is labeled as being a translation of the word sequence 1 304 . As can be seen in the matrix 322 , scores can be assigned to each possible word sequence pair between the electronic documents 302 and 312 .
- scores can be assigned to different alignments of word sequences in the electronic documents 302 and 312 .
- an empty second language word sequence can exist (as an additional row in the matrix 322 ), and the matrix 322 can include binary values, with a single “1” in each column (which indicates that a word sequence in the first electronic document corresponds with a certain word sequence in the second electronic document or the empty word sequence) and the remainder “0” in every column.
- the ranker may then assign scores to the entirety of the matrix. This score is indicative of parallelism between word sequences in the scored alignment.
- FIG. 4 an example methodology 400 that facilitates determining whether two word sequences in a word sequence pair are parallel to one another is illustrated. While the methodology 400 is described as being a series of acts that are performed in a sequence, it is to be understood that the methodology is not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement the methodology described herein.
- the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media.
- the computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like.
- results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like.
- the computer-readable medium may be a non-transitory medium, such as memory, hard drive, CD, DVD, flash drive, or the like.
- the methodology 400 begins at 402 , and at 404 a first electronic document that comprises a first word sequence is received.
- the first word sequence can be written in a first language.
- a second electronic document is received, wherein the second electronic document comprises a set of word sequences.
- the set of word sequences can be written in a second language.
- the first and second electronic documents may be web pages such as related web pages that belong to an online collaborative encyclopedia.
- a score is assigned to each word sequence pair.
- the score can be indicative of an amount of parallelism between the first word sequence and the other word sequences in the respective word sequence pairs.
- a highest score assigned to the word sequence pairs is compared to some threshold value.
- the data can indicate that the word sequence from the set of word sequences in the word sequence pair is a translation of the first word sequence. In another example the data can indicate that the two word sequences in the word sequence pair are paraphrases of one another.
- FIG. 5 a high-level illustration of an example computing device 500 that can be used in accordance with the systems and methodologies disclosed herein is illustrated.
- the computing device 500 may be used in a system that supports detecting parallel word sequences in electronic documents.
- at least a portion of the computing device 500 may be used in a system that supports extracting parallel word sequences from web pages that belong to an online collaborative encyclopedia.
- the computing device 500 includes at least one processor 502 that executes instructions that are stored in a memory 504 .
- the memory 504 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory.
- the instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above.
- the processor 502 may access the memory 504 by way of a system bus 506 .
- the memory 504 may also store electronic documents, word sequences, ranking models, word sequence pairs, data indicating that words in a word sequence pair are parallel to one another, etc.
- the computing device 500 additionally includes a data store 508 that is accessible by the processor 502 by way of the system bus 506 .
- the data store 508 may be or include any suitable computer-readable storage, including a hard disk, memory, etc.
- the data store 508 may include executable instructions, electronic documents which may be web pages, word sequences, etc.
- the computing device 500 also includes an input interface 510 that allows external devices to communicate with the computing device 500 .
- the input interface 510 may be used to receive instructions from an external computer device, a user, etc.
- the computing device 500 also includes an output interface 512 that interfaces the computing device 500 with one or more external devices.
- the computing device 500 may display text, images, etc. by way of the output interface 512 .
- the computing device 500 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by the computing device 500 .
- a system or component may be a process, a process executing on a processor, or a processor.
- a component or system may be localized on a single device or distributed across several devices.
- a component or system may refer to a portion of memory and/or a series of transistors.
Abstract
Description
- Machine translation refers to the utilization of a computing device to translate text or speech from a source natural language to a target natural language. Due to complexities in natural languages, however, executing a translation between natural languages is a non-trivial task.
- Generally, machine translation systems are learned systems that are trained through utilization of training data. Pursuant to an example, labeled training data can include word sequence pairs that are either translations of one another or are not translations of one another. These word sequence pairs are labeled and provided to a learning system that learns a model based at least in part upon this labeled training data. Thus, machine translation systems are trained in such a way that they can translate a word sequence not included in the training data by observing how translation between a source language and target language works in practice with respect to various word sequences in the training data.
- It can be ascertained that acquiring more training data that can be utilized to train a machine translation system causes translations output by a machine translation system to be increasingly accurate. Several languages have a significant amount of training data associated therewith. Thus, many conventional machine translation systems are quite adept at translating between, for example, English and Spanish. For other languages, however, there is a lack of training data that can be utilized to train a machine translation system that is desirably configured to translate between such languages. In an example, a lack of training data exists that allows for machine translation systems to efficiently translate between German and Bulgarian, for example.
- One manner for obtaining this training data is to have individuals that can speak both German and Bulgarian, for example, manually label word sequences (e.g., as being parallel to one another or not parallel to one another). This labeled data may then be used to train a machine translation system that is configured to translate between the German and Bulgarian languages, for instance. Manually labeling training data, however, is a relatively monumental task, particularly if a great amount of training data is desired.
- The following is a brief summary of subject matter that is described in greater detail herein. This summary is not intended to be limiting as to the scope of the claims.
- Various technologies pertaining to acquiring training data for a machine translation system are described herein. These technologies include utilizing electronic sources that have heretofore not been considered for obtaining parallel data. These sources include online collaborative dictionaries/encyclopedias. Such an encyclopedia may include various documents in multiple different languages that discuss a substantially similar subject. Therefore, for instance, such a collaborative encyclopedia may comprise a first entry about a first topic written in the German language and may also comprise a second entry about the same topic written in the Bulgarian language. Such documents may be direct translations of one another, partial translations of one another or authored entirely separate from one another. That the electronic documents are in some way related can be inferred, for instance, as oftentimes links exist between the documents in the collaborative online encyclopedia. In another example, each of the documents may include a link to a substantially similar image. Furthermore, entries in the online collaborative encyclopedia pertaining to a substantially similar topic may include links in the text to substantially similar web pages (e.g., web pages that are also linked as being related to one another), etc. Other sources of information are also contemplated, such as news sources that publish news about events in different languages, though in this case determining the related document pairs may be more challenging. As an example, documents that have similar person names in the title and are published within one week of one another might be considered related.
- As indicated above, word sequences in documents of different languages may or may not be direct translations of one another. Regardless, these sources may be good sources for obtaining parallel data that can be used for training a machine translation system. Accordingly, also described herein are technologies pertaining to ascertaining if a word sequence in a first electronic document written in a first language is parallel to a word sequence in a second electronic document that is written in a second language. For example, electronic documents that are known to be related to one another can be analyzed and word sequences can be generated through analysis of these documents. In one example, a word sequence can be an entire sentence. In another example, a word sequence may be a phrase. A word sequence in the first electronic document in the first language can be selected and paired separately with each word sequence in the second electronic document to create word sequence pairs. A score may be assigned to each word sequence pair, wherein the score is indicative of whether or not the word sequences in the word sequence pair are parallel to one another (e.g., whether the first word sequence is parallel to the other word sequence in the word sequence pair). The higher the score assigned to the word sequence pair, the greater the probability that the two word sequences in the word sequence pair are parallel to one another. Thus, a ranked list of word sequence pairs can be output.
- With more detail pertaining to assigning scores, scores can be assigned to sequence pairs through utilization of any suitable method/mechanism. For instance, a ranker can be trained through utilization of manually labeled parallel word sequence pairs and non-parallel word sequence pairs. In addition, the ranker can be trained to take into consideration the global configuration of electronic documents. That is, in many sources it may be common for parallel word sequences to occur in clusters. The ranker can consider such alignment when assigning scores to word sequence pairs.
- A threshold value may then be compared to a highest score assigned to a word sequence pair. If the score is above the threshold value, then a label can be assigned to the word sequence pair indicating that the word sequences in the word sequence pair are parallel to one another (e.g., translations of one another). If no score is above the threshold value, then a null value can be assigned to the first word sequence which indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. As an example, the threshold value may be selected by the user before executing the system.
- In another exemplary embodiment, a substantially optimal (in terms of being parallel) second language word sequence may be sought for each first language word sequence. Given a first language word sequence, a ranker described herein can score every pairing with second language word sequences, as well as the pair of the first the language word sequence and the empty word sequence. A label may then be assigned to the second language word sequence assigned the highest (or best) score unless the best scoring second language word sequence is the empty word sequence, in which case the null value is assigned to the aforementioned first language word sequence. In this way, a separate threshold is automatically selected for each first language word sequence.
- Furthermore, an exemplary ranker described herein need not consider each first language word sequence in isolation. Instead, the ranker may assign scores to particular word sequence alignments across documents (e.g., whole documents or a collection of word sequences between documents), where every first language word sequence is labeled with either a corresponding second language word sequence or the empty word sequence. In this way, the ranker can look at adjacent information to affect its score. As an example, it may be expected that parallel word sequences tend to occur in clusters, and the ranker can output a high score to correspondences where this phenomenon occurs.
- Other aspects will be appreciated upon reading and understanding the attached figures and description.
-
FIG. 1 is a functional block diagram of an example system that facilitates assigning a label to a word sequence pair that indicates that word sequences in such pair are parallel to one another. -
FIG. 2 is a functional block diagram of an example system that facilitates segmenting electronic documents into word sequences. -
FIG. 3 is an example depiction of word sequence pairs. -
FIG. 4 is a flow diagram that illustrates an example methodology for labeling a word sequence pair to indicate that word sequences in such pair are parallel to one another. -
FIG. 5 is an example computing system. - Various technologies pertaining to extracting parallel word sequences from aligned comparable electronic documents will now be described with reference to the drawings, where like reference numerals represent like elements throughout. In addition, several functional block diagrams of example systems are illustrated and described herein for purposes of explanation; however, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components.
- With reference to
FIG. 1 , anexample system 100 that facilitates acquiring parallel word sequences from aligned, related electronic documents is illustrated. Thesystem 100 includes adata store 102 that comprises a firstelectronic document 104 and a secondelectronic document 106. The firstelectronic document 104 can include a first set ofword sequences 108 and the secondelectronic document 106 can include a second set ofword sequences 110. In an example, these word sequences can be sentences. In another example, the word sequences can be multi-word phrases. - A
receiver component 112 can receive the firstelectronic document 104 and the secondelectronic document 106 from thedata store 102. Afeature extractor component 114 is in communication with thereceiver component 112 and can extract features from the firstelectronic document 104 and the secondelectronic document 106, respectively. Example features that can be extracted will be described in greater detail below. - In a first exemplary embodiment, the
ranker component 116 can receive the features extracted by thefeature extractor component 114 and based at least in part upon such features can assign scores to word sequence pairs between word sequences in the firstelectronic document 104 and the secondelectronic document 106. Pursuant to an example, the first set ofword sequences 108 can include a first word sequence. This first word sequence can be paired individually with each word sequence in the second set ofword sequences 110, thereby creating a plurality of word sequence pairs. Theranker component 116 can assign a score to each word sequence pair, wherein the score can be indicative of an amount of parallelism between the word sequences in the word sequence pair. As used herein, the term “parallelism” can refer to word sequences having a substantially similar semantic meaning. - A
comparer component 118 can receive the scores assigned to the word sequence pairs by theranker component 116 and can compare a highest score output by theranker component 116 with a threshold value. A score above the threshold value indicates that it is relatively highly probable that the word sequences in the corresponding word sequence pair are parallel to one another, while a score below the threshold value can indicate that it is more likely that the second set ofword sequences 110 in the secondelectronic document 106 does not include a word sequence that is parallel to the aforementioned first word sequence. - A
labeler component 120 can be in communication with thecomparer component 118 and can assign a label to the word sequence pair. For example, thelabeler component 120 can label a word sequence pair as including parallel word sequences if thecomparer component 118 found that the score assigned to the word sequence pair is above the threshold. Additionally, thelabeler component 120 can be configured to label word sequence pairs assigned a score below a threshold by theranker component 116 as not being parallel word sequences. - In a second exemplary embodiment (which can operate in combination with the first exemplary embodiment), the
ranker component 116 can receive the aforementioned features and can assign scores to a plurality of different possible alignments between word sequences in the firstelectronic document 104 and word sequences in the secondelectronic document 106. For instance, each word sequence in the first set ofword sequences 108 can be labeled with either a corresponding word sequence in the second set ofword sequences 110 or an empty word sequence. Theranker component 116 may then assign a score to a possible alignment, wherein the alignment refers to the aforementioned alignment of word sequences in the first set of word sequences with the respective word sequences in the second set of word sequences as evidenced by the labels. In this way, theranker component 116 can take into consideration adjacency of word sequences when assigning a score assigned to an alignment. As an example, it may be expected that parallel word sequences tend to occur in clusters, and theranker component 116 can output a high score to correspondences where this phenomenon occurs. Theranker component 116 can assign scores to a plurality of different possible alignments, and may cause a highest assigned score (and corresponding alignment) to be stored in thedata repository 102. This score is indicative of a probability that word sequences in the first set ofword sequences 108 in the particular alignment are parallel to corresponding word sequences in the second set ofword sequence 110 in the particular alignment. In an example, this score can be taken into consideration when assigning scores to particular word sequence pairs as described above. Alternatively, possible alignments can be selected based at least in part upon scores assigned to word sequence pair, wherein such scores are assigned as described above. - Exemplary embodiments of the
system 100 will now be described. In a first embodiment thesystem 100 can be configured to locate parallel word sequences in electronic documents that are written in different languages. Thus, the first set ofword sequences 108 can be written in a first language and the second set ofword sequences 110 can be written in a second language. Moreover, the firstelectronic document 104 and the secondelectronic document 106 may each be web pages. Pursuant to a particular example, the web pages may belong to a collaborative online encyclopedia that includes entries on various topics in several different languages. Therefore, the firstelectronic document 104 may pertain to a particular topic written in a first language and the secondelectronic document 106 may pertain to the same topic but is written in the second language. In an example, the relationship between the firstelectronic document 104 and the secondelectronic document 106 can be known based upon links between the firstelectronic document 104 and the secondelectronic document 106. For instance, the online collaborative encyclopedia may link entries pertaining to substantially similar topics that are written in different languages. In another example, the online collaborative encyclopedia may store images in a common repository, and entries written in different languages that link to the same image may be deemed to be related. Still further, hyperlinks in electronic documents can be compared to see whether such hyperlinks link to a substantially similar page or pages that are known to be related (e.g., entries pertaining to different topics written in different languages). - In another example, the first
electronic document 104 and the secondelectronic document 106 may be web pages pertaining to news items. That is, the firstelectronic document 104 may be a news item written in a first language pertaining to a particular event and the secondelectronic document 106 may be a news item pertaining to a substantially similar event but written in the second language. With respect to the online sources described above, the secondelectronic document 106 may be a direct translation of the firstelectronic document 104. However, in some cases some word sequences in the second set ofword sequences 110 may be translations of some word sequences in the first set ofword sequences 108 while other word sequences in the second set ofword sequences 110 are not parallel with word sequences in the first set ofword sequences 108. For example, if the firstelectronic document 104 and the secondelectronic document 106 are obtained from an online collaborative encyclopedia, the firstelectronic document 104 may be written by a first author in the first language about a particular topic. The secondelectronic document 106 may be written by a second author in the second language, and the second author may have used some of the language in the firstelectronic document 104 and placed a translation thereof in the secondelectronic document 106 in the second language. For instance, a multi-lingual individual may have written the secondelectronic document 106 and may have directly translated at least some text from the firstelectronic document 104 to the secondelectronic document 106. Other authors, however, may have generated original text which does not align with text in the firstelectronic document 104. Since the firstelectronic document 104 and the secondelectronic document 106 are known to be related and may be comparable, however,such documents - As mentioned above, the
feature extractor component 114 can extract features from word sequences in the first set ofword sequences 108 and the second set ofword sequences 110. Thefeature extractor component 114 can be configured to extract features from each word sequence pair of interest. Such features extracted by thefeature extractor component 114 can include but are not limited to features derived from word alignments, distortion features, features derived from a markup of an online collaborative encyclopedia and/or word level induced lexicon features. Pursuant to an example, word sequences in the firstelectronic document 104 and the secondelectronic document 106 can be aligned through utilization of a known/existing word alignment model. The features derived from word alignments refer to such alignment output by the known alignment model. These features include the log probability of an alignment, a number of aligned/unaligned words, longest aligned sequence of words, and a number of words withfertility ranker component 116 when assigning scores to word sequence pairs. - Distortion features can refer to reviewing the difference between position of previous and currently aligned word sequences. A first set of distortion features can bin such distances while a second set of features can look at the absolute difference between an expected position (one after the previously aligned word sequence) and an actual position.
- Features derived from an online collaborative encyclopedia can include a number of matching links in a particular sequence pair of interest. Such links can be weighted by their inverse frequency in the document such that a link that appears often does not contribute much to such features value. Another feature that can be derived from the markup of the online collaborative dictionary is an image feature which can be set to a particular value when two word sequences are captions of a same image. A list feature can have a particular value when two word sequences in a word sequence pair are both items in a list. The image feature and the list feature can fire with a negative value when the feature matches on one word sequence and not another. None of the above features fire on a null alignment. There is also a bias feature for these two models which fires on all non-null alignments.
- Furthermore, as indicated above, word-level induced lexicon features can be extracted from the
feature extractor component 114. Pursuant to an example, a lexicon model can be built using an approach similar to ones developed for unsupervised lexicon induction from monolingual or comparable documents corpora. This lexicon model and its utilization by thefeature extractor component 114 in connection with word sequence extraction is described herein. The lexicon model can be based on a probabilistic model P (wt|ws, T, S) where wt is a word in the target language and ws is a word in the source language and T and S are linked articles in the target and source languages, respectively. This model can be trained through utilization of any suitable training technique. Specifically, word pairs can be aligned during training rather than word sequence pairs. Such a model can be trained from a relatively small set of annotated article pairs (e.g., articles in an online collaborative encyclopedia) where for some words in the source language, one or more words are marked as corresponding to the source word (in the context of the article pair) or it is indicated that the source word does not have a corresponding relation in the target article. The word level annotated articles are disjoined from the word sequence aligned articles described herein. - The following features are used in the lexicon model and can be extracted by the
feature extractor component 114. The first feature is a transition probability which is the transition probability p(wt|ws) from a suitable word alignment model that is trained on particular seed parallel data. This probability can be utilized bi-directionally as well as the log probabilities in the two directions. - Another feature that can be extracted from the first
electronic document 104 and/or the secondelectronic document 106 by thefeature extractor component 114 can be a position difference, which is an absolute value of a difference in relative position of words ws and wt in articles S and T. - Another word-level induced lexicon feature that can be extracted by the
feature extractor component 114 is a feature indicative of orthographic similarity. Orthographic similarity is a function of the added distance between source and target words. The added distance between words written in different alphabets is computed by first performing a deterministic phonetic translation of the word to a common alphabet. This translation is inexact and thus can be subject to improvement. - In yet another example of a word-level induced lexicon feature, the
feature extractor component 114 can be configured to extract a context translation probability. The context translation probability feature can review all words occurring next to word ws in the article S and next to wt in the article T in a local context window (e.g., one word to the left and one word to the right of the subject word). Several scoring functions can be computed measuring the translation correspondence between the context (e.g., using a particular model trained from seed parallel data). This feature is similar to distributional similarity measures used in previous work with a difference that it is limited to context of words within a linked article pair. - Still another example of a word-level induced lexicon feature that can be extracted from the first
electronic document 104 and the secondelectronic document 106 includes distributional similarity. The distributional similarity feature corresponds more closely to context similarity measures used in existing work on lexicon induction. For each source head word ws, a distribution over context positions oε{−2, −1, +1, +2} is collected and context words vs in those positions are also collected based on a count of times a context word occurred at that offset from a head word: P(o,vs|ws)∝weight(o)·C(ws, o, vs). Adjacent positions −1 and +1 have a weight of 2; other positions have a weight of 1. Likewise, a distribution over target words and context for each target head word can be gathered: P(o,vt|wt). Using an existing word translation table P (vt|vs) estimated on the seed parallel corpus, a cross-lingual context distribution can be estimated as the following: P(o,vt|wt)=Σvs P(vt|vs)·P(o,vs|ws). The similarity of words ws and wt can be defined as 1 minus the Jensen-Shannon divergence of a distribution over positions and target words. - As described previously, the
ranker component 116 can receive these features extracted from theelectronic documents feature extractor component 114 and can output a ranked list of word sequence pairs that are indicative of an amount of parallelism between word sequences in such word sequence pairs. Theranker component 116 can utilize two separate models for outputting ranked lists of word sequence pairs. The first ranking model used by theranker component 116 can, for each word sequence in the source document, (the first electronic document 104) can select either a word sequence in the target document (second electronic document 106) that it is parallel to or assign a null value to the word sequence in the source document. Furthermore, theranker component 116 can be or include a model that models a global alignment of word sequences between the firstelectronic document 104 and the secondelectronic document 106. In this exemplary model, the first set ofword sequences 108 and the second set ofword sequences 110 are observed, and for each source word sequence a hidden variable can indicate the corresponding target sequence in the target document to which the source word sequence is aligned. Pursuant to an example, this model can be a first order linear chain conditional random field (CRF). Of course, other suitable models are also contemplated by the inventors and are intended to fall under the scope of the hereto appended claims. - The features extracted by the
feature extractor component 114 can be provided to theranker component 116 and theranker component 116 can assign a score to each word sequence pair, wherein the score is indicative of an amount of parallelism between word sequences in the word sequence pair. Additionally, during training of theranker component 116 the features described above can be utilized to train theranker component 116. More particularly, given the set of feature functions described above, the weights of a log linear ranking model for P(wt|ws, T, S) can be trained based at least in part upon the word level annotated article pairs. After this model is trained, a new translation table Plex(t|s) can be generated which can be defined as Plex(t|s)∝ΣtεT,sεSP(t|s, T, S). The summation is over occurrences of the source and target word in linked articles (articles known to be in some way related). This new translation table can be used to define another HMM word alignment model (together with distortion probabilities trained from parallel data) for use in sentence or word sequence extraction models. Two copies of each feature using the HMM word alignment model can be generated, one using the seed data HMM model and another using this new HMM model. - The
comparer component 118 and thelabeler component 120 can act as described above in connection with assigning thelabel 122 to a word sequence pair. Again thislabel 122 can indicate that a word sequence in the second set ofword sequences 110 is a translation of a word sequence in the first set ofword sequences 108. - In another embodiment, the
system 100 can be utilized to identify paraphrases of word sequences written in the same language. That is, the firstelectronic document 104 and the secondelectronic document 106 can include the first set ofword sequences 108 and the second set ofword sequences 110, whereinsuch word sequences feature extractor component 114 can extract at least some of the features mentioned above from the firstelectronic document 104 and the secondelectronic document 106 and theranker component 116 can rank word sequence pairs as indicated previously. Scores output by theranker component 116 can be indicative of a likelihood that word sequences in scored word sequence pairs are paraphrases of one another. Thecomparer component 118 can compare a highest score assigned to word sequences considered by theranker component 116, and if the highest score is above a threshold value, can instruct thelabeler component 120 to label the word sequences in the word sequence pair as being paraphrases of one another. - Paraphrases ascertained through utilization of the
system 100 can be employed in practice in a variety of settings. For example, such paraphrases can be utilized in a search engine such that if a user submits a query, the search engine can execute a search over a document corpus utilizing such query as well as paraphrases of such query. In a similar example, the user may submit a query to a search engine and the search engine may execute a search over a document corpus utilizing the query and a modification of such query, wherein such modification includes paraphrases of word sequences that exist in the query. Furthermore, paraphrases ascertained through utilization of thesystem 100 can be employed in connection with advertising. For example, a user may submit a query to a search engine and the search engine can sell advertisements pertaining to such query by selling paraphrases of the query to advertisers that are willing to bid on such paraphrases. - If the
system 100 is employed in connection with locating translations between word sequences in different languages such identified parallel word sequences can be utilized in a variety of settings. For example, a labeled word sequence pair with respect to a first language and a second language could be provided to a system that is configured to learn a machine translation system. This machine translation system may then be utilized to translate text from the first language to the second language. In another example, word sequences that are labeled as being substantially parallel to one another when such word sequences are written in different languages can be employed in a multi-lingual search setting. For example, a user may wish to formulate a query in a first language because the user is more comfortable with writing in such language, but the user may wish to receive the search results in a second language. These word sequence pairs can be employed in connection with automatically translating a query from the first language to the second language to execute a search over a document corpus that comprises documents in the second language. Other utilizations of word sequence alignments are contemplated and intended to fall under the scope of the hereto appended claims. - Referring now to
FIG. 2 , anexample system 200 that facilitates extracting word sequences from electronic documents is illustrated. Thesystem 200 includes adata store 102 that comprises the firstelectronic document 104 and the secondelectronic document 106. Asegmenter component 202 can analyze linguistic features in the firstelectronic document 104 and the secondelectronic document 106 and can extract word sequences from such electronic documents. Pursuant to an example, thesegmenter component 202 can be configured to extract sentences from theelectronic documents segmenter component 202 can be configured to search for certain types of punctuation including periods, question marks, etc. and can extract word sequences by selecting text that is between such punctuation marks. In another example, thesegmenter component 202 can be configured to analyze content ofelectronic documents segmenter component 202 can search for certain noun/verb arrangements or other parts of a language and can extract word sequences based at least in part upon such analysis. Thus, thesegmenter component 202 can output the first set ofword sequences 108 and the second set ofword sequences 110 that correspond to the firstelectronic document 104 and the secondelectronic document 106 and can cause such sets ofword sequences data store 102. - With reference now to
FIG. 3 , anexample depiction 300 of assigning scores to word sequence pairs existent in a first electronic document and a second electronic document is illustrated. A firstelectronic document 302 can comprise multiple word sequences 304-310. These word sequences are labeled respectively as 1, 2, 3 and 4 and may be written in a first language. - A second
electronic document 312 can comprise multiple word sequences 314-320 wherein such word sequences are in a second language. Theelectronic document - As described above a first model can be utilized to indicate probabilities of alignments between word sequences in the
electronic document electronic document 312 is aligned with such word sequence in theelectronic document 302 and such alignment information can be utilized in connection with assigning scores to word sequence pairs. - Thereafter, each word sequence in a first
electronic document 302 can be paired with each word sequence in the secondelectronic document 312 and such word sequence pairs can be assigned scores. Thus, as is shown in anexample matrix 322 with respect to word sequence 1 (shown by reference numeral 304), such word sequence is paired with each of theword sequences electronic document 312. These word sequence pairs are shown in thematrix 322 by thevalues FIG. 1 ) can assign a score to each of those word sequence pairs. Thereafter, as described previously, a highest score assigned to the word sequence pairs 1A, 1B, 1C, and 1D can be compared to a threshold value, and if such highest score is above the threshold value, then the corresponding word sequence pair can be labeled as including parallel word sequences. Therefore, if theword sequence pair 1A is assigned the highest score amongst the word sequence pairs 1A, 1B, 1C, and 1D and such score is above the threshold value, then theword sequence pair 1A can be labeled as parallel such that theword sequence A 314 is labeled as being a translation of theword sequence 1 304. As can be seen in thematrix 322, scores can be assigned to each possible word sequence pair between theelectronic documents - In another exemplary embodiment, in addition to or alternatively to assigning scores individually to word sequence pairs, scores can be assigned to different alignments of word sequences in the
electronic documents matrix 322 can include binary values, with a single “1” in each column (which indicates that a word sequence in the first electronic document corresponds with a certain word sequence in the second electronic document or the empty word sequence) and the remainder “0” in every column. The ranker may then assign scores to the entirety of the matrix. This score is indicative of parallelism between word sequences in the scored alignment. - With reference now to
FIG. 4 , anexample methodology 400 that facilitates determining whether two word sequences in a word sequence pair are parallel to one another is illustrated. While themethodology 400 is described as being a series of acts that are performed in a sequence, it is to be understood that the methodology is not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement the methodology described herein. - Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies may be stored in a computer-readable medium, displayed on a display device, and/or the like. The computer-readable medium may be a non-transitory medium, such as memory, hard drive, CD, DVD, flash drive, or the like.
- The
methodology 400 begins at 402, and at 404 a first electronic document that comprises a first word sequence is received. In an example, the first word sequence can be written in a first language. - At 406, a second electronic document is received, wherein the second electronic document comprises a set of word sequences. In an example, the set of word sequences can be written in a second language. Furthermore, as described herein, the first and second electronic documents may be web pages such as related web pages that belong to an online collaborative encyclopedia.
- At 408, a score is assigned to each word sequence pair. The score can be indicative of an amount of parallelism between the first word sequence and the other word sequences in the respective word sequence pairs.
- At 410, a highest score assigned to the word sequence pairs is compared to some threshold value. At 412, a determination is made regarding whether the score is above the threshold. If the score is above the threshold then at 414 data is generated that indicates that the word sequences in the word sequence pairs are parallel to one another. For example, the data can indicate that the word sequence from the set of word sequences in the word sequence pair is a translation of the first word sequence. In another example the data can indicate that the two word sequences in the word sequence pair are paraphrases of one another.
- If at 412 it is determined that the highest score is below the threshold then at 416 data is generated that indicates that the first word sequence does not have a corresponding parallel word sequence in the second electronic document. The methodology then completes at 418.
- Now referring to
FIG. 5 , a high-level illustration of anexample computing device 500 that can be used in accordance with the systems and methodologies disclosed herein is illustrated. For instance, thecomputing device 500 may be used in a system that supports detecting parallel word sequences in electronic documents. In another example, at least a portion of thecomputing device 500 may be used in a system that supports extracting parallel word sequences from web pages that belong to an online collaborative encyclopedia. Thecomputing device 500 includes at least oneprocessor 502 that executes instructions that are stored in amemory 504. Thememory 504 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory. The instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above. Theprocessor 502 may access thememory 504 by way of asystem bus 506. In addition to storing executable instructions, thememory 504 may also store electronic documents, word sequences, ranking models, word sequence pairs, data indicating that words in a word sequence pair are parallel to one another, etc. - The
computing device 500 additionally includes adata store 508 that is accessible by theprocessor 502 by way of thesystem bus 506. Thedata store 508 may be or include any suitable computer-readable storage, including a hard disk, memory, etc. Thedata store 508 may include executable instructions, electronic documents which may be web pages, word sequences, etc. Thecomputing device 500 also includes aninput interface 510 that allows external devices to communicate with thecomputing device 500. For instance, theinput interface 510 may be used to receive instructions from an external computer device, a user, etc. Thecomputing device 500 also includes anoutput interface 512 that interfaces thecomputing device 500 with one or more external devices. For example, thecomputing device 500 may display text, images, etc. by way of theoutput interface 512. - Additionally, while illustrated as a single system, it is to be understood that the
computing device 500 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by thecomputing device 500. - As used herein, the terms “component” and “system” are intended to encompass hardware, software, or a combination of hardware and software. Thus, for example, a system or component may be a process, a process executing on a processor, or a processor. Additionally, a component or system may be localized on a single device or distributed across several devices. Furthermore, a component or system may refer to a portion of memory and/or a series of transistors.
- It is noted that several examples have been provided for purposes of explanation. These examples are not to be construed as limiting the hereto-appended claims. Additionally, it may be recognized that the examples provided herein may be permutated while still falling under the scope of the claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/794,778 US8560297B2 (en) | 2010-06-07 | 2010-06-07 | Locating parallel word sequences in electronic documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/794,778 US8560297B2 (en) | 2010-06-07 | 2010-06-07 | Locating parallel word sequences in electronic documents |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110301935A1 true US20110301935A1 (en) | 2011-12-08 |
US8560297B2 US8560297B2 (en) | 2013-10-15 |
Family
ID=45065161
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/794,778 Active 2031-08-28 US8560297B2 (en) | 2010-06-07 | 2010-06-07 | Locating parallel word sequences in electronic documents |
Country Status (1)
Country | Link |
---|---|
US (1) | US8560297B2 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120226489A1 (en) * | 2011-03-02 | 2012-09-06 | Bbn Technologies Corp. | Automatic word alignment |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
US20130110985A1 (en) * | 2011-11-01 | 2013-05-02 | Rahul Shekher | Systems and Methods for Geographical Location Based Cloud Storage |
US20130159320A1 (en) * | 2011-12-19 | 2013-06-20 | Microsoft Corporation | Clickthrough-based latent semantic model |
US20130212095A1 (en) * | 2012-01-16 | 2013-08-15 | Haim BARAD | System and method for mark-up language document rank analysis |
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
US20140163959A1 (en) * | 2012-12-12 | 2014-06-12 | Nuance Communications, Inc. | Multi-Domain Natural Language Processing Architecture |
US8838434B1 (en) * | 2011-07-29 | 2014-09-16 | Nuance Communications, Inc. | Bootstrap call router to other languages using selected N-best translations |
US20160110341A1 (en) * | 2014-10-15 | 2016-04-21 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US9400781B1 (en) * | 2016-02-08 | 2016-07-26 | International Business Machines Corporation | Automatic cognate detection in a computer-assisted language learning system |
US10318640B2 (en) * | 2016-06-24 | 2019-06-11 | Facebook, Inc. | Identifying risky translations |
US20190318001A1 (en) * | 2017-11-30 | 2019-10-17 | International Business Machines Corporation | Document preparation with argumentation support from a deep question answering system |
US10984279B2 (en) * | 2019-06-13 | 2021-04-20 | Wipro Limited | System and method for machine translation of text |
US11748311B1 (en) * | 2012-10-30 | 2023-09-05 | Google Llc | Automatic collaboration |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9286414B2 (en) * | 2011-12-02 | 2016-03-15 | Microsoft Technology Licensing, Llc | Data discovery and description service |
US9292094B2 (en) | 2011-12-16 | 2016-03-22 | Microsoft Technology Licensing, Llc | Gesture inferred vocabulary bindings |
CN107870901B (en) * | 2016-09-27 | 2023-05-12 | 松下知识产权经营株式会社 | Method, recording medium, apparatus and system for generating similar text from translation source text |
US10755729B2 (en) * | 2016-11-07 | 2020-08-25 | Axon Enterprise, Inc. | Systems and methods for interrelating text transcript information with video and/or audio information |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030233222A1 (en) * | 2002-03-26 | 2003-12-18 | Radu Soricut | Statistical translation using a large monolingual corpus |
US20040122656A1 (en) * | 2001-03-16 | 2004-06-24 | Eli Abir | Knowledge system method and appparatus |
US6757646B2 (en) * | 2000-03-22 | 2004-06-29 | Insightful Corporation | Extended functionality for an inverse inference engine based web search |
US20050060643A1 (en) * | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US20050228643A1 (en) * | 2004-03-23 | 2005-10-13 | Munteanu Dragos S | Discovery of parallel text portions in comparable collections of corpora and training using comparable texts |
US20080262826A1 (en) * | 2007-04-20 | 2008-10-23 | Xerox Corporation | Method for building parallel corpora |
US20080300857A1 (en) * | 2006-05-10 | 2008-12-04 | Xerox Corporation | Method for aligning sentences at the word level enforcing selective contiguity constraints |
US20090076792A1 (en) * | 2005-12-16 | 2009-03-19 | Emil Ltd | Text editing apparatus and method |
US20090083023A1 (en) * | 2005-06-17 | 2009-03-26 | George Foster | Means and Method for Adapted Language Translation |
US7620538B2 (en) * | 2002-03-26 | 2009-11-17 | University Of Southern California | Constructing a translation lexicon from comparable, non-parallel corpora |
US20100177970A1 (en) * | 2004-02-15 | 2010-07-15 | Exbiblio B.V. | Capturing text from rendered documents using supplemental information |
US8249855B2 (en) * | 2006-08-07 | 2012-08-21 | Microsoft Corporation | Identifying parallel bilingual data over a network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3055545B1 (en) | 1999-01-19 | 2000-06-26 | 富士ゼロックス株式会社 | Related sentence retrieval device |
US7620539B2 (en) | 2004-07-12 | 2009-11-17 | Xerox Corporation | Methods and apparatuses for identifying bilingual lexicons in comparable corpora using geometric processing |
US8943080B2 (en) | 2006-04-07 | 2015-01-27 | University Of Southern California | Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections |
US20080120092A1 (en) | 2006-11-20 | 2008-05-22 | Microsoft Corporation | Phrase pair extraction for statistical machine translation |
-
2010
- 2010-06-07 US US12/794,778 patent/US8560297B2/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6757646B2 (en) * | 2000-03-22 | 2004-06-29 | Insightful Corporation | Extended functionality for an inverse inference engine based web search |
US20040122656A1 (en) * | 2001-03-16 | 2004-06-24 | Eli Abir | Knowledge system method and appparatus |
US20030233222A1 (en) * | 2002-03-26 | 2003-12-18 | Radu Soricut | Statistical translation using a large monolingual corpus |
US7620538B2 (en) * | 2002-03-26 | 2009-11-17 | University Of Southern California | Constructing a translation lexicon from comparable, non-parallel corpora |
US20050060643A1 (en) * | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
US20100177970A1 (en) * | 2004-02-15 | 2010-07-15 | Exbiblio B.V. | Capturing text from rendered documents using supplemental information |
US20050228643A1 (en) * | 2004-03-23 | 2005-10-13 | Munteanu Dragos S | Discovery of parallel text portions in comparable collections of corpora and training using comparable texts |
US20090083023A1 (en) * | 2005-06-17 | 2009-03-26 | George Foster | Means and Method for Adapted Language Translation |
US20090076792A1 (en) * | 2005-12-16 | 2009-03-19 | Emil Ltd | Text editing apparatus and method |
US20080300857A1 (en) * | 2006-05-10 | 2008-12-04 | Xerox Corporation | Method for aligning sentences at the word level enforcing selective contiguity constraints |
US8249855B2 (en) * | 2006-08-07 | 2012-08-21 | Microsoft Corporation | Identifying parallel bilingual data over a network |
US20080262826A1 (en) * | 2007-04-20 | 2008-10-23 | Xerox Corporation | Method for building parallel corpora |
Non-Patent Citations (1)
Title |
---|
Mining meaning from Wikipedia by Olena Medelyan, David Milne, Catherine Legg and Ian H. Witten, International Journal of Human-Computer Studies, 2009, pages 716-754 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8655640B2 (en) * | 2011-03-02 | 2014-02-18 | Raytheon Bbn Technologies Corp. | Automatic word alignment |
US20120226489A1 (en) * | 2011-03-02 | 2012-09-06 | Bbn Technologies Corp. | Automatic word alignment |
US8838434B1 (en) * | 2011-07-29 | 2014-09-16 | Nuance Communications, Inc. | Bootstrap call router to other languages using selected N-best translations |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
US9374427B2 (en) * | 2011-11-01 | 2016-06-21 | Rahul Shekher | Geographical location based cloud storage |
US20130110985A1 (en) * | 2011-11-01 | 2013-05-02 | Rahul Shekher | Systems and Methods for Geographical Location Based Cloud Storage |
US20130159320A1 (en) * | 2011-12-19 | 2013-06-20 | Microsoft Corporation | Clickthrough-based latent semantic model |
US9009148B2 (en) * | 2011-12-19 | 2015-04-14 | Microsoft Technology Licensing, Llc | Clickthrough-based latent semantic model |
US20130212095A1 (en) * | 2012-01-16 | 2013-08-15 | Haim BARAD | System and method for mark-up language document rank analysis |
US20150278203A1 (en) * | 2012-01-16 | 2015-10-01 | Sole Solution Corp | System and method for mark-up language document rank analysis |
US11748311B1 (en) * | 2012-10-30 | 2023-09-05 | Google Llc | Automatic collaboration |
US20140163959A1 (en) * | 2012-12-12 | 2014-06-12 | Nuance Communications, Inc. | Multi-Domain Natural Language Processing Architecture |
US10282419B2 (en) * | 2012-12-12 | 2019-05-07 | Nuance Communications, Inc. | Multi-domain natural language processing architecture |
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
US20170337179A1 (en) * | 2014-10-15 | 2017-11-23 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US9697195B2 (en) * | 2014-10-15 | 2017-07-04 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US10296583B2 (en) * | 2014-10-15 | 2019-05-21 | Microsoft Technology Licensing Llc | Construction of a lexicon for a selected context |
US20190361976A1 (en) * | 2014-10-15 | 2019-11-28 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US10853569B2 (en) * | 2014-10-15 | 2020-12-01 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US20160110341A1 (en) * | 2014-10-15 | 2016-04-21 | Microsoft Technology Licensing, Llc | Construction of a lexicon for a selected context |
US9400781B1 (en) * | 2016-02-08 | 2016-07-26 | International Business Machines Corporation | Automatic cognate detection in a computer-assisted language learning system |
US10318640B2 (en) * | 2016-06-24 | 2019-06-11 | Facebook, Inc. | Identifying risky translations |
US20190318001A1 (en) * | 2017-11-30 | 2019-10-17 | International Business Machines Corporation | Document preparation with argumentation support from a deep question answering system |
US11170181B2 (en) * | 2017-11-30 | 2021-11-09 | International Business Machines Corporation | Document preparation with argumentation support from a deep question answering system |
US10984279B2 (en) * | 2019-06-13 | 2021-04-20 | Wipro Limited | System and method for machine translation of text |
Also Published As
Publication number | Publication date |
---|---|
US8560297B2 (en) | 2013-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8560297B2 (en) | Locating parallel word sequences in electronic documents | |
Bojar et al. | Findings of the 2016 conference on machine translation (wmt16) | |
Denecke | Using sentiwordnet for multilingual sentiment analysis | |
JP5356197B2 (en) | Word semantic relation extraction device | |
US9262400B2 (en) | Non-transitory computer readable medium and information processing apparatus and method for classifying multilingual documents | |
US10606903B2 (en) | Multi-dimensional query based extraction of polarity-aware content | |
Modaresi et al. | Exploring the effects of cross-genre machine learning for author profiling in PAN 2016 | |
Bal et al. | Sentiment analysis with a multilingual pipeline | |
Piskorski et al. | Slav-NER: the 3rd cross-lingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages | |
Yun Ying et al. | Opinion mining on Viet Thanh Nguyen’s the sympathizer using topic modelling and sentiment analysis | |
Mataoui et al. | A new syntax-based aspect detection approach for sentiment analysis in Arabic reviews | |
Duran et al. | Some issues on the normalization of a corpus of products reviews in Portuguese | |
Daðason | Post-correction of Icelandic OCR text | |
Rayson et al. | Automatic error tagging of spelling mistakes in learner corpora | |
Warburton | Processing terminology for the translation pipeline | |
Kanojia et al. | Challenge dataset of cognates and false friend pairs from indian languages | |
Veselovská | Czech subjectivity lexicon: A lexical resource for Czech polarity classification | |
Chatzitheodorou | COSTA MT evaluation tool: An open toolkit for human machine translation evaluation | |
Khairova et al. | Automatic extraction of synonymous collocation pairs from a text corpus | |
Eyecioglu et al. | Constructing a Turkish corpus for paraphrase identification and semantic similarity | |
Zampieri et al. | Grammatical error detection with limited training data: The case of chinese | |
Hakkani-Tur et al. | Statistical sentence extraction for information distillation | |
Saralegi et al. | Cross-lingual projections vs. corpora extracted subjectivity lexicons for less-resourced languages | |
Mridha et al. | Semantic error detection and correction in Bangla sentence | |
Herdawan | An analysis on Indonesian-English abstract translation by Google Translate |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:QUIRK, CHRISTOPHER BRIAN;TOUTANOVA, KRISTINA N.;SMITH, JASON ROBERT;REEL/FRAME:024490/0122 Effective date: 20100602 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001 Effective date: 20141014 |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |