WO2010042452A2 - Machine learning for transliteration - Google Patents
Machine learning for transliteration Download PDFInfo
- Publication number
- WO2010042452A2 WO2010042452A2 PCT/US2009/059576 US2009059576W WO2010042452A2 WO 2010042452 A2 WO2010042452 A2 WO 2010042452A2 US 2009059576 W US2009059576 W US 2009059576W WO 2010042452 A2 WO2010042452 A2 WO 2010042452A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- potential
- transliteration
- pair
- anchor text
- pairs
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
Definitions
- This specification relates to transliteration.
- Each language is normally expressed in a particular writing system (e.g., a script), which is usually characterized by a particular alphabet.
- a particular writing system e.g., a script
- the English language can be expressed using Latin characters while the Japanese language can be expressed using Katakana characters.
- the scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters.
- a first writing system is used to represent words normally represented by a second writing system.
- a transliterated term can be a term that has been converted from one script to another script or a phonetic representation in one script of a term in another script. Transliterations can differ from translations because the meanings of terms are not reflected in the transliterated terms.
- Techniques for extracting transliterations pairs may require annotated training data or language specific data.
- conventional techniques for transliteration use rules, which specify that one or more particular characters in a first script can be mapped to one or more particular characters in a second script. These rules are typically language specific and may require annotated training data and/or parallel training data (e.g., comparable training data in the first and second scripts).
- This specification describes technologies relating to machine learning for transliteration.
- one aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a plurality of resources.
- the plurality of resources can include a plurality of anchor text.
- One or more potential transliterations can be determined from the plurality of anchor text.
- the one or more potential transliterations can be sorted, based on likelihoods of the one or more potential transliterations co-occurring with text that identifies a same resource or location.
- One or more potential transliteration pairs can be identified from the one or more potential transliterations.
- Each potential transliteration pair can include a first anchor text in a first writing system and a second anchor text in a second writing system.
- the second anchor text and the first anchor text can identify a same resource or location.
- the first anchor text and the second anchor text can be compared; and the potential transliteration pair can be first classified as being a transliteration pair or not being a transliteration pair, based on the comparison.
- the first classified potential transliteration pairs can be first sorted, based on likelihoods of the first classified potential transliteration pairs being transliteration pairs, to produce first sorted potential transliteration pairs.
- a subset of the first sorted potential transliteration pairs can be identified. The subset can include potential transliteration pairs classified as being transliteration pairs and potential transliteration pairs classified as not being transliteration pairs.
- the first anchor text and the second anchor text can be aligned; and one or more edits from the alignment can be extracted.
- a classification model can be generated, based on the one or more edits and the subset.
- Each of the first classified potential transliteration pairs can be second classified as being a transliteration pair or not being a transliteration pair, using the classification model.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- Comparing the first anchor text and the second anchor text can include determining a first edit distance between the first anchor text and the second anchor text, and comparing the first edit distance to a first threshold value.
- the aligning can be based on minimizing the first edit distance.
- the first threshold value can be the lesser of a length of the first anchor text and a length of the second anchor text.
- the potential transliteration pair can be first classified as not being a transliteration pair when the first edit distance is greater than the first threshold value, and the potential transliteration pair can be first classified as being a transliteration pair when the first edit distance is less than the first threshold value.
- Generating the classification model can include associating each of the one or more edits to a feature, and generating a feature weight for each feature.
- the second classifying can include: for each of the first classified potential transliteration pairs, comparing the first classified potential transliteration pair to the one or more features in the classification model; determining one or more feature weights, based on the comparison; and summing the one or more feature weights to produce a classification score.
- the method can further include for the one or more edits associated with the first writing system, second sorting the second classified potential transliteration pairs, based on likelihoods of the second classified potential transliteration pairs being transliteration pairs; and for each second sorted potential transliteration pair, reclassifying the second sorted potential transliteration pair as not being a transliteration pair when its corresponding classification score indicates that the second sorted potential transliteration pair is not a transliteration pair; reclassifying the second sorted potential transliteration pair as a best potential transliteration pair when the second sorted potential transliteration pair has the highest likelihood of being a transliteration pair and its corresponding classification score indicates that the second sorted potential transliteration pair is a transliteration pair; determining a second edit distance between the second sorted potential transliteration pair and the best potential transliteration pair; reclassifying the second sorted potential transliteration pair as being a transliteration pair when its second edit distance is less than a second threshold value and its corresponding classification score indicates that the second sorted potential trans
- the method can further include for the one or more edits associated with the second writing system, third sorting the reclassified potential transliteration pairs, based on likelihoods of the reclassified potential transliteration pairs being transliteration pairs; and for each third sorted potential transliteration pair classified as being a transliteration pair, reclassifying the third sorted potential transliteration pair as not being a transliteration pair when its corresponding classification score indicates that the third sorted potential transliteration pair is not a transliteration pair; reclassifying the third sorted potential transliteration pair as a best potential transliteration pair when the third sorted potential transliteration pair has the highest likelihood of being a transliteration pair and its corresponding classification score indicates that the third sorted potential transliteration pair is a transliteration pair; determining a third edit distance between the third sorted potential transliteration pair and the best potential transliteration pair; reclassifying the third sorted potential transliteration pair as being a transliteration pair when its third edit distance is less than a third threshold value and
- the classification model can use a support vector machine (SVM).
- SVM support vector machine
- the likelihoods can be calculated using log-likelihood ratios.
- another aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a plurality of resources.
- the plurality of resources can include a plurality of anchor text.
- One or more potential transliterations can be determined from the plurality of anchor text.
- One or more potential transliteration pairs can be identified from the one or more potential transliterations.
- Each potential transliteration pair can include a first anchor text in a first writing system and a second anchor text in a second writing system.
- the second anchor text and the first anchor text can identify a same resource or location.
- the potential transliteration pair For each potential transliteration pair, the potential transliteration pair can be classified as being a transliteration pair or not being a transliteration pair; the first anchor text can be aligned with the second anchor text. One or more edits can be extracted from the alignment. A classification model can be generated, based on the one or more edits and a subset of the classified potential transliteration pairs. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products. These and other embodiments can optionally include one or more of the following features. The method can further include identifying transliteration pairs from the potential transliteration pairs, using the classification model.
- the one or more potential transliteration pairs can be identified from the one or more potential transliterations, based on likelihoods of the one or more potential transliterations co occurring with text that identifies a same resource or location.
- the classifying can include determining a first edit distance between the first anchor text and the second anchor text; comparing the first edit distance to a first threshold value; and classifying the potential transliteration pair as being a transliteration pair or not being a transliteration pair, based on the comparison.
- the aligning can be based on minimizing the first edit distance.
- the first threshold value can be the lesser of a length of the first anchor text and a length of the second anchor text.
- the method can further include reclassifying the potential transliteration pairs based on refinement rules and the one or more edits in the first writing system.
- the method can also include reclassifying the potential transliteration pairs classified as being a transliteration pair based on the refinement rules and the one or more edits in the second writing system.
- the method can further include sorting the classified potential transliteration pairs based on likelihoods of the classified potential transliteration pairs being transliteration pairs.
- the subset can include potential transliteration pairs classified as being a transliteration pair and potential transliteration pairs classified as not being a transliteration pair.
- the classification model can use a support vector machine (SVM).
- Automatically identifying transliteration pairs using anchor text increases the flexibility and coverage of identification by: (i) reducing or eliminating the use of annotated training data, and (ii) reducing or eliminating the use of language specific rules or data (e.g., parallel training data).
- language specific rules or data e.g., parallel training data.
- FIG. 1 shows example transliteration pairs.
- FIG. 2 is a block diagram illustrating an example automatic identification of transliteration pairs using anchor text.
- FIG. 3 includes an example potential transliteration pair and its corresponding edits.
- FIG. 4 is a flow chart showing an example process for automatically identifying transliteration pairs using anchor text.
- FIG. 5 is a schematic diagram of a generic computer system.
- FIG. 1 shows example transliteration pairs.
- a transliteration pair can include, for example, a first word represented in a first writing system and a second word represented in a second writing system.
- the first writing system and second writing system can be used to express a same language or different languages.
- the first and second writing systems can be Katakana and Kanji, which are writing systems used to express the Japanese language.
- the first and second writing systems can be Latin and Kanji, which are used to express different languages.
- each transliteration in a transliteration pair can include text of any length, e.g., a transliteration can be a single character or a phrase.
- Transliterations can have a plurality of variants.
- FIG. 1 includes three Katakana transliterations f ⁇ y 7 i y, f ⁇ y y t y, and f- ⁇ 3 T V 7 ⁇ - — ⁇ for the Latin representation of the English word "saxophone".
- annotated training data or language specific rules can limit the amount of training data that can be used to train a classification model (e.g., using a classifier), for example, for identifying transliteration pairs. As a result, the likelihood that all variants of a transliteration are identified may be reduced.
- using anchor text to train a classification model can increase the amount of usable training data (e.g., any resource that includes anchor text can be used), thereby increasing the likelihood that all variants of a transliteration are learned by the classifier.
- FIG. 2 is a block diagram illustrating an example automatic identification of transliteration pairs using anchor text.
- FIG. 2 includes one or more resources 210.
- the one or more resources 210 can be, for example, web pages, spreadsheets, emails, blogs, and instant message (IM) scripts.
- the one or more resources 210 can include anchor text.
- Anchor text is text that links to a resource (e.g., a web page) identified by a Uniform Resource Locator (URL), for example.
- anchor text can link to a particular location in a resource (e.g., a location on a web page).
- Anchor text can provide contextual information related to the resource.
- anchor text may be related to actual text of an associated URL.
- the hyperlink displays in a web page as Google.
- the anchor text may not be related to the actual text of the associated URL.
- a hyperlink to the GoogleTM web site can be represented as:
- anchor text can provide contextual information related to the resource, different anchor text that identify a same resource can be used as unannotated training data for identifying transliteration pairs.
- search engine is not a transliteration for "Google"
- other examples of anchor text may provide the same contextual information as transliterations.
- Another hyperlink to the GoogleTM web site can be represented as:
- ⁇ a href http://www.google.com”> ⁇ fJC ⁇ /a> .
- the hyperlink displays in a web page as ⁇ rWi. %$ ⁇ Wi is the transliteration for "Google" in Chinese.
- a web page can include (a Katakana transliteration of "personal care"), and an English website can include anchor text personal care that both link to a same web page (e.g., a web page about personal health).
- anchor text 220 from the resources 210 can be extracted.
- extracted samples of anchor text can be in the same writing system, and therefore, cannot be transliteration pairs.
- extracted samples of anchor text that do not link to a same resource or location are less likely to be related contextual information than samples that do link to a same resource or location.
- these extracted samples of anchor text are not likely to be the same contextual information. Therefore, potential transliteration pairs 230 (e.g., transliteration pair candidates) can be determined by identifying samples of anchor text that are represented in different writing systems and link to a same resource or location.
- Each potential transliteration pair comprises a first anchor text in a first writing system and a second anchor text in a second writing system.
- various statistics related to the extracted anchor text can be collected. For example, the frequency (e.g., a count) of text, that is the same as the anchor text but does not link to a resource or a location (e.g., plain text that is not associated with a URL), in the resources can be determined.
- the frequency of the anchor text occurring in more than one writing system and linking to a same resource or location (e.g., occurring as a potential transliteration pair) can also be determined.
- the frequency of the anchor text occurring only in a single writing system can be determined.
- One or more potential transliterations from the plurality of anchor text can be determined. One or more of these frequencies can be used to calculate a likelihood that a potential transliteration co-occurs with text that identifies a same resource or location. In some implementations, a likelihood of co-occurrence can be calculated by using a log-likelihood ratio. Other statistics related to the occurrences of the anchor text can also be collected and used to calculate the likelihood that a potential transliteration pair is a transliteration pair. Potential transliteration pairs can be identified from the one or more potential transliterations. The one or more potential transliterations can be sorted, based on likelihoods of the one or more potential transliterations co-occurring with text that identifies a same resource or location. Using this likelihood determination, a system can identify potential transliteration pairs in training data in any language or writing system, e.g., the system can be language/writing system-independent.
- each potential transliteration pair can be scored.
- a classification model 240 can be generated, using a classifier (e.g., a linear support vector machine (SVM)).
- SVM linear support vector machine
- the classification model 240 can be used to score each potential transliteration pair, as described in more detail below.
- the classification model 240 can include features of transliteration pairs and a corresponding feature weight for each feature. The features can be matched to a potential transliteration pair, and the corresponding feature weights can be summed to produce a classification score for the potential transliteration pair.
- Each potential transliteration pair is initially classified as being a transliteration pair or not being a transliteration pair.
- each potential transliteration pair is initially classified based on an edit distance between the first anchor text and the second anchor text.
- the edit distance can be defined, for example, as the number of operations that are used to transform the first anchor text into the second anchor text.
- FIG. 3 an example potential transliteration pair and its corresponding edits are shown.
- the transliteration pair comprises a first anchor text in a first writing system (e.g., "sample") and a second anchor text in a second writing system (e.g., ⁇ ⁇ l/y ° /l/).
- the first anchor text and the second anchor text can be processed before an edit distance is determined.
- the first anchor text and second anchor text are transformed into a common writing system.
- ⁇ ⁇ yy ° /U can be transformed into "sanpuru” (e.g., a phonetic spelling), so that the first anchor text and the transformed second anchor text both comprise Latin characters.
- both the first anchor text and the transformed second anchor text are normalized. For example, "sample” and “sanpuru” can be capitalized to produce "SAMPLE” and "SANPURU". This normalization can be used to facilitate alignment of the first anchor text and the transformed second anchor text.
- the first anchor text and transformed second anchor text can be aligned to determine an edit distance and one or more edits from each of the first anchor text and the transformed second anchor text.
- characters from the first anchor text can be matched with characters from the transformed second anchor text.
- characters can be matched based on the statistical likelihood that one or more characters in the first anchor text co-occur with one or more characters in the second anchor text. For example, co-occurrence probabilities can be measured by Dice coefficients.
- consonant maps can be used to further refine the alignment process.
- the alignment can be performed, such that the edit distance is minimized.
- the characters of "SAMPLE” are matched with the characters of "SANPURU” in six operations.
- the operations include: (1) matching “S” with “S”, (2) matching “A” with “A”, (3) matching “M” with “N”, (4) matching "P” with “PU”, (5) matching "L” with “R”, and (6) matching "E” with "U”. Because six operations are used to align the first anchor text with transformed second anchor text, the edit distance is six.
- an unweighted edit distance is minimized.
- the edit distance can be weighted, and the weighted edit distance is minimized. Other implementations are possible.
- the edits associated with Latin can be represented as features “S_S”, “A_A”, “M_N”, “P_PU”, “L_R”, and “E_U”.
- the underscore between the letters can be used to distinguish the character from the first anchor text from the character from the transformed second anchor text.
- the edits associated with Katakana can be represented as features “S_S”, “A_A”, “N_M”, “PU_P”, “R_L”, and “U_E”.
- Each edit can be associated to a feature in the classification model 240.
- “P PU” can be a feature of the potential transliteration pair associated with Latin.
- PU P can be a feature of the potential transliteration pair associated with Katakana.
- the features corresponding to the beginning or end of anchor text in a potential transliteration pair can be indicated as a feature for the beginning of anchor text in a potential transliteration pair, and a feature for the ending of anchor text in a potential transliteration pair, respectively.
- the feature “S_S” can be represented as “ ⁇ S_ ⁇ S”, where the prefix " ⁇ " indicates that this feature represents characters that occur at the beginning of anchor text in a potential transliteration pair.
- the feature “E_U” can be represented as "E$_U$", where the suffix "$” indicates that this feature represents characters that occur at the end of anchor text in a potential transliteration pair.
- the appropriate feature weights can be used to calculate the classification score.
- Other implementations are possible. For example, different characters (e.g., delimiters) can be used as separators, prefixes, or suffixes.
- the alignment is performed so that there are no empty edits.
- adjacent edits can be grouped together so that there are no edits with an empty side.
- a first anchor text "TOP” can be aligned with a second transformed anchor text “TOPPU” ( h -y ⁇ 7 ° ).
- alignment may produce the following features "T T”, “0 0”, “P P”, “ ⁇ null>_P”, and “ ⁇ null>_U", where ⁇ null> represents an empty edit.
- the alignment can be performed, such that the adjacent features " ⁇ null>_P" and “ ⁇ null>_U” are combined to produce “ ⁇ null>_PU”, and then "P P" and “ ⁇ null>_PU” are combined to produce “P PPU”. As a result, the final alignment produces the features "T T", "0 0", and "P PPU”.
- statistics can be collected that can be used to calculate the likelihood that a potential transliteration pair is a transliteration pair. For example, a count of the number of consonants in each of the first anchor text and the transformed second anchor text can be determined. A count of the number of vowels in each of the first anchor text and the transformed second anchor text can also be determined. The difference between the counts can be calculated. The differences between the first anchor text and the second transformed anchor can be used to calculate the likelihood that the potential transliteration pair is a transliteration pair.
- the edit distance is compared to a threshold value.
- the threshold value is the lesser of a length of the first anchor text and a length of the second anchor text. If the edit distance is greater than the threshold value, then the potential transliteration pair can be classified as not being a transliteration pair. Otherwise, the potential transliteration can be classified as being a transliteration pair.
- the edit distance is six, and the lengths of "SAMPLE” and "SANPURU" are six and seven respectively. Therefore, the threshold value is six, and "sample” and ⁇ ⁇ y" y ° )V can be initially classified as being a transliteration pair.
- the initial classification can be based on any of the statistics determined during alignment.
- the potential transliteration pair can be classified as not being a transliteration pair if the edit distance is less than the threshold value.
- the potential transliteration pairs are sorted.
- the potential transliteration pairs can be sorted according to their likelihoods of being a transliteration pair. For example, a log-likelihood ratio of each potential transliteration pair being a transliteration pair can be calculated, e.g., using the statistics (e.g., frequencies and counts) acquired during extraction and alignment.
- a subset of the sorted potential transliteration pairs can be used to generate (e.g., train) the classification model 240.
- the potential transliteration pairs that are most likely to be transliteration pairs (e.g., top 1%) and were initially classified as being transliteration pairs, can be extracted. These pairs can be used to represent samples of transliteration pairs.
- the potential transliteration pairs that are least likely to be transliteration pairs (e.g., bottom 1%) and were initially classified as not being transliteration pairs can be extracted. These pairs can be used to represent samples that are not transliteration pairs. These samples can be used to train the classification model 240.
- the samples of transliteration pairs, the samples that are not transliteration pairs, and the features can be used to generate the classification model 240, e.g., generate a feature weight for each of the features.
- a simple linear SVM can be trained using this data.
- the samples of transliteration pairs can be used as data points for a first class, e.g., a class of data points that are transliteration pairs.
- the samples that are not transliteration pairs can be used as data points for a second class, e.g., a class of data points that are not transliteration pairs.
- a hyperplane (e.g., a maximum-margin hyperplane) can be determined that separates the data points into their respective classes, and maximizes the distance from the hyperplane to the nearest data point.
- the data can be used to train generative models, using linear discriminant analysis or a naive Bayes classifier; or discriminative models, using logistic regression or perception.
- a feature weight for each feature can be calculated based on samples that include the feature. For example, a feature weight can be increased for a feature that is included in a sample in the first class. As another example, a feature weight can be decreased for a feature that is included in a sample in the second class. If the feature appears in samples in both the first and the second class, a neutral feature weight (e.g., zero) can be assigned. Using this example convention, a higher classification score can represent a better potential transliteration pair (e.g., a better transliteration pair candidate).
- Feature weights that correspond to features in a potential transliteration pair can be summed to produce a classification score for the potential transliteration pair. Based on the classification score, the potential transliteration pairs that were originally extracted from the anchor text can be classified as either being or not being a transliteration pair. Using the example convention described previously, if the classification score is negative, for example, the potential transliteration pair can be classified as not being a transliteration pair. If the classification score is positive, for example, the potential transliteration pair can be classified as being a transliteration pair. A classification score of zero can represent a neutral classification. Returning to the example in FIG.
- the potential transliteration pair comprising "sample” and ⁇ ⁇ y ' ⁇ 7 ° JV can be compared to the features in the classification model, and the features "S_S”, “A_A”, “N_M”, “PU_P”, and “R_L” can be returned.
- the feature weights corresponding to the returned features can be summed to produce a classification score for the potential transliteration pair, and the classification score can be used to classify the potential transliteration
- further refinements can be performed to improve the accuracy and precision of the automatic identification of transliteration pairs.
- the following refinement rules can be used.
- the potential transliteration pairs can be sorted again based on the likelihood of the potential transliteration pairs being transliteration pairs, e.g., using the log-likelihood ratio.
- the potential transliteration pair can be reclassified as not being a transliteration pair (e.g., a negative candidate 242) if its corresponding classification score is negative.
- the potential transliteration pair can be reclassified as a best potential transliteration pair (e.g., a best candidate 244) when the potential transliteration pair has the highest likelihood of being a transliteration pair and its corresponding classification score is positive.
- a second edit distance can be determined between the potential transliteration pair and the best potential transliteration pair, and the potential transliteration pair can be reclassified as being a transliteration pair (e.g., a positive candidate 246) when its second edit distance is less than a second threshold value and its corresponding classification score is positive.
- the potential transliteration pair can be reclassified as not being a transliteration pair when its second edit distance is greater than the second threshold value.
- the same refinement rules can also be performed for the one or more edits associated with the second writing system (e.g., edits associated with the features "S_S”, “A_A”, “N_M”, “PU_P”, “R_L”, and “U_E”).
- potential transliteration pairs that were previously reclassified as not being a transliteration pair in the previous refinement are not reclassified during this refinement.
- even further refinements to the classification model can be achieved through performing one or more iterations of the steps used to train the classification model 240, described previously.
- an initial classification model 240 e.g., a pre-classification model
- the classification model 240 can be used to identify transliteration pairs in input text.
- input text can be compared to the features in the classification model 240, and transliteration pairs can be identified using the techniques described previously.
- a set of identified transliteration pairs e.g., potential transliteration pairs classified as transliteration pairs, is generated.
- FIG. 4 is a flow chart showing an example process 400 for automatically identifying transliteration pairs using anchor text.
- the method includes receiving 410 a plurality of resources, the plurality of resources including a plurality of anchor text.
- the resources 210 that include anchor text 220 can be received by an extraction engine.
- one or more potential transliterations can be determined from the plurality of anchor text.
- the one or more potential transliterations can be sorted, based on likelihoods of the one or more potential transliterations co-occurring with text that identifies a same resource or location.
- One or more potential transliteration pairs can be identified from one or more potential transliterations.
- Each potential transliteration pair includes a first anchor text in a first writing system and a second anchor text in a second writing system, the second anchor text and the first anchor text identifying a same resource.
- Each potential transliteration pair can be classified 420 as being a transliteration pair or not being a transliteration pair.
- a classification engine can perform the classifying.
- the first anchor text can be aligned 430 with the second anchor text.
- an alignment engine can perform the aligning.
- One or more edits can be extracted 440 from the alignment.
- the extraction engine or the alignment engine can perform the extraction.
- a classification model can be generated 450, based on the one or more edits and a subset of the classified potential transliteration pairs.
- the subset can include potential transliteration pairs classified as being a transliteration pair and transliteration pairs classified as not being a transliteration pair.
- Transliteration pairs can be identified 460 from the potential transliteration pairs, using the classification model.
- the classification engine can generate the classification model, and identify the transliteration pairs.
- FIG. 5 is a schematic diagram of a generic computer system 500.
- the system 500 can be used for practicing operations described in association with the techniques described previously (e.g., process 400).
- the system 500 can include a processor 510, a memory 520, a storage device 530, and input/output devices 540. Each of the components 510, 520, 530, and 540 are interconnected using a system bus 550.
- the processor 510 is capable of processing instructions for execution within the system 500. Such executed instructions can implement one or more components of a system, for automatically identifying transliteration pairs using anchor text as in FIG. 2, for example.
- the processor 510 is a single-threaded processor.
- the processor 510 is a multi-threaded processor.
- the processor 510 is capable of processing instructions stored in the memory 520 or on the storage device 530 to display graphical information for a user interface on the input/output device 540.
- the memory 520 is a computer readable medium such as volatile or non volatile that stores information within the system 500.
- the memory 520 could store the potential transliteration pairs 230 and classification model 240, for example.
- the storage device 530 is capable of providing persistent storage for the system 500.
- the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means.
- the input/output device 540 provides input/output operations for the system 500.
- the input/output device 540 includes a keyboard and/or pointing device.
- the input/output device 540 includes a display unit for displaying graphical user interfaces.
- the input/output device 540 can provide input/output operations for a system, for automatically identifying transliteration pairs using anchor text as in FIG. 2.
- the segmentation system can include computer software components to automatically identify transliteration pairs using anchor text. Examples of such software components include an extraction engine to extract anchor text from resources, an alignment engine to align a potential transliteration pair, and a classification engine to classify the potential transliteration pairs. Such software components can be persisted in storage device 530, memory 520 or can be obtained over a network connection, to name a few examples. Although many of the examples described in this specification illustrate
- transliteration pairs can be extracted from English (e.g., Latin characters) and Korean (e.g., Hangul characters) anchor text.
- transliteration pairs can be extracted from Hindi (e.g., Devanagar ⁇ characters) and Russian (e.g., Cyrillic characters) anchor text.
- other types e.g., samples
- phonetic variants of a word in a single writing system can be used to train the classification model.
- spelling variants of a word in a single writing system can be used to train the classification model.
- Other implementations are possible.
- Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus.
- the tangible program carrier can be a computer-readable medium.
- the computer-readable medium can be a machine- readable storage device, a machine-readable storage substrate, a memory device, or a combination of one or more of them.
- data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, a device with spoken language input, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- a smart phone is an example of a device with spoken language input, which can accept voice input (e.g., a user query spoken into a microphone on the device).
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200980148103.XA CN102227724B (en) | 2008-10-10 | 2009-10-05 | Machine learning for transliteration |
JP2011531101A JP5604435B2 (en) | 2008-10-10 | 2009-10-05 | Machine learning for transliteration |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10469208P | 2008-10-10 | 2008-10-10 | |
US61/104,692 | 2008-10-10 | ||
US12/357,269 | 2009-01-21 | ||
US12/357,269 US8275600B2 (en) | 2008-10-10 | 2009-01-21 | Machine learning for transliteration |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2010042452A2 true WO2010042452A2 (en) | 2010-04-15 |
WO2010042452A3 WO2010042452A3 (en) | 2010-07-08 |
Family
ID=42099693
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2009/059576 WO2010042452A2 (en) | 2008-10-10 | 2009-10-05 | Machine learning for transliteration |
Country Status (5)
Country | Link |
---|---|
US (1) | US8275600B2 (en) |
JP (1) | JP5604435B2 (en) |
KR (1) | KR101650112B1 (en) |
CN (1) | CN102227724B (en) |
WO (1) | WO2010042452A2 (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10319252B2 (en) | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
US9122674B1 (en) | 2006-12-15 | 2015-09-01 | Language Weaver, Inc. | Use of annotations in statistical machine translation |
US8831928B2 (en) | 2007-04-04 | 2014-09-09 | Language Weaver, Inc. | Customizable machine translation service |
US8825466B1 (en) | 2007-06-08 | 2014-09-02 | Language Weaver, Inc. | Modification of annotated bilingual segment pairs in syntax-based machine translation |
US8521516B2 (en) * | 2008-03-26 | 2013-08-27 | Google Inc. | Linguistic key normalization |
US8990064B2 (en) | 2009-07-28 | 2015-03-24 | Language Weaver, Inc. | Translating documents based on content |
US8463591B1 (en) * | 2009-07-31 | 2013-06-11 | Google Inc. | Efficient polynomial mapping of data for use with linear support vector machines |
US20110218796A1 (en) * | 2010-03-05 | 2011-09-08 | Microsoft Corporation | Transliteration using indicator and hybrid generative features |
US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
US8930176B2 (en) | 2010-04-01 | 2015-01-06 | Microsoft Corporation | Interactive multilingual word-alignment techniques |
US8682643B1 (en) * | 2010-11-10 | 2014-03-25 | Google Inc. | Ranking transliteration output suggestions |
JP5090547B2 (en) * | 2011-03-04 | 2012-12-05 | 楽天株式会社 | Transliteration processing device, transliteration processing program, computer-readable recording medium recording transliteration processing program, and transliteration processing method |
US11003838B2 (en) | 2011-04-18 | 2021-05-11 | Sdl Inc. | Systems and methods for monitoring post translation editing |
US8886515B2 (en) | 2011-10-19 | 2014-11-11 | Language Weaver, Inc. | Systems and methods for enhancing machine translation post edit review processes |
US8942973B2 (en) * | 2012-03-09 | 2015-01-27 | Language Weaver, Inc. | Content page URL translation |
US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
US9176936B2 (en) * | 2012-09-28 | 2015-11-03 | International Business Machines Corporation | Transliteration pair matching |
US9152622B2 (en) | 2012-11-26 | 2015-10-06 | Language Weaver, Inc. | Personalized machine translation via online adaptation |
US9213694B2 (en) | 2013-10-10 | 2015-12-15 | Language Weaver, Inc. | Efficient online domain adaptation |
RU2632137C2 (en) | 2015-06-30 | 2017-10-02 | Общество С Ограниченной Ответственностью "Яндекс" | Method and server of transcription of lexical unit from first alphabet in second alphabet |
EP3318979A4 (en) * | 2015-06-30 | 2019-03-13 | Rakuten, Inc. | Transliteration processing device, transliteration processing method, transliteration processing program and information processing device |
US10460038B2 (en) * | 2016-06-24 | 2019-10-29 | Facebook, Inc. | Target phrase classifier |
US10268686B2 (en) | 2016-06-24 | 2019-04-23 | Facebook, Inc. | Machine translation system employing classifier |
US10789410B1 (en) * | 2017-06-26 | 2020-09-29 | Amazon Technologies, Inc. | Identification of source languages for terms |
US10558748B2 (en) | 2017-11-01 | 2020-02-11 | International Business Machines Corporation | Recognizing transliterated words using suffix and/or prefix outputs |
US11062621B2 (en) * | 2018-12-26 | 2021-07-13 | Paypal, Inc. | Determining phonetic similarity using machine learning |
KR20220017313A (en) * | 2020-08-04 | 2022-02-11 | 삼성전자주식회사 | Method for transliteration search and electronic device supporting that |
CN112883162A (en) * | 2021-03-05 | 2021-06-01 | 龙马智芯(珠海横琴)科技有限公司 | Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE69422406T2 (en) * | 1994-10-28 | 2000-05-04 | Hewlett Packard Co | Method for performing data chain comparison |
JP3863330B2 (en) * | 1999-09-28 | 2006-12-27 | 株式会社東芝 | Nonvolatile semiconductor memory |
US20010029455A1 (en) * | 2000-03-31 | 2001-10-11 | Chin Jeffrey J. | Method and apparatus for providing multilingual translation over a network |
EP1575031A3 (en) * | 2002-05-15 | 2010-08-11 | Pioneer Corporation | Voice recognition apparatus |
US20050216253A1 (en) * | 2004-03-25 | 2005-09-29 | Microsoft Corporation | System and method for reverse transliteration using statistical alignment |
US20070022134A1 (en) * | 2005-07-22 | 2007-01-25 | Microsoft Corporation | Cross-language related keyword suggestion |
CN100555308C (en) * | 2005-07-29 | 2009-10-28 | 富士通株式会社 | Address recognition unit and method |
CN100483399C (en) * | 2005-10-09 | 2009-04-29 | 株式会社东芝 | Training transliteration model, segmentation statistic model and automatic transliterating method and device |
CN101042692B (en) * | 2006-03-24 | 2010-09-22 | 富士通株式会社 | translation obtaining method and apparatus based on semantic forecast |
JP4734155B2 (en) * | 2006-03-24 | 2011-07-27 | 株式会社東芝 | Speech recognition apparatus, speech recognition method, and speech recognition program |
US7983903B2 (en) * | 2007-09-07 | 2011-07-19 | Microsoft Corporation | Mining bilingual dictionaries from monolingual web pages |
US7917488B2 (en) * | 2008-03-03 | 2011-03-29 | Microsoft Corporation | Cross-lingual search re-ranking |
-
2009
- 2009-01-21 US US12/357,269 patent/US8275600B2/en active Active
- 2009-10-05 CN CN200980148103.XA patent/CN102227724B/en not_active Expired - Fee Related
- 2009-10-05 WO PCT/US2009/059576 patent/WO2010042452A2/en active Application Filing
- 2009-10-05 JP JP2011531101A patent/JP5604435B2/en not_active Expired - Fee Related
- 2009-10-05 KR KR1020117008322A patent/KR101650112B1/en active IP Right Grant
Non-Patent Citations (2)
Title |
---|
JIAN-CHENG WU ET AL.: 'Learning to Find English to Chinese Transliteration on he Web' THE 2007 JOINT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCEEDING AND COMPUTATIONAL NATURAL LANGUAGE LEARNING June 2007, PRAGUE, pages 996 - 1004 * |
K. SARAVANAN ET AL.: 'Some Experiments in Mining Named Entity Transliteration Pairs from Comparable Corpora' THE 2ND INTERNATIONAL WORKSHOP ON ''CROSS LINGUAL INFORMATION ACESS'' ADDRESSING THE INFORMATION NEED OF MUTILINGUAL SOCIETIES January 2008, pages 26 - 33 * |
Also Published As
Publication number | Publication date |
---|---|
US8275600B2 (en) | 2012-09-25 |
CN102227724A (en) | 2011-10-26 |
WO2010042452A3 (en) | 2010-07-08 |
KR20110083623A (en) | 2011-07-20 |
US20100094614A1 (en) | 2010-04-15 |
JP5604435B2 (en) | 2014-10-08 |
JP2012505474A (en) | 2012-03-01 |
KR101650112B1 (en) | 2016-08-22 |
CN102227724B (en) | 2014-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8275600B2 (en) | Machine learning for transliteration | |
Shoukry et al. | Sentence-level Arabic sentiment analysis | |
US7983903B2 (en) | Mining bilingual dictionaries from monolingual web pages | |
Elarnaoty et al. | A machine learning approach for opinion holder extraction in Arabic language | |
US8606559B2 (en) | Method and apparatus for detecting errors in machine translation using parallel corpus | |
US8706474B2 (en) | Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names | |
Jabbar et al. | Empirical evaluation and study of text stemming algorithms | |
Sun et al. | Chinese new word identification: a latent discriminative model with global features | |
Ali et al. | A Recent Survey of Arabic Named Entity Recognition on Social Media. | |
Chen et al. | Integrating natural language processing with image document analysis: what we learned from two real-world applications | |
Ozer et al. | Diacritic restoration of Turkish tweets with word2vec | |
Uthayamoorthy et al. | Ddspell-a data driven spell checker and suggestion generator for the tamil language | |
Nehar et al. | Rational kernels for Arabic root extraction and text classification | |
López et al. | Experiments on sentence boundary detection in user-generated web content | |
Naz et al. | Urdu part of speech tagging using transformation based error driven learning | |
CN110008312A (en) | A kind of document writing assistant implementation method, system and electronic equipment | |
US20120197894A1 (en) | Apparatus and method for processing documents to extract expressions and descriptions | |
Sen et al. | Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods | |
Rajan et al. | Survey of nlp resources in low-resource languages nepali, sindhi and konkani | |
Myint et al. | Disambiguation using joint entropy in part of speech of written Myanmar text | |
Wei et al. | Feature selection on Chinese text classification using character n-grams | |
Wang et al. | Mongolian named entity recognition system with rich features | |
Xie et al. | Automatic chinese spelling checking and correction based on character-based pre-trained contextual representations | |
Tongtep et al. | Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction | |
Baishya et al. | Present state and future scope of Assamese text processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980148103.X Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09819723 Country of ref document: EP Kind code of ref document: A2 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011531101 Country of ref document: JP |
|
ENP | Entry into the national phase |
Ref document number: 20117008322 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09819723 Country of ref document: EP Kind code of ref document: A2 |