WO2008131509A1 - Systems and methods for improving translation systems - Google Patents

Systems and methods for improving translation systems Download PDF

Info

Publication number
WO2008131509A1
WO2008131509A1 PCT/CA2007/001004 CA2007001004W WO2008131509A1 WO 2008131509 A1 WO2008131509 A1 WO 2008131509A1 CA 2007001004 W CA2007001004 W CA 2007001004W WO 2008131509 A1 WO2008131509 A1 WO 2008131509A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
database
entries
parsed
translation engine
Prior art date
Application number
PCT/CA2007/001004
Other languages
English (en)
French (fr)
Inventor
Tony Shut Lee Lau
Stephen John Barker
Joey Charles Tremblay
Original Assignee
Fireswirl Systems Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fireswirl Systems Inc. filed Critical Fireswirl Systems Inc.
Publication of WO2008131509A1 publication Critical patent/WO2008131509A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the invention relates to methods and systems for computer-based translation between human languages.
  • a drawback with currently available translation systems is that they do not typically handle linguistic forms that are commonly found in some languages.
  • translation engines tend to be proficient at translating individual words, but may not be adept at translating idioms or abbreviations which are common in the English language.
  • a translation engine will typically misinterpret the phrase "running out of ink” which has a literal meaning and a well- understood idiom.
  • Another drawback with currently available translation systems relates to the written Chinese language and other Asian languages which do not include spaces between written words.
  • currently available translation systems have difficulty interpreting written language that lacks inter-word spacing.
  • ⁇ .parallel corpus comprises a manually created database of pairs of words, pairs of phrases and/or pairs of sentences in the input language and in the target language.
  • Suchparallel corpus may be large, computationally expensive to use and maintain and prone to error because of the size of the parallel corpus and the need for accurate translation within the parallel corpus.
  • Some currently available translation systems encounter problems translating languages that have very different structure, such as English and Chinese, for example.
  • Figure 1 schematically depicts a method and system for improving the operation of a translation engine according to a particular embodiment of the invention.
  • Figure 2 schematically depicts a method and system for a maximal matching process that may be used as a part of the segmenter of the Figure 1 translation system in accordance with a particular embodiment of the invention.
  • input message received over an interface are processed by a message formatting block, a lookup lexicon block, a segmenter, a lexical pre-processor, an abbreviation and idiom expansion block, a spelling check block and/or a grammar check block, any or all of which may modify the input message prior to sending the input message to a translation engine (if required) for translation.
  • Figure 1 schematically depicts a method and system for improving the operation of a translation engine 110 according to a particular embodiment of the invention.
  • translation system 100 of Figure 1 comprises a front-end 106 which is added to and works in conjunction with translation engine 110.
  • Translation engine 110 may generally comprise any translation system known in the art or developed in the future, including, by way of non-limiting example, the GoogleTM translation engine (see http://translate.google.com/translate_t) or the SystransoftTM translation engine (see http://www.systransoft.com/).
  • a typical translation engine 110 comprises a software API which receives as inputs: (i) input language; (ii) target language; (iii) input message in input language; and (iv) an optional context field.
  • the optional context field may be used to adjust certain grammatical rules (i.e. where translation engine 110 is a so called “rules-based" translation engine).
  • Non-limiting examples of context fields include: conversational, medical, technical, industrial, news or the like.
  • An example of an input command to translation engine 110 could be: input language: "English”; target language: “French”; input message: “Hello. How are you?"; and context: “conversational”.
  • the output of a typical translation engine 110 is a translated version of the input message in the target language.
  • the output corresponding to the above command could be "Bonjour. Cava?".
  • One of the more common currently available translation engines 110 is the SystransoftTM translation engine (also referred to as SYSTRANTM).
  • SYSTRANTM uses what is called a rule-based translation algorithm. After receiving an input message, the SYSTRANTM translation engine 110 attempts to break the phrase down grammatically. Using a set of rules between the input language and the target language (a language pair), the SYSTRANTM translation engine 110 returns with one translation of the input message in the target language.
  • the GoogleTM translation engine is another common currently available translation engine 110.
  • the GoogleTM is based on statistical translation research conducted by IBMTM.
  • the GoogleTM translation engine makes use of ⁇ parallel corpus to mathematically (i.e. statistically) generate a set of rules between two languages. Given an input message in the input language, the GoogleTM translation engine 110 attempts to return the "most likely" output message in the target language.
  • Front-end 106 comprises hardware and software to facilitate interaction between user 102 and translation engine 110 and to improve the performance of translation engine 110.
  • front end 106 comprises a communications interface 104 and a front end controller 108.
  • Front end controller 108 may comprise one or more programmable processor(s) which may include, without limitation, embedded microprocessors, dedicated computers, groups of data processors or the like. Some functions of front end controller 108 may be implemented in software, while others may be implemented with specific hardware devices. The operation of front end controller 108 may be governed by appropriate firmware/code residing and executing therein, as is well known in the art. Front end controller 108 may comprise memory or have access to external memory. In one particular embodiment, front end controller 108 is embodied by a computer, although this is not necessary, as front end controller 108 may be implemented in an embedded architecture or some other control unit specific to system 100. Front end controller 108 may comprise or may otherwise be connected to other interface components (not specifically shown) which may be used to interact with any of the other components of system 100.
  • Interface 104 facilitates communication between user 102 and front end controller 108 and may be embodied in a wide variety of formats which depend to some degree on the implementation of front end 106.
  • front end 106 may be implemented over the internet, in which case interface 104 may comprise a user computer and front end controller 108 may comprise a remote server.
  • front end 106 may be implemented on a user computer, in which case interface 104 may comprise some input/output device of the user computer and front end controller 108 may comprise the processor(s) of the user computer.
  • front end 106 may be implemented on a cellular telephone network, in which case interface 104 may comprise a cellular telephone and front end controller 108 may comprise a remoter server. Additionally or alternatively, interface 104 may comprise the input/output components of a cellular phone and front end controller 108 may comprise an embedded processor on the cellular telephone.
  • front end 106 may be implemented on a cellular telephone network, in which case interface 104 may comprise a cellular telephone and front end controller 108 may comprise a remoter server.
  • interface 104 may comprise the input/output components of a cellular phone and front end controller 108 may comprise an embedded processor on the cellular telephone.
  • Front end controller 108 operates software (not explicitly shown).
  • Figure 1 schematically illustrates a number of the functional components of this software in accordance with an embodiment of the invention. As described in more particular detail below, the operation of this software may be used to improve the performance of translation engine 110. In addition, the operation of this software on front end controller 108 may control: the operation of front end 106; the interaction between user 102 and front end controller 108; and/or the interaction between front end controller 108 and translation engine 110.
  • FIG 1 schematically illustrates a method 112 for improving the performance of translation engine 110 using front end 106 according to an embodiment of the invention.
  • Method 112 commences when user 102 inputs some information to be translated into interface 104 (not explicitly shown in Figure 1).
  • the information to be translated is referred to herein as the "message".
  • the information input by user 102 into interface 104 will comprise text information. This is not necessary.
  • user 102 may input information orally or in some other communication technique, such as brail or the like.
  • user 102 may also input the input and target languages and, optionally, one or more contextual parameter(s).
  • translation system 100 is only suitable for translation between a particular pair of languages, in which case, the input and target languages may not be required.
  • the message to be translated, together with any parameters input by user 102, may be referred to herein as the "message set".
  • Interface 104 converts the message set input by user 102 into a format suitable for use by the rest of front end 106. Typically, although not necessarily, this format will be a text format. In circumstances where user 102 inputs a text-based message set, then interface 104 may be comprise a completely text-based interface. This is not necessary. In some embodiments, interface 104 may comprise suitable voice- recognition software which allows user 102 to speak into interface 104 and converts this speech into a suitable message set (e.g. text) for interpretation by the rest of front end 106.
  • a suitable message set e.g. text
  • the message set is then passed from interface 104 into format checking block 114.
  • Format checking block 114 checks to make sure that the message set is in a format that can be handled by front end 106 and/or translation engine 110.
  • translation engine 110 may only be suitable for certain language pairs and may not handle other languages, in which case format checking block 114 may determine that the message set is not in a suitable format.
  • user 102 may have omitted to input the target language in the message set, in which case format checking block 114 may determine that the input message set is missing a target language parameter.
  • the length of the message may be limited to a certain number of characters and format checking block 114 may detect that the length of the message is too long. It will be appreciated that limiting the length of the message may be useful in applications where computational resources are scarce.
  • format checking block 114 determines that the message set is improperly formatted, then an error flag (not explicitly shown) is set and method 112 proceeds on branch 116 to response message block 118.
  • Response message block 118 detects the error flag and outputs a suitable error indication to user 102 over interface 104.
  • format checking block 114 determines the particular type of error and sets a particular error flag from among a group of possible error flags. In such cases, response message block 118 can detect the particular type of error by detecting the particular error flag and output an error indication that is particular to that particular type of error.
  • lookup lexicon block 120 may comprise or otherwise have access to a database of commonly used linguistic phrases in the input language and their corresponding meanings in the target language. This database can be populated in any suitable manner. The contents of this database may depend on the input language and the types of common phrases in the input language. The lookup lexicon database may also be regionally specific. For example, commonly used English phrases may be different in Texas then they are in Scotland. If translation engine 110 makes use of a parallel corpus, then the lookup lexicon database is preferably significantly smaller than the parallel corpus used by translation engine 110. The lookup lexicon database may also be focused on a particular type of translation. For example, a lookup lexicon may focus on typical mobile phone SMS text messages.
  • the message is checked against the database in lookup lexicon block 120 to determine if the message (or part of the message) matches one of the well known phrases in the database. If the message matches one of the phrases in the database of lookup lexicon block 120, then the corresponding translation is sent to response message block 118 on branch 122. If the translated output message arrives at response message block 118 along branch 122, then response message block 118 formats the output message in the target language and outputs the translated message to user 102 over interface 104.
  • the message need not be sent to (or translated by) translation engine 110. It will be appreciated that this obviation of translation engine 110 may save considerable computational resources.
  • segmenter block 126 is optional and is typically used in circumstance where the input language is an Asian language (e.g. Chinese) or some other language where there are no spaces between individual words. Segmenter block 126 may operate to parse the input message into parts (e.g. words or phrases) which simplify the subsequent translation. In one particular embodiment, segmenter block 126 operates according to a maximal matching process which attempts to identify (e.g. match) the longest words/phrases possible within the message in order to parse the message into words/phrases.
  • a maximal matching process which attempts to identify (e.g. match) the longest words/phrases possible within the message in order to parse the message into words/phrases.
  • FIG. 2 schematically depicts a maximal matching process 226 according to a particular embodiment of the invention.
  • Maximal matching process 226 may have access to a relatively large dictionary/database 202 of words and phrases in the input language (e.g. Chinese).
  • maximal matching process 226 starts in block 204, where it checks the message for the presence of the character sequence corresponding to the largest word/phrase in database 202.
  • maximal matching process 226 evaluates whether there is a match between the characters of the input message and the largest word/phrase in database 202. If there is such a match (block 206 YES output), then method 226 proceeds to block 214, where the characters corresponding to the matched word/phrase are parsed from the rest of the message. If the block 206 inquiry is negative (i.e. there is no match), then method 226 proceeds to block 208.
  • Block 208 involves checking the message for the presence of the character sequence corresponding to the next largest word/phrase in database 202.
  • maximal matching process 226 evaluates whether there is a match between the characters of the input message and the next largest word/phrase in database 202. If there is such a match (block 210 YES output), then method 226 proceeds to block 214, where the characters corresponding to the matched word/phrase are parsed from the rest of the message.
  • Block 212 involves evaluating end criteria for the maximal matching process. If the block 212 end criteria indicate that maximal matching process 226 should end (block 212 YES output), then method 226 proceeds to block 216 and ends. If, on the other hand, the block 212 end criteria indicate that maximal matching process 226 should continue (block 212 NO output), then method 226 loops back to block 208. There may be a variety of suitable end criteria to evaluate in block 212.
  • Non-limiting examples of end criteria which may be used in block 212 include any one or more of: the entirety of the message (or some threshold percentage of the message characters) being recognized in block 204; reaching the smallest word in input language database 202; and recognizing a certain threshold number of words/phrases or characters.
  • Maximal matching process 226 may be implemented by segmenter 126.
  • segmenter 126 parses known word(s)/phrase(s) from the rest of the message to segment the message into words/phrases (e.g. by introducing spaces between words and/or phrases).
  • database 202 incorporates some of the well known phrases used by the database of lookup lexicon block 120, such that entire well known phrases may be parsed from the message.
  • the effectiveness of segmenter block 126 to parse the input message may depend on the size of dictionary 202 in the input language, but is generally independent of the target language.
  • segmenter block 126 the strict maximal matching process 226 of Figure 2 may be modified to consider certain patterns of words. Taking such word patterns into account may help to address issues with proper nouns which may otherwise be difficult for maximal matching process 226 to handle. Segmenter block 126 may also incorporate a table of words/phrases that are "bound” and possibly a list of words/phrases that are "unbound”. Segmenter block 126 could recognize that a "bound" Chinese symbol (for example) could not be further segmented such that the bound symbol could be ruled out for further segmentation. Segmenter 126 may also modify the order of the words/phrases checked in maximal matching process 226, such that a different order (i.e.
  • words/phrases or database 202 may be categorized into a number of groups corresponding to their level of commonness and then each such commonness level can be searched in order with the words/phrases within each such commonness level checked in a size order.
  • segmenter block 126 may comprise or otherwise have access to a statistical algorithm based on a corpus corresponding to the English language.
  • a statistical segmenter block 126 may be trained on a corpus of manually segmented words/phrases. Training would initialize the statistical segmenter block 126 by reading the corpus, calculating the probability that a series of characters, say between one to three characters, would terminate or begin a new word.
  • segmenter block 126 may comprise or otherwise have access to a rule- based segmenter.
  • a rule-based segmenter block 126 may comprise a set of rules for which segmentation could always occur.
  • Lexical pre-processor block 130 further breaks down the message into phrases or words, based on a set of rules that specify a word, a number, punctuation, etc.
  • lexical pre-processor block 130 attempts to address syntax errors (or other errors) in the message as input by user 102.
  • lexical pre-processor block 130 makes use of processes based on closest match to data in lookup tables and/or databases.
  • user errors or even acceptable forms of speech mangling
  • lexical pre-processor block 130 include:
  • lexical pre-processor block 130 may expand and formalizes common abbreviated forms into forms with a higher level of comprehensibility.
  • formalizations may include:
  • abbreviation and idiom expansion block 132 may comprise or otherwise have access to a database of commonly used abbreviations, slang, and idiomatic phrases in the input language and their corresponding expansions/meanings (also in the input language).
  • the database of abbreviation and idiom expansion block 132 is used to find exact matches with portions of the message and matching portions of the message are expanded and/or replaced with a more literal meaning.
  • Abbreviations may be expanded - e.g. lol ⁇ laughing out loud or brb - be right back;
  • Idioms may be interpreted - e.g. under the weather ⁇ sick.
  • Spell checking block 134 scans the message as modified for misspelled words and replaces the misspelled words in the message with correctly spelled words.
  • Spell checking block 134 may be implemented using commercially available spell checking software. Such commercially available spell- checking software may make use of a double-metaphone process for example. Spell checking block 134 may be principally applicable where the input message is in a Latin-based language, although it is envisaged that other spell checking systems may be devised (or may already exist) for other types of languages.
  • Spell checking block 134 preferably contains logic for making decisions about when to replace words which may be misspelled in order to avoid mistakenly replacing words which are not contained in the spell checking dictionary. Such logic may be based on grammatical constructs which are particular to a given input language. For example, in the case of English, spell checking block 134 may determine not to replace proper nouns (even though they may be absent from the spell checking dictionary) on the basis that conventional nouns are normally preceded by an article or the like whereas proper nouns are not. In some embodiments, such logic may be based on the number of letters that differ between a potentially misspelled word and a potential replacement word.
  • Examples include mixed order of a certain number of letters, omission of a certain number of letters, inclusion of a certain number of extra letters, and a certain number of incorrect letters.
  • Still other examples of spell checking logic include letters that are closely spaced on the keyboard (which may have a higher probability of being improperly spelled) and vowels and similarly sounding letters (which may also have a higher probability of error).
  • grammar checking block 136 processes each element of the input message (as modified) and parses the input message into its grammatical elements. Once grammatically parsed, grammar checking block 136 may correct grammatical errors in the message (if detected) and may attempt to restructure the message into one or more formats that will be more easily interpreted by translation engine 110. For example, some translation engines (e.g. the SYSTRANTM translation engine) have problems distinguishing proper nouns. This problem is particularly relevant where the proper noun has a literal definition in the input language or in the target language (e.g. the proper noun "Bill” has a literal definition in the English language). Grammar checking block 136 may recognize that the word "Bill” may be a proper noun that may be the subject (e.g. "Bill was here") or object (e.g. "I took Bill to the movie”) of a message, for example.
  • SYSTRANTM translation engine e.g. the SYSTRANTM translation engine
  • the word "Bill” (or any other problematic word) in the message could be replaced with one or more "stand-in” words (in this case a stand-in noun) and a suitable flag could be set to indicate the presence of the stand-in word(s) in the message.
  • the message containing the "stand-in” word(s) could then be further processed by the remainder of system 100 (e.g. static look up lexicon 120 and translation engine 110).
  • the stand-in word(s) may be selected to minimize translation difficulties for translation engine 110.
  • the use of stand-in words may be selected on the basis of the particular implementation of translation engine 110. As explained in more detail below, in some situations (e.g.
  • the word(s) replaced by stand-in word(s) are proper nouns
  • the proper nouns are reinserted into the message in the place of the stand-in word(s) after translation of the message by translation engine 110.
  • the word(s) replaced by stand-in word(s) may be separately translated as secondary messages and then the separately translated secondary messages may be reinserted (after translation by translation engine 110) into the primary messages.
  • Grammar checking block 136 may also attempt to simplify the grammatical structure of the message. For example, grammar checking block 136 may attempt to remove unnecessary parts of the message which may result in conflicts or other difficulties for translation engine 110. Non-limiting examples of the types of grammatical corrections which may be made by grammar checking block 136 include:
  • lookup lexicon block 120 performs a substantially similar function to lookup lexicon block 120 discussed above, except that the second time, lookup lexicon block 120 acts on the message as modified by segmenter block 126, lexical preprocessor 130, abbreviation and idiom expansion block 132, spell checking block 134 and grammar checking block 136 and on the secondary message (if present). If the modified message matches one of the phrases in the database of lookup lexicon block 120, then the corresponding translation is sent to response message block 118 on branch 142.
  • response message block 118 formats the output message in the target language and outputs the translated message to user 102 over interface 104.
  • the message need not be sent to (or translated by) translation engine 110. It will be appreciated that this obviation of translation engine 110 may save considerable computational resources.
  • translation engine 110 receives the modified message and outputs a translated message (on branch 138) to response message block 118.
  • translation engine 110 may comprise a commercially available translation engine with a suitable API, such as the GoogleTM translation engine (see http://translate.google.com/translatej:) or the SystransoftTM translation engine (see http://www.systransoft.com/).
  • response message block 118 Upon receipt of the translated message on branch 138, response message block 118 outputs the translated message to user 102 through interface 104.
  • the translated message received at response message block 118 is received along with a flag indicating that one or more word(s) in the translated message have been replaced by stand-in word(s) or that there is a separately translated secondary message that is meant to accompany the primary message. If response message block 118 receives such a flag, then word(s) replaced by the stand-in word(s) may be reinserted into the translated message by response message block 118 prior to outputting the translated message over interface 104 and/or the separately translated secondary message may be reinserted into the corresponding locations of the primary translated message prior to outputting the translated message over interface 104.
  • Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention.
  • one or more processors in a dual modulation display system may implement data processing steps in the methods described herein by executing software instructions retrieved from a program memory accessible to the processors.
  • the invention may also be provided in the form of a program product.
  • the program product may comprise any medium which carries a set of computer-readable instructions which, when executed by a data processor, cause the data processor to execute a method of the invention.
  • Program products according to the invention may be in any of a wide variety of forms.
  • the program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like.
  • the instructions may be present on the program product in encrypted and/or compressed formats.
  • a component e.g. a software module, processor, assembly, device, circuit, etc.
  • reference to that component should be interpreted as including as equivalents of that component any component which performs the function of the described component (i.e. that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated exemplary embodiments of the invention.
  • translation engine 110 is implemented by a commercially available translation engine 110, such as the GoogleTM translation engine (see http://translate.google.com/translate_t) or the SystransoftTM translation engine (see http://www.systransoft.com/).
  • translation engine 110 may comprise a novel combination of a rule-based translation process and a statistical translation process. In such embodiments, translation engine 110 could commence by generating a rule based translation of the modified message received on branch 140.
  • Such a novel translation engine 110 could generate a number of possible output messages which would be possible translations of a particular input message. For instance, consider translation of the English phrase “Let's eat dogs” into Chinese. A rule-based translator may return both tm “, literally “eat dog”, and literally “eat hot
  • a rule-based translator may also return other phrases as well, which would be incorrect.
  • the novel translation engine would use statistical analysis to determine which of the translated messages outputted by the rule-based translation process is most likely to exist in the target language.
  • Generating this statistical analysis set may involve the use of a non-parallel corpus.
  • a non-parallel corpus is a corpus that would only need to be in the target language, without any "part of speech" tags or parallel human translations. This type of corpus is most easily obtainable. Using statistical analysis, on the message translated by the rule-based translation process would be used.
PCT/CA2007/001004 2007-04-30 2007-06-07 Systems and methods for improving translation systems WO2008131509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US91505907P 2007-04-30 2007-04-30
US60/915,059 2007-04-30

Publications (1)

Publication Number Publication Date
WO2008131509A1 true WO2008131509A1 (en) 2008-11-06

Family

ID=39925120

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2007/001004 WO2008131509A1 (en) 2007-04-30 2007-06-07 Systems and methods for improving translation systems

Country Status (1)

Country Link
WO (1) WO2008131509A1 (de)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951416A (zh) * 2017-03-21 2017-07-14 成都星阵地科技有限公司 基于大数据处理及人工干预的多语言即时翻译系统
CN109558600A (zh) * 2018-11-14 2019-04-02 北京字节跳动网络技术有限公司 翻译处理方法及装置
CN109657252A (zh) * 2018-12-25 2019-04-19 北京微播视界科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN110502762A (zh) * 2019-08-27 2019-11-26 北京金山数字娱乐科技有限公司 一种翻译平台及其管理方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000034890A2 (en) * 1998-12-10 2000-06-15 Global Information Research And Technologies, Llc Text translation system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000034890A2 (en) * 1998-12-10 2000-06-15 Global Information Research And Technologies, Llc Text translation system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BROWN R.D.: "Example-Based Machine Translation in the Pangloss System", PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS (COLING-96), COPENHAGEN, DENMARK, 5 August 1996 (1996-08-05) - 9 August 1996 (1996-08-09), pages 169 - 174 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951416A (zh) * 2017-03-21 2017-07-14 成都星阵地科技有限公司 基于大数据处理及人工干预的多语言即时翻译系统
CN109558600A (zh) * 2018-11-14 2019-04-02 北京字节跳动网络技术有限公司 翻译处理方法及装置
CN109558600B (zh) * 2018-11-14 2023-06-30 抖音视界有限公司 翻译处理方法及装置
CN109657252A (zh) * 2018-12-25 2019-04-19 北京微播视界科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN110502762A (zh) * 2019-08-27 2019-11-26 北京金山数字娱乐科技有限公司 一种翻译平台及其管理方法

Similar Documents

Publication Publication Date Title
Bassil et al. Ocr post-processing error correction algorithm using google online spelling suggestion
US8660834B2 (en) User input classification
US8447588B2 (en) Region-matching transducers for natural language processing
US6269189B1 (en) Finding selected character strings in text and providing information relating to the selected character strings
US7788085B2 (en) Smart string replacement
US7774193B2 (en) Proofing of word collocation errors based on a comparison with collocations in a corpus
US8266169B2 (en) Complex queries for corpus indexing and search
US5890103A (en) Method and apparatus for improved tokenization of natural language text
US5384703A (en) Method and apparatus for summarizing documents according to theme
US8285541B2 (en) System and method for handling multiple languages in text
US8510097B2 (en) Region-matching transducers for text-characterization
US20100332217A1 (en) Method for text improvement via linguistic abstractions
US11386269B2 (en) Fault-tolerant information extraction
Samih et al. Detecting code-switching in moroccan Arabic social media
Tufiş et al. DIAC+: A professional diacritics recovering system
Cing et al. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
AlGahtani et al. Arabic part-of-speech tagging using transformation-based learning
JPWO2008146583A1 (ja) 辞書登録システム、辞書登録方法および辞書登録プログラム
Bar-Haim et al. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew
WO2008131509A1 (en) Systems and methods for improving translation systems
WO2014189400A1 (en) A method for diacritisation of texts written in latin- or cyrillic-derived alphabets
Wu et al. Integrating dictionary and web N-grams for chinese spell checking
Kaji et al. Splitting noun compounds via monolingual and bilingual paraphrasing: A study on japanese katakana words
Huang et al. Statistical part-of-speech tagging for classical Chinese
Tufis et al. Parallel corpora, alignment technologies and further prospects in multilingual resources and technology infrastructure

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07719920

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07719920

Country of ref document: EP

Kind code of ref document: A1