CN103729346A - Method for dynamically generating mass language assets in multiple language industry standard formats - Google Patents

Method for dynamically generating mass language assets in multiple language industry standard formats Download PDF

Info

Publication number
CN103729346A
CN103729346A CN201210383201.5A CN201210383201A CN103729346A CN 103729346 A CN103729346 A CN 103729346A CN 201210383201 A CN201210383201 A CN 201210383201A CN 103729346 A CN103729346 A CN 103729346A
Authority
CN
China
Prior art keywords
language
assets
multilingual
user
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210383201.5A
Other languages
Chinese (zh)
Other versions
CN103729346B (en
Inventor
杜金林
朱懿
杜勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Translated By Mdt Infotech Ltd Shanghai
Original Assignee
SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd filed Critical SHANGHAI YONGJINYI INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210383201.5A priority Critical patent/CN103729346B/en
Publication of CN103729346A publication Critical patent/CN103729346A/en
Application granted granted Critical
Publication of CN103729346B publication Critical patent/CN103729346B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for dynamically generating mass language assets in multiple language industry standard formats. The method comprises the steps that a TMX corpus, a TBX corpus and the like based on the XML standard format and the content in a term bank are read through a development analyzer and led into an appointed database; in the process of leading in, a database table for automatically matching and containing the same content and different language pairs automatically generates a multi-language database of target languages with a source text and multiple matched sentences; in the use process of a user, searched results are automatically fed back to the user in a translation memory mode according to a language pair appointed by the user, and the searched results are presented to the finial user for reuse in a specific format; when the multi-language database is enriched and updated, related content of the multiple languages is automatically updated, and therefore it is ensured that the user continues to obtain the updated translation memory content after the language assets are dynamically updated. The language assets stored in the text database format are directly reused, the data are not damaged or lost easily, and the safety of the assets is improved.

Description

Dynamically generate the method for the magnanimity language assets of multilingual industry standard form
Technical field
The present invention relates to the method for the magnanimity language assets of the multilingual industry standard form of a kind of dynamic generation, for the development and application of the TM module of CAT software or multilingual translation system, belong to multilingual machine translation mothod field.
Background technology
TM (Translation Memory translation memory) is one of extensive technology adopting in computer-aided translation (CAT) field, by TM technology, can significantly improve translation efficiency, guarantees content consistency.Owing to adopting, the CAT software category of TM technological development is various, the storage format of TM content varies, for the ease of the TM exchanges data between body translation and CAT instrument, the open standard that one is called TMX (Translation Memory eXchange) has been successfully applied to localization and translation industry.
In the process of software and Website localization translation, need content data file repeatability to be processed larger, in addition because content update is frequent, and be all the renewal based on last revision, just increased a small amount of fresh content or original content has been carried out to a small amount of correction, the content that before making full use of so necessary, version has been translated, and do not need again to translate.
TM technology reuses these contents of having translated effectively, it adopts the mode in segment (Segment) and TM storehouse to improve the efficiency of translation, translation database as data unit, is set up corresponding linking relationship by each sentence of source language with the sentence of target language take " translation unit (Translation Unit) ".When translator adopts the CAT instrument translation content of TM, CAT instrument constantly stores the content of up-to-date translation into TM storehouse, for the content that will translate (as word, phrase, sentence, paragraph), whether it first searches for this content in TM storehouse the content of coupling, and immediate translation is provided automatically, and translator can insert the translation mating most easily.
Along with enriching constantly of translation content, the capacity in TM storehouse constantly increases, translator needn't, for the translation worries again again of identical content, only need to be absorbed in the fresh content that needs translation, and the accuracy of TM also can guarantee the consistance of identical content translation.This is the target that adopts TM technology pursue.
But, along with deepening continuously of economic globalization, the localization of software/website and globalization industry develop rapidly, echo mutually therewith, each adopts the localization tool of T M technological development and TM instrument to get more and more, but these instruments are different producers to be developed, and there is file data storage format separately in every family.In addition, for a Local Service mechanism, often for the disparity items of different clients or same client provides localized translation service, because different clients and disparity items need to be used different localization tool, often because each localization tool file data lacks the standard format that can exchange, therefore, be difficult to reuse the TM base resource of accumulation in the past.Obviously, the standard format in TM storehouse is urgently unified.
In sum, along with deepening continuously of economic globalization, the localization of software/website and globalization industry develop rapidly, except the language assets (TM and term resources) of the TMX to existing storage and TBX form are reused, contribute to promote output and quality, reduce costs.Conventionally TMX or TBX occur form with a language, if English is to Chinese, and the English German etc. that arrives.But the technology of industry still rests on the situation that single language is supported form, also not from existing single language to identical content automatically generate the right technology of multilingual language.
The shortcoming of prior art: 1) existing language asset store framework is two-dimentional, unidirectional, and the corresponding relation between source languages and each target language cannot be got through; 2) cannot be from the single language TMX of magnanimity or TBX file identical content automatic acquisition multilingual (various dimensions), multidirectional language pair, cause the significant wastage of resource, as need obtain, certainly will cause huge cost of labor.
Summary of the invention
For addressing the above problem, the present invention aims to provide the method for the magnanimity language assets of the multilingual industry standard form of a kind of dynamic generation.Technical scheme of the present invention is as follows:
A method for the magnanimity language assets of the multilingual industry standard form of dynamic generation, comprises the following steps:
1, by the content in corpus, the terminology bank of exploitation resolver standard format based on XML by TMX, TBX etc., read out and import in the database of appointment;
2,, when importing, by Auto-matching and the right database table of placement identical content different language, automatically generate a source document, the multilingual database of the target language of many couplings;
3,, when user uses, the language pair of specifying according to user, feeds back to user by the result searching with the form of translation memory automatically, presents to final user be reused with specific form;
4, when increase, renewal multilingual database, will automatically upgrade multilingual related content, guarantee that language assets, after dynamically updating, can continue to allow user obtain the translation memory content after renewal.
The method of the magnanimity language assets of the multilingual industry standard form of above-described dynamic generation, as preferred version: also comprise:
Adopt λ language material parsing module, the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to binary object;
Adopt λ language material adaptation module, the matching feature to intermediate language language material is provided, and stores respective objects language language material into multilingual language material matrix tram;
Adopt λ language material generation module, provide and read language material information in multilingual language material matrix, and it is output as to TMX or TBX formatted file according to industry standard, facilitate archival back-up language material or the instrument for other compatible TMX or TBX.
The method of the magnanimity language assets of the multilingual industry standard form of dynamic generation of the present invention, its beneficial effect is: the language assets that exist with multilingual database form are to be physically independent of the language assets that exist with TMX and TBX form, even if multilingual database is deleted, can not have influence on original language assets, thereby guarantee the security of assets yet; And assets are that the XML (TMX and TBX are all based on XML) with textual form is kept on storage medium, are different from the binary data library file that is frequently read storage by CAT instrument, its security can be protected, and can surprisingly not lose.
The directly processing to TMX and two kinds of industry standard forms of TBX, can bring following beneficial effect:
1) directly reuse the language assets that text data library format is preserved, the not fragile loss of data, has promoted assets security.
2) without manual switch form, automatic guide enters industry standard format, the reusing of implementation language assets.
3) language of the multilingual various dimensions of automatic acquisition to term pair, such as originally there being 3 language materials that language is right, by using invention, can realize the extra increment of assets, 9 language materials that language is right of extra acquisition, thereby the maximum efficiency of performance language assets, globalisation of production to enterprise and internationalization, kept the consistance of language performance in process of globalization, directly bring the lifting of efficiency and quality, save huge multilingual production cost, shorten the time cycle of enterprise product globalization layout.
4) support the high speed of the multilingual assets of magnanimity to inquire about/reuse.
Accompanying drawing explanation
Fig. 1. dynamically generate the system chart of the method for the magnanimity language assets of multilingual industry standard form.
Specific embodiments
Abbreviation and Key Term definition:
MTMM Multilingual Translation Memory Matrix multilingual translation dot-blur pattern technology
TM Translation Memory translation memory
TU Translation Unit translation unit
TMX Translation Memory eXchange translation memory Interchange Format
TBX Term Base eXchange terminology bank Interchange Format
CAT Computer Aided Translation computer-aided translation
LISA Localization Industry Standards Association Localization Industry ANSI
OSCAR Open Standards for Container/Content Allowing Re-use re-usable container/contents open standard
Specific embodiment is as follows:
The method that dynamically generates the magnanimity language assets of multilingual industry standard form, comprises the following steps:
1) by the content in corpus, the terminology bank of exploitation resolver standard format based on XML by TMX, TBX etc., read out and import in the database of appointment;
2), when importing, by Auto-matching and the right database table of placement identical content different language, automatically generate a source document, the multilingual database of the target language of many couplings;
3), when user uses, the language pair of specifying according to user, feeds back to user by the result searching with the form of translation memory automatically, presents to final user be reused with specific form;
4) when increase, renewal multilingual database, will automatically upgrade multilingual related content, thereby guarantee that language assets, after dynamically updating, can continue to allow user obtain the translation memory content after renewal.
The method that dynamically generates the magnanimity language assets of multilingual industry standard form, specifically also comprises:
Adopt λ language material parsing module, the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to binary object;
Adopt λ language material adaptation module, the matching feature to intermediate language language material is provided, and stores respective objects language language material into multilingual language material matrix tram;
Adopt λ language material generation module, provide and read language material information in multilingual language material matrix, and it is output as to TMX or TBX formatted file according to industry standard, facilitate archival back-up language material or the instrument for other compatible TMX or TBX.
The language assets that exist with multilingual database form are to be physically independent of the language assets that exist with TMX and TBX form, even if multilingual database is deleted, also can not have influence on original language assets, thereby guarantee the security of assets; And assets are that the XML (TMX and TBX are all based on XML) with textual form is kept on storage medium, are different from the binary data library file that is frequently read storage by CAT instrument, its security can be protected, and can surprisingly not lose.
Concept example sentence of the present invention:
A. the concept of translation memory (TMX) is for example illustrated:
Single language under general case is given an example to two-dimensional TM content:
English en-us:People ' s Republic of China is a permanent member of the United Nations Organization
Chinese zh-cn: the People's Republic of China (PRC) is the permanent member of the UN organizations
English en-us:People ' s Republic of China is a permanent member of the United Nations Organization
French fr-fr:R é publique populaire de Chine est membre permanent de l ' Organisation des Nations Unies
English en-us:People ' s Republic of China is a permanent member of the United Nations Organization
German de-de:Der Volksrepublik China ist
Figure BSA00000787540100041
mitglied der Organisation der Vereinten Nationen
By the technology of the present invention, by automatic acquisition arbitrarily the multilingual various dimensions language of coupling to TM, as:
Chinese zh-cn: the People's Republic of China (PRC) is the permanent member of the UN organizations
French fr-fr:R é publique populaire de Chine est membre permanent de l ' Organisation des Nations Unies
Chinese zh-cn: the People's Republic of China (PRC) is the permanent member of the UN organizations
German de-de:Der Volksrepublik China ist
Figure BSA00000787540100051
mitglied der Organisation der Vereinten Nationen
French fr-fr:R é publique populaire de Chine est membre permanent de l ' Organisation des Nations Unies
German de-de:Der Volksrepubl ik China ist
Figure BSA00000787540100052
mitglied der Organisation der Vereinten Nationen
B. the concept of terminology bank (TBX) is for example illustrated:
Single language two-dimensional terms content under general case:
English en-us:Computer-assisted translation
Chinese zh-cn: computer-aided translation
English en-us:Computer-assisted translation
French fr-fr:Traduction assist é e par ordinateur
English en-us:Computer-assisted translation
German de-de:Computerunterst ü tzte
By the technology of the present invention, by automatic acquisition arbitrarily the multilingual various dimensions language of coupling to term:
Chinese zh-cn: computer-aided translation
French fr-fr:Traduction assist é e par ordinateur
Chinese zh-cn: computer-aided translation
German de-de:Computerunterst ü tzte
Figure BSA00000787540100054
French fr-fr:Traduction assist é e par ordinateur
German de-de:Computerunterst ü tzte
Figure BSA00000787540100055
The directly processing to TMX and two kinds of industry standard forms of TBX, can bring following beneficial effect:
1) directly reuse the language assets that text data library format is preserved, the not fragile loss of data, has promoted assets security.
2) without manual switch form, automatic guide enters industry standard format, the reusing of implementation language assets.
3) language of the multilingual various dimensions of automatic acquisition to term pair, such as originally there being 3 language materials that language is right, by using invention, can realize the extra increment of assets, 9 language materials that language is right of extra acquisition, thereby the maximum efficiency of performance language assets, globalisation of production to enterprise and internationalization, kept the consistance of language performance in process of globalization, directly bring the lifting of efficiency and quality, save huge multilingual production cost, shorten the time cycle of enterprise product globalization layout.
4) support the high speed of the multilingual assets of magnanimity to inquire about/reuse.
Each manufacturer wishes that user is larger to the CAT product dependence of self, but consider from user's angle, a kind of support magnanimity language assets from the right identical content of single language, automatically generate multilingual right method, guarantee assets security, realizing the maximization application of resource, will be quite valuable.Adopt technical scheme of the present invention, can obtain useful result: except guaranteeing right the reusing and assets security of former single language sentence, automatically for user, obtain the language pair of multilingual various dimensions simultaneously, realized the extra increment of assets, the maximum efficiency of performance language assets.
The above, be only preferred embodiment of the present invention, and any non-creativeness that those skilled in the art do around this spirit improves, and all belongs to protection scope of the present invention.

Claims (2)

1. the method that dynamically generates the magnanimity language assets of multilingual industry standard form, is characterized in that: comprise the following steps: (1) reads out and imports in the database of appointment by the content in corpus, the terminology bank of exploitation resolver standard format based on XML by TMX, TBX etc.; (2), when importing, by Auto-matching and the right database table of placement identical content different language, automatically generate a source document, the multilingual database of the target language of many couplings; (3), when user uses, the language pair of specifying according to user, feeds back to user by the result searching with the form of translation memory automatically, presents to final user be reused with specific form; (4) when increase, renewal multilingual database, will automatically upgrade multilingual related content, guarantee that language assets, after dynamically updating, can continue to allow user obtain the translation memory content after renewal.
2. the method for the magnanimity language assets of the multilingual industry standard form of dynamic generation according to claim 1, it is characterized in that: further comprising the steps of: adopt λ language material parsing module, the parsing of industry standard form TMX and TBX is provided, language material information (comprising source language, target language etc.) is read in to internal memory, be converted to binary object; Adopt λ language material adaptation module, the matching feature to intermediate language language material is provided, and stores respective objects language language material into multilingual language material matrix tram; Adopt λ language material generation module, provide and read language material information in multilingual language material matrix, and it is output as to TMX or TBX formatted file according to industry standard, facilitate archival back-up language material or the instrument for other compatible TMX or TBX.
CN201210383201.5A 2012-10-11 2012-10-11 Method for dynamically generating mass language assets in multiple language industry standard formats Expired - Fee Related CN103729346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210383201.5A CN103729346B (en) 2012-10-11 2012-10-11 Method for dynamically generating mass language assets in multiple language industry standard formats

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210383201.5A CN103729346B (en) 2012-10-11 2012-10-11 Method for dynamically generating mass language assets in multiple language industry standard formats

Publications (2)

Publication Number Publication Date
CN103729346A true CN103729346A (en) 2014-04-16
CN103729346B CN103729346B (en) 2017-02-08

Family

ID=50453425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210383201.5A Expired - Fee Related CN103729346B (en) 2012-10-11 2012-10-11 Method for dynamically generating mass language assets in multiple language industry standard formats

Country Status (1)

Country Link
CN (1) CN103729346B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407862A (en) * 2014-11-20 2015-03-11 北京奇虎科技有限公司 Data processing plugin and data processing method applied to browser

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826072A (en) * 2009-03-02 2010-09-08 Sdl有限公司 Computer assisted natural language translation
CN102591859A (en) * 2011-12-28 2012-07-18 华为技术有限公司 Method and relevant device for reusing industrial standard formatted files
US8244519B2 (en) * 2008-12-03 2012-08-14 Xerox Corporation Dynamic translation memory using statistical machine translation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8244519B2 (en) * 2008-12-03 2012-08-14 Xerox Corporation Dynamic translation memory using statistical machine translation
CN101826072A (en) * 2009-03-02 2010-09-08 Sdl有限公司 Computer assisted natural language translation
CN102591859A (en) * 2011-12-28 2012-07-18 华为技术有限公司 Method and relevant device for reusing industrial standard formatted files

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘小军: "基于多语种平行语料库的机器辅助翻译系统", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407862A (en) * 2014-11-20 2015-03-11 北京奇虎科技有限公司 Data processing plugin and data processing method applied to browser
CN104407862B (en) * 2014-11-20 2017-10-31 北京奇虎科技有限公司 Data processing insert arrangement and data processing method applied to browser

Also Published As

Publication number Publication date
CN103729346B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
Reinke State of the art in translation memory technology
Forcada et al. Apertium: a free/open-source platform for rule-based machine translation
Van Assem et al. A method for converting thesauri to RDF/OWL
US8819628B2 (en) Product localization device and method
JP6527227B2 (en) Modify native document comments in preview
CN103020044A (en) Machine-aided webpage translation method and system thereof
CN103793395A (en) Mass multi-language resource rapidly searching and reusing method
Sin-wai The development of translation technology 1967–2013
CN102103495A (en) Internationalizing system of picture and text packaging programming control software
US20050267733A1 (en) System and method for a translation process within a development infrastructure
CN103729346A (en) Method for dynamically generating mass language assets in multiple language industry standard formats
Rašmane et al. The potential of IFLA LRM and RDA key entities for identification of entities in textual documents of cultural heritage: the RunA collection
US20090125804A1 (en) Generating schema-specific dita specializations during generic schema transformations
Harrison The Darwin information typing architecture (DITA): Applications for globalization
CN102591859B (en) Method and relevant device for reusing industrial standard formatted files
Hudík et al. The integration of moses into localization industry
Wielemaker et al. Why It's Nice to be Quoted: Quasiquoting for Prolog
Dipper et al. Challenges in modelling a richly annotated diachronic corpus of German
CN103793368B (en) A kind of method of labelling in protection markup language automatically in automatization translation processes
Dunne Translation tools
Gartner METS: Metadata Encoding and Transm ission Standard
Seljan et al. Translation Memory Database in the Translation Process
Rueter et al. On new text corpora for minority languages on the helsinki korp. csc. fi server
JP2000339333A (en) System and method for supporting natural language retrieval
Li et al. Linked Data in Alma: URIs and BIBFRAME Conversion of MARC Tag 880

Legal Events

Date Code Title Description
DD01 Delivery of document by public notice

Addressee: SHANGHAI UTRANSHUB INFORMATION TECHNOLOGY CO.,LTD.

Document name: Notification of Passing Preliminary Examination of the Application for Invention

C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHANGHAI YOUYI INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: SHANGHAI YONGJINYI INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20141106

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20141106

Address after: 200439 room 306, Gao Jing International Building, 101 Yin Gao Xi Road, Shanghai

Applicant after: Translated by Mdt InfoTech Ltd. Shanghai

Address before: 200439 Gao Jing International Building 101, Yin Gao Xi Road, 306, Shanghai, China

Applicant before: SHANGHAI UTRANSHUB INFORMATION TECHNOLOGY CO.,LTD.

C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: 200439 Shanghai city Baoshan District Yixian Road No. 2816 Wordsworth Pentium building B building 20 floor

Applicant after: Translated by Mdt InfoTech Ltd. Shanghai

Address before: 200439 room 306, Gao Jing International Building, 101 Yin Gao Xi Road, Shanghai

Applicant before: Translated by Mdt InfoTech Ltd. Shanghai

ASS Succession or assignment of patent right

Owner name: DU JINLIN

Free format text: FORMER OWNER: SHANGHAI YOUYI INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20150401

COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 200439 BAOSHAN, SHANGHAI TO: 200441 BAOSHAN, SHANGHAI

TA01 Transfer of patent application right

Effective date of registration: 20150401

Address after: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant after: Du Jinlin

Address before: 200439 Shanghai city Baoshan District Yixian Road No. 2816 Wordsworth Pentium building B building 20 floor

Applicant before: Translated by Mdt InfoTech Ltd. Shanghai

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160215

Address after: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant after: Translated by Mdt InfoTech Ltd. Shanghai

Address before: 200441 Shanghai Yixian Road, No. 2816 Wordsworth Pentium building B building 20 floor

Applicant before: Du Jinlin

C14 Grant of patent or utility model
GR01 Patent grant
DD01 Delivery of document by public notice

Addressee: Sun Yuxiao

Document name: payment instructions

DD01 Delivery of document by public notice
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

CF01 Termination of patent right due to non-payment of annual fee