WO2016206336A1

WO2016206336A1 - File extraction and restoration method favorable for translation work

Info

Publication number: WO2016206336A1
Application number: PCT/CN2015/098668
Authority: WO
Inventors: 江潮; 罗伟峰
Original assignee: 武汉传神信息技术有限公司
Priority date: 2015-06-25
Filing date: 2015-12-24
Publication date: 2016-12-29
Also published as: CN104933041A; CN104933041B

Abstract

An artificial intelligence file processing method favorable for translation work. The method comprises: by means of support of an Aspose assembly to a file processing operation, disassemble a file object to be translated into a set of data to be translated, wherein a simple sentence serves as a minimum unit in the set of the data to be translated; establish a standard interpreter processing document, copy each sentence in the set of data to be translated to the interpreter processing document one by one; an interpreter fills translations into the interpreter processing document one by one; traverse the set of data to be translated and the interpreter processing document, and writing the translations into the set of data to be translated; and restore the set of data to be translated to an original manuscript format document. Various manuscripts of different formats can be converted into a standard interpreter processing document. Sentences repeatedly appearing for multiple times do not need to be repeatedly translated for multiple times, the translation work of the interpreter is simplified, the translation efficiency is improved, the execution efficiencies of extraction logics and restoration logics are high, and a restored translation manuscript reserves an original manuscript format.

Description

File extraction and restoration method for translation work

Technical field

The invention relates to an artificial intelligence and document processing method which is convenient for translation work.

Background technique

With China's participation in the world's second largest economy and the steady implementation of the "One Belt, One Road" strategy, China's various fields are more closely linked to the world. The language support service market required for communication between many countries in the process of internationalization is becoming more and more huge, which brings new opportunities and challenges to the translation industry.

The practitioners in the translation industry face a large number of manuscripts in various formats that need to be translated every day. Due to the wide variety of manuscripts, the corresponding translators need to master the use of various document programs such as Word, Excel, PPT, PDF and various types of documents. Use of document assisted translation tools. This is a big challenge and threshold for full-time translators. Obviously, such problems have hindered the development of the entire industry and the process of globalization in China.

Therefore, there is a need to propose a method of converting a plurality of mainstream document formats into a unified standard style document and also conversely reverting the converted standard document to an original format. To simplify translation work and improve translation efficiency.

Summary of the invention

The technical problem to be solved by the invention is to simplify the translation work and improve the translation efficiency, and propose a file extraction and restoration method which is beneficial to the translation work.

In order to solve the above technical problem, the file extraction and restoration method proposed by the present invention for facilitating translation work includes the following steps:

1) Using the Aspose dynamic link library to support the operation of the document processing, and disassembling the document object to be translated into a data set to be translated with a single sentence as a minimum unit;

2) Establish a translator processing document, the translator processing document has three fields of "original", "translation" and id, the "original" field corresponds to the original text of the sentence, and the "translation" field corresponds to the sentence translation;

3) copying each sentence in the set of data to be translated in a single sentence as a minimum unit one by one to the "original" field of the translator processing document, and then using the content of the sentence in the data set to be translated one with The unique placeholder Guid is replaced, and the adjacent placeholder Guid has a different character format; the content of the id field has a one-to-one mapping relationship with the different Guid;

4) delivering the translator processing document to the translator, the translator translating one by one in the translator processing document Translate the original text of the "original" field into the corresponding "translation" field until the processing is completed;

5) traversing the data set to be translated and the translator processing document, and finding a translation corresponding to the id according to different ids corresponding to different Guids, and overwriting the location of the corresponding Guid written in the data set to be translated.

6) Calling the Aspose dynamic link library to restore the data set to be translated to a document in an original format.

Disassembling the document object to be translated into a data set to be translated with a sentence as a minimum unit includes the following steps:

1-1 call the Aspose component;

1-2 traversing the document object to obtain all paragraph objects, the paragraph object containing all the text information of the document object, and not including symbols, images or other non-text information without translation;

1-3 traverses the child node object of each paragraph object, thereby obtaining a number of character set objects Run. The Aspose component provides a paragraph object, a child node object, and a Run object that facilitates character operations, and the Run object is a continuous set of character segments in a consistent character format in the document.

1-4 traversing each Run object, splitting all Run objects into a Run object containing only one complete sentence, or a Run object containing only one sentence fragment;

1-5 traverses each Run object, merging the Run object containing only the sentence fragment into its subsequent Run object containing only one complete sentence.

When you're done, you get a collection of Run objects that contain the sentence as the smallest unit, containing only one complete sentence.

The merge of a Run object containing only one sentence fragment into a subsequent Run object includes the following steps:

1-4-1 will take out the character content of the Run object of only one sentence fragment, store it in the temporary storage unit, and then delete the Run object in the paragraph object;

1-4-2 Check the next Run object. If the character content of the Run object is only a sentence fragment, extract the character content of the Run object, add it to the temporary storage unit, and then delete the Run object in the paragraph object, and continue. Check the next Run object; otherwise, take out the temporary storage unit to store the character content, add it to the character content of the next Run object, and then empty the temporary storage unit.

1-4-3 If the character content of the next Run object is terminated by the sentence terminator, the character content stored in the temporary storage unit is taken out, added to the character content of the next Run object, and then cleared. The temporary storage unit.

The invention also includes establishing a dictionary object, the key of the dictionary object is the original text, the value is the translation, the original-translation is a key-value pair, and when the translator processes the document, the corresponding original text-translation is recorded in the record. , respectively, write the dictionary object.

In step 5), if the translation field of the record of an id is empty, in the dictionary object, the original text of the record of the id is used as a key to find whether there is a matching translation value, and if found, the translation is filled with the translation. Translation column.

If no matching translation value is found in the dictionary object, the sentence is missing and directly filled with the original text, which is convenient for the reviewer to find.

Further, before the translator processing document is sent to the translator, the translator is traversed to process the document, and the repeated sentences are marked to remind the translator that the translation is not required.

Further, before the translator processing document is sent to the translator, the translator is traversed to process the document, and the sentence in the original text is automatically matched with the term in the termbase, and if the sentence is matched, the term sentence is annotated, so that Translation work is smoother.

Further, before the translator processing the document is sent to the translator, the translator is traversed to process the document, and the sentences in the original text are matched one by one with the corpus in the corpus, and if they match, the corpus translation in the corpus is filled in. Go to the "translation" field corresponding to the matching sentence.

Advantageous Effects: The present invention simplifies the work of the translator, so that the translator does not need to master the processing methods of various mainstream document programs such as PPT, Word, EXCL, and PDF, so that more energy can be focused on the work of text translation. In addition, by automatically pre-analysing the required translated documents during processing, searching for repetitive sentences for marking, all repetitive sentences need only be translated once, others are automatically filled and generated; collecting each translation result, when When you receive a new manuscript, you can directly use the previously accumulated corpus and terminology to further improve translation efficiency.

DRAWINGS

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a screenshot of a translator translation processing interface according to a specific embodiment of the present invention. The figure mainly shows a translator processing document filled with the original text.

2 is a screenshot of another translator translation processing interface according to a specific embodiment of the present invention. The figure mainly shows a translator who has processed the pre-processed document.

Figure 3 is an overall flow chart of the present invention.

detailed description

The method for extracting and restoring files for translation work proposed by the present invention comprises the following steps:

3) copying each sentence in the set of data to be translated in a single sentence as a minimum unit to the "original" field of the translator processing document one by one, and then using the content of the sentence in the data set to be translated one with The unique placeholder Guid is replaced, and the adjacent placeholder Guid has a different character format; the content of the id field has a one-to-one mapping relationship with the different Guid;

4) delivering the translator processing document to the translator, the translator translating the original text of the "original" field one by one in the translator processing document, and filling in the corresponding "translation" field until the processing is completed;

6) Calling the Aspose dynamic link library, and restoring the data set to be translated to generate a translated document identified by the document processing tool.

In order to better understand the present invention, the translation process of the present invention is described in detail below by taking the processing and translation of a Word document as an example:

S1, calling the Aspose component;

S2, traversing the Word document object to be translated, and obtaining all the paragraph objects, the paragraph object contains all the text information of the document object, and does not include symbols, images or other non-text information that does not need to be translated;

S3, traversing the child node object of each paragraph object, thereby obtaining a plurality of character set objects Run;

The Aspose component provides paragraph objects, child node objects, and Run objects that facilitate character operations. The Run object is a collection of characters in a consistent character format within a document.

There are 4 cases of the obtained Run object: 1 a Run object contains multiple complete sentences; 2 a Run object contains multiple complete sentences and a certain sentence segment; 3 A Run object contains only one sentence segment; 4 The Run object contains a complete sentence. Therefore, further sentence processing is required, and the existing Run objects are split and merged to obtain only one one. A complete sentence of the Run object.

S4, split: traverse each Run object, split all Run objects into a Run object containing only one complete sentence, or a Run object containing only one sentence fragment. The method used is for example:

Starting from the first Run object, check the character content of the Run object.

If a Run object contains only one complete sentence, or a sentence fragment, the next Run object is directly checked;

If a Run object contains multiple complete sentences, the Run object is split with a sentence terminator and split into several Run objects that contain only one complete sentence.

If a Run object contains multiple complete sentences and a sentence fragment, the Run object is bounded by a sentence terminator, split into several Run objects containing only one complete sentence, and a Run containing a sentence fragment. Object.

For example, a paragraph of text is "In order to solve the above problems, a word document that converts various mainstream document formats such as Word, Excel, PPT, PDF, etc. into a unified standard style and can also be converted into a standard Word document is proposed. A method of restoring to the original format to simplify translation and improve translation efficiency."

For the above paragraph, the Aspose component is applied, and after traversing the above paragraph object, several character collection objects Run are obtained, respectively, in the order: Run-1: "In order to solve the above problem, a special will be proposed", Run-2: Word, Excel, PPT, PDF", Run-3: "A variety of mainstream document formats are converted into a unified standard style", Run-4: "Word", Run-5: "Documents and can also be converted in turn The standard obtained, Run-6: "Word", Run-7: "The method of restoring the document to the original format. To simplify the translation work and improve the translation efficiency."

Obviously, Run-1 to Run-6 above only contain one sentence fragment, and Run-7 contains two seemingly complete sentences. To generate a data set in full sentences, Run-1 through Run-6 need to be merged, and Run-7 needs to be split further.

S5. Merge: Iterate through each Run object and merge the Run object containing only the sentence fragment into its subsequent Run object containing only one complete sentence. Specifically, the following steps are included:

S5-1 takes out the character content of the Run object of only one sentence segment, stores it in the temporary storage unit, and then deletes the Run object in the paragraph object;

S5-2 checks the next Run object. If the character content of the Run object is only a sentence fragment, the character content of the Run object is taken out, added to the temporary storage unit, and then the Run object is deleted in the paragraph object, and the inspection continues. The next Run object; otherwise, the temporary storage unit is taken to store the character content, added to the character content of the next Run object, and then the temporary storage unit is emptied.

S5-3, if the character content of the next Run object is terminated by a sentence terminator, the character content stored in the temporary storage unit is taken out, added to the character content of the next Run object, and then the content is cleared. Temporary storage unit.

S6. Establishing a translator to process the document, and the translator processing the document has three fields of “original”, “translation” and id, the “original” field corresponds to the original text of the sentence, and the “translation” field corresponds to the sentence translation;

S7. Copy each sentence in the data set to be translated in a single sentence as a minimum unit to the "original" field of the translator processing document one by one, and then use the uniqueness of the content of the sentence in the data set to be translated. The placeholder Guid is replaced, and the adjacent placeholders Guid have different character formats, for example, the adjacent placeholders Guid are given different colors, so that the adjacent Guid's character formats are different from each other; the id field There is a one-to-one mapping relationship between content and different Guids;

S8. Traverse the translator to process the document and mark the repeated sentences. As shown in Figure 2, the "Android Integration Guide" is repeated, and it is marked with a ribbon to remind translators not to repeat the translation.

S9. Traversing the translator to process the document, automatically matching the sentence in the original text with the term in the termbase, and if so, appending the term sentence to make the translation work smoother.

S10. Traversing the translator to process the document, and matching the sentences in the original text with the corpus in the corpus one by one. If 100% matches, the corpus translation in the corpus is filled into the “translation” field corresponding to the matching sentence. As shown in Fig. 2, "SDK Download" has a conventional corresponding translation "SDK download" in the corpus, and "Frequently Asked Questions" has a conventional corresponding translation "FAQ" in the corpus. Fill in the corpus translation in the corpus into the "translation" field corresponding to the matching sentence. If the match is less than 100%, the line will be marked as a certain color. Instruct the sentence to require translator translation, but first automatically fill in the corpus translation for the translator to refer to and modify.

S11. The translator processing the document is sent to the translator, and the translator translates the original text of the “original” field one by one in the translator processing document, and fills in the corresponding “translation” field until the processing is completed;

S12, creating a dictionary object, the key of the dictionary object is the original text, the value is the translation, the original-translation is a key-value pair; when traversing the translator to process the document, the corresponding original-translation in a record is written separately Enter the dictionary object.

S13. Uploading the corpus information to the server as a new corpus for use in the next translation work, reference.

S14. Traversing the data set to be translated and the processing of the document by the translator, and finding a translation corresponding to the id according to different ids corresponding to different Guids, and overwriting the location of the corresponding Guid in the data set to be translated.

If the translation field of the record of an id is empty, in the dictionary object, the original text of the record of the id is used as a key to find whether there is a matching translation value, and if found, the translation column is filled with the translation.

If no matching translation value is found in the dictionary object, the sentence is missing and directly filled with the original text for the reviewer to find.

S15. Call the Aspose dynamic link library to restore the data set to be translated into an original format file.

For the EXCL, PPT, and PDF documents mentioned in the present invention, those skilled in the art can implement the character information contained in these documents by using the ASpose component, and perform the sentence-based unit according to the method disclosed by the present invention. The data collection is split and combined, the translator is processed to process the document, which is convenient for translators to translate; and after the translator translates, the translation of the translated document is processed. For example, for an EXCL, PDF, and PPT document, it can be processed by using the method of the above embodiment after converting it into a corresponding Word document by using an existing tool. For EXCL documents, you can also use the ASpose component directly.

Obviously, the present invention has the following beneficial features:

(1) Using this method, it is possible to convert a plurality of documents of different formats into standardized documents of uniform format. Helping the standardized management of translation work, thus simplifying translators' translation work, improves translation efficiency.

(2) Find out the recurring sentences by integrating the analysis of all the documents in the same batch. A sentence that repeats multiple times does not need to be translated multiple times. It only needs to be translated when it first appears. In other places, the translation of the sentence will be automatically filled.

(3) The new corpus information generated during the translation process is saved to the cloud service. Through the continuous accumulation of corpus, the translator's own translation ability is gradually improved. Improve translation efficiency through means of corpus reuse.

(4) It is possible to convert all the manuscripts that need to be translated into a standard format to be translated, which facilitates the management of team translation and reduces the difficulty of the translation project manager.

(5) The extraction and restoration logic is highly efficient and does not have a negative impact on translation work.

(6) When restoring, the algorithm can be modified according to the specific situation to achieve the final effect of format retention.

It should be understood that the above specific embodiments are only illustrative of the technical solutions of the present invention, and are not to be construed as limiting. The technical solutions are modified or equivalent, without departing from the spirit and scope of the invention, and are intended to be included within the scope of the appended claims.

Claims

A file extraction and restoration method suitable for translation work, characterized in that it comprises the following steps:

1) Using the Aspose dynamic link library to support the operation of the document processing, and disassembling the document object to be translated into a data set to be translated with a single sentence as a minimum unit;

2) Establish a translator processing document, the translator processing document has three fields of "original", "translation" and id, the "original" field corresponds to the original text of the sentence, and the "translation" field corresponds to the sentence translation;

3) copying each sentence in the set of data to be translated in a single sentence as a minimum unit one by one to the "original" field of the translator processing document, and then using the content of the sentence in the data set to be translated one with The unique placeholder Guid is replaced, and the adjacent placeholder Guid has a different character format; the content of the id field has a one-to-one mapping relationship with the different Guid;

4) delivering the translator processing document to the translator, the translator translating the original text of the "original" field one by one in the translator processing document, and filling in the corresponding "translation" field until the processing is completed;

5) traversing the data set to be translated and the interpreter processing document, and finding a translation corresponding to the id according to different ids corresponding to different Guids, and overwriting the location of the corresponding Guid written in the data set to be translated;

6) Calling the Aspose dynamic link library to restore the data set to be translated to generate a document format document.
The method for extracting and restoring a file for facilitating translation work according to claim 1, wherein the disassembling the document object to be translated into a data set to be translated with a sentence as a minimum unit comprises the following steps:

1-1 call the Aspose component;

1-2 traversing the document object to obtain all paragraph objects, the paragraph object containing all the text information of the document object, and not including symbols, images or other non-text information without translation;

1-3 traversing the child node object of each paragraph object, thereby obtaining a plurality of character set objects Run;

1-4 traversing each Run object, splitting all Run objects into a Run object containing only one complete sentence, or a Run object containing only one sentence fragment;

1-5 traverses each Run object, merging the Run object containing only the sentence fragment into its subsequent Run object containing only one complete sentence.
The file extraction and restoration method for translation work according to claim 2, wherein the merging of a Run object containing only one sentence segment into a subsequent Run object comprises the following steps:

1-4-1 will take out the character content of the Run object of only one sentence fragment and store it in the temporary storage list. Meta, then delete the Run object in the paragraph object;

1-4-2 Check the next Run object. If the character content of the Run object is only a sentence fragment, extract the character content of the Run object, add it to the temporary storage unit, and then delete the Run object in the paragraph object, and continue. Check the next Run object; otherwise, take out the temporary storage unit to store the character content, add to the character content of the next Run object, and then empty the temporary storage unit;

1-4-3 If the character content of the next Run object is terminated by the sentence terminator, the character content stored in the temporary storage unit is taken out, added to the character content of the next Run object, and then cleared. The temporary storage unit.
The method for extracting and restoring a file for facilitating translation work according to claim 1, further comprising: creating a dictionary object, the key of the dictionary object is an original text, the value is a translation, and the original text-translation is a key value pair. When traversing the translator to process the document, the corresponding original text-translation in one record is respectively written into the dictionary object.
The method for extracting and restoring a file for facilitating translation work according to claim 4, wherein in step 5), if the translation field of the record of an id is empty, the id is located in the dictionary object. The original text of the record is key to find whether there is a matching translation value, and if found, the translation column is filled with the translation;

If no matching translation value is found in the dictionary object, the sentence is missing and directly filled with the original text.
The file extraction and restoration method for translation work according to claim 1, wherein before the translator processing the document is sent to the translator, the translator is traversed to process the document, and the repeated sentence is marked to remind the translator. No need to repeat the translation.
The file extraction and restoration method for translation work according to claim 1, wherein before the translator processing the document is sent to the translator, the translator is traversed to process the document, and the sentence in the original text is in the terminology library. The term is automatically matched, and if it matches, the term is annotated to make the translation work smoother.
The file extraction and restoration method for translation work according to claim 1, wherein before the translator processing the document is sent to the translator, the translator is traversed to process the document, and the sentences in the original text are successively and in the corpus. The corpus matches the matches. If they match, the corpus translation in the corpus is filled into the "translation" field corresponding to the matching sentence.