WO2002095614A1

WO2002095614A1 - Method for identifying language/character code system

Info

Publication number: WO2002095614A1
Application number: PCT/JP2001/004350
Authority: WO
Inventors: Izumi Suzuki
Original assignee: Izumi Suzuki
Priority date: 2001-05-24
Filing date: 2001-05-24
Publication date: 2002-11-28
Also published as: JPWO2002095614A1

Abstract

A method for mechanically identifying the language and character code system of a text document encoded by a computer. In the list LBSL/C of byte string of specified length previously formed for each objective language/character code system, byte strings of a specified number of bytes possibly occurring in a text document of a relevant language/character code system are stored. For each language/character code string, an “occurrence rate of learnt byte string” , i.e. the rate of the number of byte strings of specified length already existing in the list LBSL/C and contained in an objective text document, is calculated and only when only one language/character code system having an “occurrence rate of learnt byte” close to 1 exists, the language/character code system is outputted as the result.

Description

Specification

Language * ^ code system identification processing method

The present invention relates to a multilingual processing technique in a computer, and more particularly to a machine processing method for creating a language and a character code system of a text document encoded by the computer. '' Background technology

In recent years, in the _Q world, where the importance of multi-μngal processing ffigfriability S on computers and network data is important, there are more than 100 types of TOs with a language population of more than TOO: more than 100 types. 2G kinds of power S can be counted in the character system that is used. Also, as of the end of 1999, about 140 alphabets were reviewed by the rSO IEG 106 review group. With the spread of internet and internet globally, more and more users who are trying to communicate on the Internet using fc local languages are increasing. However, in many languages centered on Asia, it is not surprising that the character systems for handling them on computers are overwhelming. For example, in Hindi, seven well-known character sets are actually used on Internet pages. The difference in the character code system means not only the difference in character fonts, but also the display of a text document coded in the Bunko code system using a different Bunko code B character font (that is, the character code). This means that text that is completely meaningless will be displayed when decryption is performed by one-way system B).

Because of this, as one of the multilingual processing technologies for the very wide variety of character code systems, a language and character code system knowledge method that satisfies the following requirements is currently required. . '

(Department to be solved

In the machine identification method of language and

(1) Text document ability to be identified If the document does not fall under any of the registered target languages / character code systems, the language / character code system most likely to be the target language / character code system is used. Avoid erroneous assignment, that is, output either a correct identification result or “unidentifiable”. (2) Capability when multiple languages and character codes are mixed.

(3) Information necessary for identification can be obtained from text sentences by the same machine processing method regardless of the language / character code system, and expressed in the same data structure regardless of the language / character code system. Use only information for each language / character code system.

Identification methods that meet the above requirements can be powerful information processing tools in relatively large multilingual processing systems, such as searching, classifying, and statistically retrieving documents on the network; ^ ft. Next, we focus on the statistical survey on the Internet and describe in detail the technical requirements for a method that satisfies the above requirements.

From the situation surrounding the network described earlier, it is now required to understand in detail what kind of language and character code pages are on the Internet and to what extent. Have been. The survey systematically accesses pages on the Internet around the world using mouth-bot search techniques, and automatically identifies and tabulates the language and character encoding used on the pages. . (Ii) The text document that is input to the device and is to be identified is called the “target text document.” The text used on a certain page is written in a language / character code system that is not registered in the gfi device. If so, the page is checked manually and new languages / character sets are registered if necessary. (The registered language / character code system is called “target language / character code system.”)

(Prior Art) The following three methods have been known to use a machine to mystery the language, character code system, genre, etc. of a text document encoded by a computer.

(1) Create a table of the frequency of words or characters that are mainly used in »in the TO / character code system or genre of the target value in advance, and use the text as the identification ^ ^ For comparing the frequency of occurrence of words or characters used in text documents (JP-A-2000-148754)

(2) For each language / text code / genre, list in advance a plurality of words or characters that appear specifically for other ¾ ^ language / character code / genre, and list those words and characters. By checking if the text is represented in the target text document

(3) Upper IB (1), (2) Method with the characteristics of (1) (JP-A-7-262188)

However, for the purpose of gun gauge survey on the Internet f, these methods have difficulties in the following two points.

I. Both ^ output the most likely language / character code system in the target language / character code system として as the identification result. It is difficult to make a clear determination of the power of the power.

2. Adaptation to documents with multiple language / bundle code systems, power s, and method ft) are difficult. Also, in method (2), it is difficult to check for unregistered languages / character codes if they are mixed. For example, Japanese / SMft-JIS is used as a target. , Under conditions where Malay / iso8859-l is not registered, recognizes text documents containing Japanese / Shift-JIS and Malay / iso8859-l. Unless otherwise, method (2) will output the main language / Sfeift-tHS as a result, and it will be overlooked that the unregistered language / bunko code system is included.

The problem we are trying to solve is (ii) :, (2): is not satisfied ^ ^, when performing the above statistical analysis on the Internet, the result is not just a monkey. There is a possibility that unregistered language / character code systems may be missed. As many languages / bunko code systems as possible are registered in advance and a survey is started, but the possibility of encountering an unregistered language / character code system during the course of the survey remains. Rather, it can be said that one of the objectives of this survey is to collect un * m TO / Bunyu code systems through the survey. There is a need for a method that can reliably detect text documents that include some (at least about 20% or more) or all of these unregistered language / character codes in the identification process.

In addition, in the above-mentioned statistical survey in Internet Connect, which features a wide variety of language / character code systems, the information units used for identification are based on the knowledge of each language and character code system. Is not a word or character in the language that is extracted from the text document, but it is a rate unless it is the method shown in (3) of the problem to be solved. Disclosure of the invention

The unit of information used in SiJ is the partial byte sequence of all the specified bytes and numbers contained in the text document (that is: byte sequence), which is ^. Next, as information for each language / character code system, a list of predefined length ^ f strings that may be expressed in a text document created in advance using the relevant language / character code system ( LBSL / C). If most of the specified length byte sequence that can appear in a text document in a certain language / character code system is complete, the text document in which byte sequences that do not correspond to them frequently appear will be written in this language's character code system. The fact that it is not a thing is supported. In addition, 'simple union of lists LBSL / G in multiple language / character code systems; ^, meaning' mixture or one of these two language character code systems' A new language / character code system to be tasted: LBSI G, and a recognition of ^ ♦ with multiple: language / character code systems! / Is easily possible.

The list LBSL / G in each language / character code system can be easily obtained from the text document in the language / character use system. A list that can obtain good identification results. The standard of the number of text documents required to obtain LBSL / C is: 1 KB 20 KB for character code, 2 Japanese, etc. It is a pite.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagram schematically showing a system according to the present invention. FIG. 2 is a flowchart of a series of general-purpose steps of a process executed by the system shown in FIG. FIG. 3 is a flow chart of the fine steps executed in step 204 shown in FIG. 2 for calculating the learned bit rate in the target text document for each language / character code. .

Fig. 4 is executed in step 206 shown in Fig. 2 to delete the lower language / character code system when there is more than one language / character code system whose appearance rate exceeds the upper limit UB. 3 is a flowchart of the detailed steps. Fig. 5 shows the three-pight string!? That could appear when the language / bunko code system is "Japanese ZShift-JIS". Strike Part of LBSL / C. Fig. 6 shows an example in which there is no language / character code system in which the learned byte appearance rate takes the lower limit (LB) and the upper limit (UB> 閬).

FIG. 7 is an example of a list in which the relationship described in claim 2 is 153 $ in “Example of target language / character code system” (A to H) described on page 6]. In FIG. 7, the parentheses mean that the TO / character code system X is higher than the TO / character code system. FIG. 8 is an example of execution of the process described in step 206 of FIG. The language / character code system is the same as the example described in FIG. 6, and the relationship is the same as the example described in FIG. Figure & shows the number of LBSL / G items used in the experiment shown in “Possibility for Industrial Use” and the amount of text documents referred to fc to create it. Fig. 1 shows the ffi-force results of step 2 and 4 shown in Fig. 2 in the experiment shown in "Possibility of Industrial Use". BEST MODE FOR CARRYING OUT THE INVENTION

The present invention will be described in more detail with reference to the accompanying drawings.

The computer coded text: ^ (text of interest: iC ») is entered, and first in step 202, it is checked whether it is a long or short document. Eck. Next, in step 203, all specified length pite strings included in the target text document are read and stored in the list: LBS &. The default pitch length is generally 3 pite. 1 byte and 2 bytes do not provide the desired discrimination performance. On the other hand, as the default value increases, the discrimination performance improves. However, the processing time and the "W / list per character code system LBSL / The required number of items is added to C.-Next, 'the specified-length string of strings that may appear in the text document of the language / character code system created in advance for each target language / character code system In the list LBSL / C, each rule of LBSS; is searched for whether or not the S-pipe sequence does fe, and the appearance rate of the learned byte sequence is calculated for each language / character code system (step 204). Fig. 5 shows an example (part) of the table LBSL / C, and Fig. 3 shows the detailed steps of step 204.

Next, in step 205, it is checked whether or not the language code code system in which the learned pate appearance rate takes a value between the predetermined lower limit value and the upper limit value (UB>). If the language / character code system which takes a value between s limit value LB and the upper limit value UB is present it has one or more at _α present stearyl 'flop 205 illustrating an example of a in FIG. 6 If not, then "The system automatically outputs the unrecognizable J and terminates the study process. If there is no J, the target text document contains multiple language character code systems. The processing is performed next, and the values of LB and UB are determined in advance depending on the implementation case .. The lower the lower limit LB and the lower the upper limit UB, the lower the LB of the trained pate If there is a language / character code system that takes a value between Gyora possibility is high.

One word firewood Z The above list in the character code system (A) LBSL / C capability For that item, it is included in the same U list in the one-value language / character code system (B) (where A is B The relationship between the 2TO / character code systems defined by the upper-level relationship is described as a set of symbols that specify the language / character code system (an example is shown in Figure 7). Given the above relationship in the f target language / character code system created in advance, if there are multiple languages / character code systems whose learned byte appearance rate exceeds the upper limit UB, configure the relationship among them. If there is a language / character code system, the lower language Z character code system is excluded (step 206). The details of the procedure for implementing Step 206 are shown in Flowchart IV. Fig. 8 shows an execution example of step 206.

Finally, if there is only one language / character code system left unremoved in step 20 & above, the relevant language Z character code system is output as an identification result. To end the process Potential for industrial subjects,

The present invention can be a powerful multilingual information processing means not only in the statistical survey on the Internet described in the background, but also in the search and classification of * on the network for the same reason. Below _α , which is possible, two additional features of the present invention

An experiment to confirm ¾ and the result are shown.

(Feature I: Diversity of identifiable text documents)

In the related art, it may be difficult to identify a text document in which only specific types of lexicals are frequently used. For example, Hiragana is almost always used in Japanese documents, and Shimo is very frequently used. For this reason, hiragana is often used as a character with a high frequency of appearance in the conventional technology (), and the first byte of the first name is often used as a character code specifically used in the conventional technology (2). In particular, in the conventional technology (2), in order to identify either the Japanese character code system SMft-JIS or EUC, the character used in the first byte of the Shift-JIS name that is not used in EUC However, in this case, we check for the presence or absence of the code, but in this case, there is a page on the Internet that actually exists “Ichiban Prefectural Universities (Tokyo)” Aoyama Gakuin University, Asia University, Ueno Gakuen University, Sakurarin University, Since the age of documents such as Otsuma Women's University and Kana characters are not used at all, proper knowledge cannot be expected to be executed.

On the other hand, according to the method of the present invention, since a list of specified-length byte strings that may be used in a text document by the language Z character code system is used for each target language / character code system, There is no problem in identifying the above text examples. However, it always contains many numbers, blanks (spaces), symbols, etc. used in many languages Z character code system: ^ may be indistinguishable, but even in this case it is possible to return incorrect identification results Absent.

(Feature 2 r Reliability when information is insufficient)

Further, according to the present invention, there is no item in the list LBSL / C of the list of specified length pite strings which may be expressed in the text document による in the language / character / uide system to be created in advance for each ^ language / "character code system. If it is sufficient, it will not return an incorrect identification result, which is again correct, consequential, or impossible, for the following reasons.

Assume that the language / character code list LBSL / G is incomplete. At this time, the target text document 1> language / character code system A, 2) different from A, language Z registered as envelope Z character code system B, and 3> A There are three cases in which it is based on the unregistered TO character code system C. If 1>, target text Learned bytes indicated in A of the document. The appearance rate is likely to exceed the upper limit value UB if the item of LBSL G is sufficient, but is likely to be lower than ϋΒ due to insufficient items. However, even with ^^, it is not the key to boosting the appearance rate of learned pite in other languages / character codes, and consequently returns indistinguishable.

In the case of 2), the appearance rate of existing bytes for the target text document is smaller than the lower limit: even if the LB ^ C items of A are sufficient. If A's LBSL / C entry is inadequate, this figure will be less than or equal to that of LBSL C's entry, and will not be a factor in returning incorrect results.

In the case of ¾, regardless of the target language / character code system, the appearance of the previously learned pite is unlikely to exceed the UB value. The byte occurrence rate only takes on a smaller value. Therefore, the skirt fruit is output as incomprehensible.

Note that a text document whose language / character code system is clear is sent to this device, and by calculating the appearance rate of learned pite returning to the language / character code system, the list of the language / character code 'system LBSL C It is also possible to test whether or not the item is powerful enough.

c)

Details of experiments performed to verify the effectiveness of the present invention are described below.

For the eight language / character code systems (A to H) shown in Fig. 6, the items in the list LBSIJC for each language z character code system were collected in a small number as shown in Fig. 6. However, for the two-language / letter-code system DE, the union is set to the items of the list UBSL / C in the language Z-letter code system B language / letter-code B and C, respectively. The shot that was taken was used. 'In addition, for the Indonesian language Ziso885S-l, the term gibber of the strike LBS G was intentionally reduced. The items of both lists LBSL / C were randomly collected from pages on the Internet, and their language Z character code system was manually checked and extracted from fc text documents. The number of text documents in each language / character code system referred to to extract item g of list LBSL / C is also shown in Fig. 9. In addition, the persons described in the second section of the scope of slaves working on the character code system of the target language shall be the same as those described in FIG. Among the text documents identified in the experiment, the ones using A (Japanese / Shift-JIS) and B (English / iso8859-l) are shown below. * A (Japanese / SMft-JIS)

As the globalization of the economy progresses rapidly, the movement to build a new international economic order is in full swing with the establishment of the World Trade Return (WTO) and the development of the APEC (Asia-Pacific Economic Partnership). You.

The Ministry of Economy, Trade and Industry (METI) has conducted consultations with countries around the world and has exerted leadership in various places to develop a contracted economic system and to achieve a stable and comfortable exhibition of the S economy and the world economy. You.

As Japan, the world's largest funder, Japan is implementing effective and efficient economic cooperation based on the Official Development Assistance Charter in order to help developing countries become self-sustaining.

In addition, the Ministry of Economy, Trade and Industry recognizes that economic cooperation that contributes to the national interest is important, and is promoting comprehensive Sui-zai cooperation that secures organic cooperation between aid and trade. .

B (English / iso8859 "l>

"Framing ever tMng" of course, are ner trademar cum. "We all have the hair says argulies of her two older sisters and their divorced parents ^ Paul, an advertising copywriter, and Franceses, a dance teacher.MarguBes began her career as a hair ittodel for a perm company.

'T (i out on a runway, and they'd say, This is owr permt Look how natural ad beawtiful it is "' says the actress, lio haa never had a perm at all.lb maintain Ber corkscrews ^ she shampoos daily, conditions every six weeks with Sebastian Potion 9 and deep-conditions twiee a year. "My hair will do pretty much what I want it to do," she says. "It's like Play-Doh."

Photo by: Daniela Federici Under the above conditions, 言語言語 language Ζ character code system, and Α and ：: a text document in which Japanese fe / Shift-JIS and English are mixed (language / character code system A, B, G, F, G and H are all about 70 pounds, and about 130 pounds for a mixture of three languages and English) are input to the identification device described in claim 1 respectively. Figure 1G shows the appearance rate of trained pite strings for each language / character code system in.

In the language / letter-code language Indonesian language that the list LBSL C did comparative experiments in insufficient, the input text of Indonesian language is indistinguishable. About other input text As a result, the correct knowledge was obtained by performing step 206 in step 2 of the claim. For example, in the case of text input in English / 3L, the learning rate of the learned byte string in the bilingual character code system of "English only" and "D. Japanese / S, English / L, or mixed" Exceeded UB. By performing the processing of step 206 on the above character code system, it is possible to obtain a monolingual Z character code system ": B. English / L only" as shown in example 1 of FIG. . (Character code system Shift-JIS is abbreviated to &, and iso8859-I is abbreviated to L. | B)

When conducting a survey on the Internet as described in the "Background" section, it is likely that the number of languages / character codes registered as seals will be on the order of hundreds. In this example, only the eight languages / character codes were sealed.However, the discrimination ability is a problem in the same character code systems such as French / L and English / L, which are closely related. In a knowledge of a language! That is, the proper identification is performed without outputting the indistinguishability. Therefore, it is possible to confirm the effectiveness of the present invention by conducting experiments on the closely related language character code system ¾ ^ without conducting experiments on several hundred language Z character code systems. it can.

Claims

The scope of the claims

1- From a text document coded by a computer (referred to as a text document), the machine processing method for performing TO and

Reads all the specified length bytes contained in the enclosed text document 'and lists them

Means (called as LBSS) (step 203),

Registered language 規定 Character code system (referred to as target language / character code system) Created in advance, rules that may appear in text documents of the relevant language / character code system Long pile strings Means to list the list (called LBSL / C)

A step (steps 302 to 306) for searching whether each! Length string sequence of LBSS 內 exists in each LBSL / C;

Based on the results of the above steps, for each language / character code system, the ratio of the number of values in the list LBSS where the specified length pite string already exists in the list LBSL C (called the J¾¾pite string appearance rate) is calculated. Means to take them out and store them (having steps,

It is judged that the value of the learned pite string appearance rate in only one language Z character code system is close to 1, and that the existing byte string occurrence rates in other languages / character code systems are all slightly lower than 1. A processing method characterized by outputting the former language / character code system when rejected, and outputting unrecognizable otherwise.

2. The LBSL C described in claim 1 in the one-valued language Z sentence code system (assumed to be A) has a single "l Z 'sentence code system (B and The information describing the relationship between the two-language / character code systems defined by being included in the same list in (where A is a higher-level relationship than B) is described in the target language Z Means for storing an arbitrary number as a set of symbols representing the character code system (7th HD,

Receives information describing the relationship between the given number of arbitrary values given above in the target language / character code system, and multiple language character code systems. 'If there is a two-language / character use system to configure, output one or more languages Z character code systems obtained by deleting lower language / character code systems from multiple languages that accept lower language / character code systems A machine processing method for identifying: the text coded by the computer according to claim 1, further comprising a step (step 20 &); . .