WO2010050844A1 - Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media - Google Patents

Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media Download PDF

Info

Publication number
WO2010050844A1
WO2010050844A1 PCT/RU2009/000111 RU2009000111W WO2010050844A1 WO 2010050844 A1 WO2010050844 A1 WO 2010050844A1 RU 2009000111 W RU2009000111 W RU 2009000111W WO 2010050844 A1 WO2010050844 A1 WO 2010050844A1
Authority
WO
WIPO (PCT)
Prior art keywords
named
text
relations
relation
attributes
Prior art date
Application number
PCT/RU2009/000111
Other languages
French (fr)
Inventor
Vladimir Fyodorovich Khoroshevsky
Victor Petrovich Klintsov
Original Assignee
Zakrytoe Aktsionernoe Obschestvo "Avicomp Services"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zakrytoe Aktsionernoe Obschestvo "Avicomp Services" filed Critical Zakrytoe Aktsionernoe Obschestvo "Avicomp Services"
Priority to EP09823885A priority Critical patent/EP2350871A1/en
Publication of WO2010050844A1 publication Critical patent/WO2010050844A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the information technologies field, namely, to methods of computerized semantic indexing of natural language texts, as well as to machine-readable media comprising respective programs, and could be used for ordering and accumulating information in specified knowledge areas for the purpose of semantic navigation through the documents and document collections, as well as for the highly-precise and quick search of facts and documents relevant to the user's information needs.
  • the EAPO Patent No. 002016 (published on 2001.01.22) describes a method, where unique information blocks are detected in text document fragments and used for subsequent processing and searching.
  • the RU Patent No. 2268488 (published on 2006.01.20) granted on the basis of the PCT Application published as WO 01/06414 discloses the method wherein words, stable phrases, idioms, sentences, and even ideas are coded for the subsequent processing at the numerical level.
  • the RU Patent No. 2273879 (published on 2006.04.10) adduces a method wherein morphological and syntactic text analysis with the subsequent indexing of the detected units.
  • a text similarity is determined by text fragments.
  • the disadvantage of all those methods consists in that they do not take into account the semantic ambiguity of the natural language words and expressions.
  • the US Patent No. 6,189,002 discloses a method, wherein a text is divided into paragraphs and words that are converted into vectors of the ordered element. Each vector element corresponds to the paragraph determined by applying the predetermined function to a number of occurrences of the word corresponding to that element, in this paragraph.
  • the text vector is considered as the semantic profile of the document.
  • this method requires an enormous massive of the stored data and does not distinguish semantic ambiguity of the words and expressions.
  • the object of the present invention consists in extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and further usage of obtained results for building indices, which ensures the semantic navigation through documents and document collections, as well as the highly-precise and quick search of facts and documents relevant to the user's information needs, particularly, in reference to the high-inflectional language texts.
  • Fig. 1 depicts the schematic block-diagram explaining the claimed methods
  • Fig. 2 depicts the fragment of the application domain specification
  • Fig. 3 shows the rule schema for extracting the named entities of the type of "Person"
  • Fig. 4 shows the rule schema for extracting the semantically meaningful relationships of the type of "work"
  • Fig. 5 depicts the fragment of the graphical form for representing the results of the text processing
  • Fig. 6 shows the general diagram for storing the results of processing one text
  • Fig. 7 shows the diagram of the left-hand side of the rule for combining the named entities of the type of "Person" .
  • the proposed methods allow for performing effectively the conceptual indexing of the natural language texts both for the subsequent semantic navigation through the documents and document collections and for search purposes.
  • the method of computerized semantic indexing of natural language text according to the first aspect of the present invention and the method of computerized semantic indexing of collection of natural language text according to the second aspect of the present invention could be implemented in practically ei- ther computing environment, e.g., in a personal computer connected to external databases.
  • the steps of performing those methods are illustrated in Fig. 1.
  • a token could be any text object from the following set: words consisting each of the series of letters and, possibly, hyphens; a series of spaces; punctuation marks; numbers. Sometimes, such character series as A300, il50b, etc. are also pertained hereinto. Tokens' separation is always carried out in accordance with rather simple rules, for example, as in the mentioned US Patent Application No. 2007/0073533. In Fig. 1, this step is contingently marked with the reference number 2.
  • tokens are considered as the first level elementary units.
  • respective second level elementary units named hereinafter morphs are formed based on the morphological analysis.
  • its normalized word form is identified.
  • the normalized word form will be «H ⁇ TH»
  • the normalized word form will be « ⁇ pacHBoro»
  • the normalized word form will be « ⁇ pacnBBiH»
  • the normalized word form will be «c ⁇ eHa».
  • a part of speech to which this word relates and its morphological characteristics are indicated. Of course, for various parts of speech those characteristics differ.
  • RAM random access memory
  • the next step contingently marked with the reference number 3 in Fig. 1 consists in that stable phrases are identified in a set of derived elementary units of two first levels, tokens and morphs. This is performed by converting the ele- mentary units, i.e., tokens and morphs, into series that are compared with the series of normalized words and their characteristics in the dictionaries stored in advance in the databases, where words are adduced with specification of grammatical associations therebetween. Once coinciding the succeeding series being compared with the corresponding dictionary series, that succeeding series being compared is recognized as the stable phrase and stored in such a kind in the database as the third level elementary unit.
  • sentences corresponding to the portions of the text being indexed are formed.
  • these are real sentences ending with dot, but in some cases it is suitable for in- terpreting as a sentence some parts of usual sentences, say, isolated element in enumeration. Therefore, this step can issue sentences not always coincided with the sentences of the text being indexed in common sense.
  • this analysis is contingently divided into steps marked with the reference numbers 5 to 11.
  • Said multistage semantic-syntactic analysis is carried out by addressing the linguistic and heuristic rules formed in the database in the predetermined linguistic environment.
  • Such an environment could be, for example, the linguistic environment mentioned in the above RU Patent No. 2242048, or the environment disclosed in said US Patent Application No. 2007/0073533, or any other linguistic environment defining respective rules that allows to eliminate syntactic and semantic ambiguousness of the real text words and ex- pressions.
  • the linguistic and heuristic rules in the chosen environments are hereinafter referred to as rules.
  • the semantically meaningful objects hereinafter referred to as named entities (the reference number 5 in Fig. 1), and the attributes thereof (the reference numbers 7 and 9 in Fig. 1) are identified.
  • the identification of the named entities that are considered as the fourth level elementary units is carried out in the sentence in a set of elementary units of the first, second, and/or third levels.
  • the morphological attributes are formed for every named entity using said rules from the morphological attributes of those elementary units of the second and/or third levels (i.e., morphs and/or stable phrases) which constitute this named entity.
  • the semantic attributes are formed for every named entity using said rules from the semantic attributes of the elementary units of the second and/or third levels which constitute this named entity.
  • the step of forming said attributes is contingently marked with the reference 7.
  • for every named entity is assigned a respective type from the application ontology according to the topics of the application domain, to which the text being indexed relates.
  • the application ontology is meant, in this case, the specification of the particular application domain, which is stored in the respective database.
  • the corresponding anaphoric reference considered as the fifth level elementary unit (if any) is determined.
  • Every identified named entity is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto.
  • the anaphoric reference is stored together with the type and attributes of the named entity which is the antecedent of that anaphoric reference, as well as with the indication of the co-reference between that named entity and the anaphoric reference thereof.
  • the semantically meaningful relations between the named entities hereinafter referred to as named relations are determined based on the elementary units of the first, second, third, fourth and/or fifth levels using said rules (the step 6).
  • the named relations can relate the named entities within both one sentence and the entire text being indexed.
  • the morphological attributes are determined for every named relation using said rules from the second level elementary units (i.e., morphs) constituting this relation, as well as the semantic attributes from the elementary units of the first, second, third and/or fourth levels, constituting this relation.
  • the respective type is assigned to every named relation from application ontology stored in the database according to the topics of the application domain, to which the text be- ing indexed relates. After that, every named relation is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto.
  • the stored named entities and named relations are used for forming the triples.
  • a set of the triples of three types is formed within the text being indexed for every of the identified named relations relating the certain named entities.
  • the single first type triple corresponds to the relation established by the named relation between two named entities.
  • Each of the set of the second type triples corre- sponds to a value of particular attribute of one of those entities, and each of the set of the third type triples corresponds to a value of particular attribute of the named relation itself.
  • the first type triple could be represented (depicted) as O; — » Ry -> O j .
  • Each of the set of the second type tri- pies could be represented as Oj ->• A im — » V im or O j -> A jn -> V jn , where A im and A jn are respective attributes, Vj 1n and V jn are, respectively, values of those attributes.
  • each of the set of the third type triples could be represented as R y -> A 1 J k — > Vj jp , where A yk is a respective attribute, and V yP is a value of that attribute.
  • the indices i, j, k, m, n, and p are integers.
  • the triples formed at the step 12 and indices obtained at the step 13, together with the reference to the initial text from which those triples have been formed, are stored in the database (the step 15 in Fig. 1; the step 14 is omitted in this case).
  • a convolution is performed (not shown) for the objects related by co-reference relations into a single object whose set of the attributes are the combination of the attributes of all object interrelated by the co-reference relations. This is done in order for reducing the memory volume in the database required for storing such objects, as well as for integrating under one object the information obtained for the entire text.
  • the method of computerized semantic indexing of collection of natural language texts according to the second aspect of the present invention is carried out exactly as already discussed method of computerized semantic indexing of natural language texts according to the first aspect of the present invention, but in this case, after the step 13 of indexing and prior to the step 15 of storing in the database, one more step is performed. At this step marked with the reference number 14 in Fig. 1 and performed substantially simultaneously with the step 15, the following is carried out when storing in the database the formed triples and obtained semantic indices of the succeeding text.
  • the newly derived named entities and named relations are compared with the named entities and named relations already existed in the database using the linguistic and heuristic rules in the predetermined linguistic environment that are formed in the database.
  • the duplicated information is not stored in the database, and respective named entities and/or named relations are supplemented with references to the succeeding texts where they are present and references to the text fragments within each of succeeding texts from which they are derived.
  • the step of indexing the text collection is occurred similarly to the indexing the first text of this collection (or the first text indexed by this method), which permits to simplify significantly the entire indexing procedure, reduce the required memory volume, and integrate the information obtained from different texts within a single object.
  • a representative example of such text is the following message: «L ⁇ eHmpcuibHbi ⁇ ⁇ edep ⁇ ibHbi ⁇ 3a 26.06.08 K) ⁇ eH ⁇ o nopymui TuMotueHKo ⁇ u ⁇ edamb y Uymuua ijeny ⁇ a zo3 26.06.08 14:00 Kue ⁇ , HiOHb 26 (Ho ⁇ bi ⁇ Pezuo ⁇ , Muxaun P ⁇ o ⁇ ) — IIpe3udeHm V ⁇ pauHbi
  • the preliminary created application domain speci- fication is used, within which the text collection processing and semantic index constructing will be carried out.
  • a fragment of such specification is depicted in Fig. 2.
  • Such specifications are developed by human experts, who record, based on their experience and knowledge, a list of object types and a list of typical relations therebetween essential for this application domain.
  • the main types of objects are "Person”, “Organization”, “Location”, and some other.
  • the human experts build in advance also a set of rules, each rule containing, in the left-hand side, a template for searching examples of objects and/or examples of relations therebetween, and in the right-hand side, op- erators for fixing in the text the examples of objects and/or examples of relations therebetween determined in accordance with the template.
  • the specific data corresponding to the domain specification are derived in the texts being processed.
  • common and special lexicons are used.
  • the step of segmenting the text into ele- mentary units, tokens is performed with the morphological analysis of the token-words (reference number 2 in Fig. 1).
  • the initial text is transformed into a set of tokens and morphs that are represented in the Table 1 and Table 2, respectively.
  • the step is carried out for deriving the stable phrases (lookups) using the common and special lexicons (reference number 3 in Fig. 1).
  • the initial text is supplemented, besides the first and second level elementary units, with a set of the third level elementary units, lookups.
  • the fragment of this set for the above example is rep- resented in the Table 3.
  • the text being processed is fragmented into sentences (reference number 4 in Fig. 1).
  • the pluralities formed at the above steps are supplemented with a set of sentences, represented in the Table 4.
  • the text being processed will be segmented into the sentences, each of which is marked with a plurality of annotations of the first, second and third level elementary units.
  • the step of deriving the named entities is carried out at the set of the elementary units of the first, second and/or third levels using said rules.
  • the named entities «HOBBIH PeraoH» ["New Region”], «Cepret JlaBpoB» ["Sergey Lavrov”], «Y ⁇ paHHa» ["Ukraine”]
  • the pronouns are determined that could be anaphoric references to the corresponding named entities, and for those pronouns that are really such ones, the co-reference between the respective named entity and the anaphoric reference thereof (the fifth level elementary unit) is fixed.
  • the obtained set of the anaphoric references is represented in the Table 6.
  • the semantically meaningful relations between the named entities are determined using the rules.
  • the initial text will be marked with the set of annotations corresponding to the named entities with the attributes thereof and the named relations with the attributes thereof between the named entities.
  • the graphical representation of the text processing results is shown in Fig. 5.
  • the next step marked with the reference number 12 in Fig. 1 is a technical and carried out for performing the triples corresponding to the stored named entities and named relations.
  • the fragment of the set of such triples for the example under consideration is represented in the Table 8.
  • the formed set of triples contains the initial data for the semantic indexing of the text processed at the previous steps.
  • the semantic index is built as follows: first, from the set of the triples obtained at the previous step, the triple subsets are formed, each oh which subsets corresponds to one named entity with the attributes thereof, and every obtained triple subset is used as an entry for one of conventional indexers, for example, the well-known, freeware indexer Lucene, the indexer of the Yandex search machine, the Google indexer, or any other indexer, from which output an index unique for the given triple subset is obtained.
  • conventional indexers for example, the well-known, freeware indexer Lucene, the indexer of the Yandex search machine, the Google indexer, or any other indexer, from which output an index unique for the given triple subset is obtained.
  • the similar operation series is performed for all subsets of triples corresponding to the pairs of the kind "named entity — named relation” and to the triples of the kind of "named entity - named relation - named entity” taking into account the attributes of the respective named entities and/or named relations, thereby obtaining a set of the corresponding unique indices which constitute, in the aggregate, the semantic index of the text.
  • the fragment of the semantic index for the example under consideration is represented in the Tables 9 to l l.
  • a set of continuous chains of the triples for the relation "The_same" are formed.
  • the check is performed whether the set of continuous chains of the triples obtained at the previous step is empty. If that set is not empty, then, sequentially, at the next steps (53-56), the set of objects for the next chain is formed (53), this set is convolved into the single object (54) having the combined set of the attributes (without repetitions), the obtained single object is stored together with the at- tributes thereof (55), and the set of the processed objects of the succeeding chain is removed (56). But if at the step 52 the set of the triple chains turns out to be empty (initially or as a result of performing the steps 53 to 55), then, at the step
  • the formed overall set of the triples is supplemented with the semantic indi- ces and references to the initial text; after which, at the step 59, the supplemented set is stored in the database.
  • the processing of every subsequent text including its semantic index constructing is carried out by performing just the same steps as for the single text.
  • the step 13 of indexing and prior to the step 15 of storing in the database one more step marked with the reference number 14 in Fig. 1 is carried out, the step of combining the results of processing the succeeding text with the results of processing the previous texts stored already in the database, which step is carried out as fol- lows.
  • the named objects and named relations newly derived in the succeeding text being indexed are compared with the named objects and named relations already existed in the database by checking the coincidence of the semantic indices thereof, and, in the case of the positive result of such comparing, the respec- tive objects and relations are excluded from the following processing, herewith storing in the object and/or relation already existed in the database the reference to that text and that fragment of that text, where the object and/or relation excluded from the following processing are identified.
  • the similarity between the new objects and/or relations and those object and/or relations that are already exist in the database is identified using the linguistic and heuristic rules formed in advance in the database, and in the case of the positive result, the object and/or relation descriptions already exist in the database are widened with the new data, after which the existing semantic indices are reconstituted and the new semantic indices are added as the secondary ones to the already existing indices, and, moreover, in the object and/or relation already existing in the database, the reference is stored to that text and that fragment of this text, where the new objects and/or relations are identified, and then the respective objects and relations are ex- eluded from the following processing. Otherwise, the newly identified named objects and named relations together with the semantic indices thereof are added to the database.
  • the object will be identified, particularly, the object «Y ⁇ paHHa» ["Ukraine”], which semantic index fully coincides with the semantic index of the object «y ⁇ paHHa» ["Ukraine”] already existed in the database, and, moreover, the similarity will be identified (by applying the rule which diagram is shown in Fig.
  • the present invention provides for extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and subsequent use of obtained results for building indices, the main difference of which methods from the known method of indexing consists in indexing semantically meaningful concepts and relations therebetween rather than the key words and lookups, which provides for the semantic navigation through the documents and documents collections, as well as the highly-precise and quick search of facts and documents, especially in reference to high-inflectional language texts.
  • Table 1 Results of tokenizing the example text

Abstract

The present invention relates to the information technologies field, namely, to methods of computerized semantic indexing of natural language texts. The use of the present invention permits for extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and further usage of obtained results for building indices, which ensures the semantic navigation through documents and document collections, as well as the highly-precise and quick search of facts and documents relevant to the user's information needs, particularly, in reference to the high-inflectional language texts. The method of computerized semantic indexing of natural language text comprises steps of: segmenting the text in the electronic form into tokens; identifying stable phrases; forming sentences; by addressing the linguistic and heuristic rules formed in the database in the predetermined linguistic environment, identifying the semantically meaningful objects (named entities) and the semantically meaningful relations therebetween (named relations); for every named relations, forming the set of triples, where single first type triple corresponding to the relation established by the named relation between two named entities, each of the set of the second type triples corresponding to a value of particular attribute of one of those entities, and each of the set of the third type triples corresponding to a value of particular attribute of the named relation itself; at the set of the formed triples, indexing all named entities related by the named relations separately, all pairs of the kind "named entity - named relation", and all triples of the kind "named entity — named relation - named entity", while taking into account the attributes of respective named entities and/or named relations; and storing in the database the formed triples and the obtained indices together with the reference to the initial text from which those triples have been formed.

Description

METHOD OF COMPUTERIZED SEMANTIC INDEXING
OF NATURAL LANGUAGE TEXT,
METHOD OF COMPUTERIZED SEMANTIC INDEXING OF COLLECTION OF NATURAL LANGUAGE TEXTS, AND MACHINE-READABLE MEDIA
Field of the Invention
The present invention relates to the information technologies field, namely, to methods of computerized semantic indexing of natural language texts, as well as to machine-readable media comprising respective programs, and could be used for ordering and accumulating information in specified knowledge areas for the purpose of semantic navigation through the documents and document collections, as well as for the highly-precise and quick search of facts and documents relevant to the user's information needs.
Background of the Invention
At present, various methods of the computerized indexing of natural language texts are known.
So, the EAPO Patent No. 002016 (published on 2001.01.22) describes a method, where unique information blocks are detected in text document fragments and used for subsequent processing and searching. The RU Patent No. 2268488 (published on 2006.01.20) granted on the basis of the PCT Application published as WO 01/06414 discloses the method wherein words, stable phrases, idioms, sentences, and even ideas are coded for the subsequent processing at the numerical level. The RU Patent No. 2273879 (published on 2006.04.10) adduces a method wherein morphological and syntactic text analysis with the subsequent indexing of the detected units. In the method of the US Patent No. 6,871,174 (published on 2005.03.22), a text similarity is determined by text fragments. The disadvantage of all those methods consists in that they do not take into account the semantic ambiguity of the natural language words and expressions.
The US Patent No. 6,189,002 (published on 2001.02.13) discloses a method, wherein a text is divided into paragraphs and words that are converted into vectors of the ordered element. Each vector element corresponds to the paragraph determined by applying the predetermined function to a number of occurrences of the word corresponding to that element, in this paragraph. The text vector is considered as the semantic profile of the document. However, taking into account the variety of paragraphs, this method requires an enormous massive of the stored data and does not distinguish semantic ambiguity of the words and expressions.
The semantic ambiguity consideration is carried out in many known methods. For example, the RU Patent No. 2242048 (published on 2005.03.22), US Patents Nos. 6,871,199 (published on 2005.03.22), 7,024,407 (published on 2006.04.04), and 7,383,169 (published on 2008.06.03), US Patent Application Publications Nos. 2007/0005343 and 2007/0005344 (published both on 2007.01.04), 2008/0097951 (published on 2008.04.24), JP Laid-open Applications Nos. 05-128149 (published on 1993.05.25), 06-195374 (published on 1994.07.15), 10-171806 (published on 1998.06.26), and 2005-182438 (pub- lished on 2005.07.07), EP Patent Application NO. 0853286 (published on 1998.07.15) methods are described, wherein the ambiguity of the words and/or expressions being found in texts is eliminated in one or another manner. However, all those methods have only restricted application and do not affect a valuable semantic text indexing. Closest to the claimed group of inventions is a method of computerized semantic indexing of natural language text or text collection, which method is disclosed in the US Patent Application No. 2007/0073533 (published on 2007.03.29). In that method, functional structures are determined for every text portion, and triples characterizing predicate members are determined in every functional structure based on the linearization transfer rules. Then, such features as named entity, co-reference, lexical entry, structural-semantic relationship, and attribution and meronymic information are extracted from every text portion. Next, for every text portion, on the basis of the determined constituent structure, canonical triple representations characterizing the predicative members are determined together with the derived feature representations, and a structural index is determined on the basis of the canonical representation of the text portion. This method ensures good results, but yet is rather limited due to the fact that fragments of the predicative-argument structure derived during the syntactic analysis are linearized as the triples. Moreover, this method is directed to the search tasks only, rather than the tasks of navigation through a document collection.
Summary of the Invention
The object of the present invention consists in extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and further usage of obtained results for building indices, which ensures the semantic navigation through documents and document collections, as well as the highly-precise and quick search of facts and documents relevant to the user's information needs, particularly, in reference to the high-inflectional language texts.
An achievement of this object and obtaining of said technical result is provided using a method of computerized semantic indexing of natural language text and a method of computerized semantic indexing of natural language text collection in accordance with the features of the enclosed independent claims 1 and 7, respectively. The variations of the both methods are disclosed in the respective dependent claims.
Brief Description of the Drawings The invention is explained by describing a particular embodiment thereof and accompanying drawings, in which:
Fig. 1 depicts the schematic block-diagram explaining the claimed methods; Fig. 2 depicts the fragment of the application domain specification;
Fig. 3 shows the rule schema for extracting the named entities of the type of "Person";
Fig. 4 shows the rule schema for extracting the semantically meaningful relationships of the type of "work"; Fig. 5 depicts the fragment of the graphical form for representing the results of the text processing;
Fig. 6 shows the general diagram for storing the results of processing one text;
Fig. 7 shows the diagram of the left-hand side of the rule for combining the named entities of the type of "Person" .
Detailed Description of the Invention
The proposed methods allow for performing effectively the conceptual indexing of the natural language texts both for the subsequent semantic navigation through the documents and document collections and for search purposes.
The method of computerized semantic indexing of natural language text according to the first aspect of the present invention and the method of computerized semantic indexing of collection of natural language text according to the second aspect of the present invention could be implemented in practically ei- ther computing environment, e.g., in a personal computer connected to external databases. The steps of performing those methods are illustrated in Fig. 1.
All further explanations are adduced in reference to the Russian language which is one of most high-inflectional languages, although the claimed methods are applicable to the semantic text indexing in any natural languages. First of all, for the subsequent computerized treatment, the text to be indexing needs to be presented in an electronic form. This step in Fig. 1 is contingently marked with the reference 1 and could be performed by any known method, for example, by scanning the text with recognizing thereof subsequently using a well-known means of the type of ABBYY FineReader. If the text comes to the indexing from the electronic network, for example, from the Internet, then the step of representing thereof in the electronic form is carried out in advance, prior to disposing that text in the network.
The text converted into the electronic form comes to the processing, dur- ing which this text is first segmented into elementary units named tokens. A token could be any text object from the following set: words consisting each of the series of letters and, possibly, hyphens; a series of spaces; punctuation marks; numbers. Sometimes, such character series as A300, il50b, etc. are also pertained hereinto. Tokens' separation is always carried out in accordance with rather simple rules, for example, as in the mentioned US Patent Application No. 2007/0073533. In Fig. 1, this step is contingently marked with the reference number 2. Hereinafter, tokens are considered as the first level elementary units.
Moreover, for every token being a word, respective second level elementary units named hereinafter morphs are formed based on the morphological analysis. In so doing, for every word, its normalized word form is identified. For example, for the word «tmy» the normalized word form will be «H^TH», for the word «κpacHBoro» the normalized word form will be «κpacnBBiH», and for the word «cτeHa» the normalized word form will be «cτeHa». Moreover, for every word form, a part of speech to which this word relates and its morphological characteristics are indicated. Of course, for various parts of speech those characteristics differ. For example, for nouns and adjectives these are a gender (masculine, feminine, neuter), number (singular, plural), case; for verbs these are aspect (perfective, imperfective), person, number (singular, plural); etc. Thus, for the given word, its morph is the normalized word form + morphological characteris- tics including the part of speech. One and the same word can have several word forms. For example, the word «cτeκjio» has two morphs: one for the neuter noun ["glass"] and one for the verb in the past tense ["flowed down"].
Those skilled in the art would appreciate that operations of this and sub- sequent steps are carried out with storing the intermediate results, for example, in the random access memory (RAM).
The next step contingently marked with the reference number 3 in Fig. 1 consists in that stable phrases are identified in a set of derived elementary units of two first levels, tokens and morphs. This is performed by converting the ele- mentary units, i.e., tokens and morphs, into series that are compared with the series of normalized words and their characteristics in the dictionaries stored in advance in the databases, where words are adduced with specification of grammatical associations therebetween. Once coinciding the succeeding series being compared with the corresponding dictionary series, that succeeding series being compared is recognized as the stable phrase and stored in such a kind in the database as the third level elementary unit.
At the next step indicated by the reference number 4 in Fig. 1, sentences corresponding to the portions of the text being indexed are formed. Usually, these are real sentences ending with dot, but in some cases it is suitable for in- terpreting as a sentence some parts of usual sentences, say, isolated element in enumeration. Therefore, this step can issue sentences not always coincided with the sentences of the text being indexed in common sense.
The above series of steps is specified by that identifying the lookup prior to forming the sentences allows, in some instances, to eliminate certain ambigu- ousness before analyzing the text in details. Thus, for example, fixation of the lookup «MTy HM. M.B. JIoMOHOcoBa» ["Moscow State University named for M. V. Lomonosov"] allows to eliminate the false end of the sentence after the word «HM» which is, in general case, the pronoun ["them"], but here is the shortening of the word «HMeHH» ["named for"]. After the step 4, the multistage semantic-syntactic analysis is carried out.
In Fig. 4, this analysis is contingently divided into steps marked with the reference numbers 5 to 11. Said multistage semantic-syntactic analysis is carried out by addressing the linguistic and heuristic rules formed in the database in the predetermined linguistic environment. Such an environment could be, for example, the linguistic environment mentioned in the above RU Patent No. 2242048, or the environment disclosed in said US Patent Application No. 2007/0073533, or any other linguistic environment defining respective rules that allows to eliminate syntactic and semantic ambiguousness of the real text words and ex- pressions. The linguistic and heuristic rules in the chosen environments are hereinafter referred to as rules. During said multistage semantic-syntactic analysis, the semantically meaningful objects hereinafter referred to as named entities (the reference number 5 in Fig. 1), and the attributes thereof (the reference numbers 7 and 9 in Fig. 1) are identified. The identification of the named entities that are considered as the fourth level elementary units is carried out in the sentence in a set of elementary units of the first, second, and/or third levels. In this case, the morphological attributes are formed for every named entity using said rules from the morphological attributes of those elementary units of the second and/or third levels (i.e., morphs and/or stable phrases) which constitute this named entity. Moreover, the semantic attributes are formed for every named entity using said rules from the semantic attributes of the elementary units of the second and/or third levels which constitute this named entity. The step of forming said attributes is contingently marked with the reference 7. And at the step marked with the reference number 9 in Fig. 1, for every named entity is assigned a respective type from the application ontology according to the topics of the application domain, to which the text being indexed relates. By the application ontology is meant, in this case, the specification of the particular application domain, which is stored in the respective database. For every named entity, i.e., for the fourth level elementary unit with type assigned thereto, the corresponding anaphoric reference considered as the fifth level elementary unit (if any) is determined. For example, in the sentence «Ko- rpβ ITyTHHa Ha3HaHHJiH πpeMtep-MHHHcτpoM, OH cφopMHpoBaπ πpaBHτejiLcτ- BO» ["When Putin was appointed to the post of prime-minister, he has formed the government"], the anaphoric reference to the word «ΠVTΉH» [«IIyτHHa»] will be the pronoun «OH» ["he"], while the word «riyτHH» will be the antecedent for that anaphor. This step for determining the anaphoric reference is contingently marked with the reference number 11 in Fig. 1. After that, every identified named entity is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto. The anaphoric reference is stored together with the type and attributes of the named entity which is the antecedent of that anaphoric reference, as well as with the indication of the co-reference between that named entity and the anaphoric reference thereof.
After performing the steps marked with the reference numbers 5-7-9-11 in Fig. 1 , the semantically meaningful relations between the named entities hereinafter referred to as named relations are determined based on the elementary units of the first, second, third, fourth and/or fifth levels using said rules (the step 6). The named relations can relate the named entities within both one sentence and the entire text being indexed.
At the step marked with the reference number 8 in Fig. 1, the morphological attributes are determined for every named relation using said rules from the second level elementary units (i.e., morphs) constituting this relation, as well as the semantic attributes from the elementary units of the first, second, third and/or fourth levels, constituting this relation.
At the step marked with the reference number 10 in Fig. 1, the respective type is assigned to every named relation from application ontology stored in the database according to the topics of the application domain, to which the text be- ing indexed relates. After that, every named relation is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto.
At the step marked with the reference number 12 in Fig. 1, the stored named entities and named relations are used for forming the triples. In so doing, a set of the triples of three types is formed within the text being indexed for every of the identified named relations relating the certain named entities. The single first type triple corresponds to the relation established by the named relation between two named entities. Each of the set of the second type triples corre- sponds to a value of particular attribute of one of those entities, and each of the set of the third type triples corresponds to a value of particular attribute of the named relation itself. If two named entities are labeled by Oj and Oj, and the named relation relating thereof is labeled as Ry, then the first type triple could be represented (depicted) as O; — » Ry -> Oj. Each of the set of the second type tri- pies could be represented as Oj ->• Aim — » Vim or Oj -> Ajn -> Vjn, where Aim and Ajn are respective attributes, Vj1n and Vjn are, respectively, values of those attributes. Similarly, each of the set of the third type triples could be represented as Ry -> A1Jk — > Vjjp, where Ayk is a respective attribute, and VyP is a value of that attribute. In these notations, the indices i, j, k, m, n, and p are integers. Then, at the step marked with the reference number 13 in Fig. 1, the text indexing is carried out. For this purpose, all named entities related by the named relations separately, all pairs of the kind "named entity - named relation", and all triples of the kind "named entity - named relation - named entity" are indexed taking into account the attributes of respective named entities and/or named relations. The triples formed at the step 12 and indices obtained at the step 13, together with the reference to the initial text from which those triples have been formed, are stored in the database (the step 15 in Fig. 1; the step 14 is omitted in this case). Prior to this, a convolution is performed (not shown) for the objects related by co-reference relations into a single object whose set of the attributes are the combination of the attributes of all object interrelated by the co-reference relations. This is done in order for reducing the memory volume in the database required for storing such objects, as well as for integrating under one object the information obtained for the entire text. The method of computerized semantic indexing of collection of natural language texts according to the second aspect of the present invention is carried out exactly as already discussed method of computerized semantic indexing of natural language texts according to the first aspect of the present invention, but in this case, after the step 13 of indexing and prior to the step 15 of storing in the database, one more step is performed. At this step marked with the reference number 14 in Fig. 1 and performed substantially simultaneously with the step 15, the following is carried out when storing in the database the formed triples and obtained semantic indices of the succeeding text. The newly derived named entities and named relations are compared with the named entities and named relations already existed in the database using the linguistic and heuristic rules in the predetermined linguistic environment that are formed in the database. In the case of identifying similar named entities and/or named relations, the duplicated information is not stored in the database, and respective named entities and/or named relations are supplemented with references to the succeeding texts where they are present and references to the text fragments within each of succeeding texts from which they are derived. Through this, the step of indexing the text collection is occurred similarly to the indexing the first text of this collection (or the first text indexed by this method), which permits to simplify significantly the entire indexing procedure, reduce the required memory volume, and integrate the information obtained from different texts within a single object.
It should be apparent for those skilled in the art that storage devices mentioned at individual steps could be practically both different devices and a single storage device of sufficient volume. Similarly, individual databases mentioned at the respective steps could be not only physically separate databases, but also a single database. Furthermore, said storage devices (memories) could be made on the same single database, or combined with one of said databases. Those skilled in the art will also appreciate that the methods claimed in the present invention are carried out in the respective computing environment under the control of ap- propriate programs which are recorded on machine-readable media intended for a direct utilization in computer operation. Therefore, the aspects of the present invention are also the machine-readable media with such programs.
Example
In order for illustrating the embodiment of the claimed method of computerized semantic indexing of natural language texts, consider the following example. Let there is a set of Russian texts formed from the newslines of the electronic mass media. Thus, the step of converting the texts into the electronic form marked in Fig. 1 with the reference number 1 could be considered to be al- ready performed.
A representative example of such text is the following message: «LξeHmpcuibHbiύ φedepωibHbiύ
Figure imgf000012_0001
3a 26.06.08 K)ιμeHκo nopymui TuMotueHKo βuβedamb y Uymuua ijeny πa zo3 26.06.08 14:00 Kueβ, HiOHb 26 (Hoβbiύ Pezuoπ, Muxaun Pπβoβ) — IIpe3udeHm VκpauHbi
BuKmop lOujβHKO 3aπβιuι, πmo eiμe βπepa ymβepdun dupemnuβbi OJIR nepezoβo- oβ npeMbep-MUHucmpa IOnuu TimouieHKO c poccuύcKim Konnezoύ Bnadimu- oM TIymuHUM β Mocκβe 28 WOHΆ.
IJo cnoβaM JOujeHKO, TτmoweHKO βo epeMR nepezoβopoβ c HymuHUM dojijicπa onpedenuwib «πemκyκ) φopMyny tjeπu πa za3», npimeM, «πmo6bi smo βbina ifeπa πe nonumuπecKaπ, smo 6una ifeπa 3κoHθMunecκaπ».
«3rno 3Hauum, umo oβeuM cmoponoM noHKtnHbi 3aβucuMθcmu, na 6o3e Komopux φopMupyemcn uena na za3 β onepednoM zody», — nomπwi lOnjβHKO. OH dan noumnb, mno β cnyuae noβbiweuuM ueπu na za3 YKpauua βydem doβueaπibCΛ nepecMompa cinouMOcmu mpamuma poccuύcκozo za3a β Eβpony.
«Tym dojiOfCHa δbimb κoppeκuun u yβH3κa nonumuKU uenu na zo3 u no- jiumuKU ifβHbi Ha mapuφ», — 3aπβun npesudenm YκpauHbi. JJo cjioβoM K)u}eHκo, TUMOUWHKO dojiDfcna doβumbcn, mnoβu φopMyna ijβHbi Ha ZO3 dππ VκpauHbi β 2009 zody cmana u3βecmHa ne no3dnee 15 cen- mnβpR 2008 zoda, umoβbi yκpauHcκoR cmopoua CMOZJia ynecmb noeue pac-μeπ- KU β zocβiodDfceme Ha cπedytoiμuύ zod.
KaK cooβiμan «Hoβbiύ PezuoH», panee znaβa MHff Poccuu Cepzeύ Jlaβ- poβ 3aΛβnΛn, umo i^ena na zo3 dnπ YκpauHbi e 2009 zody Moσfcem βbψacmu edβoe. ».
["Central Federal District / Publication from 06.26.2008 Uschenko has entrusted Timoshenko to ferret from Putin the price for gas. 06.26.2008, 2:00 PM Kiev, June 26 (New Region, Michail Ryabov) — The Ukrainian President
Victor Uschenko has intimated that yet yesterday he has approved the guide lines for negotiations of the Prime Minister Yulia Timoshenko with the Russian colleague Vladimir Putin in Moscow on June 28.
According to Uschenko, Timoshenko, during the negotiations with Putin, must define "clean formula of the price for gas ", herewith "this price will be economical price rather than political price " .
"This means that both sides understand the dependencies on which basis the price for gas is formed in the next year", has explained Uschenko.
He has given to understand that, in the case of increasing the price for gas, Ukraine will hold out for revising the cost for transiting Russian gas to Europe.
"Here must be correcting and matching the gas prices policy and tariff price policy", has stated the Ukrainian President. According to Uschenko, Timoshenko must enforce that the formula of the price for gar for Ukraine in 2009 becomes known not later than on September 15, 2008, in order for the Ukrainian party could take into account the new rates in the state budget for the next year. As the "New Region " informed, the head of the Russian Foreign, Office
Sergey Lavrov told earlier the price for gas for Ukraine in the 2009 could be doubled. ".]
In accordance with the claimed method of computerized semantic indexing of natural language texts, the preliminary created application domain speci- fication is used, within which the text collection processing and semantic index constructing will be carried out. A fragment of such specification is depicted in Fig. 2. Such specifications are developed by human experts, who record, based on their experience and knowledge, a list of object types and a list of typical relations therebetween essential for this application domain. In the example cited, the main types of objects are "Person", "Organization", "Location", and some other. Typical relations therebetween fall into two classes: common, peculiar to any domains, for example, the relation "BEJEXAMPLIFIED", which indicate the object hierarchy of the type of "descendant-ancestor", and special, specific for the chosen application domain, for instance, in the cited example, this is the typical relations "work ", "own", "lo- cated_in", etc.
Moreover, the human experts build in advance also a set of rules, each rule containing, in the left-hand side, a template for searching examples of objects and/or examples of relations therebetween, and in the right-hand side, op- erators for fixing in the text the examples of objects and/or examples of relations therebetween determined in accordance with the template. Using those rules prepared by the human experts, the specific data corresponding to the domain specification are derived in the texts being processed. Besides the domain specification and the rules in accordance with the above methods, common and special lexicons are used.
In accordance with the claimed method of computerized semantic indexing of natural language texts, at first, the step of segmenting the text into ele- mentary units, tokens, is performed with the morphological analysis of the token-words (reference number 2 in Fig. 1). As a result of this step, the initial text is transformed into a set of tokens and morphs that are represented in the Table 1 and Table 2, respectively.
Further, after performing the text segmentation into the tokens and the morphological analysis of the token- words, the step is carried out for deriving the stable phrases (lookups) using the common and special lexicons (reference number 3 in Fig. 1). As a result of this step, the initial text is supplemented, besides the first and second level elementary units, with a set of the third level elementary units, lookups. The fragment of this set for the above example is rep- resented in the Table 3.
After performing the above steps, the text being processed is fragmented into sentences (reference number 4 in Fig. 1). As a result of this step, the pluralities formed at the above steps are supplemented with a set of sentences, represented in the Table 4. Thus, after performing all the above discussed steps, the text being processed will be segmented into the sentences, each of which is marked with a plurality of annotations of the first, second and third level elementary units.
Hereupon, in accordance with the claimed method of computerized semantic indexing of natural language texts, the step of deriving the named entities (fourth level elementary units) is carried out at the set of the elementary units of the first, second and/or third levels using said rules. Thus, for example, in the sentence «Kaκ cooSman «HOBBIH PerπoH», paHee rnasa MH^ POCCHH CepreS JTaBpOB 3aaBJi5iJi, HTO UOHa Ha ra3
Figure imgf000015_0001
M0)κeτ BtrpacTH B^Boe.» [As the "New Region" informed, the head of the MFA of Russia Sergey Lavrov told earlier the price for gas for Ukraine in the 2009 could be doubled."] of the text being considered, the named entities «HOBBIH PeraoH» ["New Region"], «Cepret JlaBpoB» ["Sergey Lavrov"], «YκpaHHa» ["Ukraine"], and some others are derived using the rule whose diagram is represented in Fig. 3. As a result of performing the steps marked with the reference numbers 5-7-9 in Fig. 1 , the fourth level elementary units (named entities together with the attributes thereof) are derived. The fragment of plurality of such units for the example under consideration is represented in the Table 5.
After that, within the entire text being processed, during the step marked in Fig. 1 with the reference number 11, the pronouns are determined that could be anaphoric references to the corresponding named entities, and for those pronouns that are really such ones, the co-reference between the respective named entity and the anaphoric reference thereof (the fifth level elementary unit) is fixed. For the example under consideration, the obtained set of the anaphoric references is represented in the Table 6.
After performing the above steps, at the set of the derived first, second, third, fourth and fifth level elementary units, the semantically meaningful relations between the named entities are determined using the rules. Thus, for example, in the sentence «Kaκ cooδmaji «HOBHH PeraoH)), paHee raaBa MH1ZJ PoccHH Ceprefi JIaBpoB saaBJMJi, HTO ueiia Ha ra3 RJIΆ YicpanHEi B 2009 ro^y M05κeτ BbipacTH BβBoe.)) [As the "New Region" informed, the head of the MFA of Russia Sergey Lavrov told earlier the price for gas for Ukraine in the 2009 could be doubled."] of the text being considered, the named relation «pa6oτaτt)) ["work"] is derived using the rule whose diagram is represented in Fig. 4. As a result of performing the steps marked with the reference numbers 6-8-10 in Fig. 1, the set of the named relations between the named entities is derived, the fragment of which for the example under consideration is represented in the Table 7.
Thus, after performing all the above discussed steps, the initial text will be marked with the set of annotations corresponding to the named entities with the attributes thereof and the named relations with the attributes thereof between the named entities. For the example under consideration, the graphical representation of the text processing results is shown in Fig. 5.
The next step marked with the reference number 12 in Fig. 1 is a technical and carried out for performing the triples corresponding to the stored named entities and named relations. The fragment of the set of such triples for the example under consideration is represented in the Table 8. In fact, the formed set of triples contains the initial data for the semantic indexing of the text processed at the previous steps. At the step marked with the reference number 13 in Fig. 1, the semantic index is built as follows: first, from the set of the triples obtained at the previous step, the triple subsets are formed, each oh which subsets corresponds to one named entity with the attributes thereof, and every obtained triple subset is used as an entry for one of conventional indexers, for example, the well-known, freeware indexer Lucene, the indexer of the Yandex search machine, the Google indexer, or any other indexer, from which output an index unique for the given triple subset is obtained. The similar operation series is performed for all subsets of triples corresponding to the pairs of the kind "named entity — named relation" and to the triples of the kind of "named entity - named relation - named entity" taking into account the attributes of the respective named entities and/or named relations, thereby obtaining a set of the corresponding unique indices which constitute, in the aggregate, the semantic index of the text. The fragment of the semantic index for the example under consideration is represented in the Tables 9 to l l. At the step marked with the reference number 15 in Fig. 1, the triples formed at the step 12 and the indices obtained at the step 13 together with the reference to the initial text from which those triples have been formed are stored in the database, and the step 14, in the case of processing one text, is omitted. The general diagram of storing all results obtained at the previous steps is represented in Fig. 6.
Referring to Fig. 6, as the first step (51), a set of continuous chains of the triples for the relation "The_same" are formed. At the next step (52), the check is performed whether the set of continuous chains of the triples obtained at the previous step is empty. If that set is not empty, then, sequentially, at the next steps (53-56), the set of objects for the next chain is formed (53), this set is convolved into the single object (54) having the combined set of the attributes (without repetitions), the obtained single object is stored together with the at- tributes thereof (55), and the set of the processed objects of the succeeding chain is removed (56). But if at the step 52 the set of the triple chains turns out to be empty (initially or as a result of performing the steps 53 to 55), then, at the step
57, the overall set of triples obtained at all previous steps is formed; at the step
58, the formed overall set of the triples is supplemented with the semantic indi- ces and references to the initial text; after which, at the step 59, the supplemented set is stored in the database.
In accordance with the claimed method of computerized semantic indexing of collection of natural language texts, the processing of every subsequent text including its semantic index constructing is carried out by performing just the same steps as for the single text. However, in this case, after the step 13 of indexing and prior to the step 15 of storing in the database, one more step marked with the reference number 14 in Fig. 1 is carried out, the step of combining the results of processing the succeeding text with the results of processing the previous texts stored already in the database, which step is carried out as fol- lows.
The named objects and named relations newly derived in the succeeding text being indexed are compared with the named objects and named relations already existed in the database by checking the coincidence of the semantic indices thereof, and, in the case of the positive result of such comparing, the respec- tive objects and relations are excluded from the following processing, herewith storing in the object and/or relation already existed in the database the reference to that text and that fragment of that text, where the object and/or relation excluded from the following processing are identified. In the case of the negative result of comparing the semantic indices, the similarity between the new objects and/or relations and those object and/or relations that are already exist in the database is identified using the linguistic and heuristic rules formed in advance in the database, and in the case of the positive result, the object and/or relation descriptions already exist in the database are widened with the new data, after which the existing semantic indices are reconstituted and the new semantic indices are added as the secondary ones to the already existing indices, and, moreover, in the object and/or relation already existing in the database, the reference is stored to that text and that fragment of this text, where the new objects and/or relations are identified, and then the respective objects and relations are ex- eluded from the following processing. Otherwise, the newly identified named objects and named relations together with the semantic indices thereof are added to the database.
Thus, for example, if, as the next one relative to the just considered example, the following text was processed: «KiieB, HTOJIB 21 (HOBBIH PeraoH, AHHa CepreeBa) - Cero^roi B KπeBe COCTOUΠHCB flBycTopoHHHe πeperoBopw pyκoBθ- ,n;cτBa YKpaHHti H FepMaHHH. Bcrpeπa Bπκτopa lOπjeHKO H AHrejibi Mepκejib B φopMaτe c rπa3y Ha raa3 flπnjiact 15 MHHyr, a 3aτeM HananncB πeperoBopBi B pacπiHpeHHOM φopMaτe, κoτopBie 3aτaHyjiHCB Ha πonTopa πaca... Bπκτop AH^;- peeBHH IOmeHKo
Figure imgf000019_0001
BHJIO Ha φoHe 6OHKOH H aκτHBH0H Mep- κejiB, KOTOpaa βBicτpo H MHOΓO roBopHJia...» ["Kiev, July 21 (New Region, Anna Sergeyeva) - Today in Kiev, the bilateral negotiations have been happened between the Ukrainian and German leaders. The meeting of Victor Uschenko and Angela Merkel in vis-a-vis format lasted for 15 minutes, and then the negotiations in the widened format have begun, which lasted for one hour and half... Victor Andreyevich Uschenko seemed to be rather slow against lively and active Merkel, who spoke quick and much..."], then, after performing the steps 1 to 13, there will be identified, for example, such objects and relations as «HOBMH PeraoH» ["New Region"], «YκpaHHa» ["Ukraine"], «BHKTOP fOπjemco)) ["Victor Uschenko"], «AHrena Mepκejib» ["Angela Merkel"], «BHK- τop AimpeeBHH K)m;eHκo» ["Victor Audrey evish Uschenko"], «Bcτpe- n:H_neperoBopi>i» ["meetings_negotiations"], etc., as well as the semantic index of this text will be formed, which fragment is represented in the Tables 12 to 14.
Further, in accordance with the claimed method of computerized semantic indexing of collection of natural language texts, at the step 14, the object will be identified, particularly, the object «YκpaHHa» ["Ukraine"], which semantic index fully coincides with the semantic index of the object «yκpaHHa» ["Ukraine"] already existed in the database, and, moreover, the similarity will be identified (by applying the rule which diagram is shown in Fig. 7) between the object «BHKTOP AH^peeBHH K)πτeHκo» ["Victor Andreyevich Uschenko"] and the object «Bπκτop K)meHκo» ["Victor Uschenko"] already existed in the database, whereafter the existed description of the object «Bπκτop K)ni;eHKO» ["Victor Uschenko"] in the database will be widened by means of the new information and additional semantic index, which is shown in the Tables 15 and 16.
Thus, the present invention provides for extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and subsequent use of obtained results for building indices, the main difference of which methods from the known method of indexing consists in indexing semantically meaningful concepts and relations therebetween rather than the key words and lookups, which provides for the semantic navigation through the documents and documents collections, as well as the highly-precise and quick search of facts and documents, especially in reference to high-inflectional language texts. Table 1. Results of tokenizing the example text
Figure imgf000021_0001
Figure imgf000021_0002
Figure imgf000021_0003
Figure imgf000022_0001
Table 2. Results of the morphological analysis of the example text
Figure imgf000023_0001
Figure imgf000024_0001
Table 3. Results of identifying the lookups in the text
Figure imgf000025_0001
Figure imgf000025_0002
Figure imgf000025_0003
Figure imgf000025_0004
Figure imgf000025_0005
Figure imgf000026_0001
Figure imgf000026_0002
Figure imgf000026_0003
Figure imgf000026_0004
Table 4. Results of segmenting the text into the sentences
Figure imgf000026_0005
Figure imgf000027_0001
Table 5. Results of identifying the named entities in the text
Figure imgf000027_0002
Figure imgf000027_0003
Figure imgf000027_0004
Figure imgf000028_0001
Table 6. Results of identifying the coferences in the text
Figure imgf000028_0002
Table 7. Results of identifying the semantically meaningful relations in the text
Figure imgf000028_0003
Figure imgf000028_0004
Figure imgf000028_0005
Figure imgf000029_0001
Table 8. Triple representation of the text processing results
Figure imgf000030_0001
Figure imgf000031_0001
Figure imgf000031_0002
Table 9. Fragment of the semantic index of the text (named entities)
Figure imgf000031_0003
Person Bnκτop fOmeHKO [Victor a_7986d80e781ce95b_253defa0906 Uschenko] First_name Bnκτop d86a3 [Victor] Initialjetter B. [V.] Gender m Family_name KDmeHKO [Uschenko]
Figure imgf000031_0004
Table 10. Fragment of the semantic index of the text (pairs of the kind "named entity - named relation")
Figure imgf000031_0005
Person Ceprew /laBpoB [Sergey Lavb_348eda6c3d9a407_3ebd0c4b59f6 rov] First_name Ceprew [Sergey] Ini- bddO tjalJetter C. [S.] Gender m
Figure imgf000032_0001
Table 11. Fragment of the semantic index of the text (triples of the kind "named entity - named relation - named entity")
Triple (object - relation - object) Semantic index (unique code)
Figure imgf000032_0002
Figure imgf000033_0001
Table 12. Fragment of the semantic index of the new text (named entities)
Figure imgf000033_0002
Figure imgf000033_0003
Figure imgf000034_0001
Table 13. Fragment of the semantic index of the new text (pairs of the kind "named entity - named relation")
Figure imgf000034_0002
Figure imgf000034_0003
Table 14. Fragment of the semantic index of the new text (triples of the kind "named entity - named relation - named entity")
Triple (object - relation - object) Semantic index (unique code)
Person Bnκτop IOmeHκo[Victor c_7986d80e781ce95b Uschenko] First__name Bwcrop [Victor] b5b8f5d8c84d5dbl InitiaUetter B. [V.] Gender m Fam- ily_name KDm.eHκo [Uschenko] Relation BcτpeMn_n__πeperoBopbi [meet- ings_and_negotiations] Person AHre- na Mepκe/ib [Angela Merkel] First_name AHre/ia [Angela] InI-
Figure imgf000035_0001
Table 15. Fragment of the semantic index of the named entities for collection from two texts prior to combining such objects
Figure imgf000035_0002
Person Bnκτop K)meHκo [Victor aJ7986d80e781 ce95b_253defa0906 Uschenko] Firstjname Bnκτop [Victor] d86a3 Initialjetter B. [V.] Gender m Fam- ily_name JOmeHKO [Uschenko]
OepcoHa BMKTOP AHflpeeBMM KDLueHKO a_7cdl659a3047c3bc_6843fδdl284 [Victor Andreevich Uschenko] fdcfO First_name Bnκτop [Victor] OτπecTBO AHflpeeBMH [Andreevich] InitiaUetter B. [V.] Initial_letter2 A. [A.] Gender m Family_name K)meHκo [Uschenko]
Table 16. Fragment of the semantic index of the named entities for collection from two texts after combining such objects
Figure imgf000035_0003
ilepcona Bnκτop AHApeeBHM K)LueHκo a_7986d80e781 ce95b_253defa0906 [Victor Andreevich Uschenko] d86a3, First_name Bnκτop [Victor] Oryecreo a_ 7cdl659a3047c3bc_6843f8dl284 AHApeeBMH [Andreevich] InitiaUetter fdcfO B. [V.] Initial_letter2 A. [A.] Gender m Family_name KDmeHKO [Uschenko]

Claims

Claims
1. A method of computerized semantic indexing of natural language text, which method comprising steps of:
- presenting the text to be indexing in an electronic form for the subse- quent computerized treatment;
- segmenting the text in the electronic form into elementary units named hereinafter tokens;
- identifying stable phrases in the text during linguistic analysis;
- forming sentences corresponding to the portions of the text; — in every sentences having the identified stable phrases, during multistage semantic-syntactic analysis by addressing the linguistic and heuristic rules formed in the database in the predetermined linguistic environment, hereinafter referred to as rules, identifying the semantically meaningful objects hereinafter referred to as named entities, and the semantically meaningful relations between the named entities hereinafter referred to as named relations;
- within the text being indexed for every identified named relations relating the certain named entities, forming the set of triples, single first type triple corresponding to the relation established by the named relation between two named entities, each of the set of the second type triples corresponding to a value of particular attribute of one of those entities, and each of the set of the third type triples corresponding to a value of particular attribute of the named relation itself;
- at the set of the formed triples, indexing all named entities related by the named relations separately, all pairs of the kind "named entity - named rela- tion", and all triples of the kind "named entity - named relation - named entity", while taking into account the attributes of respective named entities and/or named relations; - storing in the database the formed triples and the obtained indices together with the reference to the initial text from which those triples have been formed.
2. The method according to claim 1, wherein said tokens hereinafter re- ferred to as the first level elementary units are selected from the group consisting of: words in the form of the series of letters or letters and hyphens; numbers; punctuation marks; and series of spaces.
3. The method according to claim 1, further comprising a step of forming, for every token being a word, a respective second level elementary units herein- after referred to as morphs based on the morphological analysis.
4. The method according to claim 1, wherein, during said linguistic analysis at the step of forming sentences, converting the first and/or second elementary units (i.e., tokens and morphs) in every sentence into said stable phrases hereinafter referred to as the third level elementary unit by addressing dictionar- ies and morphological associations stored in advance in the database.
5. The method according to claim 1, further comprising, during said multistage semantic-syntactic analysis, steps of:
- identifying, in the sentence, said named entities considered as the fourth level elementary units in a set of elementary units of the first, second, and/or third levels;
- forming, using said rules, the morphological attributes for every named entity from the morphological attributes of said elementary units of the second and/or third levels (i.e., morphs and/or stable phrases) which constitute this named entity; - forming, using said rules, the semantic attributes for every named entity from the semantic attributes of the elementary units of the second and/or third levels which constitute this named entity; - assigning for every named entity a respective type from the application ontology stored in the database according to the topics of the domain, to which the text being indexed relates;
- storing every named entity in the memory together with the type as- signed thereto and morphological and semantic attributes determined thereto.
6. The method according to claim 5, further comprising steps of: determining, for every named entity with type assigned thereto, an anaphoric reference considered as the fifth level elementary unit, and storing said anaphoric reference in the database together with type and attributes of the named entity being the antecedent for that anaphoric reference, as well as with the indication of the co-reference between that named entity and the anaphoric reference thereof;
- said named relations considered as the sixth level elementary units are determined using said rules on the basis of the elementary units of the first, sec- ond, third, fourth and/or fifth levels;
- determining, using said rules, for every named relation the morphological attributes from the second level elementary units constituting this named relation;
- determining, using said rules, for every named relation the semantic at- tributes from the elementary units of the first, second, third and/or fourth levels constituting this relation;
- assigning the respective type to every named relation from application ontology stored in the database according to the topics of the domain, to which the text being indexed relates; - storing in the memory every named relation together with the type assigned thereto and morphologic and semantic attributes determined thereto.
7. The method according to claim 1, further comprising, prior to storing in the database the formed triples and obtained indices, a step of convolving every group of the objects related by co-reference relations into a single object whose set of the attributes being the combination of the attributes of all object interrelated by the co-reference relations.
8. A method of computerized semantic indexing of natural language text collection, which method comprising all the steps of the method according to claim 1 in reference to the succeeding text being indexed, after which, during the step of storing in the database the formed triples and the obtained indices of the succeeding text, comprising further steps of: comparing, using the linguistic and heuristic rules in the predetermined linguistic environment, the newly derived named entities and named relations with the named entities and named re- lations already existed in the database, and in the case of identifying similar named entities and/or named relations, the duplicated information is not stored in the database, and respective named entities and/or named relations are supplemented with references to the succeeding texts where they are present and references to the text fragments within each of succeeding texts from which they are derived.
9. The method according to claim 8, wherein said tokens hereinafter referred to as the first level elementary units are selected from the group consisting of: words in the form of the series of letters or letters and hyphens; numbers; punctuation marks; and series of spaces.
10. The method according to claim 8, further comprising a step of forming, for every token being a word, a respective second level elementary units hereinafter referred to as morphs based on the morphological analysis.
11. The method according to claim 8, wherein, during said linguistic analysis at the step of forming sentences, converting the first and/or second ele- mentary units (i.e., tokens and morphs) in every sentence into said stable phrases hereinafter referred to as the third level elementary unit by addressing dictionaries and morphological associations stored in advance in the database.
12. The method according to claim 8, further comprising, during said multistage semantic-syntactic analysis, steps of: - identifying, in the sentence, said named entities considered as the fourth level elementary units in a set of elementary units of the first, second, and/or third levels;
- forming, using said rules, the morphological attributes for every named entity from the morphological attributes of said elementary units of the second and/or third levels (i.e., morphs and/or stable phrases) which constitute this named entity;
- forming, using said rules, the semantic attributes for every named entity from the semantic attributes of the elementary units of the second and/or third levels which constitute this named entity;
- assigning for every named entity a respective type from the domain ontology stored in the database according to the topics of the domain, to which the text being indexed relates;
- storing every named entity in the memory together with the type as- signed thereto and morphological and semantic attributes determined thereto.
13. The method according to claim 12, further comprising steps of: determining, for every named entity with type assigned thereto, an anaphoric reference considered as the fifth level elementary unit, and storing said anaphoric reference in the database together with type and attributes of the named entity being the antecedent for that anaphoric reference, as well as with the indication of the co-reference between that named entity and the anaphoric reference thereof;
- said named relations considered as the sixth level elementary units are determined using said rules on the basis of the elementary units of the first, sec- ond, third, fourth and/or fifth levels;
- determining, using said rales, for every named relation the morphological attributes from the second level elementary units constituting this named relation; - determining, using said rules, for every named relation the semantic attributes from the elementary units of the first, second, third and/or fourth levels constituting this relation;
- assigning the respective type to every named relation from application ontology stored in the database according to the topics of the domain, to which the text being indexed relates;
- storing in the memory every named relation together with the type assigned thereto and morphologic and semantic attributes determined thereto.
14. The method according to claim 8, further comprising, prior to storing in the database the formed triples and obtained indices, a step of convolving every group of the objects related by co-reference relations into a single object whose set of the attributes being the combination of the attributes of all object interrelated by the co-reference relations.
15. A machine-readable medium intended for direct operation in a com- puter and comprising a program for carrying out the method according to claim
1.
16. A machine-readable medium intended for direct operation in a computer and comprising a program for carrying out the method according to claim 8.
PCT/RU2009/000111 2008-10-29 2009-03-06 Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media WO2010050844A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP09823885A EP2350871A1 (en) 2008-10-29 2009-03-06 Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2008142648/12A RU2399959C2 (en) 2008-10-29 2008-10-29 Method for automatic text processing in natural language through semantic indexation, method for automatic processing collection of texts in natural language through semantic indexation and computer readable media
RU2008142648 2008-10-29

Publications (1)

Publication Number Publication Date
WO2010050844A1 true WO2010050844A1 (en) 2010-05-06

Family

ID=42129031

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2009/000111 WO2010050844A1 (en) 2008-10-29 2009-03-06 Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media

Country Status (3)

Country Link
EP (1) EP2350871A1 (en)
RU (1) RU2399959C2 (en)
WO (1) WO2010050844A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking
US8997008B2 (en) 2012-07-17 2015-03-31 Pelicans Networks Ltd. System and method for searching through a graphic user interface

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2452002C1 (en) * 2011-03-04 2012-05-27 Сергей Иванович Колесник Method of creating multilingual automatic index for electronic digital pilot book
RU2518946C1 (en) * 2012-11-27 2014-06-10 Александр Александрович Харламов Method for automatic semantic indexing of natural language text
US9772995B2 (en) 2012-12-27 2017-09-26 Abbyy Development Llc Finding an appropriate meaning of an entry in a text
US20140188456A1 (en) * 2012-12-27 2014-07-03 Abbyy Development Llc Dictionary Markup System and Method
RU2538303C1 (en) * 2013-08-07 2015-01-10 Александр Александрович Харламов Method for automatic semantic comparison of natural language texts
RU2538304C1 (en) * 2013-08-22 2015-01-10 Александр Александрович Харламов Method for automatic semantic classification of natural language texts
RU2565473C2 (en) * 2013-11-01 2015-10-20 Федеральное государственное бюджетное образовательное учреждение высшего профессионального образования "Российский государственный гуманитарный университет" (РГГУ) Method of constructing corpus based on internet forums
RU2665239C2 (en) * 2014-01-15 2018-08-28 Общество с ограниченной ответственностью "Аби Продакшн" Named entities from the text automatic extraction
RU2544739C1 (en) * 2014-03-25 2015-03-20 Игорь Петрович Рогачев Method to transform structured data array
EA201700031A1 (en) * 2014-06-27 2017-05-31 Игорь Петрович РОГАЧЕВ METHOD first converting the original data files, METHOD FOR FORMING RELATIONSHIPS MAP components often STRUCTURED logical constructions convert the original data files, a method of searching in the transformed data sets using the card RELATIONS components and systems and apparatus for implementing these methods
RU2618374C1 (en) * 2015-11-05 2017-05-03 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Identifying collocations in the texts in natural language
CN107402912B (en) * 2016-05-19 2019-12-31 北京京东尚科信息技术有限公司 Method and device for analyzing semantics
RU2619193C1 (en) * 2016-06-17 2017-05-12 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Multi stage recognition of the represent essentials in texts on the natural language on the basis of morphological and semantic signs
RU2646386C1 (en) * 2016-12-07 2018-03-02 Общество с ограниченной ответственностью "Аби Продакшн" Extraction of information using alternative variants of semantic-syntactic analysis
CN106933809A (en) * 2017-03-27 2017-07-07 三角兽(北京)科技有限公司 Information processor and information processing method
CN107203511B (en) * 2017-05-27 2020-07-17 中国矿业大学 Network text named entity identification method based on neural network probability disambiguation
RU2713568C1 (en) * 2019-11-10 2020-02-05 Игорь Петрович Рогачев Method of transforming structured data array
RU2717719C1 (en) * 2019-11-10 2020-03-25 Игорь Петрович Рогачев Method of forming a data structure containing simple judgments
RU2717718C1 (en) * 2019-11-10 2020-03-25 Игорь Петрович Рогачев Method of transforming a structured data array containing simple judgments

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2273879C2 (en) * 2002-05-28 2006-04-10 Владимир Владимирович Насыпный Method for synthesis of self-teaching system for extracting knowledge from text documents for search engines
US7191115B2 (en) * 2001-06-20 2007-03-13 Microsoft Corporation Statistical method and apparatus for learning translation relationships among words
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis
US7346493B2 (en) * 2003-03-25 2008-03-18 Microsoft Corporation Linguistically informed statistical models of constituent structure for ordering in sentence realization for a natural language generation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7191115B2 (en) * 2001-06-20 2007-03-13 Microsoft Corporation Statistical method and apparatus for learning translation relationships among words
RU2273879C2 (en) * 2002-05-28 2006-04-10 Владимир Владимирович Насыпный Method for synthesis of self-teaching system for extracting knowledge from text documents for search engines
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis
US7346493B2 (en) * 2003-03-25 2008-03-18 Microsoft Corporation Linguistically informed statistical models of constituent structure for ordering in sentence realization for a natural language generation system
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking
US9552352B2 (en) * 2011-11-10 2017-01-24 Microsoft Technology Licensing, Llc Enrichment of named entities in documents via contextual attribute ranking
US8997008B2 (en) 2012-07-17 2015-03-31 Pelicans Networks Ltd. System and method for searching through a graphic user interface

Also Published As

Publication number Publication date
RU2399959C2 (en) 2010-09-20
EP2350871A1 (en) 2011-08-03
RU2008142648A (en) 2010-05-10

Similar Documents

Publication Publication Date Title
WO2010050844A1 (en) Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media
Turmo et al. Adaptive information extraction
Bikel et al. An algorithm that learns what's in a name
Faure et al. First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX
US8374844B2 (en) Hybrid system for named entity resolution
US4868750A (en) Collocational grammar system
US20100332217A1 (en) Method for text improvement via linguistic abstractions
Neumann et al. A shallow text processing core engine
US11170169B2 (en) System and method for language-independent contextual embedding
Feng et al. Probabilistic techniques for phrase extraction
Constant et al. Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields
Ferreira et al. A new sentence similarity assessment measure based on a three-layer sentence representation
Marciniak et al. Terminology extraction from medical texts in Polish
Zhang et al. Natural language generation and deep learning for intelligent building codes
Chen et al. Automated extraction of tree-adjoining grammars from treebanks
RU2563148C2 (en) System and method for semantic search
López-Hernández et al. Automatic spelling detection and correction in the medical domain: A systematic literature review
Spasić et al. Head to head: Semantic similarity of multi–word terms
Agbele et al. Context-aware stemming algorithm for semantically related root words
Panahandeh et al. Correction of spaces in Persian sentences for tokenization
Jafar Tafreshi et al. A novel approach to conditional random field-based named entity recognition using Persian specific features
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Zeller Detecting ambiguity in statutory texts
Bindu et al. Design and development of a named entity based question answering system for Malayalam language
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09823885

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009823885

Country of ref document: EP