WO2015125088A1

WO2015125088A1 - Document characterization method

Info

Publication number: WO2015125088A1
Application number: PCT/IB2015/051239
Authority: WO
Inventors: Rodrigo Andrés SANDOVAL URRICH; Juan Ignacio SÁA HARGOUS; José Manuel JIMENEZ MARIN
Original assignee: Servicios Digitales Webdox Spa
Priority date: 2014-02-18
Filing date: 2015-02-18
Publication date: 2015-08-27
Also published as: CL2016002090A1; PE20161166A1

Abstract

The present invention describes a method of automatic document characterization which receives a given input unstructured document and results in the automatic assignment of one or more document classes or categories to which the contents relates to the automatic determination of a list of the names of natural or legal persons found in the text, the automatic determination of other relevant information mentioned in the text and the document issue date, which is related with the document class or multiclass.

Description

DOCUMENT CHARACTERIZATION METHOD

TECHNICAL FIELD

The present invention relates to natural language processing focused in identifying relevant information in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.

BACKGROUND

The state of art discloses several methods and techniques related to classification or characterization of documents according to their content.

Patent US8380718 (B2) discloses a system and method for grouping similar documents. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A sub-group of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence.

On the other hand, the European patent EP1365329 (B1 ) discloses a classification method wherein one document is classified into at least one document class by selecting terms for use in the classification from among terms that occur in the document. A similarity between the input document and each class is calculated using information saved for every document class. The calculated similarity to each class is corrected. The class to which the input document belongs is determined in accordance with the corrected similarity to each class. Document US20131 10843 discloses a method and system for classifying insurance files for identification, sorting and efficient collection of subrogation claims. The invention determines whether an insurance claim has merit to warrant claim recovery efforts utilizing software code for partially describing a set of documents having unstructured and structured file data containing terms and phrases having contextual bases, code for transforming the terms and phrases, code for iterating a classification process to determine rules that best classify the set of documents based upon context, code for incorporating the rules into an induction and knowledge representation, thesauri taxonomies and text summarization to classify subrogation claims; code for calculating a base score and a concept vector to identify the selected claims that demonstrate a given probability of subrogation recovery.

First, the methods and systems described above do not allow a multiple classification process or characterization of different content elements included in each document.

Second, all the methods described above, except one, do not consider multiple language-dependent forms for the same semantic expression (grouped in rules), which involves a more flexible way to recognize specific text patterns in the document, which means that these methods depend on a set of keywords that are synonym to each other. Each expression may be composed by different amounts of words, with different synonyms, conjugations, orders, and even a set of combined words separated by no more than a specific number of words.

Third, the methods described above do not measure the importance of an expression in a specific context by adding more or less importance for that expression to influence in the result of characterization. Finally, methods focused in classification do not address a detection process with the contents found in the first steps of the characterization, drive the behavior and performance of the following steps of the characterization.

SUMMARY

The present invention relates to natural language processing focused in identifying specific information elements in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data. This text processing approach is focused on documents in which different information elements are relevant to identify from the contents that do not follow a specific layout, order, or syntax, known as unstructured documents. Unstructured documents, such as legal documents, do not present a specific sequence of elements and the contents can be found using different expressions depending on the language used.

The information elements that are searched for in the document allow determining specific characteristics of this document. This is the case of document classes or categories, and other characteristic associated, such as dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, which is related with the document class or multiclass.

Although the way these elements are searched differs from one another, in all of them the approach is to recognize specific text patterns within the document contents and drive the detection process with the contents found in the first steps of the characterization, utilizing the set of rules defined for each class. The method for automatic document characterization described here includes document classification, which is a sequential process for associating documents to pre-defined document classes, and has been addressed by different methods and techniques, such as na^'ive Bayes, and Support Vector Machine, among others. But these methods are not properly prepared to the natural overlapping of classes, which in this type of documents is not only feasible, but expected. Also, these methods, except one, focus on specific keywords, not easily considering different language- dependent expressions as equivalent concepts or ideas.

The process for automatic ocument characterization processes the document contents five times (OCR, Class and Multiclass Determination, Names detection, Document class-related characterization, Document date detection), since different approaches are needed for each of the information elements.

The first revision aims to refine the input text, acknowledging that the results of the Optical Character Recognition (OCR) might separate whole words by assuming spaces and end of lines. A dictionary of valid words in a given language is used to validate word merges, thus separated words are corrected, allowing the rest of the revisions to work easily.

The second revision is focused in the document classes by searching for semantic rules that might occur in the text, and according to rules occurrence, determine one or more classes that might apply.

The third revision focuses on the names referenced in the contents. This revision uses a list of relevant keywords, both that might be included at the beginning or at the end of a name, and also that might precede or follow, but not be included in the name. The relevance of each of the defined keywords can be associated to the main class assigned to the document in the previous step, therefore allowing a fine tuned detection process.

The fourth revision is related to the detailed characterization of the document, based on the class or multiclass that has been previously defined. Given this class or multiclass, there is a list of characteristics that a document has. As an example, in the case of a real state lease agreement, it is useful to know the "lease rent" or the "leasable area", so the search of this information must be done.

The final revision searches only for valid dates, but also considers some of the preceding text and information defined, to help in the determination of the issue date. It must be taken into consideration that each class may have one or more relevant dates, for example, such as the date in which a lease agreement commences and the date which it ends. At this step, the dates to search will pre-established, since the class of the document will be known.

This invention aims to produce a much faster, complete and accurate characterization result compared to the manual characterization that a legal expert may provide by reading the document contents. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system. The accuracy in the characterization is supported by a rule calibration process, that takes into account a set of previously classified documents, and adjust each rule's coefficients to match these documents classes as best as possible, which allows the continuous improvement of this accuracy. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram that represents the document characterization process

FIG. 2a is a block diagram that represents the Class and Multiclass determination process

FIG. 2b is a block diagram that represents the Class and Multiclass determination process with essential rules

FIG. 3 is a block diagram that represents an example of an embodiment of the invention. In this particular scenario, "Existing purchase" rule was used as an example.

FIG. 4 is a block diagram that represents the Class and Multiclass calibration process.

FIG. 5 is a block diagram that represents the Names detection process.

FIG. 6 is a block diagram that represents the Document class-related characterization.

FIG. 7 is a block diagram that represents the Document date detection process.

FIG. 8 is a block diagram that describes a computer system for executing the document characterization, from the reception of the document to be characterized, the retrieval of words from the dictionary and text rules from the database, to the final display of resulting characteristics of the provided document. DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

In this section each of the processes are detailed and explained in a way such that the understanding of the process is done in an easier way. Each process has its own figure (FIG. 2a, 2b, 3, 4, 5, 6, 7), with each explanation and also a complete process figure is included (FIG. 1 ).

Main process

The following text uses FIG.1 to explain the whole process. Each process has its own section, destined to the explanation of each process following this paragraph. As mentioned earlier, the first step is the revision of the text with the recognition of individual words 101 with an Optical Character Recognition (OCR). Once the text has been revised by the OCR, the Determination of Class or Multiclass is made 102, aligned with a Class or Multiclass rule calibration process 102a, which is an iterative process that improves the rule score definition, obtaining a better classification. With the class already defined, a names detection process 103 is completed, with the objective of obtaining relevant names of people and organizations in the document.

This process 103 takes into consideration the class that the document has been associated to make the search 105 in a more precise way, considering a relevant set of keywords and expressions 103a to define the document^'s containing names. Once the classes and names have been defined, it is possible to initiate a document class- related characterization 104, which searches for information taking into consideration the class previously defined. The final step consists of the search and definition of the document^'s valid issue date and other relevant dates 106, which similar as the process of the search of names, takes into consideration the class that has been defined to make the search. Once all of these processes have been finished the final document characterization 107 has been made. This final document characterization 107 includes dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, making in a fast and a accurate way a detailed description of the document. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.

For further details in each of the processes, please go to the respective section. Document Text Refinement

The document text is reviewed recognizing individual words obtained through an Optical Character Recognition 101 (OCR) processing means. If the document's origin is a word processor, this refinement process may produce just a minimal difference or no difference at all. To attempt correcting the detection of spaces within words by the OCR process, each individual word is concatenated with the next one and then compared to the list of valid words in the language of the document stored in the dictionary 101a. If this concatenated word is found, then it's kept for the following processes.

Class or Multi-Class Determination

The process depends on a plurality of rules defined by means of specification of logic text-related rules 301 , each of them defining a set of keywords that represent equivalent verbal expressions 301a. These expressions are described by keywords, synonyms, conjugations, different amount of words and/or equivalent verbalizations 301a of the same concept, also considering different levels of separation between these words, thus allowing more flexibility in text rule detection (2a.201 and 2b.201 ). Each rule may be related to one or more classes (2a.202 and 2b.202a), but the relevance for each class is determined by a relevance coefficient, that may have a positive or negative value 305. In some classes, a rule can be considered as essential, in which case the coefficient should be high and even the absence of this rule might lead to a penalty regarding the document class in which this rule is considered essential 2b.202b. This means that the very same rule might be highly relevant for one class of documents by using a rather high positive coefficient value, even being considered as essential, be also slightly relevant for a second class which may be referenced by a positive lower coefficient value, have no relevance at all represented by a null coefficient, and even be considered contradictory for one class of documents represented by a negative coefficient value 2b.202b.

In FIG. 2b it is illustrated the duality of an essential rule 2b.202b, due to the different values that it may have for different classes. For a given class such as a real estate purchase document, the "existing purchase" rule may be relevant to indicate that the document is effectively a purchase document. On the other hand, a document such as a real state lease agreement won't have this rule as essential and the ponderation that this rule has in the document total score may be very low (2a.203 and 2b.203).

For example, in the preferred embodiment of this method, the characterization of legal documents, the rule "existing purchase" 301 applies to any type of purchase, such as real estate, chattels or shares. In any of these different types of agreements, this rule may be found in slightly different ways 301a. A possible expression that relates to this rule is "... the parties agree to subscribe the present purchase agreement ..."301 b. An alternative expression is "... the seller sells ... and the buyer buys... "301 b, regardless of the item being considered in the sale. The methodology for all the other rules work in the similar way, implementing a set of keywords that represent equivalent terms (302 and 304).

An additional and complimentary rule that applies to this kind of contracts is "existing purchase price" which refers to the price the buyer is going to pay. As with the

"existing purchase price" rule, this rule can be found in different forms in the text, such as "... the price paid by the buyer is the following... " or "... the price of ... shall be...".

In a related example, a different type of contract is the lease agreement. The main rule that is found on this kind of agreement is "existing lease" which can be found as

"... the landlord hereby leases the premises to the tenant...", and later in the document "... the tenant hires the premises", which correspond to a real estate lease agreement. For this exact same rule a different way of expressing it is "... the lessee receives from the lessor...". In the same type of document, another rule that can be found is "length of the lease", which describes the duration of the contract. This rule can be found in the document as "... this lease shall start on [datel ] and end on [date2]", or also as "... this lease will be in effect from [date] for a duration of X days...". This set of rules are found in lease contracts, therefore rather high coefficients should be set to associate them with the lease contract class, and even considering one of them or both as essential, thus assigning them highest coefficient for this class (303).

The classification process searches the document from the beginning to the end, identifying each word individually and comparing it to one of the several rules defined 301a. If the word matches any of the words of any of the rules, then the following text is processed to complete each of those candidate rules being evaluated, but also comparing each new word with other rules as well 301 b.

The classification process ends up with a set of rules found to apply for each document (2a.202 and 2b.202). Then, the final classification score for each document class is calculated by this formula 305.

ClassScore (document, rule) =occurrences^*Coefficient

This means that if the coefficient of a certain rule is high for a document class, any of the repeated occurrences will add up for assigning that class to the document.

The definition of the previous scores for each class will have effect on the Class or Multiclass calibration (306, 2a.204 and 2b.204), and then this iterative process (2a.205 and 2b.205) will modify and improve the set of rules used to classify the document.

Names Detection

Any document, especially legal documents, might reference names of natural persons and organizations. So this process focuses on producing a list of all the names found in the contents 501. This detection, using the detection means, is based on the fact that documents mention persons' names preceding them with some specific verbal expressions 502a, such as "Mr.", or "Mrs.", and the name might be followed by a personal ID reference, or a verb, that specifies an action this person is being referenced for. In the case of organizations, preceding text might also include equivalent expressions, and usually ends up with "LLC", "Inc." or other company type acronyms usually found in legal documents. Therefore, these steps considers a list of preceding expressions, that determine the following text is a name, but are not included in the name itself, such a "Mr." or "Mrs.", and also a list of following expressions that might also be included or not in the name, used to find the end of the name referenced. This step and the list of preceding and following expressions are included in an application 502.

To improve the names detection process, some extra rules may be considered depending on the document classes determined in previous steps 501a. For example, a document associated with the "purchase contract" document class that is determined by the occurrence of the rule "existing purchase" and others, may include expressions such as "... as the seller...", or "... also known as the buyer ...", as text that follows the names in the document. These additional document class-depending rules improve the complete name detection, especially in the detection of the end of a name. It is also possible that, taking into consideration the previous defined class there is recognition of either relevant or non-relevant names in the document. The type of class previously defined will have huge impact on the way that the mentioned process 502b is done.

Document class-related characterization

This characterization refers to the detection of other specific data from the document, such as numeric amounts, own names, among others. In some cases it is of great use to obtain particular information of a document, such as the lease rent in a real estate lease agreement or the leasable area. Once the Class or Multiclass has already been defined 602a and the names search is finished 601 , the list of characteristics of a particular document is defined. Given that list, the information of the document is searched 602. This becomes of great utility to the characterization 603 of the document, since it takes into consideration the class of the document to elaborate a complete characterization of the document. An example of this definition is the real estate lease agreement 603a. In this document it is useful to know the lease rent 603b, so the search of this is information is completed once the class has been previously defined. The use of the defined rules given the class or multi class comes very useful, since the apparition of rules in the text is clear evidence of that class related information must appear, so once the rule has been identified in the text, it is possible to extract information related to the rule. It is relevant to say that this particular use can be replicated to other documents and their particular information. This process is explained in FIG. 6.

Document Date Detection

Using detection means, the final step is detecting the document issuing date, considers that more than one valid date may be included in the document, but only one of them can be considered as the document issue date. Also, this detection takes into account the existence of many verbal forms to describe a date 701.

Taking these considerations into account, the document date detection process is performed by recognizing each valid date and calculating a score through the detection means 702, that depends in the following criteria: document issue date must be prior to current date_; must be complete, which means it should include day, month, and year, and can be preceded by an expression that references a location.

The definitive document issue date is the one that gets the highest score and the first, in case of a tie 703. This process, the techniques to detect valid dates in the document, and the list of keywords that allow determination of the document issue date are configurable in a application.

Given that the Class or Multiclass of the document has been defined 704, it is needed to search one or more dates that are relevant in the document. The rules used previously to define the document are used in the search of the key dates, since in some cases the text concerning the rule has relevant information nearby. The relevant dates of the document will be already defined with a pre-established list. Then the search of these dates will be made 705, completing the document characterization process 706. As an example, in a real state lease agreement, it is useful to know the date in which the lease commences as well as the date in which the lease ends.

Class or Multi-Class Calibration

To improve the quality of this class or multi-class determination method, there is a training process using means of automatic specification, that receives a set of previously classified documents 401 , defined as the training set, and then proceeds to assign different coefficients to all the defined rules, aiming to match the occurrence of them in the training set documents 403, and also attempting to match as close as possible the classes assigned to each of them 402. This process uses a Genetic Algorithm (404), also known as GA, with an objective function that defines the optimality of a specific combination of coefficients assigned to the rules, that is based in measuring the differences between the classes pre-assigned to the training set documents 401 , and the classes assigned automatically with a given candidate combination of coefficients for each rule and each document. Then iteratively narrow this difference to arrive to a final improved set of coefficients 405. The improvement of the set of the coefficients is a constant and iterative process, which will enable to characterize the legal documents in a correct way.

Figure 8 illustrates and exemplary embodiment of a computing system based on a Web server 800 that stores and executes an embodiment of the present invention. This web server stores an application for that implements the characterization method and an application for OCR 803 that will process the documents received for characterization. The Web application 801 , is the main application that users access to execute both the OCR and Characterization processes, by connecting from a user computer 808, using a network 807, and using this Web application to characterize documents 809 they have in their own computer local disk.

The Web server reacts to a user request for document characterization by receiving the document, processing it by OCR and retrieving the dictionary of valid words 802, stored in a file in the local server's disk, to the local memory 804, and also retrieving the rules collection and document predefined classes, stored in a database 806 accessible for the server. The characterization process is then executed by the CPU 805, resulting in a Web page with the information obtained by the characterization of the supplied document that is sent back to the user's computer and displayed in its screen or display.

The calibration process operates in a similar way. An administrative user loads several pre-characterized documents to the Web Server, and executes the calibration process by another option in the same Web application. This calibration process finally modifies the coefficients for the rules updating them in the database.

Claims

1 . A method of automatic document characterization which receives a given input unstructured document and results in the automatic assignment of one or more document classes or categories to which the contents relates to, an automatic determination of the document issue date, and an automatic determination of a list of the names of natural or legal persons found in the text and other relevant information mentioned in the text, which is related with a document class or multiclass, faster, more complete and more accurate than the manual characterization or manual description made by legal technical personnel, wherein this speed and accuracy comparison is feasible when implementing the process as a computing system, the method comprises the steps of: a. Upon receiving a digital document, from a user connected from his own computer in a Web application, process it with an OCR application; b. Then, execute an automatic document characterization process within the application and process the text of the document received from the user, wherein this involves, first the refinement of the text using a dictionary of valid words, then the revision of occurrence of big number of different logic rules with their calibrated coefficients, that are associated to classes, by recognizing any of many different language expressions and set of keywords, resulting in a score for each of the rules that allow determining which classes apply through a means of specification certain logic text-related rules; c. Recognizing names of peoples and organizations in the document contents by using a wide variety of different possible verbalizations of these names and according to the text, determine which of these names are relevant to the contents and which only references through a detection means;. d. Recognizing and extracting relevant information based on the class or multiclass definition made and the set of rules defined for that class or multiclass, through the detection means; e. Recognizing relevant dates in the document and assigning a score to each of them according to the preceding or following text, associated with different keywords and language expressions that used to refer to document issue dates, taking into account the previously defined class, and by using the defined rules of the same class, through the detection means, related to the class previously identified and the set of rules used to classify as such; and f. Reviewing the document contents, recognizing different text patterns as a combination of keywords, synonyms or equivalent terms, within the contents of the document through detection means.

2. The method of claim 1 wherein the rules are defined using a configuration application that allows the user to specify a set of keywords or synonyms, number of occurrences in the contents, and/or usage within certain phrases, thus defining the rules.

3. The method of claim 1 wherein the characterization process is calibrated using automatic specification that defines a weight coefficient and value of the elements on each rule.

4. The method of claim 1 wherein valid language words and its synonyms list to be used in refining the text in a given document in digital format resulted from an optical character recognition process.