WO2015125088A1 - Document characterization method - Google Patents

Document characterization method Download PDF

Info

Publication number
WO2015125088A1
WO2015125088A1 PCT/IB2015/051239 IB2015051239W WO2015125088A1 WO 2015125088 A1 WO2015125088 A1 WO 2015125088A1 IB 2015051239 W IB2015051239 W IB 2015051239W WO 2015125088 A1 WO2015125088 A1 WO 2015125088A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
class
text
rules
characterization
Prior art date
Application number
PCT/IB2015/051239
Other languages
French (fr)
Inventor
Rodrigo Andrés SANDOVAL URRICH
Juan Ignacio SÁA HARGOUS
José Manuel JIMENEZ MARIN
Original Assignee
Servicios Digitales Webdox Spa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=53877689&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2015125088(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Servicios Digitales Webdox Spa filed Critical Servicios Digitales Webdox Spa
Publication of WO2015125088A1 publication Critical patent/WO2015125088A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Definitions

  • the present invention relates to natural language processing focused in identifying relevant information in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.
  • the state of art discloses several methods and techniques related to classification or characterization of documents according to their content.
  • Patent US8380718 discloses a system and method for grouping similar documents. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A sub-group of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence.
  • the European patent EP1365329 (B1 ) discloses a classification method wherein one document is classified into at least one document class by selecting terms for use in the classification from among terms that occur in the document. A similarity between the input document and each class is calculated using information saved for every document class. The calculated similarity to each class is corrected. The class to which the input document belongs is determined in accordance with the corrected similarity to each class.
  • Document US20131 10843 discloses a method and system for classifying insurance files for identification, sorting and efficient collection of subrogation claims.
  • the invention determines whether an insurance claim has merit to warrant claim recovery efforts utilizing software code for partially describing a set of documents having unstructured and structured file data containing terms and phrases having contextual bases, code for transforming the terms and phrases, code for iterating a classification process to determine rules that best classify the set of documents based upon context, code for incorporating the rules into an induction and knowledge representation, thesauri taxonomies and text summarization to classify subrogation claims; code for calculating a base score and a concept vector to identify the selected claims that demonstrate a given probability of subrogation recovery.
  • the present invention relates to natural language processing focused in identifying specific information elements in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.
  • This text processing approach is focused on documents in which different information elements are relevant to identify from the contents that do not follow a specific layout, order, or syntax, known as unstructured documents. Unstructured documents, such as legal documents, do not present a specific sequence of elements and the contents can be found using different expressions depending on the language used.
  • the information elements that are searched for in the document allow determining specific characteristics of this document. This is the case of document classes or categories, and other characteristic associated, such as dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, which is related with the document class or multiclass.
  • the method for automatic document characterization described here includes document classification, which is a sequential process for associating documents to pre-defined document classes, and has been addressed by different methods and techniques, such as na ' ive Bayes, and Support Vector Machine, among others. But these methods are not properly prepared to the natural overlapping of classes, which in this type of documents is not only feasible, but expected. Also, these methods, except one, focus on specific keywords, not easily considering different language- dependent expressions as equivalent concepts or ideas.
  • the process for automatic ocument characterization processes the document contents five times (OCR, Class and Multiclass Determination, Names detection, Document class-related characterization, Document date detection), since different approaches are needed for each of the information elements.
  • the first revision aims to refine the input text, acknowledging that the results of the Optical Character Recognition (OCR) might separate whole words by assuming spaces and end of lines.
  • OCR Optical Character Recognition
  • a dictionary of valid words in a given language is used to validate word merges, thus separated words are corrected, allowing the rest of the revisions to work easily.
  • the second revision is focused in the document classes by searching for semantic rules that might occur in the text, and according to rules occurrence, determine one or more classes that might apply.
  • the third revision focuses on the names referenced in the contents.
  • This revision uses a list of relevant keywords, both that might be included at the beginning or at the end of a name, and also that might precede or follow, but not be included in the name.
  • the relevance of each of the defined keywords can be associated to the main class assigned to the document in the previous step, therefore allowing a fine tuned detection process.
  • the fourth revision is related to the detailed characterization of the document, based on the class or multiclass that has been previously defined. Given this class or multiclass, there is a list of characteristics that a document has. As an example, in the case of a real state lease agreement, it is useful to know the "lease rent" or the "leasable area", so the search of this information must be done.
  • the final revision searches only for valid dates, but also considers some of the preceding text and information defined, to help in the determination of the issue date. It must be taken into consideration that each class may have one or more relevant dates, for example, such as the date in which a lease agreement commences and the date which it ends. At this step, the dates to search will pre-established, since the class of the document will be known.
  • This invention aims to produce a much faster, complete and accurate characterization result compared to the manual characterization that a legal expert may provide by reading the document contents.
  • This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.
  • the accuracy in the characterization is supported by a rule calibration process, that takes into account a set of previously classified documents, and adjust each rule's coefficients to match these documents classes as best as possible, which allows the continuous improvement of this accuracy.
  • FIG. 1 is a block diagram that represents the document characterization process
  • FIG. 2a is a block diagram that represents the Class and Multiclass determination process
  • FIG. 2b is a block diagram that represents the Class and Multiclass determination process with essential rules
  • FIG. 3 is a block diagram that represents an example of an embodiment of the invention. In this particular scenario, "Existing purchase” rule was used as an example.
  • FIG. 4 is a block diagram that represents the Class and Multiclass calibration process.
  • FIG. 5 is a block diagram that represents the Names detection process.
  • FIG. 6 is a block diagram that represents the Document class-related characterization.
  • FIG. 7 is a block diagram that represents the Document date detection process.
  • FIG. 8 is a block diagram that describes a computer system for executing the document characterization, from the reception of the document to be characterized, the retrieval of words from the dictionary and text rules from the database, to the final display of resulting characteristics of the provided document.
  • the first step is the revision of the text with the recognition of individual words 101 with an Optical Character Recognition (OCR).
  • OCR Optical Character Recognition
  • the Determination of Class or Multiclass is made 102, aligned with a Class or Multiclass rule calibration process 102a, which is an iterative process that improves the rule score definition, obtaining a better classification.
  • a names detection process 103 is completed, with the objective of obtaining relevant names of people and organizations in the document.
  • This process 103 takes into consideration the class that the document has been associated to make the search 105 in a more precise way, considering a relevant set of keywords and expressions 103a to define the document ' s containing names.
  • a document class- related characterization 104 which searches for information taking into consideration the class previously defined.
  • the final step consists of the search and definition of the document ' s valid issue date and other relevant dates 106, which similar as the process of the search of names, takes into consideration the class that has been defined to make the search.
  • This final document characterization 107 includes dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, making in a fast and a accurate way a detailed description of the document. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.
  • the document text is reviewed recognizing individual words obtained through an Optical Character Recognition 101 (OCR) processing means. If the document's origin is a word processor, this refinement process may produce just a minimal difference or no difference at all.
  • OCR Optical Character Recognition 101
  • each individual word is concatenated with the next one and then compared to the list of valid words in the language of the document stored in the dictionary 101a. If this concatenated word is found, then it's kept for the following processes.
  • the process depends on a plurality of rules defined by means of specification of logic text-related rules 301 , each of them defining a set of keywords that represent equivalent verbal expressions 301a. These expressions are described by keywords, synonyms, conjugations, different amount of words and/or equivalent verbalizations 301a of the same concept, also considering different levels of separation between these words, thus allowing more flexibility in text rule detection (2a.201 and 2b.201 ).
  • Each rule may be related to one or more classes (2a.202 and 2b.202a), but the relevance for each class is determined by a relevance coefficient, that may have a positive or negative value 305.
  • a rule can be considered as essential, in which case the coefficient should be high and even the absence of this rule might lead to a penalty regarding the document class in which this rule is considered essential 2b.202b.
  • FIG. 2b it is illustrated the duality of an essential rule 2b.202b, due to the different values that it may have for different classes.
  • the "existing purchase" rule may be relevant to indicate that the document is effectively a purchase document.
  • a document such as a real state lease agreement won't have this rule as essential and the ponderation that this rule has in the document total score may be very low (2a.203 and 2b.203).
  • the rule "existing purchase” 301 applies to any type of purchase, such as real estate, chattels or shares.
  • this rule may be found in slightly different ways 301a.
  • a possible expression that relates to this rule is "... the parties agree to subscribe the present purchase agreement "301 b.
  • An alternative expression is "... the seller sells ... and the buyer buys... "301 b, regardless of the item being considered in the sale.
  • the methodology for all the other rules work in the similar way, implementing a set of keywords that represent equivalent terms (302 and 304).
  • a different type of contract is the lease agreement.
  • the main rule that is found on this kind of agreement is "existing lease" which can be found as
  • the classification process searches the document from the beginning to the end, identifying each word individually and comparing it to one of the several rules defined 301a. If the word matches any of the words of any of the rules, then the following text is processed to complete each of those candidate rules being evaluated, but also comparing each new word with other rules as well 301 b.
  • the classification process ends up with a set of rules found to apply for each document (2a.202 and 2b.202). Then, the final classification score for each document class is calculated by this formula 305.
  • This detection using the detection means, is based on the fact that documents mention persons' names preceding them with some specific verbal expressions 502a, such as "Mr.”, or "Mrs.”, and the name might be followed by a personal ID reference, or a verb, that specifies an action this person is being referenced for.
  • preceding text might also include equivalent expressions, and usually ends up with "LLC", "Inc.” or other company type acronyms usually found in legal documents.
  • a document associated with the "purchase contract” document class that is determined by the occurrence of the rule "existing purchase” and others may include expressions such as "... as the seller", or "... also known as the buyer ", as text that follows the names in the document.
  • These additional document class-depending rules improve the complete name detection, especially in the detection of the end of a name. It is also possible that, taking into consideration the previous defined class there is recognition of either relevant or non-relevant names in the document. The type of class previously defined will have huge impact on the way that the mentioned process 502b is done.
  • This characterization refers to the detection of other specific data from the document, such as numeric amounts, own names, among others. In some cases it is of great use to obtain particular information of a document, such as the lease rent in a real estate lease agreement or the leasable area.
  • the Class or Multiclass has already been defined 602a and the names search is finished 601 , the list of characteristics of a particular document is defined. Given that list, the information of the document is searched 602. This becomes of great utility to the characterization 603 of the document, since it takes into consideration the class of the document to elaborate a complete characterization of the document.
  • An example of this definition is the real estate lease agreement 603a.
  • the final step is detecting the document issuing date, considers that more than one valid date may be included in the document, but only one of them can be considered as the document issue date. Also, this detection takes into account the existence of many verbal forms to describe a date 701.
  • the document date detection process is performed by recognizing each valid date and calculating a score through the detection means 702, that depends in the following criteria: document issue date must be prior to current date ; must be complete, which means it should include day, month, and year, and can be preceded by an expression that references a location.
  • the definitive document issue date is the one that gets the highest score and the first, in case of a tie 703. This process, the techniques to detect valid dates in the document, and the list of keywords that allow determination of the document issue date are configurable in a application.
  • the Class or Multiclass of the document Given that the Class or Multiclass of the document has been defined 704, it is needed to search one or more dates that are relevant in the document.
  • the rules used previously to define the document are used in the search of the key dates, since in some cases the text concerning the rule has relevant information nearby.
  • the relevant dates of the document will be already defined with a pre-established list. Then the search of these dates will be made 705, completing the document characterization process 706.
  • this process uses a Genetic Algorithm (404), also known as GA, with an objective function that defines the optimality of a specific combination of coefficients assigned to the rules, that is based in measuring the differences between the classes pre-assigned to the training set documents 401 , and the classes assigned automatically with a given candidate combination of coefficients for each rule and each document. Then iteratively narrow this difference to arrive to a final improved set of coefficients 405.
  • the improvement of the set of the coefficients is a constant and iterative process, which will enable to characterize the legal documents in a correct way.
  • FIG. 8 illustrates and exemplary embodiment of a computing system based on a Web server 800 that stores and executes an embodiment of the present invention.
  • This web server stores an application for that implements the characterization method and an application for OCR 803 that will process the documents received for characterization.
  • the Web application 801 is the main application that users access to execute both the OCR and Characterization processes, by connecting from a user computer 808, using a network 807, and using this Web application to characterize documents 809 they have in their own computer local disk.
  • the Web server reacts to a user request for document characterization by receiving the document, processing it by OCR and retrieving the dictionary of valid words 802, stored in a file in the local server's disk, to the local memory 804, and also retrieving the rules collection and document predefined classes, stored in a database 806 accessible for the server.
  • the characterization process is then executed by the CPU 805, resulting in a Web page with the information obtained by the characterization of the supplied document that is sent back to the user's computer and displayed in its screen or display.
  • the calibration process operates in a similar way.
  • An administrative user loads several pre-characterized documents to the Web Server, and executes the calibration process by another option in the same Web application.
  • This calibration process finally modifies the coefficients for the rules updating them in the database.

Abstract

The present invention describes a method of automatic document characterization which receives a given input unstructured document and results in the automatic assignment of one or more document classes or categories to which the contents relates to the automatic determination of a list of the names of natural or legal persons found in the text, the automatic determination of other relevant information mentioned in the text and the document issue date, which is related with the document class or multiclass.

Description

DOCUMENT CHARACTERIZATION METHOD
TECHNICAL FIELD
The present invention relates to natural language processing focused in identifying relevant information in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.
BACKGROUND
The state of art discloses several methods and techniques related to classification or characterization of documents according to their content.
Patent US8380718 (B2) discloses a system and method for grouping similar documents. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A sub-group of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence.
On the other hand, the European patent EP1365329 (B1 ) discloses a classification method wherein one document is classified into at least one document class by selecting terms for use in the classification from among terms that occur in the document. A similarity between the input document and each class is calculated using information saved for every document class. The calculated similarity to each class is corrected. The class to which the input document belongs is determined in accordance with the corrected similarity to each class. Document US20131 10843 discloses a method and system for classifying insurance files for identification, sorting and efficient collection of subrogation claims. The invention determines whether an insurance claim has merit to warrant claim recovery efforts utilizing software code for partially describing a set of documents having unstructured and structured file data containing terms and phrases having contextual bases, code for transforming the terms and phrases, code for iterating a classification process to determine rules that best classify the set of documents based upon context, code for incorporating the rules into an induction and knowledge representation, thesauri taxonomies and text summarization to classify subrogation claims; code for calculating a base score and a concept vector to identify the selected claims that demonstrate a given probability of subrogation recovery.
First, the methods and systems described above do not allow a multiple classification process or characterization of different content elements included in each document.
Second, all the methods described above, except one, do not consider multiple language-dependent forms for the same semantic expression (grouped in rules), which involves a more flexible way to recognize specific text patterns in the document, which means that these methods depend on a set of keywords that are synonym to each other. Each expression may be composed by different amounts of words, with different synonyms, conjugations, orders, and even a set of combined words separated by no more than a specific number of words.
Third, the methods described above do not measure the importance of an expression in a specific context by adding more or less importance for that expression to influence in the result of characterization. Finally, methods focused in classification do not address a detection process with the contents found in the first steps of the characterization, drive the behavior and performance of the following steps of the characterization.
SUMMARY
The present invention relates to natural language processing focused in identifying specific information elements in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data. This text processing approach is focused on documents in which different information elements are relevant to identify from the contents that do not follow a specific layout, order, or syntax, known as unstructured documents. Unstructured documents, such as legal documents, do not present a specific sequence of elements and the contents can be found using different expressions depending on the language used.
The information elements that are searched for in the document allow determining specific characteristics of this document. This is the case of document classes or categories, and other characteristic associated, such as dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, which is related with the document class or multiclass.
Although the way these elements are searched differs from one another, in all of them the approach is to recognize specific text patterns within the document contents and drive the detection process with the contents found in the first steps of the characterization, utilizing the set of rules defined for each class. The method for automatic document characterization described here includes document classification, which is a sequential process for associating documents to pre-defined document classes, and has been addressed by different methods and techniques, such as na'ive Bayes, and Support Vector Machine, among others. But these methods are not properly prepared to the natural overlapping of classes, which in this type of documents is not only feasible, but expected. Also, these methods, except one, focus on specific keywords, not easily considering different language- dependent expressions as equivalent concepts or ideas.
The process for automatic ocument characterization processes the document contents five times (OCR, Class and Multiclass Determination, Names detection, Document class-related characterization, Document date detection), since different approaches are needed for each of the information elements.
The first revision aims to refine the input text, acknowledging that the results of the Optical Character Recognition (OCR) might separate whole words by assuming spaces and end of lines. A dictionary of valid words in a given language is used to validate word merges, thus separated words are corrected, allowing the rest of the revisions to work easily.
The second revision is focused in the document classes by searching for semantic rules that might occur in the text, and according to rules occurrence, determine one or more classes that might apply.
The third revision focuses on the names referenced in the contents. This revision uses a list of relevant keywords, both that might be included at the beginning or at the end of a name, and also that might precede or follow, but not be included in the name. The relevance of each of the defined keywords can be associated to the main class assigned to the document in the previous step, therefore allowing a fine tuned detection process.
The fourth revision is related to the detailed characterization of the document, based on the class or multiclass that has been previously defined. Given this class or multiclass, there is a list of characteristics that a document has. As an example, in the case of a real state lease agreement, it is useful to know the "lease rent" or the "leasable area", so the search of this information must be done.
The final revision searches only for valid dates, but also considers some of the preceding text and information defined, to help in the determination of the issue date. It must be taken into consideration that each class may have one or more relevant dates, for example, such as the date in which a lease agreement commences and the date which it ends. At this step, the dates to search will pre-established, since the class of the document will be known.
This invention aims to produce a much faster, complete and accurate characterization result compared to the manual characterization that a legal expert may provide by reading the document contents. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system. The accuracy in the characterization is supported by a rule calibration process, that takes into account a set of previously classified documents, and adjust each rule's coefficients to match these documents classes as best as possible, which allows the continuous improvement of this accuracy. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram that represents the document characterization process
FIG. 2a is a block diagram that represents the Class and Multiclass determination process
FIG. 2b is a block diagram that represents the Class and Multiclass determination process with essential rules
FIG. 3 is a block diagram that represents an example of an embodiment of the invention. In this particular scenario, "Existing purchase" rule was used as an example.
FIG. 4 is a block diagram that represents the Class and Multiclass calibration process.
FIG. 5 is a block diagram that represents the Names detection process.
FIG. 6 is a block diagram that represents the Document class-related characterization.
FIG. 7 is a block diagram that represents the Document date detection process.
FIG. 8 is a block diagram that describes a computer system for executing the document characterization, from the reception of the document to be characterized, the retrieval of words from the dictionary and text rules from the database, to the final display of resulting characteristics of the provided document. DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
In this section each of the processes are detailed and explained in a way such that the understanding of the process is done in an easier way. Each process has its own figure (FIG. 2a, 2b, 3, 4, 5, 6, 7), with each explanation and also a complete process figure is included (FIG. 1 ).
Main process
The following text uses FIG.1 to explain the whole process. Each process has its own section, destined to the explanation of each process following this paragraph. As mentioned earlier, the first step is the revision of the text with the recognition of individual words 101 with an Optical Character Recognition (OCR). Once the text has been revised by the OCR, the Determination of Class or Multiclass is made 102, aligned with a Class or Multiclass rule calibration process 102a, which is an iterative process that improves the rule score definition, obtaining a better classification. With the class already defined, a names detection process 103 is completed, with the objective of obtaining relevant names of people and organizations in the document.
This process 103 takes into consideration the class that the document has been associated to make the search 105 in a more precise way, considering a relevant set of keywords and expressions 103a to define the document's containing names. Once the classes and names have been defined, it is possible to initiate a document class- related characterization 104, which searches for information taking into consideration the class previously defined. The final step consists of the search and definition of the document's valid issue date and other relevant dates 106, which similar as the process of the search of names, takes into consideration the class that has been defined to make the search. Once all of these processes have been finished the final document characterization 107 has been made. This final document characterization 107 includes dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, making in a fast and a accurate way a detailed description of the document. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.
For further details in each of the processes, please go to the respective section. Document Text Refinement
The document text is reviewed recognizing individual words obtained through an Optical Character Recognition 101 (OCR) processing means. If the document's origin is a word processor, this refinement process may produce just a minimal difference or no difference at all. To attempt correcting the detection of spaces within words by the OCR process, each individual word is concatenated with the next one and then compared to the list of valid words in the language of the document stored in the dictionary 101a. If this concatenated word is found, then it's kept for the following processes.
Class or Multi-Class Determination
The process depends on a plurality of rules defined by means of specification of logic text-related rules 301 , each of them defining a set of keywords that represent equivalent verbal expressions 301a. These expressions are described by keywords, synonyms, conjugations, different amount of words and/or equivalent verbalizations 301a of the same concept, also considering different levels of separation between these words, thus allowing more flexibility in text rule detection (2a.201 and 2b.201 ). Each rule may be related to one or more classes (2a.202 and 2b.202a), but the relevance for each class is determined by a relevance coefficient, that may have a positive or negative value 305. In some classes, a rule can be considered as essential, in which case the coefficient should be high and even the absence of this rule might lead to a penalty regarding the document class in which this rule is considered essential 2b.202b. This means that the very same rule might be highly relevant for one class of documents by using a rather high positive coefficient value, even being considered as essential, be also slightly relevant for a second class which may be referenced by a positive lower coefficient value, have no relevance at all represented by a null coefficient, and even be considered contradictory for one class of documents represented by a negative coefficient value 2b.202b.
In FIG. 2b it is illustrated the duality of an essential rule 2b.202b, due to the different values that it may have for different classes. For a given class such as a real estate purchase document, the "existing purchase" rule may be relevant to indicate that the document is effectively a purchase document. On the other hand, a document such as a real state lease agreement won't have this rule as essential and the ponderation that this rule has in the document total score may be very low (2a.203 and 2b.203).
For example, in the preferred embodiment of this method, the characterization of legal documents, the rule "existing purchase" 301 applies to any type of purchase, such as real estate, chattels or shares. In any of these different types of agreements, this rule may be found in slightly different ways 301a. A possible expression that relates to this rule is "... the parties agree to subscribe the present purchase agreement ..."301 b. An alternative expression is "... the seller sells ... and the buyer buys... "301 b, regardless of the item being considered in the sale. The methodology for all the other rules work in the similar way, implementing a set of keywords that represent equivalent terms (302 and 304).
An additional and complimentary rule that applies to this kind of contracts is "existing purchase price" which refers to the price the buyer is going to pay. As with the
"existing purchase price" rule, this rule can be found in different forms in the text, such as "... the price paid by the buyer is the following... " or "... the price of ... shall be...".
In a related example, a different type of contract is the lease agreement. The main rule that is found on this kind of agreement is "existing lease" which can be found as
"... the landlord hereby leases the premises to the tenant...", and later in the document "... the tenant hires the premises", which correspond to a real estate lease agreement. For this exact same rule a different way of expressing it is "... the lessee receives from the lessor...". In the same type of document, another rule that can be found is "length of the lease", which describes the duration of the contract. This rule can be found in the document as "... this lease shall start on [datel ] and end on [date2]", or also as "... this lease will be in effect from [date] for a duration of X days...". This set of rules are found in lease contracts, therefore rather high coefficients should be set to associate them with the lease contract class, and even considering one of them or both as essential, thus assigning them highest coefficient for this class (303).
The classification process searches the document from the beginning to the end, identifying each word individually and comparing it to one of the several rules defined 301a. If the word matches any of the words of any of the rules, then the following text is processed to complete each of those candidate rules being evaluated, but also comparing each new word with other rules as well 301 b.
The classification process ends up with a set of rules found to apply for each document (2a.202 and 2b.202). Then, the final classification score for each document class is calculated by this formula 305.
ClassScore (document, rule) =occurrences*Coefficient
This means that if the coefficient of a certain rule is high for a document class, any of the repeated occurrences will add up for assigning that class to the document.
The definition of the previous scores for each class will have effect on the Class or Multiclass calibration (306, 2a.204 and 2b.204), and then this iterative process (2a.205 and 2b.205) will modify and improve the set of rules used to classify the document.
Names Detection
Any document, especially legal documents, might reference names of natural persons and organizations. So this process focuses on producing a list of all the names found in the contents 501. This detection, using the detection means, is based on the fact that documents mention persons' names preceding them with some specific verbal expressions 502a, such as "Mr.", or "Mrs.", and the name might be followed by a personal ID reference, or a verb, that specifies an action this person is being referenced for. In the case of organizations, preceding text might also include equivalent expressions, and usually ends up with "LLC", "Inc." or other company type acronyms usually found in legal documents. Therefore, these steps considers a list of preceding expressions, that determine the following text is a name, but are not included in the name itself, such a "Mr." or "Mrs.", and also a list of following expressions that might also be included or not in the name, used to find the end of the name referenced. This step and the list of preceding and following expressions are included in an application 502.
To improve the names detection process, some extra rules may be considered depending on the document classes determined in previous steps 501a. For example, a document associated with the "purchase contract" document class that is determined by the occurrence of the rule "existing purchase" and others, may include expressions such as "... as the seller...", or "... also known as the buyer ...", as text that follows the names in the document. These additional document class-depending rules improve the complete name detection, especially in the detection of the end of a name. It is also possible that, taking into consideration the previous defined class there is recognition of either relevant or non-relevant names in the document. The type of class previously defined will have huge impact on the way that the mentioned process 502b is done.
Document class-related characterization
This characterization refers to the detection of other specific data from the document, such as numeric amounts, own names, among others. In some cases it is of great use to obtain particular information of a document, such as the lease rent in a real estate lease agreement or the leasable area. Once the Class or Multiclass has already been defined 602a and the names search is finished 601 , the list of characteristics of a particular document is defined. Given that list, the information of the document is searched 602. This becomes of great utility to the characterization 603 of the document, since it takes into consideration the class of the document to elaborate a complete characterization of the document. An example of this definition is the real estate lease agreement 603a. In this document it is useful to know the lease rent 603b, so the search of this is information is completed once the class has been previously defined. The use of the defined rules given the class or multi class comes very useful, since the apparition of rules in the text is clear evidence of that class related information must appear, so once the rule has been identified in the text, it is possible to extract information related to the rule. It is relevant to say that this particular use can be replicated to other documents and their particular information. This process is explained in FIG. 6.
Document Date Detection
Using detection means, the final step is detecting the document issuing date, considers that more than one valid date may be included in the document, but only one of them can be considered as the document issue date. Also, this detection takes into account the existence of many verbal forms to describe a date 701.
Taking these considerations into account, the document date detection process is performed by recognizing each valid date and calculating a score through the detection means 702, that depends in the following criteria: document issue date must be prior to current date; must be complete, which means it should include day, month, and year, and can be preceded by an expression that references a location.
The definitive document issue date is the one that gets the highest score and the first, in case of a tie 703. This process, the techniques to detect valid dates in the document, and the list of keywords that allow determination of the document issue date are configurable in a application.
Given that the Class or Multiclass of the document has been defined 704, it is needed to search one or more dates that are relevant in the document. The rules used previously to define the document are used in the search of the key dates, since in some cases the text concerning the rule has relevant information nearby. The relevant dates of the document will be already defined with a pre-established list. Then the search of these dates will be made 705, completing the document characterization process 706. As an example, in a real state lease agreement, it is useful to know the date in which the lease commences as well as the date in which the lease ends.
Class or Multi-Class Calibration
To improve the quality of this class or multi-class determination method, there is a training process using means of automatic specification, that receives a set of previously classified documents 401 , defined as the training set, and then proceeds to assign different coefficients to all the defined rules, aiming to match the occurrence of them in the training set documents 403, and also attempting to match as close as possible the classes assigned to each of them 402. This process uses a Genetic Algorithm (404), also known as GA, with an objective function that defines the optimality of a specific combination of coefficients assigned to the rules, that is based in measuring the differences between the classes pre-assigned to the training set documents 401 , and the classes assigned automatically with a given candidate combination of coefficients for each rule and each document. Then iteratively narrow this difference to arrive to a final improved set of coefficients 405. The improvement of the set of the coefficients is a constant and iterative process, which will enable to characterize the legal documents in a correct way.
Figure 8 illustrates and exemplary embodiment of a computing system based on a Web server 800 that stores and executes an embodiment of the present invention. This web server stores an application for that implements the characterization method and an application for OCR 803 that will process the documents received for characterization. The Web application 801 , is the main application that users access to execute both the OCR and Characterization processes, by connecting from a user computer 808, using a network 807, and using this Web application to characterize documents 809 they have in their own computer local disk.
The Web server reacts to a user request for document characterization by receiving the document, processing it by OCR and retrieving the dictionary of valid words 802, stored in a file in the local server's disk, to the local memory 804, and also retrieving the rules collection and document predefined classes, stored in a database 806 accessible for the server. The characterization process is then executed by the CPU 805, resulting in a Web page with the information obtained by the characterization of the supplied document that is sent back to the user's computer and displayed in its screen or display.
The calibration process operates in a similar way. An administrative user loads several pre-characterized documents to the Web Server, and executes the calibration process by another option in the same Web application. This calibration process finally modifies the coefficients for the rules updating them in the database.

Claims

1 . A method of automatic document characterization which receives a given input unstructured document and results in the automatic assignment of one or more document classes or categories to which the contents relates to, an automatic determination of the document issue date, and an automatic determination of a list of the names of natural or legal persons found in the text and other relevant information mentioned in the text, which is related with a document class or multiclass, faster, more complete and more accurate than the manual characterization or manual description made by legal technical personnel, wherein this speed and accuracy comparison is feasible when implementing the process as a computing system, the method comprises the steps of: a. Upon receiving a digital document, from a user connected from his own computer in a Web application, process it with an OCR application; b. Then, execute an automatic document characterization process within the application and process the text of the document received from the user, wherein this involves, first the refinement of the text using a dictionary of valid words, then the revision of occurrence of big number of different logic rules with their calibrated coefficients, that are associated to classes, by recognizing any of many different language expressions and set of keywords, resulting in a score for each of the rules that allow determining which classes apply through a means of specification certain logic text-related rules; c. Recognizing names of peoples and organizations in the document contents by using a wide variety of different possible verbalizations of these names and according to the text, determine which of these names are relevant to the contents and which only references through a detection means;. d. Recognizing and extracting relevant information based on the class or multiclass definition made and the set of rules defined for that class or multiclass, through the detection means; e. Recognizing relevant dates in the document and assigning a score to each of them according to the preceding or following text, associated with different keywords and language expressions that used to refer to document issue dates, taking into account the previously defined class, and by using the defined rules of the same class, through the detection means, related to the class previously identified and the set of rules used to classify as such; and f. Reviewing the document contents, recognizing different text patterns as a combination of keywords, synonyms or equivalent terms, within the contents of the document through detection means.
2. The method of claim 1 wherein the rules are defined using a configuration application that allows the user to specify a set of keywords or synonyms, number of occurrences in the contents, and/or usage within certain phrases, thus defining the rules.
3. The method of claim 1 wherein the characterization process is calibrated using automatic specification that defines a weight coefficient and value of the elements on each rule.
4. The method of claim 1 wherein valid language words and its synonyms list to be used in refining the text in a given document in digital format resulted from an optical character recognition process.
PCT/IB2015/051239 2014-02-18 2015-02-18 Document characterization method WO2015125088A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461941002P 2014-02-18 2014-02-18
US61/941,002 2014-02-18

Publications (1)

Publication Number Publication Date
WO2015125088A1 true WO2015125088A1 (en) 2015-08-27

Family

ID=53877689

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2015/051239 WO2015125088A1 (en) 2014-02-18 2015-02-18 Document characterization method

Country Status (3)

Country Link
CL (1) CL2016002090A1 (en)
PE (1) PE20161166A1 (en)
WO (1) WO2015125088A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11003889B2 (en) 2018-10-22 2021-05-11 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis
US11017221B2 (en) 2018-07-01 2021-05-25 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020083090A1 (en) * 2000-12-27 2002-06-27 Jeffrey Scott R. Document management system
US20040205448A1 (en) * 2001-08-13 2004-10-14 Grefenstette Gregory T. Meta-document management system with document identifiers
US20070206884A1 (en) * 2006-03-03 2007-09-06 Masahiro Kato Image processing apparatus, recording medium, computer data signal, and image processing method
US20100046842A1 (en) * 2008-08-19 2010-02-25 Conwell William Y Methods and Systems for Content Processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020083090A1 (en) * 2000-12-27 2002-06-27 Jeffrey Scott R. Document management system
US20040205448A1 (en) * 2001-08-13 2004-10-14 Grefenstette Gregory T. Meta-document management system with document identifiers
US20070206884A1 (en) * 2006-03-03 2007-09-06 Masahiro Kato Image processing apparatus, recording medium, computer data signal, and image processing method
US20100046842A1 (en) * 2008-08-19 2010-02-25 Conwell William Y Methods and Systems for Content Processing

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11017221B2 (en) 2018-07-01 2021-05-25 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates
US11810070B2 (en) 2018-07-01 2023-11-07 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates
US11003889B2 (en) 2018-10-22 2021-05-11 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis
US11769014B2 (en) 2018-10-22 2023-09-26 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis

Also Published As

Publication number Publication date
CL2016002090A1 (en) 2016-12-30
PE20161166A1 (en) 2016-10-26

Similar Documents

Publication Publication Date Title
Scaffidi et al. Red Opal: product-feature scoring from reviews
US11663254B2 (en) System and engine for seeded clustering of news events
US20210109958A1 (en) Conceptual, contextual, and semantic-based research system and method
US10891699B2 (en) System and method in support of digital document analysis
CA3007723C (en) Systems and/or methods for automatically classifying and enriching data records imported from big data and/or other sources to help ensure data integrity and consistency
US11514096B2 (en) Natural language processing for entity resolution
US7783629B2 (en) Training a ranking component
US8983963B2 (en) Techniques for comparing and clustering documents
US8355997B2 (en) Method and system for developing a classification tool
US9734192B2 (en) Producing sentiment-aware results from a search query
US20040049499A1 (en) Document retrieval system and question answering system
US20130060769A1 (en) System and method for identifying social media interactions
US20060217962A1 (en) Information processing device, information processing method, program, and recording medium
KR101511656B1 (en) Ascribing actionable attributes to data that describes a personal identity
US20130036076A1 (en) Method for keyword extraction
CN111506727B (en) Text content category acquisition method, apparatus, computer device and storage medium
US20110219299A1 (en) Method and system of providing completion suggestion to a partial linguistic element
JP2019530063A (en) System and method for tagging electronic records
CN106997390A (en) A kind of equipment part or parts commodity transaction information search method
KR20210047229A (en) Recommendation System and METHOD Reflecting Purchase Criteria and Product Reviews Sentiment Analysis
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
CA2956627A1 (en) System and engine for seeded clustering of news events
CN113591476A (en) Data label recommendation method based on machine learning
Wahyudi et al. Topic modeling of online media news titles during COVID-19 emergency response in Indonesia using the latent dirichlet allocation (LDA) algorithm
WO2015125088A1 (en) Document characterization method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15751446

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 001498-2016

Country of ref document: PE

WWE Wipo information: entry into national phase

Ref document number: NC2016/0001532

Country of ref document: CO

122 Ep: pct application non-entry in european phase

Ref document number: 15751446

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 240417)

122 Ep: pct application non-entry in european phase

Ref document number: 15751446

Country of ref document: EP

Kind code of ref document: A1