US20090037389A1 - Document Comparison Using Multiple Similarity Measures - Google Patents
Document Comparison Using Multiple Similarity Measures Download PDFInfo
- Publication number
- US20090037389A1 US20090037389A1 US12/193,803 US19380308A US2009037389A1 US 20090037389 A1 US20090037389 A1 US 20090037389A1 US 19380308 A US19380308 A US 19380308A US 2009037389 A1 US2009037389 A1 US 2009037389A1
- Authority
- US
- United States
- Prior art keywords
- similarity
- documents
- terms
- patents
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Definitions
- the present invention relates generally to document searching and retrieval and, in particular, to a comparison of documents using multiple similarity measures.
- searches are conducted for different reasons. Many searches are undertaken in an attempt to find material of interest for research and other purposes.
- a user conducting such a search may know of the existence of a desired document or publication, such as a book, and may be seeking to determine whether the database under investigation contains the desired document or publication, or other related documents.
- a user may construct a query of a database based on one or more keywords in an attempt to retrieve all records relating to an area of interest to the user.
- the searching and retrieval of information from databases can also be used as a strategic tool to investigate and determine the activities of market competitors.
- Pharmaceutical companies are particularly interested in the activities of their competitors.
- There are large time and dollar costs associated with pharmaceutical research so before committing resources to a particular area of interest, it is common for pharmaceutical companies to search industry and patent databases to determine what is presently known and understood in that particular area. Further, it is important to determine the nature and scope of technology in the field of interest that might be protected by patents or other intellectual property rights.
- Patents provide a limited monopoly right to exploit an invention in a particular jurisdiction to the exclusion of all others, in exchange for providing an enabling disclosure of how the invention works.
- it is particularly important to determine which chemical compounds might be subject to patent protection before committing large resources to research in a given area.
- a company may invest large amounts of time and money to research a new drug, only to find that the drug is protected by a patent granted to a market competitor.
- searching industry and patent databases is difficult, as different publications may utilise different words or expressions in relation to the same subject matter.
- a query using a given keyword may not retrieve all relevant publications due to the variance that exists in technical jargon and terminology.
- a scientist about to file a patent application for a new invention requires more complex retrieval techniques to identify existing patents and patent applications that are similar to the new invention.
- a company seeking to identify relationships with a competitor based on their assigned patents also requires more complex retrieval techniques than those afforded by traditional keyword based search techniques.
- EJB Enterprise Java Beans
- biomedical concepts are known by a variety of names.
- biological concepts may be related as a result of belonging to the same class. For example, the terms Amylase and Somatostatin are related, because both are proteins.
- Another complication is that a group of molecules may be similar in respect of a nominal attribute or characteristic, even if the formulae for the respective molecules are different. In such circumstances, it is generally not possible to utilise string-based matching techniques on the formulae to identify those molecules that possess a desired attribute or characteristic. Further, a search of a database using a generic or commercial trade name for a chemical composition may not retrieve relevant documents in which the composition is only described with reference to its formulaic representation.
- 7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE and 7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE are different formulations of Valium.
- a simple keyword search for the term “valium” might return documents relating to the first formulation, the second formulation, or neither formulation.
- One technique for querying protein patents is described in “A Protein Patent Query System Powered by Kleisli”, J. Chen et al., Proceedings of the ACM SIGMOD Conference, Seattle, Wash., 1998. Given a protein sequence, Chen uses patent and protein databases, as well as bioinformatics tools, to identify whether similar protein sequences have already been patented.
- the present disclosure provides a method for determining the similarity of two or more publications based on multiple similarity measures determined for those publications.
- the similarity measures can include, for example, semantic, structural, and reference similarity measures.
- the method utilizes the number of common terms in the publications that are being compared to determine the similarity of those publications.
- the method utilizes the references contained within the publications that are being compared to determine the similarity of those publications.
- the method determines the similarity of terms contained within the publications by utilizing a domain-specific ontology. In so doing, terms that are synonyms are considered to be equivalent.
- the similarity of the structural representations of those chemical formulae is utilized in determining the similarity of the publications.
- documents from the chemical, biochemical or pharmaceutical domains are compared, and the determination of the semantic similarity utilizes a determination of structural similarity of the chemicals described in the plurality of documents.
- a method of comparing a plurality of documents comprising the steps of: determining a plurality of similarity measures for the plurality of documents; and determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures.
- a computer program product having a computer readable medium having a computer program recorded therein for comparing documents.
- the computer program product comprises: computer program code means for determining a plurality of similarity measures for the plurality of documents; and computer program code means for determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures.
- a computer program product having a computer readable medium having a computer program recorded therein for comparing documents.
- the computer program product comprises: computer program code means for determining a reference similarity measure, based on references contained in the plurality of documents; computer program code means for determining a semantic similarity measure, based on the similarity of terms contained in the plurality of documents; and computer program code means for determining a similarity measure for the plurality of documents, based on the reference similarity measure and the semantic similarity measure.
- a knowledge retrieval system comprising: a parser for extracting required information from presented materials; an annotator for annotating terms from the parsed presented materials, by utilizing at least one of an ontology, a taxonomy, and a dictionary; a chemical representation device for deriving information about the annotated terms, based on a connection table; an integrator for collating the derived information for storage in a database; and a retrieval system for retrieving information from the database, based on input search criteria.
- an apparatus for implementing any one of the aforementioned methods there is provided an apparatus for implementing any one of the aforementioned methods.
- a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above.
- Other aspects of the invention are also disclosed.
- FIG. 1 is a flow diagram of a method of determining a semantic similarity measure for biomedical patent documents, in accordance with an embodiment of the present disclosure
- FIG. 2 is a schematic block diagram illustrating a system architecture of a BioPatentMiner system
- FIG. 3 is a Reference Graph illustrating the reference similarity between documents
- FIG. 4 is a schematic block diagram illustrating a system architecture of an BioPatentMiner system incorporating an embodiment of the present disclosure
- FIG. 5 is a screenshot capture from a visualizer 490 of FIG. 4 ;
- FIG. 6 is a schematic block diagram of a general purpose computer upon which arrangements described can be practised.
- VSM Vector Space Model
- TF-IDF Term Frequency-Inverse Document Frequency
- tfik is the frequency of the term k in the document i
- N is the number of documents in a collection
- nk is the number of documents in which the term k occurs at least once.
- dj is the dot product between the vectors, calculated as
- An ontology is a data structure containing all relevant entities, possible relationships between those entities, and associated rules for a nominal domain.
- An ontology is typically hierarchical in structure, but is not restricted to only hierarchical structures.
- a taxonomy may refer to either the classification of a group of objects, or principles underlying such classification. Taxonomies are frequently hierarchical in structure, but are not necessarily so.
- a taxonomy may be a simple organization of a group of objects, or even an alphabetical list.
- ontologies may be found, for example, at each of the following Internet websites: Gene Ontology (http://www.geneontology.org) and Suggested Upper Merged Ontology (SUMO) (http://www.ontologyportal.org).
- Gene Ontology http://www.geneontology.org
- SUMO Suggested Upper Merged Ontology
- a related area of research is concerned with determining the similarity between terms in an ontology or taxonomy.
- One simple technique is to calculate the similarity based on the distance (number of edges) between two terms in an ontology graph.
- information-theoretic approaches for determining the similarity between ontology terms have been found to be most effective. For example, “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, P. Resnik, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 448-453, 1995, proposes a method to determine the similarity between two semantic classes in a taxonomy based on the amount of information that the semantic classes share in common.
- pC be the probability of encountering a class C or a child of the term in the taxonomy. Although Resnik considered a child term by only considering is-a links, the approach can be extended to links of all types. pC is monotonic as one moves up the taxonomy and will approach 1 for the root.
- the principle of Information theory defines the information content of a term as ⁇ ln(pC).
- sim ⁇ ( Ck , Cl ) 2 ⁇ x ⁇ [ ln ⁇ ( p min ) ] ln ⁇ ( p Ck ) + ln ⁇ ( p Cl )
- the molecular fingerprints contain information about atoms and substructures contained within a molecule.
- the fingerprints are represented as a boolean array, in which each element is TRUE (1) or FALSE (0).
- Boolean arrays in turn are usually represented as bitmaps, an array of bytes or words in which each bit represents one position of the boolean array.
- Molecular fingerprints can be used to measure the similarity between different chemicals utilizing the Tanimoto Coefficient.
- the similarity between two fingerprints A and B is calculated to be:
- Such documents can include, but are not restricted to, patent documents.
- the method utilizes a plurality of similarity measures to determine the overall similarity between the documents.
- One measure relates to common terms that occur in the documents under consideration.
- Another measure relates to common references between the documents under consideration.
- a further measure relates to the premise that documents containing similar terms should be considered similar.
- the method determines the semantic similarity between terms in the documents by utilizing a domain-specific ontology. For example, terms which are synonyms (based on the ontology) are considered to be equivalent.
- the similarity between two or more patent documents can also be determined based on the structural similarity among molecules described in those patent documents, since similar molecules tend to behave similarly.
- the similarity between two chemicals is determined based on the number of common substructures between those two chemicals.
- a preferred embodiment utilizes a modified Vector Space model in which the documents are represented by the terms contained therein.
- the terms can be, for example, English words or biological concepts.
- the terms exclude “stop words”, which are common words that are ignored when searching a key phrase. The exclusion of such “stop words” accelerates the searching process.
- FIG. 1 is a flow diagram of a method 100 for determining a semantic similarity measure for biomedical patent documents, in accordance with the present disclosure.
- the method 100 begins at a Start step 110 and proceeds to step 120 , which identifies biological terms within the documents that are to be compared.
- the biological terms are typically identified with reference to one or more appropriate technical dictionaries and/or ontologies.
- Control passes to step 130 , which identifies chemical terms in the documents that are to be compared.
- the chemical terms are typically identified with reference to a set of rules.
- the method identifies chemical terms by comparing terms within the documents with a predefined dictionary of chemical names and utilising a customisable rule engine.
- Control passes to step 140 , which determines the similarity of the biological terms identified in step 120 .
- step 150 determines the similarity of the chemical terms identified in step 130 .
- Control passes to step 160 to determine the semantic similarity between the documents under consideration, based on the similarities of the biological terms and chemical terms identified in steps 140 and 150 . Determining the similarity of the documents will be discussed in greater detail below.
- Control then passes to an End step 170 and the method terminates. It will be appreciated and readily understood by a person skilled in the relevant art that one or more steps of the method 100 can be practised in a different order, or in parallel, without departing from the spirit and scope of the present invention.
- FIG. 2 is a schematic block diagram illustrating the overall architecture of an embodiment of the BioPatentMiner system 200 .
- the system 200 uses a crawler 210 to download patents 220 from an online patent database, such as provided by the United States Patent and Trademark Office (USPTO), based on a query input by a user.
- the system 200 can be also used on a collection of biomedical patents obtained by other techniques.
- a parser 230 parses these patents 220 to extract information. Such information may include, for example, inventors, assignees, titles, and abstracts.
- the parser 230 assumes that the patents 220 are in the HyperText Markup Language (HTML) format of the USPTO site.
- HTML HyperText Markup Language
- Other embodiments of the system 200 are adjusted to handle documents in other formats.
- BioAnnotator 240 Biological terms in the parsed files are passed from the parser 230 to an annotator 240 .
- the annotator 240 annotates the parsed files, using for example a BioAnnotator system as described in “Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application”, L. Subramaniam et al., Proceedings of the ACM Conference on Information and Knowledge Management, New La, La., 2003.
- BioAnnotator identifies and classifies biological terms in scientific text by utilizing publicly available biomedical ontologies 250 that are presented as an input to the annotator 240 .
- the annotated patents are passed from the annotator 240 to be indexed by an integrator/indexer 260 , such as an XML search engine.
- the integrator/indexer 260 also receives as an input the ontologies 250 , and passes the integrated information from the patents and the ontologies to an integrated knowledge database 270 .
- Information contained in the integrated knowledge database 270 can be retrieved using a knowledge retriever 280 .
- the knowledge retriever presents information from the integrated knowledge database 270 requested by a user on a display system, such as a visualizer 290 . Further information pertaining to the BioPatentMiner system 200 is contained in “BioPatentMiner: An Information Retrieval System for BioMedical Patents”, S. Mukhejea et al., Proceedings of the Very Large Databases (VLDB) Conference, Toronto, Canada, 2004.
- the present disclosure provides a method for determining the similarity of two or more publications based on multiple similarity measures determined for those publications.
- the method identifies technical terms in documents under consideration and determines the similarity of those documents based on at least two semantic, structural and reference similarity measures.
- a domain-specific ontology is utilised to determine the semantic similarity between the publications.
- Such domains can include, for example, computer science, chemistry, and biochemistry.
- the ontology stores all related synonyms for each biological concept.
- chemical formulas in the publications under consideration are compared based on common substructures contained therein.
- an embodiment of the present disclosure determines the similarity of two given documents and returns a similarity value Sim(Pi, Pj) in the range 0.0 to 1.0, in which a value of 1.0 indicates that the documents are identical.
- the similarity value is determined by a semantic similarity measure and a reference similarity measure.
- the similarity value is defined as:
- the Reference Similarity component, RefSim(Pi; Pj), calculates the similarity of the documents based on references contained within the documents, and will be discussed in further detail below.
- the Semantic Similarity component, SemSim(Pi, Pj), calculates the similarity of the documents based on the contents of the documents, as also explained in further detail below.
- the final similarity component is a linear combination of the two similarity measures.
- WTref is a predefined constant determining the weight given to the Reference Similarity during the final similarity calculation.
- the weighting term is set to 0.5, so that equal importance is given to each of the Reference and Semantic similarity components.
- the value of the weighting term can be varied for specific applications, to place greater or lesser importance on the commonality of references shared by the documents or the semantic similarity of the documents.
- the Reference similarity of the documents is determined in this embodiment by utilizing a Reference graph.
- the documents are the nodes of this graph. If a document Pi contains a reference to a document Pj, a direct edge is created between the corresponding nodes.
- FIG. 3 shows an example of such a Reference Graph 300 , in which pi corresponds to document Pi.
- a direct edge links nodes p 1 and p 3 , which indicates that document PI is related to document P 3 .
- document PI contains a reference to document P 3 .
- Node p 2 also has a direct edge to p 3 , indicating that document P 2 also contains a reference to document P 3 .
- the Reference Similarity of the corresponding documents is 0.
- the minimum path length between p 1 and p 3 is 1. Therefore, the Reference similarity between these patents, RefSim(P 1 ,P 3 ), will also be 1.
- this embodiment utilizes an augmented Cosine Similarity method to compare documents based on the contents of those documents.
- the documents are represented by vectors containing the terms contain therein.
- the terms can be English words or terms belonging to a domain-specific ontology.
- the terms exclude “stop words”, which are common words that are ignored when searching a key phrase.
- the semantic similarity between the documents is determined by utilizing a domain-specific ontology. Even if two terms are not equal, those terms can be semantically similar if the terms are synonyms or if the terms belong to the same ontology class.
- the similarity between the classes will determine the similarity between the terms.
- the structural similarity between the molecules is determined and may be incorporated into the semantic similarity measure.
- a domain-specific ontology is utilized that contains various terms as well as all synonyms for those terms. Further, each term belongs to an Ontology class.
- the classes form an ISA-hierarchy in which a class can be a child of another class. Initially, the similarity between all pairs of the ontology classes is pre-computed using Lin's method, as described above with reference to equation (4), to determine sim(Ck,Cl):
- pCk is the probability of encountering the class Ck or a child of the term in the taxonomy
- pmin is the minimum probability among their common ancestors
- SemSim ⁇ ( P i , P j ) d i ⁇ d j d i ⁇ d i ⁇ d j ⁇ d j ( 9 )
- dj is the dot product between the vectors, calculated as
- N is number of documents in a collection and nC is the number of documents in which a term of Class C occurs at least once.
- the method utilizes this metric in such a way that common classes are allocated a lower weight.
- FIG. 4 is a schematic block diagram illustrating the overall architecture of a BioPatentMiner system 400 enhanced in accordance with an embodiment of the present invention.
- the system 400 uses a crawler 410 to download patents 420 from an online patent database, such as provided by the United States Patent and Trademark Office (USPTO), based on a query input by a user.
- the system 400 can be also used on a collection of biomedical patents obtained by other techniques.
- a parser 430 parses these patents 420 to extract information. Such information may include, for example, inventors, assignees, titles, and abstracts.
- the parser 430 assumes that the patents 420 are in the HyperText Markup Language (HTML) format of the USPTO site.
- HTML HyperText Markup Language
- Other embodiments of the system 400 are adjusted to handle documents in other formats.
- BioAnnotator 440 Biological terms in the parsed files are passed from the parser 430 to an annotator 440 .
- the annotator 440 annotates the parsed files, using for example a BioAnnotator system as described in “Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application”, L. Subramaniam et al., Proceedings of the ACM Conference on Information and Knowledge Management, New La, La., 2003.
- BioAnnotator identifies and classifies biological terms in scientific text by utilizing publicly available biomedical ontologies 450 that are presented as an input to the annotator 440 .
- BioAnnotator also uses a Rule Engine to identify unknown and new chemical formulae that are not present in the ontologies.
- the biomedical ontology used by the BioAnnotator is Unified Medical Language System (UMLS), as described at the URL http://umlsks.nlm.nih.gov.
- UMLS is a consolidated repository of medical terms and their relationships, spread across multiple languages and disciplines. Such disciplines include, for example, chemistry and biology.
- One section of UMLS is a Semantic Network that includes 235 biomedical semantic classes like Gene or Genome, and Amino Acid, Peptide, or Protein.
- the Semantic Network includes biological concepts, each of which is associated with one or more semantic classes.
- the concept caspase 3 is associated with the semantic class Amino Acid, Peptide or Protein.
- a biological concept can be referred to by various synonyms.
- caspase 3 is variously referred to using the terms apopain, Yama protein, and CPP32 protein.
- UMLS identifies one of the synonyms as the primary name of the concept and BioAnnotator annotates each identified biological term with an associated primary name (or baseform) and semantic class.
- the baseform is the Molecular fingerprint representation of the chemical and the class is Chemical.
- the Molecular fingerprint is derived using a connection table, which contains a list of all the atoms within a chemical structure, together with bond information that describes the exact manner in which the individual atoms are linked together. Examples of such connection tables are available, for example, at Chemical Abstracts Service (CAS Online) http://www.cas.org.
- a Reference graph is created based on the references of the documents.
- the system integrates the information of the patents and biomedical ontologies.
- the integrated Patent Knowledge base enables several types of information retrieval. Besides keyword search, the system allows the user to search for patents containing a class of biomedical concept. Moreover, a user can identify semantic relationships between patent documents, inventors, and assignees, for example, as well as biomedical concepts.
- the annotated patents are represented in the eXtensible Markup Language (XML) and are passed through a chemical representation block 445 to be indexed by an integrator/indexer 460 , such as an XML search engine.
- the chemical representation block receives as inputs the annotated patents from the annotator 440 , and a connection table 455 .
- the connection table 455 is used to create Molecular Fingerprints for the chemicals identified in the annotated patent.
- the system integrates the information of the patents and biomedical ontologies.
- the integrator/indexer 460 receives the ontologies 450 and patents from the chemical representation block 445 as inputs and passes the integrated information from the patents and the ontologies to an integrated knowledge database 470 .
- Information contained in the integrated knowledge database 470 can be retrieved using a knowledge retriever 480 .
- the integrated information space provided by the integrated knowledge base 470 facilitates several types of information retrieval. Besides keyword search, the system allows the user to search for patents containing a class of biomedical concept. Further, it is possible to identify semantic relationships between patents, inventors, assignees, and biomedical concepts.
- the knowledge retriever presents information from the integrated knowledge database 470 requested by a user on a display system, such as a visualizer 490 , as will be described below with reference to FIG. 5 .
- the embodiment of FIG. 4 performs the functions of a semantic search, a semantic association, and a similarity search to identify similar documents.
- Ranbaxy Azole compounds as therapeutic sulfur, phenyl, hydroxy, composition, agents for fungal infections [6670363] trifluoromethyl, oxygen, hydrogen, Pfizer: Bicycliccarbonyl indole compounds halogens, salts as anti-inflammatory/analgesic agents [6303628]
- Ranbaxy Azole compounds as therapeutic mammals, phenyl, hydroxy, composition, agents for fungal infections [6670363] trifluoromethyl, hydrogen, salts, formyl, Pfizer: Heteroaryl phenyl pyrazole furyl, oxazolyl compounds as anti-inflammatory/analgesic agents [6649636]
- Ranbaxy Azole compounds as therapeutic phenyl, cobalt, alkyl, naphthyl, hydorxy, agents for fungal infections [6670363] composition, imidazolyl, trifluoromethyl, Pfizer: 3-Azabicycl
- Table 1 shows similar patent documents between two companies Ranbaxy and Pfizer. The titles of the patents and the patent numbers are shown in a first column, and the common biological terms between the patents are shown in a second column. This type of analysis is a particularly useful aid for companies to identify potential patent infringements. It is also possible to identify patent documents that are similar to a nominal patent or patent application.
- FIG. 5 is a screenshot 500 showing a visualization of the similarity of a collection of patents that are assigned to University of Texas, as obtained from the USPTO site using the keyword “glycolysis”.
- the input patents are shown as the nodes of a graph. If the similarity between two patents is greater than a predetermined similarity threshold, an edge is drawn between the nodes (patents).
- the similarity threshold is set by default to a value of 0.1, but can be readily modified by a user, as desired.
- a similarity threshold of 0.2 is utilized.
- some patents are not related to any other patents, while some other patents form clusters.
- the threshold is increased, many of the edges will disappear, and at a lower threshold more patents will be related.
- the brightness of an edge linking two nodes is proportional to the similarity between the two patents connected by that edge. Clicking on an edge shows the terms common among the patents. For example, the terms common between U.S. Pat. Nos. 5,801,033 and 5,529,914 are shown in a popup window 410 , which indicates that the common terms are gels, carriers, and polymers.
- FIG. 5 shows that terms like insulin, glucose and antibodies are associated with the given set of patents.
- the user can select a term from the list 520 to see the patents that refer to that term.
- the term “insulin” is selected in the list 520 , which results in the patents that contain the term “insulin” being highlighted. It can be seen that many of these highlighted patents are similar and form a cluster 530 .
- Table 2 shows the percentage of patents found to be similar using the two techniques. To emphasize that the evaluation is not biased by the value of the threshold, two values for the constant were used, 0.1 and 0.2. It is assumed that, in general, patents in the same USPTO class are similar to each other and dissimilar to patents of other classes. Table 2 shows that for dissimilar patents (in different classes), the number of patents found to be similar by the method disclosed herein is slightly larger, or even smaller, than the number of patents found to be similar by the old technique. Thus, when comparing patents of class 156 and 514 , 0.3% of the patents were found similar by the old technique and 0.28% of the patents were found similar by the method disclosed herein at a threshold of 0.1.
- patents of both classes 514 and 424 are pharmaceutical and are thus related. However, patents within the same class are obviously more related. Table 2 shows that, in this case, the increase in the number of similar patents was higher during intra-class comparison than during inter-class comparison. Thus, the method disclosed herein is also effective in identifying the most similar patents to a query patent in a collection of related Pharmaceutical patents.
- the aforementioned preferred method(s) comprise a particular control flow. There are many other variants of the preferred method(s) which use different control flows without departing the spirit or scope of the invention. Furthermore one or more of the steps of the preferred method(s) may be performed in parallel rather sequential.
- the method of comparing documents is preferably practised using a general-purpose computer system 600 , such as that shown in FIG. 6 wherein the processes described above may be implemented as software, such as an application program executing within the computer system 600 .
- the steps of the method of comparing documents are effected by instructions in the software that are carried out by the computer.
- the instructions may be formed as one or more code modules, each for performing one or more particular tasks.
- the software may also be divided into two separate parts, in which a first part performs the document comparison methods and a second part manages a user interface between the first part and the user.
- the software may be stored in a computer readable medium, including the storage devices described below, for example.
- the software is loaded into the computer from the computer readable medium, and then executed by the computer.
- a computer readable medium having such software or computer program recorded on it is a computer program product.
- the use of the computer program product in the computer preferably effects an advantageous apparatus for comparing documents.
- the computer system 600 is formed by a computer module 601 , input devices such as a keyboard 602 and mouse 603 , output devices including a printer 615 , a display device 614 and loudspeakers 617 .
- a Modulator-Demodulator (Modem) transceiver device 616 is used by the computer module 601 for communicating to and from a communications network 620 , for example connectable via a telephone line 621 or other functional medium.
- the modem 616 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN), and may be incorporated into the computer module 601 in some implementations.
- LAN Local Area Network
- WAN Wide Area Network
- the computer module 601 typically includes at least one processor unit 605 , and a memory unit 606 , for example formed from semiconductor random access memory (RAM) and read only memory (ROM).
- the module 601 also includes an number of input/output (I/O) interfaces including an audio-video interface 607 that couples to the video display 614 and loudspeakers 617 , an I/O interface 613 for the keyboard 602 and mouse 603 and optionally a joystick (not illustrated), and an interface 608 for the modem 616 and printer 615 .
- the modem 616 may be incorporated within the computer module 601 , for example within the interface 608 .
- a storage device 609 is provided and typically includes a hard disk drive 610 and a floppy disk drive 611 .
- a magnetic tape drive (not illustrated) may also be used.
- a CD-ROM drive 612 is typically provided as a non-volatile source of data.
- the components 605 to 613 of the computer module 601 typically communicate via an interconnected bus 604 and in a manner which results in a conventional mode of operation of the computer system 600 known to those in the relevant art. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom.
- the application program is resident on the hard disk drive 610 and read and controlled in its execution by the processor 605 .
- Intermediate storage of the program and any data fetched from the network 620 may be accomplished using the semiconductor memory 606 , possibly in concert with the hard disk drive 610 .
- the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 612 or 611 , or alternatively may be read by the user from the network 620 via the modem device 616 .
- the software can also be loaded into the computer system 600 from other computer readable media.
- computer readable medium refers to any storage or transmission medium that participates in providing instructions and/or data to the computer system 600 for execution and/or processing.
- storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 601 .
- Examples of transmission media include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.
- the method of comparing documents may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of semantic similarity determination and structural similarity determination.
- dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.
Abstract
Disclosed herein is a method for comparing documents. The method includes the steps of: determining a plurality of similarity measures; and determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures. In one embodiment, the similarity measures are chosen from the group of similarity measures consisting of semantic and reference similarity measures. When comparing documents from the chemical, biochemical or pharmaceutical domains, the determination of the similarity utilizes a determination of structural similarity of the chemical formulas described in the plurality of documents.
Description
- This application is a Division of U.S. application Ser. No. 11/304,029 filed Dec. 15, 2005, the complete disclosure of which, in its entirety, is herein incorporated by reference.
- The present invention relates generally to document searching and retrieval and, in particular, to a comparison of documents using multiple similarity measures.
- There are many techniques for searching a database to retrieve relevant documents and publications in response to a query provided by a user. Searches are conducted for different reasons. Many searches are undertaken in an attempt to find material of interest for research and other purposes. A user conducting such a search may know of the existence of a desired document or publication, such as a book, and may be seeking to determine whether the database under investigation contains the desired document or publication, or other related documents. Alternatively, a user may construct a query of a database based on one or more keywords in an attempt to retrieve all records relating to an area of interest to the user.
- The searching and retrieval of information from databases can also be used as a strategic tool to investigate and determine the activities of market competitors. Pharmaceutical companies are particularly interested in the activities of their competitors. There are large time and dollar costs associated with pharmaceutical research, so before committing resources to a particular area of interest, it is common for pharmaceutical companies to search industry and patent databases to determine what is presently known and understood in that particular area. Further, it is important to determine the nature and scope of technology in the field of interest that might be protected by patents or other intellectual property rights.
- Patents provide a limited monopoly right to exploit an invention in a particular jurisdiction to the exclusion of all others, in exchange for providing an enabling disclosure of how the invention works. In the case of pharmaceutical companies, it is particularly important to determine which chemical compounds might be subject to patent protection before committing large resources to research in a given area. Without undertaking a relevant search of patent databases and the like, a company may invest large amounts of time and money to research a new drug, only to find that the drug is protected by a patent granted to a market competitor. However, searching industry and patent databases is difficult, as different publications may utilise different words or expressions in relation to the same subject matter. Thus, a query using a given keyword may not retrieve all relevant publications due to the variance that exists in technical jargon and terminology.
- When investigating a particular field of interest, it is known to determine the similarity between two textual documents based on common keywords, as described, for example, in “A Vector Space Model for Automatic Indexing”, Salton G., Wong, A. and Yang, C. S. Communications of the ACM, 18(11), November 1975. Returning to the example of pharmaceutical companies searching patent databases, it may not be appropriate to search for common words between two documents, as pharmaceutical patent documents typically contain many different chemical and biological terms.
- Many jurisdictions provide patent databases that are able to be accessed from a remote computer terminal, typically via an Internet-based interface. For example, the records of the United States Patent and Trademark Office are able to be accessed via the Internet at the Uniform Resource Locator (URL) http://www.uspto.gov/patft/. Other patent databases are provided by, for example, the European Patent Office, the Australian Patent Office, and the Japanese Patent Office. Online patent databases typically allow traditional keyword based searches on various fields of a patent or patent application. The searchable fields can include, for example, the name of an inventor, assignee, and title. However, under some circumstances the simple keyword based searches are inadequate. For example, a scientist about to file a patent application for a new invention requires more complex retrieval techniques to identify existing patents and patent applications that are similar to the new invention. Further, a company seeking to identify relationships with a competitor based on their assigned patents also requires more complex retrieval techniques than those afforded by traditional keyword based search techniques.
- Research systems that utilize different techniques for retrieving information from patent databases have been studied. For example, “Evaluating Document Retrieval in Patent Database: a Preliminary Report”, M. Osborn et al., Proceedings of the ACM Conference on Information and Knowledge Management, Las Vegas, Nev., 1997 introduces a system that integrates a series of shallow natural language processing techniques into a vector based document information retrieval system for searching a subset of US patents. Another study, “A Patent Search and Classification System”, L. Larkey, Proceedings of the ACM Digital Library Conference, Berkeley, Calif., 1999 uses a probabilistic information retrieval system for searching and classifying US patents. Another search system is described in “Knowledge Discovery in Patent Databases”, M. Marinescu et al., Proceedings of the ACM Conference on Information and Knowledge Management, McLean, Va., 2002, which attempts to utilise techniques like Correspondence and Cluster analysis for mining patents. Some of the challenges in the domain of patent retrieval are discussed in “Workshop on Patent Retrieval: SIGIR 2000 Workshop Report”, N. Kando et al., ACM SIGIR Forum, 34(1):28-30, April 2000.
- Traditionally, text-based documents are compared based on the number of similar terms among the documents under comparison. Such techniques may not be reliable, however, for some technical disciplines in which synonyms are frequently used or in emerging areas of technology for which standardised terms are yet to be determined. Such technical disciplines include, for example, the computer science and pharmaceutical domains. In the computer science domain, Enterprise Java Beans may also be referred to as EJB. Thus, EJB is a synonym for Enterprise Java Beans in the computer science domain. In the pharmaceutical domain, many biomedical concepts are known by a variety of names. Further, biological concepts may be related as a result of belonging to the same class. For example, the terms Amylase and Somatostatin are related, because both are proteins.
- Another complication is that a group of molecules may be similar in respect of a nominal attribute or characteristic, even if the formulae for the respective molecules are different. In such circumstances, it is generally not possible to utilise string-based matching techniques on the formulae to identify those molecules that possess a desired attribute or characteristic. Further, a search of a database using a generic or commercial trade name for a chemical composition may not retrieve relevant documents in which the composition is only described with reference to its formulaic representation. For example, 7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE and 7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE are different formulations of Valium. Thus, a simple keyword search for the term “valium” might return documents relating to the first formulation, the second formulation, or neither formulation. One technique for querying protein patents is described in “A Protein Patent Query System Powered by Kleisli”, J. Chen et al., Proceedings of the ACM SIGMOD Conference, Seattle, Wash., 1998. Given a protein sequence, Chen uses patent and protein databases, as well as bioinformatics tools, to identify whether similar protein sequences have already been patented.
- Due to the complexities described above that exist in the pharmaceutical domain, it is known for pharmaceutical companies to employ one or more patent analysts, or to engage an external agency, to examine manually hundreds of patents retrieved by querying the patent databases. This is an expensive and time-consuming approach for searching patent databases and comparing the documents contained therein, and is subject to human error.
- Thus, a need exists for an improved method of comparing two or more publications to determine the similarity of those documents.
- It is an object of the present invention to overcome substantially, or at least ameliorate, one or more disadvantages of existing arrangements.
- The present disclosure provides a method for determining the similarity of two or more publications based on multiple similarity measures determined for those publications. The similarity measures can include, for example, semantic, structural, and reference similarity measures.
- In an embodiment that employs a similarity measure, the method utilizes the number of common terms in the publications that are being compared to determine the similarity of those publications.
- In another embodiment that employs a reference similarity measure, the method utilizes the references contained within the publications that are being compared to determine the similarity of those publications.
- In a further embodiment that employs a semantic similarity measure, the method determines the similarity of terms contained within the publications by utilizing a domain-specific ontology. In so doing, terms that are synonyms are considered to be equivalent. In a further embodiment, in which the publications being compared contain chemical formulae, the similarity of the structural representations of those chemical formulae is utilized in determining the similarity of the publications.
- In a yet further embodiment, documents from the chemical, biochemical or pharmaceutical domains are compared, and the determination of the semantic similarity utilizes a determination of structural similarity of the chemicals described in the plurality of documents.
- According to a first aspect of the present disclosure, there is provided a method of comparing a plurality of documents, the method comprising the steps of: determining a plurality of similarity measures for the plurality of documents; and determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures.
- According to a second aspect of the present disclosure, there is provided a computer program product having a computer readable medium having a computer program recorded therein for comparing documents. The computer program product comprises: computer program code means for determining a plurality of similarity measures for the plurality of documents; and computer program code means for determining an overall similarity measure for the plurality of documents, based on the plurality of similarity measures.
- According to a third aspect of the present disclosure, there is provided a computer program product having a computer readable medium having a computer program recorded therein for comparing documents. The computer program product comprises: computer program code means for determining a reference similarity measure, based on references contained in the plurality of documents; computer program code means for determining a semantic similarity measure, based on the similarity of terms contained in the plurality of documents; and computer program code means for determining a similarity measure for the plurality of documents, based on the reference similarity measure and the semantic similarity measure.
- According to a fourth aspect of the present disclosure, there is provided a knowledge retrieval system, comprising: a parser for extracting required information from presented materials; an annotator for annotating terms from the parsed presented materials, by utilizing at least one of an ontology, a taxonomy, and a dictionary; a chemical representation device for deriving information about the annotated terms, based on a connection table; an integrator for collating the derived information for storage in a database; and a retrieval system for retrieving information from the database, based on input search criteria.
- According to another aspect of the present disclosure, there is provided an apparatus for implementing any one of the aforementioned methods. According to another aspect of the present disclosure, there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above. Other aspects of the invention are also disclosed.
- One or more embodiments of the present invention will now be described with reference to the drawings, in which:
-
FIG. 1 is a flow diagram of a method of determining a semantic similarity measure for biomedical patent documents, in accordance with an embodiment of the present disclosure; -
FIG. 2 is a schematic block diagram illustrating a system architecture of a BioPatentMiner system; -
FIG. 3 is a Reference Graph illustrating the reference similarity between documents; -
FIG. 4 is a schematic block diagram illustrating a system architecture of an BioPatentMiner system incorporating an embodiment of the present disclosure; -
FIG. 5 is a screenshot capture from avisualizer 490 ofFIG. 4 ; and -
FIG. 6 is a schematic block diagram of a general purpose computer upon which arrangements described can be practised. - Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.
- The principles of the preferred method described herein have general applicability to document searching and retrieval. For ease of explanation, the steps of the preferred method are described with reference to a comparison of patents and patent documents. However, it is not intended that the present invention be limited to the described method. For example, the invention may have application to general database searches and comparisons of research materials.
- Text-based documents are often compared based on the number of similar terms that are common between the documents under comparison. For example, the Vector Space Model (VSM) described in “A Vector Space Model for Automatic Indexing”, Salton G., Wong, A. and Yang, C. S. Communications of the ACM, 18(11), November 1975, represents documents as vectors of terms di={wi1, wi2, . . . , win}, where wik (1<k<n) is a non-negative value denoting the weight of the term k in the document i. Generally, the weight is determined by Term Frequency-Inverse Document Frequency (TF-IDF), which is calculated as:
-
- where tfik is the frequency of the term k in the document i, N is the number of documents in a collection and nk is the number of documents in which the term k occurs at least once. The term
-
- is the Inverse Document Frequency, which represents how common the term nk is in the document corpus.
- The similarity between two documents represented by vectors di and dj is then given by the Cosine Similarity between the vectors, calculated as:
-
- where di. dj is the dot product between the vectors, calculated as
-
- If wik and wjl are the same terms, wik. wjl=wik×wjl.
- If wik and wjl are not the same, wik. wjl=0.0.
- A significant limitation of determining the similarity between documents based solely on common terms is that the semantic similarity between terms is not taken into account. In order to overcome this limitation, ontologies and taxonomies can be utilized to determine the similarity between terms. An ontology is a data structure containing all relevant entities, possible relationships between those entities, and associated rules for a nominal domain. An ontology is typically hierarchical in structure, but is not restricted to only hierarchical structures. A taxonomy may refer to either the classification of a group of objects, or principles underlying such classification. Taxonomies are frequently hierarchical in structure, but are not necessarily so. A taxonomy may be a simple organization of a group of objects, or even an alphabetical list. Examples of ontologies may be found, for example, at each of the following Internet websites: Gene Ontology (http://www.geneontology.org) and Suggested Upper Merged Ontology (SUMO) (http://www.ontologyportal.org).
- “THESUS: Organizing Web Document Collections based on Link Semantics”, M. Halkidi et al., VLDB Journal, 12(4), November 2003 utilizes similarity measures, based on an ontology, to cluster Web documents. Similarly, “Exploiting Hierarchical Domain Structure to Compute Similarity”, P. Ganesan et al., ACM Transactions on Information Systems, 21(1):64-93, January 2003, discloses several similarity measures that have been developed utilizing taxonomies for comparing collections. One such measure is the Generalized Cosine Similarity Measure, which is similar to the Cosine Similarity Measure of the Vector Space Model described above. However, when determining the dot product between two vectors in the Generalized Cosine Similarity Measure, if two terms of the vectors representing the collections are not similar, the similarity between the two terms is not zero, as described above with reference to the VSM, but rather is determined by an ontology or taxonomy.
- A related area of research is concerned with determining the similarity between terms in an ontology or taxonomy. One simple technique is to calculate the similarity based on the distance (number of edges) between two terms in an ontology graph. However, information-theoretic approaches for determining the similarity between ontology terms have been found to be most effective. For example, “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, P. Resnik, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 448-453, 1995, proposes a method to determine the similarity between two semantic classes in a taxonomy based on the amount of information that the semantic classes share in common. Let pC be the probability of encountering a class C or a child of the term in the taxonomy. Although Resnik considered a child term by only considering is-a links, the approach can be extended to links of all types. pC is monotonic as one moves up the taxonomy and will approach 1 for the root. The principle of Information theory defines the information content of a term as −ln(pC).
- Given two classes Ck and Cl in the taxonomy, let Pmin be the minimum probability among common ancestors of the two classes. Since the amount of information shared by a term can be determined from the ancestor with the maximum information content, Resnik defined the similarity between the terms to be −ln(Pmin). The value of this metric can vary between 0, for very dissimilar terms, to infinity. “An Information-Theoretic Definition of Similarity”, D. Lin, Proceedings of the International Conference on Machine Learning, pages 296-304, San Francisco, Calif., 1998, refined the similarity measure so that it varies between 0 and 1. Thus, the similarity between two ontology resources Ck and Cl is calculated to be:
-
- Various measures of similarity between pairs of chemical molecules have been proposed, as discussed for example in “Similarity and Dissimilarity Measures for Processing Chemical Structure Databases”, V. Gillet et al., The Computer Journal, 41(8), 1998. One technique is to determine the similarity based on the “Molecular Fingerprints” of the chemicals. The molecular fingerprints contain information about atoms and substructures contained within a molecule. The fingerprints are represented as a boolean array, in which each element is TRUE (1) or FALSE (0). Boolean arrays in turn are usually represented as bitmaps, an array of bytes or words in which each bit represents one position of the boolean array. The procedure for constructing a fingerprint for a chemical, which is stored as a string of bits of length L, is as follows:
-
- 1. Set all L bits of the fingerprint initially to 0.
- 2. Assign to every substructure of the chemical, starting from each atom and extending down the bonds (including bond type) until the entire molecule is represented, a unique pattern described by a small number of positions along the fingerprint. Instead of allowing the chain length to increase until the entire structure is represented, it can be stopped at a given depth (usually 8).
- 3. If a substructure is present in a chemical, the corresponding sets of bits in the fingerprint are set to 1. In other words, each substructure corresponds to a partial fingerprint and the total fingerprint is constructed by logically summing them.
- Molecular fingerprints can be used to measure the similarity between different chemicals utilizing the Tanimoto Coefficient. The similarity between two fingerprints A and B is calculated to be:
-
- where a and b are the number of 1s in A and B, respectively, and c is the number of 1s common to both fingerprints. Such a measure clearly provides a very simple picture of the similarity relationships between pair of structures, and is computationally efficient. Further, such a similarity measure has been found to be effective in that molecules that have been judged to be structurally similar by chemists have also found to be similar by this metric.
- Previous research has been conducted in the analysis of co-citations to determine related documents in bibliometrics. The notion is that if a first document cites both second and third documents, then the second and third documents are related. This technique has been adapted to find related pages in the World-wide Web, as discussed in “Life, Death and Lawfulness on the Electronic Frontier”, J. Pitkow et al., Proceedings of the ACM SIGCHI '97 Conference on Human Factors in Computing Systems, pages 383-390, Atlanta, Ga., March 1997, and “Finding Related Pages in the World-Wide Web”, J. Dean et al., Proceedings of the Eight International World-Wide Web Conference, Toronto, Canada, May 1999.
- Disclosed herein is a method for determining the similarity between multiple documents. Such documents can include, but are not restricted to, patent documents. The method utilizes a plurality of similarity measures to determine the overall similarity between the documents. One measure relates to common terms that occur in the documents under consideration. Another measure relates to common references between the documents under consideration. A further measure relates to the premise that documents containing similar terms should be considered similar. Instead of only determining matching keywords in the documents, the method determines the semantic similarity between terms in the documents by utilizing a domain-specific ontology. For example, terms which are synonyms (based on the ontology) are considered to be equivalent.
- When considering chemical, biochemical or pharmaceutical patent documents, the similarity between two or more patent documents can also be determined based on the structural similarity among molecules described in those patent documents, since similar molecules tend to behave similarly. In one embodiment, the similarity between two chemicals is determined based on the number of common substructures between those two chemicals.
- A preferred embodiment utilizes a modified Vector Space model in which the documents are represented by the terms contained therein. The terms can be, for example, English words or biological concepts. In a further embodiment, the terms exclude “stop words”, which are common words that are ignored when searching a key phrase. The exclusion of such “stop words” accelerates the searching process.
-
FIG. 1 is a flow diagram of amethod 100 for determining a semantic similarity measure for biomedical patent documents, in accordance with the present disclosure. Themethod 100 begins at aStart step 110 and proceeds to step 120, which identifies biological terms within the documents that are to be compared. The biological terms are typically identified with reference to one or more appropriate technical dictionaries and/or ontologies. Control passes to step 130, which identifies chemical terms in the documents that are to be compared. The chemical terms are typically identified with reference to a set of rules. In one embodiment, the method identifies chemical terms by comparing terms within the documents with a predefined dictionary of chemical names and utilising a customisable rule engine. An example of a rule used to identify chemical terms is that a string containing “methyl” is a chemical. Control passes to step 140, which determines the similarity of the biological terms identified instep 120. Then step 150 determines the similarity of the chemical terms identified instep 130. Control passes to step 160 to determine the semantic similarity between the documents under consideration, based on the similarities of the biological terms and chemical terms identified insteps End step 170 and the method terminates. It will be appreciated and readily understood by a person skilled in the relevant art that one or more steps of themethod 100 can be practised in a different order, or in parallel, without departing from the spirit and scope of the present invention. - BioPatentMiner is a system to facilitate knowledge discovery from patents related to biomedicine.
FIG. 2 is a schematic block diagram illustrating the overall architecture of an embodiment of theBioPatentMiner system 200. Thesystem 200 uses acrawler 210 to downloadpatents 220 from an online patent database, such as provided by the United States Patent and Trademark Office (USPTO), based on a query input by a user. Thesystem 200 can be also used on a collection of biomedical patents obtained by other techniques. Aparser 230 parses thesepatents 220 to extract information. Such information may include, for example, inventors, assignees, titles, and abstracts. In one embodiment, theparser 230 assumes that thepatents 220 are in the HyperText Markup Language (HTML) format of the USPTO site. Other embodiments of thesystem 200 are adjusted to handle documents in other formats. - Biological terms in the parsed files are passed from the
parser 230 to anannotator 240. Theannotator 240 annotates the parsed files, using for example a BioAnnotator system as described in “Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application”, L. Subramaniam et al., Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, La., 2003. BioAnnotator identifies and classifies biological terms in scientific text by utilizing publicly availablebiomedical ontologies 250 that are presented as an input to theannotator 240. - The annotated patents are passed from the
annotator 240 to be indexed by an integrator/indexer 260, such as an XML search engine. The integrator/indexer 260 also receives as an input theontologies 250, and passes the integrated information from the patents and the ontologies to anintegrated knowledge database 270. Information contained in theintegrated knowledge database 270 can be retrieved using aknowledge retriever 280. The knowledge retriever presents information from theintegrated knowledge database 270 requested by a user on a display system, such as avisualizer 290. Further information pertaining to theBioPatentMiner system 200 is contained in “BioPatentMiner: An Information Retrieval System for BioMedical Patents”, S. Mukhejea et al., Proceedings of the Very Large Databases (VLDB) Conference, Toronto, Canada, 2004. - The present disclosure provides a method for determining the similarity of two or more publications based on multiple similarity measures determined for those publications. The method identifies technical terms in documents under consideration and determines the similarity of those documents based on at least two semantic, structural and reference similarity measures. When seeking to compare documents from a particular domain, a domain-specific ontology is utilised to determine the semantic similarity between the publications. Such domains can include, for example, computer science, chemistry, and biochemistry.
- In an embodiment concerned with comparing documents in the biochemical domain, the ontology stores all related synonyms for each biological concept. In a further embodiment concerned with chemical, biological and pharmaceutical documents, chemical formulas in the publications under consideration are compared based on common substructures contained therein.
- In the following example, an embodiment of the present disclosure determines the similarity of two given documents and returns a similarity value Sim(Pi, Pj) in the range 0.0 to 1.0, in which a value of 1.0 indicates that the documents are identical. In this embodiment, the similarity value is determined by a semantic similarity measure and a reference similarity measure. The similarity value is defined as:
-
Sim(P i ,P j)=WTref×RefSim(P i ,P j)+(1−WT ref)×SemSim(P i ,P j) (6) - The Reference Similarity component, RefSim(Pi; Pj), calculates the similarity of the documents based on references contained within the documents, and will be discussed in further detail below. The Semantic Similarity component, SemSim(Pi, Pj), calculates the similarity of the documents based on the contents of the documents, as also explained in further detail below. The final similarity component is a linear combination of the two similarity measures. WTref is a predefined constant determining the weight given to the Reference Similarity during the final similarity calculation. In one embodiment, the weighting term is set to 0.5, so that equal importance is given to each of the Reference and Semantic similarity components. The value of the weighting term can be varied for specific applications, to place greater or lesser importance on the commonality of references shared by the documents or the semantic similarity of the documents.
- The Reference similarity of the documents is determined in this embodiment by utilizing a Reference graph. The documents are the nodes of this graph. If a document Pi contains a reference to a document Pj, a direct edge is created between the corresponding nodes.
FIG. 3 shows an example of such aReference Graph 300, in which pi corresponds to document Pi. In the example shown inFIG. 3 , a direct edge links nodes p1 and p3, which indicates that document PI is related to document P3. As the edge is drawn from p1 to p3, document PI contains a reference to document P3. Node p2 also has a direct edge to p3, indicating that document P2 also contains a reference to document P3. As each of documents P1 and P2 contains a reference to document P3, it is likely that documents P1 and P2 are related. Further, node p3 has direct edges to each of nodes p4 and p5, indicating that document P3 contains references to each of documents P4 and P5. Accordingly, documents P4 and P5 are considered to be similar. Based on the above observations from theReference Graph 300 inFIG. 3 , if there is a directed or undirected path between any two nodes of a Reference graph, the corresponding documents are considered to be similar, with the degree of similarity inversely proportional to the length of the path between the two nodes. Thus, the Reference Similarity between two documents Pi and Pj is calculated to be the inverse of the minimum path length between corresponding nodes pi and pj in the Reference graph. Thus, -
- If there are no paths between two nodes, the Reference Similarity of the corresponding documents is 0. For the
example Reference graph 300 shown inFIG. 3 , the minimum path length between p1 and p3 is 1. Therefore, the Reference similarity between these patents, RefSim(P1,P3), will also be 1. On the other hand, the minimum path length between p1 and p2, as well as p1 and p4, is 2. Therefore, the Reference Similarity between the corresponding patents is 0.5, as RefSim(P1,P2)=½ and RefSim(P1,P4)=½. - In order to determine the semantic similarity between two documents, SemSim(Pi,Pj), this embodiment utilizes an augmented Cosine Similarity method to compare documents based on the contents of those documents. The documents are represented by vectors containing the terms contain therein. The terms can be English words or terms belonging to a domain-specific ontology. In one embodiment, the terms exclude “stop words”, which are common words that are ignored when searching a key phrase. While comparing the terms in the documents, the semantic similarity between the documents is determined by utilizing a domain-specific ontology. Even if two terms are not equal, those terms can be semantically similar if the terms are synonyms or if the terms belong to the same ontology class. If the terms belong to different classes, the similarity between the classes will determine the similarity between the terms. Moreover, in the case in which pharmaceutical patent documents are being compared, if both the terms are molecules represented by chemical formulae, the structural similarity between the molecules is determined and may be incorporated into the semantic similarity measure.
- In one embodiment, a domain-specific ontology is utilized that contains various terms as well as all synonyms for those terms. Further, each term belongs to an Ontology class. The classes form an ISA-hierarchy in which a class can be a child of another class. Initially, the similarity between all pairs of the ontology classes is pre-computed using Lin's method, as described above with reference to equation (4), to determine sim(Ck,Cl):
-
- where pCk is the probability of encountering the class Ck or a child of the term in the taxonomy, and pmin is the minimum probability among their common ancestors.
- The Semantic Similarity between two documents Pi and Pj represented by vectors di and dj is then given by the Cosine Similarity between the vectors, as described above with reference to equation (2), calculated as:
-
- where di. dj is the dot product between the vectors, calculated as
-
- The method disclosed herein for calculating the dot product differs from traditional techniques. For terms k and l in documents i and j respectively, wik. wjl can be calculated in various ways:
-
- 1. If terms k and l are the same, or represent the same concept, wik. wjl is calculated to be:
-
- which is the traditional TF-IDF measure, as described above with reference to equation (1). The weight of a term is determined by the frequency of the term in the document, as well as how common the term is in the document corpus. If k and l belong to the Ontology classes Ck and Cl, respectively, wik. wjl is calculated to be:
-
tfik×tfjl×idf(Ck)×idf(Cl)×sim(Ck,Cl)×WTOnt (12) - For calculating the contribution of dissimilar terms to the overall similarity between two documents, the following factors are considered:
-
- (a) The frequencies of the terms in the patents.
- (b) idf(C) represents the inverse document frequency of class C in the
- corpus and is calculated as:
-
- where N is number of documents in a collection and nC is the number of documents in which a term of Class C occurs at least once. The method utilizes this metric in such a way that common classes are allocated a lower weight.
-
- 1. The similarity between the classes to which the terms belong. The similarity value sim(Ck,Cl) varies between 0, when the common ancestor of the classes is the root of the ontology hierarchy, and 1, when Ck and Cl are the same.
- 2. WTont is a predefined constant between 0 and 1 that determines the relative importance allocated to Ontology-based similarity when determining the overall similarity of documents. If the constant is set to 0, the results are equivalent to traditional cosine similarity, as described above with reference to equation (2). Preferably, the constant is set to be less than 1, so that if two terms are not equivalent, but belong to the same class, those terms are given less weight. In one embodiment, the constant is set to 0.6, as a result of experimentation with some patent collections.
- Pharmaceutical patents typically contain many chemical formulas representing different molecules. When utilizing the ontology, all chemicals are treated similarly, since all chemicals belong to the same class. However, it is desirable to determine the similarity between documents based on the structural similarity between the chemicals contained in the documents. Therefore, in equation (12) above, if k and l are both chemicals, wik. wjl is calculated to be:
-
tfik×tfjl×idf(Chemical)2×simChemical(k,l)×WTChem (14) - where idf(Chemical) is the Inverse Document Frequency of the Chemical class. simChemical(k,l) determines the similarity between the chemicals k and l. The similarity is calculated by the Tanimoto coefficient of the Molecular fingerprints of the chemicals, as discussed above with reference to equation (5). WTchem is a predefined constant between 0 and 1, that determines how much importance is given to Chemical-based similarity. In one embodiment, WTchem is set to 0.5. As described above, BioPatentMiner is a system to facilitate knowledge discovery from patents related to biomedicine.
FIG. 4 is a schematic block diagram illustrating the overall architecture of aBioPatentMiner system 400 enhanced in accordance with an embodiment of the present invention. Thesystem 400 uses acrawler 410 to downloadpatents 420 from an online patent database, such as provided by the United States Patent and Trademark Office (USPTO), based on a query input by a user. Thesystem 400 can be also used on a collection of biomedical patents obtained by other techniques. Aparser 430 parses thesepatents 420 to extract information. Such information may include, for example, inventors, assignees, titles, and abstracts. In one embodiment, theparser 430 assumes that thepatents 420 are in the HyperText Markup Language (HTML) format of the USPTO site. Other embodiments of thesystem 400 are adjusted to handle documents in other formats. - Biological terms in the parsed files are passed from the
parser 430 to anannotator 440. Theannotator 440 annotates the parsed files, using for example a BioAnnotator system as described in “Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application”, L. Subramaniam et al., Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, La., 2003. BioAnnotator identifies and classifies biological terms in scientific text by utilizing publicly availablebiomedical ontologies 450 that are presented as an input to theannotator 440. BioAnnotator also uses a Rule Engine to identify unknown and new chemical formulae that are not present in the ontologies. - In one embodiment, the biomedical ontology used by the BioAnnotator is Unified Medical Language System (UMLS), as described at the URL http://umlsks.nlm.nih.gov. UMLS is a consolidated repository of medical terms and their relationships, spread across multiple languages and disciplines. Such disciplines include, for example, chemistry and biology. One section of UMLS is a Semantic Network that includes 235 biomedical semantic classes like Gene or Genome, and Amino Acid, Peptide, or Protein. In addition, the Semantic Network includes biological concepts, each of which is associated with one or more semantic classes. For example, the concept caspase 3 is associated with the semantic class Amino Acid, Peptide or Protein. A biological concept can be referred to by various synonyms. For example, caspase 3 is variously referred to using the terms apopain, Yama protein, and CPP32 protein. UMLS identifies one of the synonyms as the primary name of the concept and BioAnnotator annotates each identified biological term with an associated primary name (or baseform) and semantic class.
- For the chemicals identified by BioAnnotator in the documents under consideration, the baseform is the Molecular fingerprint representation of the chemical and the class is Chemical. The Molecular fingerprint is derived using a connection table, which contains a list of all the atoms within a chemical structure, together with bond information that describes the exact manner in which the individual atoms are linked together. Examples of such connection tables are available, for example, at Chemical Abstracts Service (CAS Online) http://www.cas.org.
- A Reference graph is created based on the references of the documents. In addition, to facilitate knowledge discovery the system integrates the information of the patents and biomedical ontologies. The integrated Patent Knowledge base enables several types of information retrieval. Besides keyword search, the system allows the user to search for patents containing a class of biomedical concept. Moreover, a user can identify semantic relationships between patent documents, inventors, and assignees, for example, as well as biomedical concepts.
- In this embodiment, the annotated patents are represented in the eXtensible Markup Language (XML) and are passed through a
chemical representation block 445 to be indexed by an integrator/indexer 460, such as an XML search engine. The chemical representation block receives as inputs the annotated patents from theannotator 440, and a connection table 455. The connection table 455 is used to create Molecular Fingerprints for the chemicals identified in the annotated patent. In order to facilitate knowledge discovery, the system integrates the information of the patents and biomedical ontologies. Thus, the integrator/indexer 460 receives theontologies 450 and patents from thechemical representation block 445 as inputs and passes the integrated information from the patents and the ontologies to anintegrated knowledge database 470. - Information contained in the
integrated knowledge database 470 can be retrieved using aknowledge retriever 480. The integrated information space provided by the integratedknowledge base 470 facilitates several types of information retrieval. Besides keyword search, the system allows the user to search for patents containing a class of biomedical concept. Further, it is possible to identify semantic relationships between patents, inventors, assignees, and biomedical concepts. The knowledge retriever presents information from theintegrated knowledge database 470 requested by a user on a display system, such as avisualizer 490, as will be described below with reference toFIG. 5 . Thus, the embodiment ofFIG. 4 performs the functions of a semantic search, a semantic association, and a similarity search to identify similar documents. -
TABLE 1 Patents Common Terms Ranbaxy: Azole compounds as therapeutic sulfur, phenyl, hydroxy, composition, agents for fungal infections [6670363] trifluoromethyl, oxygen, hydrogen, Pfizer: Bicycliccarbonyl indole compounds halogens, salts as anti-inflammatory/analgesic agents [6303628] Ranbaxy: Azole compounds as therapeutic mammals, phenyl, hydroxy, composition, agents for fungal infections [6670363] trifluoromethyl, hydrogen, salts, formyl, Pfizer: Heteroaryl phenyl pyrazole furyl, oxazolyl compounds as anti-inflammatory/analgesic agents [6649636] Ranbaxy: Azole compounds as therapeutic phenyl, cobalt, alkyl, naphthyl, hydorxy, agents for fungal infections [6670363] composition, imidazolyl, trifluoromethyl, Pfizer: 3-Azabicyclo[3.1.0]hexane hydrogen, morphlinyl, salts, furyl, derivatives useful in therapy [6313312] oxazolyl, aryl Ranbaxy: Azole compounds as therapeutic mammals, phenyl, thienyl, hydroxy, agents for fungal infections [6670363] composition, imidazolyl, trifluoromethyl, Pfizer: Steroidal glycosides [5698526] hydrogen, morpholinyl, salts, furyl, oxazolyl, aryl Ranbaxy: Extended release formulation of solutions, dosage forms, etodolca [6586005] hydroxypropylcellulose Pfizer: Rapidly releasing and taste-masking pharmaceutical dosage form [6221402] Ranbaxy: Azole compounds as therapeutic sulfur, mammals, thienyl, hydroxy, agents for fungal infections [6670363] composition, imidazolyl, trifluoromethyl, Pfizer: Sulfonylbenzene compounds as oxygen, hydrogen, salts, furyl, triazolyl, anti-inflammatory/analgesic agents oxazolyl, atom, carboxyl, aryl [6294558] Ranbaxy: Derivatives of monosaccharides cobalt, salts, prodrugs, aryl as cell adhesion inhibitors [6329344] Pfizer: Pharmaceutically active compounds [6586439] - For each document retrieved from a database by a search, it is possible to determine documents that are similar. Further, it is possible to identify documents that are similar between two companies. For example, Table 1 shows similar patent documents between two companies Ranbaxy and Pfizer. The titles of the patents and the patent numbers are shown in a first column, and the common biological terms between the patents are shown in a second column. This type of analysis is a particularly useful aid for companies to identify potential patent infringements. It is also possible to identify patent documents that are similar to a nominal patent or patent application.
- It is possible to visualize the similarity between a collection of patents using the visualizer, which is a client side swing-based Java WebStart application. For example,
FIG. 5 is ascreenshot 500 showing a visualization of the similarity of a collection of patents that are assigned to University of Texas, as obtained from the USPTO site using the keyword “glycolysis”. The input patents are shown as the nodes of a graph. If the similarity between two patents is greater than a predetermined similarity threshold, an edge is drawn between the nodes (patents). The similarity threshold is set by default to a value of 0.1, but can be readily modified by a user, as desired. - In the particular example shown in
FIG. 5 , a similarity threshold of 0.2 is utilized. At a similarity threshold of 0.2, some patents are not related to any other patents, while some other patents form clusters. Obviously, if the threshold is increased, many of the edges will disappear, and at a lower threshold more patents will be related. In this embodiment, the brightness of an edge linking two nodes is proportional to the similarity between the two patents connected by that edge. Clicking on an edge shows the terms common among the patents. For example, the terms common between U.S. Pat. Nos. 5,801,033 and 5,529,914 are shown in apopup window 410, which indicates that the common terms are gels, carriers, and polymers. - The main terms associated with the given group of patents are shown in a scrolled
list 520 on the left side of the display. Thus,FIG. 5 shows that terms like insulin, glucose and antibodies are associated with the given set of patents. The user can select a term from thelist 520 to see the patents that refer to that term. In the example shown inFIG. 5 , the term “insulin” is selected in thelist 520, which results in the patents that contain the term “insulin” being highlighted. It can be seen that many of these highlighted patents are similar and form acluster 530. - United States patents are manually classified, so that each patent is categorized into one or more classes. The classification of patent documents facilitates searching of related documents. In one experiment, the USPTO internet site was queried to download patents from four classes. In particular, 500 patents were downloaded from each of the following classes:
-
- 1. Class 156: Adhesive Bonding and Miscellaneous Chemical Manufacture;
- 2. Class 181: Acoustics;
- 3. Class 424: Drug, Bio-Affecting and Body Treating Compositions; and
- 4. Class 514: Drug, Bio-Affecting and Body Treating Compositions.
- The patents were annotated using BioAnnotator to identify biomedical concepts and chemical formulas. The chemicals were then converted to their Molecular fingerprints. The ontology used was UMLS. It is to be noted that the first two classes are not related to pharmacy or biomedicine. However, since UMLS is spread across multiple disciplines, including Chemistry, many concepts were also identified utilizing UMLS in patents of these first two classes.
- The similarity between UMLS classes was pre-computed. Two classes are considered to be similar if those classes have a common ancestor with large information content. Thus, Molecular Function and Organism Function are found to be similar with a common ancestor Physiologic Function. Therefore, terms that belong to these two classes will have some similarity as well.
- In order to establish the effectiveness of the method disclosed herein, a comparison was made with results obtained from the traditional Vector-space TF-IDF Cosine-similarity method of determining similarity. The pair-wise patent similarity was computed using the two techniques for patents within the same USPTO class, as well as patents in different USPTO classes. If the similarity between two patents was greater than a nominal threshold, the patents were deemed to be similar.
-
TABLE 2 Comparison Using Semantic and Structural Similarity Similarity Threshold CosSim Measures Increase Inter-class 0.1 0.3 0.28 −0.02 (181 vs 514) 0.2 0.04 0.02 −0.02 Intra-class 0.1 10.74 11.82 1.08 (181) 0.2 6.88 7.58 0.7 Intra-class 0.1 6.8 14.4 7.6 (514) 0.2 1.76 4.21 2.45 Inter-class 0.1 2.24 2.22 −0.02 (156 vs 424) 0.2 0.78 0.77 −0.01 Intra-class 0.1 9.22 9.89 0.67 (156) 0.2 5.04 5.42 0.38 Intra-class 0.1 3.48 6.35 2.87 (424) 0.2 1.04 2.31 1.27 Inter-class 0.1 1.8 3.52 1.72 (424 vs 514) 0.2 0.45 0.93 0.48 Intra-class 0.1 3.48 6.35 2.87 (424) 0.2 1.04 2.31 1.27 Intra-class 0.1 6.8 14.4 7.6 (514) 0.2 1.76 4.21 2.45 - Table 2 shows the percentage of patents found to be similar using the two techniques. To emphasize that the evaluation is not biased by the value of the threshold, two values for the constant were used, 0.1 and 0.2. It is assumed that, in general, patents in the same USPTO class are similar to each other and dissimilar to patents of other classes. Table 2 shows that for dissimilar patents (in different classes), the number of patents found to be similar by the method disclosed herein is slightly larger, or even smaller, than the number of patents found to be similar by the old technique. Thus, when comparing patents of class 156 and 514, 0.3% of the patents were found similar by the old technique and 0.28% of the patents were found similar by the method disclosed herein at a threshold of 0.1.
- However, for similar patents (in the same class) a larger number of patents were found to be similar by the method disclosed herein. This was particularly the case for pharmaceutical patents belonging to classes 424 and 514. Thus, at a threshold of 0.1, by Cosine Similarity 6.8% of the patents in class 514 were found to be similar, while by the method disclosed herein 14.4% of the patents were found to be similar.
- These statistics indicate that the method disclosed herein calculates a higher similarity value for similar patents, but not for dissimilar patents. The intra-class similarity calculated by the method disclosed herein will generally be greater than the similarity between the patents as calculated by Cosine Similarity, since the method disclosed herein augments Cosine Similarity by considering the Semantic and Structural similarity between terms, even if the terms are not equal. Moreover, the method disclosed herein calculates the Reference similarity, and generally there are references between patents of the same USPTO class.
- It is noted that patents of both classes 514 and 424 are pharmaceutical and are thus related. However, patents within the same class are obviously more related. Table 2 shows that, in this case, the increase in the number of similar patents was higher during intra-class comparison than during inter-class comparison. Thus, the method disclosed herein is also effective in identifying the most similar patents to a query patent in a collection of related Pharmaceutical patents.
- The aforementioned preferred method(s) comprise a particular control flow. There are many other variants of the preferred method(s) which use different control flows without departing the spirit or scope of the invention. Furthermore one or more of the steps of the preferred method(s) may be performed in parallel rather sequential.
- The method of comparing documents is preferably practised using a general-
purpose computer system 600, such as that shown inFIG. 6 wherein the processes described above may be implemented as software, such as an application program executing within thecomputer system 600. In particular, the steps of the method of comparing documents are effected by instructions in the software that are carried out by the computer. The instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part performs the document comparison methods and a second part manages a user interface between the first part and the user. The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer from the computer readable medium, and then executed by the computer. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer preferably effects an advantageous apparatus for comparing documents. - The
computer system 600 is formed by acomputer module 601, input devices such as akeyboard 602 andmouse 603, output devices including aprinter 615, adisplay device 614 andloudspeakers 617. A Modulator-Demodulator (Modem)transceiver device 616 is used by thecomputer module 601 for communicating to and from acommunications network 620, for example connectable via atelephone line 621 or other functional medium. Themodem 616 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN), and may be incorporated into thecomputer module 601 in some implementations. - The
computer module 601 typically includes at least oneprocessor unit 605, and amemory unit 606, for example formed from semiconductor random access memory (RAM) and read only memory (ROM). Themodule 601 also includes an number of input/output (I/O) interfaces including an audio-video interface 607 that couples to thevideo display 614 andloudspeakers 617, an I/O interface 613 for thekeyboard 602 andmouse 603 and optionally a joystick (not illustrated), and aninterface 608 for themodem 616 andprinter 615. In some implementations, themodem 616 may be incorporated within thecomputer module 601, for example within theinterface 608. Astorage device 609 is provided and typically includes ahard disk drive 610 and afloppy disk drive 611. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 612 is typically provided as a non-volatile source of data. Thecomponents 605 to 613 of thecomputer module 601, typically communicate via aninterconnected bus 604 and in a manner which results in a conventional mode of operation of thecomputer system 600 known to those in the relevant art. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom. - Typically, the application program is resident on the
hard disk drive 610 and read and controlled in its execution by theprocessor 605. Intermediate storage of the program and any data fetched from thenetwork 620 may be accomplished using thesemiconductor memory 606, possibly in concert with thehard disk drive 610. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via thecorresponding drive network 620 via themodem device 616. Still further, the software can also be loaded into thecomputer system 600 from other computer readable media. The term “computer readable medium” as used herein refers to any storage or transmission medium that participates in providing instructions and/or data to thecomputer system 600 for execution and/or processing. Examples of storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of thecomputer module 601. Examples of transmission media include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. - The method of comparing documents may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of semantic similarity determination and structural similarity determination. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.
- It is apparent from the above that the arrangements described are applicable to the computer, data processing, and pharmaceutical industries.
- The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
Claims (16)
1. A knowledge retrieval system, comprising:
a parser for extracting required information from presented materials;
an annotator for annotating terms from said parsed presented materials by utilizing at least one of an ontology, a taxonomy, and a dictionary;
a chemical representation device for deriving information about said annotated terms, based on a connection table;
an integrator for collating said derived information for storage in a database; and
a retrieval system for retrieving information from said database, based on input search criteria.
2. The system of claim 1 , further comprising:
a crawler for obtaining said presented materials from the Internet.
3. The system of claim 1 , wherein said required information is selected from inventors' names, assignees, titles, and abstracts.
4. The system of claim 1 , wherein said annotated terms are scientific terms identified and classified from said parsed presented materials.
5. The system of claim 1 , wherein said annotator further creates a reference graph based on references contained within said parsed presented materials.
6. The system of claim 1 , wherein said presented materials are patent documents.
7. A knowledge retrieval system, comprising:
a parser for extracting required information from presented materials;
an annotator for annotating terms from said parsed presented materials by utilizing at least one of an ontology, a taxonomy, and a dictionary;
a chemical representation device for deriving information about said annotated terms, based on a connection table;
an integrator for collating said derived information for storage in a database; and
a retrieval system for retrieving information from said database, based on input search criteria, and
wherein said annotated terms are scientific terms identified and classified from said parsed presented materials.
8. The system of claim 7 , further comprising:
a crawler for obtaining said presented materials from the Internet.
9. The system of claim 7 , wherein said required information is selected from inventors' names, assignees, titles, and abstracts.
10. The system of claim 7 , wherein said annotator further creates a reference graph based on references contained within said parsed presented materials.
11. The system of claim 7 , wherein said presented materials are patent documents.
12. A knowledge retrieval system, comprising:
a parser for extracting required information from presented materials;
an annotator for annotating terms from said parsed presented materials by utilizing at least one of an ontology, a taxonomy, and a dictionary;
a chemical representation device for deriving information about said annotated terms, based on a connection table;
an integrator for collating said derived information for storage in a database; and
a retrieval system for retrieving information from said database, based on input search criteria, and
wherein said annotator further creates a reference graph based on references contained within said parsed presented materials.
13. The system of claim 12 , further comprising:
a crawler for obtaining said presented materials from the Internet.
14. The system of claim 12 , wherein said required information is selected from inventors' names, assignees, titles, and abstracts.
15. The system of claim 12 , wherein said annotated terms are scientific terms identified and classified from said parsed presented materials.
16. The system of claim 12 , wherein said presented materials are patent documents.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/193,803 US20090037389A1 (en) | 2005-12-15 | 2008-08-19 | Document Comparison Using Multiple Similarity Measures |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/304,029 US7472121B2 (en) | 2005-12-15 | 2005-12-15 | Document comparison using multiple similarity measures |
US12/193,803 US20090037389A1 (en) | 2005-12-15 | 2008-08-19 | Document Comparison Using Multiple Similarity Measures |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/304,029 Division US7472121B2 (en) | 2005-12-15 | 2005-12-15 | Document comparison using multiple similarity measures |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090037389A1 true US20090037389A1 (en) | 2009-02-05 |
Family
ID=38174981
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/304,029 Expired - Fee Related US7472121B2 (en) | 2005-12-15 | 2005-12-15 | Document comparison using multiple similarity measures |
US12/193,803 Abandoned US20090037389A1 (en) | 2005-12-15 | 2008-08-19 | Document Comparison Using Multiple Similarity Measures |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/304,029 Expired - Fee Related US7472121B2 (en) | 2005-12-15 | 2005-12-15 | Document comparison using multiple similarity measures |
Country Status (1)
Country | Link |
---|---|
US (2) | US7472121B2 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7958136B1 (en) * | 2008-03-18 | 2011-06-07 | Google Inc. | Systems and methods for identifying similar documents |
US20110161303A1 (en) * | 2009-12-30 | 2011-06-30 | Hong Fu Jin Precision Industry (Shenzhen) Co., Ltd | System and method for analyzing official notices of electronically filed patent applications |
US20120158639A1 (en) * | 2010-12-15 | 2012-06-21 | Joshua Lamar Moore | Method, system, and computer program for information retrieval in semantic networks |
US20130091150A1 (en) * | 2010-06-30 | 2013-04-11 | Jian-Ming Jin | Determiining similarity between elements of an electronic document |
US20130110496A1 (en) * | 2011-10-28 | 2013-05-02 | Sap Ag | Calculating Term Similarity Using A Meta-Model Semantic Network |
US20130322849A1 (en) * | 2012-05-31 | 2013-12-05 | Sony Corporation | Information processing apparatus, program, and information processing method |
US9104710B2 (en) | 2013-03-15 | 2015-08-11 | Src, Inc. | Method for cross-domain feature correlation |
US9311390B2 (en) | 2008-01-29 | 2016-04-12 | Educational Testing Service | System and method for handling the confounding effect of document length on vector-based similarity scores |
US20180068018A1 (en) * | 2010-04-30 | 2018-03-08 | International Business Machines Corporation | Managed document research domains |
WO2018091714A1 (en) * | 2016-11-21 | 2018-05-24 | Universite de Bordeaux | Device and method for generating a drug database |
JPWO2021084723A1 (en) * | 2019-10-31 | 2021-05-06 | ||
US20220261399A1 (en) * | 2019-06-24 | 2022-08-18 | Evonik Operations Gmbh | Method for predicting a feedstuff and/or feedstuff raw material |
US20230114965A1 (en) * | 2021-10-08 | 2023-04-13 | EMC IP Holding Company LLC | Document data management via graph cliques for layout understanding |
US20230154571A1 (en) * | 2017-06-22 | 2023-05-18 | Arianegroup Sas | Method and device for selecting a subassembly of molecules for use in predicting at least one property of a molecular structure |
Families Citing this family (57)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1049030A1 (en) * | 1999-04-28 | 2000-11-02 | SER Systeme AG Produkte und Anwendungen der Datenverarbeitung | Classification method and apparatus |
EP1128278B1 (en) * | 2000-02-23 | 2003-09-17 | SER Solutions, Inc | Method and apparatus for processing electronic documents |
US9177828B2 (en) | 2011-02-10 | 2015-11-03 | Micron Technology, Inc. | External gettering method and device |
EP1182577A1 (en) | 2000-08-18 | 2002-02-27 | SER Systeme AG Produkte und Anwendungen der Datenverarbeitung | Associative memory |
US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
ATE537507T1 (en) | 2001-08-27 | 2011-12-15 | Bdgb Entpr Software Sarl | METHOD FOR AUTOMATICALLY INDEXING DOCUMENTS |
US20070294232A1 (en) * | 2006-06-15 | 2007-12-20 | Andrew Gibbs | System and method for analyzing patent value |
US8140267B2 (en) * | 2006-06-30 | 2012-03-20 | International Business Machines Corporation | System and method for identifying similar molecules |
US8015162B2 (en) * | 2006-08-04 | 2011-09-06 | Google Inc. | Detecting duplicate and near-duplicate files |
US7912831B2 (en) * | 2006-10-03 | 2011-03-22 | Yahoo! Inc. | System and method for characterizing a web page using multiple anchor sets of web pages |
US9892111B2 (en) * | 2006-10-10 | 2018-02-13 | Abbyy Production Llc | Method and device to estimate similarity between documents having multiple segments |
US9495358B2 (en) * | 2006-10-10 | 2016-11-15 | Abbyy Infopoisk Llc | Cross-language text clustering |
WO2008097194A1 (en) * | 2007-02-09 | 2008-08-14 | Agency For Science, Technology And Research | Keyword classification and determination in language modelling |
US7809705B2 (en) * | 2007-02-13 | 2010-10-05 | Yahoo! Inc. | System and method for determining web page quality using collective inference based on local and global information |
US20080306899A1 (en) * | 2007-06-07 | 2008-12-11 | Gregory Michelle L | Methods, apparatus, and computer-readable media for analyzing conversational-type data |
CN101334771A (en) * | 2007-06-28 | 2008-12-31 | 冠亚智财股份有限公司 | Correlation screening system and method for patent constituted element combination |
US7877385B2 (en) * | 2007-09-21 | 2011-01-25 | Microsoft Corporation | Information retrieval using query-document pair information |
US20090132522A1 (en) * | 2007-10-18 | 2009-05-21 | Sami Leino | Systems and methods for organizing innovation documents |
CN102341802B (en) * | 2009-06-30 | 2014-05-28 | 国际商业机器公司 | Graph similarity calculation system and method |
US9213756B2 (en) * | 2009-11-02 | 2015-12-15 | Harry Urbschat | System and method of using dynamic variance networks |
US8321357B2 (en) * | 2009-09-30 | 2012-11-27 | Lapir Gennady | Method and system for extraction |
US9158833B2 (en) * | 2009-11-02 | 2015-10-13 | Harry Urbschat | System and method for obtaining document information |
US9152883B2 (en) | 2009-11-02 | 2015-10-06 | Harry Urbschat | System and method for increasing the accuracy of optical character recognition (OCR) |
US9824142B2 (en) * | 2010-01-15 | 2017-11-21 | Nec Corporation | Information processing device, information processing method, and computer-readable recording medium |
EP2354971A1 (en) * | 2010-01-29 | 2011-08-10 | E-Therapeutics plc | Document analysis system |
US9183287B2 (en) | 2010-01-29 | 2015-11-10 | E-Therapeutics Plc | Social media analysis system |
CN103189881B (en) * | 2010-08-17 | 2019-06-18 | 西格拉姆申德勒有限公司 | FSTP expert system |
KR20120052636A (en) * | 2010-11-16 | 2012-05-24 | 한국전자통신연구원 | A hscode recommendation service system and method using ontology |
US8856181B2 (en) * | 2011-07-08 | 2014-10-07 | First Retail, Inc. | Semantic matching |
US9792355B2 (en) * | 2011-10-25 | 2017-10-17 | Intelligent Views Gmbh | Searches for similar documents |
US9734195B1 (en) * | 2013-05-16 | 2017-08-15 | Veritas Technologies Llc | Automated data flow tracking |
US9467409B2 (en) * | 2013-06-04 | 2016-10-11 | Yahoo! Inc. | System and method for contextual mail recommendations |
RU2607975C2 (en) * | 2014-03-31 | 2017-01-11 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Constructing corpus of comparable documents based on universal measure of similarity |
US9984166B2 (en) * | 2014-10-10 | 2018-05-29 | Salesforce.Com, Inc. | Systems and methods of de-duplicating similar news feed items |
US10592841B2 (en) | 2014-10-10 | 2020-03-17 | Salesforce.Com, Inc. | Automatic clustering by topic and prioritizing online feed items |
US11250956B2 (en) * | 2014-11-03 | 2022-02-15 | Cerner Innovation, Inc. | Duplication detection in clinical documentation during drafting |
US10002256B2 (en) * | 2014-12-05 | 2018-06-19 | GeoLang Ltd. | Symbol string matching mechanism |
US9672206B2 (en) | 2015-06-01 | 2017-06-06 | Information Extraction Systems, Inc. | Apparatus, system and method for application-specific and customizable semantic similarity measurement |
US11017301B2 (en) | 2015-07-27 | 2021-05-25 | International Business Machines Corporation | Obtaining and using a distributed representation of concepts as vectors |
US10073890B1 (en) | 2015-08-03 | 2018-09-11 | Marca Research & Development International, Llc | Systems and methods for patent reference comparison in a combined semantical-probabilistic algorithm |
US10621499B1 (en) | 2015-08-03 | 2020-04-14 | Marca Research & Development International, Llc | Systems and methods for semantic understanding of digital information |
US10381108B2 (en) * | 2015-09-16 | 2019-08-13 | Charles Jianping Zhou | Web search and information aggregation by way of molecular network |
US9798818B2 (en) | 2015-09-22 | 2017-10-24 | International Business Machines Corporation | Analyzing concepts over time |
US20180039944A1 (en) * | 2016-01-05 | 2018-02-08 | Linkedin Corporation | Job referral system |
US10540439B2 (en) | 2016-04-15 | 2020-01-21 | Marca Research & Development International, Llc | Systems and methods for identifying evidentiary information |
EP3485391A4 (en) * | 2016-07-15 | 2020-01-22 | Albert Einstein College of Medicine | Persistance and linking of analytic products in big data environments |
US10769213B2 (en) | 2016-10-24 | 2020-09-08 | International Business Machines Corporation | Detection of document similarity |
CN109791797B (en) * | 2016-12-05 | 2023-05-02 | 智慧芽信息科技(苏州)有限公司 | System, apparatus and method for searching and displaying available information based on chemical structure similarity in large database |
CN106599609A (en) * | 2017-01-13 | 2017-04-26 | 景德镇陶瓷大学 | New method for calculating similarity of drug molecules based on ChEBI description |
US11244743B2 (en) | 2018-01-05 | 2022-02-08 | International Business Machines Corporation | Adaptive weighting of similarity metrics for predictive analytics of a cognitive system |
US20190328420A1 (en) | 2018-04-30 | 2019-10-31 | Sebastian Khairkhahan | Pericardial space access device and method |
CN108804421B (en) * | 2018-05-28 | 2022-04-15 | 中国科学技术信息研究所 | Text similarity analysis method and device, electronic equipment and computer storage medium |
US11269943B2 (en) * | 2018-07-26 | 2022-03-08 | JANZZ Ltd | Semantic matching system and method |
EP3729422B1 (en) * | 2019-02-27 | 2023-08-09 | Google LLC | Detecting continuing conversations with computing devices |
US11676582B2 (en) * | 2019-02-27 | 2023-06-13 | Google Llc | Detecting conversations with computing devices |
CN111209373A (en) * | 2020-01-07 | 2020-05-29 | 北京启明星辰信息安全技术有限公司 | Sensitive text recognition method and device based on natural semantics |
US11847169B2 (en) * | 2020-12-18 | 2023-12-19 | Shanghai Henghui Intellectual Property Service Co., Ltd. | Method for data processing and interactive information exchange with feature data extraction and bidirectional value evaluation for technology transfer and computer used therein |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5774833A (en) * | 1995-12-08 | 1998-06-30 | Motorola, Inc. | Method for syntactic and semantic analysis of patent text and drawings |
US20030004936A1 (en) * | 2001-06-29 | 2003-01-02 | Epatentmanager.Com | Simultaneous intellectual property search and valuation system and methodology (SIPS-VSM) |
US20030033295A1 (en) * | 2001-07-11 | 2003-02-13 | Adler Marc Stephen | Method for analyzing and recording innovations |
US6556992B1 (en) * | 1999-09-14 | 2003-04-29 | Patent Ratings, Llc | Method and system for rating patents and other intangible assets |
US20040053317A1 (en) * | 2002-09-10 | 2004-03-18 | Sidney Kimmel Cancer Center | Gene segregation and biological sample classification methods |
US20040126840A1 (en) * | 2002-12-23 | 2004-07-01 | Affymetrix, Inc. | Method, system and computer software for providing genomic ontological data |
US20050182571A1 (en) * | 2004-02-17 | 2005-08-18 | Ki-Eun Kim | Sequence indexing method and system |
US20060053173A1 (en) * | 2004-09-03 | 2006-03-09 | Biowisdom Limited | System and method for support of chemical data within multi-relational ontologies |
US20060074991A1 (en) * | 2002-11-06 | 2006-04-06 | Lussier Yves A | System and method for generating an amalgamated database |
US7058658B2 (en) * | 2000-03-28 | 2006-06-06 | Dana-Farber Cancer Institute, Inc. | Molecular database for antibody characterization |
-
2005
- 2005-12-15 US US11/304,029 patent/US7472121B2/en not_active Expired - Fee Related
-
2008
- 2008-08-19 US US12/193,803 patent/US20090037389A1/en not_active Abandoned
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5774833A (en) * | 1995-12-08 | 1998-06-30 | Motorola, Inc. | Method for syntactic and semantic analysis of patent text and drawings |
US6556992B1 (en) * | 1999-09-14 | 2003-04-29 | Patent Ratings, Llc | Method and system for rating patents and other intangible assets |
US7058658B2 (en) * | 2000-03-28 | 2006-06-06 | Dana-Farber Cancer Institute, Inc. | Molecular database for antibody characterization |
US20030004936A1 (en) * | 2001-06-29 | 2003-01-02 | Epatentmanager.Com | Simultaneous intellectual property search and valuation system and methodology (SIPS-VSM) |
US20030033295A1 (en) * | 2001-07-11 | 2003-02-13 | Adler Marc Stephen | Method for analyzing and recording innovations |
US20040053317A1 (en) * | 2002-09-10 | 2004-03-18 | Sidney Kimmel Cancer Center | Gene segregation and biological sample classification methods |
US20060074991A1 (en) * | 2002-11-06 | 2006-04-06 | Lussier Yves A | System and method for generating an amalgamated database |
US20040126840A1 (en) * | 2002-12-23 | 2004-07-01 | Affymetrix, Inc. | Method, system and computer software for providing genomic ontological data |
US20050182571A1 (en) * | 2004-02-17 | 2005-08-18 | Ki-Eun Kim | Sequence indexing method and system |
US20060053173A1 (en) * | 2004-09-03 | 2006-03-09 | Biowisdom Limited | System and method for support of chemical data within multi-relational ontologies |
Non-Patent Citations (1)
Title |
---|
Gillet et al. (Valerie J. Gillet, "Similarity and Dissimilarity Methods for Processing Chemical Structure Database," The Computer Journal, Vol. 41, No. 8, 1998) * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9311390B2 (en) | 2008-01-29 | 2016-04-12 | Educational Testing Service | System and method for handling the confounding effect of document length on vector-based similarity scores |
US8713034B1 (en) * | 2008-03-18 | 2014-04-29 | Google Inc. | Systems and methods for identifying similar documents |
US7958136B1 (en) * | 2008-03-18 | 2011-06-07 | Google Inc. | Systems and methods for identifying similar documents |
US20110161303A1 (en) * | 2009-12-30 | 2011-06-30 | Hong Fu Jin Precision Industry (Shenzhen) Co., Ltd | System and method for analyzing official notices of electronically filed patent applications |
US20180068018A1 (en) * | 2010-04-30 | 2018-03-08 | International Business Machines Corporation | Managed document research domains |
US20130091150A1 (en) * | 2010-06-30 | 2013-04-11 | Jian-Ming Jin | Determiining similarity between elements of an electronic document |
US8566273B2 (en) * | 2010-12-15 | 2013-10-22 | Siemens Aktiengesellschaft | Method, system, and computer program for information retrieval in semantic networks |
US20120158639A1 (en) * | 2010-12-15 | 2012-06-21 | Joshua Lamar Moore | Method, system, and computer program for information retrieval in semantic networks |
US9256595B2 (en) * | 2011-10-28 | 2016-02-09 | Sap Se | Calculating term similarity using a meta-model semantic network |
US20130110496A1 (en) * | 2011-10-28 | 2013-05-02 | Sap Ag | Calculating Term Similarity Using A Meta-Model Semantic Network |
US9854220B2 (en) * | 2012-05-31 | 2017-12-26 | Saturn Licensing Llc | Information processing apparatus, program, and information processing method |
US20130322849A1 (en) * | 2012-05-31 | 2013-12-05 | Sony Corporation | Information processing apparatus, program, and information processing method |
US9104710B2 (en) | 2013-03-15 | 2015-08-11 | Src, Inc. | Method for cross-domain feature correlation |
WO2018091714A1 (en) * | 2016-11-21 | 2018-05-24 | Universite de Bordeaux | Device and method for generating a drug database |
FR3059118A1 (en) * | 2016-11-21 | 2018-05-25 | Universite de Bordeaux | DEVICE AND METHOD FOR GENERATING A DATABASE RELATING TO DRUGS |
US11210314B2 (en) | 2016-11-21 | 2021-12-28 | Universite de Bordeaux | Device and method for generating a drug database |
US20230154571A1 (en) * | 2017-06-22 | 2023-05-18 | Arianegroup Sas | Method and device for selecting a subassembly of molecules for use in predicting at least one property of a molecular structure |
US20220261399A1 (en) * | 2019-06-24 | 2022-08-18 | Evonik Operations Gmbh | Method for predicting a feedstuff and/or feedstuff raw material |
JPWO2021084723A1 (en) * | 2019-10-31 | 2021-05-06 | ||
WO2021084723A1 (en) * | 2019-10-31 | 2021-05-06 | 富士通株式会社 | Retrieval method, retrieval program, and retrieval device |
JP7363914B2 (en) | 2019-10-31 | 2023-10-18 | 富士通株式会社 | Search method, search program and search device |
US20230114965A1 (en) * | 2021-10-08 | 2023-04-13 | EMC IP Holding Company LLC | Document data management via graph cliques for layout understanding |
US11880403B2 (en) * | 2021-10-08 | 2024-01-23 | EMC IP Holding Company LLC | Document data management via graph cliques for layout understanding |
Also Published As
Publication number | Publication date |
---|---|
US20070143322A1 (en) | 2007-06-21 |
US7472121B2 (en) | 2008-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7472121B2 (en) | Document comparison using multiple similarity measures | |
Tekli et al. | An overview on XML similarity: Background, current trends and future directions | |
Almpanidis et al. | Combining text and link analysis for focused crawling—An application for vertical search engines | |
Ding et al. | Swoogle: a search and metadata engine for the semantic web | |
US8838650B2 (en) | Method and apparatus for preprocessing a plurality of documents for search and for presenting search result | |
US8751498B2 (en) | Finding and disambiguating references to entities on web pages | |
Aljaber et al. | Document clustering of scientific texts using citation contexts | |
US7657507B2 (en) | Pseudo-anchor text extraction for vertical search | |
WO2003017143A2 (en) | Method and system for enhanced data searching | |
Mahgoub et al. | Mining association rules from unstructured documents | |
Mehta et al. | Stamantic clustering: combining statistical and semantic features for clustering of large text datasets | |
Taha et al. | BusSEngine: a business search engine | |
Inokuchi et al. | A method for online analytical processing of text data | |
Zhu et al. | Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy. | |
Berardi et al. | Mining generalized association rules on biomedical literature | |
Zanasi | Web mining through the online analyst | |
Dima et al. | Keyphrase extraction for technical language processing | |
Eirinaki | Web mining: a roadmap | |
Selmi et al. | MeSH-based semantic query expansion | |
Acharya et al. | The process of information extraction through natural language processing | |
Neves | Stepping stones and pathways: Improving retrieval by chains of relationships between documents | |
Piotrowski | NLP-supported full-text retrieval | |
Drymonas | Ontology learning from text based on multi-word term concepts: the ontogain method | |
Krishnamurthy et al. | Using structured queries for keyword information retrieval | |
Cornoiu | Vector Space Model for Legal XML Retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |