WO2017175080A1

WO2017175080A1 - Systems and methods for retrieving material science based information

Info

Publication number: WO2017175080A1
Application number: PCT/IB2017/051431
Authority: WO
Inventors: Sapankumar Hiteshchandra SHAH; Dhwani Sanjay VORA; Purushottham Gautham BASAVARSU; Sreedhar Sannareddy Reddy
Original assignee: Tata Consultancy Services Limited
Priority date: 2016-04-04
Filing date: 2017-03-11
Publication date: 2017-10-12

Abstract

A system and method for retrieving a set of result documents from a distributed database pertaining to a material science query given by the user. The system comprises an extraction module to extract attributes of material science from a set of documents stored in distributed database. A post-processing module of the system is configured to filter the extracted information components and resolve ambiguities that arise during the extraction. Further, an indexing module of the system to generate an index table of the filtered information components to mark the location of the documents in the distributed database. A query processor module is configured to convert the user query into an updated query and a searching module to execute the updated query on the index table to retrieve a set of result documents from the distributed database. The result documents, pertaining to the user query, are the final output of the system.

Description

SYSTEMS AND METHODS FOR RETRIEVING MATERIAL SCIENCE BASED

INFORMATION

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY

[001] The present application claims priority to a Patent Application Serial Number 201621011900 filed before Indian Patent Office on April 04, 2016 and incorporates that application in its entirety.

TECHNICAL FIELD

[002] The present embodiments relates generally to a query processing and, more specifically, to generate search results pertaining to the material science specific attributes.

BACKGROUND

[003] Search engines rarely give the right answer when entities or attributes or relationships are involved. For example a designer wants to know which composition of steel should be used to achieve a hardness of 50Rc and above in a product. This requires retrieving publications which are about steel and looking for cases having hardness above 50 Re. Let consider the query is "composition of steel should be used to achieve a hardness of 50Rc and above". Using existing search techniques, this query of the designer will probably find documents that contain the term "steel", "composition", "hardness" and the value "50 Rc", but few if any of those are likely to actually be about "composition of steel should be used to achieve a hardness of 50Rc and above".

[004] In the above example, the search results do not match the intent of the designer because current search technologies do not understand the relation between composition, hardness and 50Rc of the given query. The search results will retrieves all publications even when the terms hardness and 50Rc are unrelated, but simply present in different parts of the document. Further, due to the absence of value relations in the current existing search techniques, the search engine does not retrieve a publication in which hardness of the final product is 60Rc (even though 60Rc is greater than 50Rc).

[005] Considering the above limitations, there is a need for a search engine which can collectively analyze and retrieve documents having information about material composition data, material property, process data, empirical models, experimental results and process-structure-property relations. OBJECTIVE OF THE DISCLOSURE

[006] In accordance with the present embodiments, the primary objective is to provide a system and method for retrieving a set of documents pertain to material science domain.

[007] Another objective of the embodiments is to provide a system and method to retrieve information on entity-value association of material science.

[008] Another objective of the embodiments is to provide a system and method for preparing an index table according to an entity and value range relation.

SUMMARY OF THE DISCLOSURE

[009] The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below.

[0010] In the view of the forgoing, an embodiment herein provides a system and method for retrieving a set of results from a distributed database pertaining to a query, wherein the query comprises one or more attributes and constraints of material science.

[0011] In one aspect, a method to retrieve a set of results from a distributed database pertaining to a query given by a user, the method comprising steps of extracting one or more information components of a set of documents, wherein the set of documents are stored in the distributed database, further wherein the information components comprise at least one of one or more entities, one or more value instances and one or more entity-value associations, wherein the one or more entities are based on a pre-defined dictionary, filtering the extracted information components to resolve entity-value association ambiguities that arise during the extraction using knowledge repository, indexing the filtered information based on each of the one or more information components using an indexing module of the system, wherein the indexing marks the location of the information components in the documents stored in the distributed database, converting the user query into a query on the said index tables, wherein converting comprises parsing the user query, and executing the updated query on the index table wherein the executing comprises of searching the index tables wherein searching comprises retrieving a set of result documents from the distributed database and combining the retrieved documents using set theoretic operations to produce result documents pertaining to the user query. [0012] In another aspect, a system for retrieving a set of results from a distributed database pertaining to a query given by a user. The system comprising a processor communicatively coupled with the memory, wherein the processor comprising an extraction module configured to extract one or more information components of a set of documents, wherein the set of documents are stored in the distributed database, the information components comprise at least one of, one or more entities, one or more value instances and one or more entity- value association, wherein the one or more entities are based on a predefined dictionary, a post-processing module configured to filter the extracted information components and resolve one or more entity-value association ambiguities that arise during the extraction using the predefined dictionary, an indexing module configured to generate an index table of the filtered information components based on each of the one or more information components, wherein the indexing marks the location of the set of documents in the distributed database, a query processor module configured to convert the user query into a updated query based on the one or more information components present in the index table and a searching module configured to execute the updated query on the index table, wherein the executing comprises retrieving a set of result documents from the distributed database, wherein the result documents pertain to the user query.

[0013] It should be appreciated by those skilled in the art that any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.

BRIEF DESCRIPTION OF THE FIGURES

[0014] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:

[0015] Figure 1 illustrates a block diagram showing a system for retrieving a set of results from a distributed database pertaining to a query given by a user, in accordance with an embodiment of the disclosure. [0016] Figure 2 illustrates a schematic diagram showing a system for retrieving a set of results from a distributed database pertaining to a query given by a user, in accordance with an embodiment of the disclosure.

[0017] Figure 3 illustrates a schematic diagram showing a system for extracting information components from a set of documents stored in distributed database, in accordance with an embodiment of the disclosure.

[0018] Figure 4 illustrates a flow diagram showing a method for retrieving a set of results from a distributed database pertaining to a query given by a user, in accordance with an embodiment of the disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

[0019] Some embodiments, illustrating all its features, will now be discussed in detail. The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.

[0020] It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the preferred, systems and methods are now described. In the following description for the purpose of explanation and understanding reference has been made to numerous embodiments for which the intent is not to limit the scope of the disclosure.

[0021] One or more components of the embodiment are described as module for the understanding of the specification. For example, a module may include self-contained component in a hardware circuit comprising of logical gate, semiconductor device, integrated circuits or any other discrete component. The module may also be a part of any software program executed by any hardware entity for example processor. The implementation of module as a software program may include a set of logical instructions to be executed by a processor or any other hardware entity.

[0022] The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.

[0023] The elements illustrated in the Figures intemperate as explained in more detail below. Before setting forth the detailed explanation, however, it is noted that all of the discussion below, regardless of the particular implementation being described, is exemplary in nature, rather than limiting. For example, although selected aspects, features, or components of the implementations are depicted as being stored in memories, all or part of the systems and methods consistent with the attrition warning system and method may be stored on, distributed across, or read from other machine-readable media.

[0024] Method steps of the embodiments may be performed by one or more computer processors executing a program tangibly embodied on a computer-readable medium to perform functions of the embodiments by operating on input and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, the processor receives (reads) instructions and data from a memory (such as a read-only memory and/or a random access memory) and writes (stores) instructions and data to the memory. Storage devices suitable for tangibly embodying computer program instructions and data include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays). A computer can generally also receive (read) programs and data from, and write (store) programs and data to, a non-transitory computer-readable storage medium such as an internal disk (not shown) or a removable disk.

[0025] In view of the foregoing, an embodiment herein provides a system and method for retrieving a set of results from a distributed database pertaining to a query, wherein the query comprises one or more attributes and constraints of material science.

[0026] According to an embodiment of the disclosure, a system (100) for retrieving a set of results from a distributed database (120) pertaining to a query given by a user. The system (100) comprises a user interface (102), a memory (106), a processor (104) communicatively coupled with the memory (106). Further the processor comprises an extraction module (108) to extract one or more information components of a set of documents, wherein the set of documents are stored in the distributed database (120), the information components comprise at least one of, one or more entities, one or more value instances and one or more entity-value association, wherein the information components are based on a predefined dictionary (118). A post processing module (110) to filter the extracted information components and resolve one or more entity-value association ambiguities that arise during the extraction using the predefined dictionary (118), an indexing module (112) to generate an index table of the filtered information components based on each of the one or more information components, wherein the indexing marks the location of the set of documents in the distributed database (120), a query processor module (114) configured to convert the user query into a updated query based on the one or more information components present in the index table and a searching module (116) configured to execute the updated query on the index table, wherein the executing comprises retrieving a set of result documents from the distributed database (120), wherein the result documents pertain to the user query.

[0027] In the preferred embodiment of the disclosure, the extraction module (108) extracts one or more information components of a set of documents stored in the distributed database (120). Before extraction of the one or more information component, the extraction module (108) performs a pre-processing over the text of document. In the preprocessing the system (100) converts one or more PDF files into textual forms using a PDF to XML conversion tool. It would be appreciated that in this embodiment the PDF to XML conversion tool is Exegenix. The exegenix takes the one or more PDF files as an input and applies tokenization and sentence splitting to convert the one or more PDF texts into a sequence of sentences. Further, these sequence of sentences are followed by part of speech tagging (PoS tagging) stemming and a dependency parsing. The PoS tagging assigns a part of speech to the tokens of the tokenized sentence. The part of speech comprises noun, verb, adjective, cardinal number etc.. It would be appreciated that the system using the maximum entropy algorithm for PoS tagging. The PoS tags are used as features for information extraction.

[0028] The dependency parser analyzes the grammatical structure of sentence. It provides dependency relation between the words of a sentence. The dependency relation identifies a head word and the word that is dependent on the head word. The type of dependency relation specifies the grammatical relation between the words of a sentence.

[0029] The extraction module (108) extracts one or more information components of a set of documents, wherein the set of documents are stored in the distributed database (120), the information components comprise at least one of, one or more entities, one or more value instances and one or more entity-value association, wherein the one or more entities are based on a predefined dictionary (118).

[0030] Further, the extraction module (108) extracts the one or more value instances. During the PoS tagging the PoS tagger assigns the cardinal number tag to all occurrence of value instances which are followed by a measurement unit, wherein the measurement unit is stored in the predefined dictionary (118). The value instances includes all occurrence of numbers and a value range. It would be appreciated that the use of regular expressions over tokens helps in extracting value range. Furthermore, the extraction module (108) applies a set of rules to identify an associated value from the one or more value instances for each one or more entities. It would be appreciated that the use of regular expressions over dependency graphs to implement the set of rules. The set of rules may use distance in terms of number of edges in the path between two nodes in the dependency graph, it can be referred as follows:

1. If there is an entity node El in the path from value node VI to root node in the dependency graph, create relation {El, VI };

2. If an entity node E2 in the dependency graph has the shortest distance from the value node V2 and the unit associated with V2 is valid for E2, create relation {E2, V2}; and

3. If a value node V3 is mentioned in brackets and it is preceded by an entity node E3, create relation - {E3, V3 }.

[0031] As mentioned above the set of rules for entity-value relation extraction from text, the rule 1 starts with a value node VI and recursively traverses the path to the root node looking for a head node El that is identified as an entity. Thus it relates a value to the closest entity in the head-modifier relation path. Rule 2 relates values nodes that are not in head-modifier relation paths and rule 3 extracts value relations where a value node is mentioned in brackets preceded by an entity code. The extracted information components of a set of documents stored in distributed database (120) are then passed to the indexing module (112) for further processing.

[0032] The extraction module (108) may also use the predefined dictionary (118) to extract information entities presented in tabular form.

[0033] In the preferred embodiment of the disclosure, the post-processing module (110) of the system configured to filter the extracted information components and resolve one or more ambiguities that arise during the extraction. In the post-processing the postprocessing module (110) uses the predefined dictionary (118) which comprises information on processes, process parameters, measurement units, value instances, material property and other constraints pertaining to material science. During extraction of information components from the documents one or more ambiguities such as entity value association occurs. [0034] In an example such that a user query is "The specimen is heated to 1800 °C and held for 4 min prior to water quenching it to room temperature at 100 °C/min". When applying rule 1 of the extraction module for entity-value association the value 100 °C/min with processing quenching, which is incorrect since this is for parameter cooling rate. The predefined dictionary contains information on quenching process, its parameters and the measurement units. By processing the dictionary the system may correct the information and associate the value 100 with parameter cooling rate. Similarly, the value constraints are also helpful in resolving ambiguities such that the value for carbon by weight percentage in steels can vary only in the range of 0 to 2 percent.

[0035] The post-processing module (110) uses following rules to associate values with the right entities:

1. Unit Constraint: Associate value V to an entity E only when the unit mentioned for V is in the allowable list of units of E

2. Value Constraint: When one or more entities satisfy rule 1, select only those entities for which value V satisfies the value range constraint.

3. If the ambiguity is still not resolved after applying rules 1 and 2, associate value V with all qualifying entities.

4. Rule 3 may result in some loss of extraction precision, since the rule 3 associates a value with multiple entities. However the rule 3 will improve the recall of the overall search system. At the same time, loss of precision for the search system as a whole is minimal since most queries have multiple value based constraints. While a document might incorrectly match an individual constraint, the probability of the rule 3 incorrectly matching all constraints is rather low, and hence not likely to be retrieved.

[0036] In the preferred embodiment of the disclosure, the indexing module (112) of the system (100) is configured to generate an index table of the filtered information components based on each of the one or more information components, wherein the indexing marks the location of the set of documents in the distributed database (120). The indexing is performed to optimize speed and performance in finding relevant documents from the updated query. In the index table an index for each information components such that entity, value instances and entity value relation. In addition to creating the index for each information components, the index contains two additional indices for value range entity_high and entity_low. The upper value from a value range is then indexed with entity_high and the lower value with entity_low.

[0037] In the preferred embodiment of the disclosure, the query processor module (114) of the system (100) is configured to convert the user query into an updated query based on the one or more information components present in the index table. The system (100) provides a domain specific query language to support entity value based search requests. The basic attributes of the updated query is an entity-value constraint. The language provides Boolean operators to build complex queries over such entity value constraints. The value constraint may either as point value constraint or as a range value constraint. The range constraint is either with the lower and upper value bounds or with the relational operators such as >, <, > and≤.

[0038] In the preferred embodiment of the disclosure, the searching module (116) of the system (100) is configured to execute the updated query on the index table, wherein the executing comprises retrieving a set of result documents from the distributed database (120), wherein the result documents pertain to the user query. The searching module (116) takes the updated query and executes the updated query on index table. The index table comprises each entity and entity value relation in the respective indices of index. The updated query retrieves all the documents in which one or more information components are present.

[0039] In an example with an updated query on index is as follows:

1. When the updated query on the index is "elongation = [55, 65]", the searching module may retrieve all documents in which the entity is elongation and the point value of elongation is in the range of [55, 65].

2. When the updated query on the index is "elongation high = [55, 65]", the searching module may retrieve all documents in which the entity is elongation and the value range of elongation has its higher bound in the range [55, 65].

3. When the updated query on the index is "elongation low = [55, 65]", the searching module may retrieve all documents in which the entity is elongation and the value range of elongation has its lower bound in the range [55, 65].

4. When the updated query on the index is "elongation high = [65,∞]", the searching module may retrieve all documents in which the entity is elongation and the value range of elongation has its higher bound greater than 65, and 5. When the updated query on the index is "elongation low = [-∞, 55]", the searching module may retrieve all documents in which the entity is elongation and the value range of elongation has its lower bound less than 55.

[0040] Referring to figure 4 illustrates a method (300) for retrieving a set of results from a distributed database (120) pertaining to a query, wherein the query comprises one or more attributes and constraints of material science.

[0041] At block 302, extracting one or more information components of a set of documents using the extraction module (108). The set of documents are stored in the distributed database (120). The one or more information components comprises one or more entity, one or more value instances and one or more entity- value relations, wherein the information components are based on a predefined dictionary (118). The one or more entities comprises at least one material composition, material property, process and parameter. The material composition represents a set of element name and value pairs. The list of chemical element names is a closed set which is stored in the predefined dictionary (118). The material property is again a closed set, also stored in the predefined dictionary (118). The material property includes the list of property such as tensile strength, hardness etc. Before extraction of the one or more information component, the extraction module (108) performs a pre-processing over the text of document. In the preprocessing the system converts one or more PDF files into textual forms using a PDF to XML conversion tool. It would be appreciated in this embodiment the PDF to XML conversion tool is Exegenix. The exegenix takes the one or more PDF files as an input and applies tokenization and sentence splitting to convert the one or more PDF texts into a sequence of sentences. Further, these sequence of sentences are followed by part of speech tagging (PoS tagging) stemming and a dependency parsing. The PoS tagging assigns a part of speech to the tokens of the tokenized sentence. The part of speech comprises noun, verb, adjective, cardinal number etc. It would be appreciated that the system using the maximum entropy algorithm for PoS tagging. The PoS tags are used as features for information extraction.

[0042] At block 304, filtering the extracted information components to resolve entity- value association ambiguities that arise during the extraction using predefined dictionary (118). In the post-processing, by the post-processing module (110), uses the predefined dictionary (118) which comprises information on processes, process parameters, measurement units, value instances, material property and other constraints pertaining to material science. [0043] At block 306, indexing the filtered information into an index table based on each of the one or more information components using an indexing module (112), wherein the indexing marks the location of the information components in the documents stored in the distributed database (120). The indexing is performed to optimize speed and performance in finding relevant documents from the updated query. In the index table an index for each information components such that entity, value instances and entity value relation. In addition to creating the index for each information components, the index contains two additional indices for value range entity_high and entity_low. The upper value from a value range is then indexed with entity_high and the lower value with entity_low.

[0044] At block 308, converting the user query into an updated query based on one or more information components present in the index table, wherein converting comprises parsing the user query. The system (100) provides a domain specific query language to support entity value based search requests. Therefore the converted query is an updated query for domain specific query language. The basic attributes of the updated query is an entity-value constraint. The language provides Boolean operators to build complex queries over such entity value constraints. The value constraint may either as point value constraint or as a range value constraint. The range constraint is either with lower and upper value bounds or with relational operators such as >, <, > and≤.

[0045] At block 310, executing the updated query on the index table; wherein the executing comprises of searching the index tables wherein searching comprises retrieving a set of result documents from the distributed database (120) and combining the retrieved documents using set theoretic operations to produce result documents pertaining to the user query.

[0046] The preceding description has been presented with reference to various embodiments. Persons having ordinary skill in the art and technology to which this application pertains will appreciate that alterations and changes in the described structures and methods of operation can be practiced without meaningfully departing from the principle, spirit and scope.

Claims

WE CLAIM

1. A system for retrieving a set of results from a distributed database pertaining to a query given by a user, the system comprising:

an user interface;

a memory;

a processor communicatively coupled with the memory, the processor comprising:

an extraction module configured to extract one or more information components of a set of documents, wherein the set of documents are stored in the distributed database, the information components comprise at least one of, one or more entities, one or more value instances and one or more entity-value association, wherein the one or more entities are based on a predefined dictionary;

a post-processing module configured to filter the extracted one or more information components and resolve one or more entity-value association ambiguities that arise during the extraction using the predefined dictionary; an indexing module configured to generate an index table of the filtered information components based on each of the one or more information components, wherein the indexing marks the location of the set of documents in the distributed database;

a query processor module configured to convert the user query into an updated query based on the one or more information components present in the index table; and

a searching module configured to execute the updated query on the index table, wherein the executing comprises retrieving a set of result documents from the distributed database, wherein the result documents pertain to the user query.

2. The system claimed in claim 1, wherein the query comprising constraint of one or more value instances in association with the one or more entities.

3. The system claimed in claim 1, wherein the one or more entities include attributes of material science, wherein the attributes include at least one material composition, material property, and parameter and processes for known material category.

4. The system claimed in claim 1, wherein the one or more value instances include occurrence of numbers and range values followed by a measurement unit.

5. The system claimed in claim 1, wherein the predefined dictionary comprises the information of one or more processes, process parameters, measurement units, range value constraints pertaining to the material science.

6. The system claimed in claim 1, wherein the index table comprises the one or more entity value associations for a point value constraint and a low value and high value of the value range constraint.

7. A computer implemented method to retrieve a set of results from a distributed database pertaining to a query given by a user, the method comprising:

extracting one or more information components of a set of documents, wherein the set of documents are stored in the distributed database, further wherein the information components comprise at least one of one or more entities, one or more value instances and one or more entity-value associations, wherein the one or more entities are based on a pre-defined dictionary;

filtering the extracted information components to resolve entity-value association ambiguities that arise during the extraction using predefined dictionary;

indexing the filtered information into an index table based on each of the one or more information components using an indexing module, wherein the indexing marks the location of the documents stored in the distributed database;

converting the user query into an updated query based on one or more information components present in the index table, wherein converting comprises parsing the user query; and

executing the updated query on the index table, wherein the executing comprises of searching the index tables wherein searching comprises retrieving a set of result documents from the distributed database and combining the retrieved documents using set theoretic operations to produce result documents pertaining to the user query.

8. The method claimed in claim 7, wherein the query comprises constraint of one or more value instances in association with the one or more entities.

9. The method claimed in claim 7, wherein the one or more entities include attributes of material science, wherein the attributes include at least one material composition, material property, and parameter and processes for known material category.

10. The method claimed in claim 7, wherein the one or more value instances include occurrence of numbers and range values followed by a measurement unit.

11. The method claimed in claim 7, wherein the predefined dictionary comprises the information of one or more processes, process parameters, measurement units, range value constraints pertaining to the material science.

12. The method claimed in claim 7, wherein the index table comprises the one or more entity value associations for a point value constraint and a low value and high value of the value range constraint.

13. A computer readable medium storing instructions for executing a method performed by a computer processor, the method comprising:

filtering the extracted information components to resolve entity-value association ambiguities that arise during the extraction using predefined dictionary; indexing the filtered information into an index table based on each of the one or more information components using an indexing module, wherein the indexing marks the location of the documents stored in the distributed database;