WO2001040966A2 - Database indexing system and method for managing diverse document types - Google Patents

Database indexing system and method for managing diverse document types Download PDF

Info

Publication number
WO2001040966A2
WO2001040966A2 PCT/EP2000/011791 EP0011791W WO0140966A2 WO 2001040966 A2 WO2001040966 A2 WO 2001040966A2 EP 0011791 W EP0011791 W EP 0011791W WO 0140966 A2 WO0140966 A2 WO 0140966A2
Authority
WO
WIPO (PCT)
Prior art keywords
document
index
value
date
database
Prior art date
Application number
PCT/EP2000/011791
Other languages
French (fr)
Other versions
WO2001040966A3 (en
Inventor
Ingo Elfering
Julian Reschke
Original Assignee
Medical Data Services Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Medical Data Services Gmbh filed Critical Medical Data Services Gmbh
Publication of WO2001040966A2 publication Critical patent/WO2001040966A2/en
Publication of WO2001040966A3 publication Critical patent/WO2001040966A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database

Definitions

  • the present invention relates to computer-based storage system and intelligent retrieval of documents from that system.
  • the present invention relates to systems, methods and computer program products for document storage and intelligent retrieval by computers.
  • the present invention find particular, but not exclusive, application to the healthcare industry.
  • a problem frequently encountered in healthcare systems is that medical information is generated in a variety of formats and styles from a multitude of data sources, such as physicians' records, clinical labs, insurance providers, health care authorities and ultimately patients themselves. For all involved parties it is most valuable to access and analyze this data electronically for different purposes.
  • a patient for example, may wish to be able to not only view his/her medical record, but access all prescription data or all data relating to a particular disease.
  • Document volume The number of documents and the amount of data is huge. Manual/human processing is not cost or time efficient. This calls for processing by electronic means, leveraging up-to-date information technology.
  • Document formats Building on the previous paragraph, prescription data is, like other documents, generated in different formats, often in a semi-structured way. Handling of all this variety in the analyzing software is considered not feasible. Algorithms would need to be highly complex and would require continuous updating due to changes in the document formats. This is not a good prerequisite for handling sensitive medical information.
  • Document integrity Documents have to be stored unaltered. There are several reasons for this. Preserving digital signatures is just one of them.
  • the present invention is a method to overcome all the above obstacles. It allows the storage of the unaltered documents in modern, high- volume database systems (obstacles 1 and 4). It enhances the stored documents with configurable and extendable index information (obstacle 3). At the same time it provides a single, flexible query interface to the analyzing software which is independent of the actual document format (obstacle 2). This way the complexity of the analyzing method is reduced and the maintainability is enhanced. Summary of the Invention
  • this invention relates to an indexing system for a database in single computer or in a distributed computational system, the index comprises:
  • a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;
  • a second iteration of the invention comprises a method for indexing and retrieving documents from an electronic database wherein the database is to contain or contains multiple document types having multiple measures and measurement units wherein the method comprises: 1. creating or generating a name uniquely identifying an index for a specific document format;
  • Figure 1 is a flowchart of a virtual indexing system. Description of the Invention
  • the core of the present invention lies in the document indexing mechanism.
  • an index can be defined for multiple document types.
  • An example of a query would be "return all prescription dates where drug is equal to aspirin".
  • drug and date would be the names of an indices which are defined for all prescription documents.
  • the index has to cope with different document formats. Additionally, the index has to work with documents that contain more than one prescription. This raises the issue of index correlation. Given a document with multiple prescriptions, this document will have multiple "drug" and "date" indices. When performing the above-mentioned query, the correct date for the specified drug needs to be returned (as opposed to returning the wrong date, e.g. the date of another drugs prescription).
  • the present invention defines an index with the following properties:
  • Document-part a pattern which is matched against the document. For each match, a value of the index is generated and stored in the database.
  • Block-ID an identifier that is unique inside a specific document. Block-IDs are used to correlate index pairs. The "date" and “drug” index for the same prescription would have the same block-ID.
  • Properties 1 to 4 are configured in the system at setup time. Properties 5 to 7 are generated by the system for each stored document. The high-level workflow for storing and indexing documents is specified in Figure 1.
  • the implementation manages documents in XML format (http://www.w3.org/TR/1998/REC-xml- 19980210). "Document- part", tags and block-IDs are specified in the Xpath language (http://www.w3.org/TR xpath).
  • the implementation uses Microsoft SQL Server as the database, while the software itself is written in VisualBasic.
  • the implementation has been set to use other vendors' database as well (Oracle, Informix, Sybase, etc.). Also the implementation language itself could also use C, C++, Java or any other computer language.
  • DAL Data Access Layer
  • the implementation consists basically of two parts. One is the database layout and the other is the software implementing the algorithms for storage and retrieval.
  • the database layout is the basis for the higher functionality. As already mentioned the implementation uses Microsoft SQL Server as the preferred relational database engine.
  • the implementation's database contains three standard tables named vConfig, vScripts and vDocTypes. The layout of the tables is as follows: vConfig table
  • ID unique id for a vConfig table entry. Generated by the database upon insertion of a new row in vConfig.
  • virtName name of a virtual index as can be used in the query language.
  • Pattern Xpath expression used to match parts of the XML document.
  • postProc reference to the vScripts table. postProc contains the unique ID of a vScripts row.
  • targetType type of the result of the script, can be string, date or double.
  • index flag if the index value is calculated on storage or on retrieval of the document.
  • required flag if a value for the index is required or not.
  • the vConfig table holds all information about virtual indices. By reading the information from vConfig, the DAL knows which indices there are, for which document types they apply and where the script for generating an index value can be found.
  • the field targetType determines how the value of a generated virtual index should be treated. Indices can be strings (text), date (date and time) or double (double precision floating point number). vScripts table
  • ID unique id for a vScripts entry. Generated by the database upon insertion of a new row in vScripts.
  • Name name of the script which may be used in the query language (see VQL flag).
  • Script the source code of the script itself, written in a specific computer language (see Language field).
  • Language computer language the script is written in (in this implementation must be a language supporting Microsoft ScriptControl Interface - currently available are VBScript, JScript and PerlScript).
  • VQL flag if the script can also be used in the query language.
  • vScripts table is an addendum to the vConfig table. All fields could also be placed in the vConfig table itself, however since several indices might use the same script, it is more efficient to organize things this way. vScripts not only gives the source code of the script itself, but also tells in which language the script must be executed. If a script needs to be executed, the DAL can start a new Microsoft ScriptControl object with the given language, load the source code to the object and execute the script. vDocTypes table
  • ID unique id for a vDocTypes entry. Generated by the database upon insertion of a new row in vDocTypes.
  • DocTypeGroup grouping of document types under logical name.
  • TargetTable name of the database table where the actual documents will be stored.
  • DocTypes defines all known document types which are handled by the data access layer. If a document of unknown type is to be stored by the DAL, a new entry in vDocTypes will be created. Thus, the DAL is not limited to storage of a predefined set of document types.
  • TargetTable field which determines in which table of the database documents of this type will be stored (and virtual index values be kept).
  • Prescription a of Format A could be:
  • ID unique id for a VPrescription entry. Generated by the database upon insertion of a new row.
  • VPrescriptionlndex 1 ID: unique id for a VPrescriptionlndex entry. Generated by the database upon insertion of a new row.
  • configID ID of the entry in vConfig which created the entry.
  • recordlD ID of the document in VPrescription for which this entry is intended.
  • valueDouble field which holds floating point values.
  • blockID string defining a block in the document.
  • taglD xpath expression which identifies the XML element of the document which was matched to create this entry.
  • the data access layer is the software operating on the database. It provides an interface with which the application can store and retrieve XML documents.
  • the data access layer provides an interface with which the application can store and retrieve XML documents.
  • two functions are of particular interest: putDocument and findDocument.
  • the putDocument just needs the document itself to store a new document.
  • the VisualBasic implementation follows the algorithm as described in Figure 1. The details of what happens in the individual parts of the algorithm are described below.
  • Document Retrieval explains how the findDocument method is used to retrieve the stored, indexed information again. In particular it is described how the query is translated into SQL code, which is then executed by the database.
  • Determine Document Format examines the documents if any namespace references or document type definitions (DTD in XML lingo) are present. If so, it uses the name of the namespace or DTD. Otherwise it uses the name of the root tag as the name of the document type. In this example, for documents a and b, no namespace or DTDs are present, so the document types "A" and "B" are taken from the root tag.
  • DTD namespace references or document type definitions
  • the DAL consults the vDocTypes for an entry whose "DocType" columns contains the needed document type.
  • the TargetTable defines in which database table the document is to be stored. If the document type is not found in the vDocTypes table, a new entry is created which uses a DefaultTable as TargetTable. In this example, the DAL will find "VPrescription" as TargetTable for both documents a and b.
  • the VPrescription table when storing documents a and b, the VPrescription table would look like that of Table 7: Table 7 vPrescription.ID DocType Value
  • the DAL now looks up all entries in vConfig which have the stored document type in the DocType column. For each entry found, the DAL performs then the other steps. If no entries in vConfig are found, the DAL returns control to the application.
  • entries 1 and 3 are found in vConfig for documents of type A (entries 2 and 4 for type B). "Match document-part of Index" - the XPath expression of an index is matched against the stored XML document. This matching returns a list of 0, 1 or more nodes (XML lingo). With 0 nodes, no match was found and the work for this index is done (no entries in VPrescriptionlndex will be written).
  • the PostProc entry for the index is the unique ID of an entry in vScripts. This entry is loaded and the DAL initiates a Microsoft
  • ScriptControl object The code for the script is loaded into the ScriptControl for the specified language and executed. The matched node and its text property are given as input to the script. The script returns the index value and the block ID.
  • the DAL then creates a new entry in VPrescriptionsIndex with the following values: configlD with the ID of the current index, recordID with the ID of the document stored in VPrescription, valueXXX with the value from the script (date/string/double depending on the TargetType of the index), blocklD as returned from the script and, finally, taglD as the xpath identifier for the matched document node currently processed. If all nodes are processed, the next index is processed and matched against the document. When all indices are done, control is returned to the caller.
  • the data access layer accepts an SQL-like query language which is by itself an XML document.
  • the exact syntax of the language is outside the scope of this document.
  • the DAL takes such a document and transforms it into a real SQL query for the database.
  • the values returned from the database experience some format conversion by the DAL and are then passed on to the user of the DAL.

Abstract

A method for indexing in an electronic database document which have different formats and which contain diverse measures and diverse measurements between and amongst the documents.

Description

Database Indexing System and Method for Managing Diverse Document Types Area of the Invention
The present invention relates to computer-based storage system and intelligent retrieval of documents from that system. In particular, the present invention relates to systems, methods and computer program products for document storage and intelligent retrieval by computers. Background of the Invention
The present invention find particular, but not exclusive, application to the healthcare industry. A problem frequently encountered in healthcare systems is that medical information is generated in a variety of formats and styles from a multitude of data sources, such as physicians' records, clinical labs, insurance providers, health care authorities and ultimately patients themselves. For all involved parties it is most valuable to access and analyze this data electronically for different purposes. A patient, for example, may wish to be able to not only view his/her medical record, but access all prescription data or all data relating to a particular disease.
Access to medical documents has to overcome four main obstacles: Document volume: The number of documents and the amount of data is huge. Manual/human processing is not cost or time efficient. This calls for processing by electronic means, leveraging up-to-date information technology. Document formats: Building on the previous paragraph, prescription data is, like other documents, generated in different formats, often in a semi-structured way. Handling of all this variety in the analyzing software is considered not feasible. Algorithms would need to be highly complex and would require continuous updating due to changes in the document formats. This is not a good prerequisite for handling sensitive medical information.
Document contexts: For example, measurement units like weight or temperature need to be unified to a common scale for analyses to make sense. The information on which units are used (pounds or kilograms) is not necessarily contained in the document itself. It might depend on the data source/person who generated the document.
Document integrity: Documents have to be stored unaltered. There are several reasons for this. Preserving digital signatures is just one of them.
The present invention is a method to overcome all the above obstacles. It allows the storage of the unaltered documents in modern, high- volume database systems (obstacles 1 and 4). It enhances the stored documents with configurable and extendable index information (obstacle 3). At the same time it provides a single, flexible query interface to the analyzing software which is independent of the actual document format (obstacle 2). This way the complexity of the analyzing method is reduced and the maintainability is enhanced. Summary of the Invention
In a first embodiment, this invention relates to an indexing system for a database in single computer or in a distributed computational system, the index comprises:
1. a name uniquely identifying an index for a specific document format;
2. a name of the format to which this index applies;
3. a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;
4. a value-conversion statement to convert the matched document-part to the index value;
5. a value for an index for a specific matched part of a document; 6. a tag which is a pattern identifying the part of the document which was matched in order to produce the value; 7. a block-ID that is an identifier which is unique inside a specific document; wherein properties 1 to 4 are configured in the system at setup time and properties 5 to 7 are generated by the system for each stored document.
In a second iteration of the invention, it comprises a method for indexing and retrieving documents from an electronic database wherein the database is to contain or contains multiple document types having multiple measures and measurement units wherein the method comprises: 1. creating or generating a name uniquely identifying an index for a specific document format;
2. creating or generating a name of the format to which this index applies;
3. defining a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;
4. creating or generating a value-conversion statement to convert the matched document-part to the index value;
5. generating a value for an index for a specific matched part of a document;
6. generating a tag which is a pattern identifying the part of the document which was matched in order to produce the value;
7. generating a block-ID that is an identifier which is unique inside a specific document; configuring properties 1 to 4 in the system at setup time and causing the system to generate properties 5 to 7 by the system for each stored document. Description of the Figures
Figure 1 is a flowchart of a virtual indexing system. Description of the Invention
The core of the present invention lies in the document indexing mechanism. For example, in order to allow querying of all prescription documents an index can be defined for multiple document types. An example of a query would be "return all prescription dates where drug is equal to aspirin". In this example "drug" and "date" would be the names of an indices which are defined for all prescription documents.
For this to work, the index has to cope with different document formats. Additionally, the index has to work with documents that contain more than one prescription. This raises the issue of index correlation. Given a document with multiple prescriptions, this document will have multiple "drug" and "date" indices. When performing the above-mentioned query, the correct date for the specified drug needs to be returned (as opposed to returning the wrong date, e.g. the date of another drugs prescription).
The present invention defines an index with the following properties:
1. Name: ("date" or "drug" in the given example) uniquely identifying an index for a specific document format.
2. Format: name of the format to which this index applies.
3. Document-part: a pattern which is matched against the document. For each match, a value of the index is generated and stored in the database.
4. Value-conversion: statements to convert the matched document part to the index value.
5. Value: the value of an index for a specific (matched part of a) document. The "drug" index would have the value "aspirin" in the given example.
6. Tag: a pattern identifying the part of the document, which was matched in order to produce the value. 7. Block-ID: an identifier that is unique inside a specific document. Block-IDs are used to correlate index pairs. The "date" and "drug" index for the same prescription would have the same block-ID. 8. Properties 1 to 4 are configured in the system at setup time. Properties 5 to 7 are generated by the system for each stored document. The high-level workflow for storing and indexing documents is specified in Figure 1.
As an example consider the following two prescriptions which are of different format. Prescription a of Format A looks like this: date: 12/24/1999 drug: aspirin
while prescription b of Format B could look like this:
prescribe( 1999- 12-06, nicorette); prescribe(2000-01-01, aspirin).
When configuring the system, the following index information would be set up
(Table 1).
Table 1
Figure imgf000005_0001
Value-conversion of "none" signifies that the selected parts of the document are copied into the value field for the index (as shown below). "American-to-ISO8601" means a date format conversion from American notation to the ISO standard. This is an example where an index is used for generating information not imminently described in the document. When storing documents a and b, the following index information would be generated (note the date format conversion for the date index of document a) (Table 2):
Table 2
Row Document Index Value
1 a date 1999-12-24 date:*[l] a[l]
2 a drug aspirin drug:*[l] a[l]
3 b date 1999-12-06 prescription(*,x)[l] b[l]
4 b date 2000-01-01 prescription(*,x)[2] b[2]
5 b drug nicorette prescription(x,*)[l] b[l]
6 b drug aspirin prescription^, * ) [2] b[2] Going back to the original example, the query to process was:
"Return all prescription dates where drug is equal to aspirin". The algorithm for this type of query works as follows: First, eliminate all rows in the table which have "drug" as index and not "aspirin" as value. This gives Table 3:
Table 3
Row Document Index Value ■;lBW»ltl
1 a date 1999-12-24 date:*[l] a[l]
2 a drug aspirin drug:*[l] a[l]
3 b date 1999-12-06 prescription^ ,x)[ 1 ] b[l]
4 b date 2000-01-01 prescription *,x) [2] b[2]
6 b drug aspirin prescription^, *)[2] b[2]
Second, eliminate all rows which have "date" as index and which block-ID does not exist for a row with "drug" as index. This reduces the table to that of Table 4:
Table 4
Row Document Index Value tii m 111
1 a date 1999-12-24 date:*[l] a[l]
2 a drug aspirin drug:*[l] a[l]
4 b date 2000-01-01 prescription(*,x)[2] b[2]
6 b drug aspirin prescription(x,*)[2] b[2]
Third, since only dates should be returned, eliminate all rows, whose index is not "date" (Table 5):
Table 5
Row Document Index Value l_________ jl a date 1999-12-24 date:*[l] a[l] b date 2000-01-01 prescription(*,x)[2] b[2]
Finally, return the values of the remaining rows: "1999-12-24" and "2000-01-01", which are the dates when aspirin was prescribed.
-3- Note that the above notations for "document-part" patterns, tags and block- IDs are used for demonstration purposes only. The present invention was implemented using a totally different set of languages as will be described later on.
Implementation Description
An implementation of the present invention exists. The implementation manages documents in XML format (http://www.w3.org/TR/1998/REC-xml- 19980210). "Document- part", tags and block-IDs are specified in the Xpath language (http://www.w3.org/TR xpath). Furthermore, the implementation uses Microsoft SQL Server as the database, while the software itself is written in VisualBasic. The implementation has been set to use other vendors' database as well (Oracle, Informix, Sybase, etc.). Also the implementation language itself could also use C, C++, Java or any other computer language.
The implementation, called "Data Access Layer" or DAL further on, consists basically of two parts. One is the database layout and the other is the software implementing the algorithms for storage and retrieval.
Database Layout
The database layout is the basis for the higher functionality. As already mentioned the implementation uses Microsoft SQL Server as the preferred relational database engine. The implementation's database contains three standard tables named vConfig, vScripts and vDocTypes. The layout of the tables is as follows: vConfig table
1. ID: unique id for a vConfig table entry. Generated by the database upon insertion of a new row in vConfig.
2. virtName: name of a virtual index as can be used in the query language.
3. docType: document type this index is used on; holds the ID of an entry in the vDocTypes table.
4. Pattern: Xpath expression used to match parts of the XML document. 5. postProc: reference to the vScripts table. postProc contains the unique ID of a vScripts row.
6. targetType: type of the result of the script, can be string, date or double.
7. indexed: flag if the index value is calculated on storage or on retrieval of the document. 8. required: flag if a value for the index is required or not.
9. unique: flag if the value for the index needs to be unique among all documents of this type. The vConfig table holds all information about virtual indices. By reading the information from vConfig, the DAL knows which indices there are, for which document types they apply and where the script for generating an index value can be found. The field targetType determines how the value of a generated virtual index should be treated. Indices can be strings (text), date (date and time) or double (double precision floating point number). vScripts table
1. ID: unique id for a vScripts entry. Generated by the database upon insertion of a new row in vScripts.
2. Name: name of the script which may be used in the query language (see VQL flag).
3. Script: the source code of the script itself, written in a specific computer language (see Language field). 4. Language: computer language the script is written in (in this implementation must be a language supporting Microsoft ScriptControl Interface - currently available are VBScript, JScript and PerlScript). 5. VQL: flag if the script can also be used in the query language.
The vScripts table is an addendum to the vConfig table. All fields could also be placed in the vConfig table itself, however since several indices might use the same script, it is more efficient to organize things this way. vScripts not only gives the source code of the script itself, but also tells in which language the script must be executed. If a script needs to be executed, the DAL can start a new Microsoft ScriptControl object with the given language, load the source code to the object and execute the script. vDocTypes table
1. ID: unique id for a vDocTypes entry. Generated by the database upon insertion of a new row in vDocTypes.
2. DocType: name of the document type.
3. DocTypeGroup: grouping of document types under logical name. 4. TargetTable: name of the database table where the actual documents will be stored.
DocTypes defines all known document types which are handled by the data access layer. If a document of unknown type is to be stored by the DAL, a new entry in vDocTypes will be created. Thus, the DAL is not limited to storage of a predefined set of document types.
Most important is the TargetTable field which determines in which table of the database documents of this type will be stored (and virtual index values be kept). Sample Document Tables
Returning to the prescription example set out above, the documents to be stored would be valid XML documents as follows. Prescription a of Format A could be:
<A>
<date> 12/24/ 1999</date>
<drug>aspirin</drug>
</A>
while prescription b of Format B could look like this:
<B>
<drug date=" 1999- 12-06">nicorette</drug>
<drug date="2000-01-01 ">aspirin</drug> </B>
In order to handle XML documents of format A and B the following entries are created in the database (TablesβA, 6B, and 6C)
Table 6A vDocTypes.ID DocTyp DocTypeGroup TargetTable
A Prescription Vprescription B Prescription Vprescription
Table 6B vScripts.ID Name Script Language VQL
1 Adate VBScript false
2 Adrug VBScript false 3 Bdate VBScript false 4 Bdrug VBScript false
Table 6C vConfig.ID virtName DocType Pattern PostProc TargetType indexed
1 date A //date 1 date true
2 date B //drug/@date 3 date true 3 drug A //drug 2 string true 4 drug B //drug 4 string true
In addition to this, the tables for storing the documents and the indices need to be created as follows:
VPrescription
1. ID: unique id for a VPrescription entry. Generated by the database upon insertion of a new row.
2. DocType: name of the document type.
_>. Value: the document itself.
VPrescriptionlndex 1. ID: unique id for a VPrescriptionlndex entry. Generated by the database upon insertion of a new row.
2. configID: ID of the entry in vConfig which created the entry.
3. recordlD: ID of the document in VPrescription for which this entry is intended. 4. valueDouble: field which holds floating point values.
5. valueDate: field which holds date values.
6. valueString: field which holds string values.
7. blockID: string defining a block in the document.
8. taglD: xpath expression which identifies the XML element of the document which was matched to create this entry. Data Access Layer
The data access layer is the software operating on the database. It provides an interface with which the application can store and retrieve XML documents. For the description of virtual indexing, two functions are of particular interest: putDocument and findDocument.
The putDocument just needs the document itself to store a new document. The VisualBasic implementation follows the algorithm as described in Figure 1. The details of what happens in the individual parts of the algorithm are described below.
The section "Document Retrieval" explains how the findDocument method is used to retrieve the stored, indexed information again. In particular it is described how the query is translated into SQL code, which is then executed by the database.
Document Storage
A detailed description of how the individual steps of the algorithm in Figure 1 are implemented is given below.
"Determine Document Format" - the present implementation examines the documents if any namespace references or document type definitions (DTD in XML lingo) are present. If so, it uses the name of the namespace or DTD. Otherwise it uses the name of the root tag as the name of the document type. In this example, for documents a and b, no namespace or DTDs are present, so the document types "A" and "B" are taken from the root tag.
The DAL consults the vDocTypes for an entry whose "DocType" columns contains the needed document type. The TargetTable then defines in which database table the document is to be stored. If the document type is not found in the vDocTypes table, a new entry is created which uses a DefaultTable as TargetTable. In this example, the DAL will find "VPrescription" as TargetTable for both documents a and b.
"Store Document" - the DAL will create an SQL statement to insert the document as new entry into the TargetTable. The database is set up to give this new entry a unique ID which is returned to the DAL. Lets call this document.ID for now.
In this example, when storing documents a and b, the VPrescription table would look like that of Table 7: Table 7 vPrescription.ID DocType Value
< A><date> 12/24/ 1999</date>
<drug>aspirin</drug></A>
B <B><drug date=" 1999- 12-06">nicorette</drug>
<drug date="2000-01 -01 ">aspirin</drug>< B>
"Find All Indices..." - the DAL now looks up all entries in vConfig which have the stored document type in the DocType column. For each entry found, the DAL performs then the other steps. If no entries in vConfig are found, the DAL returns control to the application.
In this example, entries 1 and 3 are found in vConfig for documents of type A (entries 2 and 4 for type B). "Match document-part of Index" - the XPath expression of an index is matched against the stored XML document. This matching returns a list of 0, 1 or more nodes (XML lingo). With 0 nodes, no match was found and the work for this index is done (no entries in VPrescriptionlndex will be written).
For 1 or more nodes found, the PostProc entry for the index is the unique ID of an entry in vScripts. This entry is loaded and the DAL initiates a Microsoft
ScriptControl object. The code for the script is loaded into the ScriptControl for the specified language and executed. The matched node and its text property are given as input to the script. The script returns the index value and the block ID. The DAL then creates a new entry in VPrescriptionsIndex with the following values: configlD with the ID of the current index, recordID with the ID of the document stored in VPrescription, valueXXX with the value from the script (date/string/double depending on the TargetType of the index), blocklD as returned from the script and, finally, taglD as the xpath identifier for the matched document node currently processed. If all nodes are processed, the next index is processed and matched against the document. When all indices are done, control is returned to the caller.
In this example, when storing document "a", the indices "date" and "drug" for document type A need to be processed. The pattern for "date" matches all elements in a document with name "date". The script is called with the XML fragment "<date>12/24/1999</date>". This script knows that it has to do date conversion on its input and returns "1999-12-24" as date and the string "[!]" as block ID. After having processed all indices for document a and b, the table VPrescriptionlndex looks like Table 8:
Table 8
ID configl recordl valueDate valueString taglD blocklD
1 1999-12-24 //date[l] [1] 1 aspirin //drug[l] [1]
2 1999-12-06 //drug[l]/@da [1] te
2000-01-01 //drug[2]/@da [2] te
5 4 2 nicorette //drug[l] [1] 6 4 2 aspirin //drug[2] [2]
Document Retrieval
The data access layer accepts an SQL-like query language which is by itself an XML document. The exact syntax of the language is outside the scope of this document. However, the DAL takes such a document and transforms it into a real SQL query for the database. The values returned from the database experience some format conversion by the DAL and are then passed on to the user of the DAL.
What is of interest here is to show how a quite general query is translated into valid SQL for the database. We go back to our example: select date from Prescription where drug = "aspirin" The phrase "from Prescription" defines where to start looking. "Prescription" is a document type group. Looking into vDocTypes, the DAL sees that document type "A" and "B" belong to this group. Furthermore it sees that both A and B documents are stored in the database table "VPrescription". From that the DAL knows that all index information to VPrescription is to be found in the table "VPrescriptionlndex". The intermediate SQL statement now looks like this:
SELECT [TBD] FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD].
"[TBD]" meaning placeholders in the SQL statements which need further work.
Looking into the phrase "select date", the DAL can deduce that "date" is an index whose value should be returned. Looking for "date" in vConfig, the DAL finds 2 indices called like that for prescription document types. Both have their value in the "valueDate" field of the index table (it is an error if they differ), so the SQL statements becomes: SELECT i. valueDate FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD] Now going for the last part of the query: "where drug = "aspirin"". The DAL again deduces that "drug" is the name of an index. Looking at vConfig for indices of document group Prescriptions it finds the entries 3 and 4. Both have values of type string. The SQL statement becomes:
SELECT [TBD] FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD]
"[TBD]" meaning placeholders in the SQL statements which need further work. Looking into the phrase "select date", the DAL can deduce that "date" is an index whose value should be returned. Looking for "date" in vConfig, the DAL finds 2 indices called like that for prescription document types. Both have their value in the "valueDate" field of the index table (it is an error if they differ), so the SQL statements becomes:
SELECT i.valueDate FROM VPrescription d, VPrescriptionlndex i WHERE AID = i.recordID [TBD]
Now going for the last part of the query: "where drug = "aspirin"". The DAL again deduces that "drug" is the name of an index. Looking at vConfig for indices of document group Prescriptions it finds the entries 3 and 4. Both have values of type string. The SQL statement becomes:
SELECT i.valueDate FROM VPrescription d, VPrescriptionlndex i.
VPrescriptionlndex i2 WHERE d.ID = i.recordID AND i.recordID = i2.recordID AND i.blockID = i2.blockID
AND _2.valueString = 'aspirin'
Now the SQL statement is ready to be sent against the database, which will return 1999-12-24 and 2000-01-01. The foregoing exemplification is provided to illustrate the invention. This exemplification is not intended to limit what is reserved to the inventors hereunder.

Claims

We claim:
1. An indexing system for a database in single computer or in a distributed computational system, it comprising: 1 • a name uniquely identifying an index for a specific document format;
2. a name of the format to which this index applies;
3. a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database; 4. a value-conversion statement to convert the matched document-part to the index value;
5. a value for an index for a specific matched part of a document;
6. a tag which is a pattern identifying the part of the document which was matched in order to produce the value; 7. a block-ID that is an identifier which is unique inside a specific document; wherein properties 1 to 4 are configured in the system at setup time and properties 5 to 7 are generated by the system for each stored document.
2. A method for indexing and retieving documents from an electronic database wherein the database is to contain or contains multiple document types having multiple measures and measurement units wherein the method comprises: 1. creating or generating a name uniquely identifying an index for a specific document format; 2. creating or generating a name of the format to which this index applies;
3. defining a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;
4. creating or generating a value-conversion statement to convert the matched document-part to the index value;
5. generating a value for an index for a specific matched part of a document;
6. generating a tag which is a pattern identifying the part of the document which was matched in order to produce the value;
7. generating a block-ID that is an identifier which is unique inside a specific document; configuring properties 1 to 4 in the system at setup time and causing the system to generate properties 5 to 7 by the system for each stored document.
PCT/EP2000/011791 1999-11-29 2000-11-24 Database indexing system and method for managing diverse document types WO2001040966A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB9928210.5A GB9928210D0 (en) 1999-11-29 1999-11-29 Method
GB9928210.5 1999-11-29

Publications (2)

Publication Number Publication Date
WO2001040966A2 true WO2001040966A2 (en) 2001-06-07
WO2001040966A3 WO2001040966A3 (en) 2003-12-04

Family

ID=10865374

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/011791 WO2001040966A2 (en) 1999-11-29 2000-11-24 Database indexing system and method for managing diverse document types

Country Status (2)

Country Link
GB (1) GB9928210D0 (en)
WO (1) WO2001040966A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1451715A1 (en) * 2001-08-14 2004-09-01 McDonald, Nathan Joel Document analysis system and method
EP1730652A1 (en) * 2004-04-02 2006-12-13 Samsung Electronics Co., Ltd. Xml processor having function for filtering tree path, method of filtering tree path and recording medium thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918225A (en) * 1993-04-16 1999-06-29 Sybase, Inc. SQL-based database system with improved indexing methodology
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918225A (en) * 1993-04-16 1999-06-29 Sybase, Inc. SQL-based database system with improved indexing methodology
US5924090A (en) * 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ROSENTHAL ET AL: "XML, Databases, and Interoperability" , ATCEA ANNUAL FEDERAL DATABASE COLLOQUIUM, XX, XX, PAGE(S) 1-18 XP002205817 page 3, paragraph 3 -page 11, paragraph 5.3 *
TAN K F ET AL: "FIBEX, an extractor enabling querying of documents using SQL" , DATABASE AND EXPERT SYSTEMS APPLICATIONS, 1998. PROCEEDINGS. NINTH INTERNATIONAL WORKSHOP ON VIENNA, AUSTRIA 26-28 AUG. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, PAGE(S) 108-112 XP010296641 ISBN: 0-8186-8353-8 page 109, right-hand column, paragraph 3 -page 111, left-hand column, paragraph 5.3 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1451715A1 (en) * 2001-08-14 2004-09-01 McDonald, Nathan Joel Document analysis system and method
EP1451715A4 (en) * 2001-08-14 2009-05-13 Nathan Joel Mcdonald Document analysis system and method
US7668865B2 (en) 2001-08-14 2010-02-23 Mcdonald Nathan Joel Document analysis system and method
EP1730652A1 (en) * 2004-04-02 2006-12-13 Samsung Electronics Co., Ltd. Xml processor having function for filtering tree path, method of filtering tree path and recording medium thereof
EP1730652A4 (en) * 2004-04-02 2009-11-11 Samsung Electronics Co Ltd Xml processor having function for filtering tree path, method of filtering tree path and recording medium thereof

Also Published As

Publication number Publication date
GB9928210D0 (en) 2000-01-26
WO2001040966A3 (en) 2003-12-04

Similar Documents

Publication Publication Date Title
EP1089195B1 (en) Method for storing and managing data
US6742001B2 (en) System and method for sharing data between hierarchical databases
US5895461A (en) Method and system for automated data storage and retrieval with uniform addressing scheme
US7318063B2 (en) Managing XML documents containing hierarchical database information
EP1302888A2 (en) A system and method for use in providing a healthcare information database
US20040172442A1 (en) System and Method for Sharing Data Between Hierarchical Databases
US8713041B2 (en) Peer to peer (P2P) missing fields and field valuation feedback
EP1918827A1 (en) Data processing
US20060047648A1 (en) Comprehensive query processing and data access system and user interface
US20070016610A1 (en) Conversion of hierarchically-structured HL7 specifications to relational databases
US20200013491A1 (en) Interoperable Record Matching Process
EP1606734A1 (en) Conversion of structured information
WO2002101515A2 (en) System and method for managing data and documents
US20080109400A1 (en) Method and device for configuring a variety of medical information
Si et al. An OMOP CDM-based relational database of clinical research eligibility criteria
US20090112794A1 (en) Aliased keys for federated database queries
Solbrig et al. Automated population of an i2b2 clinical data warehouse using FHIR
Pluempitiwiriyawej et al. A classification scheme for semantic and schematic heterogeneities in XML data sources
WO2001040966A2 (en) Database indexing system and method for managing diverse document types
US20100050109A1 (en) Data input method
Freire et al. Managing XML data: An abridged overview
EP1965313A1 (en) Data processing
Liu Enabling electronic healthcare information exchange
Clark et al. A structured interface to the object-oriented genomics unified schema for XML-formatted data
Wahlin XML for asp. net developers

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): US

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69 EPC (EPO FORM 1205A OF 101202)

122 Ep: pct application non-entry in european phase