WO2001040966A2

WO2001040966A2 - Database indexing system and method for managing diverse document types

Info

Publication number: WO2001040966A2
Application number: PCT/EP2000/011791
Authority: WO
Inventors: Ingo Elfering; Julian Reschke
Original assignee: Medical Data Services Gmbh
Priority date: 1999-11-29
Filing date: 2000-11-24
Publication date: 2001-06-07
Also published as: GB9928210D0; WO2001040966A3

Abstract

A method for indexing in an electronic database document which have different formats and which contain diverse measures and diverse measurements between and amongst the documents.

Description

Database Indexing System and Method for Managing Diverse Document Types Area of the Invention

The present invention relates to computer-based storage system and intelligent retrieval of documents from that system. In particular, the present invention relates to systems, methods and computer program products for document storage and intelligent retrieval by computers. Background of the Invention

The present invention find particular, but not exclusive, application to the healthcare industry. A problem frequently encountered in healthcare systems is that medical information is generated in a variety of formats and styles from a multitude of data sources, such as physicians' records, clinical labs, insurance providers, health care authorities and ultimately patients themselves. For all involved parties it is most valuable to access and analyze this data electronically for different purposes. A patient, for example, may wish to be able to not only view his/her medical record, but access all prescription data or all data relating to a particular disease.

Access to medical documents has to overcome four main obstacles: Document volume: The number of documents and the amount of data is huge. Manual/human processing is not cost or time efficient. This calls for processing by electronic means, leveraging up-to-date information technology. Document formats: Building on the previous paragraph, prescription data is, like other documents, generated in different formats, often in a semi-structured way. Handling of all this variety in the analyzing software is considered not feasible. Algorithms would need to be highly complex and would require continuous updating due to changes in the document formats. This is not a good prerequisite for handling sensitive medical information.

Document contexts: For example, measurement units like weight or temperature need to be unified to a common scale for analyses to make sense. The information on which units are used (pounds or kilograms) is not necessarily contained in the document itself. It might depend on the data source/person who generated the document.

Document integrity: Documents have to be stored unaltered. There are several reasons for this. Preserving digital signatures is just one of them.

The present invention is a method to overcome all the above obstacles. It allows the storage of the unaltered documents in modern, high- volume database systems (obstacles 1 and 4). It enhances the stored documents with configurable and extendable index information (obstacle 3). At the same time it provides a single, flexible query interface to the analyzing software which is independent of the actual document format (obstacle 2). This way the complexity of the analyzing method is reduced and the maintainability is enhanced. Summary of the Invention

In a first embodiment, this invention relates to an indexing system for a database in single computer or in a distributed computational system, the index comprises:

1. a name uniquely identifying an index for a specific document format;

2. a name of the format to which this index applies;

3. a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;

4. a value-conversion statement to convert the matched document-part to the index value;

5. a value for an index for a specific matched part of a document; 6. a tag which is a pattern identifying the part of the document which was matched in order to produce the value; 7. a block-ID that is an identifier which is unique inside a specific document; wherein properties 1 to 4 are configured in the system at setup time and properties 5 to 7 are generated by the system for each stored document.

In a second iteration of the invention, it comprises a method for indexing and retrieving documents from an electronic database wherein the database is to contain or contains multiple document types having multiple measures and measurement units wherein the method comprises: 1. creating or generating a name uniquely identifying an index for a specific document format;

2. creating or generating a name of the format to which this index applies;

3. defining a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database;

4. creating or generating a value-conversion statement to convert the matched document-part to the index value;

5. generating a value for an index for a specific matched part of a document;

6. generating a tag which is a pattern identifying the part of the document which was matched in order to produce the value;

7. generating a block-ID that is an identifier which is unique inside a specific document; configuring properties 1 to 4 in the system at setup time and causing the system to generate properties 5 to 7 by the system for each stored document. Description of the Figures

Figure 1 is a flowchart of a virtual indexing system. Description of the Invention

The core of the present invention lies in the document indexing mechanism. For example, in order to allow querying of all prescription documents an index can be defined for multiple document types. An example of a query would be "return all prescription dates where drug is equal to aspirin". In this example "drug" and "date" would be the names of an indices which are defined for all prescription documents.

For this to work, the index has to cope with different document formats. Additionally, the index has to work with documents that contain more than one prescription. This raises the issue of index correlation. Given a document with multiple prescriptions, this document will have multiple "drug" and "date" indices. When performing the above-mentioned query, the correct date for the specified drug needs to be returned (as opposed to returning the wrong date, e.g. the date of another drugs prescription).

The present invention defines an index with the following properties:

1. Name: ("date" or "drug" in the given example) uniquely identifying an index for a specific document format.

2. Format: name of the format to which this index applies.

3. Document-part: a pattern which is matched against the document. For each match, a value of the index is generated and stored in the database.

4. Value-conversion: statements to convert the matched document part to the index value.

5. Value: the value of an index for a specific (matched part of a) document. The "drug" index would have the value "aspirin" in the given example.

6. Tag: a pattern identifying the part of the document, which was matched in order to produce the value. 7. Block-ID: an identifier that is unique inside a specific document. Block-IDs are used to correlate index pairs. The "date" and "drug" index for the same prescription would have the same block-ID. 8. Properties 1 to 4 are configured in the system at setup time. Properties 5 to 7 are generated by the system for each stored document. The high-level workflow for storing and indexing documents is specified in Figure 1.

As an example consider the following two prescriptions which are of different format. Prescription a of Format A looks like this: date: 12/24/1999 drug: aspirin

while prescription b of Format B could look like this:

prescribe( 1999- 12-06, nicorette); prescribe(2000-01-01, aspirin).

When configuring the system, the following index information would be set up

(Table 1).

Table 1

Value-conversion of "none" signifies that the selected parts of the document are copied into the value field for the index (as shown below). "American-to-ISO8601" means a date format conversion from American notation to the ISO standard. This is an example where an index is used for generating information not imminently described in the document. When storing documents a and b, the following index information would be generated (note the date format conversion for the date index of document a) (Table 2):

Table 2

Row Document Index Value

1 a date 1999-12-24 date:*[l] a[l]

2 a drug aspirin drug:*[l] a[l]

3 b date 1999-12-06 prescription(*,x)[l] b[l]

4 b date 2000-01-01 prescription(*,x)[2] b[2]

5 b drug nicorette prescription(x,*)[l] b[l]

6 b drug aspirin prescription^, * ) [2] b[2] Going back to the original example, the query to process was:

"Return all prescription dates where drug is equal to aspirin". The algorithm for this type of query works as follows: First, eliminate all rows in the table which have "drug" as index and not "aspirin" as value. This gives Table 3:

Table 3

Row Document Index Value ■;lBW»ltl

1 a date 1999-12-24 date:*[l] a[l]

2 a drug aspirin drug:*[l] a[l]

3 b date 1999-12-06 prescription^ ,x)[ 1 ] b[l]

4 b date 2000-01-01 prescription *,x) [2] b[2]

6 b drug aspirin prescription^, *)[2] b[2]

Second, eliminate all rows which have "date" as index and which block-ID does not exist for a row with "drug" as index. This reduces the table to that of Table 4:

Table 4

Row Document Index Value tii m 111

1 a date 1999-12-24 date:*[l] a[l]

2 a drug aspirin drug:*[l] a[l]

4 b date 2000-01-01 prescription(*,x)[2] b[2]

6 b drug aspirin prescription(x,*)[2] b[2]

Third, since only dates should be returned, eliminate all rows, whose index is not "date" (Table 5):

Table 5

Row Document Index Value l_________ jl a date 1999-12-24 date:*[l] a[l] b date 2000-01-01 prescription(*,x)[2] b[2]

Finally, return the values of the remaining rows: "1999-12-24" and "2000-01-01", which are the dates when aspirin was prescribed.

-3- Note that the above notations for "document-part" patterns, tags and block- IDs are used for demonstration purposes only. The present invention was implemented using a totally different set of languages as will be described later on.

Implementation Description

An implementation of the present invention exists. The implementation manages documents in XML format (http://www.w3.org/TR/1998/REC-xml- 19980210). "Document- part", tags and block-IDs are specified in the Xpath language (http://www.w3.org/TR xpath). Furthermore, the implementation uses Microsoft SQL Server as the database, while the software itself is written in VisualBasic. The implementation has been set to use other vendors' database as well (Oracle, Informix, Sybase, etc.). Also the implementation language itself could also use C, C++, Java or any other computer language.

The implementation, called "Data Access Layer" or DAL further on, consists basically of two parts. One is the database layout and the other is the software implementing the algorithms for storage and retrieval.

Database Layout

The database layout is the basis for the higher functionality. As already mentioned the implementation uses Microsoft SQL Server as the preferred relational database engine. The implementation's database contains three standard tables named vConfig, vScripts and vDocTypes. The layout of the tables is as follows: vConfig table

1. ID: unique id for a vConfig table entry. Generated by the database upon insertion of a new row in vConfig.

2. virtName: name of a virtual index as can be used in the query language.

3. docType: document type this index is used on; holds the ID of an entry in the vDocTypes table.

4. Pattern: Xpath expression used to match parts of the XML document. 5. postProc: reference to the vScripts table. postProc contains the unique ID of a vScripts row.

6. targetType: type of the result of the script, can be string, date or double.

7. indexed: flag if the index value is calculated on storage or on retrieval of the document. 8. required: flag if a value for the index is required or not.

9. unique: flag if the value for the index needs to be unique among all documents of this type. The vConfig table holds all information about virtual indices. By reading the information from vConfig, the DAL knows which indices there are, for which document types they apply and where the script for generating an index value can be found. The field targetType determines how the value of a generated virtual index should be treated. Indices can be strings (text), date (date and time) or double (double precision floating point number). vScripts table

1. ID: unique id for a vScripts entry. Generated by the database upon insertion of a new row in vScripts.

2. Name: name of the script which may be used in the query language (see VQL flag).

3. Script: the source code of the script itself, written in a specific computer language (see Language field). 4. Language: computer language the script is written in (in this implementation must be a language supporting Microsoft ScriptControl Interface - currently available are VBScript, JScript and PerlScript). 5. VQL: flag if the script can also be used in the query language.

The vScripts table is an addendum to the vConfig table. All fields could also be placed in the vConfig table itself, however since several indices might use the same script, it is more efficient to organize things this way. vScripts not only gives the source code of the script itself, but also tells in which language the script must be executed. If a script needs to be executed, the DAL can start a new Microsoft ScriptControl object with the given language, load the source code to the object and execute the script. vDocTypes table

1. ID: unique id for a vDocTypes entry. Generated by the database upon insertion of a new row in vDocTypes.

2. DocType: name of the document type.

3. DocTypeGroup: grouping of document types under logical name. 4. TargetTable: name of the database table where the actual documents will be stored.

DocTypes defines all known document types which are handled by the data access layer. If a document of unknown type is to be stored by the DAL, a new entry in vDocTypes will be created. Thus, the DAL is not limited to storage of a predefined set of document types.

Most important is the TargetTable field which determines in which table of the database documents of this type will be stored (and virtual index values be kept). Sample Document Tables

Returning to the prescription example set out above, the documents to be stored would be valid XML documents as follows. Prescription a of Format A could be:

<A>

<drug>aspirin</drug>

</A>

while prescription b of Format B could look like this:

<B>

<drug date=" 1999- 12-06">nicorette</drug>

<drug date="2000-01-01 ">aspirin</drug> </B>

In order to handle XML documents of format A and B the following entries are created in the database (TablesβA, 6B, and 6C)

Table 6A vDocTypes.ID DocTyp DocTypeGroup TargetTable

A Prescription Vprescription B Prescription Vprescription

Table 6B vScripts.ID Name Script Language VQL

1 Adate VBScript false

2 Adrug VBScript false 3 Bdate VBScript false 4 Bdrug VBScript false

Table 6C vConfig.ID virtName DocType Pattern PostProc TargetType indexed

1 date A //date 1 date true

2 date B //drug/@date 3 date true 3 drug A //drug 2 string true 4 drug B //drug 4 string true

In addition to this, the tables for storing the documents and the indices need to be created as follows:

VPrescription

1. ID: unique id for a VPrescription entry. Generated by the database upon insertion of a new row.

2. DocType: name of the document type.

_>. Value: the document itself.

VPrescriptionlndex 1. ID: unique id for a VPrescriptionlndex entry. Generated by the database upon insertion of a new row.

2. configID: ID of the entry in vConfig which created the entry.

3. recordlD: ID of the document in VPrescription for which this entry is intended. 4. valueDouble: field which holds floating point values.

5. valueDate: field which holds date values.

6. valueString: field which holds string values.

7. blockID: string defining a block in the document.

8. taglD: xpath expression which identifies the XML element of the document which was matched to create this entry. Data Access Layer

The data access layer is the software operating on the database. It provides an interface with which the application can store and retrieve XML documents. For the description of virtual indexing, two functions are of particular interest: putDocument and findDocument.

The putDocument just needs the document itself to store a new document. The VisualBasic implementation follows the algorithm as described in Figure 1. The details of what happens in the individual parts of the algorithm are described below.

The section "Document Retrieval" explains how the findDocument method is used to retrieve the stored, indexed information again. In particular it is described how the query is translated into SQL code, which is then executed by the database.

Document Storage

A detailed description of how the individual steps of the algorithm in Figure 1 are implemented is given below.

"Determine Document Format" - the present implementation examines the documents if any namespace references or document type definitions (DTD in XML lingo) are present. If so, it uses the name of the namespace or DTD. Otherwise it uses the name of the root tag as the name of the document type. In this example, for documents a and b, no namespace or DTDs are present, so the document types "A" and "B" are taken from the root tag.

The DAL consults the vDocTypes for an entry whose "DocType" columns contains the needed document type. The TargetTable then defines in which database table the document is to be stored. If the document type is not found in the vDocTypes table, a new entry is created which uses a DefaultTable as TargetTable. In this example, the DAL will find "VPrescription" as TargetTable for both documents a and b.

"Store Document" - the DAL will create an SQL statement to insert the document as new entry into the TargetTable. The database is set up to give this new entry a unique ID which is returned to the DAL. Lets call this document.ID for now.

In this example, when storing documents a and b, the VPrescription table would look like that of Table 7: Table 7 vPrescription.ID DocType Value

< A><date> 12/24/ 1999</date>

<drug>aspirin</drug></A>

B <B><drug date=" 1999- 12-06">nicorette</drug>

<drug date="2000-01 -01 ">aspirin</drug>< B>

"Find All Indices..." - the DAL now looks up all entries in vConfig which have the stored document type in the DocType column. For each entry found, the DAL performs then the other steps. If no entries in vConfig are found, the DAL returns control to the application.

In this example, entries 1 and 3 are found in vConfig for documents of type A (entries 2 and 4 for type B). "Match document-part of Index" - the XPath expression of an index is matched against the stored XML document. This matching returns a list of 0, 1 or more nodes (XML lingo). With 0 nodes, no match was found and the work for this index is done (no entries in VPrescriptionlndex will be written).

For 1 or more nodes found, the PostProc entry for the index is the unique ID of an entry in vScripts. This entry is loaded and the DAL initiates a Microsoft

ScriptControl object. The code for the script is loaded into the ScriptControl for the specified language and executed. The matched node and its text property are given as input to the script. The script returns the index value and the block ID. The DAL then creates a new entry in VPrescriptionsIndex with the following values: configlD with the ID of the current index, recordID with the ID of the document stored in VPrescription, valueXXX with the value from the script (date/string/double depending on the TargetType of the index), blocklD as returned from the script and, finally, taglD as the xpath identifier for the matched document node currently processed. If all nodes are processed, the next index is processed and matched against the document. When all indices are done, control is returned to the caller.

In this example, when storing document "a", the indices "date" and "drug" for document type A need to be processed. The pattern for "date" matches all elements in a document with name "date". The script is called with the XML fragment "<date>12/24/1999</date>". This script knows that it has to do date conversion on its input and returns "1999-12-24" as date and the string "[!]" as block ID. After having processed all indices for document a and b, the table VPrescriptionlndex looks like Table 8:

Table 8

ID configl recordl valueDate valueString taglD blocklD

1 1999-12-24 //date[l] [1] 1 aspirin //drug[l] [1]

2 1999-12-06 //drug[l]/@da [1] te

2000-01-01 //drug[2]/@da [2] te

5 4 2 nicorette //drug[l] [1] 6 4 2 aspirin //drug[2] [2]

Document Retrieval

The data access layer accepts an SQL-like query language which is by itself an XML document. The exact syntax of the language is outside the scope of this document. However, the DAL takes such a document and transforms it into a real SQL query for the database. The values returned from the database experience some format conversion by the DAL and are then passed on to the user of the DAL.

What is of interest here is to show how a quite general query is translated into valid SQL for the database. We go back to our example: select date from Prescription where drug = "aspirin" The phrase "from Prescription" defines where to start looking. "Prescription^" is a document type group. Looking into vDocTypes, the DAL sees that document type "A" and "B" belong to this group. Furthermore it sees that both A and B documents are stored in the database table "VPrescription". From that the DAL knows that all index information to VPrescription is to be found in the table "VPrescriptionlndex". The intermediate SQL statement now looks like this:

SELECT [TBD] FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD].

"[TBD]" meaning placeholders in the SQL statements which need further work.

Looking into the phrase "select date", the DAL can deduce that "date" is an index whose value should be returned. Looking for "date" in vConfig, the DAL finds 2 indices called like that for prescription document types. Both have their value in the "valueDate" field of the index table (it is an error if they differ), so the SQL statements becomes: SELECT i. valueDate FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD] Now going for the last part of the query: "where drug = "aspirin"". The DAL again deduces that "drug" is the name of an index. Looking at vConfig for indices of document group Prescriptions it finds the entries 3 and 4. Both have values of type string. The SQL statement becomes:

SELECT [TBD] FROM VPrescription d, VPrescriptionlndex i WHERE d.ID = i.recordID [TBD]

"[TBD]" meaning placeholders in the SQL statements which need further work. Looking into the phrase "select date", the DAL can deduce that "date^" is an index whose value should be returned. Looking for "date" in vConfig, the DAL finds 2 indices called like that for prescription document types. Both have their value in the "valueDate" field of the index table (it is an error if they differ), so the SQL statements becomes:

SELECT i.valueDate FROM VPrescription d, VPrescriptionlndex i WHERE AID = i.recordID [TBD]

Now going for the last part of the query: "where drug = "aspirin"". The DAL again deduces that "drug" is the name of an index. Looking at vConfig for indices of document group Prescriptions it finds the entries 3 and 4. Both have values of type string. The SQL statement becomes:

SELECT i.valueDate FROM VPrescription d, VPrescriptionlndex i.

VPrescriptionlndex i2 WHERE d.ID = i.recordID AND i.recordID = i2.recordID AND i.blockID = i2.blockID

AND _2.valueString = 'aspirin'

Now the SQL statement is ready to be sent against the database, which will return 1999-12-24 and 2000-01-01. The foregoing exemplification is provided to illustrate the invention. This exemplification is not intended to limit what is reserved to the inventors hereunder.

Claims

We claim:

1. An indexing system for a database in single computer or in a distributed computational system, it comprising: 1 • a name uniquely identifying an index for a specific document format;

2. a name of the format to which this index applies;

3. a document-part which is a pattern which is matched against the document and for each match, generating a value of the index and storing it in the database; 4. a value-conversion statement to convert the matched document-part to the index value;

5. a value for an index for a specific matched part of a document;

6. a tag which is a pattern identifying the part of the document which was matched in order to produce the value; 7. a block-ID that is an identifier which is unique inside a specific document; wherein properties 1 to 4 are configured in the system at setup time and properties 5 to 7 are generated by the system for each stored document.

2. A method for indexing and retieving documents from an electronic database wherein the database is to contain or contains multiple document types having multiple measures and measurement units wherein the method comprises: 1. creating or generating a name uniquely identifying an index for a specific document format; 2. creating or generating a name of the format to which this index applies;

5. generating a value for an index for a specific matched part of a document;

7. generating a block-ID that is an identifier which is unique inside a specific document; configuring properties 1 to 4 in the system at setup time and causing the system to generate properties 5 to 7 by the system for each stored document.