US20080010313A1 - method and system for indexing and searching contents of extensible markup language (xml) documents - Google Patents

method and system for indexing and searching contents of extensible markup language (xml) documents Download PDF

Info

Publication number
US20080010313A1
US20080010313A1 US11/858,238 US85823807A US2008010313A1 US 20080010313 A1 US20080010313 A1 US 20080010313A1 US 85823807 A US85823807 A US 85823807A US 2008010313 A1 US2008010313 A1 US 2008010313A1
Authority
US
United States
Prior art keywords
field
index
record
word
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/858,238
Inventor
David Thede
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US11/858,238 priority Critical patent/US20080010313A1/en
Publication of US20080010313A1 publication Critical patent/US20080010313A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • the present invention relates to methods and systems to index and search records stored in a language using nested fields, particularly those stored in the Extensible Markup Language (XML).
  • the present invention relates to an improved method and a computerized system to index and search documents and data in languages such as XML that utilize nested fields.
  • the Extensible Markup Language is a universally accepted format for representing structured data in textual form. It is widely adopted in enterprise databases, and particularly in databases and applications connected to the World Wide Web. The manipulation and exchange of structured data, e.g., spreadsheets, address books, financial transactions, technical drawings, etc., is often challenging as the data is traditionally represented in platform or program dependent document formats.
  • XML provides a set of rules and guidelines for designing text formats for such data; these XML text formats are unambiguous, platform-independent, and extensible.
  • Basic XML format includes tags with brackets, e.g., ⁇ city> begins a field and ⁇ /city> ends a field.
  • ⁇ city> New York ⁇ /city> represents a field named “city” that contains the content “New York.”
  • Fields can be nested, e.g., “city” is an element in the field “address,” as shown above. More complex syntax can be used for various types of data.
  • Existing database management systems such as relational database and object-oriented database systems, are generally equipped with mechanisms or facilities for rapidly retrieving selected records based on key fields in the database. Such facilities or mechanisms often depend upon the data and the schema, and therefore are specific to each database.
  • a variety of complex data structures are implemented in databases to facilitate fast retrieval of data based on key fields; examples include binary trees, B-trees, and red-black trees.
  • various types of indices are built for certain key words or fields that are frequently queried in a database to enable fast searching on those words and fields.
  • full-text indices allow rapid searches on any word in a body of text. They are commonly used by Internet search engines such as Hotbot and Alta Vista to enable a user to quickly identify a particular Web site. Although they vary considerably in their implementation, full-text indices essentially consist of a table of words in alphabetical order, with pointers or links to the corresponding locations of the words in a database or a file. Generally a full-text index also supports wildcard (represented by “*”) searches that locate words based on a partial match. For example, a search for “appl*” will find “apply,” “appliance,” etc.
  • XML represents structural data in a textual format, it lends itself only to a slow, sequential scan of the text in a search of a particular record.
  • Standard full-text indexing provides only an incomplete solution because the field context of each word is not preserved.
  • a standard full-text index of the sample XML document above supports a search for “Smith,” but not for “Smith” only in the “address” field. That is, one cannot locate an address with “Smith” in it using a full-index search; such a search will find all records in any field that has “Smith” in it.
  • Some full-text indexing systems have the ability to search for a word associated with a particular property or field of a document (such as “Author is John Smith”), but this still does not provide a way to search based on the structural context of a word in an XML file, which involves several nested field qualifiers.
  • the present invention is directed to an improved method and a computer system for indexing and searching records in a language utilizing nested fields, such as XML.
  • the present invention discloses an indexing and searching engine that constructs an improved full-text search index on the input XML data and then performs searches using the index.
  • the indexing and searching engine according to the preferred embodiment of this invention supports exact matches and partial matches using a wildcard character.
  • the method transforms the problem of indexing and searching nested field records, including XML data, into the problem of full-text indexing and searching of plain text documents.
  • the input XML data is changed into a form that encodes the field structural information by suffixing each word with its corresponding field qualifiers in their nested entirety, or alternatively, by suffixing each word with a numerical code pattern that represents the word's corresponding field qualifiers in their nested entirety.
  • the resulting encoded words are then stored in a full-text index structure.
  • wildcard matching may be used to perform searches with or without field qualifiers.
  • To search using a wildcard without field qualifiers allows identifying a record including a particular word regardless the field of the record, whereas to search using a wildcard with field qualifiers allows identifying a record including a particular word in a designated field or fields that share certain level of similarly nested structure.
  • a combination of string matching and integer pattern matching is used in the search of a particular word.
  • the portion of the word without field qualifiers is first matched against the words in the index, and then the word's field qualifiers are transformed into a pattern of numerals, e.g., integers, to be matched against the integer patterns of the words in the index that correspond to their respective field qualifiers. Therefore, evaluation of complex field criteria is reduced to simpler and faster numeric matching.
  • the present invention with all aspects of a method and computer system for indexing and searching nested field records, such as XML data and documents, significantly improves the effectiveness and speed of the search, and hence facilitates full realization of advantages of XML as an extensible, portable data exchange format.
  • the present invention is described here with particular reference to XML records, the present invention extends to any nested field record where a series of fields and sub-fields are used to nest data from a record.
  • a nested field record such as an XML document or any data stored in XML form
  • a text of words that encode the field structure context of each word in the XML data.
  • the transformation is accomplished by giving each word in the XML document a suffix that represents the field information.
  • each word is assigned a suffix according to the numerical encoding of its field or nested fields.
  • the first occurrence of “Smith” is found in the record/name/last_name field, the numerical coding or pattern of this nested field qualifiers is therefore “1/2/4,” and this record is represented as “Smith1/2/4.”
  • the second occurrence of “Smith” is found in the record/address/street field, the numerical coding or pattern is therefore “1/5/6,” and this record is represented as “Smith1/5/6.”
  • the sample XML document would be transformed to the following text for indexing: John/1/2/3 Smith/1/2/4 123/1/5/6 Smith/1/5/6 Drive/1/5/6 New/1/5/7 York/1/5/7 New/1/5/8 York/1/5/8
  • the field structure context of each word is encoded using strings of field names in the nested order. For example, as discussed above, the first occurrence of “Smith” is found in the record/name/last_name field, so this record may be represented as “Smith/record/name/last_name.” Similarly, the second occurrence of “Smith” is found in the record/address/street field, and this record may be represented as “Smith/record/address/street.” Using this encoding mechanism, the sample XML document would be transformed to the following text for indexing: John/record/name/first_name Smith/record/name/last_name 123/record/address/street Smith/record/address/street Drive/record/address/street New/record/address/city York/record/address/city New/record/address/state York/record/address/state
  • a full-text index for this transformed data may be built as follows: 123/record/address/street Drive/record/address/street John/record/name/first_name New/record/address/city New/record/address/state Smith/record/address/street Smith/record/name/last_name York/record/address/city York/record/address/state
  • the two alternative indexing methods according to the present invention may be used interchangeably for XML data of limited volume and complexity.
  • reducing the word suffix representation to a numeral or integer string will both save disk and memory space and decrease computational time for indexing and subsequent searches.
  • the encoded field qualifiers are stored in a full-text index along with each word, the content and the structure of the XML data are preserved.
  • Various full-text index searches may be performed to identify a particular word in a particular field or fields using the index.
  • wildcard matching may be used to perform searches with or without field qualifiers.
  • a wildcard character is added to the end of the word following the delimiter “/”, e.g., “John/*”. This expression will match “John” in any field.
  • field qualifiers encoded in the indexing operation are used along with wildcard characters which represent unspecified fields. For example, in the above sample XML document, two steps need to be completed to search for “New York” contained in the field “/record/address”.
  • “/record/address” is transformed to the integer string “/1/5”, using the field encodings established when the index was created. Because the search should also cover any fields that might be nested inside the address field, a wildcard character should be added at the end, e.g., “1/5/*”.
  • the numeral pattern of the field qualifiers is appended to each search term, e.g., “New/1/5” and “York/1/5”. This transformation converts any field search into an equivalent plain text search.
  • the following is a computer program segment implementing the method of encoding field qualifiers by a pattern or an array of integers, according to one embodiment of the present invention.
  • This function converts a field expression, such as “/record//name” into a corresponding numeric array.
  • a flag “fUseWildcards” is used to specify whether a particular field expression in a search query may contain wildcard characters. For example, the query “/record//name contains Smith” finds any field “name” within a field “record” that has “Smith” as the value of the “name” field.
  • a wildcard character is used between the field “record” and the field “name.” In other words, this expression should be able to also match “/record/patient/name”, and “/record/name”, etc.
  • a delimiter “/” is used at the beginning of the expression, such as “/record/name”
  • the “record” field is the top level field element.
  • the “record” field can be nested inside other fields. That is, “record/name” can match expressions such as “/table/record/name” and “/customer/record/name.” Therefore, a wildcard character should be used at the front of a field expression when there is no delimiter “/”.
  • the input field expression is tokenized based on the delimiter “/”.
  • Each string token is then assigned to a numeric value or identifier by calling the function “getFieldId.” If there is no identifier returned, the token is inserted into the table of field names so that a unique id can be created for the token by a separate function.
  • the following is a computer program segment implementing pattern-matching using numeral encoding of field qualifiers, according to one embodiment of the present invention.
  • a is the numeric encoding of the field qualifiers of a word in the index
  • b is the numeric encoding of a field qualifier in a search query.
  • Each of the integers in the “a” and “b” arrays corresponds to a field name.
  • the “b” array may contain wildcard characters so that the query will support words with similarly nested field structures. For example, as discussed above, the query “/record//name contains Smith” matches any field “name” within a field “record” that has “Smith” as the value of the “name” field.
  • the “b” array would contain: ⁇ record code>, ⁇ matchAny>, ⁇ name code>, where ⁇ record code> is the integer corresponding to the “record” field, ⁇ matchAny> is a wildcard character that matches any number of values, and ⁇ name code> is the integer corresponding to the “name” field.
  • a modified matching method is used in an alternative embodiment of the present invention.
  • This method uses a combination of string matching and integer pattern matching to identify a particular word in a particular field or fields.
  • Second, the field expression of the search query is transformed into an numeral or integer pattern, which is then matched against the numerical encoding representing field qualifiers of each word in the index. The resulting matches are subsequently combined with the matches from the first step.
  • a search request “address/street contains Oak” may be converted to the integer pattern (*,5,6,*) associated with the word “Oak”.
  • the wildcard character at the beginning and the end of the pattern indicate that the address field may be inside another field and that additional fields may be nested inside the street field. Therefore, evaluation of a complex field expression is reduced to a simple matching of integer patterns. Replacement of string comparisons with numerical comparisons accordingly improves the speed of the search.

Abstract

A method and a computer system for indexing and searching the data content of nested field records, such as those in Extensible Markup Language (XML). The system includes an indexing and searching engine that constructs an improved full-text search index on the input XML data and then performs searches using the index. The system supports exact matches and partial matches using a wildcard character. The method transforms the input XML data into a form that encodes the data structural information by suffixing each word with its corresponding field qualifiers or an equivalent numerical pattern thereof. The resulting encoded words are then stored in a full-text index structure. Various types of full-index search may be performed. One alternative embodiment is to combine string matching and numeric or integer pattern matching to identify a particular word in a particular field. The portion of the word without field qualifiers is matched against the words in the index, and the pattern of numerals representing the word's field qualifiers is matched against the numeral patterns of the words in the index that correspond to their respective field qualifiers. Therefore, evaluation of complex field criteria is reduced to simpler and faster numeric matching.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a Divisional of application Ser. No. 10/902,144, filed Jul. 30, 2004, which is a Divisional of application Ser. No. 09/549,533, filed Apr. 14, 2000, now U.S. Pat. No. 6,782,380, issued on Aug. 24, 2004, the entire contents of all of which are incorporated by reference.
  • FIELD OF THE INVENTION
  • The present invention relates to methods and systems to index and search records stored in a language using nested fields, particularly those stored in the Extensible Markup Language (XML). In particular, the present invention relates to an improved method and a computerized system to index and search documents and data in languages such as XML that utilize nested fields.
  • BACKGROUND OF THE INVENTION
  • The Extensible Markup Language (XML) is a universally accepted format for representing structured data in textual form. It is widely adopted in enterprise databases, and particularly in databases and applications connected to the World Wide Web. The manipulation and exchange of structured data, e.g., spreadsheets, address books, financial transactions, technical drawings, etc., is often challenging as the data is traditionally represented in platform or program dependent document formats. XML provides a set of rules and guidelines for designing text formats for such data; these XML text formats are unambiguous, platform-independent, and extensible.
  • An example of a simple XML document is provided as follows:
    <record>
    <name>
    <first_name>
    John
    </first_name>
    <last_name>
    Smith
    </last_name>
    123 Smith Drive
    </street>
    <city>
    New York
    </city>
    <state>
    New York
    </state>
    </address>
    </record>
  • Basic XML format includes tags with brackets, e.g., <city> begins a field and </city> ends a field. Thus, <city> New York</city> represents a field named “city” that contains the content “New York.” Fields can be nested, e.g., “city” is an element in the field “address,” as shown above. More complex syntax can be used for various types of data.
  • A key practical issue in realizing advantages afforded by XML is the need for an efficient search method. Easy data manipulation and exchange requires an effective method to handle computational intensive search operations for complex and concurrent queries, which are becoming common place in the use of networked enterprise databases and databases connected to the Internet.
  • Existing database management systems, such as relational database and object-oriented database systems, are generally equipped with mechanisms or facilities for rapidly retrieving selected records based on key fields in the database. Such facilities or mechanisms often depend upon the data and the schema, and therefore are specific to each database. A variety of complex data structures are implemented in databases to facilitate fast retrieval of data based on key fields; examples include binary trees, B-trees, and red-black trees. Additionally, various types of indices are built for certain key words or fields that are frequently queried in a database to enable fast searching on those words and fields.
  • Existing full-text indices allow rapid searches on any word in a body of text. They are commonly used by Internet search engines such as Hotbot and Alta Vista to enable a user to quickly identify a particular Web site. Although they vary considerably in their implementation, full-text indices essentially consist of a table of words in alphabetical order, with pointers or links to the corresponding locations of the words in a database or a file. Generally a full-text index also supports wildcard (represented by “*”) searches that locate words based on a partial match. For example, a search for “appl*” will find “apply,” “appliance,” etc.
  • Neither of these existing technologies provides an efficient way to search XML. Since XML represents structural data in a textual format, it lends itself only to a slow, sequential scan of the text in a search of a particular record. Standard full-text indexing provides only an incomplete solution because the field context of each word is not preserved. For example, a standard full-text index of the sample XML document above supports a search for “Smith,” but not for “Smith” only in the “address” field. That is, one cannot locate an address with “Smith” in it using a full-index search; such a search will find all records in any field that has “Smith” in it. Some full-text indexing systems have the ability to search for a word associated with a particular property or field of a document (such as “Author is John Smith”), but this still does not provide a way to search based on the structural context of a word in an XML file, which involves several nested field qualifiers.
  • Therefore, much needed is an improved full text indexing mechanism for searching XML data, which is capable of distinguishing between “Smith” in the last_name field and “Smith” in the street field, or between “New York” in the city field and “New York” in the state field. Such a mechanism should also preserve information on nested fields, so that the street field is recognized as an element within the address field, and the last_name field is recognized as an element of the name field. The queries such as “address contains New York” (search for any record that contains New York in the address field or any field under the address field) and “address/city contains New York” (search for any record that contains New York in the city field that is part of an address field) should rapidly retrieve the qualified records using such an improved indexing and searching mechanism. To make fast and effective searches possible, certain external data structures need to be constructed to preserve the inherent structure information in the XML data and to provide a short cut to locate particular items.
  • However, the current state of the art only provides limited alternatives for indexing and searching XML data. One approach is to create separate indices for each sub-fields, which preserves the structural information of the data but drastically increases the overhead and therefore is not desirable. Another approach is to use a directed graph to represent the nested fields. (Goldman R. et al., Lore: a database management system for XML, 2000) The search through a directed graph can be extremely computationally intensive and costly as the complexity of the data, hence complexity of the graph, grows. Both approaches result in an index structure whose complexity is comparable with that of the XML data itself. A more efficient and cost-saving indexing and searching method is desired.
  • SUMMARY OF THE INVENTION
  • To resolve the above problems, the present invention is directed to an improved method and a computer system for indexing and searching records in a language utilizing nested fields, such as XML. The present invention discloses an indexing and searching engine that constructs an improved full-text search index on the input XML data and then performs searches using the index. The indexing and searching engine according to the preferred embodiment of this invention supports exact matches and partial matches using a wildcard character.
  • In accordance with one aspect of the present invention, the method transforms the problem of indexing and searching nested field records, including XML data, into the problem of full-text indexing and searching of plain text documents. The input XML data is changed into a form that encodes the field structural information by suffixing each word with its corresponding field qualifiers in their nested entirety, or alternatively, by suffixing each word with a numerical code pattern that represents the word's corresponding field qualifiers in their nested entirety. The resulting encoded words are then stored in a full-text index structure.
  • In accordance with another aspect of the present invention, wildcard matching may be used to perform searches with or without field qualifiers. To search using a wildcard without field qualifiers allows identifying a record including a particular word regardless the field of the record, whereas to search using a wildcard with field qualifiers allows identifying a record including a particular word in a designated field or fields that share certain level of similarly nested structure.
  • In accordance with yet another aspect of the present invention, a combination of string matching and integer pattern matching is used in the search of a particular word. The portion of the word without field qualifiers is first matched against the words in the index, and then the word's field qualifiers are transformed into a pattern of numerals, e.g., integers, to be matched against the integer patterns of the words in the index that correspond to their respective field qualifiers. Therefore, evaluation of complex field criteria is reduced to simpler and faster numeric matching.
  • The present invention with all aspects of a method and computer system for indexing and searching nested field records, such as XML data and documents, significantly improves the effectiveness and speed of the search, and hence facilitates full realization of advantages of XML as an extensible, portable data exchange format.
  • Further features, objects, and advantages of the present invention are apparent in the examples and in the detailed description that follows.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • Indexing
  • Though the present invention is described here with particular reference to XML records, the present invention extends to any nested field record where a series of fields and sub-fields are used to nest data from a record.
  • According to a preferred embodiment of this invention, a nested field record, such as an XML document or any data stored in XML form, is transformed into a text of words that encode the field structure context of each word in the XML data. The transformation is accomplished by giving each word in the XML document a suffix that represents the field information. First, each field name is assigned a numerical code, such as an integer. For example, the following encoding may be used in the sample XML document provided in the Background section:
    record = 1
    name = 2
    first_name = 3
    last_name = 4
    address = 5
    street = 6
    city = 7
    state = 8
  • Second, each word is assigned a suffix according to the numerical encoding of its field or nested fields. For example, in the sample XML document, the first occurrence of “Smith” is found in the record/name/last_name field, the numerical coding or pattern of this nested field qualifiers is therefore “1/2/4,” and this record is represented as “Smith1/2/4.” Similarly, the second occurrence of “Smith” is found in the record/address/street field, the numerical coding or pattern is therefore “1/5/6,” and this record is represented as “Smith1/5/6.” Using the same encoding mechanism, the sample XML document would be transformed to the following text for indexing:
    John/1/2/3
    Smith/1/2/4
    123/1/5/6
    Smith/1/5/6
    Drive/1/5/6
    New/1/5/7
    York/1/5/7
    New/1/5/8
    York/1/5/8
  • And accordingly, a full-text index for this transformed data may be built as follows:
    123/1/5/6
    Drive/1/5/6
    John/1/2/3
    Smith/1/2/4
    Smith/1/5/6
    New/1/5/7
    New/1/5/8
    York/1/5/7
    York/1/5/8
  • Depending upon the complexity of the XML data, deeply nested structures may be reduced to lists of words suffixed by longer numeral or integer strings. There is no intrinsic limitation to this method; both the length of the word lists and the length of the suffix string may grow. Therefore, the method of indexing according to the present invention is much more efficient, robust, and less computationally-intensive compared to available methods such as building separate index for every field.
  • In an alternative embodiment of this invention, the field structure context of each word is encoded using strings of field names in the nested order. For example, as discussed above, the first occurrence of “Smith” is found in the record/name/last_name field, so this record may be represented as “Smith/record/name/last_name.” Similarly, the second occurrence of “Smith” is found in the record/address/street field, and this record may be represented as “Smith/record/address/street.” Using this encoding mechanism, the sample XML document would be transformed to the following text for indexing:
    John/record/name/first_name
    Smith/record/name/last_name
    123/record/address/street
    Smith/record/address/street
    Drive/record/address/street
    New/record/address/city
    York/record/address/city
    New/record/address/state
    York/record/address/state
  • And accordingly, a full-text index for this transformed data may be built as follows:
    123/record/address/street
    Drive/record/address/street
    John/record/name/first_name
    New/record/address/city
    New/record/address/state
    Smith/record/address/street
    Smith/record/name/last_name
    York/record/address/city
    York/record/address/state
  • The two alternative indexing methods according to the present invention may be used interchangeably for XML data of limited volume and complexity. However, when higher volumes of data with more complex nested field structures are involved, reducing the word suffix representation to a numeral or integer string will both save disk and memory space and decrease computational time for indexing and subsequent searches. Searching
  • Once the encoded field qualifiers are stored in a full-text index along with each word, the content and the structure of the XML data are preserved. Various full-text index searches may be performed to identify a particular word in a particular field or fields using the index.
  • According to one embodiment of the present invention, wildcard matching may be used to perform searches with or without field qualifiers. To search for a particular word, e.g., “John”, without field qualifiers, a wildcard character is added to the end of the word following the delimiter “/”, e.g., “John/*”. This expression will match “John” in any field. To search for a particular word with field qualifiers, field qualifiers encoded in the indexing operation are used along with wildcard characters which represent unspecified fields. For example, in the above sample XML document, two steps need to be completed to search for “New York” contained in the field “/record/address”. First, “/record/address” is transformed to the integer string “/1/5”, using the field encodings established when the index was created. Because the search should also cover any fields that might be nested inside the address field, a wildcard character should be added at the end, e.g., “1/5/*”. Second, the numeral pattern of the field qualifiers is appended to each search term, e.g., “New/1/5” and “York/1/5”. This transformation converts any field search into an equivalent plain text search.
  • By way of example, the following is a computer program segment implementing the method of encoding field qualifiers by a pattern or an array of integers, according to one embodiment of the present invention. This function converts a field expression, such as “/record//name” into a corresponding numeric array. A flag “fUseWildcards” is used to specify whether a particular field expression in a search query may contain wildcard characters. For example, the query “/record//name contains Smith” finds any field “name” within a field “record” that has “Smith” as the value of the “name” field. Therefore, a wildcard character is used between the field “record” and the field “name.” In other words, this expression should be able to also match “/record/patient/name”, and “/record/name”, etc. When a delimiter “/” is used at the beginning of the expression, such as “/record/name”, the “record” field is the top level field element. When there is no “/” at the beginning of the expression, such as “record/name”, the “record” field can be nested inside other fields. That is, “record/name” can match expressions such as “/table/record/name” and “/customer/record/name.” Therefore, a wildcard character should be used at the front of a field expression when there is no delimiter “/”.
  • To perform the encoding, the input field expression is tokenized based on the delimiter “/”. Each string token is then assigned to a numeric value or identifier by calling the function “getFieldId.” If there is no identifier returned, the token is inserted into the table of field names so that a unique id can be created for the token by a separate function.
    void encodeFieldExpression(const char *expr, FieldIdList&
    fieldId, int fUseWildcards) {
    if (fUseWildcards) {
    if (*expr != ‘/’)
    fieldId.append(matchAny);
    else
    *expr ++;
    }
    DStringSet s;
    s.tokenize(expr, ‘/’, fUseWildcards);
    for (int i = 0; i < s.getCount( ); ++i) {
    const char *str = s.getString(i);
    if (strIsBlank (str))
    fieldId.append(matchAny);
    else {
    long id = getFieldId(str);
    if (id == FAIL) {
    id = add(s.getString(i));
    fieldId.append(id);
    }
    }
    }
  • By way of example, the following is a computer program segment implementing pattern-matching using numeral encoding of field qualifiers, according to one embodiment of the present invention. Suppose “a” is the numeric encoding of the field qualifiers of a word in the index, and “b” is the numeric encoding of a field qualifier in a search query. Each of the integers in the “a” and “b” arrays corresponds to a field name. The “b” array may contain wildcard characters so that the query will support words with similarly nested field structures. For example, as discussed above, the query “/record//name contains Smith” matches any field “name” within a field “record” that has “Smith” as the value of the “name” field. To match this expression, the “b” array would contain: <record code>, <matchAny>, <name code>, where <record code> is the integer corresponding to the “record” field, <matchAny> is a wildcard character that matches any number of values, and <name code> is the integer corresponding to the “name” field.
    static int lMatch(const long *a, const long *b) {
    while (*a && *b)
    if ((*a = =*b) | | (*b = = matchOne)) {
    a++;
    b++;
    }
    else if (*b = = matchAny) {
    b++;
    if (!*b)
    return true;
    while (*a) {
    if (1Match(a, b))
    return true;
    else
    a++;
    }
    return false;
    }
    else
    return false;
    if (*a)
    return false;
    if (*b) {
    if (*b != matchAny)
    return false;
    b++;
    if (*b)
    return false;
    }
    return true;
    }
  • A modified matching method is used in an alternative embodiment of the present invention. This method uses a combination of string matching and integer pattern matching to identify a particular word in a particular field or fields. First, the portion of the word absent the field identifiers or their numeral encoding is matched against words in the index, to identify the matched records. This is a typical word look-up used for a text search that is not limited by fields. Second, the field expression of the search query is transformed into an numeral or integer pattern, which is then matched against the numerical encoding representing field qualifiers of each word in the index. The resulting matches are subsequently combined with the matches from the first step. For example, a search request “address/street contains Oak” may be converted to the integer pattern (*,5,6,*) associated with the word “Oak”. The wildcard character at the beginning and the end of the pattern indicate that the address field may be inside another field and that additional fields may be nested inside the street field. Therefore, evaluation of a complex field expression is reduced to a simple matching of integer patterns. Replacement of string comparisons with numerical comparisons accordingly improves the speed of the search.
  • Although alternative embodiments of the present invention have been described in detail, it is to be understood that the same is by way of illustration and example only, and is not to be taken by way of limitation. Other modifications and variations that do not depart from the scope and spirit of the invention are understood to be a part thereof.
  • All references cited above are expressly incorporated herein to the same extent as if each was individually incorporated by reference.

Claims (8)

1. A method of creating a searchable index for a nested field record containing data, comprising assigning at least one abbreviated identifier to at least one field of the record containing data, and creating an index comprising a string that includes the abbreviated identifier matched with the data of said field.
2. The method of claim 1, wherein the nested field record is an XML record.
3. The method of claim 2, wherein the abbreviated identifier is a numeral.
4. The method of claim 3, wherein the data of the XML record includes words.
5. The method of claim 4, wherein the string of the index is organized as numeral/word, wherein the numeral identifies the field and the word is the data from the XML record.
6. An index created by the method of claim 1.
7. A method of searching for desired data contained in a nested field record, comprising searching the index of claim 5 for the desired data and the matched abbreviated identifier.
8. A computer readable medium comprising instructions for carrying out the method of claim 7.
US11/858,238 2000-04-14 2007-09-20 method and system for indexing and searching contents of extensible markup language (xml) documents Abandoned US20080010313A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/858,238 US20080010313A1 (en) 2000-04-14 2007-09-20 method and system for indexing and searching contents of extensible markup language (xml) documents

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/549,533 US6782380B1 (en) 2000-04-14 2000-04-14 Method and system for indexing and searching contents of extensible mark-up language (XML) documents
US10/902,144 US7289986B2 (en) 2000-04-14 2004-07-30 Method and system for indexing and searching contents of extensible markup language (XML) documents
US11/858,238 US20080010313A1 (en) 2000-04-14 2007-09-20 method and system for indexing and searching contents of extensible markup language (xml) documents

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/902,144 Division US7289986B2 (en) 2000-04-14 2004-07-30 Method and system for indexing and searching contents of extensible markup language (XML) documents

Publications (1)

Publication Number Publication Date
US20080010313A1 true US20080010313A1 (en) 2008-01-10

Family

ID=32869744

Family Applications (3)

Application Number Title Priority Date Filing Date
US09/549,533 Expired - Lifetime US6782380B1 (en) 2000-04-14 2000-04-14 Method and system for indexing and searching contents of extensible mark-up language (XML) documents
US10/902,144 Expired - Lifetime US7289986B2 (en) 2000-04-14 2004-07-30 Method and system for indexing and searching contents of extensible markup language (XML) documents
US11/858,238 Abandoned US20080010313A1 (en) 2000-04-14 2007-09-20 method and system for indexing and searching contents of extensible markup language (xml) documents

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US09/549,533 Expired - Lifetime US6782380B1 (en) 2000-04-14 2000-04-14 Method and system for indexing and searching contents of extensible mark-up language (XML) documents
US10/902,144 Expired - Lifetime US7289986B2 (en) 2000-04-14 2004-07-30 Method and system for indexing and searching contents of extensible markup language (XML) documents

Country Status (1)

Country Link
US (3) US6782380B1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010632A1 (en) * 2006-06-23 2008-01-10 International Business Machines Corporation Processing large sized relationship-specifying markup language documents
US20080215564A1 (en) * 2007-03-02 2008-09-04 Jon Bratseth Query rewrite
US20090327313A1 (en) * 2008-06-25 2009-12-31 Microsoft Corporation Extensible input method editor dictionary
US20110066937A1 (en) * 2009-09-17 2011-03-17 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US20110179047A1 (en) * 2008-09-28 2011-07-21 Huawei Technologies Co., Ltd. Method and system for fuzzy searching, searching result processing, and filter condition processing
WO2014133542A3 (en) * 2013-03-01 2015-06-18 Empire Technology Development Llc Idempotent representation of numbers in extensible languages
US20150178335A1 (en) * 2013-12-22 2015-06-25 Varonis Systems, Ltd. On-demand indexing
US9104730B2 (en) 2012-06-11 2015-08-11 International Business Machines Corporation Indexing and retrieval of structured documents

Families Citing this family (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7366708B2 (en) * 1999-02-18 2008-04-29 Oracle Corporation Mechanism to efficiently index structured data that provides hierarchical access in a relational database system
US6782380B1 (en) * 2000-04-14 2004-08-24 David Victor Thede Method and system for indexing and searching contents of extensible mark-up language (XML) documents
DE50212603D1 (en) * 2001-06-25 2008-09-18 Siemens Ag METHOD FOR FAST SEARCHING OF ELEMENTS OR ATTRIBUTES OR FOR FAST FILTERING OF FRAGMENTS IN BINARY REPRESENTATIONS OF STRUCTURED DOCUMENTS
US7146409B1 (en) * 2001-07-24 2006-12-05 Brightplanet Corporation System and method for efficient control and capture of dynamic database content
US7669120B2 (en) * 2002-06-21 2010-02-23 Microsoft Corporation Method and system for encoding a mark-up language document
JP4502114B2 (en) * 2003-06-24 2010-07-14 セイコーインスツル株式会社 Database search device
US8229932B2 (en) * 2003-09-04 2012-07-24 Oracle International Corporation Storing XML documents efficiently in an RDBMS
US8694510B2 (en) 2003-09-04 2014-04-08 Oracle International Corporation Indexing XML documents efficiently
US7512125B2 (en) * 2003-12-19 2009-03-31 Solace Systems, Inc. Coding of routing protocol messages in markup language
JP4046086B2 (en) * 2004-01-21 2008-02-13 トヨタ自動車株式会社 Variable compression ratio internal combustion engine
US7231590B2 (en) * 2004-02-11 2007-06-12 Microsoft Corporation Method and apparatus for visually emphasizing numerical data contained within an electronic document
US7801702B2 (en) * 2004-02-12 2010-09-21 Lockheed Martin Corporation Enhanced diagnostic fault detection and isolation
US20050223288A1 (en) * 2004-02-12 2005-10-06 Lockheed Martin Corporation Diagnostic fault detection and isolation
US20050240555A1 (en) * 2004-02-12 2005-10-27 Lockheed Martin Corporation Interactive electronic technical manual system integrated with the system under test
US7584420B2 (en) 2004-02-12 2009-09-01 Lockheed Martin Corporation Graphical authoring and editing of mark-up language sequences
US7603347B2 (en) * 2004-04-09 2009-10-13 Oracle International Corporation Mechanism for efficiently evaluating operator trees
US7398265B2 (en) * 2004-04-09 2008-07-08 Oracle International Corporation Efficient query processing of XML data using XML index
US7440954B2 (en) * 2004-04-09 2008-10-21 Oracle International Corporation Index maintenance for operations involving indexed XML data
EP1759315B1 (en) * 2004-06-23 2010-06-30 Oracle International Corporation Efficient evaluation of queries using translation
US7516121B2 (en) * 2004-06-23 2009-04-07 Oracle International Corporation Efficient evaluation of queries using translation
US8566300B2 (en) * 2004-07-02 2013-10-22 Oracle International Corporation Mechanism for efficient maintenance of XML index structures in a database system
US7668806B2 (en) 2004-08-05 2010-02-23 Oracle International Corporation Processing queries against one or more markup language sources
US8046354B2 (en) * 2004-09-30 2011-10-25 International Business Machines Corporation Method and apparatus for re-evaluating execution strategy for a database query
US20060120181A1 (en) * 2004-10-05 2006-06-08 Lockheed Martin Corp. Fault detection and isolation with analysis of built-in-test results
US20060085692A1 (en) * 2004-10-06 2006-04-20 Lockheed Martin Corp. Bus fault detection and isolation
US7912827B2 (en) * 2004-12-02 2011-03-22 At&T Intellectual Property Ii, L.P. System and method for searching text-based media content
US7921076B2 (en) 2004-12-15 2011-04-05 Oracle International Corporation Performing an action in response to a file system event
TWI258678B (en) * 2004-12-21 2006-07-21 High Tech Comp Corp Search method, and computer readable medium thereof
US20080052281A1 (en) * 2006-08-23 2008-02-28 Lockheed Martin Corporation Database insertion and retrieval system and method
US8402047B1 (en) 2005-02-25 2013-03-19 Adobe Systems Incorporated Method and apparatus for generating a query to search for matching forms
US7523100B1 (en) 2005-02-28 2009-04-21 Adobe Systems Incorporated Method and apparatus for using a rendered form as a search template
US20060230339A1 (en) * 2005-04-07 2006-10-12 Achanta Phani G V System and method for high performance pre-parsed markup language
US20060256770A1 (en) * 2005-05-13 2006-11-16 Lockheed Martin Corporation Interface for configuring ad hoc network packet control
US20060256717A1 (en) * 2005-05-13 2006-11-16 Lockheed Martin Corporation Electronic packet control system
US7599289B2 (en) * 2005-05-13 2009-10-06 Lockheed Martin Corporation Electronic communication control
US20060256814A1 (en) * 2005-05-13 2006-11-16 Lockheed Martin Corporation Ad hoc computer network
US7400271B2 (en) * 2005-06-21 2008-07-15 International Characters, Inc. Method and apparatus for processing character streams
US7427025B2 (en) * 2005-07-08 2008-09-23 Lockheed Marlin Corp. Automated postal voting system and method
US8156114B2 (en) 2005-08-26 2012-04-10 At&T Intellectual Property Ii, L.P. System and method for searching and analyzing media content
US8949455B2 (en) 2005-11-21 2015-02-03 Oracle International Corporation Path-caching mechanism to improve performance of path-related operations in a repository
CN1790335A (en) * 2005-12-19 2006-06-21 无锡永中科技有限公司 XML file data access method
US20070245308A1 (en) * 2005-12-31 2007-10-18 Hill John E Flexible XML tagging
US7593949B2 (en) * 2006-01-09 2009-09-22 Microsoft Corporation Compression of structured documents
US7836399B2 (en) * 2006-02-09 2010-11-16 Microsoft Corporation Detection of lists in vector graphics documents
US7805424B2 (en) * 2006-04-12 2010-09-28 Microsoft Corporation Querying nested documents embedded in compound XML documents
US7499909B2 (en) 2006-07-03 2009-03-03 Oracle International Corporation Techniques of using a relational caching framework for efficiently handling XML queries in the mid-tier data caching
US20080021902A1 (en) * 2006-07-18 2008-01-24 Dawkins William P System and Method for Storage Area Network Search Appliance
US20080033940A1 (en) * 2006-08-01 2008-02-07 Hung The Dinh Database Query Enabling Selection By Partial Column Name
US7908260B1 (en) 2006-12-29 2011-03-15 BrightPlanet Corporation II, Inc. Source editing, internationalization, advanced configuration wizard, and summary page selection for information automation systems
US20080177556A1 (en) * 2007-01-19 2008-07-24 Long Fung Cheng Business object status management
US8781996B2 (en) 2007-07-12 2014-07-15 At&T Intellectual Property Ii, L.P. Systems, methods and computer program products for searching within movies (SWiM)
US8229920B2 (en) 2007-08-31 2012-07-24 International Business Machines Corporation Index selection for XML database systems
US8131729B2 (en) * 2008-06-12 2012-03-06 International Business Machines Corporation System and method for best-fit lookup of multi-field key
US20100042589A1 (en) * 2008-08-15 2010-02-18 Smyros Athena A Systems and methods for topical searching
US8572110B2 (en) 2008-12-04 2013-10-29 Microsoft Corporation Textual search for numerical properties
WO2010141598A2 (en) * 2009-06-02 2010-12-09 Index Logic, Llc Systematic presentation of the contents of one or more documents
JP2011065546A (en) * 2009-09-18 2011-03-31 Hitachi Solutions Ltd File search system and program
US8205227B1 (en) * 2010-02-06 2012-06-19 Frontier Communications Corporation Management and delivery of audiovisual content items that correspond to scheduled programs
US8413187B1 (en) 2010-02-06 2013-04-02 Frontier Communications Corporation Method and system to request audiovisual content items matched to programs identified in a program grid
TWI483129B (en) * 2010-03-09 2015-05-01 Alibaba Group Holding Ltd Retrieval method and device
CN102541889A (en) * 2010-12-21 2012-07-04 新奥特(北京)视频技术有限公司 Method for non-structured media data storage mode
US9348806B2 (en) * 2014-09-30 2016-05-24 International Business Machines Corporation High speed dictionary expansion
CN109101405A (en) * 2018-07-05 2018-12-28 北京西普阳光教育科技股份有限公司 The evaluation method and device of computer based interactive operation

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5701459A (en) * 1993-01-13 1997-12-23 Novell, Inc. Method and apparatus for rapid full text index creation
US5701469A (en) * 1995-06-07 1997-12-23 Microsoft Corporation Method and system for generating accurate search results using a content-index
US5703655A (en) * 1995-03-24 1997-12-30 U S West Technologies, Inc. Video programming retrieval using extracted closed caption data which has been partitioned and stored to facilitate a search and retrieval process
US5721897A (en) * 1996-04-09 1998-02-24 Rubinstein; Seymour I. Browse by prompted keyword phrases with an improved user interface
US5778361A (en) * 1995-09-29 1998-07-07 Microsoft Corporation Method and system for fast indexing and searching of text in compound-word languages
US5913209A (en) * 1996-09-20 1999-06-15 Novell, Inc. Full text index reference compression
US5913208A (en) * 1996-07-09 1999-06-15 International Business Machines Corporation Identifying duplicate documents from search results without comparing document content
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
US6167393A (en) * 1996-09-20 2000-12-26 Novell, Inc. Heterogeneous record search apparatus and method
US6240407B1 (en) * 1998-04-29 2001-05-29 International Business Machines Corp. Method and apparatus for creating an index in a database system
US20010007987A1 (en) * 1999-12-14 2001-07-12 Nobuyuki Igata Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents
US6266094B1 (en) * 1999-06-14 2001-07-24 Medialink Worldwide Incorporated Method and apparatus for the aggregation and selective retrieval of television closed caption word content originating from multiple geographic locations
US6269362B1 (en) * 1997-12-19 2001-07-31 Alta Vista Company System and method for monitoring web pages by comparing generated abstracts
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
US6360215B1 (en) * 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US6366934B1 (en) * 1998-10-08 2002-04-02 International Business Machines Corporation Method and apparatus for querying structured documents using a database extender
US6418448B1 (en) * 1999-12-06 2002-07-09 Shyam Sundar Sarkar Method and apparatus for processing markup language specifications for data and metadata used inside multiple related internet documents to navigate, query and manipulate information from a plurality of object relational databases over the web
US6421656B1 (en) * 1998-10-08 2002-07-16 International Business Machines Corporation Method and apparatus for creating structure indexes for a data base extender
US20020133484A1 (en) * 1999-12-02 2002-09-19 International Business Machines Corporation Storing fragmented XML data into a relational database by decomposing XML documents with application specific mappings
US6466940B1 (en) * 1997-02-21 2002-10-15 Dudley John Mills Building a database of CCG values of web pages from extracted attributes
US20030041053A1 (en) * 2001-08-23 2003-02-27 Chantal Roth System and method for accessing biological data
US20030140027A1 (en) * 2001-12-12 2003-07-24 Jeffrey Huttel Universal Programming Interface to Knowledge Management (UPIKM) database system with integrated XML interface
US6782380B1 (en) * 2000-04-14 2004-08-24 David Victor Thede Method and system for indexing and searching contents of extensible mark-up language (XML) documents

Patent Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5701459A (en) * 1993-01-13 1997-12-23 Novell, Inc. Method and apparatus for rapid full text index creation
US5703655A (en) * 1995-03-24 1997-12-30 U S West Technologies, Inc. Video programming retrieval using extracted closed caption data which has been partitioned and stored to facilitate a search and retrieval process
US5701469A (en) * 1995-06-07 1997-12-23 Microsoft Corporation Method and system for generating accurate search results using a content-index
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US5778361A (en) * 1995-09-29 1998-07-07 Microsoft Corporation Method and system for fast indexing and searching of text in compound-word languages
US5721897A (en) * 1996-04-09 1998-02-24 Rubinstein; Seymour I. Browse by prompted keyword phrases with an improved user interface
US5913208A (en) * 1996-07-09 1999-06-15 International Business Machines Corporation Identifying duplicate documents from search results without comparing document content
US6167393A (en) * 1996-09-20 2000-12-26 Novell, Inc. Heterogeneous record search apparatus and method
US5913209A (en) * 1996-09-20 1999-06-15 Novell, Inc. Full text index reference compression
US6466940B1 (en) * 1997-02-21 2002-10-15 Dudley John Mills Building a database of CCG values of web pages from extracted attributes
US6269362B1 (en) * 1997-12-19 2001-07-31 Alta Vista Company System and method for monitoring web pages by comparing generated abstracts
US6094649A (en) * 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
US6240407B1 (en) * 1998-04-29 2001-05-29 International Business Machines Corp. Method and apparatus for creating an index in a database system
US6366934B1 (en) * 1998-10-08 2002-04-02 International Business Machines Corporation Method and apparatus for querying structured documents using a database extender
US6421656B1 (en) * 1998-10-08 2002-07-16 International Business Machines Corporation Method and apparatus for creating structure indexes for a data base extender
US6360215B1 (en) * 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
US6266094B1 (en) * 1999-06-14 2001-07-24 Medialink Worldwide Incorporated Method and apparatus for the aggregation and selective retrieval of television closed caption word content originating from multiple geographic locations
US6636845B2 (en) * 1999-12-02 2003-10-21 International Business Machines Corporation Generating one or more XML documents from a single SQL query
US20020133484A1 (en) * 1999-12-02 2002-09-19 International Business Machines Corporation Storing fragmented XML data into a relational database by decomposing XML documents with application specific mappings
US6418448B1 (en) * 1999-12-06 2002-07-09 Shyam Sundar Sarkar Method and apparatus for processing markup language specifications for data and metadata used inside multiple related internet documents to navigate, query and manipulate information from a plurality of object relational databases over the web
US20010007987A1 (en) * 1999-12-14 2001-07-12 Nobuyuki Igata Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents
US6782380B1 (en) * 2000-04-14 2004-08-24 David Victor Thede Method and system for indexing and searching contents of extensible mark-up language (XML) documents
US20030041053A1 (en) * 2001-08-23 2003-02-27 Chantal Roth System and method for accessing biological data
US20030140027A1 (en) * 2001-12-12 2003-07-24 Jeffrey Huttel Universal Programming Interface to Knowledge Management (UPIKM) database system with integrated XML interface
US20050256900A1 (en) * 2001-12-12 2005-11-17 Nmatrix, Inc. Universal programming interface to knowledge management (UPIKM) database system with integrated XML interface

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8131728B2 (en) * 2006-06-23 2012-03-06 International Business Machines Corporation Processing large sized relationship-specifying markup language documents
US20080010632A1 (en) * 2006-06-23 2008-01-10 International Business Machines Corporation Processing large sized relationship-specifying markup language documents
US20080215564A1 (en) * 2007-03-02 2008-09-04 Jon Bratseth Query rewrite
US20090327313A1 (en) * 2008-06-25 2009-12-31 Microsoft Corporation Extensible input method editor dictionary
US8862989B2 (en) * 2008-06-25 2014-10-14 Microsoft Corporation Extensible input method editor dictionary
US20110179047A1 (en) * 2008-09-28 2011-07-21 Huawei Technologies Co., Ltd. Method and system for fuzzy searching, searching result processing, and filter condition processing
US9600564B2 (en) 2009-09-17 2017-03-21 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US20110066937A1 (en) * 2009-09-17 2011-03-17 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US8549398B2 (en) 2009-09-17 2013-10-01 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US10242123B2 (en) 2009-09-17 2019-03-26 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US9104730B2 (en) 2012-06-11 2015-08-11 International Business Machines Corporation Indexing and retrieval of structured documents
US9208199B2 (en) 2012-06-11 2015-12-08 International Business Machines Corporation Indexing and retrieval of structured documents
WO2014133542A3 (en) * 2013-03-01 2015-06-18 Empire Technology Development Llc Idempotent representation of numbers in extensible languages
US9417842B2 (en) 2013-03-01 2016-08-16 Empire Technology Development Llc Idempotent representation of numbers in extensible languages
US9842111B2 (en) * 2013-12-22 2017-12-12 Varonis Systems, Ltd. On-demand indexing
US20150178335A1 (en) * 2013-12-22 2015-06-25 Varonis Systems, Ltd. On-demand indexing
US10810247B2 (en) 2013-12-22 2020-10-20 Varonis Systems, Ltd. On-demand indexing

Also Published As

Publication number Publication date
US7289986B2 (en) 2007-10-30
US20050004935A1 (en) 2005-01-06
US6782380B1 (en) 2004-08-24

Similar Documents

Publication Publication Date Title
US7289986B2 (en) Method and system for indexing and searching contents of extensible markup language (XML) documents
Cooper et al. A fast index for semistructured data
Stonebraker et al. Document processing in a relational database system
US7756858B2 (en) Parent-child query indexing for xml databases
US6016497A (en) Methods and system for storing and accessing embedded information in object-relational databases
US7412444B2 (en) Efficient indexing of hierarchical relational database records
US8065308B2 (en) Encoding semi-structured data for efficient search and browsing
KR100414236B1 (en) A search system and method for retrieval of data
US6968338B1 (en) Extensible database framework for management of unstructured and semi-structured documents
WO1998048360A1 (en) Method and apparatus for processing free-format data
US5950184A (en) Indexing a database by finite-state transducer
Runapongsa et al. Storing and querying XML data in object-relational DBMSs
US7426506B2 (en) Parameterized keyword and methods for searching, indexing and storage
WO2002059726A2 (en) Method of performing a search of a numerical document object model
JP2002202973A (en) Structured document management device
Zuopeng et al. An efficient index structure for XML based on generalized suffix tree
Cooper et al. The index fabric: Technical overview
JPH03156677A (en) Composite data base system
AU2003204729B2 (en) Indexing and Querying Structured Documents
Hsu et al. An efficient XML indexing method based on path clustering
ZA200207743B (en) Directory searching methods and systems.
Maghamez et al. Multi-resolution indexing for XML data
Lowe et al. A formal model for representation and querying of structured documents
Min et al. Effective path indexes for XML data on relational databases
Sidirourgos et al. REPORT INS-E0802 DECEMBER 2008

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION