US20080228743A1 - System and method for multi-dimensional aggregation over large text corpora - Google Patents

System and method for multi-dimensional aggregation over large text corpora Download PDF

Info

Publication number
US20080228743A1
US20080228743A1 US12/129,850 US12985008A US2008228743A1 US 20080228743 A1 US20080228743 A1 US 20080228743A1 US 12985008 A US12985008 A US 12985008A US 2008228743 A1 US2008228743 A1 US 2008228743A1
Authority
US
United States
Prior art keywords
list
query
values
data
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/129,850
Inventor
Jeffrey A. Kusnitz
Daniel N. Meredith
Linda A. Nguyen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SAP SE
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/129,850 priority Critical patent/US20080228743A1/en
Publication of US20080228743A1 publication Critical patent/US20080228743A1/en
Assigned to SAP AG reassignment SAP AG ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
  • This invention relates to inverted indexes used in text corpora indexing, and particularly to systems and methods for multi-dimensional aggregation.
  • An inverted index is constructed over a given corpus of documents, and consists of two primary structures, 1) a dictionary of all the unique terms in the corpus and, 2) for each term in the dictionary, a list of documents that contain the term.
  • the area of large text indexing is active research space and many advancements have been made over the years toward improving the efficiency, performance and scale of indexes. Yet the general functionality of an index has not changed drastically during that period.
  • inverted indexes are built to serve very simple Boolean queries, such as “Find all documents that contain the word ‘IBM’”. Indexes respond to queries such as the aforementioned with a subset of the documents that contain the terms, and potentially an estimate of how many other documents also contain the term. Yet the data within an index can be used to provide much more insight than a list of documents for the user to investigate manually.
  • inverted indexes can be used for aggregation of unstructured information across multiple dimensions for large corpora. For example, aggregation could provide a by-email-address count of all e-mail addresses found in the .edu domain.
  • current unstructured indexing techniques do not handle aggregation operations well, and current aggregation techniques do not handle unstructured information well.
  • Exemplary embodiments include a method for retrieving data from an inverted index within a computer system, wherein the index comprises annotated postings, the method including receiving a query in a system, converting the query into a query language, scanning at least one list of postings for data from the query, aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes recording the occurrence of unique values from the list, mapping the values using a user-provided definition to an alternate value, grouping the values by a user-provided mapping of values to groups, recording and mutating data associated with the unique value in the list, relating the recorded data values with other values in the index and returning the requested data from the aggregated list in a return format.
  • Additional exemplary embodiments include a method for multi-dimension inverted index aggregation within a computer system having an input device, a memory and a display, the method including receiving a query in the memory from the input device, converting the query into a query language and sending the request to an index server, parsing the query and identifying requisite postings lists and aggregation keys and functions, initializing the aggregation functions, while results are being collected and prior to a terminating condition (e.g.
  • FIG. 1 illustrates an inverted list format in accordance with exemplary embodiments
  • FIG. 2 illustrates a system level diagram of an exemplary multi-dimensional aggregation system
  • FIG. 3 illustrates an exemplary multi-dimensional aggregation method
  • Exemplary embodiments include multi-dimensional aggregation where a subsystem is built on top of an existing inverted list index such that candidate documents can be efficiently scanned by relating data values observed to other data values in the index.
  • Exemplary embodiments further include indexing strategies and postings format that allow for efficient queries across classes of metadata and a framework for analyzing postings metadata.
  • a metadata typing system, and a per posting data field which can store metadata related to a given posting are provided.
  • a group of query-time operations which provide aggregation and numerical analyses on the metadata stored per posting is provided.
  • Exemplary embodiments further include a method for retrieving data from an inverted list index within a computer system, wherein the index includes annotated postings, the method consisting of receiving a query in the computer converting the query into a query language, scanning at least one list of postings for data from the query, aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes recording the occurrence of a unique value from the list, recording and/or mutating one or more datum(s) relating to the unique value in the list and relating the recorded values in the index and returning the requested data from the aggregated list in a return format.
  • the flexible indexing framework allows for storing mined data in the index and accessing it through an index term. For example, a data miner that tags documents whenever it finds a person's name can be implemented. Using an index term such as ⁇ PERSON>>, the indexing framework can record all names in an inverted list, using the data fields to store the individual names. Queries can then be supported such as, “Find me all documents that contain “quarterly report”, “IBM” and any ‘ ⁇ PERSON>>’”. Answering a question like the aforementioned only requires four inverted lists. Additionally, a query engine can return the list of all the names that were actually hidden behind the postings of the term ⁇ PERSON>>. This feature gives the users the ability to find documents and learn more about the document set as well. The additional overhead of using data fields is offset by the token-type model deployed, which allows for tailored compression mechanism based on the type of an index terms, as well as the added capabilities of the index in answering queries.
  • an aggregation over inverted list metadata method can be employed.
  • Query language is extended with an AGGREGATION operator that allows processing of all the data fields for all postings for a given index term.
  • the method can be implemented to count unique data fields and return the top N values with their counts.
  • the query can then return the set of document identifiers satisfying the query and a much smaller additional set containing the aggregate view of the ⁇ PERSON>> inverted list.
  • the metadata portion of a posting is expanded to potentially include an arbitrary data value associated with the posting as shown in FIG. 1 , in which the location block in each posting represents the positional information. Adding more data to a posting is discussed further in the description below in which query processing and techniques for minimizing its impact is discussed.
  • Exemplary indexes that conform employ methods that allow the analysis and annotation of unstructured information (e.g., web documents), and provide a framework to build an index of the annotation and analysis results.
  • the embodiments described herein support several aggregation features, such as, but not limited to: aggregation on single or multiple keys, and in the case of multiple keys, the order of the aggregation can be specified; map functions can be defined in order to transform values; partitions of the key space can be specified, in order to aggregate into custom segments; and process functions can be defined which specify how values are aggregated.
  • FIG. 2 illustrates a system level diagram of an exemplary multi-dimensional aggregation system 200 , which includes computer 205 , network 210 and index server 215 .
  • a user can, at step 305 , enter a query—“show me how many times each month American Idol is mentioned on the Internet” in computer 205 , which converts the query, at step 310 , into a well-defined query language and sends the request to index server 215 at step 315 , which can be via network 210 .
  • Index server 215 parses the query at step 320 and identifies the requisite postings lists, required aggregation keys and functions.
  • the aggregation functions are initialized and an empty results table is created. The aforementioned occurs at step 325 .
  • index server 215 seeks through the postings lists for matches. When a match is found at step 330 , the aggregation keys are passed to the aggregation function in step 335 , which processes the keys with the indicated function and increments the key-specific counters and accumulates the results in the results table.
  • the index server 215 collects the aggregation results table and returns them to computer 205 at step 240 .
  • An alternate exemplary method for retrieving data from an inverted list index within a computer system, wherein the index comprises annotated postings includes scanning at least one list of the postings for the data, aggregating the data in the list, thereby resulting in an aggregated list; and returning the requested data form the aggregated list in a return format.
  • aggregating includes recording the occurrence of a unique value from the list; and recording the frequency of the unique value in the list.
  • the aggregating can further include relating the recorded values to the remaining values in the index, and the relating can include creating related tables of the values.
  • the method can further include aggregating counts of the values over at least one key, aggregating counts of the mappings of the values over at least one key, aggregating counts of the values over at least one set of values associated with at least one key, aggregating mappings of the values over at least one set of values associated with at least one key, and aggregating mappings of alternate values over an aggregation of the values over at least one key.
  • the capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
  • one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media.
  • the media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention.
  • the article of manufacture can be included as a part of a computer system or sold separately.
  • At least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.

Abstract

Systems and methods for multi-dimensional aggregation. Exemplary embodiments include a method for retrieving data from an inverted list index within a computer system, wherein the index comprises annotated postings, the method including receiving a query in a system, converting the query into a query language, scanning at least one list of postings for data from the query, aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes recording the occurrence of unique values from the list, mapping the values using a user-provided definition to an alternate value, grouping the values by a user-provided mapping of values to groups, recording and mutating data associated with the unique value in the list, relating the recorded data values with other values in the index and returning the requested data from the aggregated list in a return format.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 11/686,639, filed Mar. 15, 2007, the disclosure of which is incorporated by reference herein in its entirety.
  • TRADEMARKS
  • IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This invention relates to inverted indexes used in text corpora indexing, and particularly to systems and methods for multi-dimensional aggregation.
  • 2. Description of Background
  • An inverted index is constructed over a given corpus of documents, and consists of two primary structures, 1) a dictionary of all the unique terms in the corpus and, 2) for each term in the dictionary, a list of documents that contain the term. The area of large text indexing is active research space and many advancements have been made over the years toward improving the efficiency, performance and scale of indexes. Yet the general functionality of an index has not changed drastically during that period.
  • In general, inverted indexes are built to serve very simple Boolean queries, such as “Find all documents that contain the word ‘IBM’”. Indexes respond to queries such as the aforementioned with a subset of the documents that contain the terms, and potentially an estimate of how many other documents also contain the term. Yet the data within an index can be used to provide much more insight than a list of documents for the user to investigate manually. For example, inverted indexes can be used for aggregation of unstructured information across multiple dimensions for large corpora. For example, aggregation could provide a by-email-address count of all e-mail addresses found in the .edu domain. However, current unstructured indexing techniques do not handle aggregation operations well, and current aggregation techniques do not handle unstructured information well.
  • SUMMARY OF THE INVENTION
  • Exemplary embodiments include a method for retrieving data from an inverted index within a computer system, wherein the index comprises annotated postings, the method including receiving a query in a system, converting the query into a query language, scanning at least one list of postings for data from the query, aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes recording the occurrence of unique values from the list, mapping the values using a user-provided definition to an alternate value, grouping the values by a user-provided mapping of values to groups, recording and mutating data associated with the unique value in the list, relating the recorded data values with other values in the index and returning the requested data from the aggregated list in a return format.
  • Additional exemplary embodiments include a method for multi-dimension inverted index aggregation within a computer system having an input device, a memory and a display, the method including receiving a query in the memory from the input device, converting the query into a query language and sending the request to an index server, parsing the query and identifying requisite postings lists and aggregation keys and functions, initializing the aggregation functions, while results are being collected and prior to a terminating condition (e.g. the expiration of a pre-determined time or consumption of a fixed number of postings/matches), iteratively seeking through the postings list for matches to the query, passing the aggregation keys to the aggregation functions in response to a match, processing the keys with a respective function and mutating key-specific data, entering an index to a table from an output of the functions and collecting the aggregation results and returning the results to the display.
  • System and computer program products corresponding to the above-summarized methods are also described and claimed herein.
  • Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
  • TECHNICAL EFFECTS
  • As a result of the summarized invention, technically an indexing strategy and postings format that allows for efficient queries across classes of metadata and a framework for analyzing and aggregating postings metadata has been achieved.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
  • FIG. 1 illustrates an inverted list format in accordance with exemplary embodiments;
  • FIG. 2 illustrates a system level diagram of an exemplary multi-dimensional aggregation system; and
  • FIG. 3 illustrates an exemplary multi-dimensional aggregation method
  • The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
  • DETAILED DESCRIPTION OF THE INVENTION
  • Exemplary embodiments include multi-dimensional aggregation where a subsystem is built on top of an existing inverted list index such that candidate documents can be efficiently scanned by relating data values observed to other data values in the index.
  • Exemplary embodiments further include indexing strategies and postings format that allow for efficient queries across classes of metadata and a framework for analyzing postings metadata. In one implementation, a metadata typing system, and a per posting data field which can store metadata related to a given posting are provided. In another exemplary implementation, a group of query-time operations which provide aggregation and numerical analyses on the metadata stored per posting is provided.
  • Exemplary embodiments further include a method for retrieving data from an inverted list index within a computer system, wherein the index includes annotated postings, the method consisting of receiving a query in the computer converting the query into a query language, scanning at least one list of postings for data from the query, aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes recording the occurrence of a unique value from the list, recording and/or mutating one or more datum(s) relating to the unique value in the list and relating the recorded values in the index and returning the requested data from the aggregated list in a return format.
  • The flexible indexing framework allows for storing mined data in the index and accessing it through an index term. For example, a data miner that tags documents whenever it finds a person's name can be implemented. Using an index term such as <<PERSON>>, the indexing framework can record all names in an inverted list, using the data fields to store the individual names. Queries can then be supported such as, “Find me all documents that contain “quarterly report”, “IBM” and any ‘<<PERSON>>’”. Answering a question like the aforementioned only requires four inverted lists. Additionally, a query engine can return the list of all the names that were actually hidden behind the postings of the term <<PERSON>>. This feature gives the users the ability to find documents and learn more about the document set as well. The additional overhead of using data fields is offset by the token-type model deployed, which allows for tailored compression mechanism based on the type of an index terms, as well as the added capabilities of the index in answering queries.
  • To overcome the burden of transferring large sets of data from the data fields along with the list of documents that match a query, an aggregation over inverted list metadata method can be employed. Query language is extended with an AGGREGATION operator that allows processing of all the data fields for all postings for a given index term. The method can be implemented to count unique data fields and return the top N values with their counts. The query can then return the set of document identifiers satisfying the query and a much smaller additional set containing the aggregate view of the <<PERSON>> inverted list.
  • Within the indexes described herein, the metadata portion of a posting is expanded to potentially include an arbitrary data value associated with the posting as shown in FIG. 1, in which the location block in each posting represents the positional information. Adding more data to a posting is discussed further in the description below in which query processing and techniques for minimizing its impact is discussed. Exemplary indexes that conform employ methods that allow the analysis and annotation of unstructured information (e.g., web documents), and provide a framework to build an index of the annotation and analysis results.
  • The embodiments described herein support several aggregation features, such as, but not limited to: aggregation on single or multiple keys, and in the case of multiple keys, the order of the aggregation can be specified; map functions can be defined in order to transform values; partitions of the key space can be specified, in order to aggregate into custom segments; and process functions can be defined which specify how values are aggregated.
  • FIG. 2 illustrates a system level diagram of an exemplary multi-dimensional aggregation system 200, which includes computer 205, network 210 and index server 215. In an exemplary multi-dimensional method 300 as illustrated in FIG. 3, a user can, at step 305, enter a query—“show me how many times each month American Idol is mentioned on the Internet” in computer 205, which converts the query, at step 310, into a well-defined query language and sends the request to index server 215 at step 315, which can be via network 210. Index server 215 parses the query at step 320 and identifies the requisite postings lists, required aggregation keys and functions. The aggregation functions are initialized and an empty results table is created. The aforementioned occurs at step 325. In general, while the conditions, not out of time and not enough results, exist, index server 215 seeks through the postings lists for matches. When a match is found at step 330, the aggregation keys are passed to the aggregation function in step 335, which processes the keys with the indicated function and increments the key-specific counters and accumulates the results in the results table. The index server 215 collects the aggregation results table and returns them to computer 205 at step 240.
  • An alternate exemplary method for retrieving data from an inverted list index within a computer system, wherein the index comprises annotated postings, includes scanning at least one list of the postings for the data, aggregating the data in the list, thereby resulting in an aggregated list; and returning the requested data form the aggregated list in a return format. In one exemplary implementation, aggregating includes recording the occurrence of a unique value from the list; and recording the frequency of the unique value in the list. The aggregating can further include relating the recorded values to the remaining values in the index, and the relating can include creating related tables of the values.
  • Furthermore, in other exemplary implementations the method can further include aggregating counts of the values over at least one key, aggregating counts of the mappings of the values over at least one key, aggregating counts of the values over at least one set of values associated with at least one key, aggregating mappings of the values over at least one set of values associated with at least one key, and aggregating mappings of alternate values over an aggregation of the values over at least one key.
  • The capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
  • As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
  • Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
  • The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
  • While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.

Claims (5)

1. A method for retrieving data from an inverted list index within a computer system, wherein the index comprises annotated postings, the method comprising:
receiving a query in a system;
converting the query into a query language;
scanning at least one list of postings for data from the query;
aggregating the data in the list, thereby resulting in an aggregated list, wherein the aggregating includes:
recording the occurrence of unique values from the list;
mapping the values using a user-provided definition to an alternate value;
grouping the values by a user-provided mapping of values to groups;
recording and mutating data associated with the unique value in the list;
relating the recorded data values with other values in the index; and
returning the requested data from the aggregated list in a return format.
2. The method as claimed in claim 1 wherein the annotated postings contain per-document identification, per-occurrence identification, and per-occurrence related data, wherein alternately per-occurrence related data is accessible using per-document identification and per-occurrence identification.
3. The method as claimed in claim 2 wherein the unique value is the result of a computation on a pre-existing value.
4. The method as claimed in claim 3 wherein recording data associated with the unique value takes place during query processing.
5. A method for multi-dimensional inverted index aggregation within a computer system having an input device, a memory and a display, the method consisting of:
receiving a search query in the memory from the input device;
converting the query into a query language and sending the request to an index server;
parsing the query and identifying requisite postings lists and aggregation keys and functions;
initializing the aggregation functions;
while results are being collected and prior to the expiration of a pre-determined time, iteratively seeking through the postings list for matches to the query;
passing the aggregation keys to the aggregation functions in response to a match;
processing the keys with a respective function and incrementing a key-specific counter;
entering an index to a table from an output of the functions; and
collecting the aggregation results and returning the results to the display.
US12/129,850 2007-03-15 2008-05-30 System and method for multi-dimensional aggregation over large text corpora Abandoned US20080228743A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/129,850 US20080228743A1 (en) 2007-03-15 2008-05-30 System and method for multi-dimensional aggregation over large text corpora

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/686,639 US7720837B2 (en) 2007-03-15 2007-03-15 System and method for multi-dimensional aggregation over large text corpora
US12/129,850 US20080228743A1 (en) 2007-03-15 2008-05-30 System and method for multi-dimensional aggregation over large text corpora

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/686,639 Continuation US7720837B2 (en) 2007-03-15 2007-03-15 System and method for multi-dimensional aggregation over large text corpora

Publications (1)

Publication Number Publication Date
US20080228743A1 true US20080228743A1 (en) 2008-09-18

Family

ID=39763665

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/686,639 Active 2028-06-08 US7720837B2 (en) 2007-03-15 2007-03-15 System and method for multi-dimensional aggregation over large text corpora
US12/129,850 Abandoned US20080228743A1 (en) 2007-03-15 2008-05-30 System and method for multi-dimensional aggregation over large text corpora

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/686,639 Active 2028-06-08 US7720837B2 (en) 2007-03-15 2007-03-15 System and method for multi-dimensional aggregation over large text corpora

Country Status (1)

Country Link
US (2) US7720837B2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110040761A1 (en) * 2009-08-12 2011-02-17 Globalspec, Inc. Estimation of postings list length in a search system using an approximation table
US8122029B2 (en) * 2007-06-08 2012-02-21 Apple Inc. Updating an inverted index
US20120054182A1 (en) * 2010-08-24 2012-03-01 International Business Machines Corporation Systems and methods for massive structured data management over cloud aware distributed file system
US20130311438A1 (en) * 2012-05-18 2013-11-21 Splunk Inc. Flexible schema column store
US9817853B1 (en) * 2012-07-24 2017-11-14 Google Llc Dynamic tier-maps for large online databases
US9990386B2 (en) 2013-01-31 2018-06-05 Splunk Inc. Generating and storing summarization tables for sets of searchable events
US10061807B2 (en) 2012-05-18 2018-08-28 Splunk Inc. Collection query driven generation of inverted index for raw machine data
US10229150B2 (en) 2015-04-23 2019-03-12 Splunk Inc. Systems and methods for concurrent summarization of indexed data
US10331720B2 (en) 2012-09-07 2019-06-25 Splunk Inc. Graphical display of field values extracted from machine data
US10474674B2 (en) 2017-01-31 2019-11-12 Splunk Inc. Using an inverted index in a pipelined search query to determine a set of event data that is further limited by filtering and/or processing of subsequent query pipestages
US11321311B2 (en) 2012-09-07 2022-05-03 Splunk Inc. Data model selection and application based on data sources

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU4328000A (en) 1999-03-31 2000-10-16 Verizon Laboratories Inc. Techniques for performing a data query in a computer system
US8275661B1 (en) 1999-03-31 2012-09-25 Verizon Corporate Services Group Inc. Targeted banner advertisements
US8572069B2 (en) * 1999-03-31 2013-10-29 Apple Inc. Semi-automatic index term augmentation in document retrieval
US6718363B1 (en) * 1999-07-30 2004-04-06 Verizon Laboratories, Inc. Page aggregation for web sites
US9535979B2 (en) 2013-06-21 2017-01-03 International Business Machines Corporation Multifaceted search

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5852822A (en) * 1996-12-09 1998-12-22 Oracle Corporation Index-only tables with nested group keys
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US6105023A (en) * 1997-08-18 2000-08-15 Dataware Technologies, Inc. System and method for filtering a document stream
US6349308B1 (en) * 1998-02-25 2002-02-19 Korea Advanced Institute Of Science & Technology Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US20020161757A1 (en) * 2001-03-16 2002-10-31 Jeffrey Mock Simultaneous searching across multiple data sets
US20030009443A1 (en) * 2001-06-15 2003-01-09 Oleg Yatviskiy Generic data aggregation
US6567810B1 (en) * 1997-11-19 2003-05-20 At&T Corp. Efficient and effective distributed information management
US20030225779A1 (en) * 2002-05-09 2003-12-04 Yasuhiro Matsuda Inverted index system and method for numeric attributes
US20030225752A1 (en) * 1999-08-04 2003-12-04 Reuven Bakalash Central data warehouse with integrated data aggregation engine for performing centralized data aggregation operations
US6665666B1 (en) * 1999-10-26 2003-12-16 International Business Machines Corporation System, method and program product for answering questions using a search engine
US20040205044A1 (en) * 2003-04-11 2004-10-14 International Business Machines Corporation Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20050144159A1 (en) * 2003-12-29 2005-06-30 International Business Machines Corporation Method and system for processing a text search query in a collection of documents
US6922700B1 (en) * 2000-05-16 2005-07-26 International Business Machines Corporation System and method for similarity indexing and searching in high dimensional space
US20050198076A1 (en) * 2003-10-17 2005-09-08 Stata Raymond P. Systems and methods for indexing content for fast and scalable retrieval
US20060184521A1 (en) * 1999-07-30 2006-08-17 Ponte Jay M Compressed document surrogates
US7107260B2 (en) * 1999-08-12 2006-09-12 International Business Machines Corporation Data access system
US20060241903A1 (en) * 1997-02-04 2006-10-26 The Bristol Observatory, Ltd Apparatus and method for probabilistic population size and overlap determination, remote processing of private data and probabilistic population size and overlap determination for three or more data sets
US20070078880A1 (en) * 2005-09-30 2007-04-05 International Business Machines Corporation Method and framework to support indexing and searching taxonomies in large scale full text indexes
US20070112761A1 (en) * 2005-06-28 2007-05-17 Zhichen Xu Search engine with augmented relevance ranking by community participation
US7328201B2 (en) * 2003-07-18 2008-02-05 Cleverset, Inc. System and method of using synthetic variables to generate relational Bayesian network models of internet user behaviors
US20080133473A1 (en) * 2006-11-30 2008-06-05 Broder Andrei Z Efficient multifaceted search in information retrieval systems

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US5852822A (en) * 1996-12-09 1998-12-22 Oracle Corporation Index-only tables with nested group keys
US20060241903A1 (en) * 1997-02-04 2006-10-26 The Bristol Observatory, Ltd Apparatus and method for probabilistic population size and overlap determination, remote processing of private data and probabilistic population size and overlap determination for three or more data sets
US6105023A (en) * 1997-08-18 2000-08-15 Dataware Technologies, Inc. System and method for filtering a document stream
US6567810B1 (en) * 1997-11-19 2003-05-20 At&T Corp. Efficient and effective distributed information management
US6349308B1 (en) * 1998-02-25 2002-02-19 Korea Advanced Institute Of Science & Technology Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US20060184521A1 (en) * 1999-07-30 2006-08-17 Ponte Jay M Compressed document surrogates
US20070192295A1 (en) * 1999-08-04 2007-08-16 Reuven Bakalash Relational database management system having integrated non-relational multi-dimensional data store or aggregated data elements
US20030225752A1 (en) * 1999-08-04 2003-12-04 Reuven Bakalash Central data warehouse with integrated data aggregation engine for performing centralized data aggregation operations
US7107260B2 (en) * 1999-08-12 2006-09-12 International Business Machines Corporation Data access system
US6665666B1 (en) * 1999-10-26 2003-12-16 International Business Machines Corporation System, method and program product for answering questions using a search engine
US6922700B1 (en) * 2000-05-16 2005-07-26 International Business Machines Corporation System and method for similarity indexing and searching in high dimensional space
US20020161757A1 (en) * 2001-03-16 2002-10-31 Jeffrey Mock Simultaneous searching across multiple data sets
US20030009443A1 (en) * 2001-06-15 2003-01-09 Oleg Yatviskiy Generic data aggregation
US20030225779A1 (en) * 2002-05-09 2003-12-04 Yasuhiro Matsuda Inverted index system and method for numeric attributes
US20040205044A1 (en) * 2003-04-11 2004-10-14 International Business Machines Corporation Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US7328201B2 (en) * 2003-07-18 2008-02-05 Cleverset, Inc. System and method of using synthetic variables to generate relational Bayesian network models of internet user behaviors
US20050198076A1 (en) * 2003-10-17 2005-09-08 Stata Raymond P. Systems and methods for indexing content for fast and scalable retrieval
US20050144159A1 (en) * 2003-12-29 2005-06-30 International Business Machines Corporation Method and system for processing a text search query in a collection of documents
US20070112761A1 (en) * 2005-06-28 2007-05-17 Zhichen Xu Search engine with augmented relevance ranking by community participation
US20070078880A1 (en) * 2005-09-30 2007-04-05 International Business Machines Corporation Method and framework to support indexing and searching taxonomies in large scale full text indexes
US20080133473A1 (en) * 2006-11-30 2008-06-05 Broder Andrei Z Efficient multifaceted search in information retrieval systems

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122029B2 (en) * 2007-06-08 2012-02-21 Apple Inc. Updating an inverted index
US20110040762A1 (en) * 2009-08-12 2011-02-17 Globalspec, Inc. Segmenting postings list reader
US20110040905A1 (en) * 2009-08-12 2011-02-17 Globalspec, Inc. Efficient buffered reading with a plug-in for input buffer size determination
US8205025B2 (en) 2009-08-12 2012-06-19 Globalspec, Inc. Efficient buffered reading with a plug-in for input buffer size determination
US20110040761A1 (en) * 2009-08-12 2011-02-17 Globalspec, Inc. Estimation of postings list length in a search system using an approximation table
US8775425B2 (en) * 2010-08-24 2014-07-08 International Business Machines Corporation Systems and methods for massive structured data management over cloud aware distributed file system
US20120054182A1 (en) * 2010-08-24 2012-03-01 International Business Machines Corporation Systems and methods for massive structured data management over cloud aware distributed file system
US10997138B2 (en) * 2012-05-18 2021-05-04 Splunk, Inc. Query handling for field searchable raw machine data using a field searchable datastore and an inverted index
US10423595B2 (en) * 2012-05-18 2019-09-24 Splunk Inc. Query handling for field searchable raw machine data and associated inverted indexes
US20170139964A1 (en) * 2012-05-18 2017-05-18 Splunk Inc. Query handling for field searchable raw machine data
US20170139965A1 (en) * 2012-05-18 2017-05-18 Splunk Inc. Query handling for field searchable raw machine data and associated inverted indexes
US9753974B2 (en) * 2012-05-18 2017-09-05 Splunk Inc. Flexible schema column store
US11144521B2 (en) * 2012-05-18 2021-10-12 Splunk Inc. Query handling for field searchable raw machine data using a field searchable datastore or an inverted index
US11003644B2 (en) 2012-05-18 2021-05-11 Splunk Inc. Directly searchable and indirectly searchable using associated inverted indexes raw machine datastore
US10061807B2 (en) 2012-05-18 2018-08-28 Splunk Inc. Collection query driven generation of inverted index for raw machine data
US20130311438A1 (en) * 2012-05-18 2013-11-21 Splunk Inc. Flexible schema column store
US20170140013A1 (en) * 2012-05-18 2017-05-18 Splunk Inc. Directly field searchable and indirectly searchable by inverted indexes raw machine datastore
US10409794B2 (en) * 2012-05-18 2019-09-10 Splunk Inc. Directly field searchable and indirectly searchable by inverted indexes raw machine datastore
US10402384B2 (en) * 2012-05-18 2019-09-03 Splunk Inc. Query handling for field searchable raw machine data
US9817853B1 (en) * 2012-07-24 2017-11-14 Google Llc Dynamic tier-maps for large online databases
US10331720B2 (en) 2012-09-07 2019-06-25 Splunk Inc. Graphical display of field values extracted from machine data
US11893010B1 (en) 2012-09-07 2024-02-06 Splunk Inc. Data model selection and application based on data sources
US11755634B2 (en) 2012-09-07 2023-09-12 Splunk Inc. Generating reports from unstructured data
US10977286B2 (en) 2012-09-07 2021-04-13 Splunk Inc. Graphical controls for selecting criteria based on fields present in event data
US11386133B1 (en) 2012-09-07 2022-07-12 Splunk Inc. Graphical display of field values extracted from machine data
US11321311B2 (en) 2012-09-07 2022-05-03 Splunk Inc. Data model selection and application based on data sources
US10685001B2 (en) 2013-01-31 2020-06-16 Splunk Inc. Query handling using summarization tables
US11163738B2 (en) 2013-01-31 2021-11-02 Splunk Inc. Parallelization of collection queries
US9990386B2 (en) 2013-01-31 2018-06-05 Splunk Inc. Generating and storing summarization tables for sets of searchable events
US10387396B2 (en) 2013-01-31 2019-08-20 Splunk Inc. Collection query driven generation of summarization information for raw machine data
US10229150B2 (en) 2015-04-23 2019-03-12 Splunk Inc. Systems and methods for concurrent summarization of indexed data
US11604782B2 (en) 2015-04-23 2023-03-14 Splunk, Inc. Systems and methods for scheduling concurrent summarization of indexed data
US10474674B2 (en) 2017-01-31 2019-11-12 Splunk Inc. Using an inverted index in a pipelined search query to determine a set of event data that is further limited by filtering and/or processing of subsequent query pipestages

Also Published As

Publication number Publication date
US7720837B2 (en) 2010-05-18
US20080228718A1 (en) 2008-09-18

Similar Documents

Publication Publication Date Title
US7720837B2 (en) System and method for multi-dimensional aggregation over large text corpora
US7519582B2 (en) System and method for performing a high-level multi-dimensional query on a multi-structural database
Gu et al. Record linkage: Current practice and future directions
JP5128101B2 (en) Method, apparatus and system for supporting indexing and searching taxonomy with large full-text index
US8171029B2 (en) Automatic generation of ontologies using word affinities
US8145668B2 (en) Associating information related to components in structured documents stored in their native format in a database
US9361367B2 (en) Data classifier system, data classifier method and data classifier program
US20130246386A1 (en) Identifying key phrases within documents
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
JP2010520549A (en) Data storage and management methods
US10430448B2 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
KR20060048777A (en) Phrase-based generation of document descriptions
WO2018097846A1 (en) Edge store designs for graph databases
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
US20140250119A1 (en) Domain based keyword search
US9552415B2 (en) Category classification processing device and method
US9342589B2 (en) Data classifier system, data classifier method and data classifier program stored on storage medium
US20060248037A1 (en) Annotation of inverted list text indexes using search queries
Pereira et al. A generic Web‐based entity resolution framework
CN116783587A (en) Data storage for list-based data searching
Elmadany et al. XML summarization: A survey
US11954223B2 (en) Data record search with field level user access control
US20220114275A1 (en) Data record search with field level user access control
US20210141773A1 (en) Configurable Hyper-Referenced Associative Object Schema
Venetis et al. CRSI: a compact randomized similarity index for set-valued features

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: SAP AG, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:028540/0522

Effective date: 20120629