US20110040761A1 - Estimation of postings list length in a search system using an approximation table - Google Patents
Estimation of postings list length in a search system using an approximation table Download PDFInfo
- Publication number
- US20110040761A1 US20110040761A1 US12/854,726 US85472610A US2011040761A1 US 20110040761 A1 US20110040761 A1 US 20110040761A1 US 85472610 A US85472610 A US 85472610A US 2011040761 A1 US2011040761 A1 US 2011040761A1
- Authority
- US
- United States
- Prior art keywords
- size
- posting list
- predetermined
- posting
- reading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
Definitions
- the present invention generally relates to searching an inverted index. More particularly, the invention relates to estimating a posting list size based on document frequency in order to minimize accesses to the posting list stored in secondary storage.
- an inverted index 100 often comprises two related data structures (see FIG. 1 ):
- a computerized search system When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.
- a large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both.
- secondary storage typically disk storage
- Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.
- the present invention provides, in a first aspect, a method of minimizing accesses to secondary storage when searching an inverted index for a search term.
- the method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- the present invention provides, in a second aspect, a computer system for minimizing accesses to secondary storage when searching an inverted index for a search term.
- the computer system comprises a memory, and a processor in communication with the memory to perform a method.
- the method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- the present invention provides, in a third aspect, a program product for minimizing accesses to secondary storage when searching an inverted index for a search term.
- the program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method.
- the method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, the posting list being stored in secondary storage, and reading at least a portion of the posting list into memory based on the size approximated.
- the present invention provides, in a fourth aspect, a data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term.
- the data structure comprises a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies, and a posting list length approximation for each range ID.
- FIG. 1 depicts one example of an inverted index consisting of a lexicon and corresponding posting file.
- FIG. 2 depicts one example of a posting list length approximation table data structure, according to one aspect of the present invention.
- FIG. 3 is a flow diagram for one example of a method of reading a posting list in accordance with one or more aspects of the present invention.
- FIG. 4 depicts one example of an inverted index with the storage split between main memory and secondary storage.
- FIG. 5 is an object oriented instance diagram showing one example of a posting list reader and the main objects it uses, in accordance with the present invention.
- FIG. 6 is a block diagram of one example of a computing unit incorporating one or more aspects of the present invention.
- the present invention approximates posting list size, preferably as a length in bytes, according to a term's document frequency.
- the approximate posting list size is preferably predetermined, and it covers, with high probability, the size of the associated posting list in secondary storage. Knowing the approximate size is useful for minimizing the number of accesses to secondary storage when reading a posting list. For example, if the approximate covering read size is several megabytes or less, a highly efficient strategy is to scoop up the whole posting list in a single access to secondary storage through a single read system call.
- the list can be read, for example, by filling the largest available input buffer several times using a single system call per buffer fill operation, and then doing one more partial read to pick up the remainder of the approximate covering read size. For the rare case where the approximate covering read size does not cover the posting list being read, additional supplemental reads can be issued as necessary.
- Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins.
- buffer elements typically bytes
- Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins.
- an inverted index is small enough that the lexicon fits entirely into memory, it is a simple matter to determine the size of a posting list in bytes. For example, referring back to FIG. 1 , and assuming that the lexicon is entirely in main memory, simply subtract adjacent posting list addresses to know the size in bytes of a given posting list.
- an advantage of the present invention is that this sizing information preferably used for efficient reading from secondary storage can be instantly available in main memory without needing to store the full lexicon in main memory and without needing to store posting list size in bytes as a separate field in the lexicon.
- the present invention uses a posting list range to predetermine approximate posting list size.
- a posting list range is a set of posting lists defined by an inclusive minimum and an inclusive maximum document frequency.
- a posting list is a member of a posting list range if the posting list's document frequency falls within the inclusive minimum and maximum of the range.
- Posting lists that are part of the same posting list range will have the same approximate posting list size.
- the present invention builds on the concept of a posting list range to predetermine approximate posting list size.
- the posting lists in the inverted index are partitioned into a collection of non-overlapping ranges whose union is the complete set of posting lists in the index. Each of these ranges is assigned a unique range identifier (rangeId).
- the function takes as input an integer that is the length of a posting list in number of postings, also known as the document frequency.
- the function returns the ID of the range that includes the posting list whose document frequency was passed in.
- ln( ) is the natural logarithm function
- ceil( ) is a function that rounds a number with a fractional part to the next higher integer.
- Table I below shows how the implementation of documentFrequencyToRangeIdTranslator above partitions posting lists into posting list ranges.
- the implementation of documentFrequencyToRangeIdTranslator has been found to work well in practice with a natural language corpus in which the word distribution adheres to Zipf's law.
- Each successive rangeId includes twice as many document frequencies as the preceding rangeId.
- the Posting List Length Approximation table data structure pictured in FIG. 2 can be created in one example as follows. For each posting list range compute the mean and standard deviation of posting list size as stored in secondary storage, preferably in bytes. Next, create a Posting List Length Approximation object consisting of the rangeId of the current range, the mean of posting list size, and the standard deviation of posting list size. Finally, add a hash table entry to the Posting List Length Approximation Table mapping the rangeId to the Posting List Length Approximation object.
- FIG. 2 depicts one example of a data structure 200 for a posting list length approximation table, in accordance with one aspect of the present invention.
- the data structure comprises a hash table 210 with keys 220 and associated values 230 .
- the keys comprise a plurality of range ID's 240 , as described above.
- the associated values comprise the posting list length approximation information 250 .
- the length approximation information is based on a predetermined length.
- the information comprises, for example, the corresponding range ID, a mean posting list length, and a standard deviation for the posting list length.
- the mean length and standard deviation are preferably expressed, for example, in bytes.
- the posting list length approximation table has an access method getPostingListLengthApproximation(documentFrequency) which returns a Posting List Length Approximation object based on a document frequency passed in.
- this method translates the document frequency to a rangeId using the documentFrequencyToRangeIdTranslator function discussed earlier. This rangeId is then used to do a hash table lookup to find the proper Posting List Length Approximation object to return.
- the resulting Posting List Length Approximation object can then be turned into an approximate covering read size by, for example, adding the mean posting list length in bytes to the desired number of standard deviations.
- the present example has a similar structure to the inverted index of FIG. 1 .
- the inverted index is large enough that the posting file is entirely in secondary storage and only half of the lexicon fits into main memory.
- FIG. 4 shows how the inverted index 400 is divided between main memory 402 , storing the lexicon index 404 , and secondary storage 406 , storing the full lexicon 408 and the posting file 410 .
- N be the total number of terms in the full lexicon in secondary storage. Only every second term, for a total of N/2 terms, are kept in the lexicon index in main memory due to memory constraints.
- the full lexicon in secondary storage is preferably organized as a sequence of blocks, e.g., block 412 , each of a constant size k (e.g., in bytes) such that any block can accommodate the largest pair of lexicon entries in the lexicon.
- the lexicon index does not need to store explicit disk pointers into the full lexicon. Instead, to locate the block in the full lexicon of the lexicon index record with zero-based index i, simply seek to offset i*k in the full lexicon.
- the lexicon index includes document frequency but does not include posting list sizing in bytes. The goal is to keep the main memory lexicon data structure as compact as possible.
- the Posting List Length Approximation Table will provide needed sizing information for efficient reading of posting lists in secondary storage.
- the Posting List Reader uses an object called a Posting List Reader 500 , shown in FIG. 5 , to read postings from secondary storage during query processing.
- the Posting List Reader uses a Posting List Length Approximation Table 502 to accurately estimate the sizes of posting lists to be read. It uses an Enhanced Buffered Reader 504 with an internal buffer of size bufsize bytes to read postings 506 from secondary storage using efficient predetermined buffer fill size strategies. Preferably, bufsize is relatively large (for example several megabytes) to facilitate reading large posting lists with relatively few read system calls.
- the Posting List Reader provides the following access methods:
- the readPosting( )method may be used.
- the search system When a user runs a query, the search system first parses the query, identifies the terms for which postings are needed to process the query, and locates each of these terms in the lexicon to obtain a document frequency and posting list address for each. Assuming a lexicon structured similar to that shown in FIG. 4 , a term's document frequency and posting list address can be retrieved without accessing secondary storage about half the time by doing a binary search of the lexicon index in main memory, which is very fast.
- a disk seek can be used to find the term in the full lexicon in secondary storage by seeking to offset i*k in the full lexicon and reading the lexicon entries there, where i is the zero-based record offset in the lexicon index of the lexically greatest term that is lexically less than the sought term, and k is the block size of the blocks in the full lexicon.
- the Posting List Reader receives an initialize request (step 302 ) that includes a document frequency and a posting list address.
- the document frequency is the length of the posting list to read in number of postings
- the posting list address is the byte offset in the posting file where the posting list to read starts.
- the Posting List Reader obtains a Posting List Length Approximation object (step 304 ) by calling the getPostingListLengthApproximation method on the Posting List Length Approximation Table pictured in FIG. 2 , passing the document frequency to this getter.
- getPostingListLengthApproximation in turn translates the document frequency passed in to a rangeId using the documentFrequencyToRangeIdTranslator function described earlier and does a hash table lookup in the Posting List Length Approximation table based on the rangeId to obtain the Posting List Length Approximation object.
- the Posting List Reader next obtains the approximate size of the posting list to read (step 306 ) by getting the mean and standard deviation of posting list length from the Posting List Length Approximation object and adding the desired number of standard deviations to the mean. Let approximateReadSize be the approximate read size calculated in this step.
- the next step in initializing the Posting List Reader is to build a predetermined buffer fill size strategy (step 308 ) for use with the Enhanced Buffered Reader.
- a predetermined buffer fill size strategy is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of the Enhanced Buffered Reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize. There are two cases to consider, based on the relative sizes of the bufsize (the Enhanced Buffered Reader's internal buffer size) and approximateReadSize.
- the above strategy when installed in an Enhanced Buffered Reader and used to read the posting list, will utilize the available input buffer of size bufsize bytes to read approximateReadSize bytes of data using a minimal number of disk seeks and minimal data transfer.
- the approximateReadSize is sufficient to read the entire posting list with high probability; however, as many supplemental 8 kilobyte reads as necessary will be issued to handle the relatively rare case when the approximateReadSize is insufficient.
- the next step in initializing the Posting List Reader is to seek the Enhanced Buffered Reader to start of posting list (step 310 ).
- the posting list address that was passed to the initialize request (step 302 ) is forwarded to the Enhanced Buffered Reader's seek method.
- the predetermined buffer fill size strategy of step 308 is installed in the Enhanced Buffered Reader (step 312 ), by calling the appropriate setter.
- the posting list reader is now ready to start processing read requests for postings (step 314 ).
- the Enhanced Buffered Reader automatically initiates buffer refilling as needed using read sizes consistent with good runtime performance when accessing secondary storage.
- a data processing system 600 may be provided suitable for storing and/or executing program code is usable that includes at least one processor 610 coupled directly or indirectly to memory elements through a system bus 620 .
- the memory elements include, for instance, data buffers 630 and 640 , local memory employed during actual execution of the program code, bulk storage 650 , and cache memory which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- I/O devices 660 can be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
- the shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for minimizing accesses to secondary storage for a posting list when searching an inverted index for a search term.
- the computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by a computer for performing a method.
- the method includes, for instance, automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- an application can be deployed for performing one or more aspects of the present invention.
- the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.
- a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.
- a process for integrating computing infrastructure comprising integrating computer readable code into a computer system
- the computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention.
- the code in combination with the computer system is capable of performing one or more aspects of the present invention.
- aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system”. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable storage medium.
- a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- the computer readable storage medium includes the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
- a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention.
- the computer program product can take many different physical forms, for example, disks, platters, flash memory, etc., including those above.
- Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Abstract
Description
- This application claims priority under 35 U.S.C. §119 to the following U.S. Provisional Applications, which are herein incorporated by reference in their entirety:
- Provisional Patent Application Ser. No. 61/233,411, by Flatland et al., entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE,” filed on Aug. 12, 2009; and
- Provisional Patent Application No. 61/233,420, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION,” filed on Aug. 12, 2009;
- Provisional Patent Application Ser. No. 61/233,427, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER,” filed on Aug. 12, 2009.
- This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:
- U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A); and
- U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER” (Attorney Docket No. 1634.070A).
- The present invention generally relates to searching an inverted index. More particularly, the invention relates to estimating a posting list size based on document frequency in order to minimize accesses to the posting list stored in secondary storage.
- The following definition of Information Retrieval (IR) is from the book Introduction to Information Retrieval by Manning, Raghavan and Schutze, Cambridge University Press, 2008:
-
- Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
- An inverted index is a data structure central to the design of numerous modern information retrieval systems. In
chapter 5 of Search Engines: Information Retrieval in Practice (Addison Wesley, 2010), Croft, Metzler and Strohman observe: -
- An inverted index is the computational equivalent of the index found in the back of this textbook . . . . The book index is arranged in alphabetical order by index term. Each index term is followed by a list of pages about the word.
- In a search system implemented using a computer, an inverted
index 100 often comprises two related data structures (seeFIG. 1 ): -
- 1. A
lexicon 101 contains the distinct set of terms 102 (i.e., with duplicates removed) that occur throughout all the documents of the index. To facilitate rapid searching, terms in the lexicon are usually stored in sorted order. Each term typically includes adocument frequency 104 and a pointer into the other major data structure of the inverted index, theposting file 108. The document frequency is a count of the number of documents in which a term occurs. The document frequency is useful at search time both for prioritizing term processing and as input to scoring algorithms. - 2. The
posting file 108 consists of one posting list per term in the lexicon, e.g.,list 110 forterm 112, recording for each term the set of documents in which the term occurs. Each entry in a posting list is called a posting. The number of postings in a given posting list equals the document frequency of the associated lexicon entry. A posting includes at least a document identifier and may include additional information such as: a count of the number of times the term occurs in the document; a list of term positions within the document where the term occurs; and more generally, scoring information that ascribes some degree of importance (or lack thereof) to the fact that the document contains the term.
- 1. A
- When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.
- A large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both. Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.
- Therefore, it would be helpful to minimize accesses to secondary storage when searching an inverted list, in order to improve runtime performance.
- The present invention provides, in a first aspect, a method of minimizing accesses to secondary storage when searching an inverted index for a search term. The method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- The present invention provides, in a second aspect, a computer system for minimizing accesses to secondary storage when searching an inverted index for a search term. The computer system comprises a memory, and a processor in communication with the memory to perform a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- The present invention provides, in a third aspect, a program product for minimizing accesses to secondary storage when searching an inverted index for a search term. The program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, the posting list being stored in secondary storage, and reading at least a portion of the posting list into memory based on the size approximated.
- The present invention provides, in a fourth aspect, a data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term. The data structure comprises a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies, and a posting list length approximation for each range ID.
- These, and other objects, features and advantages of this invention will become apparent from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings.
- One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
-
FIG. 1 depicts one example of an inverted index consisting of a lexicon and corresponding posting file. -
FIG. 2 depicts one example of a posting list length approximation table data structure, according to one aspect of the present invention. -
FIG. 3 is a flow diagram for one example of a method of reading a posting list in accordance with one or more aspects of the present invention. -
FIG. 4 depicts one example of an inverted index with the storage split between main memory and secondary storage. -
FIG. 5 is an object oriented instance diagram showing one example of a posting list reader and the main objects it uses, in accordance with the present invention. -
FIG. 6 is a block diagram of one example of a computing unit incorporating one or more aspects of the present invention. - The present invention approximates posting list size, preferably as a length in bytes, according to a term's document frequency. The approximate posting list size is preferably predetermined, and it covers, with high probability, the size of the associated posting list in secondary storage. Knowing the approximate size is useful for minimizing the number of accesses to secondary storage when reading a posting list. For example, if the approximate covering read size is several megabytes or less, a highly efficient strategy is to scoop up the whole posting list in a single access to secondary storage through a single read system call. If the approximate covering read size is larger than the largest available main memory input buffer, then the list can be read, for example, by filling the largest available input buffer several times using a single system call per buffer fill operation, and then doing one more partial read to pick up the remainder of the approximate covering read size. For the rare case where the approximate covering read size does not cover the posting list being read, additional supplemental reads can be issued as necessary.
- U.S. Non-Provisional Patent Application entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A) filed concurrently herewith, describes an enhanced buffered reader that can be configured with predetermined buffer fill size strategies. When the posting file is in secondary storage, using an enhanced buffered reader to read a posting list offers advantages over a conventional buffered reader. An enhanced buffered reader can be configured with a predetermined buffer fill size strategy that is based on both the size of the posting list (in bytes, for example) and the size of the available input buffer, ensuring that the fewest required number of system calls to read from secondary storage are issued. Another advantage of the enhanced buffered reader is that it neatly encapsulates buffer management details inside the enhanced buffered reader. The detailed description of present invention assumes a working understanding of enhanced buffered readers.
- Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins. When an inverted index is small enough that the lexicon fits entirely into memory, it is a simple matter to determine the size of a posting list in bytes. For example, referring back to
FIG. 1 , and assuming that the lexicon is entirely in main memory, simply subtract adjacent posting list addresses to know the size in bytes of a given posting list. As will become apparent in the detailed description below, an advantage of the present invention is that this sizing information preferably used for efficient reading from secondary storage can be instantly available in main memory without needing to store the full lexicon in main memory and without needing to store posting list size in bytes as a separate field in the lexicon. - The present invention uses a posting list range to predetermine approximate posting list size. A posting list range is a set of posting lists defined by an inclusive minimum and an inclusive maximum document frequency. A posting list is a member of a posting list range if the posting list's document frequency falls within the inclusive minimum and maximum of the range. Posting lists that are part of the same posting list range will have the same approximate posting list size. The present invention builds on the concept of a posting list range to predetermine approximate posting list size.
- As a prerequisite for populating the Posting List Length Approximation Table data structure pictured in
FIG. 2 , and given an inverted index, the posting lists in the inverted index are partitioned into a collection of non-overlapping ranges whose union is the complete set of posting lists in the index. Each of these ranges is assigned a unique range identifier (rangeId). - One example of a way to accomplish this partitioning of posting lists is through a function called documentFrequencyToRangeIdTranslator, shown below and summarized here. The function takes as input an integer that is the length of a posting list in number of postings, also known as the document frequency. The function returns the ID of the range that includes the posting list whose document frequency was passed in. ln( ) is the natural logarithm function, and ceil( ) is a function that rounds a number with a fractional part to the next higher integer.
-
documentFrequencyToRangeIdTranslator int documentFrequencyToRangeIdTranslator(int documentFrequency) return ceil(ln(documentFrequency) / ln(2.0)); } - Table I below shows how the implementation of documentFrequencyToRangeIdTranslator above partitions posting lists into posting list ranges. The implementation of documentFrequencyToRangeIdTranslator has been found to work well in practice with a natural language corpus in which the word distribution adheres to Zipf's law. Each successive rangeId includes twice as many document frequencies as the preceding rangeId.
-
TABLE I Sample Range Definitions minDocumentFrequency maxDocumentFrequency rangeId 1 1 0 2 2 1 3 4 2 5 8 3 9 16 4 17 32 5 etc. - Other implementations of the documentFrequencyToRangeIdTranslator are possible. The above is merely one example. This function could be implemented in any way that defines a complete non-overlapping partitioning of the posting lists into ranges.
- Given an inverted index, the Posting List Length Approximation table data structure pictured in
FIG. 2 can be created in one example as follows. For each posting list range compute the mean and standard deviation of posting list size as stored in secondary storage, preferably in bytes. Next, create a Posting List Length Approximation object consisting of the rangeId of the current range, the mean of posting list size, and the standard deviation of posting list size. Finally, add a hash table entry to the Posting List Length Approximation Table mapping the rangeId to the Posting List Length Approximation object. -
FIG. 2 depicts one example of adata structure 200 for a posting list length approximation table, in accordance with one aspect of the present invention. The data structure comprises a hash table 210 withkeys 220 and associatedvalues 230. The keys comprise a plurality of range ID's 240, as described above. The associated values comprise the posting listlength approximation information 250. In the presently preferred embodiment, the length approximation information is based on a predetermined length. The information comprises, for example, the corresponding range ID, a mean posting list length, and a standard deviation for the posting list length. The mean length and standard deviation are preferably expressed, for example, in bytes. - In one example, in addition to the structure shown in
FIG. 2 , the posting list length approximation table has an access method getPostingListLengthApproximation(documentFrequency) which returns a Posting List Length Approximation object based on a document frequency passed in. In the present example, the implementation of this method translates the document frequency to a rangeId using the documentFrequencyToRangeIdTranslator function discussed earlier. This rangeId is then used to do a hash table lookup to find the proper Posting List Length Approximation object to return. The resulting Posting List Length Approximation object can then be turned into an approximate covering read size by, for example, adding the mean posting list length in bytes to the desired number of standard deviations. - One example of how to use a posting list length approximation table to read a posting list efficiently will now be provided with reference to the flow diagram 300 of
FIG. 3 . The present example has a similar structure to the inverted index ofFIG. 1 . In the scenario of this example, the inverted index is large enough that the posting file is entirely in secondary storage and only half of the lexicon fits into main memory. -
FIG. 4 shows how theinverted index 400 is divided betweenmain memory 402, storing thelexicon index 404, andsecondary storage 406, storing thefull lexicon 408 and theposting file 410. Referring toFIG. 4 , let N be the total number of terms in the full lexicon in secondary storage. Only every second term, for a total of N/2 terms, are kept in the lexicon index in main memory due to memory constraints. The full lexicon in secondary storage is preferably organized as a sequence of blocks, e.g., block 412, each of a constant size k (e.g., in bytes) such that any block can accommodate the largest pair of lexicon entries in the lexicon. This causes some internal fragmentation within the full lexicon, but the advantage is that the lexicon index does not need to store explicit disk pointers into the full lexicon. Instead, to locate the block in the full lexicon of the lexicon index record with zero-based index i, simply seek to offset i*k in the full lexicon. By design, the lexicon index includes document frequency but does not include posting list sizing in bytes. The goal is to keep the main memory lexicon data structure as compact as possible. The Posting List Length Approximation Table will provide needed sizing information for efficient reading of posting lists in secondary storage. - In this example, it is assumed that the search engine implementation uses an object called a
Posting List Reader 500, shown inFIG. 5 , to read postings from secondary storage during query processing. The Posting List Reader uses a Posting List Length Approximation Table 502 to accurately estimate the sizes of posting lists to be read. It uses anEnhanced Buffered Reader 504 with an internal buffer of size bufsize bytes to readpostings 506 from secondary storage using efficient predetermined buffer fill size strategies. Preferably, bufsize is relatively large (for example several megabytes) to facilitate reading large posting lists with relatively few read system calls. The Posting List Reader provides the following access methods: -
- initialize(documentFrequency, postingListAddress)—Prepares the Posting List
- Reader for reading based on a document frequency and posting list address of a term obtained from the lexicon. After initialization, the readPosting( )method may be used.
-
- readPosting( )—Reads the next posting from the posting list.
- When a user runs a query, the search system first parses the query, identifies the terms for which postings are needed to process the query, and locates each of these terms in the lexicon to obtain a document frequency and posting list address for each. Assuming a lexicon structured similar to that shown in
FIG. 4 , a term's document frequency and posting list address can be retrieved without accessing secondary storage about half the time by doing a binary search of the lexicon index in main memory, which is very fast. If necessary, a disk seek can be used to find the term in the full lexicon in secondary storage by seeking to offset i*k in the full lexicon and reading the lexicon entries there, where i is the zero-based record offset in the lexicon index of the lexically greatest term that is lexically less than the sought term, and k is the block size of the blocks in the full lexicon. Having obtained a document frequency and posting list address for a term, the search system initializes a Posting List Reader, preparing it to read postings, as discussed below. - Returning to
FIG. 3 , the Posting List Reader receives an initialize request (step 302) that includes a document frequency and a posting list address. The document frequency is the length of the posting list to read in number of postings, and the posting list address is the byte offset in the posting file where the posting list to read starts. - The Posting List Reader obtains a Posting List Length Approximation object (step 304) by calling the getPostingListLengthApproximation method on the Posting List Length Approximation Table pictured in
FIG. 2 , passing the document frequency to this getter. (The implementation of getPostingListLengthApproximation in turn translates the document frequency passed in to a rangeId using the documentFrequencyToRangeIdTranslator function described earlier and does a hash table lookup in the Posting List Length Approximation table based on the rangeId to obtain the Posting List Length Approximation object.) - The Posting List Reader next obtains the approximate size of the posting list to read (step 306) by getting the mean and standard deviation of posting list length from the Posting List Length Approximation object and adding the desired number of standard deviations to the mean. Let approximateReadSize be the approximate read size calculated in this step.
- The next step in initializing the Posting List Reader is to build a predetermined buffer fill size strategy (step 308) for use with the Enhanced Buffered Reader. A predetermined buffer fill size strategy is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of the Enhanced Buffered Reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize. There are two cases to consider, based on the relative sizes of the bufsize (the Enhanced Buffered Reader's internal buffer size) and approximateReadSize.
- Case 1: approximateReadSize<=bufsize; and
- Case 2: approximateReadSize>bufsize.
- A discussion of these cases follows.
- Case 1: approximateReadSize<=bufsize
- Build a two-stage predetermined buffer fill size strategy as indicated below in Table II.
-
TABLE II Stage Fill Size Number of Times to Use 1 approximateReadSize 1 2 8 kilobytes Repeat as necessary - The above two-stage strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will with high probability result in a single disk seek and read of exactly approximateReadSize bytes. As many supplemental 8 kilobyte reads as necessary may then be issued to handle the relatively rare case when the approximateReadSize is insufficient.
- Case 2: approximateReadSize>bufsize
- For this discussion, let “/” represent the operation of integer division, and “%” represent the operation of integer modulo.
- In this case, we build a predetermined buffer fill size strategy that generally has three stages, as indicated in the following table. However, the second stage is not necessary when the bufsize divides the approximateReadSize evenly).
-
TABLE III Stage Fill Size Number of Times to Use 1 bufsize approximateReadSize/ bufsize 2 approximateReadSize % bufsize 1 3 8 kilobytes Repeat as necessary - The above strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will utilize the available input buffer of size bufsize bytes to read approximateReadSize bytes of data using a minimal number of disk seeks and minimal data transfer. The approximateReadSize is sufficient to read the entire posting list with high probability; however, as many supplemental 8 kilobyte reads as necessary will be issued to handle the relatively rare case when the approximateReadSize is insufficient.
- Referring once again to
FIG. 3 , the next step in initializing the Posting List Reader is to seek the Enhanced Buffered Reader to start of posting list (step 310). The posting list address that was passed to the initialize request (step 302) is forwarded to the Enhanced Buffered Reader's seek method. - Finally, the predetermined buffer fill size strategy of
step 308 is installed in the Enhanced Buffered Reader (step 312), by calling the appropriate setter. The posting list reader is now ready to start processing read requests for postings (step 314). As the search system's search logic issues read requests as desired, the Enhanced Buffered Reader automatically initiates buffer refilling as needed using read sizes consistent with good runtime performance when accessing secondary storage. - As shown in
FIG. 6 , one example of adata processing system 600 may be provided suitable for storing and/or executing program code is usable that includes at least oneprocessor 610 coupled directly or indirectly to memory elements through asystem bus 620. As known in the art, the memory elements include, for instance, data buffers 630 and 640, local memory employed during actual execution of the program code,bulk storage 650, and cache memory which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. - Input/Output or I/O devices 660 (including, but not limited to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs, thumb drives and other memory media, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
- The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for minimizing accesses to secondary storage for a posting list when searching an inverted index for a search term. The computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by a computer for performing a method. The method includes, for instance, automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
- Methods and systems relating to one or more aspects of the present invention are also described and claimed herein. Further, services relating to one or more aspects of the present invention are also described and may be claimed herein.
- Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
- In one aspect of the present invention, an application can be deployed for performing one or more aspects of the present invention. As one example, the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.
- As a further aspect of the present invention, a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.
- As yet a further aspect of the present invention, a process for integrating computing infrastructure comprising integrating computer readable code into a computer system may be provided. The computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention. The code in combination with the computer system is capable of performing one or more aspects of the present invention.
- As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system”. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- In one example, a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention. The computer program product can take many different physical forms, for example, disks, platters, flash memory, etc., including those above.
- Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
- The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiment with various modifications as are suited to the particular use contemplated.
Claims (33)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/854,726 US20110040761A1 (en) | 2009-08-12 | 2010-08-11 | Estimation of postings list length in a search system using an approximation table |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US23341109P | 2009-08-12 | 2009-08-12 | |
US23342009P | 2009-08-12 | 2009-08-12 | |
US23342709P | 2009-08-12 | 2009-08-12 | |
US12/854,726 US20110040761A1 (en) | 2009-08-12 | 2010-08-11 | Estimation of postings list length in a search system using an approximation table |
Publications (1)
Publication Number | Publication Date |
---|---|
US20110040761A1 true US20110040761A1 (en) | 2011-02-17 |
Family
ID=43589199
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/854,726 Abandoned US20110040761A1 (en) | 2009-08-12 | 2010-08-11 | Estimation of postings list length in a search system using an approximation table |
US12/854,775 Abandoned US20110040762A1 (en) | 2009-08-12 | 2010-08-11 | Segmenting postings list reader |
US12/854,755 Active 2030-09-21 US8205025B2 (en) | 2009-08-12 | 2010-08-11 | Efficient buffered reading with a plug-in for input buffer size determination |
US13/460,515 Abandoned US20120260011A1 (en) | 2009-08-12 | 2012-04-30 | Efficient buffered reading with a plug-in for input buffer size determination |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/854,775 Abandoned US20110040762A1 (en) | 2009-08-12 | 2010-08-11 | Segmenting postings list reader |
US12/854,755 Active 2030-09-21 US8205025B2 (en) | 2009-08-12 | 2010-08-11 | Efficient buffered reading with a plug-in for input buffer size determination |
US13/460,515 Abandoned US20120260011A1 (en) | 2009-08-12 | 2012-04-30 | Efficient buffered reading with a plug-in for input buffer size determination |
Country Status (1)
Country | Link |
---|---|
US (4) | US20110040761A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015078273A1 (en) * | 2013-11-29 | 2015-06-04 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for search |
EP3000060A4 (en) * | 2013-05-21 | 2017-01-18 | Facebook, Inc. | Database sharding with update layer |
US10545960B1 (en) * | 2019-03-12 | 2020-01-28 | The Governing Council Of The University Of Toronto | System and method for set overlap searching of data lakes |
US11003644B2 (en) * | 2012-05-18 | 2021-05-11 | Splunk Inc. | Directly searchable and indirectly searchable using associated inverted indexes raw machine datastore |
US11334353B2 (en) * | 2017-05-18 | 2022-05-17 | Nec Corporation | Multiparty computation method, apparatus and program |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012126180A1 (en) | 2011-03-24 | 2012-09-27 | Microsoft Corporation | Multi-layer search-engine index |
US9152697B2 (en) | 2011-07-13 | 2015-10-06 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
EP2888679A1 (en) * | 2012-08-24 | 2015-07-01 | Yandex Europe AG | Computer-implemented method of and system for searching an inverted index having a plurality of posting lists |
US8739151B1 (en) * | 2013-03-15 | 2014-05-27 | Genetec Inc. | Computer system using in-service software upgrade |
US10983973B2 (en) * | 2013-05-21 | 2021-04-20 | Facebook, Inc. | Database sharding with incorporated updates |
US9910860B2 (en) * | 2014-02-06 | 2018-03-06 | International Business Machines Corporation | Split elimination in MapReduce systems |
US9971770B2 (en) * | 2014-11-25 | 2018-05-15 | Sap Se | Inverted indexing |
Citations (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3987419A (en) * | 1974-12-05 | 1976-10-19 | Goodyear Aerospace Corporation | High speed information processing system |
US4158235A (en) * | 1977-04-18 | 1979-06-12 | Burroughs Corporation | Multi port time-shared associative buffer storage pool |
US5179662A (en) * | 1989-08-31 | 1993-01-12 | International Business Machines Corporation | Optimized i/o buffers having the ability to increase or decrease in size to meet system requirements |
US5263159A (en) * | 1989-09-20 | 1993-11-16 | International Business Machines Corporation | Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database |
US5784698A (en) * | 1995-12-05 | 1998-07-21 | International Business Machines Corporation | Dynamic memory allocation that enalbes efficient use of buffer pool memory segments |
US5915249A (en) * | 1996-06-14 | 1999-06-22 | Excite, Inc. | System and method for accelerated query evaluation of very large full-text databases |
US5916309A (en) * | 1997-05-12 | 1999-06-29 | Lexmark International Inc. | System for dynamically determining the size and number of communication buffers based on communication parameters at the beginning of the reception of message |
US6067584A (en) * | 1996-09-09 | 2000-05-23 | National Instruments Corporation | Attribute-based system and method for configuring and controlling a data acquisition task |
US6067547A (en) * | 1997-08-12 | 2000-05-23 | Microsoft Corporation | Hash table expansion and contraction for use with internal searching |
US6161154A (en) * | 1997-11-26 | 2000-12-12 | National Instruments Corporation | System and method for extracting and restoring a video buffer from/to a video acquisition cycle |
US6349308B1 (en) * | 1998-02-25 | 2002-02-19 | Korea Advanced Institute Of Science & Technology | Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems |
US6439783B1 (en) * | 1994-07-19 | 2002-08-27 | Oracle Corporation | Range-based query optimizer |
US6463486B1 (en) * | 1999-04-06 | 2002-10-08 | Microsoft Corporation | System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information |
US6542967B1 (en) * | 1999-04-12 | 2003-04-01 | Novell, Inc. | Cache object store |
US6546456B1 (en) * | 2000-09-08 | 2003-04-08 | International Business Machines Corporation | Method and apparatus for operating vehicle mounted disk drive storage device |
US20030069877A1 (en) * | 2001-08-13 | 2003-04-10 | Xerox Corporation | System for automatically generating queries |
US20050028156A1 (en) * | 2003-07-30 | 2005-02-03 | Northwestern University | Automatic method and system for formulating and transforming representations of context used by information services |
US20050246457A1 (en) * | 1999-04-06 | 2005-11-03 | Microsoft Corporation | System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information |
US6993604B2 (en) * | 2000-11-15 | 2006-01-31 | Seagate Technology Llc | Dynamic buffer size allocation for multiplexed streaming |
US7058642B2 (en) * | 2002-03-20 | 2006-06-06 | Intel Corporation | Method and data structure for a low memory overhead database |
US7099898B1 (en) * | 1999-08-12 | 2006-08-29 | International Business Machines Corporation | Data access system |
US20060248037A1 (en) * | 2005-04-29 | 2006-11-02 | International Business Machines Corporation | Annotation of inverted list text indexes using search queries |
US7143263B2 (en) * | 2003-10-16 | 2006-11-28 | International Business Machines Corporation | System and method of adaptively reconfiguring buffers |
US7146466B2 (en) * | 2004-03-23 | 2006-12-05 | International Business Machines | System for balancing multiple memory buffer sizes and method therefor |
US20070078890A1 (en) * | 2005-10-05 | 2007-04-05 | International Business Machines Corporation | System and method for providing an object to support data structures in worm storage |
US7213094B2 (en) * | 2004-02-17 | 2007-05-01 | Intel Corporation | Method and apparatus for managing buffers in PCI bridges |
US20070112813A1 (en) * | 2005-11-08 | 2007-05-17 | Beyer Kevin S | Virtual cursors for XML joins |
US20070156958A1 (en) * | 2006-01-03 | 2007-07-05 | Emc Corporation | Methods, systems, and computer program products for optimized copying of logical units (LUNs) in a redundant array of inexpensive disks (RAID) environment using buffers that are smaller than LUN delta map chunks |
US7266622B2 (en) * | 2002-03-25 | 2007-09-04 | International Business Machines Corporation | Method, computer program product, and system for automatic application buffering |
US20070255698A1 (en) * | 2006-04-10 | 2007-11-01 | Garrett Kaminaga | Secure and granular index for information retrieval |
US7330916B1 (en) * | 1999-12-02 | 2008-02-12 | Nvidia Corporation | Graphic controller to manage a memory and effective size of FIFO buffer as viewed by CPU can be as large as the memory |
US20080040307A1 (en) * | 2006-08-04 | 2008-02-14 | Apple Computer, Inc. | Index compression |
US7337165B2 (en) * | 2003-12-29 | 2008-02-26 | International Business Machines Corporation | Method and system for processing a text search query in a collection of documents |
US20080059420A1 (en) * | 2006-08-22 | 2008-03-06 | International Business Machines Corporation | System and Method for Providing a Trustworthy Inverted Index to Enable Searching of Records |
US20080082554A1 (en) * | 2006-10-03 | 2008-04-03 | Paul Pedersen | Systems and methods for providing a dynamic document index |
US7370037B2 (en) * | 2003-12-29 | 2008-05-06 | International Business Machines Corporation | Methods for processing a text search query in a collection of documents |
US20080228743A1 (en) * | 2007-03-15 | 2008-09-18 | International Business Machines Corporation | System and method for multi-dimensional aggregation over large text corpora |
US7480750B2 (en) * | 2004-05-14 | 2009-01-20 | International Buisiness Machines Corporation | Optimization of buffer pool sizes for data storage |
US7487141B1 (en) * | 2003-06-19 | 2009-02-03 | Sap Ag | Skipping pattern for an inverted index |
US20090094416A1 (en) * | 2007-10-05 | 2009-04-09 | Yahoo! Inc. | System and method for caching posting lists |
US20090112843A1 (en) * | 2007-10-29 | 2009-04-30 | International Business Machines Corporation | System and method for providing differentiated service levels for search index |
US7533245B2 (en) * | 2003-08-01 | 2009-05-12 | Illinois Institute Of Technology | Hardware assisted pruned inverted index component |
US7536408B2 (en) * | 2004-07-26 | 2009-05-19 | Google Inc. | Phrase-based indexing in an information retrieval system |
US20090132521A1 (en) * | 2007-08-31 | 2009-05-21 | Powerset, Inc. | Efficient Storage and Retrieval of Posting Lists |
US20090164424A1 (en) * | 2007-12-25 | 2009-06-25 | Benjamin Sznajder | Object-Oriented Twig Query Evaluation |
US20090164437A1 (en) * | 2007-12-20 | 2009-06-25 | Torbjornsen Oystein | Method for dynamic updating of an index, and a search engine implementing the same |
US20100161617A1 (en) * | 2007-03-30 | 2010-06-24 | Google Inc. | Index server architecture using tiered and sharded phrase posting lists |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4754399A (en) * | 1983-12-28 | 1988-06-28 | Hitachi, Ltd. | Data transfer control system for controlling data transfer between a buffer memory and input/output devices |
AU652371B2 (en) * | 1990-06-29 | 1994-08-25 | Fujitsu Limited | Data transfer system |
US5537552A (en) * | 1990-11-27 | 1996-07-16 | Canon Kabushiki Kaisha | Apparatus for selectively comparing pointers to detect full or empty status of a circular buffer area in an input/output (I/O) buffer |
JPH0820964B2 (en) * | 1991-09-13 | 1996-03-04 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Memory control device and method |
US6820144B2 (en) * | 1999-04-06 | 2004-11-16 | Microsoft Corporation | Data format for a streaming information appliance |
US6813731B2 (en) * | 2001-02-26 | 2004-11-02 | Emc Corporation | Methods and apparatus for accessing trace data |
GB2381401B (en) * | 2001-10-23 | 2005-12-21 | Thirdspace Living Ltd | Data switch |
US20040243556A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS) |
US7567959B2 (en) * | 2004-07-26 | 2009-07-28 | Google Inc. | Multiple index based information retrieval system |
JP4305378B2 (en) * | 2004-12-13 | 2009-07-29 | ソニー株式会社 | Data processing system, access control method, apparatus thereof, and program thereof |
US20070198754A1 (en) * | 2006-02-07 | 2007-08-23 | International Business Machines Corporation | Data transfer buffer control for performance |
JP2008033721A (en) * | 2006-07-31 | 2008-02-14 | Matsushita Electric Ind Co Ltd | Dma transfer control device |
CA2675216A1 (en) * | 2007-01-10 | 2008-07-17 | Nick Koudas | Method and system for information discovery and text analysis |
-
2010
- 2010-08-11 US US12/854,726 patent/US20110040761A1/en not_active Abandoned
- 2010-08-11 US US12/854,775 patent/US20110040762A1/en not_active Abandoned
- 2010-08-11 US US12/854,755 patent/US8205025B2/en active Active
-
2012
- 2012-04-30 US US13/460,515 patent/US20120260011A1/en not_active Abandoned
Patent Citations (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3987419A (en) * | 1974-12-05 | 1976-10-19 | Goodyear Aerospace Corporation | High speed information processing system |
US4158235A (en) * | 1977-04-18 | 1979-06-12 | Burroughs Corporation | Multi port time-shared associative buffer storage pool |
US5179662A (en) * | 1989-08-31 | 1993-01-12 | International Business Machines Corporation | Optimized i/o buffers having the ability to increase or decrease in size to meet system requirements |
US5263159A (en) * | 1989-09-20 | 1993-11-16 | International Business Machines Corporation | Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database |
US6439783B1 (en) * | 1994-07-19 | 2002-08-27 | Oracle Corporation | Range-based query optimizer |
US5784698A (en) * | 1995-12-05 | 1998-07-21 | International Business Machines Corporation | Dynamic memory allocation that enalbes efficient use of buffer pool memory segments |
US5915249A (en) * | 1996-06-14 | 1999-06-22 | Excite, Inc. | System and method for accelerated query evaluation of very large full-text databases |
US6067584A (en) * | 1996-09-09 | 2000-05-23 | National Instruments Corporation | Attribute-based system and method for configuring and controlling a data acquisition task |
US5916309A (en) * | 1997-05-12 | 1999-06-29 | Lexmark International Inc. | System for dynamically determining the size and number of communication buffers based on communication parameters at the beginning of the reception of message |
US6067547A (en) * | 1997-08-12 | 2000-05-23 | Microsoft Corporation | Hash table expansion and contraction for use with internal searching |
US6161154A (en) * | 1997-11-26 | 2000-12-12 | National Instruments Corporation | System and method for extracting and restoring a video buffer from/to a video acquisition cycle |
US6349308B1 (en) * | 1998-02-25 | 2002-02-19 | Korea Advanced Institute Of Science & Technology | Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems |
US6463486B1 (en) * | 1999-04-06 | 2002-10-08 | Microsoft Corporation | System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information |
US20050246457A1 (en) * | 1999-04-06 | 2005-11-03 | Microsoft Corporation | System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information |
US6542967B1 (en) * | 1999-04-12 | 2003-04-01 | Novell, Inc. | Cache object store |
US7099898B1 (en) * | 1999-08-12 | 2006-08-29 | International Business Machines Corporation | Data access system |
US7330916B1 (en) * | 1999-12-02 | 2008-02-12 | Nvidia Corporation | Graphic controller to manage a memory and effective size of FIFO buffer as viewed by CPU can be as large as the memory |
US6546456B1 (en) * | 2000-09-08 | 2003-04-08 | International Business Machines Corporation | Method and apparatus for operating vehicle mounted disk drive storage device |
US6993604B2 (en) * | 2000-11-15 | 2006-01-31 | Seagate Technology Llc | Dynamic buffer size allocation for multiplexed streaming |
US20060080482A1 (en) * | 2000-11-15 | 2006-04-13 | Seagate Technology Llc | Dynamic buffer size allocation for multiplexed streaming |
US20030069877A1 (en) * | 2001-08-13 | 2003-04-10 | Xerox Corporation | System for automatically generating queries |
US7058642B2 (en) * | 2002-03-20 | 2006-06-06 | Intel Corporation | Method and data structure for a low memory overhead database |
US7266622B2 (en) * | 2002-03-25 | 2007-09-04 | International Business Machines Corporation | Method, computer program product, and system for automatic application buffering |
US7487141B1 (en) * | 2003-06-19 | 2009-02-03 | Sap Ag | Skipping pattern for an inverted index |
US20050028156A1 (en) * | 2003-07-30 | 2005-02-03 | Northwestern University | Automatic method and system for formulating and transforming representations of context used by information services |
US7533245B2 (en) * | 2003-08-01 | 2009-05-12 | Illinois Institute Of Technology | Hardware assisted pruned inverted index component |
US7143263B2 (en) * | 2003-10-16 | 2006-11-28 | International Business Machines Corporation | System and method of adaptively reconfiguring buffers |
US20080140639A1 (en) * | 2003-12-29 | 2008-06-12 | International Business Machines Corporation | Processing a Text Search Query in a Collection of Documents |
US7370037B2 (en) * | 2003-12-29 | 2008-05-06 | International Business Machines Corporation | Methods for processing a text search query in a collection of documents |
US7337165B2 (en) * | 2003-12-29 | 2008-02-26 | International Business Machines Corporation | Method and system for processing a text search query in a collection of documents |
US7213094B2 (en) * | 2004-02-17 | 2007-05-01 | Intel Corporation | Method and apparatus for managing buffers in PCI bridges |
US7146466B2 (en) * | 2004-03-23 | 2006-12-05 | International Business Machines | System for balancing multiple memory buffer sizes and method therefor |
US7480750B2 (en) * | 2004-05-14 | 2009-01-20 | International Buisiness Machines Corporation | Optimization of buffer pool sizes for data storage |
US7536408B2 (en) * | 2004-07-26 | 2009-05-19 | Google Inc. | Phrase-based indexing in an information retrieval system |
US20060248037A1 (en) * | 2005-04-29 | 2006-11-02 | International Business Machines Corporation | Annotation of inverted list text indexes using search queries |
US20070078890A1 (en) * | 2005-10-05 | 2007-04-05 | International Business Machines Corporation | System and method for providing an object to support data structures in worm storage |
US7487178B2 (en) * | 2005-10-05 | 2009-02-03 | International Business Machines Corporation | System and method for providing an object to support data structures in worm storage |
US20090049086A1 (en) * | 2005-10-05 | 2009-02-19 | International Business Machines Corporation | System and method for providing an object to support data structures in worm storage |
US20070112813A1 (en) * | 2005-11-08 | 2007-05-17 | Beyer Kevin S | Virtual cursors for XML joins |
US20070156958A1 (en) * | 2006-01-03 | 2007-07-05 | Emc Corporation | Methods, systems, and computer program products for optimized copying of logical units (LUNs) in a redundant array of inexpensive disks (RAID) environment using buffers that are smaller than LUN delta map chunks |
US20070255698A1 (en) * | 2006-04-10 | 2007-11-01 | Garrett Kaminaga | Secure and granular index for information retrieval |
US20080040307A1 (en) * | 2006-08-04 | 2008-02-14 | Apple Computer, Inc. | Index compression |
US20080059420A1 (en) * | 2006-08-22 | 2008-03-06 | International Business Machines Corporation | System and Method for Providing a Trustworthy Inverted Index to Enable Searching of Records |
US20080082554A1 (en) * | 2006-10-03 | 2008-04-03 | Paul Pedersen | Systems and methods for providing a dynamic document index |
US20080228743A1 (en) * | 2007-03-15 | 2008-09-18 | International Business Machines Corporation | System and method for multi-dimensional aggregation over large text corpora |
US20100161617A1 (en) * | 2007-03-30 | 2010-06-24 | Google Inc. | Index server architecture using tiered and sharded phrase posting lists |
US20090132521A1 (en) * | 2007-08-31 | 2009-05-21 | Powerset, Inc. | Efficient Storage and Retrieval of Posting Lists |
US20090094416A1 (en) * | 2007-10-05 | 2009-04-09 | Yahoo! Inc. | System and method for caching posting lists |
US20090112843A1 (en) * | 2007-10-29 | 2009-04-30 | International Business Machines Corporation | System and method for providing differentiated service levels for search index |
US20090164437A1 (en) * | 2007-12-20 | 2009-06-25 | Torbjornsen Oystein | Method for dynamic updating of an index, and a search engine implementing the same |
US20090164424A1 (en) * | 2007-12-25 | 2009-06-25 | Benjamin Sznajder | Object-Oriented Twig Query Evaluation |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11003644B2 (en) * | 2012-05-18 | 2021-05-11 | Splunk Inc. | Directly searchable and indirectly searchable using associated inverted indexes raw machine datastore |
EP3000060A4 (en) * | 2013-05-21 | 2017-01-18 | Facebook, Inc. | Database sharding with update layer |
US10977229B2 (en) | 2013-05-21 | 2021-04-13 | Facebook, Inc. | Database sharding with update layer |
WO2015078273A1 (en) * | 2013-11-29 | 2015-06-04 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for search |
US10452691B2 (en) | 2013-11-29 | 2019-10-22 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating search results using inverted index |
US11334353B2 (en) * | 2017-05-18 | 2022-05-17 | Nec Corporation | Multiparty computation method, apparatus and program |
US10545960B1 (en) * | 2019-03-12 | 2020-01-28 | The Governing Council Of The University Of Toronto | System and method for set overlap searching of data lakes |
Also Published As
Publication number | Publication date |
---|---|
US8205025B2 (en) | 2012-06-19 |
US20120260011A1 (en) | 2012-10-11 |
US20110040762A1 (en) | 2011-02-17 |
US20110040905A1 (en) | 2011-02-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20110040761A1 (en) | Estimation of postings list length in a search system using an approximation table | |
US11036799B2 (en) | Low RAM space, high-throughput persistent key value store using secondary memory | |
US10678654B2 (en) | Systems and methods for data backup using data binning and deduplication | |
US8977623B2 (en) | Method and system for search engine indexing and searching using the index | |
US10114908B2 (en) | Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data | |
KR102564170B1 (en) | Method and device for storing data object, and computer readable storage medium having a computer program using the same | |
CN106980665B (en) | Data dictionary implementation method and device and data dictionary management system | |
US10255234B2 (en) | Method for storing data elements in a database | |
US10810174B2 (en) | Database management system, database server, and database management method | |
CN105468644B (en) | Method and equipment for querying in database | |
US11169968B2 (en) | Region-integrated data deduplication implementing a multi-lifetime duplicate finder | |
CN110532347A (en) | A kind of daily record data processing method, device, equipment and storage medium | |
CN111475105A (en) | Monitoring data storage method, device, server and storage medium | |
CN111045994B (en) | File classification retrieval method and system based on KV database | |
US11520818B2 (en) | Method, apparatus and computer program product for managing metadata of storage object | |
CN112306957A (en) | Method and device for acquiring index node number, computing equipment and storage medium | |
US20130218851A1 (en) | Storage system, data management device, method and program | |
US20230138113A1 (en) | System for retrieval of large datasets in cloud environments | |
US8051090B2 (en) | File management method of a ring buffer and related file management apparatus | |
US7720805B1 (en) | Sequential unload processing of IMS databases | |
CN111723266A (en) | Mass data processing method and device | |
WO2022141650A1 (en) | Memory-frugal index design in storage engine | |
US20230385240A1 (en) | Optimizations for data deduplication operations | |
US20210011881A1 (en) | System and method for insertable and removable file system | |
US8819089B2 (en) | Memory efficient representation of relational data with constant time random access to the data itself |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GLOBALSPEC, INC., NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FLATLAND, STEINAR;DALTON, JEFF J.;REEL/FRAME:024825/0968 Effective date: 20100809 |
|
AS | Assignment |
Owner name: COMERICA BANK, MICHIGAN Free format text: SECURITY AGREEMENT;ASSIGNOR:GLOBALSPEC, INC.;REEL/FRAME:026146/0641 Effective date: 20031229 |
|
AS | Assignment |
Owner name: GLOBALSPEC, INC., NEW YORK Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:COMERICA BANK;REEL/FRAME:028464/0833 Effective date: 20120626 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |