US20110040761A1 - Estimation of postings list length in a search system using an approximation table - Google Patents

Estimation of postings list length in a search system using an approximation table Download PDF

Info

Publication number
US20110040761A1
US20110040761A1 US12/854,726 US85472610A US2011040761A1 US 20110040761 A1 US20110040761 A1 US 20110040761A1 US 85472610 A US85472610 A US 85472610A US 2011040761 A1 US2011040761 A1 US 2011040761A1
Authority
US
United States
Prior art keywords
size
posting list
predetermined
posting
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/854,726
Inventor
Steinar Flatland
Jeff J. Dalton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GLOBALSPEC Inc
Original Assignee
GLOBALSPEC Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GLOBALSPEC Inc filed Critical GLOBALSPEC Inc
Priority to US12/854,726 priority Critical patent/US20110040761A1/en
Assigned to GLOBALSPEC, INC. reassignment GLOBALSPEC, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: DALTON, JEFF J., FLATLAND, STEINAR
Publication of US20110040761A1 publication Critical patent/US20110040761A1/en
Assigned to COMERICA BANK reassignment COMERICA BANK SECURITY AGREEMENT Assignors: GLOBALSPEC, INC.
Assigned to GLOBALSPEC, INC. reassignment GLOBALSPEC, INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: COMERICA BANK
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • the present invention generally relates to searching an inverted index. More particularly, the invention relates to estimating a posting list size based on document frequency in order to minimize accesses to the posting list stored in secondary storage.
  • an inverted index 100 often comprises two related data structures (see FIG. 1 ):
  • a computerized search system When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.
  • a large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both.
  • secondary storage typically disk storage
  • Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.
  • the present invention provides, in a first aspect, a method of minimizing accesses to secondary storage when searching an inverted index for a search term.
  • the method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • the present invention provides, in a second aspect, a computer system for minimizing accesses to secondary storage when searching an inverted index for a search term.
  • the computer system comprises a memory, and a processor in communication with the memory to perform a method.
  • the method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • the present invention provides, in a third aspect, a program product for minimizing accesses to secondary storage when searching an inverted index for a search term.
  • the program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method.
  • the method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, the posting list being stored in secondary storage, and reading at least a portion of the posting list into memory based on the size approximated.
  • the present invention provides, in a fourth aspect, a data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term.
  • the data structure comprises a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies, and a posting list length approximation for each range ID.
  • FIG. 1 depicts one example of an inverted index consisting of a lexicon and corresponding posting file.
  • FIG. 2 depicts one example of a posting list length approximation table data structure, according to one aspect of the present invention.
  • FIG. 3 is a flow diagram for one example of a method of reading a posting list in accordance with one or more aspects of the present invention.
  • FIG. 4 depicts one example of an inverted index with the storage split between main memory and secondary storage.
  • FIG. 5 is an object oriented instance diagram showing one example of a posting list reader and the main objects it uses, in accordance with the present invention.
  • FIG. 6 is a block diagram of one example of a computing unit incorporating one or more aspects of the present invention.
  • the present invention approximates posting list size, preferably as a length in bytes, according to a term's document frequency.
  • the approximate posting list size is preferably predetermined, and it covers, with high probability, the size of the associated posting list in secondary storage. Knowing the approximate size is useful for minimizing the number of accesses to secondary storage when reading a posting list. For example, if the approximate covering read size is several megabytes or less, a highly efficient strategy is to scoop up the whole posting list in a single access to secondary storage through a single read system call.
  • the list can be read, for example, by filling the largest available input buffer several times using a single system call per buffer fill operation, and then doing one more partial read to pick up the remainder of the approximate covering read size. For the rare case where the approximate covering read size does not cover the posting list being read, additional supplemental reads can be issued as necessary.
  • Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins.
  • buffer elements typically bytes
  • Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins.
  • an inverted index is small enough that the lexicon fits entirely into memory, it is a simple matter to determine the size of a posting list in bytes. For example, referring back to FIG. 1 , and assuming that the lexicon is entirely in main memory, simply subtract adjacent posting list addresses to know the size in bytes of a given posting list.
  • an advantage of the present invention is that this sizing information preferably used for efficient reading from secondary storage can be instantly available in main memory without needing to store the full lexicon in main memory and without needing to store posting list size in bytes as a separate field in the lexicon.
  • the present invention uses a posting list range to predetermine approximate posting list size.
  • a posting list range is a set of posting lists defined by an inclusive minimum and an inclusive maximum document frequency.
  • a posting list is a member of a posting list range if the posting list's document frequency falls within the inclusive minimum and maximum of the range.
  • Posting lists that are part of the same posting list range will have the same approximate posting list size.
  • the present invention builds on the concept of a posting list range to predetermine approximate posting list size.
  • the posting lists in the inverted index are partitioned into a collection of non-overlapping ranges whose union is the complete set of posting lists in the index. Each of these ranges is assigned a unique range identifier (rangeId).
  • the function takes as input an integer that is the length of a posting list in number of postings, also known as the document frequency.
  • the function returns the ID of the range that includes the posting list whose document frequency was passed in.
  • ln( ) is the natural logarithm function
  • ceil( ) is a function that rounds a number with a fractional part to the next higher integer.
  • Table I below shows how the implementation of documentFrequencyToRangeIdTranslator above partitions posting lists into posting list ranges.
  • the implementation of documentFrequencyToRangeIdTranslator has been found to work well in practice with a natural language corpus in which the word distribution adheres to Zipf's law.
  • Each successive rangeId includes twice as many document frequencies as the preceding rangeId.
  • the Posting List Length Approximation table data structure pictured in FIG. 2 can be created in one example as follows. For each posting list range compute the mean and standard deviation of posting list size as stored in secondary storage, preferably in bytes. Next, create a Posting List Length Approximation object consisting of the rangeId of the current range, the mean of posting list size, and the standard deviation of posting list size. Finally, add a hash table entry to the Posting List Length Approximation Table mapping the rangeId to the Posting List Length Approximation object.
  • FIG. 2 depicts one example of a data structure 200 for a posting list length approximation table, in accordance with one aspect of the present invention.
  • the data structure comprises a hash table 210 with keys 220 and associated values 230 .
  • the keys comprise a plurality of range ID's 240 , as described above.
  • the associated values comprise the posting list length approximation information 250 .
  • the length approximation information is based on a predetermined length.
  • the information comprises, for example, the corresponding range ID, a mean posting list length, and a standard deviation for the posting list length.
  • the mean length and standard deviation are preferably expressed, for example, in bytes.
  • the posting list length approximation table has an access method getPostingListLengthApproximation(documentFrequency) which returns a Posting List Length Approximation object based on a document frequency passed in.
  • this method translates the document frequency to a rangeId using the documentFrequencyToRangeIdTranslator function discussed earlier. This rangeId is then used to do a hash table lookup to find the proper Posting List Length Approximation object to return.
  • the resulting Posting List Length Approximation object can then be turned into an approximate covering read size by, for example, adding the mean posting list length in bytes to the desired number of standard deviations.
  • the present example has a similar structure to the inverted index of FIG. 1 .
  • the inverted index is large enough that the posting file is entirely in secondary storage and only half of the lexicon fits into main memory.
  • FIG. 4 shows how the inverted index 400 is divided between main memory 402 , storing the lexicon index 404 , and secondary storage 406 , storing the full lexicon 408 and the posting file 410 .
  • N be the total number of terms in the full lexicon in secondary storage. Only every second term, for a total of N/2 terms, are kept in the lexicon index in main memory due to memory constraints.
  • the full lexicon in secondary storage is preferably organized as a sequence of blocks, e.g., block 412 , each of a constant size k (e.g., in bytes) such that any block can accommodate the largest pair of lexicon entries in the lexicon.
  • the lexicon index does not need to store explicit disk pointers into the full lexicon. Instead, to locate the block in the full lexicon of the lexicon index record with zero-based index i, simply seek to offset i*k in the full lexicon.
  • the lexicon index includes document frequency but does not include posting list sizing in bytes. The goal is to keep the main memory lexicon data structure as compact as possible.
  • the Posting List Length Approximation Table will provide needed sizing information for efficient reading of posting lists in secondary storage.
  • the Posting List Reader uses an object called a Posting List Reader 500 , shown in FIG. 5 , to read postings from secondary storage during query processing.
  • the Posting List Reader uses a Posting List Length Approximation Table 502 to accurately estimate the sizes of posting lists to be read. It uses an Enhanced Buffered Reader 504 with an internal buffer of size bufsize bytes to read postings 506 from secondary storage using efficient predetermined buffer fill size strategies. Preferably, bufsize is relatively large (for example several megabytes) to facilitate reading large posting lists with relatively few read system calls.
  • the Posting List Reader provides the following access methods:
  • the readPosting( )method may be used.
  • the search system When a user runs a query, the search system first parses the query, identifies the terms for which postings are needed to process the query, and locates each of these terms in the lexicon to obtain a document frequency and posting list address for each. Assuming a lexicon structured similar to that shown in FIG. 4 , a term's document frequency and posting list address can be retrieved without accessing secondary storage about half the time by doing a binary search of the lexicon index in main memory, which is very fast.
  • a disk seek can be used to find the term in the full lexicon in secondary storage by seeking to offset i*k in the full lexicon and reading the lexicon entries there, where i is the zero-based record offset in the lexicon index of the lexically greatest term that is lexically less than the sought term, and k is the block size of the blocks in the full lexicon.
  • the Posting List Reader receives an initialize request (step 302 ) that includes a document frequency and a posting list address.
  • the document frequency is the length of the posting list to read in number of postings
  • the posting list address is the byte offset in the posting file where the posting list to read starts.
  • the Posting List Reader obtains a Posting List Length Approximation object (step 304 ) by calling the getPostingListLengthApproximation method on the Posting List Length Approximation Table pictured in FIG. 2 , passing the document frequency to this getter.
  • getPostingListLengthApproximation in turn translates the document frequency passed in to a rangeId using the documentFrequencyToRangeIdTranslator function described earlier and does a hash table lookup in the Posting List Length Approximation table based on the rangeId to obtain the Posting List Length Approximation object.
  • the Posting List Reader next obtains the approximate size of the posting list to read (step 306 ) by getting the mean and standard deviation of posting list length from the Posting List Length Approximation object and adding the desired number of standard deviations to the mean. Let approximateReadSize be the approximate read size calculated in this step.
  • the next step in initializing the Posting List Reader is to build a predetermined buffer fill size strategy (step 308 ) for use with the Enhanced Buffered Reader.
  • a predetermined buffer fill size strategy is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of the Enhanced Buffered Reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize. There are two cases to consider, based on the relative sizes of the bufsize (the Enhanced Buffered Reader's internal buffer size) and approximateReadSize.
  • the above strategy when installed in an Enhanced Buffered Reader and used to read the posting list, will utilize the available input buffer of size bufsize bytes to read approximateReadSize bytes of data using a minimal number of disk seeks and minimal data transfer.
  • the approximateReadSize is sufficient to read the entire posting list with high probability; however, as many supplemental 8 kilobyte reads as necessary will be issued to handle the relatively rare case when the approximateReadSize is insufficient.
  • the next step in initializing the Posting List Reader is to seek the Enhanced Buffered Reader to start of posting list (step 310 ).
  • the posting list address that was passed to the initialize request (step 302 ) is forwarded to the Enhanced Buffered Reader's seek method.
  • the predetermined buffer fill size strategy of step 308 is installed in the Enhanced Buffered Reader (step 312 ), by calling the appropriate setter.
  • the posting list reader is now ready to start processing read requests for postings (step 314 ).
  • the Enhanced Buffered Reader automatically initiates buffer refilling as needed using read sizes consistent with good runtime performance when accessing secondary storage.
  • a data processing system 600 may be provided suitable for storing and/or executing program code is usable that includes at least one processor 610 coupled directly or indirectly to memory elements through a system bus 620 .
  • the memory elements include, for instance, data buffers 630 and 640 , local memory employed during actual execution of the program code, bulk storage 650 , and cache memory which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
  • I/O devices 660 can be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
  • the shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for minimizing accesses to secondary storage for a posting list when searching an inverted index for a search term.
  • the computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by a computer for performing a method.
  • the method includes, for instance, automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • an application can be deployed for performing one or more aspects of the present invention.
  • the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.
  • a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.
  • a process for integrating computing infrastructure comprising integrating computer readable code into a computer system
  • the computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention.
  • the code in combination with the computer system is capable of performing one or more aspects of the present invention.
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system”. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • the computer readable storage medium includes the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention.
  • the computer program product can take many different physical forms, for example, disks, platters, flash memory, etc., including those above.
  • Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Abstract

The present invention provides a method of minimizing accesses to secondary storage when searching an inverted index for a search term. The method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size. Corresponding computer system and program products are also provided.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims priority under 35 U.S.C. §119 to the following U.S. Provisional Applications, which are herein incorporated by reference in their entirety:
  • Provisional Patent Application Ser. No. 61/233,411, by Flatland et al., entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE,” filed on Aug. 12, 2009; and
  • Provisional Patent Application No. 61/233,420, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION,” filed on Aug. 12, 2009;
  • Provisional Patent Application Ser. No. 61/233,427, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER,” filed on Aug. 12, 2009.
  • This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:
  • U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A); and
  • U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER” (Attorney Docket No. 1634.070A).
  • TECHNICAL FIELD
  • The present invention generally relates to searching an inverted index. More particularly, the invention relates to estimating a posting list size based on document frequency in order to minimize accesses to the posting list stored in secondary storage.
  • BACKGROUND
  • The following definition of Information Retrieval (IR) is from the book Introduction to Information Retrieval by Manning, Raghavan and Schutze, Cambridge University Press, 2008:
      • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
  • An inverted index is a data structure central to the design of numerous modern information retrieval systems. In chapter 5 of Search Engines: Information Retrieval in Practice (Addison Wesley, 2010), Croft, Metzler and Strohman observe:
      • An inverted index is the computational equivalent of the index found in the back of this textbook . . . . The book index is arranged in alphabetical order by index term. Each index term is followed by a list of pages about the word.
  • In a search system implemented using a computer, an inverted index 100 often comprises two related data structures (see FIG. 1):
      • 1. A lexicon 101 contains the distinct set of terms 102 (i.e., with duplicates removed) that occur throughout all the documents of the index. To facilitate rapid searching, terms in the lexicon are usually stored in sorted order. Each term typically includes a document frequency 104 and a pointer into the other major data structure of the inverted index, the posting file 108. The document frequency is a count of the number of documents in which a term occurs. The document frequency is useful at search time both for prioritizing term processing and as input to scoring algorithms.
      • 2. The posting file 108 consists of one posting list per term in the lexicon, e.g., list 110 for term 112, recording for each term the set of documents in which the term occurs. Each entry in a posting list is called a posting. The number of postings in a given posting list equals the document frequency of the associated lexicon entry. A posting includes at least a document identifier and may include additional information such as: a count of the number of times the term occurs in the document; a list of term positions within the document where the term occurs; and more generally, scoring information that ascribes some degree of importance (or lack thereof) to the fact that the document contains the term.
  • When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.
  • A large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both. Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.
  • Therefore, it would be helpful to minimize accesses to secondary storage when searching an inverted list, in order to improve runtime performance.
  • BRIEF SUMMARY OF INVENTION
  • The present invention provides, in a first aspect, a method of minimizing accesses to secondary storage when searching an inverted index for a search term. The method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • The present invention provides, in a second aspect, a computer system for minimizing accesses to secondary storage when searching an inverted index for a search term. The computer system comprises a memory, and a processor in communication with the memory to perform a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • The present invention provides, in a third aspect, a program product for minimizing accesses to secondary storage when searching an inverted index for a search term. The program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, the posting list being stored in secondary storage, and reading at least a portion of the posting list into memory based on the size approximated.
  • The present invention provides, in a fourth aspect, a data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term. The data structure comprises a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies, and a posting list length approximation for each range ID.
  • These, and other objects, features and advantages of this invention will become apparent from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
  • FIG. 1 depicts one example of an inverted index consisting of a lexicon and corresponding posting file.
  • FIG. 2 depicts one example of a posting list length approximation table data structure, according to one aspect of the present invention.
  • FIG. 3 is a flow diagram for one example of a method of reading a posting list in accordance with one or more aspects of the present invention.
  • FIG. 4 depicts one example of an inverted index with the storage split between main memory and secondary storage.
  • FIG. 5 is an object oriented instance diagram showing one example of a posting list reader and the main objects it uses, in accordance with the present invention.
  • FIG. 6 is a block diagram of one example of a computing unit incorporating one or more aspects of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The present invention approximates posting list size, preferably as a length in bytes, according to a term's document frequency. The approximate posting list size is preferably predetermined, and it covers, with high probability, the size of the associated posting list in secondary storage. Knowing the approximate size is useful for minimizing the number of accesses to secondary storage when reading a posting list. For example, if the approximate covering read size is several megabytes or less, a highly efficient strategy is to scoop up the whole posting list in a single access to secondary storage through a single read system call. If the approximate covering read size is larger than the largest available main memory input buffer, then the list can be read, for example, by filling the largest available input buffer several times using a single system call per buffer fill operation, and then doing one more partial read to pick up the remainder of the approximate covering read size. For the rare case where the approximate covering read size does not cover the posting list being read, additional supplemental reads can be issued as necessary.
  • U.S. Non-Provisional Patent Application entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A) filed concurrently herewith, describes an enhanced buffered reader that can be configured with predetermined buffer fill size strategies. When the posting file is in secondary storage, using an enhanced buffered reader to read a posting list offers advantages over a conventional buffered reader. An enhanced buffered reader can be configured with a predetermined buffer fill size strategy that is based on both the size of the posting list (in bytes, for example) and the size of the available input buffer, ensuring that the fewest required number of system calls to read from secondary storage are issued. Another advantage of the enhanced buffered reader is that it neatly encapsulates buffer management details inside the enhanced buffered reader. The detailed description of present invention assumes a working understanding of enhanced buffered readers.
  • Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins. When an inverted index is small enough that the lexicon fits entirely into memory, it is a simple matter to determine the size of a posting list in bytes. For example, referring back to FIG. 1, and assuming that the lexicon is entirely in main memory, simply subtract adjacent posting list addresses to know the size in bytes of a given posting list. As will become apparent in the detailed description below, an advantage of the present invention is that this sizing information preferably used for efficient reading from secondary storage can be instantly available in main memory without needing to store the full lexicon in main memory and without needing to store posting list size in bytes as a separate field in the lexicon.
  • The present invention uses a posting list range to predetermine approximate posting list size. A posting list range is a set of posting lists defined by an inclusive minimum and an inclusive maximum document frequency. A posting list is a member of a posting list range if the posting list's document frequency falls within the inclusive minimum and maximum of the range. Posting lists that are part of the same posting list range will have the same approximate posting list size. The present invention builds on the concept of a posting list range to predetermine approximate posting list size.
  • As a prerequisite for populating the Posting List Length Approximation Table data structure pictured in FIG. 2, and given an inverted index, the posting lists in the inverted index are partitioned into a collection of non-overlapping ranges whose union is the complete set of posting lists in the index. Each of these ranges is assigned a unique range identifier (rangeId).
  • One example of a way to accomplish this partitioning of posting lists is through a function called documentFrequencyToRangeIdTranslator, shown below and summarized here. The function takes as input an integer that is the length of a posting list in number of postings, also known as the document frequency. The function returns the ID of the range that includes the posting list whose document frequency was passed in. ln( ) is the natural logarithm function, and ceil( ) is a function that rounds a number with a fractional part to the next higher integer.
  •        documentFrequencyToRangeIdTranslator
    int documentFrequencyToRangeIdTranslator(int documentFrequency)
      return ceil(ln(documentFrequency) / ln(2.0));
    }
  • Table I below shows how the implementation of documentFrequencyToRangeIdTranslator above partitions posting lists into posting list ranges. The implementation of documentFrequencyToRangeIdTranslator has been found to work well in practice with a natural language corpus in which the word distribution adheres to Zipf's law. Each successive rangeId includes twice as many document frequencies as the preceding rangeId.
  • TABLE I
    Sample Range Definitions
    minDocumentFrequency maxDocumentFrequency rangeId
    1 1 0
    2 2 1
    3 4 2
    5 8 3
    9 16 4
    17  32 5
    etc.
  • Other implementations of the documentFrequencyToRangeIdTranslator are possible. The above is merely one example. This function could be implemented in any way that defines a complete non-overlapping partitioning of the posting lists into ranges.
  • Given an inverted index, the Posting List Length Approximation table data structure pictured in FIG. 2 can be created in one example as follows. For each posting list range compute the mean and standard deviation of posting list size as stored in secondary storage, preferably in bytes. Next, create a Posting List Length Approximation object consisting of the rangeId of the current range, the mean of posting list size, and the standard deviation of posting list size. Finally, add a hash table entry to the Posting List Length Approximation Table mapping the rangeId to the Posting List Length Approximation object.
  • FIG. 2 depicts one example of a data structure 200 for a posting list length approximation table, in accordance with one aspect of the present invention. The data structure comprises a hash table 210 with keys 220 and associated values 230. The keys comprise a plurality of range ID's 240, as described above. The associated values comprise the posting list length approximation information 250. In the presently preferred embodiment, the length approximation information is based on a predetermined length. The information comprises, for example, the corresponding range ID, a mean posting list length, and a standard deviation for the posting list length. The mean length and standard deviation are preferably expressed, for example, in bytes.
  • In one example, in addition to the structure shown in FIG. 2, the posting list length approximation table has an access method getPostingListLengthApproximation(documentFrequency) which returns a Posting List Length Approximation object based on a document frequency passed in. In the present example, the implementation of this method translates the document frequency to a rangeId using the documentFrequencyToRangeIdTranslator function discussed earlier. This rangeId is then used to do a hash table lookup to find the proper Posting List Length Approximation object to return. The resulting Posting List Length Approximation object can then be turned into an approximate covering read size by, for example, adding the mean posting list length in bytes to the desired number of standard deviations.
  • One example of how to use a posting list length approximation table to read a posting list efficiently will now be provided with reference to the flow diagram 300 of FIG. 3. The present example has a similar structure to the inverted index of FIG. 1. In the scenario of this example, the inverted index is large enough that the posting file is entirely in secondary storage and only half of the lexicon fits into main memory.
  • FIG. 4 shows how the inverted index 400 is divided between main memory 402, storing the lexicon index 404, and secondary storage 406, storing the full lexicon 408 and the posting file 410. Referring to FIG. 4, let N be the total number of terms in the full lexicon in secondary storage. Only every second term, for a total of N/2 terms, are kept in the lexicon index in main memory due to memory constraints. The full lexicon in secondary storage is preferably organized as a sequence of blocks, e.g., block 412, each of a constant size k (e.g., in bytes) such that any block can accommodate the largest pair of lexicon entries in the lexicon. This causes some internal fragmentation within the full lexicon, but the advantage is that the lexicon index does not need to store explicit disk pointers into the full lexicon. Instead, to locate the block in the full lexicon of the lexicon index record with zero-based index i, simply seek to offset i*k in the full lexicon. By design, the lexicon index includes document frequency but does not include posting list sizing in bytes. The goal is to keep the main memory lexicon data structure as compact as possible. The Posting List Length Approximation Table will provide needed sizing information for efficient reading of posting lists in secondary storage.
  • In this example, it is assumed that the search engine implementation uses an object called a Posting List Reader 500, shown in FIG. 5, to read postings from secondary storage during query processing. The Posting List Reader uses a Posting List Length Approximation Table 502 to accurately estimate the sizes of posting lists to be read. It uses an Enhanced Buffered Reader 504 with an internal buffer of size bufsize bytes to read postings 506 from secondary storage using efficient predetermined buffer fill size strategies. Preferably, bufsize is relatively large (for example several megabytes) to facilitate reading large posting lists with relatively few read system calls. The Posting List Reader provides the following access methods:
      • initialize(documentFrequency, postingListAddress)—Prepares the Posting List
  • Reader for reading based on a document frequency and posting list address of a term obtained from the lexicon. After initialization, the readPosting( )method may be used.
      • readPosting( )—Reads the next posting from the posting list.
  • When a user runs a query, the search system first parses the query, identifies the terms for which postings are needed to process the query, and locates each of these terms in the lexicon to obtain a document frequency and posting list address for each. Assuming a lexicon structured similar to that shown in FIG. 4, a term's document frequency and posting list address can be retrieved without accessing secondary storage about half the time by doing a binary search of the lexicon index in main memory, which is very fast. If necessary, a disk seek can be used to find the term in the full lexicon in secondary storage by seeking to offset i*k in the full lexicon and reading the lexicon entries there, where i is the zero-based record offset in the lexicon index of the lexically greatest term that is lexically less than the sought term, and k is the block size of the blocks in the full lexicon. Having obtained a document frequency and posting list address for a term, the search system initializes a Posting List Reader, preparing it to read postings, as discussed below.
  • Returning to FIG. 3, the Posting List Reader receives an initialize request (step 302) that includes a document frequency and a posting list address. The document frequency is the length of the posting list to read in number of postings, and the posting list address is the byte offset in the posting file where the posting list to read starts.
  • The Posting List Reader obtains a Posting List Length Approximation object (step 304) by calling the getPostingListLengthApproximation method on the Posting List Length Approximation Table pictured in FIG. 2, passing the document frequency to this getter. (The implementation of getPostingListLengthApproximation in turn translates the document frequency passed in to a rangeId using the documentFrequencyToRangeIdTranslator function described earlier and does a hash table lookup in the Posting List Length Approximation table based on the rangeId to obtain the Posting List Length Approximation object.)
  • The Posting List Reader next obtains the approximate size of the posting list to read (step 306) by getting the mean and standard deviation of posting list length from the Posting List Length Approximation object and adding the desired number of standard deviations to the mean. Let approximateReadSize be the approximate read size calculated in this step.
  • The next step in initializing the Posting List Reader is to build a predetermined buffer fill size strategy (step 308) for use with the Enhanced Buffered Reader. A predetermined buffer fill size strategy is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of the Enhanced Buffered Reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize. There are two cases to consider, based on the relative sizes of the bufsize (the Enhanced Buffered Reader's internal buffer size) and approximateReadSize.
  • Case 1: approximateReadSize<=bufsize; and
  • Case 2: approximateReadSize>bufsize.
  • A discussion of these cases follows.
  • Case 1: approximateReadSize<=bufsize
  • Build a two-stage predetermined buffer fill size strategy as indicated below in Table II.
  • TABLE II
    Stage Fill Size Number of Times to Use
    1 approximateReadSize 1
    2 8 kilobytes Repeat as necessary
  • The above two-stage strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will with high probability result in a single disk seek and read of exactly approximateReadSize bytes. As many supplemental 8 kilobyte reads as necessary may then be issued to handle the relatively rare case when the approximateReadSize is insufficient.
  • Case 2: approximateReadSize>bufsize
  • For this discussion, let “/” represent the operation of integer division, and “%” represent the operation of integer modulo.
  • In this case, we build a predetermined buffer fill size strategy that generally has three stages, as indicated in the following table. However, the second stage is not necessary when the bufsize divides the approximateReadSize evenly).
  • TABLE III
    Stage Fill Size Number of Times to Use
    1 bufsize approximateReadSize/bufsize
    2 approximateReadSize % bufsize 1
    3 8 kilobytes Repeat as necessary
  • The above strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will utilize the available input buffer of size bufsize bytes to read approximateReadSize bytes of data using a minimal number of disk seeks and minimal data transfer. The approximateReadSize is sufficient to read the entire posting list with high probability; however, as many supplemental 8 kilobyte reads as necessary will be issued to handle the relatively rare case when the approximateReadSize is insufficient.
  • Referring once again to FIG. 3, the next step in initializing the Posting List Reader is to seek the Enhanced Buffered Reader to start of posting list (step 310). The posting list address that was passed to the initialize request (step 302) is forwarded to the Enhanced Buffered Reader's seek method.
  • Finally, the predetermined buffer fill size strategy of step 308 is installed in the Enhanced Buffered Reader (step 312), by calling the appropriate setter. The posting list reader is now ready to start processing read requests for postings (step 314). As the search system's search logic issues read requests as desired, the Enhanced Buffered Reader automatically initiates buffer refilling as needed using read sizes consistent with good runtime performance when accessing secondary storage.
  • As shown in FIG. 6, one example of a data processing system 600 may be provided suitable for storing and/or executing program code is usable that includes at least one processor 610 coupled directly or indirectly to memory elements through a system bus 620. As known in the art, the memory elements include, for instance, data buffers 630 and 640, local memory employed during actual execution of the program code, bulk storage 650, and cache memory which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
  • Input/Output or I/O devices 660 (including, but not limited to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs, thumb drives and other memory media, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
  • The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for minimizing accesses to secondary storage for a posting list when searching an inverted index for a search term. The computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by a computer for performing a method. The method includes, for instance, automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.
  • Methods and systems relating to one or more aspects of the present invention are also described and claimed herein. Further, services relating to one or more aspects of the present invention are also described and may be claimed herein.
  • Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
  • In one aspect of the present invention, an application can be deployed for performing one or more aspects of the present invention. As one example, the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.
  • As a further aspect of the present invention, a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.
  • As yet a further aspect of the present invention, a process for integrating computing infrastructure comprising integrating computer readable code into a computer system may be provided. The computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention. The code in combination with the computer system is capable of performing one or more aspects of the present invention.
  • As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system”. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • In one example, a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention. The computer program product can take many different physical forms, for example, disks, platters, flash memory, etc., including those above.
  • Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
  • The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiment with various modifications as are suited to the particular use contemplated.

Claims (33)

1. A method of minimizing accesses to secondary storage when searching an inverted index for a search term, the method comprising:
obtaining by at least one computing unit a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage; and
reading by the at least one computing unit at least a portion of the posting list into memory based on the predetermined size.
2. The method of claim 1, wherein the size is a length in bytes.
3. The method of claim 1, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.
4. The method of claim 3, wherein the reading comprises issuing a single read system call.
5. The method of claim 3, wherein the predetermined minimum size comprises a size of a main memory input buffer.
6. The method of claim 1, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.
7. The method of claim 6, wherein the predetermined minimum size comprises a size of a main memory input buffer.
8. The method of claim 7, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.
9. The method of claim 1, wherein the obtaining comprises:
partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.
10. The method of claim 9, wherein each successive maximum document frequency is twice that of an immediate prior one.
11. A computer system for minimizing accesses to secondary storage when searching an inverted index for a search term, the computer system comprising:
a memory; and
a processor in communication with the memory to perform a method, the method comprising:
obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage; and
reading at least a portion of the posting list into memory based on the predetermined size.
12. The system of claim 11, wherein the size is a length in bytes.
13. The system of claim 11, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.
14. The system of claim 13, wherein the reading comprises issuing a single read system call.
15. The system of claim 13, wherein the predetermined minimum size comprises a size of a main memory input buffer.
16. The system of claim 11, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.
17. The system of claim 16, wherein the predetermined minimum size comprises a size of a main memory input buffer.
18. The system of claim 17, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.
19. The system of claim 11, wherein the obtaining comprises:
partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.
20. The system of claim 19, wherein each successive maximum document frequency is twice that of an immediate prior one.
21. A program product for minimizing accesses to secondary storage when searching an inverted index for a search term, the program product comprising:
a storage medium readable by a processor and storing instructions for execution by the processor for performing a method, the method comprising:
obtaining by at least one computing unit a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage; and
reading by the at least one computing unit at least a portion of the posting list into memory based on the predetermined size.
22. The program product of claim 21, wherein the size is a length in bytes.
23. The program product of claim 21, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.
24. The program product of claim 23, wherein the reading comprises issuing a single read system call.
25. The program product of claim 23, wherein the predetermined minimum size comprises a size of a main memory input buffer.
26. The program product of claim 21, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.
27. The program product of claim 26, wherein the predetermined minimum size comprises a size of a main memory input buffer.
28. The program product of claim 27, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.
29. The program product of claim 21, wherein the obtaining comprises:
partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.
30. The program product of claim 29, wherein each successive maximum document frequency is twice that of an immediate prior one.
31. A data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term, the data structure comprising:
a posting list length approximation table, comprising a hash table, the hash table comprising:
a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies; and
a posting list length approximation for each range ID.
32. The data structure of claim 31, wherein the posting list length approximation is a length in bytes.
33. The data structure of claim 32, wherein the posting list length approximation comprises a mean length and a standard deviation length.
US12/854,726 2009-08-12 2010-08-11 Estimation of postings list length in a search system using an approximation table Abandoned US20110040761A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/854,726 US20110040761A1 (en) 2009-08-12 2010-08-11 Estimation of postings list length in a search system using an approximation table

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US23341109P 2009-08-12 2009-08-12
US23342009P 2009-08-12 2009-08-12
US23342709P 2009-08-12 2009-08-12
US12/854,726 US20110040761A1 (en) 2009-08-12 2010-08-11 Estimation of postings list length in a search system using an approximation table

Publications (1)

Publication Number Publication Date
US20110040761A1 true US20110040761A1 (en) 2011-02-17

Family

ID=43589199

Family Applications (4)

Application Number Title Priority Date Filing Date
US12/854,726 Abandoned US20110040761A1 (en) 2009-08-12 2010-08-11 Estimation of postings list length in a search system using an approximation table
US12/854,775 Abandoned US20110040762A1 (en) 2009-08-12 2010-08-11 Segmenting postings list reader
US12/854,755 Active 2030-09-21 US8205025B2 (en) 2009-08-12 2010-08-11 Efficient buffered reading with a plug-in for input buffer size determination
US13/460,515 Abandoned US20120260011A1 (en) 2009-08-12 2012-04-30 Efficient buffered reading with a plug-in for input buffer size determination

Family Applications After (3)

Application Number Title Priority Date Filing Date
US12/854,775 Abandoned US20110040762A1 (en) 2009-08-12 2010-08-11 Segmenting postings list reader
US12/854,755 Active 2030-09-21 US8205025B2 (en) 2009-08-12 2010-08-11 Efficient buffered reading with a plug-in for input buffer size determination
US13/460,515 Abandoned US20120260011A1 (en) 2009-08-12 2012-04-30 Efficient buffered reading with a plug-in for input buffer size determination

Country Status (1)

Country Link
US (4) US20110040761A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015078273A1 (en) * 2013-11-29 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for search
EP3000060A4 (en) * 2013-05-21 2017-01-18 Facebook, Inc. Database sharding with update layer
US10545960B1 (en) * 2019-03-12 2020-01-28 The Governing Council Of The University Of Toronto System and method for set overlap searching of data lakes
US11003644B2 (en) * 2012-05-18 2021-05-11 Splunk Inc. Directly searchable and indirectly searchable using associated inverted indexes raw machine datastore
US11334353B2 (en) * 2017-05-18 2022-05-17 Nec Corporation Multiparty computation method, apparatus and program

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012126180A1 (en) 2011-03-24 2012-09-27 Microsoft Corporation Multi-layer search-engine index
US9152697B2 (en) 2011-07-13 2015-10-06 International Business Machines Corporation Real-time search of vertically partitioned, inverted indexes
EP2888679A1 (en) * 2012-08-24 2015-07-01 Yandex Europe AG Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
US8739151B1 (en) * 2013-03-15 2014-05-27 Genetec Inc. Computer system using in-service software upgrade
US10983973B2 (en) * 2013-05-21 2021-04-20 Facebook, Inc. Database sharding with incorporated updates
US9910860B2 (en) * 2014-02-06 2018-03-06 International Business Machines Corporation Split elimination in MapReduce systems
US9971770B2 (en) * 2014-11-25 2018-05-15 Sap Se Inverted indexing

Citations (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3987419A (en) * 1974-12-05 1976-10-19 Goodyear Aerospace Corporation High speed information processing system
US4158235A (en) * 1977-04-18 1979-06-12 Burroughs Corporation Multi port time-shared associative buffer storage pool
US5179662A (en) * 1989-08-31 1993-01-12 International Business Machines Corporation Optimized i/o buffers having the ability to increase or decrease in size to meet system requirements
US5263159A (en) * 1989-09-20 1993-11-16 International Business Machines Corporation Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database
US5784698A (en) * 1995-12-05 1998-07-21 International Business Machines Corporation Dynamic memory allocation that enalbes efficient use of buffer pool memory segments
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US5916309A (en) * 1997-05-12 1999-06-29 Lexmark International Inc. System for dynamically determining the size and number of communication buffers based on communication parameters at the beginning of the reception of message
US6067584A (en) * 1996-09-09 2000-05-23 National Instruments Corporation Attribute-based system and method for configuring and controlling a data acquisition task
US6067547A (en) * 1997-08-12 2000-05-23 Microsoft Corporation Hash table expansion and contraction for use with internal searching
US6161154A (en) * 1997-11-26 2000-12-12 National Instruments Corporation System and method for extracting and restoring a video buffer from/to a video acquisition cycle
US6349308B1 (en) * 1998-02-25 2002-02-19 Korea Advanced Institute Of Science & Technology Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US6439783B1 (en) * 1994-07-19 2002-08-27 Oracle Corporation Range-based query optimizer
US6463486B1 (en) * 1999-04-06 2002-10-08 Microsoft Corporation System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information
US6542967B1 (en) * 1999-04-12 2003-04-01 Novell, Inc. Cache object store
US6546456B1 (en) * 2000-09-08 2003-04-08 International Business Machines Corporation Method and apparatus for operating vehicle mounted disk drive storage device
US20030069877A1 (en) * 2001-08-13 2003-04-10 Xerox Corporation System for automatically generating queries
US20050028156A1 (en) * 2003-07-30 2005-02-03 Northwestern University Automatic method and system for formulating and transforming representations of context used by information services
US20050246457A1 (en) * 1999-04-06 2005-11-03 Microsoft Corporation System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information
US6993604B2 (en) * 2000-11-15 2006-01-31 Seagate Technology Llc Dynamic buffer size allocation for multiplexed streaming
US7058642B2 (en) * 2002-03-20 2006-06-06 Intel Corporation Method and data structure for a low memory overhead database
US7099898B1 (en) * 1999-08-12 2006-08-29 International Business Machines Corporation Data access system
US20060248037A1 (en) * 2005-04-29 2006-11-02 International Business Machines Corporation Annotation of inverted list text indexes using search queries
US7143263B2 (en) * 2003-10-16 2006-11-28 International Business Machines Corporation System and method of adaptively reconfiguring buffers
US7146466B2 (en) * 2004-03-23 2006-12-05 International Business Machines System for balancing multiple memory buffer sizes and method therefor
US20070078890A1 (en) * 2005-10-05 2007-04-05 International Business Machines Corporation System and method for providing an object to support data structures in worm storage
US7213094B2 (en) * 2004-02-17 2007-05-01 Intel Corporation Method and apparatus for managing buffers in PCI bridges
US20070112813A1 (en) * 2005-11-08 2007-05-17 Beyer Kevin S Virtual cursors for XML joins
US20070156958A1 (en) * 2006-01-03 2007-07-05 Emc Corporation Methods, systems, and computer program products for optimized copying of logical units (LUNs) in a redundant array of inexpensive disks (RAID) environment using buffers that are smaller than LUN delta map chunks
US7266622B2 (en) * 2002-03-25 2007-09-04 International Business Machines Corporation Method, computer program product, and system for automatic application buffering
US20070255698A1 (en) * 2006-04-10 2007-11-01 Garrett Kaminaga Secure and granular index for information retrieval
US7330916B1 (en) * 1999-12-02 2008-02-12 Nvidia Corporation Graphic controller to manage a memory and effective size of FIFO buffer as viewed by CPU can be as large as the memory
US20080040307A1 (en) * 2006-08-04 2008-02-14 Apple Computer, Inc. Index compression
US7337165B2 (en) * 2003-12-29 2008-02-26 International Business Machines Corporation Method and system for processing a text search query in a collection of documents
US20080059420A1 (en) * 2006-08-22 2008-03-06 International Business Machines Corporation System and Method for Providing a Trustworthy Inverted Index to Enable Searching of Records
US20080082554A1 (en) * 2006-10-03 2008-04-03 Paul Pedersen Systems and methods for providing a dynamic document index
US7370037B2 (en) * 2003-12-29 2008-05-06 International Business Machines Corporation Methods for processing a text search query in a collection of documents
US20080228743A1 (en) * 2007-03-15 2008-09-18 International Business Machines Corporation System and method for multi-dimensional aggregation over large text corpora
US7480750B2 (en) * 2004-05-14 2009-01-20 International Buisiness Machines Corporation Optimization of buffer pool sizes for data storage
US7487141B1 (en) * 2003-06-19 2009-02-03 Sap Ag Skipping pattern for an inverted index
US20090094416A1 (en) * 2007-10-05 2009-04-09 Yahoo! Inc. System and method for caching posting lists
US20090112843A1 (en) * 2007-10-29 2009-04-30 International Business Machines Corporation System and method for providing differentiated service levels for search index
US7533245B2 (en) * 2003-08-01 2009-05-12 Illinois Institute Of Technology Hardware assisted pruned inverted index component
US7536408B2 (en) * 2004-07-26 2009-05-19 Google Inc. Phrase-based indexing in an information retrieval system
US20090132521A1 (en) * 2007-08-31 2009-05-21 Powerset, Inc. Efficient Storage and Retrieval of Posting Lists
US20090164424A1 (en) * 2007-12-25 2009-06-25 Benjamin Sznajder Object-Oriented Twig Query Evaluation
US20090164437A1 (en) * 2007-12-20 2009-06-25 Torbjornsen Oystein Method for dynamic updating of an index, and a search engine implementing the same
US20100161617A1 (en) * 2007-03-30 2010-06-24 Google Inc. Index server architecture using tiered and sharded phrase posting lists

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4754399A (en) * 1983-12-28 1988-06-28 Hitachi, Ltd. Data transfer control system for controlling data transfer between a buffer memory and input/output devices
AU652371B2 (en) * 1990-06-29 1994-08-25 Fujitsu Limited Data transfer system
US5537552A (en) * 1990-11-27 1996-07-16 Canon Kabushiki Kaisha Apparatus for selectively comparing pointers to detect full or empty status of a circular buffer area in an input/output (I/O) buffer
JPH0820964B2 (en) * 1991-09-13 1996-03-04 インターナショナル・ビジネス・マシーンズ・コーポレイション Memory control device and method
US6820144B2 (en) * 1999-04-06 2004-11-16 Microsoft Corporation Data format for a streaming information appliance
US6813731B2 (en) * 2001-02-26 2004-11-02 Emc Corporation Methods and apparatus for accessing trace data
GB2381401B (en) * 2001-10-23 2005-12-21 Thirdspace Living Ltd Data switch
US20040243556A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS)
US7567959B2 (en) * 2004-07-26 2009-07-28 Google Inc. Multiple index based information retrieval system
JP4305378B2 (en) * 2004-12-13 2009-07-29 ソニー株式会社 Data processing system, access control method, apparatus thereof, and program thereof
US20070198754A1 (en) * 2006-02-07 2007-08-23 International Business Machines Corporation Data transfer buffer control for performance
JP2008033721A (en) * 2006-07-31 2008-02-14 Matsushita Electric Ind Co Ltd Dma transfer control device
CA2675216A1 (en) * 2007-01-10 2008-07-17 Nick Koudas Method and system for information discovery and text analysis

Patent Citations (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3987419A (en) * 1974-12-05 1976-10-19 Goodyear Aerospace Corporation High speed information processing system
US4158235A (en) * 1977-04-18 1979-06-12 Burroughs Corporation Multi port time-shared associative buffer storage pool
US5179662A (en) * 1989-08-31 1993-01-12 International Business Machines Corporation Optimized i/o buffers having the ability to increase or decrease in size to meet system requirements
US5263159A (en) * 1989-09-20 1993-11-16 International Business Machines Corporation Information retrieval based on rank-ordered cumulative query scores calculated from weights of all keywords in an inverted index file for minimizing access to a main database
US6439783B1 (en) * 1994-07-19 2002-08-27 Oracle Corporation Range-based query optimizer
US5784698A (en) * 1995-12-05 1998-07-21 International Business Machines Corporation Dynamic memory allocation that enalbes efficient use of buffer pool memory segments
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US6067584A (en) * 1996-09-09 2000-05-23 National Instruments Corporation Attribute-based system and method for configuring and controlling a data acquisition task
US5916309A (en) * 1997-05-12 1999-06-29 Lexmark International Inc. System for dynamically determining the size and number of communication buffers based on communication parameters at the beginning of the reception of message
US6067547A (en) * 1997-08-12 2000-05-23 Microsoft Corporation Hash table expansion and contraction for use with internal searching
US6161154A (en) * 1997-11-26 2000-12-12 National Instruments Corporation System and method for extracting and restoring a video buffer from/to a video acquisition cycle
US6349308B1 (en) * 1998-02-25 2002-02-19 Korea Advanced Institute Of Science & Technology Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US6463486B1 (en) * 1999-04-06 2002-10-08 Microsoft Corporation System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information
US20050246457A1 (en) * 1999-04-06 2005-11-03 Microsoft Corporation System for handling streaming information using a plurality of reader modules by enumerating output pins and associated streams of information
US6542967B1 (en) * 1999-04-12 2003-04-01 Novell, Inc. Cache object store
US7099898B1 (en) * 1999-08-12 2006-08-29 International Business Machines Corporation Data access system
US7330916B1 (en) * 1999-12-02 2008-02-12 Nvidia Corporation Graphic controller to manage a memory and effective size of FIFO buffer as viewed by CPU can be as large as the memory
US6546456B1 (en) * 2000-09-08 2003-04-08 International Business Machines Corporation Method and apparatus for operating vehicle mounted disk drive storage device
US6993604B2 (en) * 2000-11-15 2006-01-31 Seagate Technology Llc Dynamic buffer size allocation for multiplexed streaming
US20060080482A1 (en) * 2000-11-15 2006-04-13 Seagate Technology Llc Dynamic buffer size allocation for multiplexed streaming
US20030069877A1 (en) * 2001-08-13 2003-04-10 Xerox Corporation System for automatically generating queries
US7058642B2 (en) * 2002-03-20 2006-06-06 Intel Corporation Method and data structure for a low memory overhead database
US7266622B2 (en) * 2002-03-25 2007-09-04 International Business Machines Corporation Method, computer program product, and system for automatic application buffering
US7487141B1 (en) * 2003-06-19 2009-02-03 Sap Ag Skipping pattern for an inverted index
US20050028156A1 (en) * 2003-07-30 2005-02-03 Northwestern University Automatic method and system for formulating and transforming representations of context used by information services
US7533245B2 (en) * 2003-08-01 2009-05-12 Illinois Institute Of Technology Hardware assisted pruned inverted index component
US7143263B2 (en) * 2003-10-16 2006-11-28 International Business Machines Corporation System and method of adaptively reconfiguring buffers
US20080140639A1 (en) * 2003-12-29 2008-06-12 International Business Machines Corporation Processing a Text Search Query in a Collection of Documents
US7370037B2 (en) * 2003-12-29 2008-05-06 International Business Machines Corporation Methods for processing a text search query in a collection of documents
US7337165B2 (en) * 2003-12-29 2008-02-26 International Business Machines Corporation Method and system for processing a text search query in a collection of documents
US7213094B2 (en) * 2004-02-17 2007-05-01 Intel Corporation Method and apparatus for managing buffers in PCI bridges
US7146466B2 (en) * 2004-03-23 2006-12-05 International Business Machines System for balancing multiple memory buffer sizes and method therefor
US7480750B2 (en) * 2004-05-14 2009-01-20 International Buisiness Machines Corporation Optimization of buffer pool sizes for data storage
US7536408B2 (en) * 2004-07-26 2009-05-19 Google Inc. Phrase-based indexing in an information retrieval system
US20060248037A1 (en) * 2005-04-29 2006-11-02 International Business Machines Corporation Annotation of inverted list text indexes using search queries
US20070078890A1 (en) * 2005-10-05 2007-04-05 International Business Machines Corporation System and method for providing an object to support data structures in worm storage
US7487178B2 (en) * 2005-10-05 2009-02-03 International Business Machines Corporation System and method for providing an object to support data structures in worm storage
US20090049086A1 (en) * 2005-10-05 2009-02-19 International Business Machines Corporation System and method for providing an object to support data structures in worm storage
US20070112813A1 (en) * 2005-11-08 2007-05-17 Beyer Kevin S Virtual cursors for XML joins
US20070156958A1 (en) * 2006-01-03 2007-07-05 Emc Corporation Methods, systems, and computer program products for optimized copying of logical units (LUNs) in a redundant array of inexpensive disks (RAID) environment using buffers that are smaller than LUN delta map chunks
US20070255698A1 (en) * 2006-04-10 2007-11-01 Garrett Kaminaga Secure and granular index for information retrieval
US20080040307A1 (en) * 2006-08-04 2008-02-14 Apple Computer, Inc. Index compression
US20080059420A1 (en) * 2006-08-22 2008-03-06 International Business Machines Corporation System and Method for Providing a Trustworthy Inverted Index to Enable Searching of Records
US20080082554A1 (en) * 2006-10-03 2008-04-03 Paul Pedersen Systems and methods for providing a dynamic document index
US20080228743A1 (en) * 2007-03-15 2008-09-18 International Business Machines Corporation System and method for multi-dimensional aggregation over large text corpora
US20100161617A1 (en) * 2007-03-30 2010-06-24 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US20090132521A1 (en) * 2007-08-31 2009-05-21 Powerset, Inc. Efficient Storage and Retrieval of Posting Lists
US20090094416A1 (en) * 2007-10-05 2009-04-09 Yahoo! Inc. System and method for caching posting lists
US20090112843A1 (en) * 2007-10-29 2009-04-30 International Business Machines Corporation System and method for providing differentiated service levels for search index
US20090164437A1 (en) * 2007-12-20 2009-06-25 Torbjornsen Oystein Method for dynamic updating of an index, and a search engine implementing the same
US20090164424A1 (en) * 2007-12-25 2009-06-25 Benjamin Sznajder Object-Oriented Twig Query Evaluation

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11003644B2 (en) * 2012-05-18 2021-05-11 Splunk Inc. Directly searchable and indirectly searchable using associated inverted indexes raw machine datastore
EP3000060A4 (en) * 2013-05-21 2017-01-18 Facebook, Inc. Database sharding with update layer
US10977229B2 (en) 2013-05-21 2021-04-13 Facebook, Inc. Database sharding with update layer
WO2015078273A1 (en) * 2013-11-29 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for search
US10452691B2 (en) 2013-11-29 2019-10-22 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating search results using inverted index
US11334353B2 (en) * 2017-05-18 2022-05-17 Nec Corporation Multiparty computation method, apparatus and program
US10545960B1 (en) * 2019-03-12 2020-01-28 The Governing Council Of The University Of Toronto System and method for set overlap searching of data lakes

Also Published As

Publication number Publication date
US8205025B2 (en) 2012-06-19
US20120260011A1 (en) 2012-10-11
US20110040762A1 (en) 2011-02-17
US20110040905A1 (en) 2011-02-17

Similar Documents

Publication Publication Date Title
US20110040761A1 (en) Estimation of postings list length in a search system using an approximation table
US11036799B2 (en) Low RAM space, high-throughput persistent key value store using secondary memory
US10678654B2 (en) Systems and methods for data backup using data binning and deduplication
US8977623B2 (en) Method and system for search engine indexing and searching using the index
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
KR102564170B1 (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
CN106980665B (en) Data dictionary implementation method and device and data dictionary management system
US10255234B2 (en) Method for storing data elements in a database
US10810174B2 (en) Database management system, database server, and database management method
CN105468644B (en) Method and equipment for querying in database
US11169968B2 (en) Region-integrated data deduplication implementing a multi-lifetime duplicate finder
CN110532347A (en) A kind of daily record data processing method, device, equipment and storage medium
CN111475105A (en) Monitoring data storage method, device, server and storage medium
CN111045994B (en) File classification retrieval method and system based on KV database
US11520818B2 (en) Method, apparatus and computer program product for managing metadata of storage object
CN112306957A (en) Method and device for acquiring index node number, computing equipment and storage medium
US20130218851A1 (en) Storage system, data management device, method and program
US20230138113A1 (en) System for retrieval of large datasets in cloud environments
US8051090B2 (en) File management method of a ring buffer and related file management apparatus
US7720805B1 (en) Sequential unload processing of IMS databases
CN111723266A (en) Mass data processing method and device
WO2022141650A1 (en) Memory-frugal index design in storage engine
US20230385240A1 (en) Optimizations for data deduplication operations
US20210011881A1 (en) System and method for insertable and removable file system
US8819089B2 (en) Memory efficient representation of relational data with constant time random access to the data itself

Legal Events

Date Code Title Description
AS Assignment

Owner name: GLOBALSPEC, INC., NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FLATLAND, STEINAR;DALTON, JEFF J.;REEL/FRAME:024825/0968

Effective date: 20100809

AS Assignment

Owner name: COMERICA BANK, MICHIGAN

Free format text: SECURITY AGREEMENT;ASSIGNOR:GLOBALSPEC, INC.;REEL/FRAME:026146/0641

Effective date: 20031229

AS Assignment

Owner name: GLOBALSPEC, INC., NEW YORK

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:COMERICA BANK;REEL/FRAME:028464/0833

Effective date: 20120626

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION