US20090089266A1 - Method of finding candidate sub-queries from longer queries - Google Patents
Method of finding candidate sub-queries from longer queries Download PDFInfo
- Publication number
- US20090089266A1 US20090089266A1 US11/863,045 US86304507A US2009089266A1 US 20090089266 A1 US20090089266 A1 US 20090089266A1 US 86304507 A US86304507 A US 86304507A US 2009089266 A1 US2009089266 A1 US 2009089266A1
- Authority
- US
- United States
- Prior art keywords
- query
- queries
- subsequences
- input query
- tokens
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24539—Query rewriting; Transformation using cached or materialised query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3349—Reuse of stored results of previous queries
Definitions
- Search engines are a powerful tool for sifting through vast amounts of stored information in a structured and discriminating scheme.
- Popular search engines such as MSN®, Google® and Yahoo!® service tens of millions of queries for information every day.
- a typical search engine for use in finding documents on the World Wide Web operates by a coordinated set of programs including a spider (also referred to as a “crawler” or “bot”) that gathers information from web pages on the World Wide Web in order to create entries for a search engine index, or log; an indexing program that creates the log from the web pages that have been read; and a search program that receives a search query, compares it to the entries in the log, and returns results appropriate to the search query.
- a spider also referred to as a “crawler” or “bot”
- an indexing program that creates the log from the web pages that have been read
- search program that receives a search query, compares it to the entries in the log, and returns results appropriate to the search query.
- a current area of significant research in the field of search engine technology is how to improve the efficiency and quality of results for a given search query.
- So called concept-based searching involves using statistical analysis on various search criteria in order to identify and suggest alternative search queries that are highly semantically related to the input search query. Identifying alternative, highly correlated search queries can help focus and improve the search results for a given search.
- companies and advertisers present advertising when particular queries are entered. It would be extremely beneficial to such companies and advertisers to associate their advertising with particular queries as well as other semantically related queries.
- queries are correlated together depending on the degree to which results returned in the respective queries are the same. Thus, if first and second queries return nearly identical search results, these two queries would be considered highly correlated with each other.
- Another popular search technology relates to analyzing and comparing the semantic input queries themselves to the entries in the database log. If two queries are found to be semantically related, then the search results returned by the respective queries should be highly correlated.
- Embodiments of the present system relate to a method of identifying queries stored in a log that are semantically related to an input query which may include a large number of terms.
- a set of one or more subsequences are generated for each query stored in the log, and these sets of subsequences are stored in a lookup table.
- a set of one or more subsequences are also generated for the input query.
- Matching queries are obtained by comparing the input query subsequences against the subsequences stored in the lookup table.
- the subsequences in the lookup table and of the input query are generated by hashing of the respective query terms, or tokens, to a value between 0 and 1 using a known technique of min-hashing.
- the present system then constructs the subsequences of the query based on the values of the hashed tokens.
- the one or more subsequences of a given query are the k-min hashes of the query, where k is an integer which may vary between 1 and m.
- the upper bound of k is m. m may be arbitrarily selected as some percentage of the number of search terms in a query.
- a k-min hash is obtained, it is ordered so that the tokens in the min-hash appear in the same order in which the tokens appear in the query from which the min hash is derived.
- the ordered k-min hash sequences of the input query are then compared against the ordered k-min hash sequences in the lookup table. Where there is a match between a k-min hash of the entered query and a k-min hash of a stored log entry, the stored and entered queries may be semantically related, and the results for the matching stored log entry are returned and provided to the user as search results.
- FIG. 1 is a block diagram of a computing environment capable of performing embodiments of the present method.
- FIG. 2 is a block diagram of a search engine environment capable of carrying out embodiments of the present method.
- FIG. 3 is a block diagram for forming a lookup table of min hashes for queries stored within a log.
- FIG. 4 shows an example of a query stored within a database log where the tokens of the query are hashed to a value of between 0 and 1.
- FIG. 5 shows different min hashes which may be formed from the stored query of FIG. 4 .
- FIG. 6 shows the min hashes of FIG. 5 ordered in the same sequence as the stored query of FIG. 4 .
- FIG. 7 is a flowchart of an embodiment of the present system for finding stored search queries which are highly correlated to a long input query.
- FIG. 8 shows an example of an input query where the tokens of the query are hashed to a value of between 0 and 1.
- FIG. 9 shows different min hashes which may be formed from the input query of FIG. 8 .
- FIG. 10 shows the min hashes of FIG. 9 ordered in the same sequence as the input query of FIG. 8 .
- FIG. 11 shows an example of min hashes of an input query compared against min hashes within a lookup table.
- FIG. 12 is a flowchart of an alternative method of the present system for finding stored queries which are correlated to long input queries.
- FIG. 1 illustrates an example of a suitable general computing system environment 100 on which the invention may be implemented.
- the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing system environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary computing system environment 100 .
- the invention is operational with numerous other general purpose or special purpose computing systems, environments or configurations.
- Examples of well known computing systems, environments and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, laptop and palm computers, hand held devices, distributed computing environments that include any of the above systems or devices, and the like.
- the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
- program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types.
- the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communication network.
- program modules may be located in both local and remote computer storage media including memory storage devices.
- an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110 .
- Components of computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
- the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
- such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
- ISA Industry Standard Architecture
- MCA Micro Channel Architecture
- EISA Enhanced ISA
- VESA Video Electronics Standards Association
- PCI Peripheral Component Interconnect
- Computer 110 typically includes a variety of computer readable media.
- Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media.
- Computer readable media may comprise computer storage media and communication media.
- Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110 .
- Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
- the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132 .
- ROM read only memory
- RAM random access memory
- BIOS basic input/output system
- RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120 .
- FIG. 1 illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
- the computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
- FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media.
- removable/non-removable, volatile/ nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, DVDs, digital video tape, solid state RAM, solid state ROM, and the like.
- the hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140
- magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
- hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . These components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 . Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
- a user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161 , commonly referred to as a mouse, trackball or touch pad.
- Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
- These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus 121 , but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
- a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
- computers may also include other peripheral output devices such as speakers 197 and printer 196 , which may be connected through an output peripheral interface 195 .
- the computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180 .
- the remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 , although only a memory storage device 181 has been illustrated in FIG. 1 .
- the logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks.
- LAN local area network
- WAN wide area network
- Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
- the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 .
- the computer 110 When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communication over the WAN 173 , such as the Internet.
- the modem 172 which may be internal or external, may be connected to the system bus 121 via the user input interface 160 , or other appropriate mechanism.
- program modules depicted relative to the computer 110 may be stored in the remote memory storage device.
- FIG. 1 illustrates remote application programs 185 as residing on memory device 181 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
- FIG. 2 is a block diagram of a search processing environment 200 including software modules and data structure on which the present invention may be implemented.
- the search processing environment 200 can operate with and/or as part of the computing system environment 100 described above.
- Search processing environment 200 may be a crawler-based system having three major elements.
- First is the spider, also called the crawler 202 .
- the spider visits a web page 290 a , 290 b , reads it, and then follows links to other pages within the site.
- the spider 202 returns to the site on a regular basis to look for changes.
- the basic algorithm executed by any web crawler takes a list of seed URLs as its input and repeatedly: remove a URL from the URL list, determine the IP address of its host name, download the corresponding document, and extract any links contained in it.
- the crawler For each of the extracted links, the crawler will translate it to an absolute URL (if necessary), and add it to the list of URLs to download, provided it has not been encountered before. If desired, the crawler will process the downloaded document in other ways (e.g., index its content).
- the store 210 is a repository containing a copy of every web page that the spider finds. If a web page changes, then data store 210 is updated with new information.
- the data store further includes a log 206 of all search queries received by the search engine 212 , explained below. Additionally, in accordance with the present invention, the data store 210 further includes a lookup table 208 which includes a number of ordered subsequences of each entry within log 206 . The number of ordered subsequences of each particular log entry may vary in embodiments, but may advantageously be less than all possible subsequences of the particular log entry.
- the lookup table 208 is explained in greater detail hereinafter.
- the third part of the search processing environment 200 is search engine 212 .
- This is the program that sifts through the millions of pages recorded in the log to find matches to a search and rank them in order of what it believes is most relevant. Searching through the log involves a user building a query and submitting it through the search engine 212 .
- the query can be quite simple, a single word at minimum, but embodiments of the present system is particularly useful in handling long queries.
- a user of computing device 225 accesses search processing environment 200 via a web browser 216 on the client side and a web server 214 on the host side. Once a communication link is established between client and host, a user of computing device 225 may perform query searches as described above.
- the present system operates by hashing all terms, or tokens, of an input search query to a value of between 0 and 1 using a known min hashing algorithm.
- the present system then constructs subsequences of the input query based on the values of the hashed tokens.
- the term “k-min hash” is used herein to refer to a min hash of k tokens, where k is an integer which may vary between 1 and m.
- m may be arbitrarily selected as some percentage of the number of search terms in a query. In one embodiment, m may be selected to be between 30% to 70% of the number of search terms (rounded up or down to a whole value) or between 50% to 60% of the number of search terms (again, rounded up or down to a whole value). m may be selected other ways based on the number of terms in a query. For example, for queries of 4 or more terms, m may increase by one for the addition of every three terms, such as for example shown partially in the following table 1.
- k-min hash sequences are compared against k-min hash sequences in lookup table 208 which are similarly calculated from the stored queries in log 206 .
- the stored and entered queries may be semantically related, and the results for the matching stored log entry are returned and provided to the user as search results.
- the log 206 includes all stored query entries over some historical period.
- the log may be regenerated periodically from the most recent query submissions to reflect current search trends.
- a lookup table 208 exists of subsequences of each log entry in log 206 .
- the generation of the subsequences is now explained with reference to the flowchart of FIG. 3 .
- a hash generator may hash the terms forming each of the logged query entries.
- the hash generator may use the min hash algorithm to generate a number between 0 and 1 for each of the terms within a search query. Details of the min hash algorithm are known and set forth for example in A.
- FIG. 4 An example of the min hash algorithm is shown in FIG. 4 .
- a given query comprised of the terms, or tokens, “C D E F X” may be logged within log 206 .
- the tokens in the illustrative query are shown as letters, but may represent words, letters, numbers or combinations thereof. The length is also by way of example and may be longer or shorter than shown.
- the min hash algorithm Upon application of the min hash algorithm to each of the tokens in the query “C D E F X,” it may be determined that the respective tokens have the following hashed values:
- an algorithm may be used to obtain a number of k-min hashes for each logged search query.
- the number, m may be some arbitrarily selected number smaller than the total length of the particular logged query being examined.
- m may be selected as four.
- the minimum hashed value is 0.05, associated with the term E.
- Embodiments of the present system obtain k-min hashes for a given stored query, but the algorithm also maintains the order of the terms as presented in the stored query. That is, once it is determined which terms comprise a k-min hash, the terms in the k-min hash are organized in the same order in which they appear in the query. Thus, referring to FIGS. 3 and 6 , in step 306 , the ordered k-min hash terms are obtained.
- each query in the log 206 may have a set of associated subsequence min hashes stored in lookup table 208 .
- the log 206 may be periodically updated, at which times, the subsequences for the queries in log 206 may be recalculated and stored in lookup table 208 .
- the search engine algorithm receives a search query in step 350 .
- a user may input a query 400 , as shown in FIG. 8 , of A B C D E F.
- the tokens are shown as letters, but may represent words, letters, numbers or combinations thereof.
- the length of query 400 is also by way of example and may be longer or shorter than shown.
- each of the tokens in the query received in step 350 are hashed to a value of between 0 and 1 using the known min hash algorithm as described above with respect to step 300 in the flowchart of FIG. 3 .
- the value for m may be arbitrarily chosen based on the length of the input query 400 , and in embodiments may be shorter than the length of query 400 .
- FIG. 10 shows the same k-min hashes in the order in which they appear within query 400 .
- all k-min hashes and ordered k-min hashes are computed before the comparison step 360 explained below.
- the ordered min hashes may be determined one at a time and then the comparison performed. In this embodiment, only after a given ordered k-min hash results in no matches would the next ordered k-min hash be determined.
- step 360 the search engine algorithm of the present system compares the first ordered min hash term for query 400 against the min hash terms in lookup table 208 for each of the queries stored in log 206 .
- the search engine starts with the largest hashed sequence for comparison against the sequences in lookup table 208 .
- the min hash of query 400 is first compared against the set of min hashes in table 208 for the first query stored in log 206 .
- step 362 If a match is found between the k-min hash of the query 400 and a min hash in the set of min hashes for the first query in log 206 , that first query is stored in a buffer in step 362 . The algorithm then checks whether there are additional queries stored in log 206 . If so, the next log entry is taken in step 368 and the set of min hashes in table 208 for that next log entry are compared with the current k-min hash of input query 400 .
- step 366 If in step 366 it is determined that all of the log entries have been compared against the current k-min hash of query 400 , the algorithm next checks in step 370 whether one or more matches were found for the k-min hash of input query 400 . If no matches are found, the algorithm next determines whether there are additional k-min hashes for input query 400 . If there are additional k-min hashes (i.e., k has not yet decreased to one), the next k-min hash is taken in step 374 and the algorithm returns to step 360 . In the embodiment of FIG. 7 , the next k-min hash is obtained by decreasing k by 1.
- the algorithm initially takes the min hash having four tokens from input query 400 and compares that sequence against all stored min hashes in table 208 . If no matches are found, the algorithm next proceeds to the 3-min hash having three tokens from input query 400 and repeats steps 360 through 368 to see if any min hashes in table 208 match the tokens in the 3-min hash of input query 400 , and so on.
- step 370 If k is at 1 after steps 360 through 370 and no matches have been found in step 370 , that means that not even a single min hash token of input query 400 matches a stored min hash in table 208 , and the algorithm indicates that no matches were identified in step 376 . Although theoretically possible, in practice, at least one match for a small enough value of k will generally be found, and step 376 will not be reached.
- step 370 If, in step 370 , one or more matches have been found for a given k-min hash of input query 400 , the algorithm next checks in step 380 whether multiple matches have been stored in the buffer in step 364 . If there was a single k-min hash of table 208 that was found to match the k-min hash of input query 400 , the query stored in log 206 from which the matched k-min hash of table 208 is taken is returned in step 382 . In step 384 , a search is performed by search engine 212 using the stored query identified in steps 382 , and the results for that search of the identified query in log 206 are returned to the user as the most closely correlated search results to input query 400 .
- step 380 if it is determined in step 380 that multiple min hashes were found in table 208 to match a given k-min hash of input query 400 , the algorithm may return the most popular query of the matched min hashes in step 386 , and perform a search to obtain the search results for that most popular query, which are then returned to the user in step 384 .
- the information of how many times users entered each stored query is also stored in data store 210 . Where there are multiple matches in log 206 identified as matching the input query, the most popular will be the most frequently entered query of the matching stored query.
- FIG. 11 also shows a few example queries in log 206 and their associated set of k-min hashes in lookup table 208 . As shown and as indicated above, different length queries in log 206 may have differing numbers of min hashes.
- the search engine algorithm may initially try to find matches for the min hash with the largest number of tokens, i.e., A C D E.
- a C D E the largest number of tokens
- the search engine algorithm may use that min hash in steps 360 through 370 and find no matches.
- the next min hash, C D E is then compared against the hashes within min hash table 208 in steps 360 through 370 .
- two log entries may be identified having min hashes which match the min hash C D E of input query 400 . These log entries may be C D E R and C D E F X.
- the search engine algorithm performs step 386 of identifying the more popular log entry.
- the log entry C D E R may be more popular than log entry C D E F X.
- the results for search query C D E R are obtained (1, 2, 3 . . . ) and returned as the most closely correlated search results within data store 210 for input query A B C D E F.
- the search results for all matched queries may be returned and presented to the user as the most correlated results for the user's input query.
- the algorithm may begin with the smallest k-min hash and work upward until the largest k-min hash having a match within table 208 is identified.
- a query is received in step 450 , and the terms of the query are hashed to a value between 0 and 1 in step 452 as described above.
- the k-min hash of the search query is obtained.
- the value of k is initially selected as some minimum value. The value may be 1, or it may be greater than 1.
- the k-min hash for the initial value of k obtained in step 454 is ordered in the same sequence as the tokens appear in the initial search query.
- Steps 460 through 470 of FIG. 12 are the same as steps 360 through 370 described above with respect to FIG. 7 . Namely, the min hash for the first value of k is compared against the stored min hashes in lookup table 208 , and the associated stored queries for any matches found are stored in a buffer.
- step 470 if one or more matches are found for a given value of k, k is incremented until the point where no matches are found. At that point, the matches found at k-1 are returned as the matching candidate results. Accordingly, in step 470 if a match was found and stored in the buffer, k is incremented in step 472 and the next k-min hash is again obtained in step 454 and steps 456 through 470 are repeated.
- step 470 the search engine algorithm checks in step 474 whether k is in fact at its initial value. If so, this indicates that no matches were found for any of the min hashes of the search query entered in step 450 , and the algorithm indicates that no matches were found in step 476 . As discussed above if the starting value of k is 1, it is unlikely that step 476 will be reached. However, in this embodiment, it is contemplated that initial values of k may be greater than 1, making it more possible for step 476 to be reached.
- step 474 the match(es) found for the previous value of k are retrieved from the memory buffer in step 478 .
- step 480 the algorithm checks whether there were multiple matches for the previous value of k. If there was a single match, the query stored in log 206 from which the matched hash of table 208 is taken is returned in step 482 .
- step 484 the results for the identified query in log 206 are returned to the user as the most closely correlated search results to input query.
- step 480 if it is determined in step 480 that multiple min hashes were found in table 208 to match the previous k-min hash of the input query, the algorithm may return the most popular query of the matched min hashes in step 486 , and return the search results for that most popular query to the user in step 484 .
- Stop words which are common words such as “the,” “of” etc. Such stop words will result in a low hash value when hashed per the min hash function. Accordingly, when the min hashes are obtained, many min hashes stored in lookup table 208 will similarly include stop words and result in a high number of matches to the min hashes of the input query. Accordingly, in a further embodiment of the present invention, it is possible to weight the hash value of terms so that stop words receive higher hashed values than other, less common and more prohibitive terms in a given query.
- this waiting may be a TF-IDF (term frequency-inverse document frequency) weight, which is a known concept used in information retrieval and text mining.
- a TF-IDF weight is a statistical measure used to evaluate how important a word is within a given query. TF-IDF weight is explained in greater detail in Salton, G., Introduction to Modern Information Retrieval , McGraw Hill (1983).
- TF-IDF weight is computed as 1/
Abstract
Description
- Search engines are a powerful tool for sifting through vast amounts of stored information in a structured and discriminating scheme. Popular search engines such as MSN®, Google® and Yahoo!® service tens of millions of queries for information every day. A typical search engine for use in finding documents on the World Wide Web operates by a coordinated set of programs including a spider (also referred to as a “crawler” or “bot”) that gathers information from web pages on the World Wide Web in order to create entries for a search engine index, or log; an indexing program that creates the log from the web pages that have been read; and a search program that receives a search query, compares it to the entries in the log, and returns results appropriate to the search query.
- A current area of significant research in the field of search engine technology is how to improve the efficiency and quality of results for a given search query. So called concept-based searching involves using statistical analysis on various search criteria in order to identify and suggest alternative search queries that are highly semantically related to the input search query. Identifying alternative, highly correlated search queries can help focus and improve the search results for a given search. Moreover, companies and advertisers present advertising when particular queries are entered. It would be extremely beneficial to such companies and advertisers to associate their advertising with particular queries as well as other semantically related queries.
- In an example of a prior art system employing concept-based searching, queries are correlated together depending on the degree to which results returned in the respective queries are the same. Thus, if first and second queries return nearly identical search results, these two queries would be considered highly correlated with each other. Another popular search technology relates to analyzing and comparing the semantic input queries themselves to the entries in the database log. If two queries are found to be semantically related, then the search results returned by the respective queries should be highly correlated.
- In search engines used for web searches and other database searches, long queries are often difficult to handle. Conventional approaches to searching use all query terms as a conjunction. Accordingly, long queries may produce no results. Moreover, processing long queries is computationally difficult. It may be possible to scan all the entries in the log, which may often include millions of entries, and compare each of the entries with the original query. Each of these comparisons in turn is an expensive operation (quadratic in the length of the strings). Therefore, this approach is not feasible for large query logs and long strings.
- Embodiments of the present system relate to a method of identifying queries stored in a log that are semantically related to an input query which may include a large number of terms. A set of one or more subsequences are generated for each query stored in the log, and these sets of subsequences are stored in a lookup table. A set of one or more subsequences are also generated for the input query. Matching queries are obtained by comparing the input query subsequences against the subsequences stored in the lookup table.
- The subsequences in the lookup table and of the input query are generated by hashing of the respective query terms, or tokens, to a value between 0 and 1 using a known technique of min-hashing. The present system then constructs the subsequences of the query based on the values of the hashed tokens. The one or more subsequences of a given query are the k-min hashes of the query, where k is an integer which may vary between 1 and m. For example, a k-min hash for k=2 is a min hash including the two tokens having the two lowest hashed values of all tokens in the input query. The upper bound of k is m. m may be arbitrarily selected as some percentage of the number of search terms in a query.
- Once a k-min hash is obtained, it is ordered so that the tokens in the min-hash appear in the same order in which the tokens appear in the query from which the min hash is derived. The ordered k-min hash sequences of the input query are then compared against the ordered k-min hash sequences in the lookup table. Where there is a match between a k-min hash of the entered query and a k-min hash of a stored log entry, the stored and entered queries may be semantically related, and the results for the matching stored log entry are returned and provided to the user as search results.
-
FIG. 1 is a block diagram of a computing environment capable of performing embodiments of the present method. -
FIG. 2 is a block diagram of a search engine environment capable of carrying out embodiments of the present method. -
FIG. 3 is a block diagram for forming a lookup table of min hashes for queries stored within a log. -
FIG. 4 shows an example of a query stored within a database log where the tokens of the query are hashed to a value of between 0 and 1. -
FIG. 5 shows different min hashes which may be formed from the stored query ofFIG. 4 . -
FIG. 6 shows the min hashes ofFIG. 5 ordered in the same sequence as the stored query ofFIG. 4 . -
FIG. 7 is a flowchart of an embodiment of the present system for finding stored search queries which are highly correlated to a long input query. -
FIG. 8 shows an example of an input query where the tokens of the query are hashed to a value of between 0 and 1. -
FIG. 9 shows different min hashes which may be formed from the input query ofFIG. 8 . -
FIG. 10 shows the min hashes ofFIG. 9 ordered in the same sequence as the input query ofFIG. 8 . -
FIG. 11 shows an example of min hashes of an input query compared against min hashes within a lookup table. -
FIG. 12 is a flowchart of an alternative method of the present system for finding stored queries which are correlated to long input queries. - Embodiments of the invention will now be described with reference to
FIGS. 1-12 , which in general relate to methods for finding semantically related search engine queries for long input queries. The method uses hashing and an efficient algorithm for comparing ordered sub-sequences of a long input query against entries in the stored query log to find the best semantically matching candidates. The methods described herein can be performed on a variety of processing systems.FIG. 1 illustrates an example of a suitable generalcomputing system environment 100 on which the invention may be implemented. Thecomputing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should thecomputing system environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplarycomputing system environment 100. - The invention is operational with numerous other general purpose or special purpose computing systems, environments or configurations. Examples of well known computing systems, environments and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, laptop and palm computers, hand held devices, distributed computing environments that include any of the above systems or devices, and the like.
- The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communication network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
- With reference to
FIG. 1 , an exemplary system for implementing the invention includes a general purpose computing device in the form of acomputer 110. Components ofcomputer 110 may include, but are not limited to, aprocessing unit 120, asystem memory 130, and asystem bus 121 that couples various system components including the system memory to theprocessing unit 120. Thesystem bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus. -
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed bycomputer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed bycomputer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above are also included within the scope of computer readable media. - The
system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements withincomputer 110, such as during start-up, is typically stored inROM 131.RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processingunit 120. By way of example, and not limitation,FIG. 1 illustratesoperating system 134,application programs 135,other program modules 136, andprogram data 137. - The
computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates ahard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, amagnetic disk drive 151 that reads from or writes to a removable, nonvolatilemagnetic disk 152, and anoptical disk drive 155 that reads from or writes to a removable, nonvolatileoptical disk 156 such as a CD ROM or other optical media. Other removable/non-removable, volatile/ nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, DVDs, digital video tape, solid state RAM, solid state ROM, and the like. Thehard disk drive 141 is typically connected to thesystem bus 121 through a non-removable memory interface such asinterface 140, andmagnetic disk drive 151 andoptical disk drive 155 are typically connected to thesystem bus 121 by a removable memory interface, such asinterface 150. - The drives and their associated computer storage media discussed above and illustrated in
FIG. 1 , provide storage of computer readable instructions, data structures, program modules and other data for thecomputer 110. InFIG. 1 , for example,hard disk drive 141 is illustrated as storingoperating system 144,application programs 145,other program modules 146, andprogram data 147. These components can either be the same as or different fromoperating system 134,application programs 135,other program modules 136, andprogram data 137.Operating system 144,application programs 145,other program modules 146, andprogram data 147 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into thecomputer 110 through input devices such as akeyboard 162 andpointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to theprocessing unit 120 through auser input interface 160 that is coupled to thesystem bus 121, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). Amonitor 191 or other type of display device is also connected to thesystem bus 121 via an interface, such as avideo interface 190. In addition to themonitor 191, computers may also include other peripheral output devices such asspeakers 197 andprinter 196, which may be connected through an outputperipheral interface 195. - The
computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as aremote computer 180. Theremote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to thecomputer 110, although only amemory storage device 181 has been illustrated inFIG. 1 . The logical connections depicted inFIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. - When used in a LAN networking environment, the
computer 110 is connected to theLAN 171 through a network interface oradapter 170. When used in a WAN networking environment, thecomputer 110 typically includes amodem 172 or other means for establishing communication over theWAN 173, such as the Internet. Themodem 172, which may be internal or external, may be connected to thesystem bus 121 via theuser input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to thecomputer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,FIG. 1 illustratesremote application programs 185 as residing onmemory device 181. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. -
FIG. 2 is a block diagram of asearch processing environment 200 including software modules and data structure on which the present invention may be implemented. Thesearch processing environment 200 can operate with and/or as part of thecomputing system environment 100 described above.Search processing environment 200 may be a crawler-based system having three major elements. First is the spider, also called thecrawler 202. The spider visits aweb page spider 202 returns to the site on a regular basis to look for changes. The basic algorithm executed by any web crawler takes a list of seed URLs as its input and repeatedly: remove a URL from the URL list, determine the IP address of its host name, download the corresponding document, and extract any links contained in it. For each of the extracted links, the crawler will translate it to an absolute URL (if necessary), and add it to the list of URLs to download, provided it has not been encountered before. If desired, the crawler will process the downloaded document in other ways (e.g., index its content). - Everything the spider finds goes into the second part of the search engine,
data store 210. Thestore 210 is a repository containing a copy of every web page that the spider finds. If a web page changes, thendata store 210 is updated with new information. The data store further includes alog 206 of all search queries received by thesearch engine 212, explained below. Additionally, in accordance with the present invention, thedata store 210 further includes a lookup table 208 which includes a number of ordered subsequences of each entry withinlog 206. The number of ordered subsequences of each particular log entry may vary in embodiments, but may advantageously be less than all possible subsequences of the particular log entry. The lookup table 208 is explained in greater detail hereinafter. - The third part of the
search processing environment 200 issearch engine 212. This is the program that sifts through the millions of pages recorded in the log to find matches to a search and rank them in order of what it believes is most relevant. Searching through the log involves a user building a query and submitting it through thesearch engine 212. The query can be quite simple, a single word at minimum, but embodiments of the present system is particularly useful in handling long queries. - In practice, a user of computing device 225 accesses
search processing environment 200 via a web browser 216 on the client side and aweb server 214 on the host side. Once a communication link is established between client and host, a user of computing device 225 may perform query searches as described above. - As explained in the Background section, long search queries present special difficulties in that it is rare that logged search entries will match all of the terms in the long search query, and a brute force search of all terms in the search query against all logged entries consumes excessive time and resources. Accordingly, a method for finding semantically related candidates for long search queries according to an embodiment of the present system will now be explained with reference to
FIGS. 3-11 . - In general, the present system operates by hashing all terms, or tokens, of an input search query to a value of between 0 and 1 using a known min hashing algorithm. The present system then constructs subsequences of the input query based on the values of the hashed tokens. The term “k-min hash” is used herein to refer to a min hash of k tokens, where k is an integer which may vary between 1 and m. A k-min hash for k=1 is a min hash including a single token having the lowest hashed value of all tokens in the input query. A k-min hash for k=2 is a min hash including the two tokens having the two lowest hashed values of all tokens in the input query. A k-min hash for k=3 is a min hash including the three tokens having the three lowest hashed values, etc.
- The upper bound of k is m. m may be arbitrarily selected as some percentage of the number of search terms in a query. In one embodiment, m may be selected to be between 30% to 70% of the number of search terms (rounded up or down to a whole value) or between 50% to 60% of the number of search terms (again, rounded up or down to a whole value). m may be selected other ways based on the number of terms in a query. For example, for queries of 4 or more terms, m may increase by one for the addition of every three terms, such as for example shown partially in the following table 1.
-
TABLE 1 Number of Terms in Query: m: 4-6 3 7-9 4 10-12 5 13-15 6 16-18 7 19-21 8
It is understood that these values of m are selected by way of example only and may vary in alternative embodiments to be any of a variety of values less than the number of search terms. Although unnecessarily long, m may be selected to be equal to the number of search terms in a query in further embodiments. - k-min hash sequences are compared against k-min hash sequences in lookup table 208 which are similarly calculated from the stored queries in
log 206. Where there is a match between a k-min hash of the entered query and a k-min hash of a stored log entry, the stored and entered queries may be semantically related, and the results for the matching stored log entry are returned and provided to the user as search results. - The
log 206 includes all stored query entries over some historical period. The log may be regenerated periodically from the most recent query submissions to reflect current search trends. As indicated above, a lookup table 208 exists of subsequences of each log entry inlog 206. The generation of the subsequences is now explained with reference to the flowchart ofFIG. 3 . Instep 300, a hash generator may hash the terms forming each of the logged query entries. The hash generator may use the min hash algorithm to generate a number between 0 and 1 for each of the terms within a search query. Details of the min hash algorithm are known and set forth for example in A. Broder, “On the resemblance and containment of documents,” In Compression and Complexity of Sequences (SEQUENCES '97), 1998; and E. Cohen, “Size estimation framework with applications to transitive closure and reachability,” Journal of Computer and System Sciences, 1997. - An example of the min hash algorithm is shown in
FIG. 4 . A given query comprised of the terms, or tokens, “C D E F X” may be logged withinlog 206. In this example, the tokens in the illustrative query are shown as letters, but may represent words, letters, numbers or combinations thereof. The length is also by way of example and may be longer or shorter than shown. Upon application of the min hash algorithm to each of the tokens in the query “C D E F X,” it may be determined that the respective tokens have the following hashed values: - C: 0.1
- D: 0.15
- E: 0.05
- F: 0.6
- X: 0.5
- These values will vary between 0 and 1, but the particular assigned values shown above are by way of example only.
- In
step 302, an algorithm may be used to obtain a number of k-min hashes for each logged search query. The number, m, may be some arbitrarily selected number smaller than the total length of the particular logged query being examined. Thus, in an embodiment, for the search term C D E F X, m may be selected as four. For the search query C D E F X having hashed values as described above and shown inFIG. 4 , the minimum hashed value is 0.05, associated with the term E. Thus, referring now toFIG. 5 , the k-min hash for k=1 is E. The two lowest hashed values in this example are 0.05 for E and 0.1 for C. Accordingly, the k-min hash for k=2 is E C. Continuing, in this way, the k-min hash for k=3 is E C D. And the k-min hash for k=m=4 is E C D X. - Embodiments of the present system obtain k-min hashes for a given stored query, but the algorithm also maintains the order of the terms as presented in the stored query. That is, once it is determined which terms comprise a k-min hash, the terms in the k-min hash are organized in the same order in which they appear in the query. Thus, referring to
FIGS. 3 and 6 , instep 306, the ordered k-min hash terms are obtained. The ordered k-min hash for k=1 is E. The ordered k-min hash for k=2 is C E. The ordered k-min hash for k=3 is C D E. And the ordered k-min hash for k=m=4 is C E D X. - In
step 308, the ordered k-min hashes for k=1 to m are stored in the lookup table 208. Thus, each query in thelog 206 may have a set of associated subsequence min hashes stored in lookup table 208. As indicated above, thelog 206 may be periodically updated, at which times, the subsequences for the queries inlog 206 may be recalculated and stored in lookup table 208. - As discussed in the Background section, long search queries input to conventional search engines often return no results. A method for finding the closest matching queries within
log 206 to a long input query will now be explained with reference to the flowchart ofFIG. 7 . In embodiments, the method described with respect toFIG. 7 may be implemented after a search engine has determined that the log does not contain any exact matches to an input search query. Alternatively, the method described with respect toFIG. 7 may be used in the place of conventional searching techniques to return the best search results to an input query. - Referring now to
FIG. 7 , the search engine algorithm according to the present system receives a search query instep 350. For example, a user may input aquery 400, as shown inFIG. 8 , of A B C D E F. Inquery 400, the tokens are shown as letters, but may represent words, letters, numbers or combinations thereof. The length ofquery 400 is also by way of example and may be longer or shorter than shown. Instep 352, each of the tokens in the query received instep 350 are hashed to a value of between 0 and 1 using the known min hash algorithm as described above with respect to step 300 in the flowchart ofFIG. 3 . - As shown in
FIG. 8 , upon application of the min hash algorithm to each of the tokens in the query “A B C D E F,” it may be determined that the terms have the following hashed values shown inFIG. 8 . - A: 0.25
- B: 0.7
- C: 0.1
- D: 0.15
- E: 0.05
- F: 0.6
- These values will vary between 0 and 1, but the particular assigned values shown above are by way of example only.
- In
steps input search query 400 for k=1 to m and orders the k-min hashes in the same sequence in which they appear ininput query 400. As indicated above, the value for m may be arbitrarily chosen based on the length of theinput query 400, and in embodiments may be shorter than the length ofquery 400. For example, referring toFIG. 9 , assuming a value of m=4 for theinput query 400, the k-min hash for k=4 is E C D A. The k-min hash for k=3 is E C D. The k-min hash for k=2 is E C and the k-min hash for k=1 is E.FIG. 10 shows the same k-min hashes in the order in which they appear withinquery 400. Mainly, the ordered k-min hash for k=4 is A C D E. The ordered k-min hash for k=3 is C D E. The ordered k-min hash for k=2 is C E. The ordered k-min hash for k=1 is E. - In the embodiment of
FIG. 7 , all k-min hashes and ordered k-min hashes are computed before thecomparison step 360 explained below. In an alternative embodiment, the ordered min hashes may be determined one at a time and then the comparison performed. In this embodiment, only after a given ordered k-min hash results in no matches would the next ordered k-min hash be determined. - In
step 360, the search engine algorithm of the present system compares the first ordered min hash term forquery 400 against the min hash terms in lookup table 208 for each of the queries stored inlog 206. In the embodiment ofFIG. 7 , the search engine starts with the largest hashed sequence for comparison against the sequences in lookup table 208. Accordingly, the first min hash used is the min hash for k=m. Insteps 362 through 368, the k-min hash ofquery 400 for k=m is compared against all of the min hashes stored in table 208. In particular, the min hash ofquery 400 is first compared against the set of min hashes in table 208 for the first query stored inlog 206. If a match is found between the k-min hash of thequery 400 and a min hash in the set of min hashes for the first query inlog 206, that first query is stored in a buffer instep 362. The algorithm then checks whether there are additional queries stored inlog 206. If so, the next log entry is taken instep 368 and the set of min hashes in table 208 for that next log entry are compared with the current k-min hash ofinput query 400. - If in
step 366 it is determined that all of the log entries have been compared against the current k-min hash ofquery 400, the algorithm next checks instep 370 whether one or more matches were found for the k-min hash ofinput query 400. If no matches are found, the algorithm next determines whether there are additional k-min hashes forinput query 400. If there are additional k-min hashes (i.e., k has not yet decreased to one), the next k-min hash is taken instep 374 and the algorithm returns to step 360. In the embodiment ofFIG. 7 , the next k-min hash is obtained by decreasing k by 1. Thus, in an embodiment where m equals for example 4, the algorithm initially takes the min hash having four tokens frominput query 400 and compares that sequence against all stored min hashes in table 208. If no matches are found, the algorithm next proceeds to the 3-min hash having three tokens frominput query 400 and repeatssteps 360 through 368 to see if any min hashes in table 208 match the tokens in the 3-min hash ofinput query 400, and so on. - If k is at 1 after
steps 360 through 370 and no matches have been found instep 370, that means that not even a single min hash token ofinput query 400 matches a stored min hash in table 208, and the algorithm indicates that no matches were identified instep 376. Although theoretically possible, in practice, at least one match for a small enough value of k will generally be found, and step 376 will not be reached. - If, in
step 370, one or more matches have been found for a given k-min hash ofinput query 400, the algorithm next checks instep 380 whether multiple matches have been stored in the buffer instep 364. If there was a single k-min hash of table 208 that was found to match the k-min hash ofinput query 400, the query stored inlog 206 from which the matched k-min hash of table 208 is taken is returned instep 382. Instep 384, a search is performed bysearch engine 212 using the stored query identified insteps 382, and the results for that search of the identified query inlog 206 are returned to the user as the most closely correlated search results to inputquery 400. - Conversely, if it is determined in
step 380 that multiple min hashes were found in table 208 to match a given k-min hash ofinput query 400, the algorithm may return the most popular query of the matched min hashes instep 386, and perform a search to obtain the search results for that most popular query, which are then returned to the user instep 384. The information of how many times users entered each stored query is also stored indata store 210. Where there are multiple matches inlog 206 identified as matching the input query, the most popular will be the most frequently entered query of the matching stored query. - An example of the method described in the flowchart of
FIG. 7 is shown inFIG. 11 .FIG. 11 shows theinput query 400 described above of A B C D E F, having ordered min hashes of A C D E for k=4, C D E for k=3, C E for k=2, and E for k=1.FIG. 11 also shows a few example queries inlog 206 and their associated set of k-min hashes in lookup table 208. As shown and as indicated above, different length queries inlog 206 may have differing numbers of min hashes. In the example ofFIG. 11 , the search engine algorithm may initially try to find matches for the min hash with the largest number of tokens, i.e., A C D E. In the example ofFIG. 11 , the search engine algorithm may use that min hash insteps 360 through 370 and find no matches. The next min hash, C D E, is then compared against the hashes within min hash table 208 insteps 360 through 370. Upon comparison, two log entries may be identified having min hashes which match the min hash C D E ofinput query 400. These log entries may be C D E R and C D E F X. - As more than one log entry was identified in the example of
FIG. 11 , the search engine algorithm performsstep 386 of identifying the more popular log entry. For example, the log entry C D E R may be more popular than log entry C D E F X. Thus, as shown, the results for search query C D E R are obtained (1, 2, 3 . . . ) and returned as the most closely correlated search results withindata store 210 for input query A B C D E F. In alternative embodiments, it is understood that the search results for all matched queries (C D E R and C D E F X in the example ofFIG. 11 ) may be returned and presented to the user as the most correlated results for the user's input query. - In the embodiments described above with respect to
FIGS. 7 through 11 , the algorithm begins with the largest k-min hash of the input query (k=m) to identify the stored query inlog 206 with the largest min hash matching that of the input query. This allows search results to be quickly and resource-efficiently identified. In an alternative embodiment shown in a flowchart ofFIG. 12 , it is also understood that the algorithm may begin with the smallest k-min hash and work upward until the largest k-min hash having a match within table 208 is identified. - In particular, a query is received in step 450, and the terms of the query are hashed to a value between 0 and 1 in
step 452 as described above. Instep 454 the k-min hash of the search query is obtained. In the embodiment ofFIG. 12 , the value of k is initially selected as some minimum value. The value may be 1, or it may be greater than 1. Instep 456, the k-min hash for the initial value of k obtained instep 454 is ordered in the same sequence as the tokens appear in the initial search query. -
Steps 460 through 470 ofFIG. 12 are the same assteps 360 through 370 described above with respect toFIG. 7 . Namely, the min hash for the first value of k is compared against the stored min hashes in lookup table 208, and the associated stored queries for any matches found are stored in a buffer. - In the embodiment of
FIG. 12 , if one or more matches are found for a given value of k, k is incremented until the point where no matches are found. At that point, the matches found at k-1 are returned as the matching candidate results. Accordingly, instep 470 if a match was found and stored in the buffer, k is incremented instep 472 and the next k-min hash is again obtained instep 454 andsteps 456 through 470 are repeated. - In the event that no match was found in
step 470, the search engine algorithm checks instep 474 whether k is in fact at its initial value. If so, this indicates that no matches were found for any of the min hashes of the search query entered in step 450, and the algorithm indicates that no matches were found instep 476. As discussed above if the starting value of k is 1, it is unlikely thatstep 476 will be reached. However, in this embodiment, it is contemplated that initial values of k may be greater than 1, making it more possible forstep 476 to be reached. - Assuming that k is not at its initial value in
step 474, the match(es) found for the previous value of k are retrieved from the memory buffer instep 478. Instep 480, the algorithm checks whether there were multiple matches for the previous value of k. If there was a single match, the query stored inlog 206 from which the matched hash of table 208 is taken is returned instep 482. Instep 484, the results for the identified query inlog 206 are returned to the user as the most closely correlated search results to input query. - Conversely, if it is determined in
step 480 that multiple min hashes were found in table 208 to match the previous k-min hash of the input query, the algorithm may return the most popular query of the matched min hashes instep 486, and return the search results for that most popular query to the user instep 484. - Search queries often include common terms, or “stop words,” which are common words such as “the,” “of” etc. Such stop words will result in a low hash value when hashed per the min hash function. Accordingly, when the min hashes are obtained, many min hashes stored in lookup table 208 will similarly include stop words and result in a high number of matches to the min hashes of the input query. Accordingly, in a further embodiment of the present invention, it is possible to weight the hash value of terms so that stop words receive higher hashed values than other, less common and more prohibitive terms in a given query.
- In one embodiment, this waiting may be a TF-IDF (term frequency-inverse document frequency) weight, which is a known concept used in information retrieval and text mining. In general a TF-IDF weight is a statistical measure used to evaluate how important a word is within a given query. TF-IDF weight is explained in greater detail in Salton, G., Introduction to Modern Information Retrieval, McGraw Hill (1983). In general, TF-IDF weight is computed as 1/|q|.log(N/(1+f)), where |q| is the length of the query, N is the number of queries in the query log and f is the number of queries in which the term occurs. Biasing the computed min hash values for the respective tokens in a search query in this way will result in the correlations from
log 206 which are less likely to be matched based on stop words. - The foregoing detailed description of the inventive system has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the inventive system to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. The described embodiments were chosen in order to best explain the principles of the inventive system and its practical application to thereby enable others skilled in the art to best utilize the inventive system in various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the inventive system be defined by the claims appended hereto.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/863,045 US7765204B2 (en) | 2007-09-27 | 2007-09-27 | Method of finding candidate sub-queries from longer queries |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/863,045 US7765204B2 (en) | 2007-09-27 | 2007-09-27 | Method of finding candidate sub-queries from longer queries |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090089266A1 true US20090089266A1 (en) | 2009-04-02 |
US7765204B2 US7765204B2 (en) | 2010-07-27 |
Family
ID=40509517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/863,045 Expired - Fee Related US7765204B2 (en) | 2007-09-27 | 2007-09-27 | Method of finding candidate sub-queries from longer queries |
Country Status (1)
Country | Link |
---|---|
US (1) | US7765204B2 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100114857A1 (en) * | 2008-10-17 | 2010-05-06 | John Edwards | User interface with available multimedia content from multiple multimedia websites |
US20100131538A1 (en) * | 2008-11-24 | 2010-05-27 | Yahoo! Inc. | Identifying and expanding implicitly temporally qualified queries |
US20140075202A1 (en) * | 2012-09-12 | 2014-03-13 | Infosys Limited | Method and system for securely accessing different services based on single sign on |
US10303682B2 (en) * | 2013-09-21 | 2019-05-28 | Oracle International Corporation | Automatic verification and triage of query results |
US20210391908A1 (en) * | 2020-06-10 | 2021-12-16 | Peking University | Method, system, computer device, and storage medium for non-contact determination of a sensing boundary |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8392446B2 (en) * | 2007-05-31 | 2013-03-05 | Yahoo! Inc. | System and method for providing vector terms related to a search query |
US8620902B2 (en) * | 2011-06-01 | 2013-12-31 | Lexisnexis, A Division Of Reed Elsevier Inc. | Computer program products and methods for query collection optimization |
US9396187B2 (en) * | 2011-06-28 | 2016-07-19 | Broadcom Corporation | System and method for using network equipment to provide targeted advertising |
Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5202985A (en) * | 1988-04-14 | 1993-04-13 | Racal-Datacom, Inc. | Apparatus and method for displaying data communication network configuration after searching the network |
US6169986B1 (en) * | 1998-06-15 | 2001-01-02 | Amazon.Com, Inc. | System and method for refining search queries |
US6363377B1 (en) * | 1998-07-30 | 2002-03-26 | Sarnoff Corporation | Search data processor |
US20020120598A1 (en) * | 2001-02-26 | 2002-08-29 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browse |
US20030055813A1 (en) * | 2001-05-15 | 2003-03-20 | Microsoft Corporation | Query optimization by sub-plan memoization |
US6691109B2 (en) * | 2001-03-22 | 2004-02-10 | Turbo Worx, Inc. | Method and apparatus for high-performance sequence comparison |
US20040225645A1 (en) * | 2003-05-06 | 2004-11-11 | Rowney Kevin T. | Personal computing device -based mechanism to detect preselected data |
US20040254920A1 (en) * | 2003-06-16 | 2004-12-16 | Brill Eric D. | Systems and methods that employ a distributional analysis on a query log to improve search results |
US20050027723A1 (en) * | 2002-09-18 | 2005-02-03 | Chris Jones | Method and apparatus to report policy violations in messages |
US20050055341A1 (en) * | 2003-09-05 | 2005-03-10 | Paul Haahr | System and method for providing search query refinements |
US20050086252A1 (en) * | 2002-09-18 | 2005-04-21 | Chris Jones | Method and apparatus for creating an information security policy based on a pre-configured template |
US20050108339A1 (en) * | 2003-05-15 | 2005-05-19 | Matt Gleeson | Method and apparatus for filtering email spam using email noise reduction |
US20050165838A1 (en) * | 2004-01-26 | 2005-07-28 | Fontoura Marcus F. | Architecture for an indexer |
US20050283473A1 (en) * | 2004-06-17 | 2005-12-22 | Armand Rousso | Apparatus, method and system of artificial intelligence for data searching applications |
US20060095521A1 (en) * | 2004-11-04 | 2006-05-04 | Seth Patinkin | Method, apparatus, and system for clustering and classification |
US7051023B2 (en) * | 2003-04-04 | 2006-05-23 | Yahoo! Inc. | Systems and methods for generating concept units from search queries |
US20060168006A1 (en) * | 2003-03-24 | 2006-07-27 | Mr. Marvin Shannon | System and method for the classification of electronic communication |
US20060184549A1 (en) * | 2005-02-14 | 2006-08-17 | Rowney Kevin T | Method and apparatus for modifying messages based on the presence of pre-selected data |
US20060195425A1 (en) * | 2005-02-28 | 2006-08-31 | Microsoft Corporation | Composable query building API and query language |
US20060218123A1 (en) * | 2005-03-28 | 2006-09-28 | Sybase, Inc. | System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning |
US20060224589A1 (en) * | 2005-02-14 | 2006-10-05 | Rowney Kevin T | Method and apparatus for handling messages containing pre-selected data |
US20060253439A1 (en) * | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
US7136845B2 (en) * | 2001-07-12 | 2006-11-14 | Microsoft Corporation | System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users' queries |
US20060282456A1 (en) * | 2005-06-10 | 2006-12-14 | Microsoft Corporation | Fuzzy lookup table maintenance |
US20070005556A1 (en) * | 2005-06-30 | 2007-01-04 | Microsoft Corporation | Probabilistic techniques for detecting duplicate tuples |
US20070112714A1 (en) * | 2002-02-01 | 2007-05-17 | John Fairweather | System and method for managing knowledge |
US20070124698A1 (en) * | 2005-11-15 | 2007-05-31 | Microsoft Corporation | Fast collaborative filtering through approximations |
US20070208703A1 (en) * | 2006-03-03 | 2007-09-06 | Microsoft Corporation | Web forum crawler |
US20080243764A1 (en) * | 2007-03-29 | 2008-10-02 | Microsoft Corporation | Group joins to navigate data relationships |
US20080256143A1 (en) * | 2007-04-11 | 2008-10-16 | Data Domain, Inc. | Cluster storage using subsegmenting |
US7472114B1 (en) * | 2002-09-18 | 2008-12-30 | Symantec Corporation | Method and apparatus to define the scope of a search for information from a tabular data source |
US20090049062A1 (en) * | 2007-08-14 | 2009-02-19 | Krishna Prasad Chitrapura | Method for Organizing Structurally Similar Web Pages from a Web Site |
-
2007
- 2007-09-27 US US11/863,045 patent/US7765204B2/en not_active Expired - Fee Related
Patent Citations (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5202985A (en) * | 1988-04-14 | 1993-04-13 | Racal-Datacom, Inc. | Apparatus and method for displaying data communication network configuration after searching the network |
US6169986B1 (en) * | 1998-06-15 | 2001-01-02 | Amazon.Com, Inc. | System and method for refining search queries |
US6363377B1 (en) * | 1998-07-30 | 2002-03-26 | Sarnoff Corporation | Search data processor |
US6804677B2 (en) * | 2001-02-26 | 2004-10-12 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
US20020120598A1 (en) * | 2001-02-26 | 2002-08-29 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browse |
US20050033733A1 (en) * | 2001-02-26 | 2005-02-10 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
US6691109B2 (en) * | 2001-03-22 | 2004-02-10 | Turbo Worx, Inc. | Method and apparatus for high-performance sequence comparison |
US20030055813A1 (en) * | 2001-05-15 | 2003-03-20 | Microsoft Corporation | Query optimization by sub-plan memoization |
US7136845B2 (en) * | 2001-07-12 | 2006-11-14 | Microsoft Corporation | System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users' queries |
US20070112714A1 (en) * | 2002-02-01 | 2007-05-17 | John Fairweather | System and method for managing knowledge |
US7472114B1 (en) * | 2002-09-18 | 2008-12-30 | Symantec Corporation | Method and apparatus to define the scope of a search for information from a tabular data source |
US20050027723A1 (en) * | 2002-09-18 | 2005-02-03 | Chris Jones | Method and apparatus to report policy violations in messages |
US20050086252A1 (en) * | 2002-09-18 | 2005-04-21 | Chris Jones | Method and apparatus for creating an information security policy based on a pre-configured template |
US20060168006A1 (en) * | 2003-03-24 | 2006-07-27 | Mr. Marvin Shannon | System and method for the classification of electronic communication |
US7051023B2 (en) * | 2003-04-04 | 2006-05-23 | Yahoo! Inc. | Systems and methods for generating concept units from search queries |
US20040225645A1 (en) * | 2003-05-06 | 2004-11-11 | Rowney Kevin T. | Personal computing device -based mechanism to detect preselected data |
US20050108339A1 (en) * | 2003-05-15 | 2005-05-19 | Matt Gleeson | Method and apparatus for filtering email spam using email noise reduction |
US20050132197A1 (en) * | 2003-05-15 | 2005-06-16 | Art Medlar | Method and apparatus for a character-based comparison of documents |
US20050108340A1 (en) * | 2003-05-15 | 2005-05-19 | Matt Gleeson | Method and apparatus for filtering email spam based on similarity measures |
US20040254920A1 (en) * | 2003-06-16 | 2004-12-16 | Brill Eric D. | Systems and methods that employ a distributional analysis on a query log to improve search results |
US20050055341A1 (en) * | 2003-09-05 | 2005-03-10 | Paul Haahr | System and method for providing search query refinements |
US20050165838A1 (en) * | 2004-01-26 | 2005-07-28 | Fontoura Marcus F. | Architecture for an indexer |
US7424467B2 (en) * | 2004-01-26 | 2008-09-09 | International Business Machines Corporation | Architecture for an indexer with fixed width sort and variable width sort |
US20070271268A1 (en) * | 2004-01-26 | 2007-11-22 | International Business Machines Corporation | Architecture for an indexer |
US20050283473A1 (en) * | 2004-06-17 | 2005-12-22 | Armand Rousso | Apparatus, method and system of artificial intelligence for data searching applications |
US20060095521A1 (en) * | 2004-11-04 | 2006-05-04 | Seth Patinkin | Method, apparatus, and system for clustering and classification |
US7574409B2 (en) * | 2004-11-04 | 2009-08-11 | Vericept Corporation | Method, apparatus, and system for clustering and classification |
US20060224589A1 (en) * | 2005-02-14 | 2006-10-05 | Rowney Kevin T | Method and apparatus for handling messages containing pre-selected data |
US20060184549A1 (en) * | 2005-02-14 | 2006-08-17 | Rowney Kevin T | Method and apparatus for modifying messages based on the presence of pre-selected data |
US20060195425A1 (en) * | 2005-02-28 | 2006-08-31 | Microsoft Corporation | Composable query building API and query language |
US20060218123A1 (en) * | 2005-03-28 | 2006-09-28 | Sybase, Inc. | System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning |
US20060253439A1 (en) * | 2005-05-09 | 2006-11-09 | Liwei Ren | Matching engine for querying relevant documents |
US20060282456A1 (en) * | 2005-06-10 | 2006-12-14 | Microsoft Corporation | Fuzzy lookup table maintenance |
US7584204B2 (en) * | 2005-06-10 | 2009-09-01 | Microsoft Corporation | Fuzzy lookup table maintenance |
US20070005556A1 (en) * | 2005-06-30 | 2007-01-04 | Microsoft Corporation | Probabilistic techniques for detecting duplicate tuples |
US20070124698A1 (en) * | 2005-11-15 | 2007-05-31 | Microsoft Corporation | Fast collaborative filtering through approximations |
US20070208703A1 (en) * | 2006-03-03 | 2007-09-06 | Microsoft Corporation | Web forum crawler |
US7599931B2 (en) * | 2006-03-03 | 2009-10-06 | Microsoft Corporation | Web forum crawler |
US20080243764A1 (en) * | 2007-03-29 | 2008-10-02 | Microsoft Corporation | Group joins to navigate data relationships |
US20080256143A1 (en) * | 2007-04-11 | 2008-10-16 | Data Domain, Inc. | Cluster storage using subsegmenting |
US20090049062A1 (en) * | 2007-08-14 | 2009-02-19 | Krishna Prasad Chitrapura | Method for Organizing Structurally Similar Web Pages from a Web Site |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100114857A1 (en) * | 2008-10-17 | 2010-05-06 | John Edwards | User interface with available multimedia content from multiple multimedia websites |
US8321401B2 (en) * | 2008-10-17 | 2012-11-27 | Echostar Advanced Technologies L.L.C. | User interface with available multimedia content from multiple multimedia websites |
US8903863B2 (en) | 2008-10-17 | 2014-12-02 | Echostar Technologies L.L.C. | User interface with available multimedia content from multiple multimedia websites |
US20100131538A1 (en) * | 2008-11-24 | 2010-05-27 | Yahoo! Inc. | Identifying and expanding implicitly temporally qualified queries |
US8156111B2 (en) * | 2008-11-24 | 2012-04-10 | Yahoo! Inc. | Identifying and expanding implicitly temporally qualified queries |
US20140075202A1 (en) * | 2012-09-12 | 2014-03-13 | Infosys Limited | Method and system for securely accessing different services based on single sign on |
US9449167B2 (en) * | 2012-09-12 | 2016-09-20 | Infosys Limited | Method and system for securely accessing different services based on single sign on |
US10303682B2 (en) * | 2013-09-21 | 2019-05-28 | Oracle International Corporation | Automatic verification and triage of query results |
US11126620B2 (en) | 2013-09-21 | 2021-09-21 | Oracle International Corporation | Automatic verification and triage of query results |
US20210391908A1 (en) * | 2020-06-10 | 2021-12-16 | Peking University | Method, system, computer device, and storage medium for non-contact determination of a sensing boundary |
Also Published As
Publication number | Publication date |
---|---|
US7765204B2 (en) | 2010-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7765204B2 (en) | Method of finding candidate sub-queries from longer queries | |
JP5492187B2 (en) | Search result ranking using edit distance and document information | |
Cambazoglu et al. | Scalability challenges in web search engines | |
US7424469B2 (en) | System and method for blending the results of a classifier and a search engine | |
US8117215B2 (en) | Distributing content indices | |
US8209317B2 (en) | Method and apparatus for reconstructing a search query | |
US8171029B2 (en) | Automatic generation of ontologies using word affinities | |
US8244767B2 (en) | Composite locality sensitive hash based processing of documents | |
US20090299978A1 (en) | Systems and methods for keyword and dynamic url search engine optimization | |
TWI549005B (en) | Multi-layer search-engine index | |
US20100070507A1 (en) | Hybrid content recommending server, system, and method | |
US7827172B2 (en) | “Query-log match” relevance features | |
US20040249808A1 (en) | Query expansion using query logs | |
US8914316B2 (en) | Information similarity and related statistical techniques for use in distributed computing environments | |
JP2008510228A (en) | Multi-stage query processing system and method for use with a token space repository | |
CA2505294A1 (en) | Query to task mapping | |
JP2009525520A (en) | Evaluation method for ranking and sorting electronic documents in search result list based on relevance, and database search engine | |
US20080288483A1 (en) | Efficient retrieval algorithm by query term discrimination | |
EP2192503A1 (en) | Optimised tag based searching | |
US7836108B1 (en) | Clustering by previous representative | |
US20120239657A1 (en) | Category classification processing device and method | |
US20070239735A1 (en) | Systems and methods for predicting if a query is a name | |
Li et al. | On mining webclick streams for path traversal patterns | |
JP2011170583A (en) | Information search apparatus, information search method and information search program | |
CN112883143A (en) | Elasticissearch-based digital exhibition searching method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GOLLAPUDI, SREENIVAS;PANIGRAHY, RINA;REEL/FRAME:019891/0577 Effective date: 20070927 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20140727 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034542/0001 Effective date: 20141014 |