WO2009047773A2 - A method application and sysyem for processing computerized search queries - Google Patents

A method application and sysyem for processing computerized search queries Download PDF

Info

Publication number
WO2009047773A2
WO2009047773A2 PCT/IL2008/001354 IL2008001354W WO2009047773A2 WO 2009047773 A2 WO2009047773 A2 WO 2009047773A2 IL 2008001354 W IL2008001354 W IL 2008001354W WO 2009047773 A2 WO2009047773 A2 WO 2009047773A2
Authority
WO
WIPO (PCT)
Prior art keywords
search
term
specific
server
database
Prior art date
Application number
PCT/IL2008/001354
Other languages
French (fr)
Other versions
WO2009047773A8 (en
WO2009047773A3 (en
Inventor
Dror Feitelson
Original Assignee
Yissum Research Development Company Of The Hebrew University Of Jerusalem
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yissum Research Development Company Of The Hebrew University Of Jerusalem filed Critical Yissum Research Development Company Of The Hebrew University Of Jerusalem
Publication of WO2009047773A2 publication Critical patent/WO2009047773A2/en
Publication of WO2009047773A8 publication Critical patent/WO2009047773A8/en
Publication of WO2009047773A3 publication Critical patent/WO2009047773A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates generally to the field of data processing. More specifically, the present invention relates to method, application and system for processing computerized search queries.
  • Search engines generally perform the following operations: Web crawling, Indexing and Searching.
  • Web search engines work by storing information about many web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries.
  • Web crawler sometimes also known as a spider
  • Some search engines such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find.
  • This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it.
  • This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages.
  • search engines When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.
  • search engines support the use of the boolean operators AND, OR and NOT to further specify the search query.
  • Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords.
  • Load balancing can be done at several levels.
  • the top level is in the Domain Name System (“DNS”) network: DNS servers can be set up so that successive address resolution requests are directed to distinct IP addresses.
  • DNS Domain Name System
  • the basic scheme only allows simple distribution schemes like round-robin, but more sophisticated schemes with full Perl support have also been devised.
  • Another option is to use a special hardware router that distributes flows, or to use URL re-writing in the web server.
  • the Google cluster (ca. 2002) used both DNS to distribute requests among multiple geographically spread clusters, and hardware routers for load balancing within each such cluster.
  • a search engine e.g. google, aol, altavista, etc.
  • a search engine may include two or more servers or two or more server clusters, each of which servers/clusters may be associated with a different Internet Protocol ("IP") Address.
  • IP Internet Protocol
  • each of the two or more servers/clusters may be associated with a different port number, or another unique identifier, within the same IP address.
  • Each of the servers may be functionally associated with a database server containing some or all of the data stored and provided by the search engine (e.g.
  • At least one of the servers may be functionally associated with a database storing a predominantly (e.g. only) search-term-specific-subset of all the data stored by the search engine, which database may be referred to as a search-term-specific-database.
  • a search-term-specific-subset of data may include search engine data searchable and/or otherwise, associated with one or a set of specific search terms.
  • One or more servers/clusters of a search engine may be functionally associated with a database including all the data stored and provided by the search engine. It should be understood that the term server may also apply to server clusters.
  • a search engine may include one or more search-term-specific database replication modules residing on one or more servers, which replication module(s) may replicate to a first search-term- specific-database, continuously or intermittently, a first search-term-specific-subset of the search engine's total searchable data.
  • the same or a second search-term-specific database replication module may replicate to a second search-term-specific-database, continuously or intermittently, a second search-term-specific-subset of the search engine's total searchable data, which second search-term-specific-subset may be associated with at least one different search term than the first search-term-specific-subset.
  • the predictive module may track and record what percentage of search queries including the terms "Israel & English & News" results in the search client device requesting data from the website www.jpost.com.
  • the client's target data request information may be obtained from the client device via a floating agent application sent to the client along with the search engine interface.
  • the predictive module may maintain one or more predictive target data tables with records of some or substantially all combinations of search terms which correlate with a high probability (e.g. greater than 50%) that the client device issuing a search request with those specific search terms will request data from a specific target.
  • One or more records may also include an identifier of the specific target to which the combination of terms relates. It should be understood that any method of obtaining, tracking and recording the above mentioned statistical data, known today or to be devised in the future, may be used as part of the present invention.
  • a search query routing module which routing module may direct a search query from a search engine client application to a specific server functionally associated with a given search-term-specific-database when the search query includes a term associated with the given search-term-specific-database.
  • the search query routing module may include or be otherwise functionally associated with a routing table, which routing table may define correlations between at least one specific search term and the IP address, or other server designator, of a server functionally associated with a search-term-specific-database populated with and adapted to provide data associated with the at least one specific search term.
  • the routing table may be based upon the table used and/or updated by the one or more replication modules.
  • the routing module may compare one or more terms within a given search query against entries in the routing table, and in the event the routing table includes an entry corresponding to one or more of the search query terms, the routing module may route the search query to the server designated by the corresponding entry. According to some embodiments of the present invention, in the event that the routing module does not identify an entry in the routing table corresponding to one or more of the terms in the search query, the routing module may route the search query to a (default search engine) server functionally associated with a database populated with the complete data of the search engine (e.g. the default search engine IP address).
  • the routing table may be integral with the routing module or may reside on one of the servers associated with the search engine.
  • the routing module may update the routing table intermittently. Updating of the routing table may be based on updates of the table used and/or updated by the one or more replication modules. According to some embodiments of the present invention, the routing table and the replication module table may be substantially identical or the same table.
  • the routing table may include one or more entries defining correlations between one or a set of specific search terms and the IP address, or other designator, of a predicted search target (website, content, etc.).
  • a predicted search target website, content, etc.
  • the routing module When one or a combination of the search terms in a given search query are identified by the routing module as having a high probabilistic correlation with a given search target, that is - some high percentage of search queries including this given set of search terms (e.g. above 50%, 60%, 70%, 80%.. ) result in the client device requesting data from the given search target (e.g.
  • the routing module may return to the client device the server identifier or other designator of the predicted target. Routing table entries correlating specific sets of search terms with specific predicted search targets may be provided and intermittently updated by one or more functional modules on one or more of the servers associated with the search engine. The predicted search target entries in the routing table may be updated based on data from one or more records of a predictive target data table, or the like, as described above. According to further embodiments of the present invention, the routing module may direct or redirect a web- browser on the client device to the predicted target.
  • the routing module may be implemented at any node along a path between a search engine client and a search engine server.
  • the routing module may be an application or an applet transmitted to and/or otherwise running on a client device.
  • the routing module may be transmitted to, and may run within a web browser on a client device when the browser requests and receives a search engine's landing/home page.
  • the routing module may be part of browser plug-in.
  • the routing module may be part of a local DNS plug-in application.
  • FIG. 1 shows a functional block diagram of a three layer web-server architecture to which some embodiments of the present invention are applicable;
  • Figs. 2A & 2B show two graphs of a heavy tail of a search term distribution determined in accordance with some embodiments of the present invention: (A) a Zipf count-rank plot, and (B) an LLCD plot;
  • Fig. 3 shows a mass-count plot of a distribution of search terms determined in accordance with some embodiments of the present invention
  • Fig. 4 shows a table with examples of the four possible classes of query terms according to some embodiments of the present invention.
  • Fig. 5 shows a graph showing query coverage achieved by increasing numbers of query words according to some embodiments of the present invention
  • Fig. 6 shows a functional block diagram including a server-side routing module, a search-term-specific replication module, a predictive module, and a routing module server/updater according to some embodiments of the present invention
  • FIG. 7A shows a flowchart including the steps of an exemplary method of generating a predictive table according to some embodiments of the present invention
  • FIG. 7B shows a flowchart including the steps of an exemplary method of generating a search-term-specific replication table according to some embodiments of the present invention
  • Fig. 8 shows a functional block diagram including a client side routing module and client-side (floating) reporting agent, facilitating client search query routing to a specific search-term-specific server cluster and/or to a specific predictive target
  • Fig. 9 shows a flowchart including the steps of an exemplary method of routing search queries according to some embodiments of the present invention.
  • Embodiments of the present invention may include apparatuses for performing the operations herein. This apparatus may be specially constructed for the desired purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus.
  • a computer readable storage medium such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus.
  • IP networking is a set of communications protocols that implement the protocol stack on which the Internet and most commercial networks run. It has also been referred to as the TCP/IP protocol suite, which is named after two of the most important protocols in it: the Transmission Control Protocol (TCP) and the Internet Protocol (IP), which were also the first two networking protocols defined.
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • the Internet Protocol suite like many protocol suites — can be viewed as a set of layers. Each layer solves a set of problems involving the transmission of data, and provides a well-defined service to the upper layer protocols based on using services from some lower layers. Upper layers are logically closer to the user and deal with more abstract data, relying on lower layer protocols to translate data into forms that can eventually be physically transmitted.
  • the TCP/IP reference model consists of four layers.
  • the IP suite uses encapsulation to provide abstraction of protocols and services.
  • a protocol at a higher level uses a protocol at a lower level to help accomplish its aims.
  • the Internet protocol stack has never been altered, by the IETF, from the four layers defined in RFC 1122. The IETF makes no effort to follow the seven-layer OSI model and does not refer to it in standards-track protocol specifications and other architectural documents.
  • DNS TFTP
  • TLS/SSL TLS/SSL
  • FTP Gopher
  • HTTP IMAP
  • IRC IRC
  • RFC3439 on Internet architecture, contains a section entitled: "Layering Considered Harmful”: Emphasizing layering as the key driver of architecture is not a feature of the TCP/IP model, but rather of OSI. Much confusion comes from attempts to force OSI-like layering onto an architecture that minimizes their use. [0038] Today, most commercial operating systems include and install the TCP/IP stack by default. For most users, there is no need to look for implementations. TCP/IP is included in all commercial Unix systems, Mac OS X 1 and all free-software Unix-like systems such as Linux distributions and BSD systems, as well as Microsoft Windows.
  • mobile devices may connect with and access data from an enterprise data system over a communication network at some portion of which may be a wireless network.
  • a communication network at some portion of which may be a wireless network.
  • wireless network may technically be used to refer to any type of network that is wireless, the term is most commonly used to refer to a telecommunications network whose interconnections between nodes is implemented without the use of wires, such as a computer network (which is a type of communications network).
  • Wireless telecommunications networks are generally implemented with some type of remote information transmission system that uses electromagnetic waves, such as radio waves, for the carrier and this implementation usually takes place at the physical level or "layer" of the network. (For example, see the Physical Layer of the OSI Model).
  • electromagnetic waves such as radio waves
  • GSM Global System for Mobile Communications
  • the GSM network is divided into three major systems which are the switching system, the base station system, and the operation and support system (Global System for Mobile Communication (GSM)).
  • GSM Global System for Mobile Communication
  • the cell phone connects to the base system station which then connects to the operation and support station; it then connects to the switching station where the call is transferred where it needs to go (Global System for Mobile Communication (GSM)).
  • GSM Global System for Mobile Communication
  • PCS Personal Communications Service
  • D-AMPS Digital Advanced Mobile Phone Service
  • GSM Global System for Mobile Communications
  • GSM Global standard for digital mobile communication, common in most countries except South Korea and Japan.
  • PCS Personal communication system - not a single standard, this covers both CDMA and GSM networks operating at 1900 MHz in North America.
  • GPRS General Packet Radio Service
  • upgraded packet-based service within the GSM framework gives higher data rates and always-on service.
  • UMTS - Universal Mobile Telephone Service (3rd generation cell phone network), based on the W-CDMA radio access network.
  • NMT Nordic Mobile Telephony, analog system originally developed by PTTs in the Nordic countries.
  • AMPS Advanced Mobile Phone System introduced in the Americas in about 1984.
  • D-AMPS - Digital AMPS also known as TDMA.
  • the heavily-used words dominate. For example, those 70% of the words that only appear once account for only 6% of the total instances.
  • the joint ratio metric indicated in Fig. 3 is a generalization of the proverbial 20/80 rule. In this case, we found that 9% of the words (and specifically, the more heavily used ones) account for 91% of the instances, and vice versa: 91% of the words (those that are used less often) account for only 9% of the instances.
  • the N1/2 metric shows the effect of the most highly used words. In this case, it shows that a full half of the instances are actually repetitions of only 0.05% of the words (not 5 percent, but 5/1000)
  • directed client- side or server-side routing preprocessing is performed based on the use of those query words that are highly popular. It was determined that by performing some preliminary analysis, for example within the client's web browser, before submitting the query to the server, may alleviate some default server load. It was found in simulations that an analysis (i.e. preprocessing) of search terms using a table lookup with a table containing a few hundred of the most popular query terms can be effective to that end. The lookup may map each such term, or set of terms, to a server that can be specialized to handle queries on this topic, thereby reducing the load on the default gateway/server, and allowing for partitioning of the backend database.
  • query words were classified into four distinct classes:
  • Topical words that serve to define the topic of the query, mainly nouns and proper names.
  • Navigation a special case of topical words, where the intent is to find a specific web site quickly.
  • Adjectives words that modify a query more than defining it.
  • a search engine e.g. google, aol, altavista, etc.
  • a search engine may include two or more servers or two or more server clusters, each of which servers/clusters may be associated with a different Internet Protocol ("IP") Address.
  • IP Internet Protocol
  • each of the two or more servers/clusters may be associated with a different port number, or another unique identifier, within the same IP address.
  • Each of the servers may be functionally associated with a database server containing some or all of the data stored and provided by the search engine (e.g.
  • At least one of the servers may be functionally associated with a database storing a predominantly (e.g. only) search-term-specific-subset of all the data stored by the search engine, which database may be referred to as a search-term-specific-database.
  • a search-term-specific-subset of data may include search engine data searchable and/or otherwise associated with one or a set of specific search terms.
  • One or more servers/clusters of a search engine may be functionally associated with a database including all the data stored and provided by the search engine. It should be understood that the term server may also apply to server clusters.
  • a search engine may include one of more search-term-specific database replication modules residing on one or more servers, which replication module(s) may replicate to a first search-term- specific-database, continuously or intermittently, a first search-term-specific-subset of the search engine's total searchable data.
  • the same or a second search-term-specific database replication module may replicate to a second search-term-specific-database, continuously or intermittently, a second search-term-specific-subset of the search engine's total searchable data, which second search-term-specific-subset may be associated with at least one different search term than the first search-term-specific-subset.
  • a single search-term-specific database replication module may be used to populate multiple search-term-specific-databases by replicating to each search-term-specific-database, in parallel or in series, data from the complete search engine database relevant to each respective search-term-specific- database.
  • a table including entries identifying which search-term-specific-databases is associated with which search terms may be used and/or updated by the one or more replication modules.
  • One or more records may also include an identifier of the specific target to which the combination of terms relates. It should be understood that any method of obtaining, tracking and recording the above mentioned statistical data, known today or to be devised in the future, may be used as part of the present invention.
  • a search query routing module which routing module may direct a search query from a search engine client application to a specific server functionally associated with a given search-term-specific-database when the search query includes a term associated with the given search-term-specific-database.
  • the search query routing module may include or be otherwise functionally associated with a routing table, which routing table may define correlations between at least one specific search term and the IP address, or other server designator, of a server functionally associated with a search-term-specific-database populated with and adapted to provide data associated with the at least one specific search term.
  • the routing table may be based upon the table used and/or updated by the one or more replication modules.
  • the routing module may compare one or more terms within a given search query against entries in the routing table, and in the event the routing table includes an entry corresponding to one or more of the search query terms, the routing module may route the search query to the server designated by the corresponding entry. According to some embodiments of the present invention, in the event that the routing module does not identify an entry in the routing table corresponding to one or more of the terms in the search query, the routing module may route the search query to a (default search engine) server functionally associated with a database populated with the complete data of the search engine (e.g. the default search engine IP address).
  • the routing table may be integral with the routing module or may reside on one of the servers associated with the search engine.
  • the routing module may update the routing table intermittently. Updating of the routing table may be based on updates of the table used and/or updated by the one or more replication modules. According to some embodiments of the present invention, the routing table and the replication module table may be substantially identical or the same table.
  • the routing table may include one or more entries defining correlations between one or a set of specific search terms and the IP address, or other designator, of a predicted search target (website, content, etc.).
  • a predicted search target website, content, etc.
  • the routing module When one or a combination of the search terms in a given search query are identified by the routing module as having a high probabilistic correlation with a given search target, that is - some high percentage of search queries including this given set of search terms (e.g. above 50%, 60%, 70%, 80%.. ) result in the client device requesting data from the given search target (e.g.
  • the routing module may return to the client device the server identifier or other designator of the predicted target. Routing table entries correlating specific sets of search terms with specific predicted search targets may be provided and intermittently updated by one or more functional modules on one or more of the servers associated with the search engine. The predicted search target entries in the routing table may be updated based on data from one or more records of a predictive target data table, or the like, as described above. According to further embodiments of the present invention, the routing module may direct or redirect a web- browser on the client device to the predicted target.
  • the routing module may be implemented at any node along a path between a search engine client and a search engine server.
  • the routing module may be an application or an applet transmitted to and/or otherwise running on a client device.
  • the routing module may be transmitted to, and may run within a web browser on a client device when the browser requests and receives a search engine's landing/home page.
  • the routing module may be part of browser plug-in.
  • the routing module may be part of a local DNS plug-in application.
  • FIG. 6 there is shown a functional block diagram of a server architecture including a server-side routing module, a search-term-specific replication module, a predictive search target module, and a routing module server/updater according to some embodiments of the present invention.
  • Operation of the exemplary server-side arrangement, as shown in Fig. 6, may be understood better in view of Figs. 7A & 7B, where Fig. 7A shows a flowchart including the steps of an exemplary method of generating a predictive table according to some embodiments of the present invention, and Fig. 7B shows a flowchart including the steps of an exemplary method of generating a search-term-specific replication table according to some embodiments of the present invention.
  • the server architecture of Fig. 6 may part of a search engine cluster, where searchable data is stored on a default/complete database including all the searchable data of the search engine.
  • the web/html servers may respond to an initial client device request for data by providing the client with code including a search engine interface application.
  • the web/html servers may also provide the client with code include a client-side routing module.
  • the web/html servers may provide the client with code including a reporting agent application.
  • the servers may include or be otherwise associated with a routing module.
  • Search queries received from a client device may be processed by application servers, which application servers may utilize searchable data stored on the search engine database(s) to generate a list of results, possibly including hyperlinks to servers, correlated with search terms of -a given search query.
  • the generated list may be provided to the client device, directly or through web/html servers.
  • the predictive search target module may intercept client device's search queries, from either the web/html or application servers, and may also intercept/receive resulting search target information from associated reporting agent applications running on client devices (Step 7000A).
  • the predictive module may track multiple queries and associated search target information and may determine correlations between one or more sets of search query terms with respective search targets (Step 7100A).
  • the correlations may be stored, and updated as needed, in a predictive table(s) (Step 7200A), and data from records of the predictive table(s) may be used to populate and/or update records in a routing table (7300A) functionally associated with a (server-side or client-side) routing module according to some embodiments of the present invention.
  • a routing module server/updater may be used to populate and/or update records in a routing table (7300A) functionally associated with a (server-side or client-side) routing module based on data from the predictive table(s).
  • a search-term-specific database replication module may intercept and indentify frequently used sets of search terms in search queries from client devices (Step 7000B).
  • the sets of terms may be tabulated, statistically ranked, and those sets of terms having a probabilistic rank above some threshold value (e.g. set of terms present in greater than 0.1% of all search queries) may be recorded in a search-term-specific data table (Step 7100B).
  • One or more replication modules may replicate portions of the default/complete database to search-term-specific databases based on records in the replication table (Step 7200B). Data from the replication table(s) may also be used to populate and/or update records in a routing table functionally associated with a (server- side or client-side) routing module (Step 7300B).
  • Fig. 6 shows the replication module replicating potions of the default/complete database to search-term-specific databases. Also shown is a server-side routing module routing client requests including search term sets found in the routing table to either a server with a search-term-specific database or to a predictive target, depending upon which is indicated in the routing table. [0077] Turning now to Fig. 8, there is shown a functional block diagram including a client-side routing module and client-side (floating) reporting agent, facilitating client search query routing to a specific search-term-specific server cluster and/or to a specific predictive target. Operation of the client-side routing module may be understood in view of Fig.
  • the routing module may intercept search queries input into a search engine interface (Step 9000) and may compare those terms against entries in a routing table (Step 9100). If the terms do not match any entries (Step 9200C), the routing module may route the query directly to the default search engine server (Step 9300C). However, if the terms match an entry associated with a search-term-specific database (Step 9200A) 1 the routing module may forward or route the query to a server associated with the search-term-specific database (Step 9300A). If the search terms match an entry associated with a predictive target (step 9200B), the routing module may send a data request (i.e. direct or redirect the client device) to the predictive target (Step 9300B).
  • a data request i.e. direct or redirect the client device

Abstract

Disclosed is a method, application and system for processing computerized search requests or queries. According to some embodiments of the present invention, there may be provided a routing module adapted to route a search query to either a first server functionally associated with a first search-term-specific-database, when the search query includes one or more terms associated with the first search-term-specific- database, or to a predictive target, when the search query includes one or more terms associated with the predictive target. According to further embodiments of the present invention, there is provided a search-term-specific data replication module may be adapted to replicate search-term-specific portions of a search engine's complete searchable database to a search-term-specific-database. A predictive search target module may be adapted to generate and/or update a table of predicative targets and associated search terms.

Description

A METHOD APPLICATION AND SYSYEM FOR PROCESSING COMPUTERIZED
SEARCH QUERIES
FIELD OF THE INVENTION
[001] The present invention relates generally to the field of data processing. More specifically, the present invention relates to method, application and system for processing computerized search queries.
BACKGROUND
[002] Search engines generally perform the following operations: Web crawling, Indexing and Searching. Web search engines work by storing information about many web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere. [003] When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords.
[004] The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. [005] Large-scale web servers, such as those used by major search engines and e- commerce sites, are actually implemented as a cluster of servers. A central problem with this architecture is one of load balancing: incoming requests need to be routed to a lightly loaded server, and in particular, it is highly undesirable that some servers remain idle while requests are waiting to be served.
[006] Load balancing can be done at several levels. The top level is in the Domain Name System ("DNS") network: DNS servers can be set up so that successive address resolution requests are directed to distinct IP addresses. The basic scheme only allows simple distribution schemes like round-robin, but more sophisticated schemes with full Perl support have also been devised. Another option is to use a special hardware router that distributes flows, or to use URL re-writing in the web server. As an example, the Google cluster (ca. 2002) used both DNS to distribute requests among multiple geographically spread clusters, and hardware routers for load balancing within each such cluster.
[007] As distinct web requests are usually completely independent, serving them on different cluster nodes is relatively simple. However, two potential bottlenecks remain. One is the entry point to the cluster, where the requests are allocated to the different nodes. The other is the backend database, which must remain consistent. [008] There is a need in the field of query processing and data management for improved methods and systems for query processing.
SUMMARY OF THE INVENTION
[009] The present invention is a method, application and system for computerized search queries. According to some embodiments of the present invention, a search engine (e.g. google, aol, altavista, etc.) may include two or more servers or two or more server clusters, each of which servers/clusters may be associated with a different Internet Protocol ("IP") Address. Alternatively, each of the two or more servers/clusters may be associated with a different port number, or another unique identifier, within the same IP address. Each of the servers may be functionally associated with a database server containing some or all of the data stored and provided by the search engine (e.g. excerpts and links to websites and webpages available on the internet), and at least one of the servers may be functionally associated with a database storing a predominantly (e.g. only) search-term-specific-subset of all the data stored by the search engine, which database may be referred to as a search-term-specific-database. A search-term-specific-subset of data may include search engine data searchable and/or otherwise, associated with one or a set of specific search terms. One or more servers/clusters of a search engine may be functionally associated with a database including all the data stored and provided by the search engine. It should be understood that the term server may also apply to server clusters. [0010] According to some embodiments of the present invention, a search engine may include one or more search-term-specific database replication modules residing on one or more servers, which replication module(s) may replicate to a first search-term- specific-database, continuously or intermittently, a first search-term-specific-subset of the search engine's total searchable data. According to further embodiments of the present invention, the same or a second search-term-specific database replication module may replicate to a second search-term-specific-database, continuously or intermittently, a second search-term-specific-subset of the search engine's total searchable data, which second search-term-specific-subset may be associated with at least one different search term than the first search-term-specific-subset. According to some embodiments of the present invention, a single search-term-specific database replication module may be used to populate multiple search-term-specific-databases by replicating to each search-term-specific-database, in parallel or in series, data from the complete search engine database relevant to each respective search-term-specific- database. A table including entries identifying which search-term-specific-databases is associated with which search terms may be used and/or updated by the one or more replication modules. [0011] Integral or otherwise functionally associated with the one or more search engine servers may be a predictive search target module, which predictive module may track and compile statistical information correlating terms in received search queries with client device search targets. For example, the predictive module may track and record what percentage of search queries including the terms "Israel & English & News" results in the search client device requesting data from the website www.jpost.com. The client's target data request information may be obtained from the client device via a floating agent application sent to the client along with the search engine interface. The predictive module may maintain one or more predictive target data tables with records of some or substantially all combinations of search terms which correlate with a high probability (e.g. greater than 50%) that the client device issuing a search request with those specific search terms will request data from a specific target. One or more records may also include an identifier of the specific target to which the combination of terms relates. It should be understood that any method of obtaining, tracking and recording the above mentioned statistical data, known today or to be devised in the future, may be used as part of the present invention.
[0012] According to further embodiments of the present invention, there may be provided a search query routing module, which routing module may direct a search query from a search engine client application to a specific server functionally associated with a given search-term-specific-database when the search query includes a term associated with the given search-term-specific-database. The search query routing module may include or be otherwise functionally associated with a routing table, which routing table may define correlations between at least one specific search term and the IP address, or other server designator, of a server functionally associated with a search-term-specific-database populated with and adapted to provide data associated with the at least one specific search term. The routing table may be based upon the table used and/or updated by the one or more replication modules. [0013] The routing module may compare one or more terms within a given search query against entries in the routing table, and in the event the routing table includes an entry corresponding to one or more of the search query terms, the routing module may route the search query to the server designated by the corresponding entry. According to some embodiments of the present invention, in the event that the routing module does not identify an entry in the routing table corresponding to one or more of the terms in the search query, the routing module may route the search query to a (default search engine) server functionally associated with a database populated with the complete data of the search engine (e.g. the default search engine IP address). [0014] The routing table may be integral with the routing module or may reside on one of the servers associated with the search engine. In the event the routing table resides on the client, the routing module, or an associated application or process, may update the routing table intermittently. Updating of the routing table may be based on updates of the table used and/or updated by the one or more replication modules. According to some embodiments of the present invention, the routing table and the replication module table may be substantially identical or the same table.
[0015] According to further embodiments of the present invention, the routing table may include one or more entries defining correlations between one or a set of specific search terms and the IP address, or other designator, of a predicted search target (website, content, etc.). When one or a combination of the search terms in a given search query are identified by the routing module as having a high probabilistic correlation with a given search target, that is - some high percentage of search queries including this given set of search terms (e.g. above 50%, 60%, 70%, 80%.. ) result in the client device requesting data from the given search target (e.g. 80% of search queries with the terms "Israel & English & News" end up with the client device sending a request for data to www.jpost.com), the routing module may return to the client device the server identifier or other designator of the predicted target. Routing table entries correlating specific sets of search terms with specific predicted search targets may be provided and intermittently updated by one or more functional modules on one or more of the servers associated with the search engine. The predicted search target entries in the routing table may be updated based on data from one or more records of a predictive target data table, or the like, as described above. According to further embodiments of the present invention, the routing module may direct or redirect a web- browser on the client device to the predicted target. It should be understood that all method directing or redirecting, known today or to be devised in the future, may be used as part of the present invention. Likewise, it should be understood that any method or technique of determining the probabilistic correlation between any set of search terms and a predicted search target, known today or to be devised in the future, may be applicable to the present invention.
[0016] The routing module may be implemented at any node along a path between a search engine client and a search engine server. According to some embodiments of the present invention, the routing module may be an application or an applet transmitted to and/or otherwise running on a client device. For example, the routing module may be transmitted to, and may run within a web browser on a client device when the browser requests and receives a search engine's landing/home page. According to further embodiments of the present invention, the routing module may be part of browser plug-in. According to yet further embodiments, the routing module may be part of a local DNS plug-in application. BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The subject matter regarded as the invention is particularly pointed out arid distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
[0018] Fig. 1 shows a functional block diagram of a three layer web-server architecture to which some embodiments of the present invention are applicable;
[0019] Figs. 2A & 2B show two graphs of a heavy tail of a search term distribution determined in accordance with some embodiments of the present invention: (A) a Zipf count-rank plot, and (B) an LLCD plot;
[0020] Fig. 3 shows a mass-count plot of a distribution of search terms determined in accordance with some embodiments of the present invention;
[0021] Fig. 4 shows a table with examples of the four possible classes of query terms according to some embodiments of the present invention;
[0022] Fig. 5 shows a graph showing query coverage achieved by increasing numbers of query words according to some embodiments of the present invention;
[0023] Fig. 6 shows a functional block diagram including a server-side routing module, a search-term-specific replication module, a predictive module, and a routing module server/updater according to some embodiments of the present invention;
[0024] Fig. 7A shows a flowchart including the steps of an exemplary method of generating a predictive table according to some embodiments of the present invention;
[0025] Fig. 7B shows a flowchart including the steps of an exemplary method of generating a search-term-specific replication table according to some embodiments of the present invention; [0026] Fig. 8 shows a functional block diagram including a client side routing module and client-side (floating) reporting agent, facilitating client search query routing to a specific search-term-specific server cluster and/or to a specific predictive target; and [0027] Fig. 9 shows a flowchart including the steps of an exemplary method of routing search queries according to some embodiments of the present invention. [0028] It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
DETAILED DESCRIPTION
[0029] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
[0030] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "processing", "computing", "calculating", "determining", or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices. [0031] Embodiments of the present invention may include apparatuses for performing the operations herein. This apparatus may be specially constructed for the desired purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions, and capable of being coupled to a computer system bus.
[0032] The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the inventions as described herein.
[0033] Terms in this application relating to distributed data networking, such as send or receive, may be interpreted in reference to Internet protocol suite, which is a set of communications protocols that implement the protocol stack on which the Internet and most commercial networks run. It has also been referred to as the TCP/IP protocol suite, which is named after two of the most important protocols in it: the Transmission Control Protocol (TCP) and the Internet Protocol (IP), which were also the first two networking protocols defined. Today's IP networking represents a synthesis of two developments that began in the 1970s, namely LANs (Local Area Networks) and the Internet, both of which have revolutionized computing.
[0034] The Internet Protocol suite — like many protocol suites — can be viewed as a set of layers. Each layer solves a set of problems involving the transmission of data, and provides a well-defined service to the upper layer protocols based on using services from some lower layers. Upper layers are logically closer to the user and deal with more abstract data, relying on lower layer protocols to translate data into forms that can eventually be physically transmitted. The TCP/IP reference model consists of four layers.
Layers in the Internet Protocol suite
[0035] The IP suite uses encapsulation to provide abstraction of protocols and services.
Generally a protocol at a higher level uses a protocol at a lower level to help accomplish its aims. The Internet protocol stack has never been altered, by the IETF, from the four layers defined in RFC 1122. The IETF makes no effort to follow the seven-layer OSI model and does not refer to it in standards-track protocol specifications and other architectural documents.
DNS, TFTP, TLS/SSL, FTP, Gopher, HTTP, IMAP, IRC,
4. Application NNTP, POP3, SIP, SMTP, SNMP, SSH, TELNET, ECHO, RTP, PNRP, rlogin, ENRP
Figure imgf000013_0001
[0036] Some textbooks have attempted to map the Internet Protocol suite model onto the seven layer OSI Model. The mapping often splits the Internet Protocol suite's Network access layer into a Data link layer on top of a Physical layer, and the Internet layer is mapped to the OSI's Network layer. These textbooks are secondary sources that contravene the intent of RFC1122 and other IETF primary sources. The IETF has repeatedly stated that Internet protocol and architecture development is not intended to be OSI-compliant.
[0037] RFC3439, on Internet architecture, contains a section entitled: "Layering Considered Harmful": Emphasizing layering as the key driver of architecture is not a feature of the TCP/IP model, but rather of OSI. Much confusion comes from attempts to force OSI-like layering onto an architecture that minimizes their use. [0038] Today, most commercial operating systems include and install the TCP/IP stack by default. For most users, there is no need to look for implementations. TCP/IP is included in all commercial Unix systems, Mac OS X1 and all free-software Unix-like systems such as Linux distributions and BSD systems, as well as Microsoft Windows. [0039] Unique implementations include Lightweight TCP/IP, an open source stack designed for embedded systems and KA9Q NOS, a stack and associated protocols for amateur packet radio systems and personal computers connected via serial lines. [0040] According to some embodiments of the present invention, mobile devices may connect with and access data from an enterprise data system over a communication network at some portion of which may be a wireless network. While the term wireless network may technically be used to refer to any type of network that is wireless, the term is most commonly used to refer to a telecommunications network whose interconnections between nodes is implemented without the use of wires, such as a computer network (which is a type of communications network). Wireless telecommunications networks are generally implemented with some type of remote information transmission system that uses electromagnetic waves, such as radio waves, for the carrier and this implementation usually takes place at the physical level or "layer" of the network. (For example, see the Physical Layer of the OSI Model). Various wireless technologies and standards existing, including:
1. Global System for Mobile Communications (GSM): The GSM network is divided into three major systems which are the switching system, the base station system, and the operation and support system (Global System for Mobile Communication (GSM)). The cell phone connects to the base system station which then connects to the operation and support station; it then connects to the switching station where the call is transferred where it needs to go (Global System for Mobile Communication (GSM)). This is used for cellular phones, is the most common standard and is used for a majority of cellular providers.
2. Personal Communications Service (PCS): PCS is a radio band that can be used by mobile phones in North America. Sprint happened to be the first service to set up a PCS.
3. D-AMPS: D-AMPS, which stands for Digital Advanced Mobile Phone Service, is an upgraded version of AMPS but it is being phased out due to advancement in technology. The newer GSM networks are replacing the older system.
4. Wireless MAN - metropolitan area network.
5. Wireless LAN - local area networks.
6. Wireless PAN - personal area networks.
7. GSM - Global standard for digital mobile communication, common in most countries except South Korea and Japan.
8. PCS - Personal communication system - not a single standard, this covers both CDMA and GSM networks operating at 1900 MHz in North America.
9. Mobitex - pager-based network in the USA and Canada, built by Ericsson, now used by PDAs such as the Palm VII and Research in Motion BlackBerry.
10. GPRS - General Packet Radio Service, upgraded packet-based service within the GSM framework, gives higher data rates and always-on service.
11. UMTS - Universal Mobile Telephone Service (3rd generation cell phone network), based on the W-CDMA radio access network.
12. AX.25 - amateur packet radio.
13. NMT - Nordic Mobile Telephony, analog system originally developed by PTTs in the Nordic countries. 14. AMPS - Advanced Mobile Phone System introduced in the Americas in about 1984.
15. D-AMPS - Digital AMPS, also known as TDMA.
16.Wi-Fi - Wireless Fidelity, widely used for Wireless LAN, and based on IEEE
802.11 standards. 17.Wimax - A solution for BWA (Broadband Wireless Access) and conforms to IEEE
802.16 standard. 18. Canopy - A wide-area broadband wireless solution from Motorola.
DISCUSSION OF STATISTICAL BASIS FOR AN EXEMPLARY EMBODIMENT OF THE PRESENT INVENTION AND A PARTIAL DESCRIPTION OF THE EXEMPLARY EMBODIMENT:
[0041] The following discussion and findings are based on an analysis of large-scale query logs from the AOL web search engine including more than 20 million queries from more than 650 thousand users. The sample size used is considered sufficiently large that it may be expected that similar results would be obtained for other search engines and for e-commerce sites. Based on this analysis, it was found that the popularity of different search terms follows the Zipf distribution: when query words are ranked according to their popularity, the number of times a word is seen is inversely proportional to its rank. This finding is visualized in the graphs of Figs. 2A & 2B, where both ranks and usage counts are shown in logarithmic scale. Also shown is an LLCD (log-log complementary distribution) plot of the tail of the distribution. In this plot, the straight line indicates a power-law (heavy) tail. [0042] It was found that the distribution of query terms exhibits mass-count disparity: most of the distinct terms are seldom used, but most of the words seen are actually repeated instances of a small part of the full set of terms. In the following discussion of the statistical basis of some embodiments of the present invention, we will use "word" to denote a distinct query term, and "instance" to denote the appearance of a word in a query. The mass-count distribution of word is visualized in Fig. 3, where the top graph shows the distribution of how many times each word is used. More than 70% of the distinct words are only used once, and only 4% are used 10 times or more. But if we look at all the instances of these words (including repetitions), as shown by the bottom graph, the heavily-used words dominate. For example, those 70% of the words that only appear once account for only 6% of the total instances. The joint ratio metric indicated in Fig. 3 is a generalization of the proverbial 20/80 rule. In this case, we found that 9% of the words (and specifically, the more heavily used ones) account for 91% of the instances, and vice versa: 91% of the words (those that are used less often) account for only 9% of the instances.
[0043] The N1/2 metric shows the effect of the most highly used words. In this case, it shows that a full half of the instances are actually repetitions of only 0.05% of the words (not 5 percent, but 5/1000)
[0044] According to this exemplary embodiment of the present invention, directed client- side or server-side routing preprocessing is performed based on the use of those query words that are highly popular. It was determined that by performing some preliminary analysis, for example within the client's web browser, before submitting the query to the server, may alleviate some default server load. It was found in simulations that an analysis (i.e. preprocessing) of search terms using a table lookup with a table containing a few hundred of the most popular query terms can be effective to that end. The lookup may map each such term, or set of terms, to a server that can be specialized to handle queries on this topic, thereby reducing the load on the default gateway/server, and allowing for partitioning of the backend database.
[0045] It was, however, also found that not all popular query words may be used in this way. As part of, and resulting from, our simulations in this exemplary embodiment, query words were classified into four distinct classes:
[0046] English — words that are simply very common in spoken language, but do not really convey any semantics regarding the query.
[0047] Topical — words that serve to define the topic of the query, mainly nouns and proper names.
[0048] Navigation — a special case of topical words, where the intent is to find a specific web site quickly.
[0049] Adjectives — words that modify a query more than defining it.
[0050] Examples of these four classes, from the AOL dataset, are given in the table of
Fig 4.
[0051] It was determined that the most highly popular terms (e.g. most of the top ten) are actually simple English words, and some adjectives are also quite popular. But the majority of highly popular terms are indeed topical.
[0052] On the bottom of the graph of Fig. 5 are navigational queries labeled "direct".
These are queries where the preprocessing can actually redirect the user directly to the desired site (i.e. predictive target), without submitting the query to the server at all. For example, this would be the desired behavior if the user presses the "I'm feeling lucky" button on Google's search page. Based on the AOL data, we estimate that with about
1000 terms, of which less than 100 are navigational, about 4.5% of the total queries can be handled in this way. This is the fraction of queries that included only a single term, which we had classified as navigational.
[0053] In the middle of the graph of Fig. 5 are queries that can be routed to specialized servers labeled "classified". It was found that a mere 200 query terms are sufficient to cover about 20% of all the queries; 900 terms cover nearly 40%. The meaning of "cover" here is that this fraction of the queries includes at least one highly popular topical word. Routing such queries directly to specialized servers may achieve two benefits: First, it may reduce the load on the main/default server's entry point; Second, it may allow for better service of these queries. For example, many topical words are actually place names. These can be routed to servers that are primed with additional data to better serve localized queries. Another example is sets of topical terms that are all related to the same theme (e.g. beach, airlines, weather, hotel, hotels, inn, Disney, golf, travel, airport, and vacation). These can be directed to a specialized server that is based on domain knowledge relating to this specific theme. The remaining queries may simply be sent to the main/default server, just like all queries are sent today. [0054] Our simulations indicate that by using a lookup table of less than 1000 words, a reduction of one third of all queries reaching the main/default server may be achieved.
[0055] The following publications are hereby incorporated by reference in their entirety:
[0056] "Stanford ::DNSserver - a DNS name server framework for Perl". URL http://www.stanford.edurriepel/lbnamed/Stanford-DNSserver/DNSserver.html. [0057] L. A. Barroso, J. Dean, and U. HToIzIe, "Web search for a planet: the Google cluster architecture". IEEE Micro 23(2), pp. 22-28, Mar-Apr 2003. [0058] R. S. Engelschall, "Apache 1.3 URL rewriting guide". URL http://httpd.apache.Org/docs/1.3/misc/rewriteguide.html, Dec 1997.
[0059] D. G. Feitelson, "Metrics for mass-count disparity". In Modeling, Anal. &
Simulation of Comput. & Telecomm. Syst, pp. 61-68, Sep 2006.
[0060] B. J. Jansen, A. Spink, and J. Pedersen, "A temporal comparison of AltaVista web searching". J. Am. Soc. Inf. Sci. 56(6), pp. 559-570, 2005.
[0061] G. Pass, A. Chowdhury, and C. Torgeson, "A picture of search". In 1st Intl. Conf.
Scalable Information Syst., Jun 2006.
[0062] [7] G. K. Zipf, Human Behavior and the Principle of Least Effort. Addison-
Wesley, 1949.
[0063] The present invention is a method, application and system for computerized search queries. According to some embodiments of the present invention, a search engine (e.g. google, aol, altavista, etc.) may include two or more servers or two or more server clusters, each of which servers/clusters may be associated with a different Internet Protocol ("IP") Address. Alternatively, each of the two or more servers/clusters may be associated with a different port number, or another unique identifier, within the same IP address. Each of the servers may be functionally associated with a database server containing some or all of the data stored and provided by the search engine (e.g. excerpts and links to websites and webpages available on the internet), and at least one of the servers may be functionally associated with a database storing a predominantly (e.g. only) search-term-specific-subset of all the data stored by the search engine, which database may be referred to as a search-term-specific-database. A search-term-specific-subset of data may include search engine data searchable and/or otherwise associated with one or a set of specific search terms. One or more servers/clusters of a search engine may be functionally associated with a database including all the data stored and provided by the search engine. It should be understood that the term server may also apply to server clusters. [0064] According to some embodiments of the present invention, a search engine may include one of more search-term-specific database replication modules residing on one or more servers, which replication module(s) may replicate to a first search-term- specific-database, continuously or intermittently, a first search-term-specific-subset of the search engine's total searchable data. According to further embodiments of the present invention, the same or a second search-term-specific database replication module may replicate to a second search-term-specific-database, continuously or intermittently, a second search-term-specific-subset of the search engine's total searchable data, which second search-term-specific-subset may be associated with at least one different search term than the first search-term-specific-subset. According to some embodiments of the present invention, a single search-term-specific database replication module may be used to populate multiple search-term-specific-databases by replicating to each search-term-specific-database, in parallel or in series, data from the complete search engine database relevant to each respective search-term-specific- database. A table including entries identifying which search-term-specific-databases is associated with which search terms may be used and/or updated by the one or more replication modules.
[0065] Integral or otherwise functionally associated with the one or more search engine servers may be a predictive search target module, which predictive module may track and compile statistical information correlating terms in received search queries with client device search targets. For example, the predictive module may track and record what percentage of search queries including the terms "Israel & English & News" results in the search client device requesting data from the website www.jpost.com. The client's target data request information may be obtained from the client device via a floating agent application sent to the client along with the search engine interface. The predictive module may maintain one or more predictive target data tables with records of some or substantially all combinations of search terms which correlate with a high probability (e.g. great than 50%) that the client device issuing a search request with those specific search terms will request data from a specific target. One or more records may also include an identifier of the specific target to which the combination of terms relates. It should be understood that any method of obtaining, tracking and recording the above mentioned statistical data, known today or to be devised in the future, may be used as part of the present invention.
[0066] According to further embodiments of the present invention, there may be provided a search query routing module, which routing module may direct a search query from a search engine client application to a specific server functionally associated with a given search-term-specific-database when the search query includes a term associated with the given search-term-specific-database. The search query routing module may include or be otherwise functionally associated with a routing table, which routing table may define correlations between at least one specific search term and the IP address, or other server designator, of a server functionally associated with a search-term-specific-database populated with and adapted to provide data associated with the at least one specific search term. The routing table may be based upon the table used and/or updated by the one or more replication modules. [0067] The routing module may compare one or more terms within a given search query against entries in the routing table, and in the event the routing table includes an entry corresponding to one or more of the search query terms, the routing module may route the search query to the server designated by the corresponding entry. According to some embodiments of the present invention, in the event that the routing module does not identify an entry in the routing table corresponding to one or more of the terms in the search query, the routing module may route the search query to a (default search engine) server functionally associated with a database populated with the complete data of the search engine (e.g. the default search engine IP address). [0068] The routing table may be integral with the routing module or may reside on one of the servers associated with the search engine. In the event the routing table resides on the client, the routing module, or an associated application or process, may update the routing table intermittently. Updating of the routing table may be based on updates of the table used and/or updated by the one or more replication modules. According to some embodiments of the present invention, the routing table and the replication module table may be substantially identical or the same table.
[0069] According to further embodiments of the present invention, the routing table may include one or more entries defining correlations between one or a set of specific search terms and the IP address, or other designator, of a predicted search target (website, content, etc.). When one or a combination of the search terms in a given search query are identified by the routing module as having a high probabilistic correlation with a given search target, that is - some high percentage of search queries including this given set of search terms (e.g. above 50%, 60%, 70%, 80%.. ) result in the client device requesting data from the given search target (e.g. 80% of search queries with the terms "Israel & English & News" end up with the client device sending a request for data to www.jpost.com), the routing module may return to the client device the server identifier or other designator of the predicted target. Routing table entries correlating specific sets of search terms with specific predicted search targets may be provided and intermittently updated by one or more functional modules on one or more of the servers associated with the search engine. The predicted search target entries in the routing table may be updated based on data from one or more records of a predictive target data table, or the like, as described above. According to further embodiments of the present invention, the routing module may direct or redirect a web- browser on the client device to the predicted target. It should be understood that all method directing or redirecting, known today or to be devised in the future, may be used as part of the present invention. Likewise, it should be understood that any method or technique of determining the probabilistic correlation between any set of search terms and a predicted search target, known today or to be devised in the future, may be applicable to the present invention.
[0070] The routing module may be implemented at any node along a path between a search engine client and a search engine server. According to some embodiments of the present invention, the routing module may be an application or an applet transmitted to and/or otherwise running on a client device. For example, the routing module may be transmitted to, and may run within a web browser on a client device when the browser requests and receives a search engine's landing/home page. According to further embodiments of the present invention, the routing module may be part of browser plug-in. According to yet further embodiments, the routing module may be part of a local DNS plug-in application.
[0071] Turning now to Fig. 6, there is shown a functional block diagram of a server architecture including a server-side routing module, a search-term-specific replication module, a predictive search target module, and a routing module server/updater according to some embodiments of the present invention. Operation of the exemplary server-side arrangement, as shown in Fig. 6, may be understood better in view of Figs. 7A & 7B, where Fig. 7A shows a flowchart including the steps of an exemplary method of generating a predictive table according to some embodiments of the present invention, and Fig. 7B shows a flowchart including the steps of an exemplary method of generating a search-term-specific replication table according to some embodiments of the present invention.
[0072] The server architecture of Fig. 6 may part of a search engine cluster, where searchable data is stored on a default/complete database including all the searchable data of the search engine. The web/html servers may respond to an initial client device request for data by providing the client with code including a search engine interface application. According to further embodiments of the present invention, the web/html servers may also provide the client with code include a client-side routing module. According to yet further embodiments of the present invention, the web/html servers may provide the client with code including a reporting agent application. According to an alternate embodiment, as shown in Fig. 6, the servers may include or be otherwise associated with a routing module.
[0073] Search queries received from a client device may be processed by application servers, which application servers may utilize searchable data stored on the search engine database(s) to generate a list of results, possibly including hyperlinks to servers, correlated with search terms of -a given search query. The generated list may be provided to the client device, directly or through web/html servers. [0074] According to some embodiments of the present invention, the predictive search target module may intercept client device's search queries, from either the web/html or application servers, and may also intercept/receive resulting search target information from associated reporting agent applications running on client devices (Step 7000A). The predictive module may track multiple queries and associated search target information and may determine correlations between one or more sets of search query terms with respective search targets (Step 7100A). The correlations may be stored, and updated as needed, in a predictive table(s) (Step 7200A), and data from records of the predictive table(s) may be used to populate and/or update records in a routing table (7300A) functionally associated with a (server-side or client-side) routing module according to some embodiments of the present invention. A routing module server/updater may be used to populate and/or update records in a routing table (7300A) functionally associated with a (server-side or client-side) routing module based on data from the predictive table(s).
[0075] A search-term-specific database replication module, or some other module functionally associated with the replication module, may intercept and indentify frequently used sets of search terms in search queries from client devices (Step 7000B). The sets of terms may be tabulated, statistically ranked, and those sets of terms having a probabilistic rank above some threshold value (e.g. set of terms present in greater than 0.1% of all search queries) may be recorded in a search-term-specific data table (Step 7100B). One or more replication modules may replicate portions of the default/complete database to search-term-specific databases based on records in the replication table (Step 7200B). Data from the replication table(s) may also be used to populate and/or update records in a routing table functionally associated with a (server- side or client-side) routing module (Step 7300B).
[0076] Fig. 6 shows the replication module replicating potions of the default/complete database to search-term-specific databases. Also shown is a server-side routing module routing client requests including search term sets found in the routing table to either a server with a search-term-specific database or to a predictive target, depending upon which is indicated in the routing table. [0077] Turning now to Fig. 8, there is shown a functional block diagram including a client-side routing module and client-side (floating) reporting agent, facilitating client search query routing to a specific search-term-specific server cluster and/or to a specific predictive target. Operation of the client-side routing module may be understood in view of Fig. 9 which shows a flowchart including the steps of an exemplary method of routing search queries according to some embodiments of the present invention. The routing module may intercept search queries input into a search engine interface (Step 9000) and may compare those terms against entries in a routing table (Step 9100). If the terms do not match any entries (Step 9200C), the routing module may route the query directly to the default search engine server (Step 9300C). However, if the terms match an entry associated with a search-term-specific database (Step 9200A)1 the routing module may forward or route the query to a server associated with the search-term-specific database (Step 9300A). If the search terms match an entry associated with a predictive target (step 9200B), the routing module may send a data request (i.e. direct or redirect the client device) to the predictive target (Step 9300B).
[0078] While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims

CLAIMSWhat is claimed:
1. A system for processing search engine queries comprising: a routing module adapted to route a request to a non-default search engine server when the search engine query includes one or more terms associated with an entry in a routing table.
2. The system according to claim 1 , wherein the non-default server to which the request is routed is a server associated with a first predictive target and the search query includes one or more terms associated with the first predictive target.
3. The system according to claim 2, wherein the non-default server to which the request is routed is a server associated with a second predictive target and the search query includes one or more terms associated with the second predictive target.
4. The system according to claim 1 , wherein the non-default server to which the request is routed is a first server functionally associated with a first search-term- specific-database the search query includes one or more terms associated with the first search-term-specific-database.
5. The system according to claim 4, wherein the request includes search query terms.
6. The system according to claim 5, wherein said routing module is adapted to route a search query to a second server functionally associated with a second search- term-specific-database when the search query includes one or more terms associated with the second search-term-specific-database.
7. The system according to claim 1 , wherein said routing module is functionally associated with a routing table, wherein the routing table includes one or more entries such that each entry correlates one or more search terms with a server identifier.
8. The system according to claim 7, wherein said routing module is adapted to route the search query to a default server of the search engine when the search query does not include one or more terms associated with an entry in the routing table.
9. The system according to claim 1 , wherein said routing module runs on a search engine client device.
10. The system according to claim 1 , wherein said routing module is uploaded and runs in conjunction with a web-browser.
11. The system according to claim 10, wherein said routing module is an application running within a browser is part of a web-browser plug-in.
12. The system according to claim 1 , wherein said routing module is part of a domain name system ("DNS") local to the client device.
13. The system according to claim 1 , wherein said routing module resides on a server of the search engine.
14. A system for providing a search engine comprising: a set of servers and databases; and a first search-term-specific database replication modules residing on one or more servers and adapted to replicate to a first search-term-specific-database a first search-term-specific-subset searchable data.
15. The system according to claim 13, further comprising a search-term-specific replication module table usable to generate or update a routing table for a search query routing module.
16. The system according to claim 14, wherein the second search-term-specific database replication modules or a second search-term-specific replication module is adapted to replicate to a second search-term-specific-database a second search-term- specific-subset of searchable data.
17. A system for providing a search engine comprising: a set of servers and databases; and a predictive search target module residing on one or more servers and adapted to populate a predictive search target table.
18. The system according to claim 17, further comprising a routing module updater adapted to update a routing table functionally associated a routing module based on the predictive search target table.
19. The system according to claim 17, wherein said predictive search term module is adapted to receive target information from a reporting agent.
20. The system according to claim 19, wherein the reporting agent resides on a client device and is adapted to monitor and transmit information correlating search terms input on the device and the search target of the device.
PCT/IL2008/001354 2007-10-09 2008-10-12 A method application and sysyem for processing computerized search queries WO2009047773A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US97858707P 2007-10-09 2007-10-09
US60/978,587 2007-10-09

Publications (3)

Publication Number Publication Date
WO2009047773A2 true WO2009047773A2 (en) 2009-04-16
WO2009047773A8 WO2009047773A8 (en) 2009-07-16
WO2009047773A3 WO2009047773A3 (en) 2010-03-11

Family

ID=40549699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2008/001354 WO2009047773A2 (en) 2007-10-09 2008-10-12 A method application and sysyem for processing computerized search queries

Country Status (1)

Country Link
WO (1) WO2009047773A2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6175869B1 (en) * 1998-04-08 2001-01-16 Lucent Technologies Inc. Client-side techniques for web server allocation
US20030055818A1 (en) * 2001-05-04 2003-03-20 Yaroslav Faybishenko Method and system of routing messages in a distributed search network
US7251681B1 (en) * 2000-06-16 2007-07-31 Cisco Technology, Inc. Content routing services protocol

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6175869B1 (en) * 1998-04-08 2001-01-16 Lucent Technologies Inc. Client-side techniques for web server allocation
US7251681B1 (en) * 2000-06-16 2007-07-31 Cisco Technology, Inc. Content routing services protocol
US20030055818A1 (en) * 2001-05-04 2003-03-20 Yaroslav Faybishenko Method and system of routing messages in a distributed search network

Also Published As

Publication number Publication date
WO2009047773A8 (en) 2009-07-16
WO2009047773A3 (en) 2010-03-11

Similar Documents

Publication Publication Date Title
US7310687B2 (en) Methods and systems for managing class-based condensation
Chen et al. Efficient and adaptive web replication using content clustering
US8095545B2 (en) System and methodology for a multi-site search engine
AU2009277143B2 (en) Federated community search
US8166198B2 (en) Method and system for accelerating browsing sessions
US20060206460A1 (en) Biasing search results
US8326846B2 (en) Virtual list view support in a distributed directory
US11025584B2 (en) Client subnet efficiency by equivalence class aggregation
CN106055603B (en) Browser access network address recommended method, client and system based on VPN
US20060294223A1 (en) Pre-fetching and DNS resolution of hyperlinked content
KR20050030542A (en) Systems and methods for client-based web crawling
US6961751B1 (en) Method, apparatus, and article of manufacture for providing enhanced bookmarking features for a heterogeneous environment
US7676553B1 (en) Incremental web crawler using chunks
CN105930528A (en) Webpage cache method and server
US20080133460A1 (en) Searching descendant pages of a root page for keywords
CN109614419B (en) Named data network-oriented knowledge service routing mining method
Valavanis et al. MobiShare: Sharing context-dependent data & services from mobile sources
CN101551813A (en) Network connection apparatus, search equipment and method for collecting search engine data source
Mastorakis et al. Experimentation with fuzzy interest forwarding in named data networking
WO2009047773A2 (en) A method application and sysyem for processing computerized search queries
EP3667509B1 (en) Communication device and communication method for processing meta data
Ooka et al. High-speed design of conflictless name lookup and efficient selective cache on CCN router
KR100347985B1 (en) System for Providing the Internet Address Supplementary Services and Method thereof
WO2011116381A1 (en) Rapid navigation system for mobile devices
Akiyoshi et al. Content search method utilizing the metadata matching characteristics of both Spatio-temporal content and user request in the IoT era

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08837506

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08837506

Country of ref document: EP

Kind code of ref document: A2