WO2009000173A1

WO2009000173A1 - Searching method, searching system and searching server

Info

Publication number: WO2009000173A1
Application number: PCT/CN2008/070598
Authority: WO
Inventors: Liang Sun
Original assignee: Tencent Technology (Shenzhen) Company Limited
Priority date: 2007-06-26
Filing date: 2008-03-27
Publication date: 2008-12-31
Also published as: CN100462979C; CN101071442A

Abstract

A searching method is provided, which includes: determining the type of the keyword to be searched; when the keyword is a high frequency keyword, N searching servers accesses to a part of index tables of the high frequency keyword which stored in themselves, respectively, where N is integer more than 1; when the keyword is a low frequency keyword, one of the N searching servers accesses to the total index tables of the low frequency keyword which stored in themselves; determining the text which involves the keyword to be searched according to the accessed index table. A searching system and a searching server are also provided. With the solution, it effectively improves the search speed.

Description

Search method, retrieval system and retrieval server

TECHNICAL FIELD The present invention relates to the field of communications technologies, and in particular, to a retrieval method, a retrieval system, and a retrieval server. BACKGROUND OF THE INVENTION When performing a search, a user needs to input a search string. Usually, the search string contains one or more keywords. When each keyword is separated by a space, a space between the keywords indicates between the keywords. Perform a "and" retrieval operation. Each keyword can consist of one or more morphemes. A morpheme is the smallest language unit that can express independent semantics, usually a Chinese word that is segmented by the word segmentation system. Key words can be divided into morphemes of different numbers by word segmentation system. If it is divided into two morphemes, the keyword is a binary compound morpheme. If it is divided into three morphemes, the keyword is ternary compound. Morpheme. When searching, the input search string needs to find a collection of all the documents containing the search string in a short time, and display the document collection through the document identification list.

Among various Internet search engine technologies, background retrieval cluster technology is one of the most core technologies. This technology is directly related to the collaboration between multiple search servers to provide retrieval services for a larger set of data. Since the number of document collections managed by a single retrieval server is limited, if the number of documents saved is too large, it will be difficult for the system to return the desired results within a time acceptable to the user during normal retrieval operations. Usually the user can accept no more than 1 second, so a search cluster consisting of multiple search servers is needed to support search services within a larger data set.

The most important operation in the retrieval process is the access to the inverted index. The inverted index is a data structure used to speed up the retrieval of the search string. It can exist in the form of a disk file or it can be loaded into the memory. At least consists of a dictionary file and an inverted table file. A plurality of inverted entries are saved in the inverted table file, and each inverted entry is used to save the correspondence between each keyword and the document in the search string. Therefore, effectively improving the reading speed of the inverted items can improve the retrieval efficiency. The time to read the inverted entry of the inverted table file includes the time of each disk address and the time required to read the data. Comparison of the amount of data read In a small case, the reading time of the inverted row item mainly depends on the addressing time of the disk. In the case that the amount of data read is relatively large, the reading time of the inverted row item mainly depends on the read data. time.

An existing distributed index file retrieval model based on document partitioning is shown in Fig. 1. The system includes a retrieval proxy server and a plurality of parallel retrieval servers managed by the retrieval proxy server. Each retrieval server allocates one-ninth of the documents in the full set of documents, where N is the total number of retrieval servers. In the indexing phase, multiple parallel retrieval servers complete the indexing tasks in their respective servers in parallel. In the retrieval phase, the retrieval proxy server sends the read requests to each retrieval server at the same time. After the retrieval server completes the local retrieval, the retrieval results will be retrieved. Returned to the search proxy server, and finally the search proxy server aggregates the search results of each search server according to a specific weight sorting manner. It can be seen that the document partition-based retrieval system has an independent structural design, and the degree of coupling between the retrieval servers is small, and each retrieval server is equivalent to a retrieval subsystem that can be independently loaded. However, in the Internet search service, most of the search strings are composed of two or more keywords. The search server needs to perform the position offset matching in the document after matching the document identifiers for each keyword. This will result in multiple I/O access to the document disk. And when the high frequency morpheme is included in the search string, the number of document identification lists and position offset lists that need to be read is large, for example, the inverted list of high frequency morphemes such as "China", "Net", "We", etc. The amount of item data usually accounts for a large proportion of the entire inverted index data. It is impossible to read the index data in a short time, so most of the retrieval time will be consumed in the reading operation of the file input and output. As a result, the overall concurrency of the retrieval system is degraded, resulting in slower retrieval speed and response speed of the retrieval string.

An existing distributed index file retrieval model based on index entry partitioning is shown in FIG. 2. The system includes a retrieval proxy server and N sets of parallel retrieval servers managed by the retrieval agent, where N is an integer greater than 1, each group Retrieve the server to allocate one-ninth of the documents in the full set of documents. Among them, each group of search servers contains three search servers. Generally, different inverted entries corresponding to the same index keyword are stored in different retrieval servers according to the value of the hash value modulo. For example, ("China")%3 = 1 , the inverted data item block of the index keyword corresponding to "China" is stored in the search server No. 1 of the group, so that the original search server can be stored on a single search server. All indexed keyword inverted items are evenly distributed in 3 The server retrieves the server, thereby speeding up access to the inverted entries. However, in a retrieval system based on index entry partitioning, when the retrieval string includes two or more keywords, a single retrieval server in each group of retrieval servers cannot perform the retrieval independently, and must be the same as other retrieval servers in the group. Collaboration can complete the retrieval, thus increasing the degree of data coupling between the retrieval servers, resulting in more complicated data backup and lower retrieval speed. In addition, the operation is performed every time the check is performed, thereby increasing the amount of communication between the search servers. SUMMARY OF THE INVENTION Embodiments of the present invention provide a retrieval method, a retrieval system, and a retrieval server, which can improve the speed of retrieval.

A retrieval method, including:

Determining the type of keyword to be retrieved;

When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.

When the keyword is a low frequency keyword, one of the n search servers reads all index entries of the low frequency keyword stored by the search server;

Determining a search result of the keyword to be retrieved according to the read index table item. A retrieval system comprising:

a cluster proxy server, configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;

The search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.

A retrieval server, comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;

a keyword reading module, configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.

A cluster proxy server, including:

a first module, configured to determine a type of a keyword to be retrieved;

a second module, configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;

And a third module, configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.

It can be seen from the above technical solution that, in an embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when searching, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the subsequent processing time does not delay the time overhead of a single logical operation, thereby improving the retrieval speed. On the other hand, all the inverted items of a low frequency keyword are stored by a retrieval server, and only the inverted list of the low frequency keyword is read by the server when the retrieval is performed. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.

In addition, the embodiment of the present invention can effectively improve the coupling degree between the retrieval servers in the retrieval cluster, and increase the resource dynamic allocation capability between the servers. By considering the memory resources, disk input and output resources, and CPU (central processing unit) resources of multiple search servers in the cluster as a whole, unified planning ensures maximum concurrency of the cluster, thereby further improving the retrieval speed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a distributed index file retrieval model based on document partitioning; FIG. 2 is a schematic diagram of a distributed index file retrieval model based on index entry partitioning; FIG. 3 is a schematic diagram of a retrieval method according to an embodiment of the present invention. flow chart;

4 is a flowchart of a retrieval method according to another embodiment of the present invention;

5 is a schematic diagram of searching a specific search string by applying the method of the present invention;

6 is a structural diagram of a retrieval system in an embodiment of the present invention;

7 is a schematic diagram of a retrieval model of a retrieval system in an embodiment of the present invention;

Figure 8 is a flow chart for applying the retrieval model in Figure 7 for retrieval;

FIG. 9 is a structural diagram of a retrieval server in an embodiment of the present invention. The present invention will be further described in detail with reference to the drawings and embodiments. instruction of.

The flow of the retrieval method in the embodiment of the present invention is as shown in FIG. 3.

Step 301: Parsing the retrieved search string to generate a search expression consisting of keywords.

Step 302: Send a read request for the keyword to each search server in the cluster. The read request for the keyword includes a read-ahead request for the inverted entry of the keyword.

Wherein, the inverted list item of the keyword is an array in which the identifiers of all the documents including the keyword are recorded, and the identifier of the document including the keyword, the weight of the keyword in the document, and The positional offset of the keyword in the document, the basic structure is as follows: t <di,w _d i,t,loci,loc ₂ , ...loc _fd i, _t ><d ₂ ....> ... <d _ft ...>

Where t represents a certain keyword in the search string, 4 represents the identifier of a series of documents containing the keyword t, W _d , _t represents the weight of the keyword t in the document di, and loci represents the keyword t at the current The position offset that appears in the document, usually expressed in two bytes. According to the inverted list item, you can quickly find a keyword in the search string. Usually, the inverted index file of each search string is composed of N inverted entries, and the number of N is the sum of the number of keywords in the search string. Step 303: The retrieval server in the cluster reads the inverted item of the keyword according to the frequency of the keyword hitting the document.

The keywords in the search expression can be classified into high frequency keywords and low frequency keywords composed of super high frequency keywords and medium and high frequency keywords according to the frequency of hitting the documents.

In the embodiment of the present invention, the index data may be counted before the retrieval is performed, the number of documents hit by each keyword is determined, and the type of the keyword to be retrieved is determined according to the frequency threshold of the preset document. When the keyword is a UHF keyword and/or a medium-high frequency keyword, the inverted item of the keyword is segmented and stored by the retrieval server in the cluster, and each retrieval server stores a part of the keyword. Schedule item. For example, when the cluster includes n retrieval servers, all index entries of the high frequency keyword are divided into n parts, and the mth retrieval server stores the mth partial index entries of the keyword, where n is greater than 1 An integer, m is an integer greater than 1 and less than or equal to n. When the keyword is a low frequency keyword, all the inverted entries of the keyword are stored by a retrieval server in the cluster. For example, the entire low frequency keyword is divided into n parts, and the mth retrieval server stores all index entries of the mth part of the low frequency keyword.

In the retrieval phase, when the keyword is a UHF keyword and/or a medium-high frequency keyword, each retrieval server in the cluster reads the inverted item of the high-frequency keyword stored by itself; In the case of a low frequency keyword, all of the inverted entries of the low frequency keyword are read by a retrieval server storing the inverted list of the low frequency keywords.

When the cluster includes n search servers, n is an integer greater than 1. The segmentation of the inverted items of the keyword includes: modulating the document identifier in the inverted entry of the high frequency keyword, and taking the modulo The parameter is n, and the inverted table items having the same modulus value are stored as a group in the retrieval server corresponding to the modulus value, and in the retrieval phase, the retrieval server corresponding to the modulus value reads the inverted table having the same modulus value. item. Similarly, the word identifier (word ID) corresponding to the low frequency keyword is modulo, the modulo parameter is n, and the same low frequency keyword of the modulo value is grouped and stored by a retrieval server.

Further, in the embodiment of the present invention, the retrieval server compresses the eight-byte document identifier in the keyword inverted list entry into a four-byte document article number.

Step 304: The retrieval server in the cluster performs logical operation on the inverted item of the keyword After the search results are output.

When the search string of the contemporary search includes both high-frequency keywords and low-frequency keywords, the logical operations of the inverted items of different keywords are performed. Specifically, the search server storing the low-frequency keyword inverted list item modulates the document identifier corresponding to the inverted entry of the low-frequency keyword, and the modulo parameter is n, and the inverted item corresponding to each modulus value is Send to the retrieval server corresponding to the modulus. Each search server in the cluster performs logical operations on the inverted items of the high frequency keyword and the low frequency keyword; and the search results of the search string are obtained by summarizing the logical operation results of each search server.

The logical operation may be one of an operation, or an operation, a non-operation, or any combination.

The flow of the retrieval method in another embodiment of the present invention is shown in FIG. Each cluster shown in this embodiment includes n retrieval servers, where n is an integer greater than one.

Step 401: Parsing the retrieved search string to generate a search expression consisting of keywords.

Usually, the search string input by the user that needs to be searched may be a short sentence or include a plurality of keywords. These search strings are original strings that are not formatted by the computer, and the search string is parsed to generate a computer-recognizable search. expression. The search expression may contain one or more keywords. If the user input keyword does not include a separator, after the parsing process, there is a logical relationship between the keywords. If the user input keyword includes a separator, for example, when the keywords are separated by a space, the preceding and following keywords are subjected to the "and" retrieval operation, and the keywords are separated by T, indicating the before and after keywords. To perform an "OR" operation, use "!" before the keyword to indicate a "non" operation on the keyword.

In the present embodiment, it is assumed that the search string includes both high frequency keywords and low frequency keywords. Step 402: Send a read request for the keyword to each search server in the cluster, wherein the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.

Step 403: Determine that the keyword is a high frequency keyword or a low frequency keyword, and if it is a high frequency keyword, perform step 404; if it is a low frequency keyword, perform step 405.

According to the number of inverted entries corresponding to the keyword, that is, the number of documents hit by the keyword The quantity is different, the keywords in the search expression are divided into high frequency keywords and low frequency keywords. In particular, the high frequency keywords can be further divided into medium and high frequency keywords and ultra high frequency keywords.

In the embodiment of the present invention, the index data may be counted before the search is performed, and the number of documents hit by each keyword, that is, the number of inverted entries corresponding to the keyword, may be determined according to a preset frequency threshold of the document. Determine the type of keyword to be retrieved.

Step 404: The n retrieval servers in the cluster respectively read a part of the inverted entry of the high frequency keyword, and then perform step 407.

For the reading of the inverted entries of the high-frequency keywords, a technology similar to the disk RAID (Redundant Independent Disk Array) system can be used, so that the n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords. The entry, in the retrieval, is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed. Single logical operation time overhead.

Step 405: The retrieval server storing the low frequency keyword to be retrieved in the cluster reads all the inverted entries of the low frequency keyword.

The inverted entry of the low frequency keyword is read by one of the search servers in the cluster, avoiding the existing situation of a small number of inverted entries read on multiple search servers.

Usually, the data block of the inverted list item of the low frequency keyword is smaller than the minimum read data block of the disk, for example, 64K, and for the data block smaller than 64K, the time taken by the disk to read is the same. In the prior art, the inverted list of low frequency keywords is divided into n blocks, and then read by n servers, so that not only does not increase the speed of reading, but also wastes multiple searches in the cluster. Server resources. By applying the embodiment of the present invention, the above problems are effectively avoided.

In this embodiment, the retrieval server further compresses the eight-byte document identifier in the keyword inverted list item into a four-byte document part number when the index is established.

The document identifier in the inverted list item is used to locate the document. For web pages on the Internet, each web page has a unique URL (Uniform Resource Locator), which we can use according to the URL string of the web page. After processing the signature algorithm, a 64-bit (8-byte) globally unique integer corresponding to the URL string is obtained, thereby obtaining a document identifier corresponding to the document. However, due to the large number of web pages in the Internet, the storage space occupied by the document identification is also large. In this embodiment, the inverted items of the keyword are respectively stored in n units. When retrieving the server, it is equivalent to sending different documents to different search servers, so each retrieval server gets a certain number of documents, assuming that the number is N, where N is an integer greater than 0, then In this embodiment, each search server further numbers the documents assigned to the machine, and converts the document identifiers into an integer from 0 - N-1 as the document number of the document. In this way, for the same document, the length of the document number is much smaller than the length of the original document identifier, thereby saving storage space and improving the reading speed.

Step 406: The document part number of the inverted list item of the low frequency keyword is modulo and sent to the corresponding modulus search server.

Step 407: The n retrieval servers in the cluster perform logical operations on the inverted entries that have been read.

The logical operations in this step are performed according to the logical relationship between the keywords in the search string to be retrieved, wherein the logical operations include one or any combination of operations, operations, and operations.

Step 408: The result of the search operation of the search string is obtained by summarizing the logical operation results of the n search servers.

In the above embodiment, the search string includes both a high frequency keyword and a low frequency keyword as an example. In practical applications, when the search string includes only high frequency keywords, steps 405-406 need not be performed, and when the search string includes only low frequency keywords, step 404 need not be performed.

The following is an example of a search string for "China Xu Jianjun". The process of searching the search string "China Xu Jianjun" is shown in Figure 5. The cluster includes three search servers, which are search server 0, search server 1, and search server 2.

First, the search string "China Xu Jianjun" is analyzed to generate a search expression consisting of the keywords "China" and "Xu Jianjun".

Secondly, the three search servers in the cluster determine the type of the keyword to be retrieved according to the number of hits of the keyword to be retrieved, and read the inverted entry of the keyword according to the type of the keyword.

Among them, "China" is a high-frequency keyword that appears very frequently in the document, and "Xu Jianjun" is a specific person name. In the case of non-celebrity, it is a low frequency that appears very low in the document. Key words. In this embodiment, it is assumed that the document number list in the inverted list of the high frequency keyword "China" is {16, 38, 100, 207, 319, 872, 903, 1081, 2331, 5618}, low frequency key The list of document numbers in the inverted list of the word "Xu Jianjun" is {38, 295, 307, 971, 2331}.

Since the three search servers in the cluster respectively store a part of the inverted list items of the high-frequency keyword "China", the three search servers in the cluster respectively read a part of the inverted items of the high-frequency keyword "China". Each document number corresponding to the high frequency keyword "China" is modulo 3, and the retrieval server corresponding to each modulus value reads the inverted entry corresponding to the modulus value. For example, if the value of the document number 16 to 3 is modulo 1, the retrieval server 1 in the cluster reads the inverted entry of the document number 16. Correspondingly, the search server 0 in the cluster reads the inverted entry of the document number {207, 903, 2331}, and the search server 1 in the cluster reads the document number as {16, 100, 319, 1081. } The reverse row entry, the search server 2 in the cluster reads the inverted entry of the document number {38, 872, 5618}.

The three search servers in the cluster save all the inverted items of different low frequency keywords. Assume that all the inverted items of the low-frequency keyword "Xu Jianjun" are stored in the search server 2 in the cluster. The search server 2 saves and reads all the inverted items including the low-frequency keyword "Xu Jianjun", that is, the inverted items of the document number {38, 295, 307, 971, 2331}.

Once again, after the search server in the cluster finishes reading the inverted items of the keyword, the inverted list of the low-frequency keyword "Xu Jianjun" is distributed to the three search servers in the cluster.

After the document number corresponding to the low frequency keyword is modulo 3, the inverted item corresponding to each modulus value is sent to the retrieval server corresponding to the modulus value. In the present embodiment, for the low frequency keyword "Xu Jianjun", the inverted entry of document document number {2331} is sent to search server 0, and the inverted entry of document document number {295, 307} is sent to The search server 1, the inverted entry of the document number {38, 971} is sent to the search server 2, and the intermediate result of the search is obtained.

Finally, the three servers in the cluster operate and invert the high-frequency keyword "China" and the low-frequency key word "Xu Jianjun" and obtain the search results.

After the operation and retrieval, the search result of the search server 0 is the document with the document number 2331, the search result of the search server 1 is empty, and the search result of the search server 2 is the document with the document number 38, and the search of the three search servers is performed. After the results are summarized, get the search string "China Xu Jianjun's result of the search is the document with the document number {2331, 38}.

Fig. 6 shows a retrieval system in an embodiment of the present invention.

As shown in FIG. 6, the retrieval system in this embodiment includes: a caching proxy server 610, a cluster proxy server 620, and a retrieval server 630.

The cache proxy server 610 parses the search string to be retrieved to generate a search expression consisting of key words; receives the search result from the cluster proxy server 620, and outputs the search result as needed. The cluster proxy server 620 is configured to receive a retrieval expression from the caching proxy server 610, determine the type of the keyword in the retrieval expression, and send a reading command to the retrieval server 630 according to the type of the keyword; receive the retrieval result from the retrieval server 630. And sending the search result to the caching proxy server 610. The retrieval server 630 is configured to read the inverted item of the keyword according to the read command from the cluster proxy server 620, determine the retrieval result of the keyword to be retrieved, and return the retrieval result to the cluster proxy server 620; When the at least two keywords are included, the search server 630 is further configured to perform logical operations on the inverted items of the at least two keywords after obtaining the inverted items of each keyword, and determine the corresponding at least two keywords. Search Results.

A schematic diagram of a retrieval model using the system of the present invention is shown in FIG. 7. The cache proxy server, the cluster proxy server, and the retrieval server in the schematic diagram are distributed in a "tree" manner, and the system includes a cache proxy server, and the cache proxy server is connected. n cluster proxy servers, each cluster proxy server is connected to n retrieval servers, and each set of n retrieval servers constitutes a cluster retrieval subsystem.

The cache proxy server is a separate process and can reside on a hardware server. At the time of retrieval, the caching proxy server caches the query request of the externally input search string, and parses the search string to be retrieved to generate a search expression consisting of keywords. For example, the caching proxy server can invoke a retrieval interpreter in the retrieval server to parse the externally entered retrieval string into a retrieval expression that the machine can understand. When each retrieval cluster subsystem returns the retrieval result to the cluster proxy server, the cache proxy server summarizes the results of all cluster proxy servers and returns them to the external user.

A clustered proxy server is a separate process that can reside on a single hardware server. At the time of retrieval, the cluster proxy server determines the type of the keyword in the retrieval expression, and according to the The type of the keyword, sends a read command to the search server in the cluster subsystem, and when the keyword is a high frequency keyword, sends a command to the search server to read a part of the index entry of the high frequency keyword stored by itself, A command to read all index entries of the low frequency keywords stored by itself is sent to a retrieval server in the retrieval server. When each search server returns the search result, the returned search results are summarized to determine the search result of the keyword to be searched; and the summarized search result is returned to the upper cache proxy server.

Each retrieval server is a separate process that can reside on a hardware server. It is a basic retrieval unit. Under the scheduling of the upper cluster proxy server, basic underlying retrieval operations, including cluster agents. The server's read command reads the inverted list of keywords and returns it to the cluster proxy server. When receiving a command to read a part of an index entry of a high frequency keyword stored by itself, reading a part of an index entry of the high frequency keyword; when receiving all index entries of the low frequency keyword stored by itself When the command is executed, all index entries of the low frequency keyword are read. When the search string includes at least two keywords, the search server further performs logical operations such as "and" or "not" on the inverted items of the at least two keywords to determine the at least two keywords. Corresponding index table entry.

With the embodiment of the present invention, the retrieval speed can be remarkably improved. Through experiments, it is known that among the 15 million web documents randomly downloaded from the Internet, the total number of unary, binary, and ternary morphemes that hit more than 1,000 documents does not exceed 500,000. Then it can be inferred that in 100 million documents, the number of morphemes hitting 6000-10000 pieces will not exceed 500,000. Assume that when storing the relationship between keywords and documents, 8-byte storage document identification is used, and 3-byte storage is used. The weight of the keyword and the 2-byte stored keyword position offset after compression. When the keyword hits 5000 documents, the storage space of the inverted item of the keyword is 64k, when the keyword hits 10000 documents. The inverted row of the keyword has a storage space of 128k and a read time of 8 milliseconds. In the retrieval model provided by the present invention as shown in FIG. 7, if a group of 16 retrieval servers is used, the inverted row entry is performed according to the inverted row entry, including the storage space of the document identifier, the weight, and the position offset. Separate. For a morpheme with a storage space of 64k or more, an inverted list item of the morpheme is stored by a plurality of search servers, and for a morpheme whose storage space is 64k or less, all the inverted items of the morpheme are stored by one search server. And after compressing the document identification into the document number, it uses less than 2 bytes of space for storage. In this way, for A morpheme with a storage space below 64k, the time for reading an inverted table entry of one morpheme is less than 8 milliseconds. For a morpheme of 64k or more, an inverted row of 64k-128k is stored in each retrieval server, which can be stored ( 64k-128k) /7*16=15-30 million inverted items. Then, in 100 million documents, for medium and high frequency keywords with a hit rate below three thousandths, each read time is also within 8 milliseconds. It can be seen that for low frequency keywords and medium and high frequency keywords, all the inverted items can be read in one reading time. For high-frequency morphemes with a hit rate of more than three-thousandths of a thousand, only the part with higher weight can be stored, and the part with lower weight can be deactivated so that the maximum inverted item storage space of each high-frequency morpheme No more than 1M, that is, the reading time does not exceed 50ms.

The flowchart for applying the search model in Figure 7 is shown in Figure 8. In this embodiment, at least two keywords are included in the search string.

Step 801: The cache proxy server parses the search string to be retrieved to generate a search expression consisting of the key words.

Step 802: The cluster proxy server determines the type of each keyword in the retrieval expression, and sends a command to read the inverted row item to the retrieval server according to the type of each keyword.

Step 803: After receiving the read request, the search server reads the keyword inverted list item. Step 804: The retrieval server performs logical operations on the inverted items of the at least two keywords. For example, logical operations are performed on the document number in the inverted list item to obtain a logical operation on the keyword.

Step 805: Each retrieval server sends the result of the logical operation to the upper cluster server for aggregation to obtain an intermediate result.

Step 806: Each cluster server sends the intermediate result to the upper cache proxy server to summarize and output the final result.

Fig. 9 shows the structure of a retrieval server in the embodiment of the present invention. In this embodiment, the search string to be retrieved includes at least two keywords.

The retrieval server includes: a retrieval interpretation module 910, a read management module 920, a keyword reading module 930, a logical operation module 940, and an identification conversion module 950.

The search and interpretation module 910 is configured to parse the search string to be retrieved to generate a search expression composed of keywords for the upper layer server to call. The read management module 920 is used to connect At least one of a command to read a part of an index entry of a high frequency keyword stored therein and a command to read all index entries of a low frequency keyword stored by itself. The keyword reading module 930 is configured to: when receiving a command to read a part of the index entry of the high frequency keyword stored by itself, read a part of the index entry of the high frequency keyword; when receiving the read self storage When all the indexes of the low frequency keyword are indexed, all index entries of the low frequency keyword are read. The inverted item includes the document number generated after the document identifier is compressed. The logic operation module 940 is configured to perform logical operations on the index entries corresponding to the at least two keywords to be retrieved according to the logical relationship when there are at least two keywords having a logical relationship to be retrieved, and determine at least two The index table entry corresponding to the keyword. The identifier conversion module 950 is configured to compress the eight-byte document identifier in the keyword inverted list item into a four-byte document article number.

In the above embodiment, the indexing method for the document is an inverted index, and the corresponding index entry is an inverted list item, which is only an example of the present invention and is not intended to limit the present invention. When the embodiment of the present invention is applied, other index methods may be used to read the index table corresponding to the index method.

As can be seen from the above embodiments, in the embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when the search is performed, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the time overhead of a single logical operation is not delayed, and the retrieval speed is improved. On the other hand, all inverted entries of a low frequency keyword are stored by a retrieval server. When the search is performed, only the inverted list item of the low frequency keyword is read by the server. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.

In addition, the embodiment of the present invention can effectively improve the coupling degree between the search servers in the search cluster, and increase the resource dynamic allocation capability between the servers, and uniformly plan the resources of multiple search servers in the cluster to maximize the maximum The overall concurrency capability of the cluster is guaranteed, which further improves the retrieval speed.

While the invention has been described by the embodiments of the present invention, it will be understood that

Claims

Claim

A retrieval method, comprising:

Determining the type of keyword to be retrieved;

Determining a search result of the keyword to be retrieved according to the read index table item.

2. The method according to claim 1, further comprising: dividing all index entries of each high frequency keyword into _n portions; and the _mth retrieval server stores the high frequency keywords Part m index entry;

All low frequency keywords are divided into n parts; the mth retrieval server stores all index entries of the mth partial low frequency keyword; wherein m is an integer greater than 1 and less than or equal to n.

3. The method of claim 2, wherein

The dividing the all index entries of each high frequency keyword into n parts includes: modulating an identifier for distinguishing the document in the index table item, where the modulo parameter is n, and the modulo value is An index entry corresponding to the identifier of the m is used as the mth partial index entry;

The dividing the all low frequency keywords into the n parts includes: modulating a character identifier (word ID) corresponding to the low frequency keyword, where the modulo parameter is n, and the low frequency corresponding to the word ID of the modulo value is m The keyword is used as the mth partial low frequency keyword.

The method according to claim 1, wherein when there are at least two keywords having a logical relationship to be retrieved, the method further comprises: according to the logical relationship, the at least two to-be-reads that have been read The indexed entry of the retrieved keyword performs a logical operation; and based on the result of the logical operation, the search result of the at least two keywords is determined.

The method according to claim 4, wherein when the at least two keywords to be searched include at least one low frequency keyword, the logical operation of the read I entry includes:

All the index entries of the at least one low frequency keyword that have been read are divided into n parts; n part index entries of the at least one low frequency keyword are respectively sent to the n retrieval servers; Retrieving the index table entry that the server has read and the received A part of the index entries of at least one low frequency keyword are logically operated.

The method according to claim 4 or 5, wherein the logical operation of the index entry comprises:

The method according to any one of claims 1 to 5, further comprising:

Compressing the identifier of the index table item for distinguishing the document into a document part number;

The identifier for distinguishing the document is updated to the document article number.

The method according to claim 7, wherein compressing the identifier of the index table item for distinguishing the document into the document part number comprises:

The document identifier in the index entry is compressed into a four-byte document article number.

The method according to any one of claims 1 to 5, wherein the index entry is an inverted entry.

The method according to any one of claims 1 to 5, wherein the high frequency keyword comprises an ultra high frequency keyword and a medium high frequency keyword.

11. A retrieval system, comprising:

The retrieval system according to claim 11, wherein the retrieval server is further configured to: when there are at least two keywords having a logical relationship to be retrieved, according to the logical relationship, the read The index entries of the at least two keywords to be retrieved are logically operated, and the index entries corresponding to the at least two keywords are determined.

13. A retrieval server, comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;

The retrieval server according to claim 13, further comprising: a logic operation module, configured to: when there are at least two keywords having a logical relationship to be retrieved, according to the logical relationship, the pair has been read The index entries of the at least two keywords to be retrieved are logically operated to determine index entries corresponding to the at least two keywords.

The search server according to claim 13, further comprising: an identifier conversion module, configured to compress the identifier of the high-frequency keyword or the index entry of the low-frequency keyword for distinguishing the document into a document Article number.

16. A cluster proxy server, comprising:

a first module, configured to determine a type of a keyword to be retrieved;

The cluster proxy server according to claim 16, further comprising:

a fourth module, configured to divide all index entries of each high frequency keyword into n parts; and send, to the mth search server, a command to store the mth partial index entry of each high frequency keyword, where m is an integer greater than 1 and less than or equal to n;

And a fifth module, configured to divide all low frequency keywords into n parts; and send, to the mth search server, a command to store all index entries of the mth part of the low frequency keywords.