WO2009000173A1 - Searching method, searching system and searching server - Google Patents

Searching method, searching system and searching server Download PDF

Info

Publication number
WO2009000173A1
WO2009000173A1 PCT/CN2008/070598 CN2008070598W WO2009000173A1 WO 2009000173 A1 WO2009000173 A1 WO 2009000173A1 CN 2008070598 W CN2008070598 W CN 2008070598W WO 2009000173 A1 WO2009000173 A1 WO 2009000173A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
read
retrieval
index
server
Prior art date
Application number
PCT/CN2008/070598
Other languages
French (fr)
Chinese (zh)
Inventor
Liang Sun
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2009000173A1 publication Critical patent/WO2009000173A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a retrieval method, a retrieval system, and a retrieval server.
  • a search string contains one or more keywords.
  • each keyword is separated by a space, a space between the keywords indicates between the keywords.
  • Each keyword can consist of one or more morphemes.
  • a morpheme is the smallest language unit that can express independent semantics, usually a Chinese word that is segmented by the word segmentation system. Key words can be divided into morphemes of different numbers by word segmentation system.
  • the keyword is a binary compound morpheme. If it is divided into three morphemes, the keyword is ternary compound. Morpheme.
  • searching the input search string needs to find a collection of all the documents containing the search string in a short time, and display the document collection through the document identification list.
  • background retrieval cluster technology is one of the most core technologies. This technology is directly related to the collaboration between multiple search servers to provide retrieval services for a larger set of data. Since the number of document collections managed by a single retrieval server is limited, if the number of documents saved is too large, it will be difficult for the system to return the desired results within a time acceptable to the user during normal retrieval operations. Usually the user can accept no more than 1 second, so a search cluster consisting of multiple search servers is needed to support search services within a larger data set.
  • the inverted index is a data structure used to speed up the retrieval of the search string. It can exist in the form of a disk file or it can be loaded into the memory. At least consists of a dictionary file and an inverted table file. A plurality of inverted entries are saved in the inverted table file, and each inverted entry is used to save the correspondence between each keyword and the document in the search string. Therefore, effectively improving the reading speed of the inverted items can improve the retrieval efficiency.
  • the time to read the inverted entry of the inverted table file includes the time of each disk address and the time required to read the data.
  • the reading time of the inverted row item mainly depends on the addressing time of the disk. In the case that the amount of data read is relatively large, the reading time of the inverted row item mainly depends on the read data. time.
  • the system includes a retrieval proxy server and a plurality of parallel retrieval servers managed by the retrieval proxy server.
  • Each retrieval server allocates one-ninth of the documents in the full set of documents, where N is the total number of retrieval servers.
  • N is the total number of retrieval servers.
  • the retrieval proxy server sends the read requests to each retrieval server at the same time. After the retrieval server completes the local retrieval, the retrieval results will be retrieved. Returned to the search proxy server, and finally the search proxy server aggregates the search results of each search server according to a specific weight sorting manner.
  • the document partition-based retrieval system has an independent structural design, and the degree of coupling between the retrieval servers is small, and each retrieval server is equivalent to a retrieval subsystem that can be independently loaded.
  • most of the search strings are composed of two or more keywords.
  • the search server needs to perform the position offset matching in the document after matching the document identifiers for each keyword. This will result in multiple I/O access to the document disk.
  • the high frequency morpheme is included in the search string, the number of document identification lists and position offset lists that need to be read is large, for example, the inverted list of high frequency morphemes such as "China", “Net”, “We”, etc.
  • the amount of item data usually accounts for a large proportion of the entire inverted index data. It is impossible to read the index data in a short time, so most of the retrieval time will be consumed in the reading operation of the file input and output. As a result, the overall concurrency of the retrieval system is degraded, resulting in slower retrieval speed and response speed of the retrieval string.
  • FIG. 2 An existing distributed index file retrieval model based on index entry partitioning is shown in FIG. 2.
  • the system includes a retrieval proxy server and N sets of parallel retrieval servers managed by the retrieval agent, where N is an integer greater than 1, each group Retrieve the server to allocate one-ninth of the documents in the full set of documents.
  • each group of search servers contains three search servers.
  • All indexed keyword inverted items are evenly distributed in 3
  • the server retrieves the server, thereby speeding up access to the inverted entries.
  • a single retrieval server in each group of retrieval servers cannot perform the retrieval independently, and must be the same as other retrieval servers in the group.
  • Collaboration can complete the retrieval, thus increasing the degree of data coupling between the retrieval servers, resulting in more complicated data backup and lower retrieval speed.
  • the operation is performed every time the check is performed, thereby increasing the amount of communication between the search servers.
  • a retrieval method including:
  • the n search servers When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.
  • one of the n search servers reads all index entries of the low frequency keyword stored by the search server;
  • a retrieval system comprising:
  • a cluster proxy server configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;
  • the search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.
  • a retrieval server comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;
  • a keyword reading module configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.
  • a cluster proxy server including:
  • a first module configured to determine a type of a keyword to be retrieved
  • a second module configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;
  • a third module configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.
  • an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when searching, the high frequency keyword is used by multiple servers.
  • the inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the subsequent processing time does not delay the time overhead of a single logical operation, thereby improving the retrieval speed.
  • all the inverted items of a low frequency keyword are stored by a retrieval server, and only the inverted list of the low frequency keyword is read by the server when the retrieval is performed. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
  • FIG. 1 is a schematic diagram of a distributed index file retrieval model based on document partitioning
  • FIG. 2 is a schematic diagram of a distributed index file retrieval model based on index entry partitioning
  • FIG. 3 is a schematic diagram of a retrieval method according to an embodiment of the present invention. flow chart
  • FIG. 4 is a flowchart of a retrieval method according to another embodiment of the present invention.
  • FIG. 5 is a schematic diagram of searching a specific search string by applying the method of the present invention.
  • FIG. 6 is a structural diagram of a retrieval system in an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a retrieval model of a retrieval system in an embodiment of the present invention.
  • Figure 8 is a flow chart for applying the retrieval model in Figure 7 for retrieval;
  • FIG. 9 is a structural diagram of a retrieval server in an embodiment of the present invention. The present invention will be further described in detail with reference to the drawings and embodiments. instruction of.
  • the flow of the retrieval method in the embodiment of the present invention is as shown in FIG. 3.
  • Step 301 Parsing the retrieved search string to generate a search expression consisting of keywords.
  • Step 302 Send a read request for the keyword to each search server in the cluster.
  • the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
  • the inverted list item of the keyword is an array in which the identifiers of all the documents including the keyword are recorded, and the identifier of the document including the keyword, the weight of the keyword in the document, and The positional offset of the keyword in the document, the basic structure is as follows: t ⁇ di,w d i,t,loci,loc 2 , ...loc fd i, t > ⁇ d 2 ....> ... ⁇ d ft ...>
  • Step 303 The retrieval server in the cluster reads the inverted item of the keyword according to the frequency of the keyword hitting the document.
  • the keywords in the search expression can be classified into high frequency keywords and low frequency keywords composed of super high frequency keywords and medium and high frequency keywords according to the frequency of hitting the documents.
  • the index data may be counted before the retrieval is performed, the number of documents hit by each keyword is determined, and the type of the keyword to be retrieved is determined according to the frequency threshold of the preset document.
  • the keyword is a UHF keyword and/or a medium-high frequency keyword
  • the inverted item of the keyword is segmented and stored by the retrieval server in the cluster, and each retrieval server stores a part of the keyword.
  • Schedule item For example, when the cluster includes n retrieval servers, all index entries of the high frequency keyword are divided into n parts, and the mth retrieval server stores the mth partial index entries of the keyword, where n is greater than 1 An integer, m is an integer greater than 1 and less than or equal to n.
  • the keyword is a low frequency keyword
  • all the inverted entries of the keyword are stored by a retrieval server in the cluster.
  • the entire low frequency keyword is divided into n parts, and the mth retrieval server stores all index entries of the mth part of the low frequency keyword.
  • each retrieval server in the cluster reads the inverted item of the high-frequency keyword stored by itself; In the case of a low frequency keyword, all of the inverted entries of the low frequency keyword are read by a retrieval server storing the inverted list of the low frequency keywords.
  • n is an integer greater than 1.
  • the segmentation of the inverted items of the keyword includes: modulating the document identifier in the inverted entry of the high frequency keyword, and taking the modulo
  • the parameter is n, and the inverted table items having the same modulus value are stored as a group in the retrieval server corresponding to the modulus value, and in the retrieval phase, the retrieval server corresponding to the modulus value reads the inverted table having the same modulus value. item.
  • the word identifier (word ID) corresponding to the low frequency keyword is modulo
  • the modulo parameter is n
  • the same low frequency keyword of the modulo value is grouped and stored by a retrieval server.
  • the retrieval server compresses the eight-byte document identifier in the keyword inverted list entry into a four-byte document article number.
  • Step 304 The retrieval server in the cluster performs logical operation on the inverted item of the keyword After the search results are output.
  • the search server storing the low-frequency keyword inverted list item modulates the document identifier corresponding to the inverted entry of the low-frequency keyword, and the modulo parameter is n, and the inverted item corresponding to each modulus value is Send to the retrieval server corresponding to the modulus.
  • Each search server in the cluster performs logical operations on the inverted items of the high frequency keyword and the low frequency keyword; and the search results of the search string are obtained by summarizing the logical operation results of each search server.
  • the logical operation may be one of an operation, or an operation, a non-operation, or any combination.
  • Each cluster shown in this embodiment includes n retrieval servers, where n is an integer greater than one.
  • Step 401 Parsing the retrieved search string to generate a search expression consisting of keywords.
  • the search string input by the user that needs to be searched may be a short sentence or include a plurality of keywords.
  • These search strings are original strings that are not formatted by the computer, and the search string is parsed to generate a computer-recognizable search. expression.
  • the search expression may contain one or more keywords. If the user input keyword does not include a separator, after the parsing process, there is a logical relationship between the keywords. If the user input keyword includes a separator, for example, when the keywords are separated by a space, the preceding and following keywords are subjected to the "and" retrieval operation, and the keywords are separated by T, indicating the before and after keywords. To perform an "OR” operation, use "! before the keyword to indicate a "non” operation on the keyword.
  • Step 402 Send a read request for the keyword to each search server in the cluster, wherein the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
  • Step 403 Determine that the keyword is a high frequency keyword or a low frequency keyword, and if it is a high frequency keyword, perform step 404; if it is a low frequency keyword, perform step 405.
  • the keywords in the search expression are divided into high frequency keywords and low frequency keywords.
  • the high frequency keywords can be further divided into medium and high frequency keywords and ultra high frequency keywords.
  • the index data may be counted before the search is performed, and the number of documents hit by each keyword, that is, the number of inverted entries corresponding to the keyword, may be determined according to a preset frequency threshold of the document. Determine the type of keyword to be retrieved.
  • Step 404 The n retrieval servers in the cluster respectively read a part of the inverted entry of the high frequency keyword, and then perform step 407.
  • n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords.
  • the entry, in the retrieval is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Single logical operation time overhead For the reading of the inverted entries of the high-frequency keywords, a technology similar to the disk RAID (Redundant Independent Disk Array) system can be used, so that the n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords.
  • the entry, in the retrieval is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Single logical operation time overhead For the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Step 405 The retrieval server storing the low frequency keyword to be retrieved in the cluster reads all the inverted entries of the low frequency keyword.
  • the inverted entry of the low frequency keyword is read by one of the search servers in the cluster, avoiding the existing situation of a small number of inverted entries read on multiple search servers.
  • the data block of the inverted list item of the low frequency keyword is smaller than the minimum read data block of the disk, for example, 64K, and for the data block smaller than 64K, the time taken by the disk to read is the same.
  • the inverted list of low frequency keywords is divided into n blocks, and then read by n servers, so that not only does not increase the speed of reading, but also wastes multiple searches in the cluster. Server resources.
  • the retrieval server further compresses the eight-byte document identifier in the keyword inverted list item into a four-byte document part number when the index is established.
  • the document identifier in the inverted list item is used to locate the document.
  • each web page has a unique URL (Uniform Resource Locator), which we can use according to the URL string of the web page.
  • URL Uniform Resource Locator
  • After processing the signature algorithm a 64-bit (8-byte) globally unique integer corresponding to the URL string is obtained, thereby obtaining a document identifier corresponding to the document.
  • the inverted items of the keyword are respectively stored in n units.
  • each retrieval server gets a certain number of documents, assuming that the number is N, where N is an integer greater than 0, then
  • each search server further numbers the documents assigned to the machine, and converts the document identifiers into an integer from 0 - N-1 as the document number of the document. In this way, for the same document, the length of the document number is much smaller than the length of the original document identifier, thereby saving storage space and improving the reading speed.
  • Step 406 The document part number of the inverted list item of the low frequency keyword is modulo and sent to the corresponding modulus search server.
  • Step 407 The n retrieval servers in the cluster perform logical operations on the inverted entries that have been read.
  • the logical operations in this step are performed according to the logical relationship between the keywords in the search string to be retrieved, wherein the logical operations include one or any combination of operations, operations, and operations.
  • Step 408 The result of the search operation of the search string is obtained by summarizing the logical operation results of the n search servers.
  • the search string includes both a high frequency keyword and a low frequency keyword as an example.
  • steps 405-406 need not be performed, and when the search string includes only low frequency keywords, step 404 need not be performed.
  • the cluster includes three search servers, which are search server 0, search server 1, and search server 2.
  • search string "China Xu Jianjun” is analyzed to generate a search expression consisting of the keywords "China” and "Xu Jianjun".
  • the three search servers in the cluster determine the type of the keyword to be retrieved according to the number of hits of the keyword to be retrieved, and read the inverted entry of the keyword according to the type of the keyword.
  • the three search servers in the cluster respectively store a part of the inverted list items of the high-frequency keyword "China"
  • the three search servers in the cluster respectively read a part of the inverted items of the high-frequency keyword "China”.
  • Each document number corresponding to the high frequency keyword "China” is modulo 3
  • the retrieval server corresponding to each modulus value reads the inverted entry corresponding to the modulus value. For example, if the value of the document number 16 to 3 is modulo 1, the retrieval server 1 in the cluster reads the inverted entry of the document number 16.
  • the search server 0 in the cluster reads the inverted entry of the document number ⁇ 207, 903, 2331 ⁇
  • the search server 1 in the cluster reads the document number as ⁇ 16, 100, 319, 1081.
  • the reverse row entry the search server 2 in the cluster reads the inverted entry of the document number ⁇ 38, 872, 5618 ⁇ .
  • the three search servers in the cluster save all the inverted items of different low frequency keywords. Assume that all the inverted items of the low-frequency keyword "Xu Jianjun" are stored in the search server 2 in the cluster.
  • the search server 2 saves and reads all the inverted items including the low-frequency keyword "Xu Jianjun", that is, the inverted items of the document number ⁇ 38, 295, 307, 971, 2331 ⁇ .
  • the inverted list of the low-frequency keyword "Xu Jianjun" is distributed to the three search servers in the cluster.
  • the inverted item corresponding to each modulus value is sent to the retrieval server corresponding to the modulus value.
  • the inverted entry of document document number ⁇ 2331 ⁇ is sent to search server 0
  • the inverted entry of document document number ⁇ 295, 307 ⁇ is sent to The search server 1
  • the inverted entry of the document number ⁇ 38, 971 ⁇ is sent to the search server 2, and the intermediate result of the search is obtained.
  • the three servers in the cluster operate and invert the high-frequency keyword "China” and the low-frequency key word “Xu Jianjun” and obtain the search results.
  • the search result of the search server 0 is the document with the document number 2331
  • the search result of the search server 1 is empty
  • the search result of the search server 2 is the document with the document number 38
  • the search of the three search servers is performed.
  • Fig. 6 shows a retrieval system in an embodiment of the present invention.
  • the retrieval system in this embodiment includes: a caching proxy server 610, a cluster proxy server 620, and a retrieval server 630.
  • the cache proxy server 610 parses the search string to be retrieved to generate a search expression consisting of key words; receives the search result from the cluster proxy server 620, and outputs the search result as needed.
  • the cluster proxy server 620 is configured to receive a retrieval expression from the caching proxy server 610, determine the type of the keyword in the retrieval expression, and send a reading command to the retrieval server 630 according to the type of the keyword; receive the retrieval result from the retrieval server 630. And sending the search result to the caching proxy server 610.
  • the retrieval server 630 is configured to read the inverted item of the keyword according to the read command from the cluster proxy server 620, determine the retrieval result of the keyword to be retrieved, and return the retrieval result to the cluster proxy server 620;
  • the search server 630 is further configured to perform logical operations on the inverted items of the at least two keywords after obtaining the inverted items of each keyword, and determine the corresponding at least two keywords. Search Results.
  • FIG. 7 A schematic diagram of a retrieval model using the system of the present invention is shown in FIG. 7.
  • the cache proxy server, the cluster proxy server, and the retrieval server in the schematic diagram are distributed in a "tree" manner, and the system includes a cache proxy server, and the cache proxy server is connected.
  • n cluster proxy servers, each cluster proxy server is connected to n retrieval servers, and each set of n retrieval servers constitutes a cluster retrieval subsystem.
  • the cache proxy server is a separate process and can reside on a hardware server.
  • the caching proxy server caches the query request of the externally input search string, and parses the search string to be retrieved to generate a search expression consisting of keywords.
  • the caching proxy server can invoke a retrieval interpreter in the retrieval server to parse the externally entered retrieval string into a retrieval expression that the machine can understand.
  • the cache proxy server summarizes the results of all cluster proxy servers and returns them to the external user.
  • a clustered proxy server is a separate process that can reside on a single hardware server.
  • the cluster proxy server determines the type of the keyword in the retrieval expression, and according to the The type of the keyword, sends a read command to the search server in the cluster subsystem, and when the keyword is a high frequency keyword, sends a command to the search server to read a part of the index entry of the high frequency keyword stored by itself, A command to read all index entries of the low frequency keywords stored by itself is sent to a retrieval server in the retrieval server.
  • the returned search results are summarized to determine the search result of the keyword to be searched; and the summarized search result is returned to the upper cache proxy server.
  • Each retrieval server is a separate process that can reside on a hardware server. It is a basic retrieval unit. Under the scheduling of the upper cluster proxy server, basic underlying retrieval operations, including cluster agents.
  • the server's read command reads the inverted list of keywords and returns it to the cluster proxy server.
  • all index entries of the low frequency keyword are read.
  • the search server further performs logical operations such as "and" or "not” on the inverted items of the at least two keywords to determine the at least two keywords. Corresponding index table entry.
  • the retrieval speed can be remarkably improved.
  • the total number of unary, binary, and ternary morphemes that hit more than 1,000 documents does not exceed 500,000. Then it can be inferred that in 100 million documents, the number of morphemes hitting 6000-10000 pieces will not exceed 500,000.
  • 8-byte storage document identification is used, and 3-byte storage is used.
  • the storage space of the inverted item of the keyword is 64k, when the keyword hits 10000 documents.
  • the inverted row of the keyword has a storage space of 128k and a read time of 8 milliseconds.
  • the inverted row entry is performed according to the inverted row entry, including the storage space of the document identifier, the weight, and the position offset. Separate. For a morpheme with a storage space of 64k or more, an inverted list item of the morpheme is stored by a plurality of search servers, and for a morpheme whose storage space is 64k or less, all the inverted items of the morpheme are stored by one search server.
  • FIG 8 The flowchart for applying the search model in Figure 7 is shown in Figure 8.
  • at least two keywords are included in the search string.
  • Step 801 The cache proxy server parses the search string to be retrieved to generate a search expression consisting of the key words.
  • Step 802 The cluster proxy server determines the type of each keyword in the retrieval expression, and sends a command to read the inverted row item to the retrieval server according to the type of each keyword.
  • Step 803 After receiving the read request, the search server reads the keyword inverted list item.
  • Step 804 The retrieval server performs logical operations on the inverted items of the at least two keywords. For example, logical operations are performed on the document number in the inverted list item to obtain a logical operation on the keyword.
  • Step 805 Each retrieval server sends the result of the logical operation to the upper cluster server for aggregation to obtain an intermediate result.
  • Step 806 Each cluster server sends the intermediate result to the upper cache proxy server to summarize and output the final result.
  • Fig. 9 shows the structure of a retrieval server in the embodiment of the present invention.
  • the search string to be retrieved includes at least two keywords.
  • the retrieval server includes: a retrieval interpretation module 910, a read management module 920, a keyword reading module 930, a logical operation module 940, and an identification conversion module 950.
  • the search and interpretation module 910 is configured to parse the search string to be retrieved to generate a search expression composed of keywords for the upper layer server to call.
  • the read management module 920 is used to connect At least one of a command to read a part of an index entry of a high frequency keyword stored therein and a command to read all index entries of a low frequency keyword stored by itself.
  • the keyword reading module 930 is configured to: when receiving a command to read a part of the index entry of the high frequency keyword stored by itself, read a part of the index entry of the high frequency keyword; when receiving the read self storage When all the indexes of the low frequency keyword are indexed, all index entries of the low frequency keyword are read.
  • the inverted item includes the document number generated after the document identifier is compressed.
  • the logic operation module 940 is configured to perform logical operations on the index entries corresponding to the at least two keywords to be retrieved according to the logical relationship when there are at least two keywords having a logical relationship to be retrieved, and determine at least two The index table entry corresponding to the keyword.
  • the identifier conversion module 950 is configured to compress the eight-byte document identifier in the keyword inverted list item into a four-byte document article number.
  • the indexing method for the document is an inverted index
  • the corresponding index entry is an inverted list item, which is only an example of the present invention and is not intended to limit the present invention.
  • other index methods may be used to read the index table corresponding to the index method.
  • an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when the search is performed, the high frequency keyword is used by multiple servers.
  • the inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the time overhead of a single logical operation is not delayed, and the retrieval speed is improved.
  • all inverted entries of a low frequency keyword are stored by a retrieval server. When the search is performed, only the inverted list item of the low frequency keyword is read by the server. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
  • the embodiment of the present invention can effectively improve the coupling degree between the search servers in the search cluster, and increase the resource dynamic allocation capability between the servers, and uniformly plan the resources of multiple search servers in the cluster to maximize the maximum The overall concurrency capability of the cluster is guaranteed, which further improves the retrieval speed.

Abstract

A searching method is provided, which includes: determining the type of the keyword to be searched; when the keyword is a high frequency keyword, N searching servers accesses to a part of index tables of the high frequency keyword which stored in themselves, respectively, where N is integer more than 1; when the keyword is a low frequency keyword, one of the N searching servers accesses to the total index tables of the low frequency keyword which stored in themselves; determining the text which involves the keyword to be searched according to the accessed index table. A searching system and a searching server are also provided. With the solution, it effectively improves the search speed.

Description

检索方法、 检索系统及检索服务器  Search method, retrieval system and retrieval server
技术领域 本发明涉及通信技术领域, 具体涉及一种检索方法、 检索系统及检 索服务器。 发明背景 在进行检索时, 用户需要输入检索串, 通常检索串包含一个或多个 关键词, 当每个关键词之间用空格隔开时, 关键词之间的空格表示对各 个关键词之间进行 "与" 的检索操作。 每个关键词可以由一个或多个语 素组成。 语素是能够表达独立语意的最小语言单位, 通常为分词系统切 分出的中文词。 关键词通过分词系统可以被切分为数量不等的语素, 如 果切分为两个语素,则该关键词为二元复合语素,如果切分为三个语素, 则该关键词为三元复合语素。 在进行检索时, 对输入的检索串需要在较 短的时间内找出包含该检索串的所有文档的集合, 并将该文档集合通过 文档标识列表进行显示。 TECHNICAL FIELD The present invention relates to the field of communications technologies, and in particular, to a retrieval method, a retrieval system, and a retrieval server. BACKGROUND OF THE INVENTION When performing a search, a user needs to input a search string. Usually, the search string contains one or more keywords. When each keyword is separated by a space, a space between the keywords indicates between the keywords. Perform a "and" retrieval operation. Each keyword can consist of one or more morphemes. A morpheme is the smallest language unit that can express independent semantics, usually a Chinese word that is segmented by the word segmentation system. Key words can be divided into morphemes of different numbers by word segmentation system. If it is divided into two morphemes, the keyword is a binary compound morpheme. If it is divided into three morphemes, the keyword is ternary compound. Morpheme. When searching, the input search string needs to find a collection of all the documents containing the search string in a short time, and display the document collection through the document identification list.
在各种互联网搜索引擎技术中, 后台检索集群技术是最为核心的技 术之一, 这种技术直接关系到多台检索服务器间的协作, 以便为更大规 模的数据集合提供检索服务。 由于单个检索服务器管理文档集合的数量 是有限的, 如果保存的文档数量过大, 就会导致在进行正常检索操作过 程中, 系统难以在用户可以接受的时间内返回需要的结果。 通常用户可 以接受的时间不超过 1秒, 因此需要采用由多台检索服务器组成的检索 集群来支持更大数据集合范围内的检索服务。  Among various Internet search engine technologies, background retrieval cluster technology is one of the most core technologies. This technology is directly related to the collaboration between multiple search servers to provide retrieval services for a larger set of data. Since the number of document collections managed by a single retrieval server is limited, if the number of documents saved is too large, it will be difficult for the system to return the desired results within a time acceptable to the user during normal retrieval operations. Usually the user can accept no more than 1 second, so a search cluster consisting of multiple search servers is needed to support search services within a larger data set.
检索过程中最主要的操作就是对倒排索引的访问, 倒排索引是一种 用来加速对检索串进行检索的数据结构, 它可以以磁盘文件的形式存 在,也可以加载到内存中,它至少由词典文件和倒排表文件两部分组成。 倒排表文件中保存了多个倒排表项, 每个倒排表项用于保存检索串中每 个关键词与文档的对应关系。 因此有效提高对倒排表项的读取速度就可 以相应提高检索效率。 对倒排表文件的倒排表项进行读取的时间包括每 一次对磁盘的寻址时间和读取数据所需要的时间。 在读取的数据量比较 小的情况下, 对倒排表项的读取时间主要取决于磁盘的寻址时间, 在读 取的数据量比较大的情况下, 对倒排表项的读取时间主要取决于读取数 据的时间。 The most important operation in the retrieval process is the access to the inverted index. The inverted index is a data structure used to speed up the retrieval of the search string. It can exist in the form of a disk file or it can be loaded into the memory. At least consists of a dictionary file and an inverted table file. A plurality of inverted entries are saved in the inverted table file, and each inverted entry is used to save the correspondence between each keyword and the document in the search string. Therefore, effectively improving the reading speed of the inverted items can improve the retrieval efficiency. The time to read the inverted entry of the inverted table file includes the time of each disk address and the time required to read the data. Comparison of the amount of data read In a small case, the reading time of the inverted row item mainly depends on the addressing time of the disk. In the case that the amount of data read is relatively large, the reading time of the inverted row item mainly depends on the read data. time.
现有基于文档分区的分布式索引文件检索模型如图 1所示, 该系统 包括一个检索代理服务器以及由该检索代理服务器管理的多台平行的 检索服务器。 每台检索服务器分配文档全集的 N分之一的文档, 这里 N 为检索服务器的总数。 在索引阶段, 多台平行的检索服务器并行完成各 自服务器中的索引任务, 在检索阶段, 检索代理服务器将读取请求同时 发送到每个检索服务器, 检索服务器在做完本地检索后, 将检索结果返 回给检索代理服务器, 最终由检索代理服务器根据特定的权值排序方式 将每个检索服务器的检索结果汇聚在一起。 可见, 基于文档分区的检索 系统具有独立的结构设计, 检索服务器之间的耦合度小, 每台检索服务 器都相当于可以进行独立加载的检索子系统。 但是在互联网检索服务 中, 大部分检索串是由两个或者两个以上的关键词组成的, 检索服务器 需要在进行针对每个关键词的文档标识匹配后, 再进行文档内的位置偏 移匹配, 这就会带来对文档磁盘的多次输入输出访问。 并且当检索串中 包括高频语素时, 需要读取的文档标识列表和位置偏移列表的数量很 大, 例如, "中国"、 "网"、 "我们" 等的高频语素的倒排表项数据量通 常占到整个倒排索引数据量的很大比例, 要在短时间内读完这些索引数 据是不可能的, 因此检索的大部分时间将消耗在文件输入输出的读取操 作上, 从而使得检索系统的整体并发能力下降, 导致系统对检索串的检 索速度和响应速度变慢。  An existing distributed index file retrieval model based on document partitioning is shown in Fig. 1. The system includes a retrieval proxy server and a plurality of parallel retrieval servers managed by the retrieval proxy server. Each retrieval server allocates one-ninth of the documents in the full set of documents, where N is the total number of retrieval servers. In the indexing phase, multiple parallel retrieval servers complete the indexing tasks in their respective servers in parallel. In the retrieval phase, the retrieval proxy server sends the read requests to each retrieval server at the same time. After the retrieval server completes the local retrieval, the retrieval results will be retrieved. Returned to the search proxy server, and finally the search proxy server aggregates the search results of each search server according to a specific weight sorting manner. It can be seen that the document partition-based retrieval system has an independent structural design, and the degree of coupling between the retrieval servers is small, and each retrieval server is equivalent to a retrieval subsystem that can be independently loaded. However, in the Internet search service, most of the search strings are composed of two or more keywords. The search server needs to perform the position offset matching in the document after matching the document identifiers for each keyword. This will result in multiple I/O access to the document disk. And when the high frequency morpheme is included in the search string, the number of document identification lists and position offset lists that need to be read is large, for example, the inverted list of high frequency morphemes such as "China", "Net", "We", etc. The amount of item data usually accounts for a large proportion of the entire inverted index data. It is impossible to read the index data in a short time, so most of the retrieval time will be consumed in the reading operation of the file input and output. As a result, the overall concurrency of the retrieval system is degraded, resulting in slower retrieval speed and response speed of the retrieval string.
现有基于索引项分区的分布式索引文件检索模型如图 2所示, 该系 统包括一个检索代理服务器以及由该检索代理管理的 N组平行的检索 服务器,其中 N为大于 1的整数,每组检索服务器分配文档全集的 N分 之一的文档。 其中, 每组检索服务器中包含 3台检索服务器。 通常, 根 据哈希值取模的值, 将同一索引关键词对应的不同倒排表项存储到不同 的检索服务器中。 例如("中国")%3 = 1 , 则将 "中国" 对应的索引关 键词的倒排表项数据块存放在该组的 1号检索服务器上, 这样就可以把 原来存放在单个检索服务器上的所有索引关键词倒排表项平均分布在 3 台检索服务器上, 从而加快了对倒排表项的访问。 但是在基于索引项分 区的检索系统中, 当检索串包括两个或者两个以上的关键词时, 每组检 索服务器中的单台检索服务器无法独立完成检索, 必须同该组内的其它 检索服务器协作才能完成检索, 因此增加了检索服务器之间的数据耦合 度, 导致数据备份比较复杂, 降低了检索的速度。 另外, 每完成一次检 进行操作, 因此增大了检索服务器之间的通信量。 发明内容 本发明实施例提供了一种检索方法、 检索系统和检索服务器, 能够 提高检索的速度。 An existing distributed index file retrieval model based on index entry partitioning is shown in FIG. 2. The system includes a retrieval proxy server and N sets of parallel retrieval servers managed by the retrieval agent, where N is an integer greater than 1, each group Retrieve the server to allocate one-ninth of the documents in the full set of documents. Among them, each group of search servers contains three search servers. Generally, different inverted entries corresponding to the same index keyword are stored in different retrieval servers according to the value of the hash value modulo. For example, ("China")%3 = 1 , the inverted data item block of the index keyword corresponding to "China" is stored in the search server No. 1 of the group, so that the original search server can be stored on a single search server. All indexed keyword inverted items are evenly distributed in 3 The server retrieves the server, thereby speeding up access to the inverted entries. However, in a retrieval system based on index entry partitioning, when the retrieval string includes two or more keywords, a single retrieval server in each group of retrieval servers cannot perform the retrieval independently, and must be the same as other retrieval servers in the group. Collaboration can complete the retrieval, thus increasing the degree of data coupling between the retrieval servers, resulting in more complicated data backup and lower retrieval speed. In addition, the operation is performed every time the check is performed, thereby increasing the amount of communication between the search servers. SUMMARY OF THE INVENTION Embodiments of the present invention provide a retrieval method, a retrieval system, and a retrieval server, which can improve the speed of retrieval.
一种检索方法, 包括:  A retrieval method, including:
确定待检索的关键词的类型;  Determining the type of keyword to be retrieved;
当所述关键词为高频关键词时, 由 n台检索服务器分别读取自身存 储的所述高频关键词的一部分索引表项, n为大于 1的整数;  When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.
当所述关键词为低频关键词时, 所述 n台检索服务器中一台检索服 务器读取自身存储的所述低频关键词的全部索引表项;  When the keyword is a low frequency keyword, one of the n search servers reads all index entries of the low frequency keyword stored by the search server;
根据所述已读取的索引表项,确定所述待检索的关键词的检索结果。 一种检索系统, 包括:  Determining a search result of the keyword to be retrieved according to the read index table item. A retrieval system comprising:
集群代理服务器, 用于确定待检索的关键词的类型; 当所述关键词 为高频关键词时, 向 n台检索服务器分别发送读取自身存储的所述高频 关键词的一部分索引表项的命令; 当所述关键词为低频关键词时, 向所 述 n台检索服务器中一台检索服务器发送读取自身存储的所述低频关键 词的全部索引表项的命令, 其中 n为大于 1的整数; 根据所述检索服务 器读取的索引表项, 确定所述待检索的关键词的检索结果;  a cluster proxy server, configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;
所述检索服务器, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。  The search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.
一种检索服务器, 包括: 读取管理模块, 用于接收读取自身存储的高频关键词的一部分索引 表项的命令以及读取自身存储的低频关键词的全部索引表项的命令中 的至少一个; A retrieval server, comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;
关键词读取模块, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。  a keyword reading module, configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.
一种集群代理服务器, 包括:  A cluster proxy server, including:
第一模块, 用于确定待检索的关键词的类型;  a first module, configured to determine a type of a keyword to be retrieved;
第二模块, 用于当所述关键词为高频关键词时, 向 n台检索服务器 分别发送读取自身存储的所述高频关键词的一部分索引表项的命令; 当 所述关键词为低频关键词时, 向所述 n台检索服务器中一台检索服务器 发送读取自身存储的所述低频关键词的全部索引表项的命令, 其中 n为 大于 1的整数;  a second module, configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;
第三模块, 用于根据所述检索服务器读取的索引表项, 确定所述待 检索的关键词的检索结果。  And a third module, configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.
由以上技术方案可见, 在本发明实施例中, 一方面, 一个高频关键 词的倒排表项由集群中的多台服务器存储, 在进行检索时, 由多台服务 器对该高频关键词的倒排表项进行并行读取, 因此可以在系统设计时间 内读取超大数量的倒排表项, 并且在后续进行逻辑操作时, 不延误单次 逻辑操作的时间开销, 提高了检索速度。 另一方面, 一个低频关键词的 所有倒排表项由一台检索服务器存储, 在进行检索时, 仅由该服务器对 该低频关键词的倒排表项进行读取。 因此无需在多台检索服务器上分别 读取较少数量的倒排表项, 节省了集群中多台检索服务器的存储资源, 提高了检索速度。  It can be seen from the above technical solution that, in an embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when searching, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the subsequent processing time does not delay the time overhead of a single logical operation, thereby improving the retrieval speed. On the other hand, all the inverted items of a low frequency keyword are stored by a retrieval server, and only the inverted list of the low frequency keyword is read by the server when the retrieval is performed. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
另外, 应用本发明实施例可以有效提高检索集群内部检索服务器之 间的耦合度, 并且增加了服务器之间的资源动态调配能力。 通过把集群 内的多台检索服务器的内存资源, 磁盘输入输出资源以及 CPU (中央处 理器) 资源看成整体进行统一规划, 最大限度保证了集群整体的并发能 力, 从而进一步提高了检索速度。 附图简要说明 图 1为现有基于文档分区的分布式索引文件检索模型示意图; 图 2为现有基于索引项分区的分布式索引文件检索模型示意图; 图 3为本发明实施例中检索方法的流程图; In addition, the embodiment of the present invention can effectively improve the coupling degree between the retrieval servers in the retrieval cluster, and increase the resource dynamic allocation capability between the servers. By considering the memory resources, disk input and output resources, and CPU (central processing unit) resources of multiple search servers in the cluster as a whole, unified planning ensures maximum concurrency of the cluster, thereby further improving the retrieval speed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a distributed index file retrieval model based on document partitioning; FIG. 2 is a schematic diagram of a distributed index file retrieval model based on index entry partitioning; FIG. 3 is a schematic diagram of a retrieval method according to an embodiment of the present invention. flow chart;
图 4为本发明另一实施例中检索方法的流程图;  4 is a flowchart of a retrieval method according to another embodiment of the present invention;
图 5为应用本发明方法对具体检索串进行检索的示意图;  5 is a schematic diagram of searching a specific search string by applying the method of the present invention;
图 6为本发明实施例中检索系统的结构图;  6 is a structural diagram of a retrieval system in an embodiment of the present invention;
图 7为应用本发明实施例中检索系统的检索模型示意图;  7 is a schematic diagram of a retrieval model of a retrieval system in an embodiment of the present invention;
图 8为应用图 7中检索模型进行检索的流程图;  Figure 8 is a flow chart for applying the retrieval model in Figure 7 for retrieval;
图 9为本发明实施例中检索服务器的结构图。 实施本发明的方式 为了使本技术领域的人员更好地理解本发明方案, 并使本发明的上 述目的、 特征和优点能够更加明显易懂, 下面结合附图和具体实施方式 对本发明作进一步详细的说明。  FIG. 9 is a structural diagram of a retrieval server in an embodiment of the present invention. The present invention will be further described in detail with reference to the drawings and embodiments. instruction of.
本发明实施例中检索方法的流程如图 3所示。  The flow of the retrieval method in the embodiment of the present invention is as shown in FIG. 3.
步骤 301 : 对待检索的检索串进行解析后生成由关键词组成的检索 表达式。  Step 301: Parsing the retrieved search string to generate a search expression consisting of keywords.
步骤 302: 将对关键词的读取请求发送至集群中的各个检索服务器。 其中, 对关键词的读取请求中包括对关键词的倒排表项的预读请求。  Step 302: Send a read request for the keyword to each search server in the cluster. The read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
其中, 关键词的倒排表项是记录了包含该关键词的所有文档的标识 的数组, 在该数组中包括包含该关键词的文档的标识、 该关键词在该文 档中的权值、 及该关键词在该文档中的位置偏移, 基本结构如下所示: t <di,wdi,t,loci,loc2, ...locfdi,t><d2....>... <dft...> Wherein, the inverted list item of the keyword is an array in which the identifiers of all the documents including the keyword are recorded, and the identifier of the document including the keyword, the weight of the keyword in the document, and The positional offset of the keyword in the document, the basic structure is as follows: t <di,w d i,t,loci,loc 2 , ...loc fd i, t ><d 2 ....> ... <d ft ...>
其中, t表示检索串中的某个关键词, 4表示包含该关键词 t的一系 列文档的标识, Wd,t表示关键词 t在文档 di中的权值, loci表示关键词 t 在当前文档中出现的位置偏移, 通常用两字节表示。 根据倒排表项可以 快速查找检索串中的某个关键词。 通常每个检索串的倒排索引文件由 N 个倒排表项组成, N的数量即为检索串中关键词的数量之和。 步骤 303: 集群中的检索服务器按照关键词命中文档的频率高低读 取关键词的倒排表项。 Where t represents a certain keyword in the search string, 4 represents the identifier of a series of documents containing the keyword t, W d , t represents the weight of the keyword t in the document di, and loci represents the keyword t at the current The position offset that appears in the document, usually expressed in two bytes. According to the inverted list item, you can quickly find a keyword in the search string. Usually, the inverted index file of each search string is composed of N inverted entries, and the number of N is the sum of the number of keywords in the search string. Step 303: The retrieval server in the cluster reads the inverted item of the keyword according to the frequency of the keyword hitting the document.
检索表达式中的关键词按照命中文档的频率高低, 可以分为由超高 频关键词和中高频关键词组成的高频关键词以及低频关键词。  The keywords in the search expression can be classified into high frequency keywords and low frequency keywords composed of super high frequency keywords and medium and high frequency keywords according to the frequency of hitting the documents.
在本发明实施例中, 可以在检索进行之前对索引数据进行统计, 确 定每个关键词命中的文档的数量, 根据预先设置的文档的频率阈值, 确 定待检索的关键词的类型。当关键词为超高频关键词和 /或中高频关键词 时,对该关键词的倒排表项进行分割, 由集群中的检索服务器共同存储, 每个检索服务器存储该关键词的一部分倒排表项。 例如, 当集群中包括 n台检索服务器时, 将该高频关键词的全部索引表项分割为 n部分, 第 m台检索服务器存储该关键词的第 m部分索引表项,其中 n为大于 1的 整数, m为大于 1小于等于 n的整数。 当关键词为低频关键词时, 该关 键词的全部倒排表项由集群中的一台检索服务器存储。 例如将全部低频 关键词分割为 n部分, 第 m台检索服务器存储第 m部分低频关键词的 全部索引表项。  In the embodiment of the present invention, the index data may be counted before the retrieval is performed, the number of documents hit by each keyword is determined, and the type of the keyword to be retrieved is determined according to the frequency threshold of the preset document. When the keyword is a UHF keyword and/or a medium-high frequency keyword, the inverted item of the keyword is segmented and stored by the retrieval server in the cluster, and each retrieval server stores a part of the keyword. Schedule item. For example, when the cluster includes n retrieval servers, all index entries of the high frequency keyword are divided into n parts, and the mth retrieval server stores the mth partial index entries of the keyword, where n is greater than 1 An integer, m is an integer greater than 1 and less than or equal to n. When the keyword is a low frequency keyword, all the inverted entries of the keyword are stored by a retrieval server in the cluster. For example, the entire low frequency keyword is divided into n parts, and the mth retrieval server stores all index entries of the mth part of the low frequency keyword.
在检索阶段, 当关键词为超高频关键词和 /或中高频关键词时, 集群 中的每个检索服务器分别读取自身存储的该高频关键词的倒排表项; 当 关键词为低频关键词时, 由存储该低频关键词的倒排表项的检索服务器 读取该低频关键词的全部倒排表项。  In the retrieval phase, when the keyword is a UHF keyword and/or a medium-high frequency keyword, each retrieval server in the cluster reads the inverted item of the high-frequency keyword stored by itself; In the case of a low frequency keyword, all of the inverted entries of the low frequency keyword are read by a retrieval server storing the inverted list of the low frequency keywords.
其中, 集群中包含 n台检索服务器时, n为大于 1的整数, 对关键 词的倒排表项进行分割包括: 对高频关键词的倒排表项中的文档标识进 行取模, 取模参数为 n, 将具有相同模值的倒排表项作为一组存储在与 该模值对应的检索服务器, 在检索阶段, 由该模值对应的检索服务器读 取具有相同模值的倒排表项。 类似的, 对低频关键词对应的文字标识 ( word ID )进行取模,取模参数为 n,将模值相同低频关键词作为一组, 由一台检索服务器存储。  When the cluster includes n search servers, n is an integer greater than 1. The segmentation of the inverted items of the keyword includes: modulating the document identifier in the inverted entry of the high frequency keyword, and taking the modulo The parameter is n, and the inverted table items having the same modulus value are stored as a group in the retrieval server corresponding to the modulus value, and in the retrieval phase, the retrieval server corresponding to the modulus value reads the inverted table having the same modulus value. item. Similarly, the word identifier (word ID) corresponding to the low frequency keyword is modulo, the modulo parameter is n, and the same low frequency keyword of the modulo value is grouped and stored by a retrieval server.
进一步地, 在本发明实施例中, 检索服务器将关键词倒排表项中的 八字节的文档标识压缩为四字节的文档篇号。  Further, in the embodiment of the present invention, the retrieval server compresses the eight-byte document identifier in the keyword inverted list entry into a four-byte document article number.
步骤 304: 集群中的检索服务器对关键词的倒排表项进行逻辑操作 后输出检索结果。 Step 304: The retrieval server in the cluster performs logical operation on the inverted item of the keyword After the search results are output.
当代检索的检索串中既包括高频关键词又包括低频关键词时, 对不 同关键词的倒排表项进行逻辑操作。 具体的, 存储有低频关键词倒排表 项的检索服务器将该低频关键词的倒排表项对应的文档标识进行取模, 取模参数为 n, 将每个模值对应的倒排表项发送到该模值对应的检索服 务器。 集群中的每个检索服务器对高频关键词和低频关键词的倒排表项 进行逻辑操作; 对每个检索服务器的逻辑操作结果进行汇总后得到检索 串的检索结果。  When the search string of the contemporary search includes both high-frequency keywords and low-frequency keywords, the logical operations of the inverted items of different keywords are performed. Specifically, the search server storing the low-frequency keyword inverted list item modulates the document identifier corresponding to the inverted entry of the low-frequency keyword, and the modulo parameter is n, and the inverted item corresponding to each modulus value is Send to the retrieval server corresponding to the modulus. Each search server in the cluster performs logical operations on the inverted items of the high frequency keyword and the low frequency keyword; and the search results of the search string are obtained by summarizing the logical operation results of each search server.
其中, 逻辑操作可以为与操作、 或操作、 非操作中的一种或任意组 合。  The logical operation may be one of an operation, or an operation, a non-operation, or any combination.
本发明另一实施例中检索方法的流程如图 4所示。 该实施例示出的 每个集群中包含 n台检索服务器, 其中 n为大于 1的整数。  The flow of the retrieval method in another embodiment of the present invention is shown in FIG. Each cluster shown in this embodiment includes n retrieval servers, where n is an integer greater than one.
步骤 401 : 对待检索的检索串进行解析后生成由关键词组成的检索 表达式。  Step 401: Parsing the retrieved search string to generate a search expression consisting of keywords.
通常用户输入的需要进行检索的检索串可以是一个短句或者包括若 干关键词, 这些检索串都是未经计算机格式化处理的原始字符串, 对检 索串进行解析处理后生成计算机可以识别的检索表达式。 检索表达式可 以包含一个或多个关键词, 如果用户输入关键词之间不包括分隔符, 则 经过解析处理后, 关键词之间存在逻辑与的关系。 如果用户输入关键词 之间包括分隔符, 例如, 当关键词之间用空格隔开, 表示前后的关键词 进行 "与" 检索操作, 当关键词之间用 T 隔开, 表示前后的关键词进 行 "或" 操作, 当关键词之前使用 "!", 表示对该关键词进行 "非" 操 作。  Usually, the search string input by the user that needs to be searched may be a short sentence or include a plurality of keywords. These search strings are original strings that are not formatted by the computer, and the search string is parsed to generate a computer-recognizable search. expression. The search expression may contain one or more keywords. If the user input keyword does not include a separator, after the parsing process, there is a logical relationship between the keywords. If the user input keyword includes a separator, for example, when the keywords are separated by a space, the preceding and following keywords are subjected to the "and" retrieval operation, and the keywords are separated by T, indicating the before and after keywords. To perform an "OR" operation, use "!" before the keyword to indicate a "non" operation on the keyword.
在本实施例中,假设检索串中既包括高频关键词又包括低频关键词。 步骤 402: 将对关键词的读取请求发送至集群中的各个检索服务器 其中, 对关键词的读取请求中包括对关键词的倒排表项的预读请求。  In the present embodiment, it is assumed that the search string includes both high frequency keywords and low frequency keywords. Step 402: Send a read request for the keyword to each search server in the cluster, wherein the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
步骤 403: 判断关键词为高频关键词或为低频关键词, 若为高频关 键词则执行步骤 404; 若为低频关键词则执行步骤 405。  Step 403: Determine that the keyword is a high frequency keyword or a low frequency keyword, and if it is a high frequency keyword, perform step 404; if it is a low frequency keyword, perform step 405.
根据关键词对应的倒排表项数量的不同, 即关键词命中的文档的数 量的不同, 将检索表达式中的关键词分为高频关键词和低频关键词, 特 别的, 高频关键词还可以进一步分为中高频关键词和超高频关键词。 According to the number of inverted entries corresponding to the keyword, that is, the number of documents hit by the keyword The quantity is different, the keywords in the search expression are divided into high frequency keywords and low frequency keywords. In particular, the high frequency keywords can be further divided into medium and high frequency keywords and ultra high frequency keywords.
在本发明实施例中, 可以在检索进行之前对索引数据进行统计, 确 定每个关键词命中的文档的数量, 即关键词对应的倒排表项的数量, 根 据预先设置的文档的频率阈值, 确定待检索的关键词的类型。  In the embodiment of the present invention, the index data may be counted before the search is performed, and the number of documents hit by each keyword, that is, the number of inverted entries corresponding to the keyword, may be determined according to a preset frequency threshold of the document. Determine the type of keyword to be retrieved.
步骤 404: 集群中 n台检索服务器分别读取高频关键词的一部分倒 排表项, 然后执行步骤 407。  Step 404: The n retrieval servers in the cluster respectively read a part of the inverted entry of the high frequency keyword, and then perform step 407.
对于高频关键词的倒排表项的读取, 可以采用类似磁盘 RAID (冗 余独立磁盘阵列) 系统的技术, 让集群中的 n台检索服务器分别存储超 大规模的高频关键词的倒排表项, 在检索时, 由 n台检索服务器进行并 行读取,这样一来,在系统设计时间内可以完成对超大倒排表项的读取, 同时在后续进行逻辑运算时, 也不会延误单次逻辑操作时间开销。  For the reading of the inverted entries of the high-frequency keywords, a technology similar to the disk RAID (Redundant Independent Disk Array) system can be used, so that the n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords. The entry, in the retrieval, is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed. Single logical operation time overhead.
步骤 405: 集群中的存储有待检索的低频关键词的检索服务器读取 该低频关键词的全部倒排表项。  Step 405: The retrieval server storing the low frequency keyword to be retrieved in the cluster reads all the inverted entries of the low frequency keyword.
对于低频关键词的倒排表项, 由集群中的一台检索服务器读取, 避 免了现有的在多台检索服务器上分别读取的少量倒排表项的情况。  The inverted entry of the low frequency keyword is read by one of the search servers in the cluster, avoiding the existing situation of a small number of inverted entries read on multiple search servers.
通常低频关键词的倒排表项的数据块小于磁盘的最小读取数据块, 例如 64K,对于小于 64K的数据块,磁盘在读取时耗费的时间是一样的。 在现有技术中, 将低频关键词的倒排表项切分成 n块, 再由 n台服务器 去读取, 这样一来, 不但不会提高读取的速度, 而且浪费了集群中多个 检索服务器的资源。 通过应用本发明实施例, 有效地避免了上述问题。  Usually, the data block of the inverted list item of the low frequency keyword is smaller than the minimum read data block of the disk, for example, 64K, and for the data block smaller than 64K, the time taken by the disk to read is the same. In the prior art, the inverted list of low frequency keywords is divided into n blocks, and then read by n servers, so that not only does not increase the speed of reading, but also wastes multiple searches in the cluster. Server resources. By applying the embodiment of the present invention, the above problems are effectively avoided.
在本实施例中, 进一步包括检索服务器在建立索引时将关键词倒排 表项中的八个字节的文档标识压缩为四个字节的文档篇号。  In this embodiment, the retrieval server further compresses the eight-byte document identifier in the keyword inverted list item into a four-byte document part number when the index is established.
倒排表项中的文档标识是用于定位文档的, 对于互联网上的网页来 说, 每个网页都有一个唯一的 URL (统一资源定位器), 我们可以根据 网页的 URL字符串对其进行签名算法的处理后, 得到该 URL字符串对 应的一个 64bit ( 8个字节)的全局唯一整数, 从而得到该文档对应的文 档标识。 但是由于该互联网中的网页数量巨大, 因此该文档标识占用的 存储空间也很大。 在本实施例中, 将关键词的倒排表项分别存储到 n台 检索服务器上时, 也就相当于将不同的文档到了不同的检索服务器上, 因此每台检索服务器上都得到了一定数量的文档, 假设该数量为 N, 其 中, N为大于 0的整数, 则在本实施例中, 每台检索服务器都会对分配 到本机的文档进行进一步编号, 将这些文档标识转化为从 0 - N-1 的整 数, 作为该文档的文档篇号。 这样一来, 对于同一文档, 文档篇号的长 度远远小于原来文档标识的长度, 从而节约了存储空间, 提高了读取速 度。 The document identifier in the inverted list item is used to locate the document. For web pages on the Internet, each web page has a unique URL (Uniform Resource Locator), which we can use according to the URL string of the web page. After processing the signature algorithm, a 64-bit (8-byte) globally unique integer corresponding to the URL string is obtained, thereby obtaining a document identifier corresponding to the document. However, due to the large number of web pages in the Internet, the storage space occupied by the document identification is also large. In this embodiment, the inverted items of the keyword are respectively stored in n units. When retrieving the server, it is equivalent to sending different documents to different search servers, so each retrieval server gets a certain number of documents, assuming that the number is N, where N is an integer greater than 0, then In this embodiment, each search server further numbers the documents assigned to the machine, and converts the document identifiers into an integer from 0 - N-1 as the document number of the document. In this way, for the same document, the length of the document number is much smaller than the length of the original document identifier, thereby saving storage space and improving the reading speed.
步骤 406: 将低频关键词的倒排表项的文档篇号取模后发送到对应 的模值的检索服务器。  Step 406: The document part number of the inverted list item of the low frequency keyword is modulo and sent to the corresponding modulus search server.
步骤 407: 集群中的 n台检索服务器对已读取的倒排表项进行逻辑 操作。  Step 407: The n retrieval servers in the cluster perform logical operations on the inverted entries that have been read.
本步骤中的逻辑操作根据待检索的检索串中关键词之间的逻辑关系 进行, 其中逻辑操作包括与操作、或操作、 非操作中的一种或任意组合。  The logical operations in this step are performed according to the logical relationship between the keywords in the search string to be retrieved, wherein the logical operations include one or any combination of operations, operations, and operations.
步骤 408: 对 n台检索服务器的逻辑操作结果进行汇总后得到检索 串的检索结果。  Step 408: The result of the search operation of the search string is obtained by summarizing the logical operation results of the n search servers.
在上述实施例中, 以检索串既包括高频关键词又包括低频关键词为 例进行说明。 在实际应用中, 当检索串仅包括高频关键词时, 无需执行 步骤 405~406, 当检索串仅包括低频关键词时, 无需执行步骤 404。  In the above embodiment, the search string includes both a high frequency keyword and a low frequency keyword as an example. In practical applications, when the search string includes only high frequency keywords, steps 405-406 need not be performed, and when the search string includes only low frequency keywords, step 404 need not be performed.
下面以检索串为 "中国徐建军" 为例进行说明。 对检索串 "中国徐 建军" 进行检索的流程如图 5所示。 该集群中包含三台检索服务器, 这 三台检索服务器分别为检索服务器 0、 检索服务器 1和检索服务器 2。  The following is an example of a search string for "China Xu Jianjun". The process of searching the search string "China Xu Jianjun" is shown in Figure 5. The cluster includes three search servers, which are search server 0, search server 1, and search server 2.
首先, 对检索串 "中国徐建军" 进行解析, 生成由关键词 "中国" 和 "徐建军" 组成的检索表达式。  First, the search string "China Xu Jianjun" is analyzed to generate a search expression consisting of the keywords "China" and "Xu Jianjun".
其次, 集群中的三台检索服务器根据待检索的关键词命中文档的数 量, 确定待检索的关键词的类型, 并根据关键词的类型, 读取关键词的 倒排表项。  Secondly, the three search servers in the cluster determine the type of the keyword to be retrieved according to the number of hits of the keyword to be retrieved, and read the inverted entry of the keyword according to the type of the keyword.
其中, "中国"是一个在文档中出现频率非常高的高频关键词,而"徐 建军" 作为一个具体的人名, 在他为非名人的情况下, 是一个在文档中 出现频率很低的低频关键词。 在本实施例中, 假设高频关键词 "中国" 的倒排表项中的文档篇号 列表为 {16, 38, 100, 207, 319, 872, 903, 1081, 2331, 5618}, 低频 关键词 "徐建军" 的倒排表项中的文档篇号列表为 {38, 295, 307, 971, 2331}。 Among them, "China" is a high-frequency keyword that appears very frequently in the document, and "Xu Jianjun" is a specific person name. In the case of non-celebrity, it is a low frequency that appears very low in the document. Key words. In this embodiment, it is assumed that the document number list in the inverted list of the high frequency keyword "China" is {16, 38, 100, 207, 319, 872, 903, 1081, 2331, 5618}, low frequency key The list of document numbers in the inverted list of the word "Xu Jianjun" is {38, 295, 307, 971, 2331}.
由于集群中三台检索服务器分别保存了高频关键词 "中国" 的一部 分倒排表项,所以由集群中三台检索服务器分别读取高频关键词 "中国" 的一部分倒排表项。 将高频关键词 "中国" 对应的各个文档篇号对 3取 模, 每个模值对应的检索服务器读取该模值对应的倒排表项。 例如, 文 档篇号 16对 3取模后的值为 1, 因此集群中的检索服务器 1读取文档篇 号 16的倒排表项。 相应的, 集群中的检索服务器 0相应读取文档篇号 为 {207, 903, 2331}的倒排表项, 集群中的检索服务器 1相应读取文档 篇号为 {16, 100, 319, 1081}的倒排表项, 集群中的检索服务器 2相应 读取文档篇号为 {38, 872, 5618}的倒排表项。  Since the three search servers in the cluster respectively store a part of the inverted list items of the high-frequency keyword "China", the three search servers in the cluster respectively read a part of the inverted items of the high-frequency keyword "China". Each document number corresponding to the high frequency keyword "China" is modulo 3, and the retrieval server corresponding to each modulus value reads the inverted entry corresponding to the modulus value. For example, if the value of the document number 16 to 3 is modulo 1, the retrieval server 1 in the cluster reads the inverted entry of the document number 16. Correspondingly, the search server 0 in the cluster reads the inverted entry of the document number {207, 903, 2331}, and the search server 1 in the cluster reads the document number as {16, 100, 319, 1081. } The reverse row entry, the search server 2 in the cluster reads the inverted entry of the document number {38, 872, 5618}.
集群中的三台检索服务器分别保存了不同低频关键词的所有倒排表 项。 假设低频关键词 "徐建军" 的所有倒排表项保存在集群中的检索服 务器 2上。 由检索服务器 2保存并读取包含低频关键词 "徐建军" 的所 有倒排表项, 即文档篇号为 {38, 295, 307, 971, 2331}的倒排表项。  The three search servers in the cluster save all the inverted items of different low frequency keywords. Assume that all the inverted items of the low-frequency keyword "Xu Jianjun" are stored in the search server 2 in the cluster. The search server 2 saves and reads all the inverted items including the low-frequency keyword "Xu Jianjun", that is, the inverted items of the document number {38, 295, 307, 971, 2331}.
再次, 集群中的检索服务器完成对关键词的倒排表项的读取后, 将 低频关键词 "徐建军" 的倒排表项分发到集群中的三台检索服务器。  Once again, after the search server in the cluster finishes reading the inverted items of the keyword, the inverted list of the low-frequency keyword "Xu Jianjun" is distributed to the three search servers in the cluster.
将低频关键词对应的文档篇号对 3取模后, 将每个模值对应的倒排 表项发送到该模值对应的检索服务器。 在本实施例中, 对于低频关键词 "徐建军", 文档篇号为 {2331}的倒排表项被发送到检索服务器 0, 文档 篇号为 {295, 307}的倒排表项被发送到检索服务器 1, 文档篇号为 {38, 971}的倒排表项被发送到检索服务器 2, 得到检索的中间结果。  After the document number corresponding to the low frequency keyword is modulo 3, the inverted item corresponding to each modulus value is sent to the retrieval server corresponding to the modulus value. In the present embodiment, for the low frequency keyword "Xu Jianjun", the inverted entry of document document number {2331} is sent to search server 0, and the inverted entry of document document number {295, 307} is sent to The search server 1, the inverted entry of the document number {38, 971} is sent to the search server 2, and the intermediate result of the search is obtained.
最后, 集群中的三台服务器分别对高频关键词 "中国" 和低频关键 词 "徐建军" 的倒排表项进行与操作, 并获取检索结果。  Finally, the three servers in the cluster operate and invert the high-frequency keyword "China" and the low-frequency key word "Xu Jianjun" and obtain the search results.
经过与操作, 检索服务器 0的检索结果是文档篇号为 2331的文档, 检索服务器 1的检索结果为空, 检索服务器 2的检索结果是文档篇号为 38的文档, 将三个检索服务器的检索结果汇总后, 获取对检索串 "中国 徐建军" 进行检索后的结果为文档篇号为 {2331 , 38}的文档。 After the operation and retrieval, the search result of the search server 0 is the document with the document number 2331, the search result of the search server 1 is empty, and the search result of the search server 2 is the document with the document number 38, and the search of the three search servers is performed. After the results are summarized, get the search string "China Xu Jianjun's result of the search is the document with the document number {2331, 38}.
图 6示出了本发明实施例中的检索系统。  Fig. 6 shows a retrieval system in an embodiment of the present invention.
如图 6所示, 本实施例中的检索系统包括: 緩存代理服务器 610、 集群代理服务器 620以及检索服务器 630。  As shown in FIG. 6, the retrieval system in this embodiment includes: a caching proxy server 610, a cluster proxy server 620, and a retrieval server 630.
緩存代理服务器 610用于对待检索的检索串进行解析后生成由关键 词组成的检索表达式; 接收来自集群代理服务器 620的检索结果, 根据 需要输出该检索结果。 集群代理服务器 620用于接收来自緩存代理服务 器 610的检索表达式, 确定检索表达式中关键词的类型, 并根据关键词 的类型, 向检索服务器 630发送读取命令; 从检索服务器 630接收检索 结果, 并将该检索结果发送给緩存代理服务器 610。 检索服务器 630用 于根据来自集群代理服务器 620的读取命令, 读取关键词的倒排表项, 确定待检索的关键词的检索结果, 并向集群代理服务器 620返回检索结 果; 当检索串中包括至少两个关键词时, 检索服务器 630进一步用于在 获取每个关键词的倒排表项之后, 对至少两个关键词的倒排表项进行逻 辑操作, 确定至少两个关键词对应的检索结果。  The cache proxy server 610 parses the search string to be retrieved to generate a search expression consisting of key words; receives the search result from the cluster proxy server 620, and outputs the search result as needed. The cluster proxy server 620 is configured to receive a retrieval expression from the caching proxy server 610, determine the type of the keyword in the retrieval expression, and send a reading command to the retrieval server 630 according to the type of the keyword; receive the retrieval result from the retrieval server 630. And sending the search result to the caching proxy server 610. The retrieval server 630 is configured to read the inverted item of the keyword according to the read command from the cluster proxy server 620, determine the retrieval result of the keyword to be retrieved, and return the retrieval result to the cluster proxy server 620; When the at least two keywords are included, the search server 630 is further configured to perform logical operations on the inverted items of the at least two keywords after obtaining the inverted items of each keyword, and determine the corresponding at least two keywords. Search Results.
应用本发明系统的检索模型示意图如图 7所示, 该示意图中的緩存 代理服务器、 集群代理服务器以及检索服务器呈 "树型" 分布, 该系统 中包括一个緩存代理服务器, 该緩存代理服务器下连接 n台集群代理服 务器, 每一台集群代理服务器下面连接 n台检索服务器, 每一组 n台检 索服务器组成一个集群检索子系统。  A schematic diagram of a retrieval model using the system of the present invention is shown in FIG. 7. The cache proxy server, the cluster proxy server, and the retrieval server in the schematic diagram are distributed in a "tree" manner, and the system includes a cache proxy server, and the cache proxy server is connected. n cluster proxy servers, each cluster proxy server is connected to n retrieval servers, and each set of n retrieval servers constitutes a cluster retrieval subsystem.
其中, 緩存代理服务器为一个独立的进程, 可以驻留在一台硬件服 务器上。 在检索时, 緩存代理服务器对外部输入的检索串的查询请求进 行緩存, 对待检索的检索串进行解析后生成由关键词组成的检索表达 式。 例如, 緩存代理服务器可以调用检索服务器中的检索解释程序将外 部输入的检索串解析成机器能够读懂的检索表达式。 当每个检索集群子 系统返回检索结果到集群代理服务器后, 再由该緩存代理服务器汇总所 有集群代理服务器的结果并返回给外部用户。  The cache proxy server is a separate process and can reside on a hardware server. At the time of retrieval, the caching proxy server caches the query request of the externally input search string, and parses the search string to be retrieved to generate a search expression consisting of keywords. For example, the caching proxy server can invoke a retrieval interpreter in the retrieval server to parse the externally entered retrieval string into a retrieval expression that the machine can understand. When each retrieval cluster subsystem returns the retrieval result to the cluster proxy server, the cache proxy server summarizes the results of all cluster proxy servers and returns them to the external user.
集群代理服务器是一个独立的进程,可以驻留在一台硬件服务器上。 在检索时, 集群代理服务器确定检索表达式中关键词的类型, 并根据关 键词的类型, 向集群子系统中的检索服务器发送读取命令, 当关键词位 高频关键词时, 向检索服务器分别发送读取自身存储的高频关键词的一 部分索引表项的命令, 向检索服务器中一台检索服务器发送读取自身存 储的低频关键词的全部索引表项的命令。 当每个检索服务器返回检索结 果时, 对返回的检索结果进行汇总, 确定待检索的关键词的检索结果; 并将汇总后的检索结果返回给上层的緩存代理服务器。 A clustered proxy server is a separate process that can reside on a single hardware server. At the time of retrieval, the cluster proxy server determines the type of the keyword in the retrieval expression, and according to the The type of the keyword, sends a read command to the search server in the cluster subsystem, and when the keyword is a high frequency keyword, sends a command to the search server to read a part of the index entry of the high frequency keyword stored by itself, A command to read all index entries of the low frequency keywords stored by itself is sent to a retrieval server in the retrieval server. When each search server returns the search result, the returned search results are summarized to determine the search result of the keyword to be searched; and the summarized search result is returned to the upper cache proxy server.
每一台检索服务器都是一个独立的进程, 可以驻留在一台硬件服务 器上, 其为一个最基本的检索单元, 在上层集群代理服务器的调度下, 进行基本的底层检索操作, 包括集群代理服务器的读取指令读取关键词 的倒排表项, 并返回给集群代理服务器。 当接收到读取自身存储的高频 关键词的一部分索引表项的命令时, 读取高频关键词的一部分索引表 项; 当接收到读取自身存储的低频关键词的全部索引表项的命令时, 读 取低频关键词的全部索引表项。 当检索串中包括至少两个关键词时, 检 索服务器还对至少两个关键词的倒排表项进行相应的 "与" "或" "非" 等逻辑操作, 确定所述至少两个关键词对应的索引表项。  Each retrieval server is a separate process that can reside on a hardware server. It is a basic retrieval unit. Under the scheduling of the upper cluster proxy server, basic underlying retrieval operations, including cluster agents. The server's read command reads the inverted list of keywords and returns it to the cluster proxy server. When receiving a command to read a part of an index entry of a high frequency keyword stored by itself, reading a part of an index entry of the high frequency keyword; when receiving all index entries of the low frequency keyword stored by itself When the command is executed, all index entries of the low frequency keyword are read. When the search string includes at least two keywords, the search server further performs logical operations such as "and" or "not" on the inverted items of the at least two keywords to determine the at least two keywords. Corresponding index table entry.
应用本发明实施例, 可以显著地提高检索速度。 通过实验得知, 在 互联网随机下载得到的 1500万个网页文档中, 命中文档数量超过 1000 篇的一元、 二元、 三元语素总数量不超过 50万。 那么可以推想在 1亿 篇文档中, 命中文档数量在 6000-10000篇的语素数量不会超过 50万, 假定存储关键词与文档的关系时, 采用 8字节存储文档标识、 采用 3字 节存储关键词的权值以及压缩后采用 2字节存储关键词位置偏移, 在关 键词命中 5000篇文档时, 该关键词的倒排表项的存储空间为 64k, 在关 键词命中 10000篇文档时, 该关键词的倒排表项的存储空间为 128k, 读 取时间 8毫秒。 在如图 7所示的本发明提供的检索模型中, 如果采用 16 台检索服务器一组, 根据倒排表项, 包括文档标识、 权值和位置偏移的 存储空间, 对倒排表项进行分隔。 对于存储空间在 64k以上的语素, 同 时由多台检索服务器存储该语素的倒排表项, 对于存储空间在 64k以下 的语素, 由一台检索服务器存储该语素的全部倒排表项。 并且将文档标 识压缩为文档篇号后, 采用小于 2字节的空间来存储。 这样一来, 对于 存储空间在 64k以下的语素, 每次读取一个语素的倒排表项的时间小于 8毫秒, 对于 64k以上的语素, 在每台检索服务器存储有 64k-128k的倒 排表项, 可以存放(64k-128k ) /7*16=15-30万个倒排表项。 那么, 在 1 亿篇文档中, 对于命中率在千分之三以下的中高频关键词, 每次读取时 间也在 8毫秒之内。 可见, 对于低频关键词和中高频关键词, 都可以在 一次读取时间内把所有的倒排表项读完。 对于命中率超过千分之三以上 的高频语素可以只存放权值较高的部分, 权值较低的部分可以做停用处 理, 以使得每个高频语素的最大倒排表项存储空间不超过 1M, 即读取 时间不超过 50ms。 With the embodiment of the present invention, the retrieval speed can be remarkably improved. Through experiments, it is known that among the 15 million web documents randomly downloaded from the Internet, the total number of unary, binary, and ternary morphemes that hit more than 1,000 documents does not exceed 500,000. Then it can be inferred that in 100 million documents, the number of morphemes hitting 6000-10000 pieces will not exceed 500,000. Assume that when storing the relationship between keywords and documents, 8-byte storage document identification is used, and 3-byte storage is used. The weight of the keyword and the 2-byte stored keyword position offset after compression. When the keyword hits 5000 documents, the storage space of the inverted item of the keyword is 64k, when the keyword hits 10000 documents. The inverted row of the keyword has a storage space of 128k and a read time of 8 milliseconds. In the retrieval model provided by the present invention as shown in FIG. 7, if a group of 16 retrieval servers is used, the inverted row entry is performed according to the inverted row entry, including the storage space of the document identifier, the weight, and the position offset. Separate. For a morpheme with a storage space of 64k or more, an inverted list item of the morpheme is stored by a plurality of search servers, and for a morpheme whose storage space is 64k or less, all the inverted items of the morpheme are stored by one search server. And after compressing the document identification into the document number, it uses less than 2 bytes of space for storage. In this way, for A morpheme with a storage space below 64k, the time for reading an inverted table entry of one morpheme is less than 8 milliseconds. For a morpheme of 64k or more, an inverted row of 64k-128k is stored in each retrieval server, which can be stored ( 64k-128k) /7*16=15-30 million inverted items. Then, in 100 million documents, for medium and high frequency keywords with a hit rate below three thousandths, each read time is also within 8 milliseconds. It can be seen that for low frequency keywords and medium and high frequency keywords, all the inverted items can be read in one reading time. For high-frequency morphemes with a hit rate of more than three-thousandths of a thousand, only the part with higher weight can be stored, and the part with lower weight can be deactivated so that the maximum inverted item storage space of each high-frequency morpheme No more than 1M, that is, the reading time does not exceed 50ms.
应用图 7中检索模型进行检索的流程图如图 8所示。在本实施例中, 检索串中包括至少两个关键词。  The flowchart for applying the search model in Figure 7 is shown in Figure 8. In this embodiment, at least two keywords are included in the search string.
步骤 801 : 緩存代理服务器将待检索的检索串进行解析后生成由关 键词组成的检索表达式。  Step 801: The cache proxy server parses the search string to be retrieved to generate a search expression consisting of the key words.
步骤 802: 集群代理服务器确定检索表达式中每个关键词的类型, 并根据每个关键词的类型, 向检索服务器发送读取倒排表项的命令。  Step 802: The cluster proxy server determines the type of each keyword in the retrieval expression, and sends a command to read the inverted row item to the retrieval server according to the type of each keyword.
步骤 803: 检索服务器接收读取请求后, 读取关键词倒排表项。 步骤 804: 检索服务器对至少两个关键词的倒排表项进行逻辑操作。 例如, 对倒排表项中的文档篇号进行逻辑操作, 获取对关键词进行逻辑 运算。  Step 803: After receiving the read request, the search server reads the keyword inverted list item. Step 804: The retrieval server performs logical operations on the inverted items of the at least two keywords. For example, logical operations are performed on the document number in the inverted list item to obtain a logical operation on the keyword.
步骤 805: 各个检索服务器将逻辑运算后的结果发送到上层集群服 务器进行汇总得到中间结果。  Step 805: Each retrieval server sends the result of the logical operation to the upper cluster server for aggregation to obtain an intermediate result.
步骤 806: 各个集群服务器将中间结果发送到上层緩存代理服务器 汇总得到最终结果并输出。  Step 806: Each cluster server sends the intermediate result to the upper cache proxy server to summarize and output the final result.
图 9示出了本发明实施例中检索服务器的结构。 在本实施例中, 待 检索的检索串包括至少两个关键词。  Fig. 9 shows the structure of a retrieval server in the embodiment of the present invention. In this embodiment, the search string to be retrieved includes at least two keywords.
该检索服务器包括: 检索解释模块 910、读取管理模块 920、 关键词 读取模块 930、 逻辑运算模块 940以及标识转换模块 950。  The retrieval server includes: a retrieval interpretation module 910, a read management module 920, a keyword reading module 930, a logical operation module 940, and an identification conversion module 950.
其中, 检索解释模块 910用于对待检索的检索串进行解析后生成由 关键词组成的检索表达式供上层服务器调用。 读取管理模块 920用于接 收读取自身存储的高频关键词的一部分索引表项的命令以及读取自身 存储的低频关键词的全部索引表项的命令中的至少一个。 关键词读取模 块 930用于当接收到读取自身存储的高频关键词的一部分索引表项的命 令时, 读取该高频关键词的一部分索引表项; 当接收到读取自身存储的 所述低频关键词的全部索引表项的命令时, 读取该低频关键词的全部索 引表项。 其中倒排表项中包括文档标识压缩后生成的文档篇号。 逻辑运 算模块 940用于当有至少两个存在逻辑关系的关键词待检索时, 根据逻 辑关系, 对已读取的对应至少两个待检索的关键词的索引表项进行逻辑 操作, 确定至少两个关键词对应的索引表项。 标识转换模块 950用于将 关键词倒排表项中的八字节的文档标识压缩为四字节的文档篇号。 The search and interpretation module 910 is configured to parse the search string to be retrieved to generate a search expression composed of keywords for the upper layer server to call. The read management module 920 is used to connect At least one of a command to read a part of an index entry of a high frequency keyword stored therein and a command to read all index entries of a low frequency keyword stored by itself. The keyword reading module 930 is configured to: when receiving a command to read a part of the index entry of the high frequency keyword stored by itself, read a part of the index entry of the high frequency keyword; when receiving the read self storage When all the indexes of the low frequency keyword are indexed, all index entries of the low frequency keyword are read. The inverted item includes the document number generated after the document identifier is compressed. The logic operation module 940 is configured to perform logical operations on the index entries corresponding to the at least two keywords to be retrieved according to the logical relationship when there are at least two keywords having a logical relationship to be retrieved, and determine at least two The index table entry corresponding to the keyword. The identifier conversion module 950 is configured to compress the eight-byte document identifier in the keyword inverted list item into a four-byte document article number.
在上述实施例中, 对文档的索引方法为倒排索引, 相应的索引表项 为倒排表项, 这只是本发明的示例, 并不用于限制本发明。 在应用本发 明实施例时, 还可以采用其它索引方法, 读取该索引方法对应的索引表 项。  In the above embodiment, the indexing method for the document is an inverted index, and the corresponding index entry is an inverted list item, which is only an example of the present invention and is not intended to limit the present invention. When the embodiment of the present invention is applied, other index methods may be used to read the index table corresponding to the index method.
由以上实施例可见, 在本发明实施例中, 一方面, 一个高频关键词 的倒排表项由集群中的多台服务器存储, 在进行检索时, 由多台服务器 对该高频关键词的倒排表项进行并行读取, 因此可以在系统设计时间内 读取超大数量的倒排表项, 并且不延误单次逻辑操作的时间开销, 提高 了检索速度。 另一方面, 一个低频关键词的所有倒排表项由一台检索服 务器存储。 在进行检索时, 仅由该服务器对该低频关键词的倒排表项进 行读取。 因此无需在多台检索服务器上分别读取较少数量的倒排表项, 节省了集群中多台检索服务器的存储资源, 提高了检索速度。  As can be seen from the above embodiments, in the embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when the search is performed, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the time overhead of a single logical operation is not delayed, and the retrieval speed is improved. On the other hand, all inverted entries of a low frequency keyword are stored by a retrieval server. When the search is performed, only the inverted list item of the low frequency keyword is read by the server. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
另外, 应用本发明实施例可以有效提高检索集群内部检索服务器之 间的耦合度, 并且增加了服务器之间的资源动态调配能力, 通过把集群 内的多台检索服务器的资源进行统一规划, 最大限度保证了集群整体的 并发能力, 从而进一步提高了检索速度。  In addition, the embodiment of the present invention can effectively improve the coupling degree between the search servers in the search cluster, and increase the resource dynamic allocation capability between the servers, and uniformly plan the resources of multiple search servers in the cluster to maximize the maximum The overall concurrency capability of the cluster is guaranteed, which further improves the retrieval speed.
虽然通过实施例描绘了本发明, 本领域普通技术人员知道, 本发明 有许多变形和变化而不脱离本发明的精神, 希望所附的权利要求包括这 些变形和变化而不脱离本发明的精神。  While the invention has been described by the embodiments of the present invention, it will be understood that

Claims

权利要求书 Claim
1、 一种检索方法, 其特征在于, 包括: A retrieval method, comprising:
确定待检索的关键词的类型;  Determining the type of keyword to be retrieved;
当所述关键词为高频关键词时, 由 n台检索服务器分别读取自身存 储的所述高频关键词的一部分索引表项, n为大于 1的整数;  When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.
当所述关键词为低频关键词时, 所述 n台检索服务器中一台检索服 务器读取自身存储的所述低频关键词的全部索引表项;  When the keyword is a low frequency keyword, one of the n search servers reads all index entries of the low frequency keyword stored by the search server;
根据所述已读取的索引表项,确定所述待检索的关键词的检索结果。 Determining a search result of the keyword to be retrieved according to the read index table item.
2、 根据权利要求 1所述的方法, 其特征在于, 进一步包括: 将每个高频关键词的全部索引表项分割为 n部分; 第 m台检索服务 器存储所述每个高频关键词的第 m部分索引表项; 2. The method according to claim 1, further comprising: dividing all index entries of each high frequency keyword into n portions; and the mth retrieval server stores the high frequency keywords Part m index entry;
将全部低频关键词分割为 n部分;第 m台检索服务器存储所述第 m 部分低频关键词的全部索引表项; 其中 m为大于 1小于等于 n的整数。  All low frequency keywords are divided into n parts; the mth retrieval server stores all index entries of the mth partial low frequency keyword; wherein m is an integer greater than 1 and less than or equal to n.
3、 根据权利要求 2所述的方法, 其特征在于,  3. The method of claim 2, wherein
所述将每个高频关键词的全部索引表项分割为 n部分包括: 对所述 索引表项中的用于区分文档的标识进行取模, 其中, 取模参数为 n, 将 模值为 m的标识对应的索引表项作为所述第 m部分索引表项;  The dividing the all index entries of each high frequency keyword into n parts includes: modulating an identifier for distinguishing the document in the index table item, where the modulo parameter is n, and the modulo value is An index entry corresponding to the identifier of the m is used as the mth partial index entry;
所述将全部低频关键词分割为 n部分包括: 对所述低频关键词对应 的文字标识(word ID )进行取模, 其中, 取模参数为 n, 将模值为 m的 word ID对应的低频关键词作为所述第 m部分低频关键词。  The dividing the all low frequency keywords into the n parts includes: modulating a character identifier (word ID) corresponding to the low frequency keyword, where the modulo parameter is n, and the low frequency corresponding to the word ID of the modulo value is m The keyword is used as the mth partial low frequency keyword.
4、根据权利要求 1所述的方法, 其特征在于, 当有至少两个存在逻 辑关系的关键词待检索时, 进一步包括: 根据所述逻辑关系, 对已读取 的所述至少两个待检索的关键词的索引表项进行逻辑操作; 根据所述逻 辑操作的结果, 确定所述至少两个关键词的检索结果。  The method according to claim 1, wherein when there are at least two keywords having a logical relationship to be retrieved, the method further comprises: according to the logical relationship, the at least two to-be-reads that have been read The indexed entry of the retrieved keyword performs a logical operation; and based on the result of the logical operation, the search result of the at least two keywords is determined.
5、根据权利要求 4所述的方法, 其特征在于, 当所述至少两个待检 索的关键词包括至少一个低频关键词时, 所述对已读取索 I表项进行逻 辑操作包括:  The method according to claim 4, wherein when the at least two keywords to be searched include at least one low frequency keyword, the logical operation of the read I entry includes:
将已读取的所述至少一个低频关键词的全部索引表项分割为 n部 分; 将所述至少一个低频关键词的 n部分索引表项分别发送给所述 n台 检索服务器; 所述 n台检索服务器对已读取的索引表项以及所述接收的 至少一个低频关键词的一部分索引表项进行逻辑操作。 All the index entries of the at least one low frequency keyword that have been read are divided into n parts; n part index entries of the at least one low frequency keyword are respectively sent to the n retrieval servers; Retrieving the index table entry that the server has read and the received A part of the index entries of at least one low frequency keyword are logically operated.
6、根据权利要求 4或 5所述的方法, 其特征在于, 所述索引表项进 行逻辑操作包括:  The method according to claim 4 or 5, wherein the logical operation of the index entry comprises:
7、 根据权利要求 1~5 中任一项所述的方法, 其特征在于, 进一步 包括: The method according to any one of claims 1 to 5, further comprising:
将所述索引表项中用于区分文档的标识压缩为文档篇号;  Compressing the identifier of the index table item for distinguishing the document into a document part number;
将所述用于区分文档的标识更新为所述文档篇号。  The identifier for distinguishing the document is updated to the document article number.
8、根据权利要求 7所述的方法, 其特征在于, 将所述索引表项中用 于区分文档的标识压缩为文档篇号包括:  The method according to claim 7, wherein compressing the identifier of the index table item for distinguishing the document into the document part number comprises:
将所述索引表项中的文档标识压缩为四字节的文档篇号。  The document identifier in the index entry is compressed into a four-byte document article number.
9、 根据权利要求 1~5 中任一项所述的方法, 其特征在于, 所述索 引表项为倒排表项。  The method according to any one of claims 1 to 5, wherein the index entry is an inverted entry.
10、 根据权利要求 1~5中任一项所述的方法, 其特征在于, 所述高 频关键词包括超高频关键词和中高频关键词。  The method according to any one of claims 1 to 5, wherein the high frequency keyword comprises an ultra high frequency keyword and a medium high frequency keyword.
11、 一种检索系统, 其特征在于, 包括:  11. A retrieval system, comprising:
集群代理服务器, 用于确定待检索的关键词的类型; 当所述关键词 为高频关键词时, 向 n台检索服务器分别发送读取自身存储的所述高频 关键词的一部分索引表项的命令; 当所述关键词为低频关键词时, 向所 述 n台检索服务器中一台检索服务器发送读取自身存储的所述低频关键 词的全部索引表项的命令, 其中 n为大于 1的整数; 根据所述检索服务 器读取的索引表项, 确定所述待检索的关键词的检索结果;  a cluster proxy server, configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;
所述检索服务器, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。  The search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.
12、根据权利要求 11所述的检索系统, 其特征在于, 所述检索服务 器进一步用于当有至少两个存在逻辑关系的关键词待检索时, 根据所述 逻辑关系, 对已读取的所述至少两个待检索的关键词的索引表项进行逻 辑操作, 确定所述至少两个关键词对应的索引表项。  The retrieval system according to claim 11, wherein the retrieval server is further configured to: when there are at least two keywords having a logical relationship to be retrieved, according to the logical relationship, the read The index entries of the at least two keywords to be retrieved are logically operated, and the index entries corresponding to the at least two keywords are determined.
13、 一种检索服务器, 其特征在于, 包括: 读取管理模块, 用于接收读取自身存储的高频关键词的一部分索引 表项的命令以及读取自身存储的低频关键词的全部索引表项的命令中 的至少一个; 13. A retrieval server, comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;
关键词读取模块, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。  a keyword reading module, configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.
14、根据权利要求 13所述的检索服务器,其特征在于,进一步包括: 逻辑运算模块,用于当有至少两个存在逻辑关系的关键词待检索时, 根据所述逻辑关系, 对已读取的所述至少两个待检索的关键词的索引表 项进行逻辑操作, 确定所述至少两个关键词对应的索引表项。  The retrieval server according to claim 13, further comprising: a logic operation module, configured to: when there are at least two keywords having a logical relationship to be retrieved, according to the logical relationship, the pair has been read The index entries of the at least two keywords to be retrieved are logically operated to determine index entries corresponding to the at least two keywords.
15、根据权利要求 13所述的检索服务器,其特征在于,进一步包括: 标识转换模块, 用于将所述高频关键词或者低频关键词的索引表项 中用于区分文档的标识压缩为文档篇号。  The search server according to claim 13, further comprising: an identifier conversion module, configured to compress the identifier of the high-frequency keyword or the index entry of the low-frequency keyword for distinguishing the document into a document Article number.
16、 一种集群代理服务器, 其特征在于, 包括:  16. A cluster proxy server, comprising:
第一模块, 用于确定待检索的关键词的类型;  a first module, configured to determine a type of a keyword to be retrieved;
第二模块, 用于当所述关键词为高频关键词时, 向 n台检索服务器 分别发送读取自身存储的所述高频关键词的一部分索引表项的命令; 当 所述关键词为低频关键词时, 向所述 n台检索服务器中一台检索服务器 发送读取自身存储的所述低频关键词的全部索引表项的命令, 其中 n为 大于 1的整数;  a second module, configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;
第三模块, 用于根据所述检索服务器读取的索引表项, 确定所述待 检索的关键词的检索结果。  And a third module, configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.
17、根据权利要求 16所述的集群代理服务器, 其特征在于, 进一步 包括:  The cluster proxy server according to claim 16, further comprising:
第四模块, 用于将每个高频关键词的全部索引表项分割为 n部分; 向第 m台检索服务器发送存储所述每个高频关键词的第 m部分索引表 项的命令, 其中 m为大于 1小于等于 n的整数;  a fourth module, configured to divide all index entries of each high frequency keyword into n parts; and send, to the mth search server, a command to store the mth partial index entry of each high frequency keyword, where m is an integer greater than 1 and less than or equal to n;
第五模块, 用于将全部低频关键词分割为 n部分; 向第 m台检索服 务器发送存储所述第 m部分低频关键词的全部索引表项的命令。  And a fifth module, configured to divide all low frequency keywords into n parts; and send, to the mth search server, a command to store all index entries of the mth part of the low frequency keywords.
PCT/CN2008/070598 2007-06-26 2008-03-27 Searching method, searching system and searching server WO2009000173A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNB2007101124514A CN100462979C (en) 2007-06-26 2007-06-26 Distributed indesx file searching method, searching system and searching server
CN200710112451.4 2007-06-26

Publications (1)

Publication Number Publication Date
WO2009000173A1 true WO2009000173A1 (en) 2008-12-31

Family

ID=38898665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070598 WO2009000173A1 (en) 2007-06-26 2008-03-27 Searching method, searching system and searching server

Country Status (2)

Country Link
CN (1) CN100462979C (en)
WO (1) WO2009000173A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100462979C (en) * 2007-06-26 2009-02-18 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
US8386929B2 (en) * 2010-06-22 2013-02-26 Microsoft Corporation Personal assistant for task utilization
US9229946B2 (en) 2010-08-23 2016-01-05 Nokia Technologies Oy Method and apparatus for processing search request for a partitioned index
CN102479207B (en) * 2010-11-29 2013-07-03 阿里巴巴集团控股有限公司 Information search method, system and device
US10192176B2 (en) 2011-10-11 2019-01-29 Microsoft Technology Licensing, Llc Motivation of task completion and personalization of tasks and lists
CN103064841A (en) * 2011-10-20 2013-04-24 北京中搜网络技术股份有限公司 Retrieval device and retrieval method
CN103810220B (en) * 2012-11-15 2018-02-27 腾讯科技(深圳)有限公司 A kind of microblogging searching method and device
CN103455619B (en) * 2013-09-12 2016-09-07 焦点科技股份有限公司 A kind of scoring treatment method and system based on Lucene slice structure
CN104679778B (en) * 2013-11-29 2019-03-26 腾讯科技(深圳)有限公司 A kind of generation method and device of search result
CN103678697A (en) * 2013-12-26 2014-03-26 乐视网信息技术(北京)股份有限公司 Reverse index storage method and system thereof
CN105335373A (en) * 2014-06-17 2016-02-17 阿里巴巴集团控股有限公司 Information searching method and apparatus
CN105608022B (en) * 2014-11-25 2017-08-01 南方电网科学研究院有限责任公司 The instruction distribution method and system of a kind of intelligent and safe chip based on drainage technique
CN104778200A (en) * 2015-01-13 2015-07-15 东莞中山大学研究院 Heterogeneous processing big data retrieval method combining historical data
CN106156166B (en) * 2015-04-16 2020-11-10 深圳市腾讯计算机系统有限公司 Relation chain query system, document retrieval method, index establishment method and device
CN106156000B (en) 2015-04-28 2020-03-17 腾讯科技(深圳)有限公司 Search method and search system based on intersection algorithm
CN105447162B (en) * 2015-12-01 2021-06-25 腾讯科技(深圳)有限公司 Group file searching method and device
CN105653646B (en) * 2015-12-28 2019-06-04 北京中电普华信息技术有限公司 System for dynamically querying and method under a kind of concurrent querying condition
CN106055622A (en) * 2016-05-26 2016-10-26 浪潮软件集团有限公司 Data searching method and system
CN107436911A (en) * 2017-05-24 2017-12-05 阿里巴巴集团控股有限公司 Fuzzy query method, device and inquiry system
CN107145603A (en) * 2017-06-08 2017-09-08 上海德衡数据科技有限公司 A kind of network documentation search engine framework for keyword
CN108520051A (en) * 2018-04-04 2018-09-11 湖南蚁坊软件股份有限公司 A method of promoting Apache Lucene modifier search performances
CN110532347B (en) * 2019-09-02 2023-12-22 北京博睿宏远数据科技股份有限公司 Log data processing method, device, equipment and storage medium
CN112836008B (en) * 2021-02-07 2023-03-21 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN113923209B (en) * 2021-09-29 2023-07-14 北京轻舟智航科技有限公司 Processing method for downloading batch data based on LevelDB
CN113824804A (en) * 2021-11-24 2021-12-21 飞狐信息技术(天津)有限公司 Keyword detection method and related device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198068A1 (en) * 2004-03-04 2005-09-08 Shouvick Mukherjee Keyword recommendation for internet search engines
CN1975729A (en) * 2005-12-02 2007-06-06 国际商业机器公司 System of effectively searching text for keyword, and method thereof
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070084004A (en) * 2004-11-05 2007-08-24 가부시키가이샤 아이.피.비. Keyword extracting device
CN1936887A (en) * 2005-09-22 2007-03-28 国家计算机网络与信息安全管理中心 Automatic text classification method based on classification concept space

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198068A1 (en) * 2004-03-04 2005-09-08 Shouvick Mukherjee Keyword recommendation for internet search engines
CN1975729A (en) * 2005-12-02 2007-06-06 国际商业机器公司 System of effectively searching text for keyword, and method thereof
CN101071442A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment

Also Published As

Publication number Publication date
CN100462979C (en) 2009-02-18
CN101071442A (en) 2007-11-14

Similar Documents

Publication Publication Date Title
WO2009000173A1 (en) Searching method, searching system and searching server
US11249971B2 (en) Segmenting machine data using token-based signatures
US7007015B1 (en) Prioritized merging for full-text index on relational store
US6754799B2 (en) System and method for indexing and retrieving cached objects
EP2973018B1 (en) A method to accelerate queries using dynamically generated alternate data formats in flash cache
US7058783B2 (en) Method and mechanism for on-line data compression and in-place updates
US7185019B2 (en) Performant and scalable merge strategy for text indexing
US6209003B1 (en) Garbage collection in an object cache
US8959077B2 (en) Multi-layer search-engine index
US20080082554A1 (en) Systems and methods for providing a dynamic document index
US20120327956A1 (en) Flow compression across multiple packet flows
Cambazoglu et al. Scalability challenges in web search engines
CN104679898A (en) Big data access method
WO2008154823A1 (en) Searching method, system and device
CN104778270A (en) Storage method for multiple files
Williams et al. What's Next? Index Structures for Efficient Phrase Querying.
US9262511B2 (en) System and method for indexing streams containing unstructured text data
US7627777B2 (en) Fault tolerance scheme for distributed hyperlink database
JP3499105B2 (en) Information search method and information search device
CN114968953A (en) Log storage and retrieval method, system, terminal equipment and medium
CN102201007A (en) Large-scale data retrieving system
Zhang et al. Efficient search in large textual collections with redundancy
Jonassen et al. A combined semi-pipelined query processing architecture for distributed full-text retrieval
US20060248056A1 (en) Fast rich application view initiation
Henrique et al. A new approach for verifying url uniqueness in web crawlers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08715334

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 7315/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10/06/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08715334

Country of ref document: EP

Kind code of ref document: A1