检索方法、 检索系统及检索服务器 Search method, retrieval system and retrieval server
技术领域 本发明涉及通信技术领域, 具体涉及一种检索方法、 检索系统及检 索服务器。 发明背景 在进行检索时, 用户需要输入检索串, 通常检索串包含一个或多个 关键词, 当每个关键词之间用空格隔开时, 关键词之间的空格表示对各 个关键词之间进行 "与" 的检索操作。 每个关键词可以由一个或多个语 素组成。 语素是能够表达独立语意的最小语言单位, 通常为分词系统切 分出的中文词。 关键词通过分词系统可以被切分为数量不等的语素, 如 果切分为两个语素,则该关键词为二元复合语素,如果切分为三个语素, 则该关键词为三元复合语素。 在进行检索时, 对输入的检索串需要在较 短的时间内找出包含该检索串的所有文档的集合, 并将该文档集合通过 文档标识列表进行显示。 TECHNICAL FIELD The present invention relates to the field of communications technologies, and in particular, to a retrieval method, a retrieval system, and a retrieval server. BACKGROUND OF THE INVENTION When performing a search, a user needs to input a search string. Usually, the search string contains one or more keywords. When each keyword is separated by a space, a space between the keywords indicates between the keywords. Perform a "and" retrieval operation. Each keyword can consist of one or more morphemes. A morpheme is the smallest language unit that can express independent semantics, usually a Chinese word that is segmented by the word segmentation system. Key words can be divided into morphemes of different numbers by word segmentation system. If it is divided into two morphemes, the keyword is a binary compound morpheme. If it is divided into three morphemes, the keyword is ternary compound. Morpheme. When searching, the input search string needs to find a collection of all the documents containing the search string in a short time, and display the document collection through the document identification list.
在各种互联网搜索引擎技术中, 后台检索集群技术是最为核心的技 术之一, 这种技术直接关系到多台检索服务器间的协作, 以便为更大规 模的数据集合提供检索服务。 由于单个检索服务器管理文档集合的数量 是有限的, 如果保存的文档数量过大, 就会导致在进行正常检索操作过 程中, 系统难以在用户可以接受的时间内返回需要的结果。 通常用户可 以接受的时间不超过 1秒, 因此需要采用由多台检索服务器组成的检索 集群来支持更大数据集合范围内的检索服务。 Among various Internet search engine technologies, background retrieval cluster technology is one of the most core technologies. This technology is directly related to the collaboration between multiple search servers to provide retrieval services for a larger set of data. Since the number of document collections managed by a single retrieval server is limited, if the number of documents saved is too large, it will be difficult for the system to return the desired results within a time acceptable to the user during normal retrieval operations. Usually the user can accept no more than 1 second, so a search cluster consisting of multiple search servers is needed to support search services within a larger data set.
检索过程中最主要的操作就是对倒排索引的访问, 倒排索引是一种 用来加速对检索串进行检索的数据结构, 它可以以磁盘文件的形式存 在,也可以加载到内存中,它至少由词典文件和倒排表文件两部分组成。 倒排表文件中保存了多个倒排表项, 每个倒排表项用于保存检索串中每 个关键词与文档的对应关系。 因此有效提高对倒排表项的读取速度就可 以相应提高检索效率。 对倒排表文件的倒排表项进行读取的时间包括每 一次对磁盘的寻址时间和读取数据所需要的时间。 在读取的数据量比较
小的情况下, 对倒排表项的读取时间主要取决于磁盘的寻址时间, 在读 取的数据量比较大的情况下, 对倒排表项的读取时间主要取决于读取数 据的时间。 The most important operation in the retrieval process is the access to the inverted index. The inverted index is a data structure used to speed up the retrieval of the search string. It can exist in the form of a disk file or it can be loaded into the memory. At least consists of a dictionary file and an inverted table file. A plurality of inverted entries are saved in the inverted table file, and each inverted entry is used to save the correspondence between each keyword and the document in the search string. Therefore, effectively improving the reading speed of the inverted items can improve the retrieval efficiency. The time to read the inverted entry of the inverted table file includes the time of each disk address and the time required to read the data. Comparison of the amount of data read In a small case, the reading time of the inverted row item mainly depends on the addressing time of the disk. In the case that the amount of data read is relatively large, the reading time of the inverted row item mainly depends on the read data. time.
现有基于文档分区的分布式索引文件检索模型如图 1所示, 该系统 包括一个检索代理服务器以及由该检索代理服务器管理的多台平行的 检索服务器。 每台检索服务器分配文档全集的 N分之一的文档, 这里 N 为检索服务器的总数。 在索引阶段, 多台平行的检索服务器并行完成各 自服务器中的索引任务, 在检索阶段, 检索代理服务器将读取请求同时 发送到每个检索服务器, 检索服务器在做完本地检索后, 将检索结果返 回给检索代理服务器, 最终由检索代理服务器根据特定的权值排序方式 将每个检索服务器的检索结果汇聚在一起。 可见, 基于文档分区的检索 系统具有独立的结构设计, 检索服务器之间的耦合度小, 每台检索服务 器都相当于可以进行独立加载的检索子系统。 但是在互联网检索服务 中, 大部分检索串是由两个或者两个以上的关键词组成的, 检索服务器 需要在进行针对每个关键词的文档标识匹配后, 再进行文档内的位置偏 移匹配, 这就会带来对文档磁盘的多次输入输出访问。 并且当检索串中 包括高频语素时, 需要读取的文档标识列表和位置偏移列表的数量很 大, 例如, "中国"、 "网"、 "我们" 等的高频语素的倒排表项数据量通 常占到整个倒排索引数据量的很大比例, 要在短时间内读完这些索引数 据是不可能的, 因此检索的大部分时间将消耗在文件输入输出的读取操 作上, 从而使得检索系统的整体并发能力下降, 导致系统对检索串的检 索速度和响应速度变慢。 An existing distributed index file retrieval model based on document partitioning is shown in Fig. 1. The system includes a retrieval proxy server and a plurality of parallel retrieval servers managed by the retrieval proxy server. Each retrieval server allocates one-ninth of the documents in the full set of documents, where N is the total number of retrieval servers. In the indexing phase, multiple parallel retrieval servers complete the indexing tasks in their respective servers in parallel. In the retrieval phase, the retrieval proxy server sends the read requests to each retrieval server at the same time. After the retrieval server completes the local retrieval, the retrieval results will be retrieved. Returned to the search proxy server, and finally the search proxy server aggregates the search results of each search server according to a specific weight sorting manner. It can be seen that the document partition-based retrieval system has an independent structural design, and the degree of coupling between the retrieval servers is small, and each retrieval server is equivalent to a retrieval subsystem that can be independently loaded. However, in the Internet search service, most of the search strings are composed of two or more keywords. The search server needs to perform the position offset matching in the document after matching the document identifiers for each keyword. This will result in multiple I/O access to the document disk. And when the high frequency morpheme is included in the search string, the number of document identification lists and position offset lists that need to be read is large, for example, the inverted list of high frequency morphemes such as "China", "Net", "We", etc. The amount of item data usually accounts for a large proportion of the entire inverted index data. It is impossible to read the index data in a short time, so most of the retrieval time will be consumed in the reading operation of the file input and output. As a result, the overall concurrency of the retrieval system is degraded, resulting in slower retrieval speed and response speed of the retrieval string.
现有基于索引项分区的分布式索引文件检索模型如图 2所示, 该系 统包括一个检索代理服务器以及由该检索代理管理的 N组平行的检索 服务器,其中 N为大于 1的整数,每组检索服务器分配文档全集的 N分 之一的文档。 其中, 每组检索服务器中包含 3台检索服务器。 通常, 根 据哈希值取模的值, 将同一索引关键词对应的不同倒排表项存储到不同 的检索服务器中。 例如("中国")%3 = 1 , 则将 "中国" 对应的索引关 键词的倒排表项数据块存放在该组的 1号检索服务器上, 这样就可以把 原来存放在单个检索服务器上的所有索引关键词倒排表项平均分布在 3
台检索服务器上, 从而加快了对倒排表项的访问。 但是在基于索引项分 区的检索系统中, 当检索串包括两个或者两个以上的关键词时, 每组检 索服务器中的单台检索服务器无法独立完成检索, 必须同该组内的其它 检索服务器协作才能完成检索, 因此增加了检索服务器之间的数据耦合 度, 导致数据备份比较复杂, 降低了检索的速度。 另外, 每完成一次检 进行操作, 因此增大了检索服务器之间的通信量。 发明内容 本发明实施例提供了一种检索方法、 检索系统和检索服务器, 能够 提高检索的速度。 An existing distributed index file retrieval model based on index entry partitioning is shown in FIG. 2. The system includes a retrieval proxy server and N sets of parallel retrieval servers managed by the retrieval agent, where N is an integer greater than 1, each group Retrieve the server to allocate one-ninth of the documents in the full set of documents. Among them, each group of search servers contains three search servers. Generally, different inverted entries corresponding to the same index keyword are stored in different retrieval servers according to the value of the hash value modulo. For example, ("China")%3 = 1 , the inverted data item block of the index keyword corresponding to "China" is stored in the search server No. 1 of the group, so that the original search server can be stored on a single search server. All indexed keyword inverted items are evenly distributed in 3 The server retrieves the server, thereby speeding up access to the inverted entries. However, in a retrieval system based on index entry partitioning, when the retrieval string includes two or more keywords, a single retrieval server in each group of retrieval servers cannot perform the retrieval independently, and must be the same as other retrieval servers in the group. Collaboration can complete the retrieval, thus increasing the degree of data coupling between the retrieval servers, resulting in more complicated data backup and lower retrieval speed. In addition, the operation is performed every time the check is performed, thereby increasing the amount of communication between the search servers. SUMMARY OF THE INVENTION Embodiments of the present invention provide a retrieval method, a retrieval system, and a retrieval server, which can improve the speed of retrieval.
一种检索方法, 包括: A retrieval method, including:
确定待检索的关键词的类型; Determining the type of keyword to be retrieved;
当所述关键词为高频关键词时, 由 n台检索服务器分别读取自身存 储的所述高频关键词的一部分索引表项, n为大于 1的整数; When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.
当所述关键词为低频关键词时, 所述 n台检索服务器中一台检索服 务器读取自身存储的所述低频关键词的全部索引表项; When the keyword is a low frequency keyword, one of the n search servers reads all index entries of the low frequency keyword stored by the search server;
根据所述已读取的索引表项,确定所述待检索的关键词的检索结果。 一种检索系统, 包括: Determining a search result of the keyword to be retrieved according to the read index table item. A retrieval system comprising:
集群代理服务器, 用于确定待检索的关键词的类型; 当所述关键词 为高频关键词时, 向 n台检索服务器分别发送读取自身存储的所述高频 关键词的一部分索引表项的命令; 当所述关键词为低频关键词时, 向所 述 n台检索服务器中一台检索服务器发送读取自身存储的所述低频关键 词的全部索引表项的命令, 其中 n为大于 1的整数; 根据所述检索服务 器读取的索引表项, 确定所述待检索的关键词的检索结果; a cluster proxy server, configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;
所述检索服务器, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。 The search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.
一种检索服务器, 包括:
读取管理模块, 用于接收读取自身存储的高频关键词的一部分索引 表项的命令以及读取自身存储的低频关键词的全部索引表项的命令中 的至少一个; A retrieval server, comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;
关键词读取模块, 用于当接收到读取自身存储的所述高频关键词的 一部分索引表项的命令时, 读取所述高频关键词的一部分索引表项; 当 接收到读取自身存储的所述低频关键词的全部索引表项的命令时, 读取 所述低频关键词的全部索 I表项。 a keyword reading module, configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.
一种集群代理服务器, 包括: A cluster proxy server, including:
第一模块, 用于确定待检索的关键词的类型; a first module, configured to determine a type of a keyword to be retrieved;
第二模块, 用于当所述关键词为高频关键词时, 向 n台检索服务器 分别发送读取自身存储的所述高频关键词的一部分索引表项的命令; 当 所述关键词为低频关键词时, 向所述 n台检索服务器中一台检索服务器 发送读取自身存储的所述低频关键词的全部索引表项的命令, 其中 n为 大于 1的整数; a second module, configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;
第三模块, 用于根据所述检索服务器读取的索引表项, 确定所述待 检索的关键词的检索结果。 And a third module, configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.
由以上技术方案可见, 在本发明实施例中, 一方面, 一个高频关键 词的倒排表项由集群中的多台服务器存储, 在进行检索时, 由多台服务 器对该高频关键词的倒排表项进行并行读取, 因此可以在系统设计时间 内读取超大数量的倒排表项, 并且在后续进行逻辑操作时, 不延误单次 逻辑操作的时间开销, 提高了检索速度。 另一方面, 一个低频关键词的 所有倒排表项由一台检索服务器存储, 在进行检索时, 仅由该服务器对 该低频关键词的倒排表项进行读取。 因此无需在多台检索服务器上分别 读取较少数量的倒排表项, 节省了集群中多台检索服务器的存储资源, 提高了检索速度。 It can be seen from the above technical solution that, in an embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when searching, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the subsequent processing time does not delay the time overhead of a single logical operation, thereby improving the retrieval speed. On the other hand, all the inverted items of a low frequency keyword are stored by a retrieval server, and only the inverted list of the low frequency keyword is read by the server when the retrieval is performed. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
另外, 应用本发明实施例可以有效提高检索集群内部检索服务器之 间的耦合度, 并且增加了服务器之间的资源动态调配能力。 通过把集群 内的多台检索服务器的内存资源, 磁盘输入输出资源以及 CPU (中央处 理器) 资源看成整体进行统一规划, 最大限度保证了集群整体的并发能 力, 从而进一步提高了检索速度。
附图简要说明 图 1为现有基于文档分区的分布式索引文件检索模型示意图; 图 2为现有基于索引项分区的分布式索引文件检索模型示意图; 图 3为本发明实施例中检索方法的流程图; In addition, the embodiment of the present invention can effectively improve the coupling degree between the retrieval servers in the retrieval cluster, and increase the resource dynamic allocation capability between the servers. By considering the memory resources, disk input and output resources, and CPU (central processing unit) resources of multiple search servers in the cluster as a whole, unified planning ensures maximum concurrency of the cluster, thereby further improving the retrieval speed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a distributed index file retrieval model based on document partitioning; FIG. 2 is a schematic diagram of a distributed index file retrieval model based on index entry partitioning; FIG. 3 is a schematic diagram of a retrieval method according to an embodiment of the present invention. flow chart;
图 4为本发明另一实施例中检索方法的流程图; 4 is a flowchart of a retrieval method according to another embodiment of the present invention;
图 5为应用本发明方法对具体检索串进行检索的示意图; 5 is a schematic diagram of searching a specific search string by applying the method of the present invention;
图 6为本发明实施例中检索系统的结构图; 6 is a structural diagram of a retrieval system in an embodiment of the present invention;
图 7为应用本发明实施例中检索系统的检索模型示意图; 7 is a schematic diagram of a retrieval model of a retrieval system in an embodiment of the present invention;
图 8为应用图 7中检索模型进行检索的流程图; Figure 8 is a flow chart for applying the retrieval model in Figure 7 for retrieval;
图 9为本发明实施例中检索服务器的结构图。 实施本发明的方式 为了使本技术领域的人员更好地理解本发明方案, 并使本发明的上 述目的、 特征和优点能够更加明显易懂, 下面结合附图和具体实施方式 对本发明作进一步详细的说明。 FIG. 9 is a structural diagram of a retrieval server in an embodiment of the present invention. The present invention will be further described in detail with reference to the drawings and embodiments. instruction of.
本发明实施例中检索方法的流程如图 3所示。 The flow of the retrieval method in the embodiment of the present invention is as shown in FIG. 3.
步骤 301 : 对待检索的检索串进行解析后生成由关键词组成的检索 表达式。 Step 301: Parsing the retrieved search string to generate a search expression consisting of keywords.
步骤 302: 将对关键词的读取请求发送至集群中的各个检索服务器。 其中, 对关键词的读取请求中包括对关键词的倒排表项的预读请求。 Step 302: Send a read request for the keyword to each search server in the cluster. The read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
其中, 关键词的倒排表项是记录了包含该关键词的所有文档的标识 的数组, 在该数组中包括包含该关键词的文档的标识、 该关键词在该文 档中的权值、 及该关键词在该文档中的位置偏移, 基本结构如下所示: t <di,wdi,t,loci,loc2, ...locfdi,t><d2....>... <dft...> Wherein, the inverted list item of the keyword is an array in which the identifiers of all the documents including the keyword are recorded, and the identifier of the document including the keyword, the weight of the keyword in the document, and The positional offset of the keyword in the document, the basic structure is as follows: t <di,w d i,t,loci,loc 2 , ...loc fd i, t ><d 2 ....> ... <d ft ...>
其中, t表示检索串中的某个关键词, 4表示包含该关键词 t的一系 列文档的标识, Wd,t表示关键词 t在文档 di中的权值, loci表示关键词 t 在当前文档中出现的位置偏移, 通常用两字节表示。 根据倒排表项可以 快速查找检索串中的某个关键词。 通常每个检索串的倒排索引文件由 N 个倒排表项组成, N的数量即为检索串中关键词的数量之和。
步骤 303: 集群中的检索服务器按照关键词命中文档的频率高低读 取关键词的倒排表项。 Where t represents a certain keyword in the search string, 4 represents the identifier of a series of documents containing the keyword t, W d , t represents the weight of the keyword t in the document di, and loci represents the keyword t at the current The position offset that appears in the document, usually expressed in two bytes. According to the inverted list item, you can quickly find a keyword in the search string. Usually, the inverted index file of each search string is composed of N inverted entries, and the number of N is the sum of the number of keywords in the search string. Step 303: The retrieval server in the cluster reads the inverted item of the keyword according to the frequency of the keyword hitting the document.
检索表达式中的关键词按照命中文档的频率高低, 可以分为由超高 频关键词和中高频关键词组成的高频关键词以及低频关键词。 The keywords in the search expression can be classified into high frequency keywords and low frequency keywords composed of super high frequency keywords and medium and high frequency keywords according to the frequency of hitting the documents.
在本发明实施例中, 可以在检索进行之前对索引数据进行统计, 确 定每个关键词命中的文档的数量, 根据预先设置的文档的频率阈值, 确 定待检索的关键词的类型。当关键词为超高频关键词和 /或中高频关键词 时,对该关键词的倒排表项进行分割, 由集群中的检索服务器共同存储, 每个检索服务器存储该关键词的一部分倒排表项。 例如, 当集群中包括 n台检索服务器时, 将该高频关键词的全部索引表项分割为 n部分, 第 m台检索服务器存储该关键词的第 m部分索引表项,其中 n为大于 1的 整数, m为大于 1小于等于 n的整数。 当关键词为低频关键词时, 该关 键词的全部倒排表项由集群中的一台检索服务器存储。 例如将全部低频 关键词分割为 n部分, 第 m台检索服务器存储第 m部分低频关键词的 全部索引表项。 In the embodiment of the present invention, the index data may be counted before the retrieval is performed, the number of documents hit by each keyword is determined, and the type of the keyword to be retrieved is determined according to the frequency threshold of the preset document. When the keyword is a UHF keyword and/or a medium-high frequency keyword, the inverted item of the keyword is segmented and stored by the retrieval server in the cluster, and each retrieval server stores a part of the keyword. Schedule item. For example, when the cluster includes n retrieval servers, all index entries of the high frequency keyword are divided into n parts, and the mth retrieval server stores the mth partial index entries of the keyword, where n is greater than 1 An integer, m is an integer greater than 1 and less than or equal to n. When the keyword is a low frequency keyword, all the inverted entries of the keyword are stored by a retrieval server in the cluster. For example, the entire low frequency keyword is divided into n parts, and the mth retrieval server stores all index entries of the mth part of the low frequency keyword.
在检索阶段, 当关键词为超高频关键词和 /或中高频关键词时, 集群 中的每个检索服务器分别读取自身存储的该高频关键词的倒排表项; 当 关键词为低频关键词时, 由存储该低频关键词的倒排表项的检索服务器 读取该低频关键词的全部倒排表项。 In the retrieval phase, when the keyword is a UHF keyword and/or a medium-high frequency keyword, each retrieval server in the cluster reads the inverted item of the high-frequency keyword stored by itself; In the case of a low frequency keyword, all of the inverted entries of the low frequency keyword are read by a retrieval server storing the inverted list of the low frequency keywords.
其中, 集群中包含 n台检索服务器时, n为大于 1的整数, 对关键 词的倒排表项进行分割包括: 对高频关键词的倒排表项中的文档标识进 行取模, 取模参数为 n, 将具有相同模值的倒排表项作为一组存储在与 该模值对应的检索服务器, 在检索阶段, 由该模值对应的检索服务器读 取具有相同模值的倒排表项。 类似的, 对低频关键词对应的文字标识 ( word ID )进行取模,取模参数为 n,将模值相同低频关键词作为一组, 由一台检索服务器存储。 When the cluster includes n search servers, n is an integer greater than 1. The segmentation of the inverted items of the keyword includes: modulating the document identifier in the inverted entry of the high frequency keyword, and taking the modulo The parameter is n, and the inverted table items having the same modulus value are stored as a group in the retrieval server corresponding to the modulus value, and in the retrieval phase, the retrieval server corresponding to the modulus value reads the inverted table having the same modulus value. item. Similarly, the word identifier (word ID) corresponding to the low frequency keyword is modulo, the modulo parameter is n, and the same low frequency keyword of the modulo value is grouped and stored by a retrieval server.
进一步地, 在本发明实施例中, 检索服务器将关键词倒排表项中的 八字节的文档标识压缩为四字节的文档篇号。 Further, in the embodiment of the present invention, the retrieval server compresses the eight-byte document identifier in the keyword inverted list entry into a four-byte document article number.
步骤 304: 集群中的检索服务器对关键词的倒排表项进行逻辑操作
后输出检索结果。 Step 304: The retrieval server in the cluster performs logical operation on the inverted item of the keyword After the search results are output.
当代检索的检索串中既包括高频关键词又包括低频关键词时, 对不 同关键词的倒排表项进行逻辑操作。 具体的, 存储有低频关键词倒排表 项的检索服务器将该低频关键词的倒排表项对应的文档标识进行取模, 取模参数为 n, 将每个模值对应的倒排表项发送到该模值对应的检索服 务器。 集群中的每个检索服务器对高频关键词和低频关键词的倒排表项 进行逻辑操作; 对每个检索服务器的逻辑操作结果进行汇总后得到检索 串的检索结果。 When the search string of the contemporary search includes both high-frequency keywords and low-frequency keywords, the logical operations of the inverted items of different keywords are performed. Specifically, the search server storing the low-frequency keyword inverted list item modulates the document identifier corresponding to the inverted entry of the low-frequency keyword, and the modulo parameter is n, and the inverted item corresponding to each modulus value is Send to the retrieval server corresponding to the modulus. Each search server in the cluster performs logical operations on the inverted items of the high frequency keyword and the low frequency keyword; and the search results of the search string are obtained by summarizing the logical operation results of each search server.
其中, 逻辑操作可以为与操作、 或操作、 非操作中的一种或任意组 合。 The logical operation may be one of an operation, or an operation, a non-operation, or any combination.
本发明另一实施例中检索方法的流程如图 4所示。 该实施例示出的 每个集群中包含 n台检索服务器, 其中 n为大于 1的整数。 The flow of the retrieval method in another embodiment of the present invention is shown in FIG. Each cluster shown in this embodiment includes n retrieval servers, where n is an integer greater than one.
步骤 401 : 对待检索的检索串进行解析后生成由关键词组成的检索 表达式。 Step 401: Parsing the retrieved search string to generate a search expression consisting of keywords.
通常用户输入的需要进行检索的检索串可以是一个短句或者包括若 干关键词, 这些检索串都是未经计算机格式化处理的原始字符串, 对检 索串进行解析处理后生成计算机可以识别的检索表达式。 检索表达式可 以包含一个或多个关键词, 如果用户输入关键词之间不包括分隔符, 则 经过解析处理后, 关键词之间存在逻辑与的关系。 如果用户输入关键词 之间包括分隔符, 例如, 当关键词之间用空格隔开, 表示前后的关键词 进行 "与" 检索操作, 当关键词之间用 T 隔开, 表示前后的关键词进 行 "或" 操作, 当关键词之前使用 "!", 表示对该关键词进行 "非" 操 作。 Usually, the search string input by the user that needs to be searched may be a short sentence or include a plurality of keywords. These search strings are original strings that are not formatted by the computer, and the search string is parsed to generate a computer-recognizable search. expression. The search expression may contain one or more keywords. If the user input keyword does not include a separator, after the parsing process, there is a logical relationship between the keywords. If the user input keyword includes a separator, for example, when the keywords are separated by a space, the preceding and following keywords are subjected to the "and" retrieval operation, and the keywords are separated by T, indicating the before and after keywords. To perform an "OR" operation, use "!" before the keyword to indicate a "non" operation on the keyword.
在本实施例中,假设检索串中既包括高频关键词又包括低频关键词。 步骤 402: 将对关键词的读取请求发送至集群中的各个检索服务器 其中, 对关键词的读取请求中包括对关键词的倒排表项的预读请求。 In the present embodiment, it is assumed that the search string includes both high frequency keywords and low frequency keywords. Step 402: Send a read request for the keyword to each search server in the cluster, wherein the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
步骤 403: 判断关键词为高频关键词或为低频关键词, 若为高频关 键词则执行步骤 404; 若为低频关键词则执行步骤 405。 Step 403: Determine that the keyword is a high frequency keyword or a low frequency keyword, and if it is a high frequency keyword, perform step 404; if it is a low frequency keyword, perform step 405.
根据关键词对应的倒排表项数量的不同, 即关键词命中的文档的数
量的不同, 将检索表达式中的关键词分为高频关键词和低频关键词, 特 别的, 高频关键词还可以进一步分为中高频关键词和超高频关键词。 According to the number of inverted entries corresponding to the keyword, that is, the number of documents hit by the keyword The quantity is different, the keywords in the search expression are divided into high frequency keywords and low frequency keywords. In particular, the high frequency keywords can be further divided into medium and high frequency keywords and ultra high frequency keywords.
在本发明实施例中, 可以在检索进行之前对索引数据进行统计, 确 定每个关键词命中的文档的数量, 即关键词对应的倒排表项的数量, 根 据预先设置的文档的频率阈值, 确定待检索的关键词的类型。 In the embodiment of the present invention, the index data may be counted before the search is performed, and the number of documents hit by each keyword, that is, the number of inverted entries corresponding to the keyword, may be determined according to a preset frequency threshold of the document. Determine the type of keyword to be retrieved.
步骤 404: 集群中 n台检索服务器分别读取高频关键词的一部分倒 排表项, 然后执行步骤 407。 Step 404: The n retrieval servers in the cluster respectively read a part of the inverted entry of the high frequency keyword, and then perform step 407.
对于高频关键词的倒排表项的读取, 可以采用类似磁盘 RAID (冗 余独立磁盘阵列) 系统的技术, 让集群中的 n台检索服务器分别存储超 大规模的高频关键词的倒排表项, 在检索时, 由 n台检索服务器进行并 行读取,这样一来,在系统设计时间内可以完成对超大倒排表项的读取, 同时在后续进行逻辑运算时, 也不会延误单次逻辑操作时间开销。 For the reading of the inverted entries of the high-frequency keywords, a technology similar to the disk RAID (Redundant Independent Disk Array) system can be used, so that the n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords. The entry, in the retrieval, is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed. Single logical operation time overhead.
步骤 405: 集群中的存储有待检索的低频关键词的检索服务器读取 该低频关键词的全部倒排表项。 Step 405: The retrieval server storing the low frequency keyword to be retrieved in the cluster reads all the inverted entries of the low frequency keyword.
对于低频关键词的倒排表项, 由集群中的一台检索服务器读取, 避 免了现有的在多台检索服务器上分别读取的少量倒排表项的情况。 The inverted entry of the low frequency keyword is read by one of the search servers in the cluster, avoiding the existing situation of a small number of inverted entries read on multiple search servers.
通常低频关键词的倒排表项的数据块小于磁盘的最小读取数据块, 例如 64K,对于小于 64K的数据块,磁盘在读取时耗费的时间是一样的。 在现有技术中, 将低频关键词的倒排表项切分成 n块, 再由 n台服务器 去读取, 这样一来, 不但不会提高读取的速度, 而且浪费了集群中多个 检索服务器的资源。 通过应用本发明实施例, 有效地避免了上述问题。 Usually, the data block of the inverted list item of the low frequency keyword is smaller than the minimum read data block of the disk, for example, 64K, and for the data block smaller than 64K, the time taken by the disk to read is the same. In the prior art, the inverted list of low frequency keywords is divided into n blocks, and then read by n servers, so that not only does not increase the speed of reading, but also wastes multiple searches in the cluster. Server resources. By applying the embodiment of the present invention, the above problems are effectively avoided.
在本实施例中, 进一步包括检索服务器在建立索引时将关键词倒排 表项中的八个字节的文档标识压缩为四个字节的文档篇号。 In this embodiment, the retrieval server further compresses the eight-byte document identifier in the keyword inverted list item into a four-byte document part number when the index is established.
倒排表项中的文档标识是用于定位文档的, 对于互联网上的网页来 说, 每个网页都有一个唯一的 URL (统一资源定位器), 我们可以根据 网页的 URL字符串对其进行签名算法的处理后, 得到该 URL字符串对 应的一个 64bit ( 8个字节)的全局唯一整数, 从而得到该文档对应的文 档标识。 但是由于该互联网中的网页数量巨大, 因此该文档标识占用的 存储空间也很大。 在本实施例中, 将关键词的倒排表项分别存储到 n台
检索服务器上时, 也就相当于将不同的文档到了不同的检索服务器上, 因此每台检索服务器上都得到了一定数量的文档, 假设该数量为 N, 其 中, N为大于 0的整数, 则在本实施例中, 每台检索服务器都会对分配 到本机的文档进行进一步编号, 将这些文档标识转化为从 0 - N-1 的整 数, 作为该文档的文档篇号。 这样一来, 对于同一文档, 文档篇号的长 度远远小于原来文档标识的长度, 从而节约了存储空间, 提高了读取速 度。 The document identifier in the inverted list item is used to locate the document. For web pages on the Internet, each web page has a unique URL (Uniform Resource Locator), which we can use according to the URL string of the web page. After processing the signature algorithm, a 64-bit (8-byte) globally unique integer corresponding to the URL string is obtained, thereby obtaining a document identifier corresponding to the document. However, due to the large number of web pages in the Internet, the storage space occupied by the document identification is also large. In this embodiment, the inverted items of the keyword are respectively stored in n units. When retrieving the server, it is equivalent to sending different documents to different search servers, so each retrieval server gets a certain number of documents, assuming that the number is N, where N is an integer greater than 0, then In this embodiment, each search server further numbers the documents assigned to the machine, and converts the document identifiers into an integer from 0 - N-1 as the document number of the document. In this way, for the same document, the length of the document number is much smaller than the length of the original document identifier, thereby saving storage space and improving the reading speed.
步骤 406: 将低频关键词的倒排表项的文档篇号取模后发送到对应 的模值的检索服务器。 Step 406: The document part number of the inverted list item of the low frequency keyword is modulo and sent to the corresponding modulus search server.
步骤 407: 集群中的 n台检索服务器对已读取的倒排表项进行逻辑 操作。 Step 407: The n retrieval servers in the cluster perform logical operations on the inverted entries that have been read.
本步骤中的逻辑操作根据待检索的检索串中关键词之间的逻辑关系 进行, 其中逻辑操作包括与操作、或操作、 非操作中的一种或任意组合。 The logical operations in this step are performed according to the logical relationship between the keywords in the search string to be retrieved, wherein the logical operations include one or any combination of operations, operations, and operations.
步骤 408: 对 n台检索服务器的逻辑操作结果进行汇总后得到检索 串的检索结果。 Step 408: The result of the search operation of the search string is obtained by summarizing the logical operation results of the n search servers.
在上述实施例中, 以检索串既包括高频关键词又包括低频关键词为 例进行说明。 在实际应用中, 当检索串仅包括高频关键词时, 无需执行 步骤 405~406, 当检索串仅包括低频关键词时, 无需执行步骤 404。 In the above embodiment, the search string includes both a high frequency keyword and a low frequency keyword as an example. In practical applications, when the search string includes only high frequency keywords, steps 405-406 need not be performed, and when the search string includes only low frequency keywords, step 404 need not be performed.
下面以检索串为 "中国徐建军" 为例进行说明。 对检索串 "中国徐 建军" 进行检索的流程如图 5所示。 该集群中包含三台检索服务器, 这 三台检索服务器分别为检索服务器 0、 检索服务器 1和检索服务器 2。 The following is an example of a search string for "China Xu Jianjun". The process of searching the search string "China Xu Jianjun" is shown in Figure 5. The cluster includes three search servers, which are search server 0, search server 1, and search server 2.
首先, 对检索串 "中国徐建军" 进行解析, 生成由关键词 "中国" 和 "徐建军" 组成的检索表达式。 First, the search string "China Xu Jianjun" is analyzed to generate a search expression consisting of the keywords "China" and "Xu Jianjun".
其次, 集群中的三台检索服务器根据待检索的关键词命中文档的数 量, 确定待检索的关键词的类型, 并根据关键词的类型, 读取关键词的 倒排表项。 Secondly, the three search servers in the cluster determine the type of the keyword to be retrieved according to the number of hits of the keyword to be retrieved, and read the inverted entry of the keyword according to the type of the keyword.
其中, "中国"是一个在文档中出现频率非常高的高频关键词,而"徐 建军" 作为一个具体的人名, 在他为非名人的情况下, 是一个在文档中 出现频率很低的低频关键词。
在本实施例中, 假设高频关键词 "中国" 的倒排表项中的文档篇号 列表为 {16, 38, 100, 207, 319, 872, 903, 1081, 2331, 5618}, 低频 关键词 "徐建军" 的倒排表项中的文档篇号列表为 {38, 295, 307, 971, 2331}。 Among them, "China" is a high-frequency keyword that appears very frequently in the document, and "Xu Jianjun" is a specific person name. In the case of non-celebrity, it is a low frequency that appears very low in the document. Key words. In this embodiment, it is assumed that the document number list in the inverted list of the high frequency keyword "China" is {16, 38, 100, 207, 319, 872, 903, 1081, 2331, 5618}, low frequency key The list of document numbers in the inverted list of the word "Xu Jianjun" is {38, 295, 307, 971, 2331}.
由于集群中三台检索服务器分别保存了高频关键词 "中国" 的一部 分倒排表项,所以由集群中三台检索服务器分别读取高频关键词 "中国" 的一部分倒排表项。 将高频关键词 "中国" 对应的各个文档篇号对 3取 模, 每个模值对应的检索服务器读取该模值对应的倒排表项。 例如, 文 档篇号 16对 3取模后的值为 1, 因此集群中的检索服务器 1读取文档篇 号 16的倒排表项。 相应的, 集群中的检索服务器 0相应读取文档篇号 为 {207, 903, 2331}的倒排表项, 集群中的检索服务器 1相应读取文档 篇号为 {16, 100, 319, 1081}的倒排表项, 集群中的检索服务器 2相应 读取文档篇号为 {38, 872, 5618}的倒排表项。 Since the three search servers in the cluster respectively store a part of the inverted list items of the high-frequency keyword "China", the three search servers in the cluster respectively read a part of the inverted items of the high-frequency keyword "China". Each document number corresponding to the high frequency keyword "China" is modulo 3, and the retrieval server corresponding to each modulus value reads the inverted entry corresponding to the modulus value. For example, if the value of the document number 16 to 3 is modulo 1, the retrieval server 1 in the cluster reads the inverted entry of the document number 16. Correspondingly, the search server 0 in the cluster reads the inverted entry of the document number {207, 903, 2331}, and the search server 1 in the cluster reads the document number as {16, 100, 319, 1081. } The reverse row entry, the search server 2 in the cluster reads the inverted entry of the document number {38, 872, 5618}.
集群中的三台检索服务器分别保存了不同低频关键词的所有倒排表 项。 假设低频关键词 "徐建军" 的所有倒排表项保存在集群中的检索服 务器 2上。 由检索服务器 2保存并读取包含低频关键词 "徐建军" 的所 有倒排表项, 即文档篇号为 {38, 295, 307, 971, 2331}的倒排表项。 The three search servers in the cluster save all the inverted items of different low frequency keywords. Assume that all the inverted items of the low-frequency keyword "Xu Jianjun" are stored in the search server 2 in the cluster. The search server 2 saves and reads all the inverted items including the low-frequency keyword "Xu Jianjun", that is, the inverted items of the document number {38, 295, 307, 971, 2331}.
再次, 集群中的检索服务器完成对关键词的倒排表项的读取后, 将 低频关键词 "徐建军" 的倒排表项分发到集群中的三台检索服务器。 Once again, after the search server in the cluster finishes reading the inverted items of the keyword, the inverted list of the low-frequency keyword "Xu Jianjun" is distributed to the three search servers in the cluster.
将低频关键词对应的文档篇号对 3取模后, 将每个模值对应的倒排 表项发送到该模值对应的检索服务器。 在本实施例中, 对于低频关键词 "徐建军", 文档篇号为 {2331}的倒排表项被发送到检索服务器 0, 文档 篇号为 {295, 307}的倒排表项被发送到检索服务器 1, 文档篇号为 {38, 971}的倒排表项被发送到检索服务器 2, 得到检索的中间结果。 After the document number corresponding to the low frequency keyword is modulo 3, the inverted item corresponding to each modulus value is sent to the retrieval server corresponding to the modulus value. In the present embodiment, for the low frequency keyword "Xu Jianjun", the inverted entry of document document number {2331} is sent to search server 0, and the inverted entry of document document number {295, 307} is sent to The search server 1, the inverted entry of the document number {38, 971} is sent to the search server 2, and the intermediate result of the search is obtained.
最后, 集群中的三台服务器分别对高频关键词 "中国" 和低频关键 词 "徐建军" 的倒排表项进行与操作, 并获取检索结果。 Finally, the three servers in the cluster operate and invert the high-frequency keyword "China" and the low-frequency key word "Xu Jianjun" and obtain the search results.
经过与操作, 检索服务器 0的检索结果是文档篇号为 2331的文档, 检索服务器 1的检索结果为空, 检索服务器 2的检索结果是文档篇号为 38的文档, 将三个检索服务器的检索结果汇总后, 获取对检索串 "中国
徐建军" 进行检索后的结果为文档篇号为 {2331 , 38}的文档。 After the operation and retrieval, the search result of the search server 0 is the document with the document number 2331, the search result of the search server 1 is empty, and the search result of the search server 2 is the document with the document number 38, and the search of the three search servers is performed. After the results are summarized, get the search string "China Xu Jianjun's result of the search is the document with the document number {2331, 38}.
图 6示出了本发明实施例中的检索系统。 Fig. 6 shows a retrieval system in an embodiment of the present invention.
如图 6所示, 本实施例中的检索系统包括: 緩存代理服务器 610、 集群代理服务器 620以及检索服务器 630。 As shown in FIG. 6, the retrieval system in this embodiment includes: a caching proxy server 610, a cluster proxy server 620, and a retrieval server 630.
緩存代理服务器 610用于对待检索的检索串进行解析后生成由关键 词组成的检索表达式; 接收来自集群代理服务器 620的检索结果, 根据 需要输出该检索结果。 集群代理服务器 620用于接收来自緩存代理服务 器 610的检索表达式, 确定检索表达式中关键词的类型, 并根据关键词 的类型, 向检索服务器 630发送读取命令; 从检索服务器 630接收检索 结果, 并将该检索结果发送给緩存代理服务器 610。 检索服务器 630用 于根据来自集群代理服务器 620的读取命令, 读取关键词的倒排表项, 确定待检索的关键词的检索结果, 并向集群代理服务器 620返回检索结 果; 当检索串中包括至少两个关键词时, 检索服务器 630进一步用于在 获取每个关键词的倒排表项之后, 对至少两个关键词的倒排表项进行逻 辑操作, 确定至少两个关键词对应的检索结果。 The cache proxy server 610 parses the search string to be retrieved to generate a search expression consisting of key words; receives the search result from the cluster proxy server 620, and outputs the search result as needed. The cluster proxy server 620 is configured to receive a retrieval expression from the caching proxy server 610, determine the type of the keyword in the retrieval expression, and send a reading command to the retrieval server 630 according to the type of the keyword; receive the retrieval result from the retrieval server 630. And sending the search result to the caching proxy server 610. The retrieval server 630 is configured to read the inverted item of the keyword according to the read command from the cluster proxy server 620, determine the retrieval result of the keyword to be retrieved, and return the retrieval result to the cluster proxy server 620; When the at least two keywords are included, the search server 630 is further configured to perform logical operations on the inverted items of the at least two keywords after obtaining the inverted items of each keyword, and determine the corresponding at least two keywords. Search Results.
应用本发明系统的检索模型示意图如图 7所示, 该示意图中的緩存 代理服务器、 集群代理服务器以及检索服务器呈 "树型" 分布, 该系统 中包括一个緩存代理服务器, 该緩存代理服务器下连接 n台集群代理服 务器, 每一台集群代理服务器下面连接 n台检索服务器, 每一组 n台检 索服务器组成一个集群检索子系统。 A schematic diagram of a retrieval model using the system of the present invention is shown in FIG. 7. The cache proxy server, the cluster proxy server, and the retrieval server in the schematic diagram are distributed in a "tree" manner, and the system includes a cache proxy server, and the cache proxy server is connected. n cluster proxy servers, each cluster proxy server is connected to n retrieval servers, and each set of n retrieval servers constitutes a cluster retrieval subsystem.
其中, 緩存代理服务器为一个独立的进程, 可以驻留在一台硬件服 务器上。 在检索时, 緩存代理服务器对外部输入的检索串的查询请求进 行緩存, 对待检索的检索串进行解析后生成由关键词组成的检索表达 式。 例如, 緩存代理服务器可以调用检索服务器中的检索解释程序将外 部输入的检索串解析成机器能够读懂的检索表达式。 当每个检索集群子 系统返回检索结果到集群代理服务器后, 再由该緩存代理服务器汇总所 有集群代理服务器的结果并返回给外部用户。 The cache proxy server is a separate process and can reside on a hardware server. At the time of retrieval, the caching proxy server caches the query request of the externally input search string, and parses the search string to be retrieved to generate a search expression consisting of keywords. For example, the caching proxy server can invoke a retrieval interpreter in the retrieval server to parse the externally entered retrieval string into a retrieval expression that the machine can understand. When each retrieval cluster subsystem returns the retrieval result to the cluster proxy server, the cache proxy server summarizes the results of all cluster proxy servers and returns them to the external user.
集群代理服务器是一个独立的进程,可以驻留在一台硬件服务器上。 在检索时, 集群代理服务器确定检索表达式中关键词的类型, 并根据关
键词的类型, 向集群子系统中的检索服务器发送读取命令, 当关键词位 高频关键词时, 向检索服务器分别发送读取自身存储的高频关键词的一 部分索引表项的命令, 向检索服务器中一台检索服务器发送读取自身存 储的低频关键词的全部索引表项的命令。 当每个检索服务器返回检索结 果时, 对返回的检索结果进行汇总, 确定待检索的关键词的检索结果; 并将汇总后的检索结果返回给上层的緩存代理服务器。 A clustered proxy server is a separate process that can reside on a single hardware server. At the time of retrieval, the cluster proxy server determines the type of the keyword in the retrieval expression, and according to the The type of the keyword, sends a read command to the search server in the cluster subsystem, and when the keyword is a high frequency keyword, sends a command to the search server to read a part of the index entry of the high frequency keyword stored by itself, A command to read all index entries of the low frequency keywords stored by itself is sent to a retrieval server in the retrieval server. When each search server returns the search result, the returned search results are summarized to determine the search result of the keyword to be searched; and the summarized search result is returned to the upper cache proxy server.
每一台检索服务器都是一个独立的进程, 可以驻留在一台硬件服务 器上, 其为一个最基本的检索单元, 在上层集群代理服务器的调度下, 进行基本的底层检索操作, 包括集群代理服务器的读取指令读取关键词 的倒排表项, 并返回给集群代理服务器。 当接收到读取自身存储的高频 关键词的一部分索引表项的命令时, 读取高频关键词的一部分索引表 项; 当接收到读取自身存储的低频关键词的全部索引表项的命令时, 读 取低频关键词的全部索引表项。 当检索串中包括至少两个关键词时, 检 索服务器还对至少两个关键词的倒排表项进行相应的 "与" "或" "非" 等逻辑操作, 确定所述至少两个关键词对应的索引表项。 Each retrieval server is a separate process that can reside on a hardware server. It is a basic retrieval unit. Under the scheduling of the upper cluster proxy server, basic underlying retrieval operations, including cluster agents. The server's read command reads the inverted list of keywords and returns it to the cluster proxy server. When receiving a command to read a part of an index entry of a high frequency keyword stored by itself, reading a part of an index entry of the high frequency keyword; when receiving all index entries of the low frequency keyword stored by itself When the command is executed, all index entries of the low frequency keyword are read. When the search string includes at least two keywords, the search server further performs logical operations such as "and" or "not" on the inverted items of the at least two keywords to determine the at least two keywords. Corresponding index table entry.
应用本发明实施例, 可以显著地提高检索速度。 通过实验得知, 在 互联网随机下载得到的 1500万个网页文档中, 命中文档数量超过 1000 篇的一元、 二元、 三元语素总数量不超过 50万。 那么可以推想在 1亿 篇文档中, 命中文档数量在 6000-10000篇的语素数量不会超过 50万, 假定存储关键词与文档的关系时, 采用 8字节存储文档标识、 采用 3字 节存储关键词的权值以及压缩后采用 2字节存储关键词位置偏移, 在关 键词命中 5000篇文档时, 该关键词的倒排表项的存储空间为 64k, 在关 键词命中 10000篇文档时, 该关键词的倒排表项的存储空间为 128k, 读 取时间 8毫秒。 在如图 7所示的本发明提供的检索模型中, 如果采用 16 台检索服务器一组, 根据倒排表项, 包括文档标识、 权值和位置偏移的 存储空间, 对倒排表项进行分隔。 对于存储空间在 64k以上的语素, 同 时由多台检索服务器存储该语素的倒排表项, 对于存储空间在 64k以下 的语素, 由一台检索服务器存储该语素的全部倒排表项。 并且将文档标 识压缩为文档篇号后, 采用小于 2字节的空间来存储。 这样一来, 对于
存储空间在 64k以下的语素, 每次读取一个语素的倒排表项的时间小于 8毫秒, 对于 64k以上的语素, 在每台检索服务器存储有 64k-128k的倒 排表项, 可以存放(64k-128k ) /7*16=15-30万个倒排表项。 那么, 在 1 亿篇文档中, 对于命中率在千分之三以下的中高频关键词, 每次读取时 间也在 8毫秒之内。 可见, 对于低频关键词和中高频关键词, 都可以在 一次读取时间内把所有的倒排表项读完。 对于命中率超过千分之三以上 的高频语素可以只存放权值较高的部分, 权值较低的部分可以做停用处 理, 以使得每个高频语素的最大倒排表项存储空间不超过 1M, 即读取 时间不超过 50ms。 With the embodiment of the present invention, the retrieval speed can be remarkably improved. Through experiments, it is known that among the 15 million web documents randomly downloaded from the Internet, the total number of unary, binary, and ternary morphemes that hit more than 1,000 documents does not exceed 500,000. Then it can be inferred that in 100 million documents, the number of morphemes hitting 6000-10000 pieces will not exceed 500,000. Assume that when storing the relationship between keywords and documents, 8-byte storage document identification is used, and 3-byte storage is used. The weight of the keyword and the 2-byte stored keyword position offset after compression. When the keyword hits 5000 documents, the storage space of the inverted item of the keyword is 64k, when the keyword hits 10000 documents. The inverted row of the keyword has a storage space of 128k and a read time of 8 milliseconds. In the retrieval model provided by the present invention as shown in FIG. 7, if a group of 16 retrieval servers is used, the inverted row entry is performed according to the inverted row entry, including the storage space of the document identifier, the weight, and the position offset. Separate. For a morpheme with a storage space of 64k or more, an inverted list item of the morpheme is stored by a plurality of search servers, and for a morpheme whose storage space is 64k or less, all the inverted items of the morpheme are stored by one search server. And after compressing the document identification into the document number, it uses less than 2 bytes of space for storage. In this way, for A morpheme with a storage space below 64k, the time for reading an inverted table entry of one morpheme is less than 8 milliseconds. For a morpheme of 64k or more, an inverted row of 64k-128k is stored in each retrieval server, which can be stored ( 64k-128k) /7*16=15-30 million inverted items. Then, in 100 million documents, for medium and high frequency keywords with a hit rate below three thousandths, each read time is also within 8 milliseconds. It can be seen that for low frequency keywords and medium and high frequency keywords, all the inverted items can be read in one reading time. For high-frequency morphemes with a hit rate of more than three-thousandths of a thousand, only the part with higher weight can be stored, and the part with lower weight can be deactivated so that the maximum inverted item storage space of each high-frequency morpheme No more than 1M, that is, the reading time does not exceed 50ms.
应用图 7中检索模型进行检索的流程图如图 8所示。在本实施例中, 检索串中包括至少两个关键词。 The flowchart for applying the search model in Figure 7 is shown in Figure 8. In this embodiment, at least two keywords are included in the search string.
步骤 801 : 緩存代理服务器将待检索的检索串进行解析后生成由关 键词组成的检索表达式。 Step 801: The cache proxy server parses the search string to be retrieved to generate a search expression consisting of the key words.
步骤 802: 集群代理服务器确定检索表达式中每个关键词的类型, 并根据每个关键词的类型, 向检索服务器发送读取倒排表项的命令。 Step 802: The cluster proxy server determines the type of each keyword in the retrieval expression, and sends a command to read the inverted row item to the retrieval server according to the type of each keyword.
步骤 803: 检索服务器接收读取请求后, 读取关键词倒排表项。 步骤 804: 检索服务器对至少两个关键词的倒排表项进行逻辑操作。 例如, 对倒排表项中的文档篇号进行逻辑操作, 获取对关键词进行逻辑 运算。 Step 803: After receiving the read request, the search server reads the keyword inverted list item. Step 804: The retrieval server performs logical operations on the inverted items of the at least two keywords. For example, logical operations are performed on the document number in the inverted list item to obtain a logical operation on the keyword.
步骤 805: 各个检索服务器将逻辑运算后的结果发送到上层集群服 务器进行汇总得到中间结果。 Step 805: Each retrieval server sends the result of the logical operation to the upper cluster server for aggregation to obtain an intermediate result.
步骤 806: 各个集群服务器将中间结果发送到上层緩存代理服务器 汇总得到最终结果并输出。 Step 806: Each cluster server sends the intermediate result to the upper cache proxy server to summarize and output the final result.
图 9示出了本发明实施例中检索服务器的结构。 在本实施例中, 待 检索的检索串包括至少两个关键词。 Fig. 9 shows the structure of a retrieval server in the embodiment of the present invention. In this embodiment, the search string to be retrieved includes at least two keywords.
该检索服务器包括: 检索解释模块 910、读取管理模块 920、 关键词 读取模块 930、 逻辑运算模块 940以及标识转换模块 950。 The retrieval server includes: a retrieval interpretation module 910, a read management module 920, a keyword reading module 930, a logical operation module 940, and an identification conversion module 950.
其中, 检索解释模块 910用于对待检索的检索串进行解析后生成由 关键词组成的检索表达式供上层服务器调用。 读取管理模块 920用于接
收读取自身存储的高频关键词的一部分索引表项的命令以及读取自身 存储的低频关键词的全部索引表项的命令中的至少一个。 关键词读取模 块 930用于当接收到读取自身存储的高频关键词的一部分索引表项的命 令时, 读取该高频关键词的一部分索引表项; 当接收到读取自身存储的 所述低频关键词的全部索引表项的命令时, 读取该低频关键词的全部索 引表项。 其中倒排表项中包括文档标识压缩后生成的文档篇号。 逻辑运 算模块 940用于当有至少两个存在逻辑关系的关键词待检索时, 根据逻 辑关系, 对已读取的对应至少两个待检索的关键词的索引表项进行逻辑 操作, 确定至少两个关键词对应的索引表项。 标识转换模块 950用于将 关键词倒排表项中的八字节的文档标识压缩为四字节的文档篇号。 The search and interpretation module 910 is configured to parse the search string to be retrieved to generate a search expression composed of keywords for the upper layer server to call. The read management module 920 is used to connect At least one of a command to read a part of an index entry of a high frequency keyword stored therein and a command to read all index entries of a low frequency keyword stored by itself. The keyword reading module 930 is configured to: when receiving a command to read a part of the index entry of the high frequency keyword stored by itself, read a part of the index entry of the high frequency keyword; when receiving the read self storage When all the indexes of the low frequency keyword are indexed, all index entries of the low frequency keyword are read. The inverted item includes the document number generated after the document identifier is compressed. The logic operation module 940 is configured to perform logical operations on the index entries corresponding to the at least two keywords to be retrieved according to the logical relationship when there are at least two keywords having a logical relationship to be retrieved, and determine at least two The index table entry corresponding to the keyword. The identifier conversion module 950 is configured to compress the eight-byte document identifier in the keyword inverted list item into a four-byte document article number.
在上述实施例中, 对文档的索引方法为倒排索引, 相应的索引表项 为倒排表项, 这只是本发明的示例, 并不用于限制本发明。 在应用本发 明实施例时, 还可以采用其它索引方法, 读取该索引方法对应的索引表 项。 In the above embodiment, the indexing method for the document is an inverted index, and the corresponding index entry is an inverted list item, which is only an example of the present invention and is not intended to limit the present invention. When the embodiment of the present invention is applied, other index methods may be used to read the index table corresponding to the index method.
由以上实施例可见, 在本发明实施例中, 一方面, 一个高频关键词 的倒排表项由集群中的多台服务器存储, 在进行检索时, 由多台服务器 对该高频关键词的倒排表项进行并行读取, 因此可以在系统设计时间内 读取超大数量的倒排表项, 并且不延误单次逻辑操作的时间开销, 提高 了检索速度。 另一方面, 一个低频关键词的所有倒排表项由一台检索服 务器存储。 在进行检索时, 仅由该服务器对该低频关键词的倒排表项进 行读取。 因此无需在多台检索服务器上分别读取较少数量的倒排表项, 节省了集群中多台检索服务器的存储资源, 提高了检索速度。 As can be seen from the above embodiments, in the embodiment of the present invention, on the one hand, an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when the search is performed, the high frequency keyword is used by multiple servers. The inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the time overhead of a single logical operation is not delayed, and the retrieval speed is improved. On the other hand, all inverted entries of a low frequency keyword are stored by a retrieval server. When the search is performed, only the inverted list item of the low frequency keyword is read by the server. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
另外, 应用本发明实施例可以有效提高检索集群内部检索服务器之 间的耦合度, 并且增加了服务器之间的资源动态调配能力, 通过把集群 内的多台检索服务器的资源进行统一规划, 最大限度保证了集群整体的 并发能力, 从而进一步提高了检索速度。 In addition, the embodiment of the present invention can effectively improve the coupling degree between the search servers in the search cluster, and increase the resource dynamic allocation capability between the servers, and uniformly plan the resources of multiple search servers in the cluster to maximize the maximum The overall concurrency capability of the cluster is guaranteed, which further improves the retrieval speed.
虽然通过实施例描绘了本发明, 本领域普通技术人员知道, 本发明 有许多变形和变化而不脱离本发明的精神, 希望所附的权利要求包括这 些变形和变化而不脱离本发明的精神。
While the invention has been described by the embodiments of the present invention, it will be understood that