Suche Bilder Maps Play YouTube News Gmail Drive Mehr »
Anmelden
Nutzer von Screenreadern: Klicke auf diesen Link, um die Bedienungshilfen zu aktivieren. Dieser Modus bietet die gleichen Grundfunktionen, funktioniert aber besser mit deinem Reader.

Patentsuche

  1. Erweiterte Patentsuche
VeröffentlichungsnummerCN102368260 A
PublikationstypAnmeldung
AnmeldenummerCN 201110308830
Veröffentlichungsdatum7. März 2012
Eingetragen12. Okt. 2011
Prioritätsdatum12. Okt. 2011
Veröffentlichungsnummer201110308830.7, CN 102368260 A, CN 102368260A, CN 201110308830, CN-A-102368260, CN102368260 A, CN102368260A, CN201110308830, CN201110308830.7
Erfinder时迎超, 柴春光, 黄际洲
Antragsteller北京百度网讯科技有限公司
Zitat exportierenBiBTeX, EndNote, RefMan
Externe Links:  SIPO, Espacenet
Method and device of producing domain required template
CN 102368260 A
Zusammenfassung
The invention provides a method and a device of producing a domain required template, wherein the method comprises the following steps of: A, obtaining candidate required templates of a special domain; B, extracting the characteristics of the candidate required templates; C, sorting the candidate required templates according to the extracted characteristics; and D, selecting the final required template as the template required in the special domain from the candidate required templates. With above mode, a universal method for producing the high-quality domain required template is realized, which provides a guarantee for a search engine to understand the purpose of acts of users.
Ansprüche(28)  übersetzt aus folgender Sprache: Chinesisch
1. 一种生成领域需求模版的方法,其特征在于,所述方法包括:A.获取特定领域的候选需求模版;B.提取候选需求模版的特征,所述特征至少包括:表征候选需求模板与所述特定领域之间紧密度的相似度特征、表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征以及表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征中的至少一种;C.利用提取的特征对候选需求模版进行排序;D.根据排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 1. A method of generating field needs stencil, characterized in that the method comprises: A candidate needs to obtain a particular template field; B needs to extract candidate template feature, the feature comprising at least: Characterization of candidate template needs tightness in specific areas of similarity between the features characterizing the candidate needs a template cover features a user search query request generalization ability and to characterize demand template border candidate is not generalized terms of the impact on the correctness of the candidate template demand word feature at least one;. C using the extracted features for the candidate needs to sort templates;. D to select the final demand from the candidate needs stencil templates sort the results according to the needs of specific areas as a template.
2.根据权利要求1所述的方法,其特征在于,所述步骤A包括:Al.从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ; A2.将选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符, 得到候选需求模版。 2. The method according to claim wherein said step A comprises:. Al select a user query in the specific areas of the preset qualifier match query from search logs; A2 will be selected. The query in the specific areas of the preset groove keyword matching part with a wildcard, the candidate needs to obtain the template.
3.根据权利要求2所述的方法,其特征在于,在所述步骤A2之后还包括:根据预设的对所述特定领域的槽位数量要求,从所述步骤A2得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 According to the specific areas of a preset number of slots requested, from the candidate needs stencil obtained in step A2: 3. The method according to claim 2, characterized in that, after said step A2 further comprises filter out the candidate does not meet the needs of the number of slots of the template requirements.
4.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的相似度特征的步骤包括:获取所述W的核心词向量及所述特定领域的核心词向量;计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 Step 4. The method according to claim 1, characterized in that the similarities of feature extraction template W candidate needs include: obtaining core word vector W and the core words of the vector for the specific area; calculating the W Similarity core words vector and the specific areas of core words between the vectors, and the similarity of the similarity of characteristics of the W.
5.根据权利要求4所述的方法,其特征在于,获取所述W的核心词向量的步骤包括: 从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中N1为正整数。 5. The method according to claim 4, wherein the obtaining of the W core term vectors include: Select the largest number of queries N1 a query from the W cover in the search query logs in and The N1 a query to determine the core words and core word from a search engine returns search results weighted to form the core of the W word vector, wherein N1 is a positive integer.
6.根据权利要求4所述的方法,其特征在于,获取所述特定领域的核心词向量的步骤包括:利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 6. The method according to claim 4, wherein the step of acquiring the particular core areas of term vectors include: the use of seeds to obtain the domain-specific search engine query search results returned, and the search results Words Words identifying core and core weight, to form the core of the specific areas of the word vector.
7.根据权利要求6所述的方法,其特征在于,所述特定领域的种子query的获取方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 7. The method according to claim 6, wherein the specific area of seed query access modes include: one way, select the search logs from all the candidate needs a template covering specific areas included in the query number Most candidate needs N2 templates, and demand for the N2 candidate template, choose the largest number of queries M1 a query from each candidate needs covered query template as seed query, where the N2 and M1 is a positive integer; or, The second way, the groove with a preset keyword preset the specific areas covered by the specific areas of qualifiers combined to generate the specific areas of seed query; or, Third mode using a selected part of the way After the seed query, use the default of the specific areas of the tank keyword dictionary mode selected seeds of a query in the groove keyword substitution expanded keyword dictionary for the slot in the other slot keyword seed query; the query and the extended portion seed seed seed query query constitute the specific areas.
8.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的泛化能力特征的步骤包括:确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W对应的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 Step 8. The method according to claim 1, characterized in that the candidate needs to extract features generalization W templates include: determining the W sequence corresponding groove keyword, statistics of the W sequence corresponding groove keyword In cross-slot-specific keyword sequence number and calculated according to the generalization features the W, wherein W keyword sequence corresponding slot is covered by the W in a search query log in sequence consisting of grooves keywords.
9.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的边界词特征的步骤包括:将所述特定领域包含的所有候选需求模版切分为片段,从得到的各切分片段中选取正片段并确定各正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W 的向量;计算所述W的向量与所述正向量的相似度S1,以及,所述W与所述负向量的相似度s2, 并根据所述S1与所述S2的差值得到所述W的边界词特征。 Step 9. The method according to claim 1, characterized in that, the candidate word feature extraction border demand template W include: the needs of all the candidate templates the specific field comprises cut into fragments, obtained from the cut points fragments and fragments being selected to determine the weight of each of the n segments to generate the n weight vector specific areas, negative fragment selected from fragment obtained by segmentation of each of the fragments and determining the weight of each negative weight vector to generate the negative specific areas; W determine the weight of the heavy syncopated fragments and fragments of the right to use the W segmentation and segmentation segments constituting the weight vector W; calculating the W vector with the positive vector similarity S1, and, The vector W with the negative similarity s2, and wherein the W word boundary obtained according to the difference between the S1 and the S2.
10.根据权利要求9所述的方法,其特征在于,所述特定领域的正向量和负向量的生成过程具体包括:确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列;Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 10. The method according to claim 9, characterized in that said specific areas of the positive and negative vector vector generation process specifically comprises: determining the respective slit grooves keywords corresponding sequence segments, one segment corresponding segmentation A groove keyword query sequence is a candidate for a demand of the segmentation template fragment covered the groove comprises a sequence keywords;. Tl if a segment corresponding to the segmentation of the same sequence all keywords slots, then the cut sub-fragment as a negative fragment, and the negative power segment weight 1; T2 Find all slots are not identical sequence fragment if a corresponding split, but there is a slot keyword sequence all slots criteria in the split segment. sequence representing the ratio P is greater than a first predetermined threshold, the segmentation fragment fragment as negative, and the negative weight is the weight proportion of fragment P;. T3 is determined for each candidate template specific requirements contained in the corresponding field number of mutually different groove keyword sequence to obtain the maximum amount of the Z1, if a sub-fragment does not satisfy the Tl and T2 in the condition of the cut, and the cut mutually different sub-segments corresponding grooves keyword The sequence number of the Z1 Z2 ratio of greater than a preset second threshold value, then the segmentation fragment fragment as positive, and the positive segment weight is the ratio of Z1 and Z2.
11.根据权利要求9所述的方法,其特征在于,确定所述W的切分片段的权重的步骤包括:统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 11. The method according to claim 9, characterized in that, to determine the weight W of the slit segments weight comprises: frequency splitting of the W segment statistics appearing in the W and the number of times as Segmentation fragment corresponding to the weight.
12.根据权利要求1所述的方法,其特征在于,所述步骤C包括:从候选需求模版中选取标准模版集;利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;使用提取的各特征及特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 Characterized by said parameters for each standard template corresponding to the extracted set of training, so that the training; standard template selected from the candidate set of requirements templates: 12. The method according to claim 1 or claim 2, wherein said step C comprising parameter values of the standard template set templates rank all candidates template can not demand more forward as the corresponding features when the weight; the right to use the extracted features and characteristics of each candidate needs recalculation of score templates, and based on the score for each candidate template needs to be sorted.
13.根据权利要求12所述的方法,其特征在于,从候选需求模版中选取标准模版集的步骤包括:针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数;取各特征的模版集合之间的交集作为标准模版集。 13. The method according to claim 12, characterized in that the selection criteria set stencil template demand from the candidate comprises: for each feature extraction characteristic values are based on the candidate needs to sort templates, respectively, for each feature Take the previous arrangement N3 position as a candidate needs stencil template corresponding feature set, which is a positive integer N3; intersection take between each feature set as the standard template template set.
14.根据权利要求1所述的方法,其特征在于,所述步骤D包括:将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数;利用排序位于前礼位的候选需求模版的边界词获取关键词集合,并将排序位于前队位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 14. The method according to claim 1, characterized in that said step D comprises: sorting the candidate needs to N4 in the top position of the template chosen for the final demand template, wherein N4 is a positive integer; the use of Li in the top position of the sorting word boundary candidate set of keywords needs access template, and the template in the top ranked candidate demand force after the position of the word boundary candidates belong to the set of keywords selected template needs to final demand template, wherein the boundary word is Demand is not the candidate template generalization word, the key words are mutual information to meet the requirements of the word word word synonymous with the boundary or the boundary between words, M2 and M2 is a positive integer less than or equal to N4.
15. 一种生成领域需求模版的装置,其特征在于,所述装置包括:候选模版获取单元,用于获取特定领域的候选需求模版;特征提取单元,用于提取候选需求模版的特征,其中所述特征提取单元至少包括相似度特征提取单元、泛化能力特征提取单元或边界词特征提取单元中的一个,所述相似度特征提取单元用于提取表征候选需求模板与所述特定领域之间紧密度的相似度特征,所述泛化能力特征提取单元用于提取表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征,所述边界词特征提取单元用于提取表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征;排序单元,用于利用所述特征提取单元提取的特征对候选需求模版进行排序;选取单元,用于根据所述排序单元排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 15. A field generating means needs stencil, characterized in that said apparatus comprises: a candidate template acquisition unit for acquiring the specific needs in the field of candidate template; feature extraction means for extracting a candidate template demand feature, wherein wherein said extraction unit comprises at least similarity feature extraction unit, a feature extraction unit or generalization word boundaries in a feature extraction unit, said similarity feature extraction unit for extracting a candidate needs to characterize the specific field between the template and close the degree of similarity of features, the generalization of feature extraction unit for extracting a candidate needs to characterize template cover features a user search query request generalization ability, the boundary word feature extraction unit for extracting a candidate needs to characterize template not pan characterized word boundary terms of the impact on the demand for the candidate template correctness; sorting unit, for utilizing the characteristic feature extraction unit extracts the candidate needs to sort template; selecting means for sorting according to the sorting unit Select final demand as a specific field template template demand from the candidate needs templates.
16.根据权利要求15所述的装置,其特征在于,所述候选模版获取单元包括:限定单元,用于从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ;泛化单元,用于将所述限定单元选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 16. The apparatus of claim 15, characterized in that said candidate template acquisition unit includes: defining unit, for selecting preset user query in the specific area of the qualifier match query from search logs ; generalization unit for the defined unit with the pre-selected query the specific area of the groove keyword matching part with a wildcard, the candidate needs to obtain the template.
17.根据权利要求16所述的装置,其特征在于,所述候选模版获取单元进一步包括过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从所述泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 17. The apparatus according to claim 16, wherein said candidate template acquired according to the preset field of the specific requirements of the slot number, derived from the generalization unit further comprises a filter unit means for The candidate needs to filter out the template candidate does not meet the needs of the number of slots of the template requirements.
18.根据权利要求15所述的装置,其特征在于,所述相似度提取单元包括:模版词向量生成单元,用于在提取候选需求模版W的相似度特征时,获取所述W的核心词向量;领域词向量生成单元,用于获取所述特定领域的核心词向量;计算单元,用于计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 18. The apparatus according to claim 15, characterized in that the similarity extracting unit comprises: word template vector generation means for, when the extraction candidate needs stencil W similarity characteristics, obtaining the core word W Vector; field word vector generation unit to acquire the specific areas of the core word vector; calculating unit for calculating the degree of similarity W core word vector and the specific areas of core words between the vectors, and The similarity of the W's features as similarity.
19.根据权利要求18所述的装置,其特征在于,所述模版词向量生成单元从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中所述N1 为正整数。 19. The apparatus of claim 18, wherein, wherein said stencil the word vector generation unit number of queries to select up to a query from the N1 W cover in the search query in the log, and a query on the N1 identify the core word and the word from the core search engine returns search results weighted to form the core of the W word vector, wherein N1 is a positive integer.
20.根据权利要求18所述的装置,其特征在于,所述领域词向量生成单元利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 Seed 20. The apparatus of claim 18, wherein, wherein said field word vector generation unit using the domain-specific query to obtain the search engine returns search results, and identify the core word and the word in the core search results weights, to form the core of the specific areas of the word vector.
21.根据权利要求20所述的装置,其特征在于,所述领域词向量生成单元获取所述特定领域的种子query的方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 21. The apparatus of claim 20, wherein, wherein said field word vector generation unit Obtaining seed query the specific areas include: one way, the candidate needs to select from all templates included in the specific field covering up the search query log number N2 candidate needs stencil, and demand for the N2 candidate template, choose the largest number of queries M1 a query from each candidate needs covered query template as seed query, where N2 and M1 is a positive integer; or, the way two qualifier preset specific areas of the tank keyword with a preset combination of the specific areas of the specific areas to generate seed query; or, Third mode, use The way a selected portion of the seed query, the use of the specific areas of the preset groove keyword dictionary the way a selected seed keyword query in the slot Replace the slot keyword dictionary Other Keywords seed tank expanded query; the query and the extended portion seed seeds query specific areas constituting the seed query.
22.根据权利要求15所述的装置,其特征在于,所述泛化能力特征提取单元在提取候选需求模版W的泛化能力特征时,确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 22. The apparatus of claim 15, wherein, wherein the feature extraction unit generalization generalization feature extraction template W candidate needs to determine the W sequence corresponding groove keyword, statistics of the W keyword sequence number of slots corresponding grooves mutually different keyword sequence and calculated in accordance with the characteristics of generalization of the W, one of the slots of the W keyword sequence is composed of the W in the search log covered in a query sequence consisting of grooves keywords.
23.根据权利要求15所述的装置,其特征在于,所述边界词特征提取单元包括: 切分单元,用于将特定领域包含的所有候选需求模版切分为片段;正负向量生成单元,用于从所述切分单元得到的各切分片段中选取正片段并确定正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;模版向量生成单元,用于在提取候选需求模版W的边界词特征时,确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;相似度计算单元,用于计算所述W的向量与所述正向量的相似度S1,以及,所述W的向量与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 23. The apparatus according to claim 15, characterized in that said word boundary feature extraction unit comprises: segmentation means for the needs of all the candidate templates will cut into fragments comprising specific areas; negative vector generation means, for from the split fragments cut each sub-unit was selected positive fragments and determine the weight of being heavy fragments to generate positive vector of the specific areas of the negative fragment selected from each split fragment obtained and determine the negative fragment The weights of the specific areas to generate a negative vector; template vector generation means for, when the boundary extracting characteristic word candidate template W needs to determine the right of the W sub-fragments cut and re-use of the split segment in W Segmentation and segments constituting the weights vector W; similarity calculating unit for calculating similarity of the vector W and the positive vector similarity S1, and the vector W and the negative vector of S2, and get the word boundary feature of the W in accordance with the difference between the S1 and the S2.
24.根据权利要求23所述的装置,其特征在于,所述正负向量生成单元包括:槽关键词序列确定单元,用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query 中的槽关键词组成的序列;正负片段选取单元,用于按照下列方式从各切分片段中选取正片段和负片段以及确定正片段和负片段的权重:Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 24. The apparatus according to claim 23, wherein said positive and negative vector generation unit comprises: a groove keyword sequence determination means for determining a sub-segment corresponding to the cut groove keyword sequence, wherein a segmentation fragment a corresponding groove keyword sequence is that a candidate needs a query template that split segment covered the groove keyword composition comprising; negative fragment selection means for selecting segmentation from each segment in the following manner positive fragment and negative fragment and determine the weight of the positive fragment and negative fragment of weight:. Tl if a keyword segmentation sequence fragments corresponding to all slots are the same, the split fragment as a negative fragment, and the negative power segment weighting of 1; T2. If a keyword sequence typing all segments corresponding groove cut not identical, but there is a slot in the sequence of all grooves keyword Keyword sequence of this fragment accounted segmentation ratio P is greater than a first predetermined threshold, The split fragment as a negative fragment, and the negative weight of the segment right proportion P;. T3 each candidate needs to determine specific areas of the template that contains the number of different mutually corresponding grooves keyword sequence to obtain the number of corresponding mutually different groove keyword sequence number Z1 Z2 and the ratio of the maximum value Z1, if a sub-fragment does not satisfy the Tl and T2 in the condition of the cut, and the split fragments greater than a preset first second threshold value, then the fragment as being split fragment, and the weight ratio of the weight of the positive segment of Z2 and Z1.
25.根据权利要求23所述的装置,其特征在于,所述模版向量特征生成单元在确定所述W的切分片段的权重时,统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 25. The apparatus according to claim 23, characterized in that the template feature vector generating unit in determining the weight W of the slit segments heavy, statistical segmentation of the W in the W segment appears in and the number of times as the number corresponding to the right to re-split fragments.
26.根据权利要求15所述的装置,其特征在于,所述排序单元包括:标准模版集选取单元,用于从候选需求模版中选取标准模版集;训练单元,用于利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;计算与排序单元,用于使用所述特征提取单元提取的各特征及所述训练单元得到的各特征的权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序。 26. The apparatus of claim 15, wherein, wherein said sorting unit includes: standard template set selection means for selecting a standard template set templates from the candidate needs; training unit for using the standard template set the parameters of the feature extraction corresponding training, the training so that the standard template set templates rank all candidates in demand template parameter value can not be closer to the top when a corresponding feature weights; calculation and sorting unit for use The right of each feature and get the training modules each feature extraction feature extraction unit weight calculation template candidate needs to score and the score based on the template candidate needs to be sorted.
27.根据权利要求26所述的装置,其特征在于,所述标准模版集选取单元包括:模版集合确定单元,用于针对提取的每个特征基于特征值对候选需求模版进行排序, 分别针对每个特征取排列在前队位的候选需求模版作为对应特征的模版集合,其中N3为正整数;交集单元,用于取各特征的模版集合之间的交集作为标准模版集。 27. The apparatus according to claim 26, wherein said selecting means includes a set of standard templates: template set determining unit for feature extraction for each feature value based on the candidate template needs to sort, separately for each feature take the first team place arrangement candidate needs stencil template as corresponding feature set, which is a positive integer N3; intersection unit for taking the intersection between each feature set as the standard template template set.
28.根据权利要求15所述的装置,其特征在于,所述选取单元包括:第一选取单元,用于将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4 为正整数;第二选取单元,用于利用排序位于前M2位的候选需求模版的边界词获取关键词集合, 并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 28. The apparatus according to claim 15, wherein said selecting means comprises: a first selecting means for sorting the candidate needs to stencil the front N4 bits selected as the final demand template, wherein N4 is a positive integer; second selecting means for utilizing the candidate needs to sort in the top position of the stencil M2 get word boundaries set of keywords and sorting the candidate needs stencil located N4 bits before the boundary after the words belong to the candidate set of keywords Demand stencil template selected as final demand, where the word boundaries are not a candidate needs the template generalization words, the key words are mutual information word synonymous with the word boundary or the boundary between words to meet the requirements of the word, M2 and M2 is a positive integer less than or equal to N4.
Beschreibung  übersetzt aus folgender Sprache: Chinesisch

一种生成领域需求模版的方法及其装置 A method of generating areas of demand template method and apparatus

【技术领域】 TECHNICAL FIELD

[0001] 本发明涉及自然语言处理技术,特别涉及一种生成领域需求模版的方法及其装置。 [0001] The present invention relates to natural language processing technology, particularly to a stencil areas of demand generation method and apparatus.

【背景技术】 BACKGROUND OF THE INVENTION

[0002] 搜索引擎为人们找到所需信息提供了极大的便利。 [0002] The search engine for people to find the information you need to provide a great convenience. 在传统的搜索引擎为用户提供信息的方式中,是通过查找包含用户搜索关键字的索引,为用户返回与关键字匹配的相关页面来实现的。 In the traditional search engines to provide users with information on the way in, it is by looking at the index that contains the user search keywords, and keyword match for the user to return to the relevant page to achieve. 例如,用户的搜索请求(query)为“北京汽车4S店招聘销售主管”,这时会得到招聘网站的搜索结果页面,用户可以通过点击该页面进入招聘网站,然后在该招聘网站内填写相关信息并在站内进行检索,得到自己真正需要的信息。 For example, the user's search request (query) is "Beijing Auto 4S shop sales recruitment director", then get the search results page recruitment site, the user can click on the page to enter the job site, and then fill in the relevant information within the recruitment site and retrieved in the station, get the information they really need. 如果搜索引擎能够更好地理解用户在检索时的真正目的,那么搜索引擎就能够更准确地向用户返回真正符合其需求的信息。 If the search engines can better understand the real purpose of the user retrieval, then the search engine will be able to more accurately truly meet their needs to return information to the user. 因此,自然语言处理对搜索引擎而言非常重要。 Thus, natural language processing for search engines is very important. 在自然语言处理中,可以采用基于领域的需求模版对用户的搜索目的进行识别。 In natural language processing, you can use the template fields based on the demand of the user's search purpose of identification. 例如,用户的query为“大钟寺到西单怎么走”,如果该query与交通领域的需求模版相匹配,就可以得知该用户有交通领域的需求,因此可以直接向该用户返回与交通领域相关的应用。 For example, the user's query is "Dazhongsi Xidan how to walk", if the query template needs to match with the transport sector, to be able to tell the user there is a demand in the transportation sector can be returned directly to the user and the transport sector related applications. 可见,是否能够产生高质量的领域需求模版,对搜索引擎正确理解用户的搜索意图而言,非常重要。 Be seen whether demand can produce high-quality field template for the search engine to correctly understand the user's search intent, it is very important.

[0003] 在以往生成领域需求模版时,针对不同的应用,通常采用不同的挖掘方法进行,这不仅浪费了大量的人力物力,而且这种生成领域需求模版的方法,适应性差,难以随着应用的变化而做出相应的改变。 [0003] In the past, when the field of demand generation templates for different applications, usually using different mining methods, which not only waste a lot of manpower and resources, and this demand is generated template field methods, poor adaptability, it is difficult with the application The changes make the appropriate changes.

【发明内容】 SUMMARY OF THE INVENTION

[0004] 本发明所要解决的技术问题是提供一种生成领域需求模版的方法及装置,以解决采用现有技术生成的领域需求模版适应性差的缺陷。 [0004] The technical problem to be solved by the present invention is to provide a method of generating areas of demand template method and apparatus for use in the field to address the needs of poor adaptability templates generated deficiencies of the prior art.

[0005] 本发明为解决技术问题而采用的技术方案是提供一种生成领域需求模版的方法, 包括:A.获取特定领域的候选需求模版;B.提取候选需求模版的特征,所述特征至少包括: 表征候选需求模板与所述特定领域之间紧密度的相似度特征、表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征以及表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征中的至少一种;C.利用提取的特征对候选需求模版进行排序;D.根据排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 [0005] The aspect of the present invention to solve technical problems and is used to provide a template to generate demand areas, comprising:.. A candidate needs to obtain the template specific areas; B extraction candidate demands stencils characteristics that at least include: characterization of a candidate needs in specific areas between the template and the tightness of the similarity of features, characterizing the candidate needs a template cover features a user search request generalization ability and characterization of candidate query template demand not demand template generalization word candidate word feature at least one border impacts of the correctness of;. C using the extracted features for the candidate needs to sort templates;. D to select the final demand from the candidate needs stencil templates sort the results according to the needs of specific areas as a template.

[0006] 根据本发明之一优选实施例,所述步骤A包括:A1.从搜索日志中选取用户query 中与预设的所述特定领域的限定词匹配的query ;A2.将选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 [0006] According to a preferred embodiment of the present invention, the step A comprises:.. A1 Select user query with a preset specific areas of the qualifier match query from search logs; query A2 will be selected in and preset grooves of the specific areas of keyword matching part with a wildcard, the candidate needs to obtain the template.

[0007] 根据本发明之一优选实施例,在所述步骤A2之后还包括:根据预设的对所述特定领域的槽位数量要求,从所述步骤A2得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0007] According to one preferred embodiment of the present invention, after the step A2 further comprises: according to a preset number of slots on the specific requirements of the field, from the candidate needs stencil obtained in step A2 are not filtered out a candidate needs to meet the requirements of the slot number of the template. [0008] 根据本发明之一优选实施例,提取候选需求模版W的相似度特征的步骤包括:获取所述W的核心词向量及所述特定领域的核心词向量;计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 Step [0008] According to a preferred embodiment of the present invention, the candidate needs to extract similarities of feature templates W include: obtaining core word vector W and the core words of the vector for the specific area; calculating the W core word similarity vector and the specific areas of core words between the vectors, and the similarity of the similarity of characteristics of the W.

[0009] 根据本发明之一优选实施例,获取所述W的核心词向量的步骤包括:从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中N1为正整数。 Step [0009] According to a preferred embodiment of the present invention, access to the core of the W word vector include: Select the largest number of queries N1 a query from the W cover in the search query in the log, and the N1 a query to determine the core words and core word from a search engine returns search results weighted to form the core of the W word vector, wherein N1 is a positive integer.

[0010] 根据本发明之一优选实施例,获取所述特定领域的核心词向量的步骤包括:利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 Step [0010] According to a preferred embodiment of the present invention to obtain the specific areas of the core word vector include: the use of seeds to obtain the domain-specific search engine query search results returned, and identify the core terms in the search results and core word weights to form the core of the specific areas of the word vector.

[0011] 根据本发明之一优选实施例,所述特定领域的种子query的获取方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2 个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 [0011] According to a preferred embodiment of the present invention, the specific areas of seed query access modes include: one way, covering up to select the number of N2 in the search query logs from all candidate needs templates included in the specific field a candidate needs the template, and the demand for the N2 candidate templates, select a template from each candidate needs covered query maximum number of queries to a query as M1 seed query, where the N2 and M1 is a positive integer; or, two way, Alternatively, after three way, using the way of a selected portion of the seed query; the qualifier of the particular areas of the preset groove keyword with a preset combination of the specific areas to generate the specific areas of seed query , using the specific areas of the preset groove groove keyword keyword dictionary will replace the way a query in selected seeds for the slot keyword dictionary in the other slot expanded seed keyword query; the said some seeds query and query the extended seed constituting the specific areas of seed query.

[0012] 根据本发明之一优选实施例,提取候选需求模版W的泛化能力特征的步骤包括: 确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W对应的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 Step [0012] According to a preferred embodiment of the present invention, the candidate needs to extract features generalization W templates include: determining the W sequence corresponding groove keyword, statistics of the W groove corresponding sequence of mutually different keywords number of slots keyword sequence and on the basis of the calculated level of generalization features the W, wherein W keyword sequence corresponding slot is covered by the W in search logs a keyword query in the groove sequence composition.

[0013] 根据本发明之一优选实施例,提取候选需求模版W的边界词特征的步骤包括:将所述特定领域包含的所有候选需求模版切分为片段,从得到的各切分片段中选取正片段并确定各正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;计算所述W的向量与所述正向量的相似度S1,以及,所述W与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 Step [0013] According to one preferred embodiment of the present invention, the extraction candidate word boundaries needs stencil W features include: all the candidate needs stencil cut into specific areas comprising fragments selected from each of the split fragments obtained and to determine the weight of each fragment being a fragment of a positive weight vector to generate the domain-specific positive, negative fragment selected from each sub-fragment was cut and the right to determine the weight of each segment to generate negative negative vector of the specific areas; determining the Right W segmentation heavy fragments and fragments of the W using segmentation and right split segments constituting the weight vector W; calculating the W vector with the positive vector similarity S1, and the W and the negative vector similarity S2, and get the word boundary feature of the W in accordance with the difference between the S1 and the S2.

[0014] 根据本发明之一优选实施例,所述特定领域的正向量和负向量的生成过程具体包括:确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列; Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述Τ2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 Wherein a groove corresponding to a key segment segmentation determine the split segments corresponding grooves keyword sequence: [0014] According to one preferred embodiment of the present invention, the generation process of the specific field positive and negative vectors include vectors specifically word sequence is a candidate needs a query template that split segment covered slot in a sequence of keywords contained;. Tl if a split fragment corresponding to all slots keyword identical sequence, then the segmentation fragment as negative fragment, and the fragment of a weight of negative 1;. T2 If a split fragment corresponding keyword sequence all slots are not identical, but there is a slot keyword sequence accounted for all slots keyword sequence fragment of the segmentation P ratio is greater than a preset first threshold value, then the segmentation fragment as a negative fragment, and the negative weight of the weight of the fragments of proportion P;. T3 each candidate needs to determine specific areas of the template that contains the corresponding mutually different number of slots keyword sequence to obtain the maximum amount of the Z1, if a sub-segment Tl and does not satisfy the conditions in the Τ2 cut and mutually different sub-segments corresponding groove keyword sequence number of the cut Z1 Z2 ratio is greater than the preset second threshold value, then the segmentation fragment fragment as positive, and the positive segment weight is the ratio of Z1 and Z2.

[0015] 根据本发明之一优选实施例,确定所述W的切分片段的权重的步骤包括:统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 [0015] According to one preferred embodiment of the present invention, the weight W is determined segmentation heavy fragment comprises: frequency splitting of the W segment statistics appearing in the W and the number of times corresponding to segmentation as right weight fragments.

[0016] 根据本发明之一优选实施例,所述步骤C包括:从候选需求模版中选取标准模版集;利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;使用提取的各特征及特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 [0016] According to a preferred embodiment of the present invention, the step C comprises: Select standard template set templates from the candidate needs; the parameters of the feature set of training using the standard template corresponding to the retrieved, the training so that the standard parameter value stencil set templates ranking all candidates demand template can not be closer to the top when a corresponding feature weight; the right to use the extracted various features and characteristics of the recalculation candidate demands stencils score, and based on the score needs of each candidate Sort templates.

[0017] 根据本发明之一优选实施例,从候选需求模版中选取标准模版集的步骤包括:针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数;取各特征的模版集合之间的交集作为标准模版集。 [0017] According to one embodiment of the present invention are preferably chosen from the candidate needs standard template set template comprises: for each feature extraction characteristic values are based on the candidate needs to sort templates were taken for each feature arrayed N3 position before a candidate needs the template as a template corresponding feature set, which is a positive integer N3; take each feature templates intersection between sets as a standard template set.

[0018] 根据本发明之一优选实施例,所述步骤D包括:将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数;利用排序位于前M2位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2 为正整数且M2小于或等于N4。 [0018] According to one preferred embodiment of the present invention, the step D comprises: the N4 position in the top of the sorted candidate selected as the final demand needs stencil template, wherein N4 is a positive integer; use sorting in the top position of the candidate needs stencil M2 Get word boundaries set of keywords and sorting the candidate needs stencil located N4 bits before the boundary after the words belong to the candidate set of keywords stencil demand for the final demand selection template, wherein the boundary word candidate template needs the word is not generalized, the key words are mutual information word synonymous with the word boundary or the boundary between the words meet the word requirement, M2 and M2 is a positive integer less than or equal to N4.

[0019] 本发明还提供了一种生成领域需求模版的装置,包括:候选模版获取单元,用于获取特定领域的候选需求模版;特征提取单元,用于提取候选需求模版的特征,其中所述特征提取单元至少包括相似度特征提取单元、泛化能力特征提取单元或边界词特征提取单元中的一个,所述相似度特征提取单元用于提取表征候选需求模板与所述特定领域之间紧密度的相似度特征,所述泛化能力特征提取单元用于提取表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征,所述边界词特征提取单元用于提取表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征;排序单元,用于利用所述特征提取单元提取的特征对候选需求模版进行排序;选取单元,用于根据所述排序单元排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 [0019] The present invention further provides an apparatus for generating a demand template field, comprising: a candidate template acquisition unit for acquiring the specific needs in the field of candidate template; feature extraction means for extracting a candidate template demand feature, wherein said similarity feature extraction unit comprises at least feature extraction means for feature extraction unit or generalization word boundaries in a feature extraction unit, said similarity feature extraction unit for extracting characterizing the tightness between the template and the specific needs of candidate areas Similarity feature, the generalization feature extraction unit for extracting a candidate needs to characterize template cover features a user search query request generalization ability, the boundary word feature extraction unit for extracting a candidate needs to characterize template not generalize word boundary characteristic word candidate on the needs arising from the template correctness; sorting unit, for utilizing the characteristic feature extraction unit extracts the candidate needs to sort the template; selecting means, sorting means for sorting according to the result Select the template demand final demand in specific areas of the stencil as demand from the candidate templates.

[0020] 根据本发明之一优选实施例,所述候选模版获取单元包括:限定单元,用于从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ;泛化单元,用于将所述限定单元选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 [0020] According to a preferred embodiment of the present invention, the candidate template acquisition unit includes: defining unit, for selecting preset user query in the specific area of the qualifier match query from search logs; generalization unit for the defined unit with the pre-selected query the specific area of the groove keyword matching part with a wildcard, the candidate needs to obtain the template.

[0021] 根据本发明之一优选实施例,所述候选模版获取单元进一步包括过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从所述泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0021] According to the present invention is preferably one based on a preset particular area of the slot number of requirements, from the obtained cell candidate needs generalization unit further comprises a filter unit, for Example, to obtain the candidate template template filter out the candidate does not meet the needs of the number of slots of the template requirements.

[0022] 根据本发明之一优选实施例,所述相似度提取单元包括:模版词向量生成单元,用于在提取候选需求模版W的相似度特征时,获取所述W的核心词向量;领域词向量生成单元,用于获取所述特定领域的核心词向量;计算单元,用于计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 [0022] According to a preferred embodiment of the present invention, the similarity extracting unit comprises: stencil word vector generation unit for feature extraction similarity template W candidate needs, access to the core of the W word vector; and field word vector generation unit to acquire the specific areas of the core word vector; calculating unit for calculating the degree of similarity W core word vector and the specific areas of core words between the vectors and the similarity As a feature of the W of similarity. [0023] 根据本发明之一优选实施例,所述模版词向量生成单元从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中所述N1为正整数。 [0023] According to a preferred embodiment of the present invention, the stencil word vector generation unit number of queries to select up to a query from the N1 W cover in the search query in the log, and the N1 a query from a search engine return search results identifying the core word and the word of core weight to form the core of the W word vector, wherein N1 is a positive integer.

[0024] 根据本发明之一优选实施例,所述领域词向量生成单元利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 [0024] According to a preferred embodiment of the present invention, the term vector field seed production unit to obtain the domain-specific search engine query search results returned, and determine the weight of the core word and the word of the core weight in the search results, to form the core of the specific areas of the word vector.

[0025] 根据本发明之一优选实施例,所述领域词向量生成单元获取所述特定领域的种子query的方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 [0025] According to a preferred embodiment of the present invention, the field of word vector generation unit acquires seed query the way the specific areas include: one way, select the search logs from all candidate needs templates included in the specific field In covering the largest number of N2 candidate query template demand, and demand for the N2 candidate template, choose the largest number of queries M1 a query from each candidate needs covered query template as seed query, where the N2 and M1 is positive Integer; qualifier or two ways, the specific areas preset grooves keyword with a preset combination of the specific areas of the specific areas to generate seed query; or, Third mode, use the mode a selected portion of the seed query, the use of the specific areas of the groove preset keyword dictionary the way a selected seed keyword query in the slot Replace the slot keyword dictionary in the other slot keyword Seeds expanded query; the query and the extended portion seed seeds query specific areas constituting the seed query.

[0026] 根据本发明之一优选实施例,所述泛化能力特征提取单元在提取候选需求模版W 的泛化能力特征时,确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 When the [0026] According to a preferred embodiment of the present invention, the generalization feature extraction unit extracts features generalization W templates candidate needs to determine the corresponding slot W keyword sequence corresponding statistics of the W cross groove groove keyword sequence-specific keyword sequence number and calculated in accordance with the characteristics of generalization of the W, one of the slots of the W keyword sequence is covered by the W in a search log keyword query sequence in the groove formed.

[0027] 根据本发明之一优选实施例,所述边界词特征提取单元包括:切分单元,用于将特定领域包含的所有候选需求模版切分为片段;正负向量生成单元,用于从所述切分单元得到的各切分片段中选取正片段并确定正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;模版向量生成单元,用于在提取候选需求模版W的边界词特征时,确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;相似度计算单元,用于计算所述W的向量与所述正向量的相似度S1,以及,所述W的向量与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 [0027] According to a preferred embodiment of the present invention, the boundary word feature extraction unit comprising: segmentation means demand for all candidate templates will cut into segments containing specific areas; positive and negative vector generation unit from Cut each of the split fragments obtained sub-unit being selected fragments and determine the weight of being heavy fragments to generate the positive vector in specific areas, from each split fragment was selected Negative fragments and determine the weight of each segment weight to negative specific areas generating the negative vector; stencil vector generation unit for a boundary word feature extraction candidate needs stencil W and W determine the weight of the heavy syncopated fragments and fragments using segmentation and cut the W sub weight fragments constituting the weight vector W; similarity calculating unit for calculating the vector W with the positive vector similarity S1, and the vector W and the negative vector similarity S2, and wherein said word boundary W obtained according to the difference between the S1 and the S2.

[0028] 根据本发明之一优选实施例,所述正负向量生成单元包括:槽关键词序列确定单元,用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列; 正负片段选取单元,用于按照下列方式从各切分片段中选取正片段和负片段以及确定正片段和负片段的权重:T1.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;Τ2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;Τ3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述Τ2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 [0028] According to one preferred embodiment of the present invention, the positive and negative vector generation means comprises: grooves keyword sequence determination unit for determining the cut points corresponding grooves keyword sequence segments, one segment corresponds to a segmentation Sequence groove keyword query sequence that contains a template of a candidate needs the split fragments covered the groove keyword composition; positive and negative fragment selection means for selecting being in accordance with the following manner fragment from each segment and segmentation Negative fragments and determine the weight of the positive fragment and negative fragment of weight:.. T1 If a split fragment corresponding to all slots keyword identical sequence, then the segmentation fragment as a negative fragment and fragment the negative weight of 1; Τ2 if A segmentation fragment sequences corresponding to all grooves keywords are not identical, but there is a slot in the sequence of all grooves keyword Keyword sequence of this fragment accounted segmentation ratio P is greater than a first predetermined threshold value, then the cut sub-fragment as a negative fragment, and the negative weight of the weight of the fragments of proportion P;. Τ3 each candidate needs to determine specific areas of the template that contains the number of different mutually corresponding grooves keyword sequence to obtain the maximum amount of the Z1 If a split fragment does not meet the conditions described Τ2 Tl and said, and the number of cross grooves keyword sequence-specific sub-segments corresponding to the ratio of the Z1 and Z2 is greater than a predetermined second threshold value of the cut, The fragment is split as a positive fragment, and the right of the positive fragment weight ratio of Z2 and Z1. [0029] 根据本发明之一优选实施例,所述模版向量特征生成单元在确定所述W的切分片段的权重时,统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 [0029] According to one preferred embodiment of the present invention, the template feature vector to determine the number of cells in the right W segmentation fragment heavy, splitting the fragment W statistics appearing in the generation and W The number of times as the weight of the heavy fragment corresponding to segmentation.

[0030] 根据本发明之一优选实施例,所述排序单元包括:标准模版集选取单元,用于从候选需求模版中选取标准模版集;训练单元,用于利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;计算与排序单元,用于使用所述特征提取单元提取的各特征及所述训练单元得到的各特征的权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序。 [0030] According to a preferred embodiment of the present invention, the sorting unit includes: standard template set selection means for selecting a standard template set templates from the candidate needs; training modules for the training set using the standard template extraction each characteristic parameter corresponding to the training so that the standard template set templates ranking all candidates demand template parameter value can not be closer to the top when a corresponding feature weights; calculation and sorting unit, for using the feature Right extraction unit extracts each feature and the training unit was re-calculated each feature template candidate needs to score, and based on the score for the candidate needs to sort templates.

[0031] 根据本发明之一优选实施例,所述标准模版集选取单元包括:模版集合确定单元, 用于针对提取的每个特征基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中队为正整数;交集单元,用于取各特征的模版集合之间的交集作为标准模版集。 [0031] According to one preferred embodiment of the present invention, the standard set of templates selected unit comprising: a collection of template determination unit for feature extraction for each feature value based on the candidate needs to sort template, were taken for each feature N3 position of the previous arrangement candidate needs stencil template as corresponding feature set, where the team is a positive integer; intersection unit for taking the intersection between each feature set as the standard template template set.

[0032] 根据本发明之一优选实施例,所述选取单元包括:第一选取单元,用于将排序位于前队位的候选需求模版选取为最终需求模版,其中N4为正整数;第二选取单元,用于利用排序位于前礼位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 [0032] According to one preferred embodiment of the present invention, the selecting means comprises: a first selecting means for sorting the candidate bit located before the team needs to select a template for the final demand template, wherein N4 is a positive integer; a second selected means for utilizing the candidate needs to sort in the top position of the stencil Li word boundary Get set of keywords, and in the top ranked candidate template needs N4 bit boundary after the word belongs to said selected candidate set of keywords template needs of a template for the final demand, where the word boundaries are not a candidate needs the template generalization words, the key words are mutual information and the word or words are synonyms border to the boundary between the words meet the requirements word, M2 and M2 is a positive integer less than or equal to N4.

[0033] 由以上技术方案可以看出,本发明提供了一种通用性的领域需求模版的生成方法,针对不同的领域,均可通过本方法自动挖掘候选需求模版,并提取候选需求模版的特征对候选需求模版的质量进行评定,从而能够在候选需求模版中得到高质量的需求模版。 [0033] As can be seen from the above technical solution, the present invention provides a versatile method of generating demand field template for different areas, can automatically identify candidate needs stencil by this method, and the candidate needs to extract the template feature the quality of the candidate needs to be assessed template, which can obtain high quality requirements in the candidate needs stencil template. 本发明得到的高质量的各个领域的需求模版为搜索引擎理解用户的行为目的提供了保障。 Demand stencil of the present invention to obtain a high quality in all areas of the search engines understand users' behavior aim to provide a guarantee.

【附图说明】 BRIEF DESCRIPTION

[0034] 图1为本发明中生成领域的需求模版的方法的流程示意图; [0034] FIG. 1 is a schematic flow needs in the field of the invention to generate a schematic template method;

[0035] 图2为本发明中获取候选需求模版的实施例的流程示意图; [0035] Figure 2 of the present invention needs to obtain the candidate templates schematic embodiment of a flow;

[0036] 图3为本发明中利用种子query获取搜索引擎返回数据的示意图; [0036] FIG. 3 is a schematic view of the invention of seed get search engine query returned data;

[0037] 图4为本发明中生成领域需求模版的装置的实施例的结构示意框图; Example domain structure template apparatus needs [0037] Figure 4 is a schematic block diagram of the invention produced;

[0038] 图5为本发明中相似度特征提取单元的实施例的结构示意框图; [0038] FIG. 5 in the present invention, the similarity of the feature extraction block diagram showing a schematic structure of an embodiment of the unit;

[0039] 图6为本发明中边界词特征提取单元的实施例的结构示意框图; [0039] FIG. 6 in the present invention, the word boundary feature extraction structure schematic block diagram of an embodiment of the unit;

[0040] 图7为本发明中标准模版集选取单元的实施例的结构示意框图。 [0040] FIG. 7 of the present invention the standard template set selection structure schematic block diagram of an embodiment of the unit.

【具体实施方式】 [DETAILED DESCRIPTION]

[0041] 为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。 [0041] In order to make the objectives, technical solutions, and advantages of the present invention will become clear below in conjunction with the accompanying drawings and specific embodiments of the present invention will be described in detail.

[0042] 请参考图1,图1为本发明中生成领域的需求模版的方法的流程示意图。 [0042] Please refer to FIG. 1, a schematic flow diagram in Figure 1 of the present invention needs to generate the template field method. 如图1所示,该方法包括: As shown in Figure 1, the method comprising:

[0043] 步骤SlOl :获取特定领域的候选需求模版。 [0043] Step SlOl: Get a candidate needs specific areas of the stencil. [0044] 步骤S102 :提取候选需求模版的特征。 [0044] Step S102: template feature extraction candidate needs.

[0045] 步骤S103 :利用提取的特征对候选需求模版进行排序。 [0045] Step S103: using the extracted features for the candidate needs to sort templates.

[0046] 步骤S104 :根据排序的结果从候选需求模版中选取最终的需求模版作为特定领域的需求模版。 [0046] Step S104: Select the final demand from the candidate needs stencil templates sort the results according to the needs of specific areas as a template.

[0047] 下面通过具体的实施例对上述方法进行详细介绍。 [0047] The following specific examples through a detailed description of the methods described above.

[0048] 本发明中,特定领域是反映用户搜索目的的一个范围,如公交领域、天气领域等等,这些领域反映了用户搜索信息时的搜索目的。 [0048] The present invention, the specific area is reflected in a range of user-search purposes, such as public transport field, weather areas, etc., these areas reflect the purpose of the search when a user searches for information.

[0049] 请参考图2,图2为本发明中获取候选需求模版的实施例的流程示意图。 [0049] Please refer to Figure 2, Figure 2 is a schematic flow diagram of an embodiment obtained candidate needs template. 在本实施例中,利用了领域限定词词典与槽关键词词典对用户搜索日志(querylog)中的用户搜索请求query进行处理,从而生成候选需求模版。 In this embodiment, the use of the qualifier field groove keyword dictionary and a user dictionary search logs (querylog) in the user's search query request is processed, a candidate needs to generate templates.

[0050] 领域限定词词典包含了与各个领域相关的词语,其中特定领域的限定词是与特定领域相关的词语,在本实施例中,特定领域的限定词用于在选取query时,对query进行过滤。 [0050] field qualifier dictionary contains words related to various fields, including qualifiers specific areas are associated with a particular field of words, in this embodiment, the qualifier for specific areas in the selection query, to query filtered. 只有包含特定领域的限定词的query,才会进行泛化,泛化生成的候选需求模版,就属于特定领域的候选需求模版。 Only contain qualifier specific areas of query, will generalization, generalization candidate needs to generate templates, the candidate needs to belong to specific areas of the stencil. 领域限定词词典中的词语可以通过下列途径收集得到: Field qualifier dictionary words can be collected in the following ways:

[0051] 首先可以从用户的query中挖掘领域种子词作为领域限定词,其中领域种子词可以通过人工的方式配置,或者采用人工的方式在搜索日志中标注。 [0051] First, the seed field can be tapped from the user's query as a qualifier field, where the field of the seed can be configured by artificial means, or the use of artificial means marked in the search log.

[0052] 然后通过查找同义词词典,得到与领域种子词同义的词语作为领域限定词,此外, 还可以通过使用度量两个词紧密程度的互信息选取搜索日志中与种子词关联程度高的词语一并作为领域限定词。 [0052] and then by looking thesaurus, get word synonymous with the seed of the field as the field of qualifiers, in addition, by using a measure of how closely the two words chosen mutual information search logs associated with a high degree of the seed word As a qualifier in the field together. 词语之间的互信息可通过对大规模语料进行统计得到,由于属于现有技术,在此不再赘述。 Mutual information between words can be obtained on a large scale corpus statistics, because they belong to the prior art, it is not discussed here. 以公交领域为例,表1给出了部分领域限定词的示例: In the field of public transport, for example, Table 1 shows an example of some areas of qualifiers:

[0053] 表1 [0053] Table 1

[0054] [0054]

Figure CN102368260AD00121

[0055] 生成候选需求模版的过程,就是对query进行泛化的过程,所谓泛化,指的是将用户query中与特定领域的槽关键词匹配的部分替换为通配符。 Process [0055] generating a candidate needs the template, is to generalize the query process, the so-called generalization, referring to the user query with specific areas of the slot keyword matching part with a wildcard. 槽关键词是用于泛化的词语,通过查找槽关键词词典确定,该词典可通过收集各种专有名词得到。 Keywords are the words used in the slot generalization, determined by finding the groove keyword dictionary, the dictionary can be obtained by collecting various proper nouns.

[0056] 例如“北京15路公交车路线”这样的query,在泛化以后,可以得到“[城市名][公交路线]公交车路线”这样的需求模版。 [0056] For example, "Beijing 15 bus routes," such a query, after generalization, can be "[city name] [Bus routes] bus routes" template such demand. 每一个“[],,符号代表模版的一个槽位,表示该位置在满足通配符属性要求的情况下可进行替换,例如上面这个模版与“上海郊14路公共车路线”也匹配。 Each "[] ,, a symbol representing the slot template, indicating that the position in meeting the requirements of the situation wildcard attribute can be replaced, such as the top of the template and the" Shanghai rural road 14 public bus routes, "also match.

[0057] 在得到上述候选需求模版后,还可以根据对候选需求模版所属的特定领域预设的槽位数量要求决定是否对这些候选需求模版进行过滤处理。 [0057] After the candidate needs to obtain the above templates, but also according to the needs of specific areas of the candidate belongs to a preset template calling for a decision on the number of slots the candidate needs stencil filtration treatment. 例如在火车信息查询领域, query中的可变信息一般仅涉及起点和终点,因此可以将火车信息查询领域的模版预定槽数设置为2,凡是不符合预定槽数要求的模版都会被过滤掉,以降低后续对候选需求模版进行处理的复杂度。 For example, in the field of train information, query the variable information generally involves only the beginning and end, and therefore the predetermined number of slots stencil train information fields can be set to 2, who do not meet the predetermined number of slots required template will be filtered out, in order to reduce the complexity of the follow-up demand for candidate template for processing.

[0058] 本实施例中,步骤S102中提取的特征,至少包括以下特征中的一种: [0058] In this embodiment, in step S102 the extracted feature includes at least one of the following features:

[0059] 相似度特征,用于描述候选需求模版与特定领域联系的紧密度;泛化能力特征,用于描述候选需求模版覆盖用户搜索请求query的能力;边界词特征,用于描述候选需求模版中未被泛化的词语对候选需求模版的正确性产生的影响。 [0059] Similarity feature, a candidate needs to describe the template associated with particular areas of tightness; generalization characteristics, the candidate needs the ability to override the user search request query templates used to describe; boundary word feature, used to describe a candidate needs stencil Words not been generalized effect on the validity of a candidate needs to produce a template.

[0060] 下面对上述三个特征的计算方式的实施例进行具体介绍。 [0060] The following examples of calculation of the above three features will be specifically described.

[0061] 1、相似度特征 [0061] 1, the similarity of characteristics

[0062] 一个候选需求模版W的相似度特征可以通过计算候选需求模板W的核心词向量与该候选需求模板W所属特定领域的核心词向量之间的余弦距离得到,具体可以采用下列公式⑴进行计算: [0062] A candidate needs similarities of feature templates W can be obtained by computing a candidate needs a core template W word vector cosine distance between the candidate needs template W belongs to specific areas of core words between the vectors obtained specifically conducted using the following equation ⑴ Calculation:

[0063] sim_score = CossSimilarity (pattern_vector, seed_query_centroid) (1) [0063] sim_score = CossSimilarity (pattern_vector, seed_query_centroid) (1)

[0064] 其中,sim_score表示候选需求模版W的相似度特征值,pattern_vector表示候选需求模板W的核心词向量,seed_query_centriod表示特定领域的核心词向量, CossSimilarity表示余弦相似度函数。 [0064] where, sim_score characteristic value represents the similarity of the candidate needs stencil W, pattern_vector represents the core word vector W is a candidate needs a template, seed_query_centriod represents the core of the word vector in specific areas, CossSimilarity represents cosine similarity function.

[0065] 核心词向量,是由核心词为向量特征形成的向量。 [0065] core word vector, is characterized by the core word vectors is formed is. 因此,在计算相似度特征时,首先要确定如何选取核心词。 Thus, in calculating the similarity characteristics, we must first determine how to select the core word.

[0066] 在确定特定领域的核心词时,可以利用该特定领域的种子query获取搜索引擎返回的数据,并利用搜索引擎返回的数据确定核心词。 [0066] In determining the core words in specific areas, you can use the query to obtain seed specific field search engine returns the data, and use the search engine to return the data to determine the core words. 请参考图3,图3为本发明中利用种子query获取搜索引擎返回数据的示意图。 Refer to Figure 3, which illustrates a schematic diagram of the invention to obtain a search engine query Seed return data. 如图3所示,种子query为“北京15路公交车路线”,该种子query可以从搜索引擎得到多个搜索结果。 3, seed query "Beijing 15 bus routes," the seed query can get more search results from search engines. 将这些搜索结果的标题(title)和内容(text)进行预处理(包括分句、分词、去除停用词等)后,得到统计语料。 These search results Title (title) and the content (text) preprocessing (including clause, word, stop word removal, etc.) to give statistical corpus. 针对统计语料中的每个词,统计该词出现的句子数及该词与检索词共同出现的句子数,并统计包含检索词的句子数,其中检索词是种子query分词后得到的词语。 The number of co-occurrence of sentences for each word corpus statistics, the number of statistical word appears in the sentence and word search terms, and count the number of sentences containing the terms in which the search term is obtained after the seed word query words.

[0067] 得到上述信息后,可采用下列公式(2)计算每个词的权重,并将权值大于设定阈值的词语作为核心词,这些核心词的权重相应地构成了对应向量特征的权重。 After the [0067] to obtain the above information, can use the following equation (2) calculation of the weight of each word, and the right value is greater than the set threshold value of words as the core words, these core words right weight accordingly constitutes a feature vector corresponding to the weight .

[0068] [0068]

CentraHtysch term(w)= J。 CentraHtysch term (w) = J. f;、(=-二洲、,1。垂HD ( 2 ) f;., (= - two continents ,, 1 vertical HD (2)

一 log(5/ {w) +1) + log(5/ {sen _ term) +1) A log (5 / {w) +1) + log (5 / {sen _ term) +1)

[0069]其中,Centralityseh te„(w)表示词w 的权值,Co(w,sch_term)表示词w 与检索词sch_term共同出现的句子个数;sf (sch_term)表示含有检索词sch_term的句子个数; Sf(W)表示包含词w句子个数;idf(w)表示词w的逆向文档频率,可通过查找利用大规模语料统计得来的逆文档频率表得到。 [0069] where, Centralityseh te "(w) word w represents weights, Co (w, sch_term) indicates the number of sentences and search terms sch_term word w appearing in common; sf (sch_term) indicates the sentence containing the search term sch_term months number; Sf (W) represents the number of sentences containing the word w; idf (w) indicates the reverse document frequency of the word w can be obtained by looking at the use of large-scale corpus statistics come inverse document frequency table.

[0070] 在获取特定领域的种子query时,可采用下列几种实施方式: [0070] In particular in the field of acquiring seed query, several embodiments can be the following:

[0071] 实施方式一: [0071] Embodiment 1:

[0072] 在特定领域包含的候选需求模版中选取在搜索日志中覆盖的query数最多的N2 个候选需求模版,并针对这N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2J1为正整数,优选地,M1等于1。 [0072] selected query maximum number N2 candidate needs stencil in search logs covering a candidate needs in specific areas included in the template, and N2 candidate demand for these templates, select the query from each candidate needs covered query templates in Maximum number of M1 as a query seed query, wherein N2J1 is a positive integer, preferably, M1 is equal to 1. 例如下面表2为公交领域的候选需求模版:[0073]表 2 For example in Table 2 below is a candidate needs a template Bus areas: [0073] Table 2

Figure CN102368260AD00141

[0075] 假设N2 = 2,M1 = 1,则表3示出了针对表2中的候选需求模版采用实施方式一得到的种子query及其相应的候选需求模版。 [0075] assumed that N2 = 2, M1 = 1, Table 3 shows the seed and its corresponding candidate query template for the needs of the candidate needs Table 2 embodiment uses a template obtained.

[0076]表 3 [0076] Table 3

Figure CN102368260AD00142

[0078] 在这种实施方式下,种子query来源于用户的真实query,能够更好地代表用户的习惯。 [0078] In this embodiment, the seed query query from real users, to better represent the user's habits.

[0079] 实施方式二: [0079] Second Embodiment:

[0080] 将特定领域的槽关键词与特定领域限定词进行组合生成种子query。 [0080] The specific areas of domain-specific groove keyword qualifier combined generation seed query.

[0081] 以生成公交领域的种子query为例,请参考表4 : [0081] to generate public areas seed query, for example, refer to Table 4:

[0082] 表4 [0082] Table 4

Figure CN102368260AD00151

[0084] 这种方式下,生成的种子query结构简单。 [0084] In this way, a simple query to generate the seed structure.

[0085] 优选地,可采用实施方式三来获取种子query。 [0085] Preferably, the third embodiment can be used to obtain the seed query.

[0086] 实施方式三: [0086] Embodiment three:

[0087] 采用实施方式一的方法选出部分种子query,然后利用槽关键词词典将选取的种子query中的槽关键词替换为特定领域的其他槽关键词以得到扩展的种子query。 [0087] The embodiment of a method to elect some seeds query, and then use the slot keyword dictionary will be selected in the groove seed keyword query replaces other slot keyword specific areas in order to get the extended seed query.

[0088] 例如表5所示为采用实施方式三得到的种子query。 [0088] For example in Table 5 shows the third embodiment using seeds obtained query.

[0089] 表5 [0089] Table 5

[0090] [0090]

Figure CN102368260AD00152

[0091] 上述过程可得到特定领域的核心词向量,下面将描述获取候选需求模版的核心词向量的过程。 [0091] The above process can be obtained core word vector specific areas will be described below candidate needs to obtain the core word vector stencil process.

[0092] 首先,与获取特定领域的核心词向量类似的,需要先获取统计语料。 [0092] First, the core areas and to obtain the specific term vectors Similarly, we need to obtain statistical corpus. 在获取统计语料时,首先从候选需求模版在搜索日志中覆盖的query里,选取查询次数最多的N1个query 作为待搜索query,然后使用这些待搜索query从搜索引擎中获取搜索结果,对这些搜索结果的title和text进行预处理,就可以得到统计语料了,其中N1为正整数。 When obtain statistical corpus, the first template from the candidate needs covered in the search query logs in, select the maximum number of queries to a query as to be N1 search query, and then use these to be the search query to obtain search results from search engines, these search title and text preprocessing result, you can get statistics corpus, and where N1 is a positive integer.

[0093] 在得到的统计语料中,统计每个词的在语料中出现的频率,并按照下列公式(3) 计算每个词的权重,权重大于设定阈值的词就可作为候选需求模版的核心词,核心词的权重即为对应的向量特征的权重。 [0093] In the statistical corpus obtained, the statistical frequency of occurrence of each word in the corpus, and in accordance with the following equation (3) to calculate the weight of each word, the right to set a threshold of significant words can demand as a candidate template core words, the right to the core of the word is the corresponding vector characteristics of heavy weights.

[0094] Weight (w) = log (tf (w)+1) X log (idf (w)+1) (3) [0094] Weight (w) = log (tf (w) +1) X log (idf (w) +1) (3)

[0095] 其中,Weight (w)表示词w的权值,tf (w)表示词w的频率,idf (w)表示词w的逆向文档频率,可通过查找利用大规模语料统计得来的逆文档频率表得到。 [0095] where, Weight (w) is the weight of the word w, tf (w) represents the frequency of the word w, idf (w) represents the inverse document frequency of the word w, can find the use of large-scale corpus statistics come inverse document frequency in the table.

[0096] 在得到候选需求模版的核心词向量与特定领域的核心词向量后,就可按照公式(1)计算候选需求模版的相似度特征了。 [0096] After the candidate needs to get the word vector template core domain-specific core word vector, can according to equation (1) calculate the similarity of the characteristics of the candidate template demand.

[0097] 2、泛化能力特征 [0097] 2 generalization features

[0098] 泛化能力特征可用候选需求模版对应的槽关键词序列中互异的槽关键词序列的数量来衡量,其中候选需求模版对应的一个槽关键词序列是由候选需求模版在搜索日志中覆盖的一个query中的槽关键词组成的序列。 [0098] generalization of the number of candidates demand features are available template sequence corresponding groove keyword mutually different groove to measure keyword sequence, in which the candidate needs a corresponding groove template sequence is a candidate needs keyword in the search log template covered in a query sequence consisting of grooves keywords.

[0099] 例如对模版“[城市名][公交路线]公交车路线”,其覆盖的query有“北京15路公交车路线”、“上海郊14路公交车路线”、“沈阳铁西2线公交车路线”、“北京15路公交车路线图查询”,则槽关键词序列有“北京15路”、“上海郊14路”、“沈阳铁西2线”和“北京15 路”,互异的槽关键词序列为“北京15路”、“上海郊14路”和“沈阳铁西2线”,因此对模版“[城市名][公交路线]公交车路线”而言,它的泛化能力特征值就是3。 [0099] for example the template "[city name] [Bus routes] bus routes", which covers the query of "Beijing 15 bus routes," "Shanghai suburbs 14 bus routes," "Shenyang West Line 2 Bus route "," Beijing 15 bus roadmap inquiry ", the slot keyword sequence of" 15 Beijing Road, "" Shanghai suburb 14 Road "," Shenyang West 2 line "and" Beijing 15 Road ", mutual slot sequence-specific keywords as "Beijing 15 Road", "Shanghai suburb 14 Road" and "Shenyang Tiexi 2 line", so the template "[city name] [Bus routes] bus routes", for its pan ability characteristic value is 3.

[0100] 优选的,泛化能力特征采用下列方式进行计算。 [0100] Preferably, the generalization ability characteristics are calculated using the following ways. 首先确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量及该数量中的最大值,然后按照下列公式(4)计算每个候选需求模版的泛化能力特征值: First determine the sequence-specific mutual groove keyword in the quantity and the maximum amount corresponding to each candidate needs template contains specific areas, and (4) is calculated for each candidate needs template generalization eigenvalues according to the following formula:

[0101] general_scorei = log (pattern_dif_queryi+l) /log (max_dif_query+l) (4) [0101] general_scorei = log (pattern_dif_queryi + l) / log (max_dif_query + l) (4)

[0102] 其中,general_SCOrei表示候选需求模版i的泛化能力特征值,pattern_dif_ query,表示候选需求模版i对应的互异的槽关键词序列的数量,max_dif_query表示该候选需求模板i所属特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量中的最大值。 [0102] where, general_SCOrei candidate needs stencil i represents generalization eigenvalues, pattern_dif_ query, showing the number of candidate needs stencil i corresponding mutually different groove keyword sequence, max_dif_query template indicates that the candidate needs to your specific areas i included The number of cross grooves keyword sequence-specific needs of each candidate template corresponding maximum.

[0103] 3、边界词特征 [0103] 3, border word feature

[0104] 边界词是候选需求模版中未被泛化的词语。 [0104] boundary word is not a candidate needs the template generalization words. 候选需求模版中未被泛化的词语对最终生成的模版的正确性产生影响。 Candidate needs template generalization words not affect the validity of the resulting template. 例如在公交领域,“[城市名][公交路线]公交车路线” 这样的需求模版,显然比“公交卡断了怎么办[城市名]”这样的模版更能反映公交领域的需求。 For example, in the field of public transport, "[city name] [Bus routes] bus routes" template such demand, it is clear than the "bus card off how to do [city name]" to better reflect the needs of such a template public areas.

[0105] 在本发明中,候选需求模版W的边界词特征通过下面的公式(5)来计算。 [0105] In the present invention, the boundary word feature candidate needs stencil W calculated by the following equation (5).

[0106] boundary_word_score [0106] boundary_word_score

[0107] = CosSimilarity(pattern_centroid, positive_centroid) (5) [0107] = CosSimilarity (pattern_centroid, positive_centroid) (5)

[0108] -CosSimilarity(pattern_centroid, negative_centroid) [0108] -CosSimilarity (pattern_centroid, negative_centroid)

[0109] 其中,boundary_word_score为候选需求模版W的边界词特征,CosSimilarity为余弦相似度函数,patterr^centroid为候选需求模版W形成的向量,positive^entroid为特定领域的正向量,negative_centroid为特定领域的负向量。 [0109] wherein, boundary_word_score boundary candidate word feature template needs of W, CosSimilarity cosine similarity function, patterr ^ needs stencil candidate vector centroid W formation, positive ^ entroid positive vector specific areas, negative_centroid specific areas negative vector.

[0110] 下面分别介绍如何获取公式中的各个变量值。 [0110] The following describes how to obtain the value of each variable in the equation.

[0111] 生成特定领域的正负向量的过程包括: [0111] The process of generating positive and negative vector specific areas include:

[0112] 将特定领域包含的所有候选需求模版按照η元词组(n-gram) (η > 1)的方式进行切分,优选地,取η = 2,可得到各个切分片段,其中所谓n-gram就是能够进行语义表达的最小粒度的η个词语按顺序出现的组合,其中η为预设的正整数。 [0112] All candidate templates the specific requirements contained in the field element according phrase η (n-gram) (η> 1) segmentation manner, preferably, take η = 2, each split fragments obtained, wherein the so-called n -gram η is a combination of words can express semantic smallest size that appears in the order in which η preset positive integer. 例如对“[城市名][公交路线]公交车路线”这个模版,假设其能够进行语义表达的最小粒度的词语分别为“[城市名]”、“[公交路线],,和“公交车路线”,则该模版的2-gram的切分片段分别是“[城市名][公交路线]”、“[公交路线]公交车路线”,或者对“公交卡断了怎么办[城市名]”这个模版,假设其能够进行语义表达的最小粒度的词语分别为“公交卡”、“断了”、“怎么办”和“[城市名]”,则该模版的2-gram的切分片段分别是“公交卡断了”、“断了怎么办”、“怎么办[城市名]”。 For example, "[city name] [Bus routes] bus routes," the template, it can be assumed that the minimum size of the Semantic words were "[city name]", "[Bus routes] ,, and" Bus Route "then split 2-gram fragment of the template are" [city name] [Bus routes] "," [transit route] bus routes, "or to" cut off from the bus how to do [city name] " The templates, which can be semantic assumptions minimum size of the words are "bus card", "down", "how to do" and "[city name]", split fragment the template 2-gram, respectively It is "off the bus card", "broken how to do", "how do [city name]."

[0113] 从各切分片段中选取正片段和负片段,其中一个正片段就是正向量的一个向量特征,一个负片段就是负向量中的一个向量特征,并确定各个向量特征的权重。 [0113] selected from each segment segmentation positive fragment and negative fragment, which is a fragment of a positive characteristic vector vector positive, one negative is negative vector fragment of a vector features, and to determine the weight of each feature vector. 该过程包括: The process comprises:

[0114] A.确定各切分片段对应的槽关键词序列,其中一个切分片段的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列。 [0114] A. determine the split fragments corresponding groove keyword sequence, in which a fragment of a split groove keyword query sequence that contains a template that a candidate needs segmentation segments covered in the slot keyword composition sequence.

[0115] 例如,对切分片段“[城市名]公交”来说,包含该切分片段的候选需求模版及其覆的query如表6所示: [0115] For example, segmentation fragment "[city name] bus", the candidate needs and cover the query template that contains the split fragments shown in Table 6:

[0116]表 6 [0116] Table 6

[0117 [0117

Figure CN102368260AD00171

[0118] 则对切分片段“[城市名]公交”而言,它的槽关键词序列包括“北京15路”、“上海36路”、“北京15路”、“杭州”。 [01] is to split fragment "[city name] bus" is concerned, it's slot keyword sequence including the "Beijing 15 Road", "Shanghai 36 Road", "Beijing 15 Road", "Hangzhou."

[0119] B.按照下列方式确定从各切分片段中选取正向量特征和负向量特征并确定各向量特征的权重: [0119] B. determined in accordance with the following being selected vector features from each segment segmentation and negative characteristics and to determine the weight of each vector feature vector weight:

[0120] (1)如果一个切分片段的所有槽关键词序列相同,则该切分片段作为负向量特征, 且该负向量特征的权重为1。 [0120] (1) If a split clip all slots keyword identical sequence, the segmentation feature vector fragment as a negative, and the negative vector feature a weight of 1.

[0121] (2)如果一个切分片段的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值时,则将该切分片段作为负向量特征,且该向量特征的权重为比例P,优选地,第一阈值为90%。 [0121] (2) if a segmentation of all slots keyword sequence fragments are not identical, but there is a slot keyword sequence in all of the segmentation groove keyword sequence fragments accounted for the proportion P is greater than a preset first When the threshold value, then the segmentation feature vector fragment as a negative, and the right of the feature vector weight ratio of P, preferably, the first threshold value of 90%.

[0122] (3)确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量, 得到该数量中的最大值Z1,如果一个切分片段不符合上述两种情况,且该切分片段的互异的槽关键词序列的数量Z2与2工的比值大于预设的第二阈值时,则将该切分片段作为正向量特征,且该正向量特征的权重为Z2与Z1的比值,优选地,第二阈值为1%。 [0122] (3) to determine the needs of each candidate template contains specific areas corresponding to the number of different mutual groove keyword sequence to obtain the maximum amount of the Z1, if a split clip does not comply with the above two cases, and number of mutually different groove keyword sequence fragments that Z2 split with the ratio 2 workers greater than a preset second threshold value, then the segmentation feature vector fragment as positive, and the right positive characteristic vector and a weight of Z2 Z1 ratio, preferably, the second threshold is 1%.

[0123] 例如上面的切分片段“[城市名]公交”,互异的槽关键词序列分别为“北京15路”、 “上海36路”、“杭州”,互异的槽关键词序列的数目为3,其中“北京15路”在所有槽关键词序列中的比例为2/4,“上海/36路”在所有槽关键词序列中的比例为1/4,“杭州”在所有槽关键词序列中的比例为1/4,因此该切分片段不符合⑴或(2)中情况,所以该切分片段不属于负向量特征,假设特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量中的最大值为10且第二阈值为1 %,则由于3/10大于1 %,所以该切分片段应该作为正向量特征。 [0123] The above example segmentation fragment "[city name] bus", mutually different groove keyword sequence are "15 Beijing Road," "Shanghai 36 Road", "Hangzhou", mutually different groove keyword sequence number 3, where "15 Beijing Road," the proportion of all slots keyword sequence is 2/4 "Shanghai / 36 Road," the proportion of all slots keyword sequence 1/4, "Hangzhou" in all slots keyword sequence ratio of 1/4, and therefore does not meet the segmentation ⑴ fragment or (2) in the case, so that the segmentation fragment does not belong to a negative feature vectors, each candidate hypothesis needs template containing the corresponding specific areas cross The number of different sequences of groove keyword maximum of 10 and a second threshold value of 1%, since 3/10 than 1%, so the split segment should feature as a positive vector.

[0124] 以表2所示的模版为例,采用上述方式得到的正向量与负向量分别如表7和表8 所示: [0124] In the template shown in Table 2, for example, using the above-described manner are negative vector and vector are shown in Table 7 and Table 8 below:

[0125]表 7 [0125] Table 7

[0126] [0126]

Figure CN102368260AD00172
Figure CN102368260AD00181

[0127]表 8 [0127] Table 8

[0128] [0128]

Figure CN102368260AD00182

[0129] 候选需求模版W形成的向量中的向量特征是候选需求模版W的切分片段,其中切分的方式与正负向量中描述的类似,而特征权重可由对应的切分片段在候选需求模版W中出现的次数来确定。 The vector feature vector [0129] candidate needs stencil W formation was split fragment candidate needs stencil W, similar to the manner in which the segmentation described negative vector, and feature weights can be split fragment corresponding candidate needs number that appears in the stencil W determined.

[0130] 例如“[城市名][公交路线]公交车路线”这个模版包含的切分片段分别为“[城 [0130] For example, "[city name] [Bus routes] bus routes" split fragment contains the template are "[City

市名][公交路线]”和“[公交路线]公交车路线”,由于这两个切分片段在该模版中出现的次数都是1,所以模版“[城市名][公交路线]公交车路线”对应的向量特征“[城市名] [公交路线]”和“[公交路线]公交车路线”的特征权重分别都是1。如果一个模版为“[城市名][公交路线][城市名][公交路线]”,那么对这个模版的向量特征“[城市名][公交路线]”而言,特征权重就是2。 City name] [Bus routes] "and" [transit route] bus route ", since the number of these two split segments appear in this template is 1, so the template" [city name] [Bus routes] Bus route "feature vector corresponding" [city name] [Bus routes] "and" [bus route] bus route "feature weights are all 1. If a template is" [city name] [Bus routes] [city name ] [transit route], "then the vector characteristics of the template" [city name] [Bus routes] ", the feature weight is 2.

[0131] 候选需求模版的向量特征的特征权重的确定方式不唯一,除了以切分片段在模版中出现的次数作为对应的向量特征的特征权重,还可以采用布尔值的形式来确定对应的向量特征的特征权重,在此不对特征权重的计算方式进行限定。 Feature weight determining manner [0131] candidate needs stencil vector feature is not unique, in addition to the number of split clip appears in the template as a corresponding vector characteristic feature weights, you can also take the form of a Boolean value to determine the corresponding vector characteristic feature of the right weight, this does not feature in the calculation of the weight defined manner.

[0132] 以表2所示的候选需求模版为例,各个候选需求模版的边界词特征如表9所示: [0132] In Table 2 candidate template needs an example, the boundary of each candidate word feature template needs as shown in Table 9:

[0133]表 9 [0133] Table 9

[0134] [0134]

Figure CN102368260AD00191

[0135] 在步骤S103中,排序的过程包括: [0135] In step S103, the sorting process comprising:

[0136] 1、从候选需求模版中选取标准模版集,包括: [0136] 1, select the standard template set templates from the candidate needs, including:

[0137] 针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数。 [0137] For each feature, respectively, based on the extracted feature value templates are sorted candidate demand, respectively, are arranged to take the first position N3 corresponding to the candidate needs as template feature set of templates for each feature, wherein N3 is a positive integer.

[0138] 取各特征的模版集合之间的交集,并将该交集作为标准模版集。 [0138] intersected stencil between each feature set, and the intersection as a standard template set.

[0139] 例如:针对特征1、2、3对候选需求模版Sl-SlO进行排序,得到表10 : [0139] For example: 1,2,3 candidate demand for feature templates Sl-SlO sort the results of Table 10:

[0140] 表10 [0140] Table 10

[0141] [0141]

Figure CN102368260AD00201

[0142] 如果N3 = 5,则特征1的模版集合为{S5 S6 S4 S2 Si},特征2的模版集合为{S4 S5 S2 S8 Si},特征3的模版集合为{S2 SlO S5 S6 Si},则各特征的模版集合的交集就是{Si S2 S5}。 Stencil set [0142] if N3 = 5, the feature 1 is {S5 S6 S4 S2 Si}, the template collection is characterized by 2 {S4 S5 S2 S8 Si}, the template collection is characterized by 3 {S2 SlO S5 S6 Si} , the intersection of a set of templates for each feature is {Si S2 S5}.

[0143] 2、利用标准模版集训练提取的各特征对应的参数,将训练中使得标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重。 [0143] 2, the use of parameters extracted from the training set of standard templates corresponding to each feature, the training set such that the standard template template parameter value can not be closer to the top ranking when the demand in all the candidate templates as the corresponding feature weights.

[0144] 公式(6)是基于提取的全部特征对所有候选需求模版进行排序时,各候选需求模版的得分,得分越高说明该候选需求模版的质量越好,因此排名就越靠前。 [0144] Equation (6) is based on the overall feature extraction template for all the candidate needs to sort, each candidate needs to score a template, the higher the score shows the quality of the candidate template needs better, therefore ranking the more forward.

[0145] total_score =入pim—score+ 入2general_score+ 入3boundary—word—score (6) [0145] total_score = the pim-score + into 2general_score + into 3boundary-word-score (6)

[0146] 其中,sim—score、general_score 禾口boundary—word—score 分另1J是才目"f以度特征、泛化能力特征及边界词特征的值,λ”入2及λ3是待训练的参数,代表了各个特征的权重。 [0146] where, sim-score, general_score Hekou boundary-word-score points 1J is just another project "f degrees features, characteristics and boundary generalization word feature value, λ" into 2 and λ3 are to be trained parameters, on behalf of the rights of the various features weight.

[0147] 训练参数采用的方法是梯度下降,通过连续迭代,不停调整参数的值,以使得标准模版集中的模版的排名尽可能地靠前,直到标准模版集中的模版在所有候选需求模版中的排序不再提前,这时的各参数值即为对应特征的权重。 [0147] Method training parameters used is gradient descent through successive iterations, constantly adjusting the value of the parameter so that the standard template set template rankings as possible front until the standard template set template in the template for all the candidate needs Sort longer in advance, each parameter value at this time is the right weight corresponding features.

[0148] 3、使用提取的各特征及其权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序,即采用下列公式(6)计算候选需求模版的得分,其中公式(6)中的λ ρ λ 2 及λ 3为训练得到的各个特征的权重。 [0148] 3, the feature extraction and the use of weight calculation template candidate needs to score, and based on the score for the candidate needs to sort templates, which uses the following formula (6) Calculate the candidate needs to score a template, where the equation (6) The λ ρ λ 2 and λ 3 for the right training to get the various features of the weight.

[0149] 通过上述方式计算出候选需求模版的得分,便可以按照得分从高到低的顺序对候选需求模版进行排序。 [0149] calculated as described above by the candidate needs to score a template, you can score in descending order according to the candidate needs to sort templates.

[0150] 步骤S104中选取最终的需求模版时,除了会将排序位于前N4位的候选需求模版作为最终需求模版以外,还会利用排序位于前M2位的候选需求模版的边界词从排序位于前N4位之后的候选需求模版中选取最终需求模版,其中M2与N4均为正整数且M2 ^ Ν4。 When the [0150] step S104 selected template final demand, in addition to N4 will be sorted in the top position of a candidate other than as final demand needs stencil template will be sorted before the word boundary in the top position of the candidate M2 demand use stencil from sorting located Candidate needs N4 bit stencil after stencil selected final demand, including M2 and N4 are positive integers and M2 ^ Ν4.

[0151] 具体做法为: [0151] specific practices:

[0152] 利用关键词词典,获取与排序位于前M2位的候选需求模版的边界词对应的关键词集合,其中所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词; [0152] The use of the keyword dictionary, the word get boundaries and sorting in the top position of the candidate M2 demands stencils corresponding set of keywords, where the key words are words synonymous with the word boundary or the boundary Words meet the requirements of mutual information between words;

[0153] 将排序位于前N4位之后的候选需求模版中的边界词均属于关键词集合的候选需求模版作为最终需求模版。 [0153] will be sorted in the top N4 position after a candidate needs word templates borders belong to the candidate needs stencil set of keywords as final demand template.

[0154] 假设排名在前M2位以内的模版有:[城市名][公交路线]公交车路线、[地点名] 到[地点名]的公交车、[城市名]公交[公交路线],其中边界词有“公交车路线”、“到”、 “公交车”、“的”,通过关键词词典,可以得到与上述边界词对应的关键词集合为“公交/工 [0154] hypothesis ranking position within stencil M2 front are: [city name] [Bus routes] bus routes, [place name] to [place name] buses [city name] Bus [Bus routes], where Boundary word has "bus route", "to", "bus", "the" keyword dictionary, the word can be obtained corresponding to the boundary set of keywords as "bus / workers

交/工交车/公车/公共交通/公共交通线路/公共汽车/公交/公交车/公交联营车/公交路线/公交汽车/公交线/公交线路/公汽/共交/市区公交/公交车线路/的/到/到达”,则对于排名在前N4位之后的模版“到[地点名]公交车路线”而言,由于这个模版的边界词“到”与“公交车路线”均在关键词集合里,因此这个模版也可以被选取为最终模版。上述关键词词典中的关键词可通过各种现有技术得到,如挖掘同义词或互信息计算等, 在此不再详述。 AC / delivery workers / bus / public transport / public transport routes / bus / Bus / Bus / Bus joint car / bus routes / buses / bus lines / bus routes / bus / co-pay / urban bus / Bus Line / of / to / arrival ", the templates for ranking in the top position after the N4" to [place name] bus route ", since the template boundary word" to "and" bus route "were key Words in the collection, so this template can also be selected as the final template above keyword dictionary keywords to get through a variety of art, such as mining synonyms or mutual information calculation, this will not elaborate.

[0155] 请参考图4,图4为本发明中生成领域模版的装置的实施例的结构示意框图。 [0155] Referring to Figure 4, a schematic block diagram showing the structure of an embodiment of the invention in the field of template generation device of FIG. 4 is. 如图4所示,该装置包括:候选需求模版获取单元201、特征提取单元202、排序单元203及选取单元204。 4, the apparatus comprising: a candidate needs template obtaining unit 201, the feature extraction unit 202, sorting unit 203 and the unit 204 selected.

[0156] 其中候选需求模版获取单元201用于获取特定领域的候选需求模版。 [0156] where a candidate needs the template acquisition unit 201 for acquiring domain-specific templates candidate needs. 优选地,候选需求模版获取单元201包括限定单元2011和泛化单元2012。 Preferably, the candidate needs the template obtaining unit 201 includes a limited section 2011 and section 2012 of generalization.

[0157] 其中限定单元2011用于从搜索日志中选取用户搜索请求query中与预设的特定领域的限定词匹配的query,其中特定领域限定词是与特定领域相关的词语。 [0157] defining section 2011 for selecting a user searches from search query logs request with a preset specific areas qualifier match query, wherein the specific field qualifier is associated with a particular field of words. 泛化单元2012用于将选取的query中与预设的特定领域的槽关键词匹配的部分替换为通配符,以得到候选需求模版,其中特定领域的槽关键词是特定领域用于泛化的词语。 Generalization unit 2012 for the selected portion of the pre-defined query in specific areas of the slot wildcard replaced keyword matching, candidate needs to obtain a template, in which specific areas of the slot keyword is word for generalization of specific areas .

[0158] 进一步地,所述候选需求模版获取单元201还可包括一过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0158] Further, the candidate needs stencil acquisition unit 201 may further include a filter unit, according to a preset for specific areas of the required number of slots, the candidate needs template generalization unit obtained from filtered the candidate does not meet the needs of the number of slots of the template requirements.

[0159] 特征提取单元202用于提取候选需求模版的特征。 [0159] feature extraction unit 202 for extracting a candidate template demand characteristics. 优选地,特征提取单元202包括相似度特征提取单元2021、泛化能力特征提取单元2022及边界词特征提取单元2023中的至少一种。 Preferably, the feature extraction unit 202 includes a feature extraction unit 2021 similarity, generalization and boundary feature extraction unit 2022 word feature extracting at least one unit in 2023.

[0160] 其中,相似度特征提取单元2021用于提取候选需求模版的相似度特征,所述相似度特征用于描述候选需求模版与特定领域联系的紧密度。 [0160] where the similarity feature extraction unit 2021 for extracting characteristic similarity candidate demands stencils, the similarity of characteristics used to describe the closeness of the template and the candidate needs to contact specific areas. 请参考图5,图5为本发明中相似度特征提取单元的实施例的结构示意框图。 Refer to Figure 5, which illustrates the present invention in a schematic block diagram of the structure of the similarity of the feature extraction unit embodiment. 如图5所示,相似度特征提取单元2021包括模版词向量生成单元2021_1、领域词向量生成单元2021_2和计算单元2021_3。 As shown in Figure 5, the similarity of the feature extraction unit 2021 includes a template word vector generation unit 2021_1, the word vector field generating unit and computing unit 2021_3 2021_2.

[0161] 其中模版词向量生成单元2021_1用于在提取候选需求模版W的相似度特征时,获取W的核心词向量。 [0161] where the word vector generation unit 2021_1 template for feature extraction similarity template W candidate needs, access to core words of vector W.

[0162] 领域词向量生成单元2021_2用于获取特定领域的核心词向量。 [0162] the word vector field generating unit 2021_2 for obtaining specific areas of the core word vector.

[0163] 计算单元2021_2用于计算该候选需求模版的核心词向量与特定领域的核心词向量之间的相似度,并将该相似度作为W的相似度特征。 [0163] computing unit 2021_2 demand for similarity calculation template of the candidate vector and specific areas of core words of core words between the vectors, and the similarity of the similarity of characteristics W.

[0164] 优选地,模版词向量生成单元2021_1在获取W的核心词向量时,从W在搜索日志中覆盖的query中选取查询次数最多的N1个query,并在这N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成W的核心词向量,其中所述N1为任意正整数。 [0164] Preferably, the stencil word vector generation unit 2021_1 in acquiring key word vector W and select the most number of queries from W N1 a query in the search query logs covered, and returns from the search engine in a query N1 The key word in the search results to determine the weight and core words to form the core of the word vector W, wherein N1 is any positive integer.

[0165] 领域词向量生成单元2021_2获取特定领域的种子query的方式包括: Seeds query manner [0165] word vector field generating unit 2021_2 Get specific areas include:

[0166] 方式一、从特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对这N2个候选需求模版,从每个候选需求模版覆盖的query 中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数。 [0166] One way, covering the most select query number N2 candidate needs stencil in search logs from all candidate needs specific areas included in the template, and demand for these N2 candidate template, templates cover demand from each candidate query Select the maximum number of queries to a query as M1 seed query, where the N2 and M1 is a positive integer.

[0167] 方式二、将预设的特定领域的槽关键词与预设的特定领域的限定词进行组合生成所述特定领域的种子query。 [0167] The second way, the pre-qualifier keyword specific areas groove with a preset specific areas can be combined to generate the specific areas of seed query. [0168] 方式三、利用方式一选择出部分种子query后,利用预设的特定领域的槽关键词词典将方式一选择出的种子query中的槽关键词替换为槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成特定领域的种子query。 [0168] Three ways by way of a selected portion of the seed query, the use of specific areas of the preset groove keyword dictionary will approach a selected seed keyword query in the tank is replaced groove keyword dictionary in other slots Keywords expanded seed query; the query and the extended portion seed seed seed query query constitute specific areas.

[0169] 优选地,领域词向量生成单元2021_2可采用方式三获取特定领域的种子query。 [0169] Preferably, the areas of word vector generation unit 2021_2 can get seeds query mode three specific areas.

[0170] 请继续参考图4。 [0170] Please continue to refer Figure 4. 泛化能力特征提取单元2022,用于提取候选需求模版的泛化能力特征。 Generalization of feature extraction unit 2022, a candidate for generalization feature extraction template demand. 所述泛化能力特征用于描述候选需求模版覆盖用户搜索请求query的能力。 The generalization features the ability for users to search for a candidate needs a template covering the description of query requests.

[0171] 优选地,泛化能力特征提取单元2022在提取候选需求模版W的泛化能力特征时, 确定W对应的槽关键词序列,统计W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算W的泛化能力特征,其中W对应的一个槽关键词序列是由W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 When the [0171] Preferably, the generalization feature extraction unit 2022 extracts features generalization W templates candidate needs to determine a corresponding groove keyword sequence W, W statistics groove keyword sequence corresponding grooves mutually different keywords the number of sequences and computing generalization feature W pursuant to the number one slot W keyword sequence is a sequence corresponding to a query by a W in the search logs covering the slot keyword composition.

[0172] 边界词特征提取单元2023,用于提取候选需求模版的边界词特征。 [0172] word boundary feature extraction unit 2023, a candidate for the Boundary word feature extraction template demand. 所述边界词特征用于描述候选需求模版中未被泛化的词语对候选需求模版的正确性产生的影响。 Words used to describe the boundary feature template not affect a candidate needs the candidate needs generalization of words generated templates correctness.

[0173] 请参考图6,图6为本发明中边界词特征提取单元的实施例的结构示意框图。 [0173] Referring to Figure 6, which illustrates the present invention, the word boundary feature extraction structure schematic block diagram of an embodiment of the unit. 如图6所示,该实施例包括:切分单元2023_1、正负向量生成单元2023_2、模版向量生成单元2023_3及相似度计算单元2023_4。 As shown in Figure 6, this embodiment includes: segmentation unit 2023_1, positive and negative vector generation unit 2023_2, stencil vector generation unit 2023_3 and 2023_4 similarity calculating unit.

[0174] 其中切分单元2023_1用于将特定领域包含的所有候选需求模版切分为片段。 [0174] wherein the segmentation unit 2023_1 for all the candidate needs stencil cut into segments containing specific areas.

[0175] 正负向量生成单元2023_2用于从切分单元2023_1得到的各切分片段中选取正片段并确定正片段的权重以生成特定领域的正向量,从得到的各切分片段中选取负片段并确定负片段的权重以生成特定领域的负向量。 [0175] negative vector generation unit 2023_2 for each split fragment derived from the segmentation unit 2023_1 choose positive fragments and determine the weight of being heavy fragments to generate positive vector in specific areas, selected from each negative split fragments obtained fragments and determine the weight of negative fragment to produce a specific area of a negative vector. 优选地,正负向量生成单元2023_3包括槽关键词序列确定单元2023_21及正负片段选取单元2023_22。 Preferably, the positive and negative vector generation unit 2023_3 includes a slot keyword sequence determination unit 2023_21 and 2023_22 negative fragment selected unit.

[0176] 其中槽序列词确定单元2023_21用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列。 [0176] wherein the grooves word sequence determination unit for determining the cut points 2023_21 fragment corresponding groove keyword sequence, in which a split fragment corresponding slot keyword sequence fragment that contains the segmentation candidate needs a template covered a keyword query sequence in the groove formed.

[0177] 正负片段选取单元2023_22用于按照下列方式从各切分片段中选取正片段和负片段并确定正片段和负片段的权重: [0177] clip select unit 2023_22 negative for selecting positive fragment and negative fragment from each split segment in the following manner and to identify positive fragment and negative fragment weights:

[0178] (1)如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ; [0178] (1) if a segmentation keyword sequence fragments corresponding to all slots are the same, the split fragment as a negative fragment, and the negative power segment weighting of 1;

[0179] (2)如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ; [0179] (2) If all slots keyword sequence typing a fragment corresponding cut is not exactly the same, but there is a slot keyword sequence in all of the segmentation groove keyword sequence fragment P proportion accounted for greater than a preset first a threshold value, then the segmentation fragment as a negative fragment, and the negative right segment of the proportional weight of P;

[0180] (3)确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量, 得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 [0180] (3) to determine the cross-slot keyword sequence number of disparate needs of each candidate correspond to specific areas of the template included to give the maximum amount of the Z1, if a split clip does not satisfy the Tl and the T2 in the conditions, and the segmentation fragment corresponding to the number of different cross-slot sequence of keywords and the Z1 Z2 ratio of greater than a preset second threshold value, then the segmentation fragment as a positive fragment, and the positive fragment The weight is the ratio of Z2 and Z1.

[0181] 模版向量生成单元2023_3用于在提取候选需求模版W的边界词特征时,确定W的切分片段的权重并使用W的切分片段及切分片段的权重构成W的向量。 [0181] template vector generation unit 2023_3 for word feature extraction border stencil W candidate needs to determine the right W segmentation heavy fragments and fragments of the right to use segmentation and segmentation W reconstructed fragments of the vector W. 优选地,模版向量生成单元2023_3在确定W的切分片段的权重时,统计W的切分片段在W中出现的次数,并将该次数作为对应切分片段的权重。 Preferably, the number of templates in vector generation unit 2023_3 right to determine W segmentation fragment heavy, splitting fragment statistics W appears in the W, and the segmentation of the number of times as the corresponding fragment weights. [0182] 相似度计算单元2023_4用于计算W的向量与正向量的相似度S1以及W的向量与负向量的相似度s2,并根据S1与S2的差值得到W的边界词特征。 [0182] 2023_4 similarity calculating unit for calculating the W vector and vector similarity S1 positive and negative vector W vector similarity s2, and get the word feature W boundary according to the difference between S1 and S2.

[0183] 请继续参考图4。 [0183] Please continue to refer Figure 4. 排序单元203用于利用特征提取单元202提取的特征对候选需求模版进行排序。 Sorting unit 203 for utilizing characteristic feature extraction unit 202 extracts a template for the candidate needs to be sorted. 排序单元203包括标准模版集选取单元2031、训练单元2032及计算与排序单元2033。 Sorting unit 203 includes a standard template set selection unit 2031, the training unit calculation and sorting unit 2032 and 2033.

[0184] 其中,标准模版集选取单元2031用于从候选需求模版中选取标准模版集。 [0184] where standard template set selection unit 2031 for selecting candidates demand from the standard template set templates. 请参考图7,图7为本发明中标准模版集选取单元的实施例的结构示意框图。 Refer to Figure 7, which illustrates the present invention in the standard template set selection structure schematic block diagram of an embodiment of the unit. 如图7所示,标准模版集选取单元2031包括模版集合确定单元2031_1和交集单元2031_2。 As shown in Figure 7, the standard template set selection unit 2031 includes a set of templates and the intersection of unit determination unit 2031_1 2031_2. 其中模版集合确定单元2031_1,用于针对提取的每个特征基于特征值对各候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数。 Wherein the determining unit 2031_1 stencil set for feature extraction for each characteristic value based on the template for each candidate needs to sort, were taken the previous arrangement N3 position as a candidate needs stencil template corresponding feature set for each feature, which N3 It is a positive integer. 交集单元2031_2,用于取各特征的模版集合之间的交集作为标准模版集。 Intersection unit 2031_2, set the template for taking the intersection between each feature set as the standard template.

[0185] 请继续参考图4。 [0185] Please continue to refer Figure 4. 训练单元2032用于使用标准模版集训练提取的各特征对应的参数,将训练中使得标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重。 Parameter training unit 2032 for each feature extraction using standard templates corresponding to the training set, the training focused on making standard template template parameter value can not be closer to the top ranking when demand in all candidate templates as a corresponding feature weights.

[0186] 计算与排序单元2033用于使用特征提取单元202提取的各特征及训练单元2032 得到的各特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 The weight of each feature and Training Unit [0186] calculation and sorting unit 2033 for using the feature extraction unit 202 extracts of 2032 was characterized by heavy computing each stencil candidate needs to score, and based on the score for each candidate needs to sort templates. 优选地,按照得分从高到低对各候选需求模版进行排序。 Preferably, according to the needs of each candidate scores highest to lowest sort templates.

[0187] 选取单元204用于根据排序单元203排序的结果从候选需求模版中选取最终需求模版作为特定领域的需求模版。 [0187] selection unit 204 for selecting the final demand from the candidate needs stencil templates based on the results of the sorting unit 203 as a sort of template needs in specific areas. 优选地,选取单元204包括第一选取单元2041和第二选取单元2042。 Preferably, the selection unit 204 includes a first selection unit 2041 and the second selection unit 2042. 其中第一选取单元2041用于将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数。 Wherein the first selecting unit 2041 demand for the candidate in the top N4 stencil sort selected as the final demand bits template, wherein N4 is a positive integer. 第二选取单元2042用于利用排序位于前M2位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 The second word boundary selected sorting unit 2042 for use in the top position M2 candidate set of keywords needs access template, and the template in the top ranked candidate needs N4 bits after the word boundary candidates belong to the set of keywords template needs Select a template for the final demand, where the word boundaries are not a candidate needs the template generalization words, the key words are mutual information to meet the requirements of the word synonymous with the word boundary or the boundary between words word, M2 and M2 is a positive integer less than or equal to N4.

[0188] 以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。 [0188] The foregoing is only preferred embodiments of the present invention, it is not intended to limit the invention within the spirit and principles of the present invention, made any modifications, equivalents, improvements should be included Within the scope of protection of the invention.

Patentzitate
Zitiertes PatentEingetragen Veröffentlichungsdatum Antragsteller Titel
CN1514387A *31. Dez. 200221. Juli 2004中国科学院计算技术研究所Sound distinguishing method in speech sound inquiry
CN101216853A *11. Jan. 20089. Juli 2008孟小峰Intelligent web enquiry interface system and its method
US6516312 *4. Apr. 20004. Febr. 2003International Business Machine CorporationSystem and method for dynamically associating keywords with domain-specific search engine queries
Nichtpatentzitate
Referenz
1 *刘亮亮等: "基于查询模板的特定领域中文问答系统的研究与实现", 《江苏科技大学学报(自然科学版)》, vol. 25, no. 2, 15 April 2011 (2011-04-15), pages 163 - 168
Referenziert von
Zitiert von PatentEingetragen Veröffentlichungsdatum Antragsteller Titel
CN103136221A *24. Nov. 20115. Juni 2013北京百度网讯科技有限公司Method capable of generating requirement template and requirement identification method and device
Klassifizierungen
Internationale KlassifikationG06F17/30
Juristische Ereignisse
DatumCodeEreignisBeschreibung
7. März 2012C06Publication
28. Aug. 2013C10Entry into substantive examination
14. Dez. 2016C14Grant of patent or utility model