CN102368260A - Method and device of producing domain required template - Google Patents

Method and device of producing domain required template Download PDF

Info

Publication number
CN102368260A
CN102368260A CN2011103088307A CN201110308830A CN102368260A CN 102368260 A CN102368260 A CN 102368260A CN 2011103088307 A CN2011103088307 A CN 2011103088307A CN 201110308830 A CN201110308830 A CN 201110308830A CN 102368260 A CN102368260 A CN 102368260A
Authority
CN
China
Prior art keywords
masterplate
demand
candidate
specific area
groove
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103088307A
Other languages
Chinese (zh)
Other versions
CN102368260B (en
Inventor
柴春光
黄际洲
时迎超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110308830.7A priority Critical patent/CN102368260B/en
Priority claimed from CN201110308830.7A external-priority patent/CN102368260B/en
Publication of CN102368260A publication Critical patent/CN102368260A/en
Application granted granted Critical
Publication of CN102368260B publication Critical patent/CN102368260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a device of producing a domain required template, wherein the method comprises the following steps of: A, obtaining candidate required templates of a special domain; B, extracting the characteristics of the candidate required templates; C, sorting the candidate required templates according to the extracted characteristics; and D, selecting the final required template as the template required in the special domain from the candidate required templates. With above mode, a universal method for producing the high-quality domain required template is realized, which provides a guarantee for a search engine to understand the purpose of acts of users.

Description

A kind of method and device thereof that generates field demand masterplate
[technical field]
The present invention relates to natural language processing technique, particularly a kind of method and device thereof that generates field demand masterplate.
[background technology]
Search engine provides great facility for people find information needed.Provide in the mode of information for the user at traditional search engine, be the index that comprises the user search key word through searching, and realizes for the user returns with the related pages of keyword matching.For example; User's searching request (query) is " automobile 4S shop, Beijing recruitment sales manager "; At this moment can obtain the result of page searching of recruitment website; The user can get into recruitment website through clicking this page, in this recruitment website, fills in relevant information then and in the station, retrieves, and obtains the own information that really needs.If search engine can be understood the real purpose of user when retrieval better, search engine just can return the information that really meets its demand to the user more accurately so.Therefore, natural language processing is extremely important as far as search engine.In natural language processing, can adopt based on the demand masterplate in field user's search purpose is discerned.For example, user's query is " Da Zhongsi to the Xidan how to get to ", if the demand masterplate of this query and field of traffic is complementary, just can learn that this user has the demand of field of traffic, therefore can directly return the application relevant with field of traffic to this user.It is thus clear that, whether can produce high-quality field demand masterplate, as far as search engine correct understanding user's search intention, extremely important.
When generating field demand masterplate,, adopt different method for digging to carry out usually to different application in the past; This has not only wasted lot of manpower and material resources; And the method for this generation field demand masterplate, bad adaptability is difficult to make corresponding change along with the variation of using.
[summary of the invention]
Technical matters to be solved by this invention provides a kind of method and device that generates field demand masterplate, to solve the defective of the field demand masterplate bad adaptability that adopts the prior art generation.
The technical scheme that the present invention adopts for the technical solution problem provides a kind of method that generates field demand masterplate, comprising: A. obtains candidate's demand masterplate of specific area; B. extract the characteristic of candidate's demand masterplate, said characteristic comprises at least: characterize tight ness rating between candidate's demand template and the said specific area the similarity characteristic, characterize candidate's demand template and cover at least a in the border speech characteristic that word not extensive in generalization ability characteristic and the sign candidate demand template of user search request query ability exerted an influence to candidate's demand template correctness; C. utilize the characteristic of extracting that candidate's demand masterplate is sorted; D. the result according to ordering selects the demand masterplate of final demand masterplate as specific area from candidate's demand masterplate.
The preferred embodiment one of according to the present invention, said steps A comprises: A1. chooses among the user query query with the determiner coupling of preset said specific area from search log; A2. with replacing with asterisk wildcard with the part of the groove keyword coupling of preset said specific area among the query that chooses, obtain candidate's demand masterplate.
The preferred embodiment one of according to the present invention; After said steps A 2, also comprise:, from candidate's demand masterplate that said steps A 2 obtains, filter out and do not satisfy candidate's demand masterplate that the number of slots amount requires according to preset number of slots amount requirement to said specific area.
The preferred embodiment one of according to the present invention, the step of extracting the similarity characteristic of candidate's demand masterplate W comprises: obtain the core word vector of said W and the core word vector of said specific area; Calculate the similarity between the core word vector of core word vector and said specific area of said W, and with the similarity characteristic of this similarity as said W.
The preferred embodiment one of according to the present invention, the step of obtaining the core word vector of said W comprises: in the query that said W covers search log, choose the maximum N of inquiry times 1Individual query, and at said N 1Individual query confirms the weight of core word and core word from the Search Results that search engine returns, to form the core word vector of said W, wherein N 1Be positive integer.
The preferred embodiment one of according to the present invention; The step of obtaining the core word vector of said specific area comprises: utilize the seed query of said specific area to obtain the Search Results that search engine returns; And in this Search Results, confirm the weight of core word and core word, to form the core word vector of said specific area.
The preferred embodiment one of according to the present invention, the obtain manner of the seed query of said specific area comprises: mode one, from all candidate's demand masterplates that said specific area comprises, be chosen at and cover the maximum N of query number in the search log 2Individual candidate's demand masterplate, and to said N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2And M 1Be positive integer; Perhaps, the groove keyword of mode two, the said specific area that will preset makes up the seed query that generates said specific area with the determiner of preset said specific area; Perhaps; Mode three, utilize said mode one to select part seed query after, the groove keyword among the seed query that the groove keyword dictionary that utilizes preset said specific area is selected said mode one replaces with the seed query that other groove keywords in the said groove keyword dictionary are expanded; The seed query of said part seed query and said expansion constitutes the seed query of said specific area.
The preferred embodiment one of according to the present invention; The step of extracting the generalization ability characteristic of candidate's demand masterplate W comprises: confirm the groove keyword sequence that said W is corresponding; Add up the quantity of the groove keyword sequence of inequality in the corresponding groove keyword sequence of said W and calculate the generalization ability characteristic of said W, the sequence that the groove keyword among the query that the corresponding groove keyword sequence of wherein said W is covered in search log by said W is formed according to this quantity.
The preferred embodiment one of according to the present invention; The step of extracting the border speech characteristic of candidate's demand masterplate W comprises: all candidate's demand masterplate cuttings that said specific area is comprised are fragment; The weight of from each the cutting fragment that obtains, choosing positive fragment and confirming each positive fragment is to generate the positive vector of said specific area, and the weight of from each the cutting fragment that obtains, choosing negative film section and definite each negative film section is to generate the negative vector of said specific area; Confirm said W the cutting fragment weight and use the weight of cutting fragment and the cutting fragment of said W to constitute the vector of said W; Calculate the vector of said W and the similarity S of said positive vector 1, and, the similarity S of said W and said negative vector 2, and according to said S 1With said S 2Difference obtain the border speech characteristic of said W.
The preferred embodiment one of according to the present invention; The positive vector of said specific area and the generative process of negative vector specifically comprise: confirm the groove keyword sequence that each cutting fragment is corresponding, a corresponding groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed; If T1. corresponding all the groove keyword sequences of cutting fragment are identical, then with this cutting fragment as the negative film section, and the weight of this negative film section is 1; If T2. all groove keyword sequences of a cutting fragment correspondence are incomplete same; But exist ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative film section, and the weight of this negative film section is said ratio P; The quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that T3. definite specific area comprises is corresponding obtains the maximal value Z in this quantity 1If a cutting fragment does not satisfy the condition among said T1 and the said T2, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment correspondence 2With said Z 1Ratio greater than the second preset threshold value, then with this cutting fragment as fragment just, and the weight of this positive fragment is Z 2With Z 1Ratio.
The preferred embodiment one of according to the present invention confirms that the step of weight of the cutting fragment of said W comprises: add up number of times that the cutting fragment of said W occurs and with the weight of this number of times as corresponding cutting fragment in said W.
The preferred embodiment one of according to the present invention, said step C comprises: selection standard masterplate collection from candidate's demand masterplate; Utilize the training of said standard masterplate to practice each characteristic corresponding parameters of extracting, the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that said standard masterplate is concentrated can't be more forward is as the weight of character pair; The score of each characteristic of use extracting and weight calculation candidate's demand masterplate of characteristic, and according to this score each candidate's demand masterplate is sorted.
The preferred embodiment one of according to the present invention, the step of selection standard masterplate collection comprises from candidate's demand masterplate: each characteristic to extracting sorts to candidate's demand masterplate based on eigenwert respectively, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer; Get common factor between the masterplate set of each characteristic as standard masterplate collection.
The preferred embodiment one of according to the present invention, said step D comprises: ordering is positioned at preceding N 4Candidate's demand masterplate of position is chosen for final demand masterplate, wherein N 4Be positive integer; Utilize ordering to be positioned at preceding M 2The border speech of candidate's demand masterplate of position obtains keyword set, and ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to said keyword set is chosen for the final demand masterplate; Wherein said border speech is not by extensive speech in candidate's demand masterplate; Said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information, M 2Be positive integer and M 2Be less than or equal to N 4
The present invention also provides a kind of device that generates field demand masterplate, comprising: candidate's masterplate acquiring unit is used to obtain candidate's demand masterplate of specific area; Feature extraction unit; Be used to extract the characteristic of candidate's demand masterplate; Wherein said feature extraction unit comprises in similarity feature extraction unit, generalization ability feature extraction unit or the border speech feature extraction unit at least; Said similarity feature extraction unit is used to extract the similarity characteristic that characterizes tight ness rating between candidate's demand template and the said specific area; Said generalization ability feature extraction unit is used to extract the generalization ability characteristic that characterizes candidate's demand template covering user search request query ability, and said border speech feature extraction unit is used for extracting the border speech characteristic that the not extensive word of sign candidate's demand template is exerted an influence to candidate's demand template correctness; Sequencing unit, the characteristic that is used to utilize said feature extraction unit to extract sorts to candidate's demand masterplate; Choose the unit, be used for selecting the demand masterplate of final demand masterplate from candidate's demand masterplate as specific area according to the result of said sequencing unit ordering.
The preferred embodiment one of according to the present invention, said candidate's masterplate acquiring unit comprises: limit the unit, be used for choosing the user query query with the determiner coupling of preset said specific area from search log; Extensive unit is used for the part of the query of said qualification unit selection and the groove keyword coupling of preset said specific area is replaced with asterisk wildcard, obtains candidate's demand masterplate.
The preferred embodiment one of according to the present invention; Said candidate's masterplate acquiring unit further comprises filter element; Be used for according to preset number of slots amount requirement, from candidate's demand masterplate that said extensive unit obtains, filter out and do not satisfy candidate's demand masterplate that the number of slots amount requires said specific area.
The preferred embodiment one of according to the present invention, said similarity extraction unit comprises: masterplate term vector generation unit is used for when extracting the similarity characteristic of candidate's demand masterplate W, obtaining the core word vector of said W; Field term vector generation unit is used to obtain the core word vector of said specific area; Computing unit is used to calculate the similarity between the core word vector of core word vector and said specific area of said W, and with the similarity characteristic of this similarity as said W.
The preferred embodiment one of according to the present invention, said masterplate term vector generation unit is chosen the maximum N of inquiry times in the query that said W covers search log 1Individual query, and at said N 1Individual query confirms the weight of core word and core word from the Search Results that search engine returns, to form the core word vector of said W, wherein said N 1Be positive integer.
The preferred embodiment one of according to the present invention; Said field term vector generation unit utilizes the seed query of said specific area to obtain the Search Results that search engine returns; And in this Search Results, confirm the weight of core word and core word, to form the core word vector of said specific area.
The preferred embodiment one of according to the present invention, the mode that said field term vector generation unit obtains the seed query of said specific area comprises: mode one, from all candidate's demand masterplates that said specific area comprises, be chosen at and cover the maximum N of query number in the search log 2Individual candidate's demand masterplate, and to said N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2And M 1Be positive integer; Perhaps, the groove keyword of mode two, the said specific area that will preset makes up the seed query that generates said specific area with the determiner of preset said specific area; Perhaps; Mode three, utilize said mode one to select part seed query after, the groove keyword among the seed query that the groove keyword dictionary that utilizes preset said specific area is selected said mode one replaces with the seed query that other groove keywords in the said groove keyword dictionary are expanded; The seed query of said part seed query and said expansion constitutes the seed query of said specific area.
The preferred embodiment one of according to the present invention; Said generalization ability feature extraction unit is when extracting the generalization ability characteristic of candidate's demand masterplate W; Confirm the groove keyword sequence that said W is corresponding; Add up the quantity of the groove keyword sequence of inequality in the corresponding groove keyword sequence of said W and calculate the generalization ability characteristic of said W, the sequence that the groove keyword among the query that the groove keyword sequence of wherein said W is covered in search log by said W is formed according to this quantity.
The preferred embodiment one of according to the present invention, said border speech feature extraction unit comprises: the cutting unit, all candidate's demand masterplate cuttings that are used for specific area is comprised are fragment; Positive negative vector generation unit; The weight that each the cutting fragment that is used for obtaining from said cutting unit is chosen positive fragment and confirmed positive fragment is to generate the positive vector of said specific area, and the weight of from each the cutting fragment that obtains, choosing negative film section and definite each negative film section is to generate the negative vector of said specific area; Masterplate vector generation unit is used for when extracting the border speech characteristic of candidate's demand masterplate W, confirm said W the cutting fragment weight and use the weight of cutting fragment and the cutting fragment of said W to constitute the vector of said W; Similarity calculated is used to calculate the vector of said W and the similarity S of said positive vector 1, and, the vector of said W and the similarity S of said negative vector 2, and according to said S 1With said S 2Difference obtain the border speech characteristic of said W.
The preferred embodiment one of according to the present invention; Said positive negative vector generation unit comprises: the groove keyword sequence is confirmed the unit; Be used for the groove keyword sequence of confirming that each cutting fragment is corresponding, a corresponding groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed; Positive and negative fragment is chosen the unit; Be used for choosing from each cutting fragment the weight of positive fragment and negative film section and definite positive fragment and negative film section: if all groove keyword sequences of a cutting fragment correspondence of T1. are identical according to following manner; Then with this cutting fragment as the negative film section, and the weight of this negative film section is 1; If T2. all groove keyword sequences of a cutting fragment correspondence are incomplete same; But exist ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative film section, and the weight of this negative film section is said ratio P; The quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that T3. definite specific area comprises is corresponding obtains the maximal value Z in this quantity 1If a cutting fragment does not satisfy the condition among said T1 and the said T2, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment correspondence 2With said Z 1Ratio greater than the second preset threshold value, then with this cutting fragment as fragment just, and the weight of this positive fragment is Z 2With Z 1Ratio.
The preferred embodiment one of according to the present invention, said masterplate are added up number of times that the cutting fragment of said W occurs and with the weight of this number of times as corresponding cutting fragment in said W when the weight of measure feature generation unit in the cutting fragment of confirming said W.
The preferred embodiment one of according to the present invention, said sequencing unit comprises: standard masterplate collection is chosen the unit, is used for from candidate's demand masterplate selection standard masterplate collection; Training unit is used to utilize said standard masterplate training to practice each characteristic corresponding parameters of extracting, and the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that said standard masterplate is concentrated can't be more forward is as the weight of character pair; Calculate and sequencing unit, be used to use the score of weight calculation candidate's demand masterplate of each characteristic that each characteristic that said feature extraction unit extracts and said training unit obtain, and candidate's demand masterplate is sorted according to this score.
The preferred embodiment one of according to the present invention, said standard masterplate collection is chosen the unit and comprised: the unit is confirmed in the masterplate set, is used for candidate's demand masterplate being sorted based on eigenwert to each characteristic of extracting, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer; The common factor unit is used to get common factor between the masterplate set of each characteristic as standard masterplate collection.
The preferred embodiment one of according to the present invention, the said unit of choosing comprises: first chooses the unit, is used for ordering is positioned at preceding N 4Candidate's demand masterplate of position is chosen for final demand masterplate, wherein N 4Be positive integer; Second chooses the unit, is used for utilizing ordering to be positioned at preceding M 2The border speech of candidate's demand masterplate of position obtains keyword set, and ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to said keyword set is chosen for the final demand masterplate; Wherein said border speech is not by extensive speech in candidate's demand masterplate; Said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information, M 2Be positive integer and M 2Be less than or equal to N 4
Can find out by above technical scheme; The invention provides a kind of generation method of field demand masterplate of versatility; To different fields; All can pass through this method automatic mining candidate demand masterplate, and the characteristic of extracting candidate's demand masterplate evaluates to the quality of candidate's demand masterplate, thereby can in candidate's demand masterplate, obtain high-quality demand masterplate.The demand masterplate of the high-quality every field that the present invention obtains is that the behavior purpose that search engine is understood the user provides guarantee.
[description of drawings]
Fig. 1 is the schematic flow sheet of the method for the demand masterplate in generation field among the present invention;
Fig. 2 is for obtaining the schematic flow sheet of the embodiment of candidate's demand masterplate among the present invention;
Fig. 3 utilizes seed query to obtain the synoptic diagram of search engine return data among the present invention;
Fig. 4 is the structural representation block diagram of the embodiment of the device of generation field demand masterplate among the present invention;
Fig. 5 is the structural representation block diagram of the embodiment of similarity feature extraction unit among the present invention;
Fig. 6 is the structural representation block diagram of the embodiment of border speech feature extraction unit among the present invention;
Fig. 7 chooses the structural representation block diagram of the embodiment of unit for master die version collection among the present invention.
[embodiment]
In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.
Please refer to Fig. 1, Fig. 1 is the schematic flow sheet of the method for the demand masterplate in generation field among the present invention.As shown in Figure 1, this method comprises:
Step S101: the candidate's demand masterplate that obtains specific area.
Step S102: the characteristic of extracting candidate's demand masterplate.
Step S103: utilize the characteristic of extracting that candidate's demand masterplate is sorted.
Step S104: the result according to ordering chooses the demand masterplate of final demand masterplate as specific area from candidate's demand masterplate.
Embodiment below by concrete describes in detail to said method.
Among the present invention, specific area is a scope of reflection user search purpose, like public transport field, weather field or the like, and the search purpose when user search information has been reflected in these fields.
Please refer to Fig. 2, Fig. 2 is for obtaining the schematic flow sheet of the embodiment of candidate's demand masterplate among the present invention.In the present embodiment, utilize field determiner dictionary and groove keyword dictionary that the user search request query in the user search daily record (querylog) is handled, thereby generated candidate's demand masterplate.
Field determiner dictionary has comprised the word relevant with every field, and wherein the determiner of specific area is the word relevant with specific area, and in the present embodiment, the determiner of specific area is used for when choosing query, query being filtered.The query that only comprises the determiner of specific area just can carry out extensively, and candidate's demand masterplate of extensive generation just belongs to candidate's demand masterplate of specific area.Word in the determiner dictionary of field can be collected through following approach and obtain:
At first can excavation applications seed speech be as the field determiner from user's query, wherein seed speech in field can dispose through the mode of manual work, perhaps adopts artificial mode in search log, to mark.
Then through searching synonymicon; Obtain word with field seed speech synonym as the field determiner; In addition, can also through the mutual information that uses two speech tightness degree of tolerance choose in the search log with the high word of seed word association degree in the lump as the field determiner.Mutual information between the word can obtain through extensive language material is added up, owing to belong to prior art, repeats no more at this.With the public transport field is example, and table 1 has provided the example of certain fields determiners:
Table 1
Figure BDA0000098048170000091
Generate the process of candidate's demand masterplate, exactly query is carried out extensive process, so-called extensive, refer to replacing with asterisk wildcard with the part of the groove keyword coupling of specific area among the user query.The groove keyword is to be used for extensive word, confirms that through searching groove keyword dictionary this dictionary can obtain through collecting various proper nouns.
For example " Beijing 15 road bus routes " such query after extensive, can obtain " [city name] [bus routes] bus route " such demand masterplate.A groove position of each " [] " symbology masterplate representes that this position can replace under the situation that satisfies asterisk wildcard attribute specification, for example above this masterplate also mate with " suburb, Shanghai No. 14 bus routes ".
After obtaining above-mentioned candidate's demand masterplate, can also be according to requiring decision whether these candidate's demand masterplates to be carried out filtration treatment to the preset number of slots amount of the specific area under candidate's demand masterplate.For example in train information inquiry field; Variable information among the query generally only relates to starting point and terminal point; Therefore masterplate intended groove number that can train information inquiry field is set to 2; Every intended groove that do not meet is counted the masterplate of requirement and all can be filtered, to reduce the follow-up complexity that candidate's demand masterplate is handled.
In the present embodiment, the characteristic of extracting among the step S102 comprises a kind of in the following characteristic at least:
The similarity characteristic is used to describe the tight ness rating that candidate's demand masterplate and specific area are got in touch; The generalization ability characteristic is used to describe the ability that candidate's demand masterplate covers user search request query; Border speech characteristic is used for describing candidate's demand masterplate not by the influence of extensive word to the correctness generation of candidate's demand masterplate.
The embodiment that faces the account form of above-mentioned three characteristics down specifically introduces.
1, similarity characteristic
The similarity characteristic of candidate's demand masterplate W can be through calculated candidate demand template W core word vector and this candidate's demand template W under cosine between the core word vector of specific area apart from obtaining, specifically can adopt formula (1) to calculate:
sim_score=CossSimilarity(pattern_vector,seed_query_centroid) (1)
Wherein, Sim_score representes the similarity eigenwert of candidate's demand masterplate W; Pattern_vector representes the core word vector of candidate's demand template W, and seed_query_centriod representes the core word vector of specific area, and CossSimilarity representes the cosine similarity function.
Core word vector is to be the vector that forms to measure feature by core word.Therefore, when calculating the similarity characteristic, at first to confirm how to choose core word.
When confirming the core word of specific area, can utilize the seed query of this specific area to obtain the data that search engine returns, and the data of utilizing search engine to return are confirmed core word.Please refer to Fig. 3, Fig. 3 utilizes seed query to obtain the synoptic diagram of search engine return data among the present invention.As shown in Figure 3, seed query is " Beijing 15 road bus routes ", and this seed query can obtain a plurality of Search Results from search engine.After the title (title) of these Search Results and content (text) carried out pre-service (comprising subordinate sentence, participle, removal stop words etc.), obtain adding up language material.To each speech in the statistics language material, add up sentence number and this speech and the common sentence number that occurs of term that this speech occurs, and add up the sentence number that comprises term, wherein term is the word that obtains behind the seed query participle.
After obtaining above-mentioned information, can adopt formula (2) to calculate the weight of each speech, and with weights greater than the word of setting threshold as core word, the weight of these core words has correspondingly constituted corresponding weight to measure feature.
Centralit y sch _ term ( w ) = log ( Co ( w , sch _ term ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( sch _ term ) + 1 ) × log ( idf ( w ) + 1 ) - - - ( 2 )
Wherein, Centrality Sch_term(w) weights of expression speech w, Co (w, sch_term) the common sentence number that occurs of expression speech w and term sch_term; Sf (sch_term) expression contains the sentence number of term sch_term; Sf (w) expression comprises speech w sentence number; The reverse document frequency of idf (w) expression speech w can obtain through searching the contrary document frequency table that utilizes extensive corpus statistics to get.
When obtaining the seed query of specific area, can adopt following several kinds of embodiments:
Embodiment one:
In candidate's demand masterplate that specific area comprises, be chosen at the maximum N of query number that covers in the search log 2Individual candidate's demand masterplate, and to this N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2, M 1Be positive integer, preferably, M 1Equal 1.For example following table 2 is candidate's demand masterplate in public transport field:
Table 2
Figure BDA0000098048170000121
Suppose N 2=2, M 1=1, then table 3 shows seed query and the corresponding candidate's demand masterplate thereof that adopts embodiment one to obtain to the candidate's demand masterplate in the table 2.
Table 3
Seed query Corresponding templates
Beijing 15 road bus routes [city name] [bus routes] bus route
Beijing Public Transport 23 tunnel [city name] public transport [bus routes]
Under this embodiment, seed query derives from user's true query, the custom of representative of consumer better.
Embodiment two:
The groove keyword and the specific area determiner of specific area are made up generation seed query.
Seed query to generate the public transport field is an example, please refer to table 4:
Table 4
The seed query that generates Corresponding groove keyword Corresponding field determiner
Beijing 15 road bus routes Beijing 15 tunnel Bus route
Shanghai public transport Shanghai Public transport
Under this mode, the seed query of generation is simple in structure.
Preferably, can adopt embodiment three to obtain seed query.
Embodiment three:
Adopt the method for embodiment one to select part seed query, utilize groove keyword dictionary the groove keyword among the seed query that chooses to be replaced with the seed query of other groove keywords to be expanded of specific area then.
For example table 5 is depicted as the seed query that adopts embodiment three to obtain.
Table 5
The seed query that chooses The seed query of expansion
Beijing 15 road bus routes Shenyang 15 road bus routes
Beijing Public Transport 23 tunnel Jinan public transport 12 tunnel
Said process can obtain the core word vector of specific area, below description is obtained the process of the core word vector of candidate's demand masterplate.
At first, similar with the core word vector that obtains specific area, need obtain the statistics language material earlier.When obtaining the statistics language material, at first in the query that candidate's demand masterplate covers, choose the maximum N of inquiry times search log 1Individual query uses these query to be searched from search engine, to obtain Search Results as query to be searched then, and the title and the text of these Search Results carried out pre-service, just can obtain having added up language material, wherein N 1Be positive integer.
In the statistics language material that obtains; Add up the frequency that in language material, occurs of each speech; And calculate the weight of each speech according to formula (3), and weight just can be used as the core word of candidate's demand masterplate greater than the speech of setting threshold, and the weight of core word is the corresponding weight to measure feature.
Weight(w)=log(tf(w)+1)×log(idf(w)+1) (3)
Wherein, the weights of Weight (w) expression speech w, the frequency of tf (w) expression speech w, the reverse document frequency of idf (w) expression speech w can obtain through searching the contrary document frequency table that utilizes extensive corpus statistics to get.
Behind the core word vector of the core word vector that obtains candidate's demand masterplate and specific area, just can be according to the similarity characteristic of formula (1) calculated candidate demand masterplate.
2, generalization ability characteristic
The quantity of the groove keyword sequence of inequality is weighed in the corresponding groove keyword sequence of generalization ability characteristic available candidate demand masterplate, the sequence that the groove keyword among the query that the groove keyword sequence that wherein candidate's demand masterplate is corresponding is covered in search log by candidate's demand masterplate is formed.
For example to masterplate " [city name] [bus routes] bus route "; The query of its covering has " Beijing 15 road bus routes ", " suburb, Shanghai 14 road bus routes ", " Shenyang Tie Xi 2 line bus routes ", " Beijing 15 road bus route figure inquiry "; Then the groove keyword sequence has " Beijing 15 tunnel ", " suburb, Shanghai 14 tunnel ", " Shenyang Tie Xi 2 lines " and " Beijing 15 tunnel "; The groove keyword sequence of inequality is " Beijing 15 tunnel ", " suburb, Shanghai 14 tunnel " and " Shenyang Tie Xi 2 lines "; Therefore as far as masterplate " [city name] [bus routes] bus route ", its generalization ability eigenwert is exactly 3.
Preferably, the generalization ability characteristic adopts following manner to calculate.At first confirm quantity and the maximal value in this quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that specific area comprises is corresponding, calculate the generalization ability eigenwert of each candidate's demand masterplate then according to formula (4):
general_score i=log(pattern_dif_query i+1)/log(max_dif_query+1) (4)
Wherein, general_score iThe generalization ability eigenwert of expression candidate demand masterplate i, pattern_dif_query iThe quantity of the groove keyword sequence of the inequality that expression candidate demand masterplate i is corresponding, max_dif_query representes the maximal value in the quantity of groove keyword sequence of the inequality that specific area comprises under this candidate's demand template i each candidate's demand masterplate is corresponding.
3, border speech characteristic
The border speech is not by extensive word in candidate's demand masterplate.Do not exerted an influence in candidate's demand masterplate by the correctness of extensive word to the masterplate of final generation.For example in the public transport field, " [city name] [bus routes] bus route " such demand masterplate obviously more can reflect the demand in public transport field than " mass transit card what if broken [city name] " such masterplate.
In the present invention, the border speech characteristic of candidate's demand masterplate W is calculated through following formula (5).
boundary_word_score
=CosSimilarity(pattern_centroid,positive_centroid) (5)
-CosSimilarity(pattern_centroid,negative_centroid)
Wherein, Boundary_word_score is the border speech characteristic of candidate's demand masterplate W; CosSimilarity is the cosine similarity function; Pattern_centroid is the vector that candidate's demand masterplate W forms, and positive_centroid is the positive vector of specific area, and negative_centroid is the negative vector of specific area.
Introduce how to obtain each variate-value in the formula respectively below.
The process that generates the positive negative vector of specific area comprises:
All candidate's demand masterplates that specific area is comprised carry out cutting according to the mode of n unit's phrase (n-gram) (n>1); Preferably; Get n=2; Can obtain each cutting fragment, wherein so-called n-gram is exactly n the combination that word occurs in order can carrying out the minimum particle size of semantic meaning representation, and wherein n is preset positive integer.For example to " [city name] [bus routes] bus route " this masterplate; Suppose that its word that can carry out the minimum particle size of semantic meaning representation is respectively " [city name] ", " [bus routes] " and " bus route "; Then the cutting fragment of the 2-gram of this masterplate is respectively " [city name] [bus routes] ", " [bus routes] bus route "; Perhaps to " what if mass transit card has broken [city name] " this masterplate; Suppose that its word that can carry out the minimum particle size of semantic meaning representation is respectively " mass transit card ", " disconnected ", " what if " and " [city name] ", then the cutting fragment of the 2-gram of this masterplate is respectively " mass transit card has broken ", " what if having broken ", " what if [city name] ".
From each cutting fragment, choose positive fragment and negative film section, one of them positive fragment be exactly of positive vector to measure feature, negative film section be exactly in the negative vector one to measure feature, and confirm each weight to measure feature.This process comprises:
A. confirm the groove keyword sequence that each cutting fragment is corresponding, a groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed.
For example, concerning cutting fragment " [city name] public transport ", the query of candidate's demand masterplate and covering thereof that comprises this cutting fragment is as shown in table 6:
Table 6
Figure BDA0000098048170000161
Then as far as cutting fragment " [city name] public transport ", its groove keyword sequence comprises " Beijing 15 tunnel ", " Shanghai 36 tunnel ", " Beijing 15 tunnel ", " Hangzhou ".
B. confirm from each cutting fragment, to choose positive vector characteristic and negative vector characteristic and confirm each weight according to following manner to measure feature:
(1) if all groove keyword sequences of a cutting fragment are identical, then this cutting fragment is as the negative vector characteristic, and the weight of this negative vector characteristic is 1.
(2) if all groove keyword sequences of a cutting fragment are incomplete same; But when existing ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative vector characteristic; And should be ratio P to the weight of measure feature, preferably, first threshold be 90%.
(3) confirm the quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that specific area comprises is corresponding, obtain the maximal value Z in this quantity 1If a cutting fragment does not meet above-mentioned two kinds of situation, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment 2With Z 1Ratio during greater than preset second threshold value, then with this cutting fragment as the positive vector characteristic, and the weight of this positive vector characteristic is Z 2With Z 1Ratio, preferably, second threshold value is 1%.
Cutting fragment " [city name] public transport " for example; The groove keyword sequence of inequality is respectively " Beijing 15 tunnel ", " Shanghai 36 tunnel ", " Hangzhou "; The number of the groove keyword sequence of inequality is 3, and wherein " Beijing 15 tunnel " ratio in all groove keyword sequences is 2/4, and " Shanghai/36 tunnel " ratio in all groove keyword sequences is 1/4; " Hangzhou " ratio in all groove keyword sequences is 1/4; Therefore this cutting fragment does not meet situation in (1) or (2), so this cutting fragment does not belong to the negative vector characteristic, suppose that the maximal value in the quantity of groove keyword sequence of inequality of each candidate's demand masterplate correspondence that specific area comprises is that 10 and second threshold value is 1%; Then because 3/10 greater than 1%, so this cutting fragment should be as the positive vector characteristic.
With the masterplate shown in the table 2 is example, and positive vector that the employing aforesaid way obtains and negative vector are respectively shown in table 7 and table 8:
Table 7
In the positive vector to measure feature Feature weight
[city name] [bus routes] 1.000000
[bus routes] bus route 1.000000
[city name] public transport 0.666667
Public transport [bus routes] 0.666667
[location name] arrives 0.666667
To [location name] 1.000000
[location name] 0.666667
Bus 0.666667
Table 8
In the negative vector to measure feature Feature weight
[location name] bus route 1.000000
The public transport monthly ticket 1.000000
Monthly ticket [city name] 1.000000
Mass transit card [location name] 1.000000
[location name] recharge point 1.000000
Public transport [city name] 1.000000
[city name] phone 1.000000
Public transport [location name] 1.000000
[location name] catches a thief 1.000000
Mass transit card has broken 1.000000
What if broken 1.000000
What if [city name] 1.000000
In the vector that candidate's demand masterplate W forms is the cutting fragment of candidate's demand masterplate W to measure feature; That wherein describes in the mode of cutting and the positive negative vector is similar, and the number of times that feature weight can be occurred in candidate's demand masterplate W by the cutting fragment of correspondence is confirmed.
For example the cutting fragment that comprises of " [city name] [bus routes] bus route " this masterplate is respectively " [city name] [bus routes] " and " [bus routes] bus route "; Because the number of times that these two cutting fragments occur in this masterplate all is 1, so the corresponding feature weight to measure feature " [city name] [bus routes] " and " [bus routes] bus route " of masterplate " [city name] [bus routes] bus route " all is respectively 1.If a masterplate is " [city name] [bus routes] [city name] [bus routes] ", so as far as this masterplate to measure feature " [city name] [bus routes] ", feature weight is exactly 2.
Definite mode to the feature weight of measure feature of candidate's demand masterplate is not unique; Except the number of times that in masterplate, occurs with the cutting fragment feature weight to measure feature as correspondence; Can also adopt the form of Boolean to confirm corresponding feature weight, the account form of feature weight not limited at this to measure feature.
Candidate's demand masterplate with shown in the table 2 is an example, and the border speech characteristic of each candidate's demand masterplate is as shown in table 9:
Table 9
Figure BDA0000098048170000191
In step S103, the process of ordering comprises:
1, selection standard masterplate collection from candidate's demand masterplate comprises:
Each characteristic to extracting sorts to candidate's demand masterplate based on eigenwert respectively, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer.
Get the common factor between the masterplate set of each characteristic, and should occur simultaneously as standard masterplate collection.
For example: sort to 1,2,3 pairs of candidate's demand masterplates of characteristic S1-S10, obtain table 10:
Table 10
Figure BDA0000098048170000201
If N 3=5, then the set of the masterplate of characteristic 1 is for { S5 S6 S4 S2 S1}, the masterplate set of characteristic 2 is for { S4 S5 S2 S8 S1}, the masterplate set of characteristic 3 is for { S2 S10 S5 S6 S1}, then the masterplate intersection of sets collection of each characteristic is exactly { S1 S2 S5}.
2, utilize standard masterplate training to practice each characteristic corresponding parameters of extracting, the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that the standard masterplate is concentrated can't be more forward is as the weight of character pair.
When whole characteristics that formula (6) is based on extraction sort to all candidate's demand masterplates, the score of each candidate's demand masterplate, the quality of this candidate's demand masterplate of the high more explanation of score is good more, so rank is just forward more.
total_score=λ 1sim_score+λ 2general_score+λ 3boundary_word_score (6)
Wherein, sim_score, general_score and boundary_word_score are respectively the values of similarity characteristic, generalization ability characteristic and border speech characteristic, λ 1, λ 2And λ 3Be parameter to be trained, represented the weight of each characteristic.
The method that training parameter adopts is that gradient descends; Pass through subsequent iteration; Do not stop to adjust the value of parameter; So that the rank of the masterplate that the standard masterplate is concentrated is forward as much as possible, the ordering of masterplate in all candidate's demand masterplates of concentrating up to the standard masterplate no longer shifts to an earlier date, and each parameter value at this moment is the weight of character pair.
3, use each characteristic of extraction and the score of weight calculation candidate demand masterplate thereof, and candidate's demand masterplate is sorted, promptly adopt the score of formula (6) calculated candidate demand masterplate, wherein the λ in the formula (6) according to this score 1, λ 2And λ 3The weight of each characteristic that obtains for training.
Calculate the score of candidate's demand masterplate by the way, just can sort to candidate's demand masterplate according to score order from high to low.
When choosing final demand masterplate among the step S104, except meeting is positioned at ordering in preceding N 4Candidate's demand masterplate of position also can utilize ordering to be positioned at preceding M as beyond the final demand masterplate 2The border speech of candidate's demand masterplate of position is positioned at preceding N from ordering 4Choose the final demand masterplate in candidate's demand masterplate after the position, wherein M 2With N 4Be positive integer and M 2≤N 4
Specific practice is:
Utilize the keyword dictionary, obtain with ordering and be positioned at preceding M 2The keyword set that the border speech of candidate's demand masterplate of position is corresponding, wherein said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information;
Ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to keyword set is as the final demand masterplate.
Suppose that rank is at preceding M 2The position has with interior masterplate: bus, [city name] public transport [bus routes] of [city name] [bus routes] bus route, [location name] to [location name]; Wherein the border speech have " bus route ", " arriving ", " bus ", " "; Through the keyword dictionary; Can obtain being combined into " friendships/urban district public transport/bus routes of public transport/industry and traffic/industry and traffic car/bus/public transport/public transport line/motorbus/public transport/bus/public transport joint operation car/bus routes/bus/public transport line/public bus network/bus/altogether// to/arrival " with the above-mentioned border corresponding keyword set of speech, then for rank at preceding N 4Masterplate " to [location name] bus route " after the position because the border speech of this masterplate " arrives " and " bus route " all in keyword set, so this masterplate also can be chosen for final masterplate.Keyword in the above-mentioned keyword dictionary can obtain through various prior aries, as excavating synonym or mutual information calculating etc., is not described in detail in this.
Please refer to Fig. 4, Fig. 4 is the structural representation block diagram of the embodiment of the device of generation field masterplate among the present invention.As shown in Figure 4, this device comprises: candidate's demand masterplate acquiring unit 201, feature extraction unit 202, sequencing unit 203 and choose unit 204.
Wherein candidate's demand masterplate acquiring unit 201 is used to obtain candidate's demand masterplate of specific area.Preferably, candidate's demand masterplate acquiring unit 201 comprises qualification unit 2011 and extensive unit 2012.
Wherein limit unit 2011 and be used for from the query that search log is chosen the user search request query and the determiner of preset specific area matees, wherein the specific area determiner is the word relevant with specific area.Extensive unit 2012 is used for the part of the query that chooses and the groove keyword coupling of preset specific area is replaced with asterisk wildcard, and to obtain candidate's demand masterplate, wherein the groove keyword of specific area is that specific area is used for extensive word.
Further; Said candidate's demand masterplate acquiring unit 201 also can comprise a filter element; Be used for according to preset number of slots amount requirement, from candidate's demand masterplate that extensive unit obtains, filter out and do not satisfy candidate's demand masterplate that the number of slots amount requires said specific area.
Feature extraction unit 202 is used to extract the characteristic of candidate's demand masterplate.Preferably, feature extraction unit 202 comprises at least a in similarity feature extraction unit 2021, generalization ability feature extraction unit 2022 and the border speech feature extraction unit 2023.
Wherein, similarity feature extraction unit 2021 is used to extract the similarity characteristic of candidate's demand masterplate, and said similarity characteristic is used to describe the tight ness rating that candidate's demand masterplate and specific area are got in touch.Please refer to Fig. 5, Fig. 5 is the structural representation block diagram of the embodiment of similarity feature extraction unit among the present invention.As shown in Figure 5, similarity feature extraction unit 2021 comprises masterplate term vector generation unit 2021_1, field term vector generation unit 2021_2 and computing unit 2021_3.
Wherein masterplate term vector generation unit 2021_1 is used for when extracting the similarity characteristic of candidate's demand masterplate W, obtaining the core word vector of W.
Field term vector generation unit 2021_2 is used to obtain the core word vector of specific area.
Computing unit 2021_2 is used to calculate the similarity between the core word vector of core word vector and specific area of this candidate's demand masterplate, and with the similarity characteristic of this similarity as W.
Preferably, masterplate term vector generation unit 2021_1 chooses the maximum N of inquiry times from the query that W covers search log when obtaining the core word vector of W 1Individual query, and at this N 1Individual query confirms the weight of core word and core word from the Search Results that search engine returns, to form the core word vector of W, wherein said N 1Be any positive integer.
The mode that field term vector generation unit 2021_2 obtains the seed query of specific area comprises:
Mode one, from all candidate's demand masterplates that specific area comprises, be chosen in the search log and cover the maximum N of query number 2Individual candidate's demand masterplate, and to this N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2And M 1Be positive integer.
Mode two, the determiner of the groove keyword of preset specific area and preset specific area is made up the seed query that generates said specific area.
Mode three, utilize mode one to select part seed query after, the groove keyword among the seed query that the groove keyword dictionary that utilizes preset specific area is selected mode one replaces with the seed query that other groove keywords in the groove keyword dictionary are expanded; The seed query of said part seed query and said expansion constitutes the seed query of specific area.
Preferably, field term vector generation unit 2021_2 can adopt mode three to obtain the seed query of specific area.
Please continue with reference to figure 4.Generalization ability feature extraction unit 2022 is used to extract the generalization ability characteristic of candidate's demand masterplate.Said generalization ability characteristic is used to describe the ability that candidate's demand masterplate covers user search request query.
Preferably; Generalization ability feature extraction unit 2022 is when extracting the generalization ability characteristic of candidate's demand masterplate W; Confirm the groove keyword sequence that W is corresponding; The sequence that groove keyword among the query that the quantity of the groove keyword sequence of inequality and calculate the generalization ability characteristic of W according to this quantity in the corresponding groove keyword sequence of statistics W, the groove keyword sequence that wherein W is corresponding are covered in search log by W is formed.
Border speech feature extraction unit 2023 is used to extract the border speech characteristic of candidate's demand masterplate.Said border speech characteristic is used for describing candidate's demand masterplate not by the influence of extensive word to the correctness generation of candidate's demand masterplate.
Please refer to Fig. 6, Fig. 6 is the structural representation block diagram of the embodiment of border speech feature extraction unit among the present invention.As shown in Figure 6, this embodiment comprises: cutting unit 2023_1, positive negative vector generation unit 2023_2, masterplate vector generation unit 2023_3 and similarity calculated 2023_4.
Wherein 2023_1 all candidate's demand masterplate cuttings of being used for specific area is comprised in cutting unit are fragment.
Positive negative vector generation unit 2023_2 is used for each cutting fragment that 2023_1 obtains from the cutting unit and chooses the positive vector of the weight of the also definite positive fragment of positive fragment with the generation specific area, from each the cutting fragment that obtains, chooses the negative film section and confirms the negative vector of the weight of negative film section with the generation specific area.Preferably, positive negative vector generation unit 2023_3 comprises that the groove keyword sequence confirms that unit 2023_21 and positive and negative fragment choose unit 2023_22.
Wherein groove sequence speech confirms that unit 2023_21 is used for the groove keyword sequence of confirming that each cutting fragment is corresponding, and a corresponding groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed.
Positive and negative fragment is chosen the weight that unit 2023_22 is used for choosing from each cutting fragment according to following manner positive fragment and negative film section and definite positive fragment and negative film section:
(1) if corresponding all the groove keyword sequences of cutting fragment are identical, then with this cutting fragment as the negative film section, and the weight of this negative film section is 1;
(2) if all groove keyword sequences of a cutting fragment correspondence are incomplete same; But exist ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative film section, and the weight of this negative film section is said ratio P;
(3) confirm the quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that specific area comprises is corresponding, obtain the maximal value Z in this quantity 1If a cutting fragment does not satisfy the condition among said T1 and the said T2, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment correspondence 2With said Z 1Ratio greater than the second preset threshold value, then with this cutting fragment as fragment just, and the weight of this positive fragment is Z 2With Z 1Ratio.
Masterplate vector generation unit 2023_3 is used for when extracting the border speech characteristic of candidate's demand masterplate W, confirm W the cutting fragment weight and use the weight of cutting fragment and the cutting fragment of W to constitute the vector of W.Preferably, masterplate vector generation unit 2023_3 when the weight of the cutting fragment of confirming W, the number of times that the cutting fragment of statistics W occurs in W, and with the weight of this number of times as corresponding cutting fragment.
Similarity calculated 2023_4 is used to calculate the vector of W and the similarity S of positive vector 1And the similarity S of the vector of W and negative vector 2, and according to S 1With S 2Difference obtain the border speech characteristic of W.
Please continue with reference to figure 4.The characteristic that sequencing unit 203 is used to utilize feature extraction unit 202 to extract sorts to candidate's demand masterplate.Sequencing unit 203 comprises that standard masterplate collection chooses unit 2031, training unit 2032 and calculating and sequencing unit 2033.
Wherein, standard masterplate collection is chosen unit 2031 and is used for from candidate's demand masterplate selection standard masterplate collection.Please refer to Fig. 7, Fig. 7 chooses the structural representation block diagram of the embodiment of unit for master die version collection among the present invention.As shown in Figure 7, standard masterplate collection is chosen unit 2031 and is comprised definite unit 2031_1 of masterplate set and common factor unit 2031_2.Wherein unit 2031_1 is confirmed in the masterplate set, is used for each candidate's demand masterplate being sorted based on eigenwert to each characteristic of extracting, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer.Common factor unit 2031_2 is used to get common factor between the masterplate set of each characteristic as standard masterplate collection.
Please continue with reference to figure 4.Training unit 2032 is used to use the training of standard masterplate to practice each characteristic corresponding parameters of extracting, and the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that the standard masterplate is concentrated can't be more forward is as the weight of character pair.
Calculate the score that is used for weight calculation candidate's demand masterplate of each characteristic of use characteristic extraction unit 202 extractions and each characteristic that training unit 2032 obtains with sequencing unit 2033, and each candidate's demand masterplate is sorted according to this score.Preferably, from high to low each candidate's demand masterplate is sorted according to score.
Choosing unit 204 is used for choosing the demand masterplate of final demand masterplate as specific area according to the result of sequencing unit 203 orderings from candidate's demand masterplate.Preferably, choose unit 204 and comprise that first chooses unit 2041 and second and choose unit 2042.Wherein first choose unit 2041 and be used for ordering is positioned at preceding N 4Candidate's demand masterplate of position is chosen for final demand masterplate, wherein N 4Be positive integer.Second chooses unit 2042 is used for utilizing ordering to be positioned at preceding M 2The border speech of candidate's demand masterplate of position obtains keyword set, and ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to keyword set is chosen for the final demand masterplate; Wherein said border speech is not by extensive speech in candidate's demand masterplate; Said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information, M 2Be positive integer and M 2Be less than or equal to N 4
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (28)

1. a method that generates field demand masterplate is characterized in that, said method comprises:
A. obtain candidate's demand masterplate of specific area;
B. extract the characteristic of candidate's demand masterplate, said characteristic comprises at least: characterize tight ness rating between candidate's demand template and the said specific area the similarity characteristic, characterize candidate's demand template and cover at least a in the border speech characteristic that word not extensive in generalization ability characteristic and the sign candidate demand template of user search request query ability exerted an influence to candidate's demand template correctness;
C. utilize the characteristic of extracting that candidate's demand masterplate is sorted;
D. the result according to ordering selects the demand masterplate of final demand masterplate as specific area from candidate's demand masterplate.
2. method according to claim 1 is characterized in that, said steps A comprises:
A1. from search log, choose among the user query query with the determiner coupling of preset said specific area;
A2. with replacing with asterisk wildcard with the part of the groove keyword coupling of preset said specific area among the query that chooses, obtain candidate's demand masterplate.
3. method according to claim 2; It is characterized in that; After said steps A 2, also comprise:, from candidate's demand masterplate that said steps A 2 obtains, filter out and do not satisfy candidate's demand masterplate that the number of slots amount requires according to preset number of slots amount requirement to said specific area.
4. method according to claim 1 is characterized in that, the step of extracting the similarity characteristic of candidate's demand masterplate W comprises:
Obtain the core word vector of said W and the core word vector of said specific area;
Calculate the similarity between the core word vector of core word vector and said specific area of said W, and with the similarity characteristic of this similarity as said W.
5. method according to claim 4 is characterized in that, the step of obtaining the core word vector of said W comprises:
In the query that said W covers, choose the maximum N of inquiry times search log 1Individual query, and at said N 1Individual query confirms the weight of core word and core word from the Search Results that search engine returns, to form the core word vector of said W, wherein N 1Be positive integer.
6. method according to claim 4 is characterized in that, the step of obtaining the core word vector of said specific area comprises:
Utilize the seed query of said specific area to obtain the Search Results that search engine returns, and in this Search Results, confirm the weight of core word and core word, to form the core word vector of said specific area.
7. method according to claim 6 is characterized in that, the obtain manner of the seed query of said specific area comprises:
Mode one, from all candidate's demand masterplates that said specific area comprises, be chosen in the search log and cover the maximum N of query number 2Individual candidate's demand masterplate, and to said N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2And M 1Be positive integer; Perhaps,
The groove keyword of mode two, the said specific area that will preset makes up the seed query that generates said specific area with the determiner of preset said specific area; Perhaps,
Mode three, utilize said mode one to select part seed query after, the groove keyword among the seed query that the groove keyword dictionary that utilizes preset said specific area is selected said mode one replaces with the seed query that other groove keywords in the said groove keyword dictionary are expanded; The seed query of said part seed query and said expansion constitutes the seed query of said specific area.
8. method according to claim 1 is characterized in that, the step of extracting the generalization ability characteristic of candidate's demand masterplate W comprises:
Confirm the groove keyword sequence that said W is corresponding; Add up the quantity of the groove keyword sequence of inequality in the corresponding groove keyword sequence of said W and calculate the generalization ability characteristic of said W, the sequence that the groove keyword among the query that the corresponding groove keyword sequence of wherein said W is covered in search log by said W is formed according to this quantity.
9. method according to claim 1 is characterized in that, the step of extracting the border speech characteristic of candidate's demand masterplate W comprises:
All candidate's demand masterplate cuttings that said specific area is comprised are fragment; The weight of from each the cutting fragment that obtains, choosing positive fragment and confirming each positive fragment is to generate the positive vector of said specific area, and the weight of from each the cutting fragment that obtains, choosing negative film section and definite each negative film section is to generate the negative vector of said specific area;
Confirm said W the cutting fragment weight and use the weight of cutting fragment and the cutting fragment of said W to constitute the vector of said W;
Calculate the vector of said W and the similarity S of said positive vector 1, and, the similarity S of said W and said negative vector 2, and according to said S 1With said S 2Difference obtain the border speech characteristic of said W.
10. method according to claim 9 is characterized in that, the positive vector of said specific area and the generative process of negative vector specifically comprise:
Confirm the groove keyword sequence that each cutting fragment is corresponding, a corresponding groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed;
If T1. corresponding all the groove keyword sequences of cutting fragment are identical, then with this cutting fragment as the negative film section, and the weight of this negative film section is 1;
If T2. all groove keyword sequences of a cutting fragment correspondence are incomplete same; But exist ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative film section, and the weight of this negative film section is said ratio P;
The quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that T3. definite specific area comprises is corresponding obtains the maximal value Z in this quantity 1If a cutting fragment does not satisfy the condition among said T1 and the said T2, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment correspondence 2With said Z 1Ratio greater than the second preset threshold value, then with this cutting fragment as fragment just, and the weight of this positive fragment is Z 2With Z 1Ratio.
11. method according to claim 9 is characterized in that, confirms that the step of weight of the cutting fragment of said W comprises:
Add up number of times that the cutting fragment of said W occurs and with the weight of this number of times in said W as corresponding cutting fragment.
12. method according to claim 1 is characterized in that, said step C comprises:
Selection standard masterplate collection from candidate's demand masterplate;
Utilize the training of said standard masterplate to practice each characteristic corresponding parameters of extracting, the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that said standard masterplate is concentrated can't be more forward is as the weight of character pair;
The score of each characteristic of use extracting and weight calculation candidate's demand masterplate of characteristic, and according to this score each candidate's demand masterplate is sorted.
13. method according to claim 12 is characterized in that, the step of selection standard masterplate collection comprises from candidate's demand masterplate:
Each characteristic to extracting sorts to candidate's demand masterplate based on eigenwert respectively, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer;
Get common factor between the masterplate set of each characteristic as standard masterplate collection.
14. method according to claim 1 is characterized in that, said step D comprises:
Ordering is positioned at preceding N 4Candidate's demand masterplate of position is chosen for final demand masterplate, wherein N 4Be positive integer;
Utilize ordering to be positioned at preceding M 2The border speech of candidate's demand masterplate of position obtains keyword set, and ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to said keyword set is chosen for the final demand masterplate; Wherein said border speech is not by extensive speech in candidate's demand masterplate; Said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information, M 2Be positive integer and M 2Be less than or equal to N 4
15. a device that generates field demand masterplate is characterized in that, said device comprises:
Candidate's masterplate acquiring unit is used to obtain candidate's demand masterplate of specific area;
Feature extraction unit; Be used to extract the characteristic of candidate's demand masterplate; Wherein said feature extraction unit comprises in similarity feature extraction unit, generalization ability feature extraction unit or the border speech feature extraction unit at least; Said similarity feature extraction unit is used to extract the similarity characteristic that characterizes tight ness rating between candidate's demand template and the said specific area; Said generalization ability feature extraction unit is used to extract the generalization ability characteristic that characterizes candidate's demand template covering user search request query ability, and said border speech feature extraction unit is used for extracting the border speech characteristic that the not extensive word of sign candidate's demand template is exerted an influence to candidate's demand template correctness;
Sequencing unit, the characteristic that is used to utilize said feature extraction unit to extract sorts to candidate's demand masterplate;
Choose the unit, be used for selecting the demand masterplate of final demand masterplate from candidate's demand masterplate as specific area according to the result of said sequencing unit ordering.
16. device according to claim 15 is characterized in that, said candidate's masterplate acquiring unit comprises:
Limit the unit, be used for from the query that search log is chosen the user query and the determiner of preset said specific area matees;
Extensive unit is used for the part of the query of said qualification unit selection and the groove keyword coupling of preset said specific area is replaced with asterisk wildcard, obtains candidate's demand masterplate.
17. device according to claim 16; It is characterized in that; Said candidate's masterplate acquiring unit further comprises filter element; Be used for according to preset number of slots amount requirement, from candidate's demand masterplate that said extensive unit obtains, filter out and do not satisfy candidate's demand masterplate that the number of slots amount requires said specific area.
18. device according to claim 15 is characterized in that, said similarity extraction unit comprises:
Masterplate term vector generation unit is used for when extracting the similarity characteristic of candidate's demand masterplate W, obtaining the core word vector of said W;
Field term vector generation unit is used to obtain the core word vector of said specific area;
Computing unit is used to calculate the similarity between the core word vector of core word vector and said specific area of said W, and with the similarity characteristic of this similarity as said W.
19. device according to claim 18 is characterized in that, said masterplate term vector generation unit is chosen the maximum N of inquiry times in the query that said W covers search log 1Individual query, and at said N 1Individual query confirms the weight of core word and core word from the Search Results that search engine returns, to form the core word vector of said W, wherein said N 1Be positive integer.
20. device according to claim 18; It is characterized in that; Said field term vector generation unit utilizes the seed query of said specific area to obtain the Search Results that search engine returns; And in this Search Results, confirm the weight of core word and core word, to form the core word vector of said specific area.
21. device according to claim 20 is characterized in that, the mode that said field term vector generation unit obtains the seed query of said specific area comprises:
Mode one, from all candidate's demand masterplates that said specific area comprises, be chosen in the search log and cover the maximum N of query number 2Individual candidate's demand masterplate, and to said N 2Individual candidate's demand masterplate is selected the maximum M of inquiry times from the query that each candidate's demand masterplate covers 1Individual query is as seed query, wherein N 2And M 1Be positive integer; Perhaps,
The groove keyword of mode two, the said specific area that will preset makes up the seed query that generates said specific area with the determiner of preset said specific area; Perhaps,
Mode three, utilize said mode one to select part seed query after, the groove keyword among the seed query that the groove keyword dictionary that utilizes preset said specific area is selected said mode one replaces with the seed query that other groove keywords in the said groove keyword dictionary are expanded; The seed query of said part seed query and said expansion constitutes the seed query of said specific area.
22. device according to claim 15; It is characterized in that; Said generalization ability feature extraction unit is when extracting the generalization ability characteristic of candidate's demand masterplate W; Confirm the groove keyword sequence that said W is corresponding; Add up the quantity of the groove keyword sequence of inequality in the corresponding groove keyword sequence of said W and calculate the generalization ability characteristic of said W, the sequence that the groove keyword among the query that the groove keyword sequence of wherein said W is covered in search log by said W is formed according to this quantity.
23. device according to claim 15 is characterized in that, said border speech feature extraction unit comprises:
The cutting unit, all candidate's demand masterplate cuttings that are used for specific area is comprised are fragment;
Positive negative vector generation unit; The weight that each the cutting fragment that is used for obtaining from said cutting unit is chosen positive fragment and confirmed positive fragment is to generate the positive vector of said specific area, and the weight of from each the cutting fragment that obtains, choosing negative film section and definite each negative film section is to generate the negative vector of said specific area;
Masterplate vector generation unit is used for when extracting the border speech characteristic of candidate's demand masterplate W, confirm said W the cutting fragment weight and use the weight of cutting fragment and the cutting fragment of said W to constitute the vector of said W;
Similarity calculated is used to calculate the vector of said W and the similarity S of said positive vector 1, and, the vector of said W and the similarity S of said negative vector 2, and according to said S 1With said S 2Difference obtain the border speech characteristic of said W.
24. device according to claim 23 is characterized in that, said positive negative vector generation unit comprises:
Groove keyword sequence determining unit; Be used for the groove keyword sequence of confirming that each cutting fragment is corresponding, a corresponding groove keyword sequence of one of them cutting fragment is to comprise the sequence that the groove keyword among the query that candidate's demand masterplate of this cutting fragment covered is formed;
Positive and negative fragment is chosen the unit, is used for choosing from each cutting fragment according to following manner the weight of positive fragment and negative film section and definite positive fragment and negative film section:
If T1. corresponding all the groove keyword sequences of cutting fragment are identical, then with this cutting fragment as the negative film section, and the weight of this negative film section is 1;
If T2. all groove keyword sequences of a cutting fragment correspondence are incomplete same; But exist ratio P that a groove keyword sequence accounts in all groove keyword sequences of this cutting fragment greater than presetting first threshold; Then with this cutting fragment as the negative film section, and the weight of this negative film section is said ratio P;
The quantity of the groove keyword sequence of the inequality that each candidate's demand masterplate that T3. definite specific area comprises is corresponding obtains the maximal value Z in this quantity 1If a cutting fragment does not satisfy the condition among said T1 and the said T2, and the quantity Z of the groove keyword sequence of the inequality of this cutting fragment correspondence 2With said Z 1Ratio greater than the second preset threshold value, then with this cutting fragment as fragment just, and the weight of this positive fragment is Z 2With Z 1Ratio.
25. device according to claim 23; It is characterized in that; Said masterplate is added up number of times that the cutting fragment of said W occurs and with the weight of this number of times as corresponding cutting fragment in said W when the weight of measure feature generation unit in the cutting fragment of confirming said W.
26. device according to claim 15 is characterized in that, said sequencing unit comprises:
Standard masterplate collection is chosen the unit, is used for from candidate's demand masterplate selection standard masterplate collection;
Training unit is used to utilize said standard masterplate training to practice each characteristic corresponding parameters of extracting, and the parameter value when making in the training that the rank of masterplate in all candidate's demand masterplates that said standard masterplate is concentrated can't be more forward is as the weight of character pair;
Calculate and sequencing unit, be used to use the score of weight calculation candidate's demand masterplate of each characteristic that each characteristic that said feature extraction unit extracts and said training unit obtain, and candidate's demand masterplate is sorted according to this score.
27. device according to claim 26 is characterized in that, said standard masterplate collection is chosen the unit and is comprised:
The unit is confirmed in the masterplate set, is used for candidate's demand masterplate being sorted based on eigenwert to each characteristic of extracting, gets to each characteristic respectively and is arranged in preceding N 3Candidate's demand masterplate of position is as the masterplate set of character pair, wherein N 3Be positive integer;
The common factor unit is used to get common factor between the masterplate set of each characteristic as standard masterplate collection.
28. device according to claim 15 is characterized in that, the said unit of choosing comprises:
First chooses the unit, is used for ordering is positioned at preceding N 4Candidate's demand masterplate of position is chosen for final demand masterplate, wherein N 4Be positive integer;
Second chooses the unit, is used for utilizing ordering to be positioned at preceding M 2The border speech of candidate's demand masterplate of position obtains keyword set, and ordering is positioned at preceding N 4Candidate's demand masterplate that border speech in candidate's demand masterplate after the position all belongs to said keyword set is chosen for the final demand masterplate; Wherein said border speech is not by extensive speech in candidate's demand masterplate; Said keyword be and the speech of said border speech synonym or and said border speech between the speech that meets the demands of mutual information, M 2Be positive integer and M 2Be less than or equal to N 4
CN201110308830.7A 2011-10-12 A kind of method generating domain requirement masterplate and device thereof Active CN102368260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110308830.7A CN102368260B (en) 2011-10-12 A kind of method generating domain requirement masterplate and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110308830.7A CN102368260B (en) 2011-10-12 A kind of method generating domain requirement masterplate and device thereof

Publications (2)

Publication Number Publication Date
CN102368260A true CN102368260A (en) 2012-03-07
CN102368260B CN102368260B (en) 2016-12-14

Family

ID=

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device
CN106971728A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of quick identification vocal print method and system
CN107480139A (en) * 2017-08-16 2017-12-15 深圳市空谷幽兰人工智能科技有限公司 The bulk composition extracting method and device of medical field
CN108228637A (en) * 2016-12-21 2018-06-29 中国电信股份有限公司 Natural language client auto-answer method and system
WO2020019565A1 (en) * 2018-07-27 2020-01-30 天津字节跳动科技有限公司 Search sorting method and apparatus, and electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘亮亮等: "基于查询模板的特定领域中文问答系统的研究与实现", 《江苏科技大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN103823809B (en) * 2012-11-16 2018-06-08 百度在线网络技术(北京)有限公司 A kind of method, the method for Classified optimization and its device to query phrase classification
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device
CN106971728A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of quick identification vocal print method and system
CN108228637A (en) * 2016-12-21 2018-06-29 中国电信股份有限公司 Natural language client auto-answer method and system
CN108228637B (en) * 2016-12-21 2020-09-04 中国电信股份有限公司 Automatic response method and system for natural language client
CN107480139A (en) * 2017-08-16 2017-12-15 深圳市空谷幽兰人工智能科技有限公司 The bulk composition extracting method and device of medical field
WO2020019565A1 (en) * 2018-07-27 2020-01-30 天津字节跳动科技有限公司 Search sorting method and apparatus, and electronic device and storage medium
US11481402B2 (en) 2018-07-27 2022-10-25 Tianjin Bytedance Technology Co., Ltd. Search ranking method and apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN103268348B (en) A kind of user's query intention recognition methods
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN102411621B (en) Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode
CN109543178A (en) A kind of judicial style label system construction method and system
CN102831128B (en) Method and device for sorting information of namesake persons on Internet
CN101299217B (en) Method, apparatus and system for processing map information
CN101887443B (en) Method and device for classifying texts
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN103425687A (en) Retrieval method and system based on queries
CN103678564A (en) Internet product research system based on data mining
CN102542067A (en) Automatic image semantic annotation method based on scale learning and correlated label dissemination
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN105843796A (en) Microblog emotional tendency analysis method and device
CN103617280A (en) Method and system for mining Chinese event information
CN103823893A (en) User comment-based product search method and system
CN103106287A (en) Processing method and processing system for retrieving sentences by user
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN106202294A (en) The related news computational methods merged based on key word and topic model and device
CN103246644A (en) Method and device for processing Internet public opinion information
CN113886604A (en) Job knowledge map generation method and system
CN103500216A (en) Method for extracting file information
CN102799586B (en) A kind of escape degree defining method for search results ranking and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant