CN102999520B - A kind of method and apparatus of search need identification - Google Patents

A kind of method and apparatus of search need identification Download PDF

Info

Publication number
CN102999520B
CN102999520B CN201110273327.2A CN201110273327A CN102999520B CN 102999520 B CN102999520 B CN 102999520B CN 201110273327 A CN201110273327 A CN 201110273327A CN 102999520 B CN102999520 B CN 102999520B
Authority
CN
China
Prior art keywords
query
search results
demand type
gram
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110273327.2A
Other languages
Chinese (zh)
Other versions
CN102999520A (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110273327.2A priority Critical patent/CN102999520B/en
Publication of CN102999520A publication Critical patent/CN102999520A/en
Application granted granted Critical
Publication of CN102999520B publication Critical patent/CN102999520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of method and apparatus of search need identification, wherein method comprises: S1, obtain query to be identified; S2, obtain the Search Results of described query to be identified, determine each n unit's phrase (n-gram) of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified; S3, calculate respectively described query to be identified core word vector and the core word vector of predetermined each demand type between similarity, determine the demand type of described query to be identified according to the result of calculation of similarity.The accuracy of search need identification can be improved by the present invention.

Description

A kind of method and apparatus of search need identification
[technical field]
The present invention relates to field of computer technology, the method and apparatus of particularly a kind of search need identification.
[background technology]
Along with internet developing rapidly and maturation in the world, the information resources on network are enriched constantly, and information data amount, also in expansion at full speed, has become the major way of modern's obtaining information by search engine obtaining information.In order to provide more convenient to user, accurately inquiry service be that search engine technique is in the current and following developing direction.
In search engine technique, the search need of user is identified to be the important ring improving searching accuracy and validity, in structured search, effect is remarkable especially.Existing search need recognition method usually adopts and the core word vector of query and each demand type is calculated similarity respectively, determines the demand type of query according to Similarity Measure result.The demand type such as similarity being come top n is identified as the demand type of this query, or, according to the value of similarity, determine the demand levels of this query in each demand type.But because query itself is shorter, available information is few, the similarity between the core word vector relying on query and directly calculate query and demand type, may cause semantic similarity deviation comparatively large, thus cause the accuracy of search need identification.
[summary of the invention]
The invention provides a kind of method and apparatus of search need identification, so that improve the accuracy of search need identification.
Concrete technical scheme is as follows:
A kind of search need knows method for distinguishing, and the method comprises:
S1, obtain query to be identified;
S2, obtain the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;
S3, calculate respectively described query to be identified core word vector and the core word vector of predetermined each demand type between similarity, determine the demand type of described query to be identified according to the result of calculation of similarity.
According to one preferred embodiment of the present invention, the Search Results obtaining described query to be identified in step S2 is: obtain in the Search Results of described query to be identified the Search Results coming front N1, described N1 is default positive integer.
According to one preferred embodiment of the present invention, determine that the weight of each n-gram specifically comprises based on the appearance situation of each n-gram in Search Results text described in step S2:
Be that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,
The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram.
According to one preferred embodiment of the present invention, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.
According to one preferred embodiment of the present invention, determine that the core word vector of demand type comprises:
S31, determine that the seed query of this demand type gathers;
S32, utilize seed query to gather in each seed query search for, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.
According to one preferred embodiment of the present invention, the determination mode that the seed query of demand type gathers comprises:
Configured by artificial mode; Or
Artificial mode is adopted to mark in search daily record; Or,
From the search daily record of this demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,
From the search daily record of the Webpage search of this demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.
According to one preferred embodiment of the present invention, described step S32 specifically comprises:
Each seed query in utilizing the seed query of this demand type to gather searches for, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.
According to one preferred embodiment of the present invention, describedly determine that the weight of each n-gram comprises based on the appearance situation of each n-gram in Search Results text:
Be that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,
The IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight.
According to one preferred embodiment of the present invention, determine that the demand type of described query to be identified comprises according to the result of calculation of similarity described in step S3:
Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified, described N2 is default positive integer; Or,
According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value that calculates in described step S3 is corresponding is the demand levels of described query to be identified in corresponding demand type.
A device for search need identification, this device comprises:
Identify object acquisition unit, for obtaining query to be identified;
Primary vector determining unit, for obtaining the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;
Demand type determining units, for calculating the similarity between the core word vector of described query to be identified and the core word vector of predetermined each demand type respectively, determines the demand type of described query to be identified according to the result of calculation of similarity.
According to one preferred embodiment of the present invention, described primary vector determining unit is when obtaining the Search Results of described query to be identified, and concrete acquisition in the Search Results of described query to be identified comes the individual Search Results of front N1, and described N1 is default positive integer.
According to one preferred embodiment of the present invention, described primary vector determining unit, when determining the weight of each n-gram, is that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,
The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram.
According to one preferred embodiment of the present invention, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.
According to one preferred embodiment of the present invention, this device also comprises: secondary vector determining unit;
Described secondary vector determining unit specifically comprises:
Seed query determines subelement, for determining that the seed query of demand type gathers;
Core word vector forms subelement, for obtaining the Search Results of each seed query in seed query set, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.
According to one preferred embodiment of the present invention, described seed query determines that the seed query of the demand type that subelement acquisition is configured by artificial mode gathers; Or,
The seed query obtaining the demand type adopting artificial mode to mark in search daily record gathers; Or,
From the search daily record of demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,
From the search daily record of the Webpage search of demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.
According to one preferred embodiment of the present invention, described core word vector forms the Search Results that subelement obtains each seed query in the seed query set of this demand type, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.
According to one preferred embodiment of the present invention, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, is specifically each n-gram imparting weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,
The IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight.
According to one preferred embodiment of the present invention, Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified by described demand type determining units, and described N2 is default positive integer; Or,
According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value of calculating is corresponding is the demand levels of described query to be identified in corresponding demand type.
As can be seen from the above technical solutions, the present invention adopts the n-gram of the Search Results text of query to be identified and determines the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtain the core word vector of query to be identified, utilize the core word vector of the query to be identified obtained to calculate the similarity with the core word vector of each demand type further, thus identify the demand type of query to be identified.Visible, present invention utilizes the information that the query to be identified that compares enriches more itself, namely the n-gram of the Search Results text of query to be identified, expresses the semanteme of query to be identified more fully, thus improves the accuracy of search need identification.
[accompanying drawing explanation]
The method flow diagram that Fig. 1 provides for the embodiment of the present invention one;
The webpage schematic diagram comprising the sentence of query to be identified that Fig. 2 provides for the embodiment of the present invention one;
The structure drawing of device that Fig. 3 provides for the embodiment of the present invention two;
The instance graph that Fig. 4 sorts for large search for the search need identification that the embodiment of the present invention provides;
The search need identification that Fig. 5 provides for the embodiment of the present invention is for the instance graph of vertical search.
[embodiment]
In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.
Embodiment one,
The method flow diagram that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:
Step 101: obtain query to be identified.
Step 102: the Search Results obtaining this query to be identified, determine each n unit's phrase (n-gram) in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of query to be identified.
Because the usual Search Results based on query exists larger correlativity with query, therefore, the Search Results obtained after utilizing query to be identified to search in this step carries out the extraction of core word vector.
In addition, search engine is when searching for for query to be identified, Search Results carries out sorting according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, the Search Results coming front N1 can be chosen, from the text of this front N1 Search Results, determine n-gram, wherein N1 is default positive integer.
Due to may bulk information be there is in the page of Search Results, much may with query to be identified semantically correlativity be less, therefore, the text of the Search Results utilized when determining n-gram can be: the sentence comprising this query to be identified in web page title or webpage.
To comprise the sentence of query to be identified from webpage, suppose that query to be identified is for " home cooking ", after utilizing this query to be identified to search for, suppose that one of them Search Results of returning is as shown in Figure 2, the sentence comprising query to be identified in webpage is:
Home cooking _ menu complete works is done in way _ home cooking menu _ of home cooking _ home cooking
Home cooking is requisite during we live
The way of home cooking is various, and as northeast home cooking, Guo Lin home cooking etc., it is how the simplest that cook home cooking menu
Cuisines are outstanding for you provide abundant simple home cooking menu complete works of
Then from above four sentences, n-gram is determined.
So-called n-gram is exactly the combination that n word of minimum particle size occurs in order, and wherein n is default one or more positive integers.For " home cooking is requisite during we live ", if n is 1,2,3 or 4, the n-gram so obtained is:
1-gram: home cooking, be, we, life, in, essential
2-gram: home cooking is, be us, our life, in life, in essential
3-gram: home cooking is for we, be that we live, in our life, essential in life
4-gram: home cooking is that we live, are during we live, essential in our life
Wherein " " be filtered in the process determining n-gram as stop words.
When determining the weight of each n-gram, following two kinds of modes can be included but not limited to:
Mode one, be that each n-gram gives weight according to the word frequency of each n-gram in Search Results text (TF) and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.
Mode two, the sentence number occurred in Search Results text according to n-gram, with the reverse document frequency (IDF) of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram for n-gram gives weight.Which is based on information theory, and formula can as shown in formula (1).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 1 )
Wherein, w is n-gram, q is query to be identified, the weight that Centrality (w) is n-gram, the sentence number that Co (w, q) is n-gram and query co-occurrence to be identified, the sentence number that sf (w) occurs in Search Results text for n-gram, the sentence number that sf (q) occurs in Search Results text for query to be identified, the reverse document frequency that idf (w) is n-gram.
It should be noted that, above-mentioned formula (1) is only the example that the embodiment of the present invention provides, and the simple modification done according to this formula and equivalent replacement will not enumerate, all in limited range of the present invention.
Step 103: the core word vector and the similarity of the core word vector of each demand type that calculate query to be identified respectively, determines the demand type of query to be identified according to the result of calculation of similarity.
Pre-determine out the core word vector of each demand type in the present invention, the core word vector defining method of this demand type can be: determine that the seed query of this demand type gathers; Each seed query in utilizing seed query to gather searches for, and extracts core word and determine the weight of each core word based on the appearance situation of core word in Search Results text from the text of Search Results, obtains the core word vector of this demand type.
The seed query of the seed query set of formation demand type embodies the demand of corresponding preset kind, and these seeds query set can be configured by artificial mode, or adopts artificial mode to mark in search daily record.More preferably, also seed query can be excavated from search daily record, from the search daily record of this demand type vertical search, such as obtain the query of searching times higher than preset first threshold value as the seed query of this demand type, or, from the search daily record of the Webpage search of this demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and using searching times in the query that obtains higher than the query of default Second Threshold as the seed query of this demand type, etc.
Such as, the seed query of game class gather in seed query can comprise: " downloads of standalone version mobile phone trivial games ", " precious prompt fast lp608 mobile phone games download ", " World of Warcraft's download ", " World of Warcraft " etc.
After each seed query in utilizing each seed query to gather searches for, the mode extracting core word can adopt following several:
First kind of way: determine each n-gram in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type.
Because search engine is when the Search Results for seed query sorts, normally carry out sorting according to the correlativity with seed query, therefore, in order to raise the efficiency, reduce calculated amount, can choose the Search Results coming front N3, from the text of this front N3 Search Results, determine n-gram, wherein N3 is default positive integer.
Due to may bulk information be there is in the page of Search Results, much may with seed query semantically correlativity be less, therefore, the text of the Search Results utilized when determining n-gram can be: the sentence comprising this seed query in web page title or webpage, be all like this below in several mode, repeat no more.
When determining the weight of each n-gram, following two kinds of modes can be included but not limited to:
Mode 1, be that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.
Mode 2, the sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the IDF of the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram.Which is based on information theory, and formula can as shown in formula (2).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 2 )
Wherein, w is n-gram, q is corresponding seed query, the weight that Centrality (w) is n-gram, the sentence number that Co (w, q) is n-gram and this seed query co-occurrence, the sentence number that sf (w) occurs in Search Results text for n-gram, the sentence number that sf (q) occurs in Search Results text for this seed query, the reverse document frequency that idf (w) is n-gram.
It should be noted that, above-mentioned formula (2) is only the example that the embodiment of the present invention provides, and the simple modification done according to this formula and equivalent replacement will not enumerate, all in limited range of the present invention.
The second way: after word segmentation processing and removal stop words are carried out to the text of Search Results, statistics obtains the word frequency of each word after removing stop words, determine that word frequency is each word determination weight determined higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type.
Wherein, the weight that the higher word of word frequency is corresponding is larger.
The third mode: after participle and removal stop words are carried out to the text of Search Results, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type.
Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.
4th kind of mode: after participle and removal stop words are carried out to the text of Search Results, be that weight given in each word according to removing sentence number that each word of obtaining after stop words occurs in Search Results text, with the IDF of the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.
The computing formula of weighted value is as shown in formula (3).
Centrality ( w ) = log ( Co ( w , q ) + 1 ) log ( sf ( w ) + 1 ) + log ( sf ( q ) + 1 ) × log ( idf ( w ) + 1 ) ; - - - ( 3 )
Wherein, w is the word obtained after removing stop words, q is corresponding seed query, Centrality (w) is the weight of word w, the sentence number that Co (w, q) is word w and this seed query co-occurrence, the sentence number that sf (w) occurs in Search Results text for word w, the sentence number that sf (q) occurs in Search Results text for this seed query, the reverse document frequency that idf (w) is word w.
When calculating the core word vector of core word vector sum demand type of query to be identified, the computing method of cosine similarity can be adopted.Table 1 is for several query to be identified and each demand type similarity.
Table 1
Query to be identified With game class similarity With software class similarity With novel class similarity
Network game repair sieve legend 0.0026 0 0.4431
The novel of DNF 0.0050 0.0001 0.3467
Story of a play or opera task in DNF 0.3616 0.0128 0
Swordsman's love standalone version 3 attack strategy 0.1631 0 0.0063
Swordsman's love reads the non-cigarette of step in full 0 0 0.1205
After determining similarity, Similarity value can be come the demand type of front N2, or the Similarity value demand type that exceedes default similarity threshold is identified as the demand type of query to be identified, wherein N2 is default positive integer.Situation such as shown in table 1, supposes that N2 is 1, then can identify " novel of DNF " for novel class demand, and " swordsman's love standalone version 3 attack strategy " is game class demand.
Also according to the corresponding relation between the Similarity value preset and similarity grade, according to core word vector and the value of the similarity of the core word vector of each demand type of query to be identified, the demand levels of query to be identified in each demand type can be identified.Such as, pre-set similarity more than 0.3 for strong demand levels, similarity is weak demand levels between 0.1 to 0.3, and similarity is without demand levels below 0.1.Then in table 1, " novel of DNF " has strong demand in novel class demand, without the need to asking on game class and software class; " swordsman's love standalone version 3 attack strategy " has weak demand, without the need to asking on software class and novel class on game class.
Be more than the detailed description that search need knowledge method for distinguishing provided by the present invention is carried out, be described in detail below by the device of embodiment two to search need identification provided by the invention.
Embodiment two,
The structure drawing of device that Fig. 3 provides for the embodiment of the present invention two, as shown in Figure 3, this device can comprise: identify object acquisition unit 300, primary vector determining unit 310 and demand type determining units 320.
Identify that object acquisition unit 300 obtains query to be identified.
Primary vector determining unit 310 obtains the Search Results of query to be identified, determines each n-gram of Search Results text and determines the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtains the core word vector of query to be identified.
Because the usual Search Results based on query exists larger correlativity with query; therefore query to be identified can be supplied to search engine by primary vector determining unit 310, obtains the Search Results that returns of search engine and is further used for extracting the core word vector of query to be identified.
Search engine is when searching for for query to be identified, Search Results carries out sorting according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, primary vector determining unit 310 is when obtaining the Search Results of query to be identified, and concrete acquisition in the Search Results of query to be identified comes the individual Search Results of front N1, and wherein N1 is default positive integer.
Primary vector determining unit 310, when determining the weight of each n-gram, can adopt following two kinds of modes:
First kind of way: be that n-gram gives weight according to the TF of n-gram in Search Results text and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.
The second way: the sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram.Which is based on information theory, and formula as shown in the formula (1) in embodiment one, can not repeat them here.
Due to may bulk information be there is in the page of Search Results, much with query to be identified semantically correlativity is less, therefore, mentioned above searching results text can comprise: the web page title of Search Results, or the sentence comprising query to be identified in the webpage of Search Results.
Demand type determining units 320 calculates the similarity between the core word vector of query to be identified and the core word vector of predetermined each demand type respectively, determines the demand type of query to be identified according to the result of calculation of similarity.
Owing to needing the core word vector of predefined each demand type, therefore, this device can also comprise: secondary vector determining unit 330.
Secondary vector determining unit 330 can specifically comprise: seed query determines that subelement 331 and core word vector form subelement 332.
Seed query determines that subelement 331 determines that the seed query of demand type gathers.Particularly, can obtain in the following manner:
First kind of way: the seed query obtaining the demand type configured by artificial mode is gathered.
The second way: the seed query obtaining the demand type adopting artificial mode to mark in search daily record gathers.
The third mode: obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value and gather from the search daily record of demand type vertical search.
4th kind of mode: from the search daily record of the Webpage search of demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.
Core word vector forms the Search Results that subelement 332 obtains each seed query in seed query set, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.Namely core word vector forms subelement 332 and is supplied to by each seed query after search engine searches for respectively, obtains the Search Results that search engine returns.
Particularly, core word vector formation subelement 332 can adopt following four kinds of modes to obtain the core word vector of this demand type:
The Search Results of each seed query during mode one, the seed query obtaining this demand type gather, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type.
Wherein, when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, can be specifically that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value; Or, the IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight, specifically can adopt the formula (2) in embodiment one, not repeat them here.
The Search Results of each seed query during mode two, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type.Wherein, the weight that the higher word of word frequency is corresponding is larger.
The Search Results of each seed query during mode three, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type.Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.
The Search Results of each seed query during mode four, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.For the formula (3) in embodiment one can be adopted when weight given in each word, do not repeat them here.
After determining similarity, Similarity value can be come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as query to be identified by demand type determining units 320, and N2 is default positive integer; Or, according to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value of calculating is corresponding is the demand levels of query to be identified in corresponding demand type.
After the said method adopting the embodiment of the present invention to provide or device identify demand type, may be used for but be not limited to following application scenarios:
1) for the sequence of large search.After user inputs query, the demand type of this query can be identified by the said method of the embodiment of the present invention and device, by the Search Results of large search to should query demand type page-ranking in advance.
Such as, when user inputs query " home cooking high definition ", this query can be identified in large search there is video class demand, the associated video information of " home cooking " this TV play can be there is in for the results page of this large search, obtaining of this partial video information can be that video vertical search provides and inserts in the Search Results of large search, like this in the Search Results of large search, can the page of this video class be come before Search Results, as shown in Figure 4, the satisfaction of user and search experience is made all will to be greatly improved.
2) for vertical search.After user inputs query, the demand type of this query can be identified by the said method of the embodiment of the present invention and device, this query is distributed to optimum content resource or application provider's process, the final result accurately returning to user efficiently and match.
Such as, and as user's input " from Baidu mansion to five road junctions ", this query can be identified there is map class demand, this query is supplied to map vertical search, the calculating of bus routes is carried out by map vertical search, then directly show from Baidu mansion to the bus trip map at five road junctions and relevant bus information, as shown in Figure 5.
3) for information recommendation.After user inputs query, can be identified the demand type of this query, carry out information recommendation based on this demand type to user by the said method of the embodiment of the present invention and device, recommendation, the query of such as advertisement recommendation, knowledge question platform recommend.
Such as, user inputs query " cheap MP3 player " and identifies its demand type for shopping class, then can recommend the advertisement relevant to MP3 player at Search Results, the actual demand matching degree of such advertisement and user is just very high.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (16)

1. search need knows a method for distinguishing, and it is characterized in that, the method comprises:
S1, obtain query to be identified;
S2, obtain the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;
Wherein, describedly determine that the weight of each n-gram specifically comprises based on the appearance situation of each n-gram in Search Results text:
Be that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,
The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram;
S3, calculate respectively described query to be identified core word vector and the core word vector of predetermined each demand type between similarity, determine the demand type of described query to be identified according to the result of calculation of similarity.
2. method according to claim 1, is characterized in that, the Search Results obtaining described query to be identified in step S2 is: obtain in the Search Results of described query to be identified the Search Results coming front N1, described N1 is default positive integer.
3. method according to claim 1 and 2, is characterized in that, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.
4. method according to claim 1, is characterized in that, determines that the core word vector of demand type comprises:
S31, determine that the seed query of this demand type gathers;
S32, utilize seed query to gather in each seed query search for, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.
5. method according to claim 4, is characterized in that, the determination mode that the seed query of demand type gathers comprises:
Configured by artificial mode; Or
Artificial mode is adopted to mark in search daily record; Or,
From the search daily record of this demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,
From the search daily record of the Webpage search of this demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.
6. method according to claim 4, is characterized in that, described step S32 specifically comprises:
Each seed query in utilizing the seed query of this demand type to gather searches for, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,
Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.
7. method according to claim 6, is characterized in that, in step s 32, describedly determines that the weight of each n-gram comprises based on the appearance situation of each n-gram in Search Results text:
Be that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,
The IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight.
8. method according to claim 1, is characterized in that, determines that the demand type of described query to be identified comprises described in step S3 according to the result of calculation of similarity:
Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified, described N2 is default positive integer; Or,
According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value that calculates in described step S3 is corresponding is the demand levels of described query to be identified in corresponding demand type.
9. a device for search need identification, is characterized in that, this device comprises:
Identify object acquisition unit, for obtaining query to be identified;
Primary vector determining unit, for obtaining the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;
Wherein, described primary vector determining unit, when determining the weight of each n-gram, is that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,
The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram;
Demand type determining units, for calculating the similarity between the core word vector of described query to be identified and the core word vector of predetermined each demand type respectively, determines the demand type of described query to be identified according to the result of calculation of similarity.
10. device according to claim 9, it is characterized in that, described primary vector determining unit is when obtaining the Search Results of described query to be identified, and concrete acquisition in the Search Results of described query to be identified comes the individual Search Results of front N1, and described N1 is default positive integer.
11. devices according to claim 9 or 10, it is characterized in that, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.
12. devices according to claim 9, is characterized in that, this device also comprises: secondary vector determining unit;
Described secondary vector determining unit specifically comprises:
Seed query determines subelement, for determining that the seed query of demand type gathers;
Core word vector forms subelement, for obtaining the Search Results of each seed query in seed query set, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.
13. devices according to claim 12, is characterized in that, described seed query determines that the seed query of the demand type that subelement acquisition is configured by artificial mode gathers; Or,
The seed query obtaining the demand type adopting artificial mode to mark in search daily record gathers; Or,
From the search daily record of demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,
From the search daily record of the Webpage search of demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.
14. devices according to claim 12, it is characterized in that, described core word vector forms the Search Results that subelement obtains each seed query in the seed query set of this demand type, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,
Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.
15. devices according to claim 14, it is characterized in that, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, is specifically each n-gram imparting weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,
The IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight.
16. devices according to claim 9, it is characterized in that, Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified by described demand type determining units, and described N2 is default positive integer; Or,
According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value of calculating is corresponding is the demand levels of described query to be identified in corresponding demand type.
CN201110273327.2A 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification Active CN102999520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110273327.2A CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110273327.2A CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Publications (2)

Publication Number Publication Date
CN102999520A CN102999520A (en) 2013-03-27
CN102999520B true CN102999520B (en) 2016-04-27

Family

ID=47928094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110273327.2A Active CN102999520B (en) 2011-09-15 2011-09-15 A kind of method and apparatus of search need identification

Country Status (1)

Country Link
CN (1) CN102999520B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794251B (en) * 2015-05-19 2018-04-27 苏州工讯科技有限公司 Industrial products vertical search engine aligning method based on search result utility analysis
CN106951422B (en) * 2016-01-07 2021-05-28 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN107092621A (en) * 2016-11-24 2017-08-25 北京小度信息科技有限公司 Information search method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820592A (en) * 2009-02-27 2010-09-01 华为技术有限公司 Method and device for mobile search
CN102096717A (en) * 2011-02-15 2011-06-15 百度在线网络技术(北京)有限公司 Search method and search engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259829A1 (en) * 2009-12-30 2012-10-11 Xin Zhou Generating related input suggestions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101820592A (en) * 2009-02-27 2010-09-01 华为技术有限公司 Method and device for mobile search
CN102096717A (en) * 2011-02-15 2011-06-15 百度在线网络技术(北京)有限公司 Search method and search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
拟合用户偏好的个性化搜索;桑艳艳 等;《情报科学》;20080831;第26卷(第8期);第1249页 *

Also Published As

Publication number Publication date
CN102999520A (en) 2013-03-27

Similar Documents

Publication Publication Date Title
CN102360383B (en) Method for extracting text-oriented field term and term relationship
CN100557612C (en) A kind of search result ordering method and device based on search engine
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN102999521B (en) A kind of method and device identifying search need
CN105893444A (en) Sentiment classification method and apparatus
CN102200975B (en) Vertical search engine system using semantic analysis
CN103390051A (en) Topic detection and tracking method based on microblog data
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN103885937A (en) Method for judging repetition of enterprise Chinese names on basis of core word similarity
CN103294693A (en) Searching method, server and system
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN102163234A (en) Equipment and method for error correction of query sequence based on degree of error correction association
CN103020066A (en) Method and device for recognizing search demand
CN103473338A (en) Webpage content extraction method and webpage content extraction system
Al-Taani et al. An extractive graph-based Arabic text summarization approach
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN105138558A (en) User access content-based real-time personalized information collection method
CN103617213A (en) Method and system for identifying newspage attributive characters
CN103390004A (en) Determination method and determination device for semantic redundancy and corresponding search method and device
KR101254362B1 (en) Method and system for providing keyword ranking using common affix
CN101383782A (en) Method and system for acquiring network resource identification
CN103914533A (en) Promotion search result display method and device
CN102567290A (en) Method, device and equipment for expanding short text to be processed

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant