CN102999520B

CN102999520B - A kind of method and apparatus of search need identification

Info

Publication number: CN102999520B
Application number: CN201110273327.2A
Authority: CN
Inventors: 黄际洲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-09-15
Filing date: 2011-09-15
Publication date: 2016-04-27
Anticipated expiration: 2031-09-15
Also published as: CN102999520A

Abstract

The invention provides a kind of method and apparatus of search need identification, wherein method comprises: S1, obtain query to be identified; S2, obtain the Search Results of described query to be identified, determine each n unit's phrase (n-gram) of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified; S3, calculate respectively described query to be identified core word vector and the core word vector of predetermined each demand type between similarity, determine the demand type of described query to be identified according to the result of calculation of similarity.The accuracy of search need identification can be improved by the present invention.

Description

A kind of method and apparatus of search need identification

[technical field]

The present invention relates to field of computer technology, the method and apparatus of particularly a kind of search need identification.

[background technology]

Along with internet developing rapidly and maturation in the world, the information resources on network are enriched constantly, and information data amount, also in expansion at full speed, has become the major way of modern's obtaining information by search engine obtaining information.In order to provide more convenient to user, accurately inquiry service be that search engine technique is in the current and following developing direction.

In search engine technique, the search need of user is identified to be the important ring improving searching accuracy and validity, in structured search, effect is remarkable especially.Existing search need recognition method usually adopts and the core word vector of query and each demand type is calculated similarity respectively, determines the demand type of query according to Similarity Measure result.The demand type such as similarity being come top n is identified as the demand type of this query, or, according to the value of similarity, determine the demand levels of this query in each demand type.But because query itself is shorter, available information is few, the similarity between the core word vector relying on query and directly calculate query and demand type, may cause semantic similarity deviation comparatively large, thus cause the accuracy of search need identification.

[summary of the invention]

The invention provides a kind of method and apparatus of search need identification, so that improve the accuracy of search need identification.

Concrete technical scheme is as follows:

A kind of search need knows method for distinguishing, and the method comprises:

S1, obtain query to be identified;

S2, obtain the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;

S3, calculate respectively described query to be identified core word vector and the core word vector of predetermined each demand type between similarity, determine the demand type of described query to be identified according to the result of calculation of similarity.

According to one preferred embodiment of the present invention, the Search Results obtaining described query to be identified in step S2 is: obtain in the Search Results of described query to be identified the Search Results coming front N1, described N1 is default positive integer.

According to one preferred embodiment of the present invention, determine that the weight of each n-gram specifically comprises based on the appearance situation of each n-gram in Search Results text described in step S2:

Be that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,

The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram.

According to one preferred embodiment of the present invention, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.

According to one preferred embodiment of the present invention, determine that the core word vector of demand type comprises:

S31, determine that the seed query of this demand type gathers;

S32, utilize seed query to gather in each seed query search for, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.

According to one preferred embodiment of the present invention, the determination mode that the seed query of demand type gathers comprises:

Configured by artificial mode; Or

Artificial mode is adopted to mark in search daily record; Or,

From the search daily record of this demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,

From the search daily record of the Webpage search of this demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.

According to one preferred embodiment of the present invention, described step S32 specifically comprises:

Each seed query in utilizing the seed query of this demand type to gather searches for, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,

Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,

Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,

Each seed query in utilizing the seed query of this demand type to gather searches for, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.

According to one preferred embodiment of the present invention, describedly determine that the weight of each n-gram comprises based on the appearance situation of each n-gram in Search Results text:

Be that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,

The IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight.

According to one preferred embodiment of the present invention, determine that the demand type of described query to be identified comprises according to the result of calculation of similarity described in step S3:

Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified, described N2 is default positive integer; Or,

According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value that calculates in described step S3 is corresponding is the demand levels of described query to be identified in corresponding demand type.

A device for search need identification, this device comprises:

Identify object acquisition unit, for obtaining query to be identified;

Primary vector determining unit, for obtaining the Search Results of described query to be identified, determine each n unit phrase n-gram of Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of described query to be identified;

Demand type determining units, for calculating the similarity between the core word vector of described query to be identified and the core word vector of predetermined each demand type respectively, determines the demand type of described query to be identified according to the result of calculation of similarity.

According to one preferred embodiment of the present invention, described primary vector determining unit is when obtaining the Search Results of described query to be identified, and concrete acquisition in the Search Results of described query to be identified comes the individual Search Results of front N1, and described N1 is default positive integer.

According to one preferred embodiment of the present invention, described primary vector determining unit, when determining the weight of each n-gram, is that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,

According to one preferred embodiment of the present invention, this device also comprises: secondary vector determining unit;

Described secondary vector determining unit specifically comprises:

Seed query determines subelement, for determining that the seed query of demand type gathers;

Core word vector forms subelement, for obtaining the Search Results of each seed query in seed query set, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.

According to one preferred embodiment of the present invention, described seed query determines that the seed query of the demand type that subelement acquisition is configured by artificial mode gathers; Or,

The seed query obtaining the demand type adopting artificial mode to mark in search daily record gathers; Or,

From the search daily record of demand type vertical search, obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value gather; Or,

From the search daily record of the Webpage search of demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.

According to one preferred embodiment of the present invention, described core word vector forms the Search Results that subelement obtains each seed query in the seed query set of this demand type, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,

Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type; Or,

Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type; Or,

Obtain the Search Results of each seed query in the seed query set of this demand type, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.

According to one preferred embodiment of the present invention, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, is specifically each n-gram imparting weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,

According to one preferred embodiment of the present invention, Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified by described demand type determining units, and described N2 is default positive integer; Or,

According to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value of calculating is corresponding is the demand levels of described query to be identified in corresponding demand type.

As can be seen from the above technical solutions, the present invention adopts the n-gram of the Search Results text of query to be identified and determines the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtain the core word vector of query to be identified, utilize the core word vector of the query to be identified obtained to calculate the similarity with the core word vector of each demand type further, thus identify the demand type of query to be identified.Visible, present invention utilizes the information that the query to be identified that compares enriches more itself, namely the n-gram of the Search Results text of query to be identified, expresses the semanteme of query to be identified more fully, thus improves the accuracy of search need identification.

[accompanying drawing explanation]

The method flow diagram that Fig. 1 provides for the embodiment of the present invention one;

The webpage schematic diagram comprising the sentence of query to be identified that Fig. 2 provides for the embodiment of the present invention one;

The structure drawing of device that Fig. 3 provides for the embodiment of the present invention two;

The instance graph that Fig. 4 sorts for large search for the search need identification that the embodiment of the present invention provides;

The search need identification that Fig. 5 provides for the embodiment of the present invention is for the instance graph of vertical search.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.

Embodiment one,

The method flow diagram that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:

Step 101: obtain query to be identified.

Step 102: the Search Results obtaining this query to be identified, determine each n unit's phrase (n-gram) in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of query to be identified.

Because the usual Search Results based on query exists larger correlativity with query, therefore, the Search Results obtained after utilizing query to be identified to search in this step carries out the extraction of core word vector.

In addition, search engine is when searching for for query to be identified, Search Results carries out sorting according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, the Search Results coming front N1 can be chosen, from the text of this front N1 Search Results, determine n-gram, wherein N1 is default positive integer.

Due to may bulk information be there is in the page of Search Results, much may with query to be identified semantically correlativity be less, therefore, the text of the Search Results utilized when determining n-gram can be: the sentence comprising this query to be identified in web page title or webpage.

To comprise the sentence of query to be identified from webpage, suppose that query to be identified is for " home cooking ", after utilizing this query to be identified to search for, suppose that one of them Search Results of returning is as shown in Figure 2, the sentence comprising query to be identified in webpage is:

Home cooking _ menu complete works is done in way _ home cooking menu _ of home cooking _ home cooking

Home cooking is requisite during we live

The way of home cooking is various, and as northeast home cooking, Guo Lin home cooking etc., it is how the simplest that cook home cooking menu

Cuisines are outstanding for you provide abundant simple home cooking menu complete works of

Then from above four sentences, n-gram is determined.

So-called n-gram is exactly the combination that n word of minimum particle size occurs in order, and wherein n is default one or more positive integers.For " home cooking is requisite during we live ", if n is 1,2,3 or 4, the n-gram so obtained is:

1-gram: home cooking, be, we, life, in, essential

2-gram: home cooking is, be us, our life, in life, in essential

3-gram: home cooking is for we, be that we live, in our life, essential in life

4-gram: home cooking is that we live, are during we live, essential in our life

Wherein " " be filtered in the process determining n-gram as stop words.

When determining the weight of each n-gram, following two kinds of modes can be included but not limited to:

Mode one, be that each n-gram gives weight according to the word frequency of each n-gram in Search Results text (TF) and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.

Mode two, the sentence number occurred in Search Results text according to n-gram, with the reverse document frequency (IDF) of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram for n-gram gives weight.Which is based on information theory, and formula can as shown in formula (1).

Centrality (w) = \frac{\log (Co (w, q) + 1)}{\log (sf (w) + 1) + \log (sf (q) + 1)} \times \log (idf (w) + 1); - - - (1)

Wherein, w is n-gram, q is query to be identified, the weight that Centrality (w) is n-gram, the sentence number that Co (w, q) is n-gram and query co-occurrence to be identified, the sentence number that sf (w) occurs in Search Results text for n-gram, the sentence number that sf (q) occurs in Search Results text for query to be identified, the reverse document frequency that idf (w) is n-gram.

It should be noted that, above-mentioned formula (1) is only the example that the embodiment of the present invention provides, and the simple modification done according to this formula and equivalent replacement will not enumerate, all in limited range of the present invention.

Step 103: the core word vector and the similarity of the core word vector of each demand type that calculate query to be identified respectively, determines the demand type of query to be identified according to the result of calculation of similarity.

Pre-determine out the core word vector of each demand type in the present invention, the core word vector defining method of this demand type can be: determine that the seed query of this demand type gathers; Each seed query in utilizing seed query to gather searches for, and extracts core word and determine the weight of each core word based on the appearance situation of core word in Search Results text from the text of Search Results, obtains the core word vector of this demand type.

The seed query of the seed query set of formation demand type embodies the demand of corresponding preset kind, and these seeds query set can be configured by artificial mode, or adopts artificial mode to mark in search daily record.More preferably, also seed query can be excavated from search daily record, from the search daily record of this demand type vertical search, such as obtain the query of searching times higher than preset first threshold value as the seed query of this demand type, or, from the search daily record of the Webpage search of this demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and using searching times in the query that obtains higher than the query of default Second Threshold as the seed query of this demand type, etc.

Such as, the seed query of game class gather in seed query can comprise: " downloads of standalone version mobile phone trivial games ", " precious prompt fast lp608 mobile phone games download ", " World of Warcraft's download ", " World of Warcraft " etc.

After each seed query in utilizing each seed query to gather searches for, the mode extracting core word can adopt following several:

First kind of way: determine each n-gram in the text of Search Results and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type.

Because search engine is when the Search Results for seed query sorts, normally carry out sorting according to the correlativity with seed query, therefore, in order to raise the efficiency, reduce calculated amount, can choose the Search Results coming front N3, from the text of this front N3 Search Results, determine n-gram, wherein N3 is default positive integer.

Due to may bulk information be there is in the page of Search Results, much may with seed query semantically correlativity be less, therefore, the text of the Search Results utilized when determining n-gram can be: the sentence comprising this seed query in web page title or webpage, be all like this below in several mode, repeat no more.

Mode 1, be that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.

Mode 2, the sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the IDF of the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram.Which is based on information theory, and formula can as shown in formula (2).

Centrality (w) = \frac{\log (Co (w, q) + 1)}{\log (sf (w) + 1) + \log (sf (q) + 1)} \times \log (idf (w) + 1); - - - (2)

Wherein, w is n-gram, q is corresponding seed query, the weight that Centrality (w) is n-gram, the sentence number that Co (w, q) is n-gram and this seed query co-occurrence, the sentence number that sf (w) occurs in Search Results text for n-gram, the sentence number that sf (q) occurs in Search Results text for this seed query, the reverse document frequency that idf (w) is n-gram.

It should be noted that, above-mentioned formula (2) is only the example that the embodiment of the present invention provides, and the simple modification done according to this formula and equivalent replacement will not enumerate, all in limited range of the present invention.

The second way: after word segmentation processing and removal stop words are carried out to the text of Search Results, statistics obtains the word frequency of each word after removing stop words, determine that word frequency is each word determination weight determined higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type.

Wherein, the weight that the higher word of word frequency is corresponding is larger.

The third mode: after participle and removal stop words are carried out to the text of Search Results, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type.

Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.

4th kind of mode: after participle and removal stop words are carried out to the text of Search Results, be that weight given in each word according to removing sentence number that each word of obtaining after stop words occurs in Search Results text, with the IDF of the sentence number of corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.

The computing formula of weighted value is as shown in formula (3).

Centrality (w) = \frac{\log (Co (w, q) + 1)}{\log (sf (w) + 1) + \log (sf (q) + 1)} \times \log (idf (w) + 1); - - - (3)

Wherein, w is the word obtained after removing stop words, q is corresponding seed query, Centrality (w) is the weight of word w, the sentence number that Co (w, q) is word w and this seed query co-occurrence, the sentence number that sf (w) occurs in Search Results text for word w, the sentence number that sf (q) occurs in Search Results text for this seed query, the reverse document frequency that idf (w) is word w.

When calculating the core word vector of core word vector sum demand type of query to be identified, the computing method of cosine similarity can be adopted.Table 1 is for several query to be identified and each demand type similarity.

Table 1

Query to be identified	With game class similarity	With software class similarity	With novel class similarity
				Network game repair sieve legend	0.0026	0	0.4431
The novel of DNF	0.0050	0.0001	0.3467
				Story of a play or opera task in DNF	0.3616	0.0128	0

Swordsman's love standalone version 3 attack strategy	0.1631	0	0.0063
				Swordsman's love reads the non-cigarette of step in full	0	0	0.1205

After determining similarity, Similarity value can be come the demand type of front N2, or the Similarity value demand type that exceedes default similarity threshold is identified as the demand type of query to be identified, wherein N2 is default positive integer.Situation such as shown in table 1, supposes that N2 is 1, then can identify " novel of DNF " for novel class demand, and " swordsman's love standalone version 3 attack strategy " is game class demand.

Also according to the corresponding relation between the Similarity value preset and similarity grade, according to core word vector and the value of the similarity of the core word vector of each demand type of query to be identified, the demand levels of query to be identified in each demand type can be identified.Such as, pre-set similarity more than 0.3 for strong demand levels, similarity is weak demand levels between 0.1 to 0.3, and similarity is without demand levels below 0.1.Then in table 1, " novel of DNF " has strong demand in novel class demand, without the need to asking on game class and software class; " swordsman's love standalone version 3 attack strategy " has weak demand, without the need to asking on software class and novel class on game class.

Be more than the detailed description that search need knowledge method for distinguishing provided by the present invention is carried out, be described in detail below by the device of embodiment two to search need identification provided by the invention.

Embodiment two,

The structure drawing of device that Fig. 3 provides for the embodiment of the present invention two, as shown in Figure 3, this device can comprise: identify object acquisition unit 300, primary vector determining unit 310 and demand type determining units 320.

Identify that object acquisition unit 300 obtains query to be identified.

Primary vector determining unit 310 obtains the Search Results of query to be identified, determines each n-gram of Search Results text and determines the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtains the core word vector of query to be identified.

Because the usual Search Results based on query exists larger correlativity with query; therefore query to be identified can be supplied to search engine by primary vector determining unit 310, obtains the Search Results that returns of search engine and is further used for extracting the core word vector of query to be identified.

Search engine is when searching for for query to be identified, Search Results carries out sorting according to the correlativity with query to be identified, therefore, in order to raise the efficiency, reduce calculated amount, primary vector determining unit 310 is when obtaining the Search Results of query to be identified, and concrete acquisition in the Search Results of query to be identified comes the individual Search Results of front N1, and wherein N1 is default positive integer.

Primary vector determining unit 310, when determining the weight of each n-gram, can adopt following two kinds of modes:

First kind of way: be that n-gram gives weight according to the TF of n-gram in Search Results text and corresponding n value.The TF of usual n-gram in Search Results text is higher, illustrates that the significance level of this n-gram is higher, and, n value is larger, and the quantity of information that this n-gram comprises is larger, and corresponding weight value also should be higher, therefore, TF*n can be adopted in this approach to be that n-gram gives weight.

The second way: the sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram.Which is based on information theory, and formula as shown in the formula (1) in embodiment one, can not repeat them here.

Due to may bulk information be there is in the page of Search Results, much with query to be identified semantically correlativity is less, therefore, mentioned above searching results text can comprise: the web page title of Search Results, or the sentence comprising query to be identified in the webpage of Search Results.

Demand type determining units 320 calculates the similarity between the core word vector of query to be identified and the core word vector of predetermined each demand type respectively, determines the demand type of query to be identified according to the result of calculation of similarity.

Owing to needing the core word vector of predefined each demand type, therefore, this device can also comprise: secondary vector determining unit 330.

Secondary vector determining unit 330 can specifically comprise: seed query determines that subelement 331 and core word vector form subelement 332.

Seed query determines that subelement 331 determines that the seed query of demand type gathers.Particularly, can obtain in the following manner:

First kind of way: the seed query obtaining the demand type configured by artificial mode is gathered.

The second way: the seed query obtaining the demand type adopting artificial mode to mark in search daily record gathers.

The third mode: obtain the seed query that searching times forms this demand type higher than the query of preset first threshold value and gather from the search daily record of demand type vertical search.

4th kind of mode: from the search daily record of the Webpage search of demand type, obtain corresponding to clicking the website of this demand type or clicking the query of the title comprising this demand type Feature Words, and the seed query that searching times in the query of acquisition forms this demand type higher than the query of default Second Threshold is gathered.

Core word vector forms the Search Results that subelement 332 obtains each seed query in seed query set, from Search Results text, extract core word and determine the weight of each core word based on the appearance situation of core word in Search Results text, obtaining the core word vector of this demand type.Namely core word vector forms subelement 332 and is supplied to by each seed query after search engine searches for respectively, obtains the Search Results that search engine returns.

Particularly, core word vector formation subelement 332 can adopt following four kinds of modes to obtain the core word vector of this demand type:

The Search Results of each seed query during mode one, the seed query obtaining this demand type gather, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type.

Wherein, when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, can be specifically that each n-gram gives weight according to the TF of each n-gram in Search Results text and corresponding n value; Or, the IDF of the sentence number of the sentence number occurred in Search Results text according to n-gram, n-gram and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and n-gram is that n-gram gives weight, specifically can adopt the formula (2) in embodiment one, not repeat them here.

The Search Results of each seed query during mode two, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, statistics to remove after stop words obtain the TF of each word, determine that TF is each word determination weight higher than the word of default word frequency threshold value and based on word frequency, obtain the core word vector of this demand type.Wherein, the weight that the higher word of word frequency is corresponding is larger.

The Search Results of each seed query during mode three, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, TF and IDF of each word that statistics obtains after removing stop words, determine that TF-IDF value is each word determination weight determined higher than the word of default TF-IDF threshold value and based on TF-IDF, obtain the core word vector of this demand type.Wherein, the weight that the larger word of TF-IDF value is corresponding is larger.

The Search Results of each seed query during mode four, the seed query obtaining this demand type gather, after word segmentation processing and removal stop words are carried out to Search Results text, be respectively each word according to the IDF removing the sentence number that each word of obtaining after stop words occurs in Search Results text, the sentence number of each word and corresponding seed query co-occurrence, sentence number that seed query occurs in Search Results text and each word and give weight, right to choose weight values, higher than the word of default weight threshold, obtains the core word vector of this demand type.For the formula (3) in embodiment one can be adopted when weight given in each word, do not repeat them here.

After determining similarity, Similarity value can be come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as query to be identified by demand type determining units 320, and N2 is default positive integer; Or, according to the corresponding relation between the Similarity value preset and similarity grade, determine that the similarity grade that the Similarity value of calculating is corresponding is the demand levels of query to be identified in corresponding demand type.

After the said method adopting the embodiment of the present invention to provide or device identify demand type, may be used for but be not limited to following application scenarios:

1) for the sequence of large search.After user inputs query, the demand type of this query can be identified by the said method of the embodiment of the present invention and device, by the Search Results of large search to should query demand type page-ranking in advance.

Such as, when user inputs query " home cooking high definition ", this query can be identified in large search there is video class demand, the associated video information of " home cooking " this TV play can be there is in for the results page of this large search, obtaining of this partial video information can be that video vertical search provides and inserts in the Search Results of large search, like this in the Search Results of large search, can the page of this video class be come before Search Results, as shown in Figure 4, the satisfaction of user and search experience is made all will to be greatly improved.

2) for vertical search.After user inputs query, the demand type of this query can be identified by the said method of the embodiment of the present invention and device, this query is distributed to optimum content resource or application provider's process, the final result accurately returning to user efficiently and match.

Such as, and as user's input " from Baidu mansion to five road junctions ", this query can be identified there is map class demand, this query is supplied to map vertical search, the calculating of bus routes is carried out by map vertical search, then directly show from Baidu mansion to the bus trip map at five road junctions and relevant bus information, as shown in Figure 5.

3) for information recommendation.After user inputs query, can be identified the demand type of this query, carry out information recommendation based on this demand type to user by the said method of the embodiment of the present invention and device, recommendation, the query of such as advertisement recommendation, knowledge question platform recommend.

Such as, user inputs query " cheap MP3 player " and identifies its demand type for shopping class, then can recommend the advertisement relevant to MP3 player at Search Results, the actual demand matching degree of such advertisement and user is just very high.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. search need knows a method for distinguishing, and it is characterized in that, the method comprises:

S1, obtain query to be identified;

Wherein, describedly determine that the weight of each n-gram specifically comprises based on the appearance situation of each n-gram in Search Results text:

The sentence number occurred in Search Results text according to n-gram, be that n-gram gives weight with the reverse document frequency IDF of the sentence number of query co-occurrence to be identified, sentence number that query to be identified occurs in Search Results text and n-gram;

2. method according to claim 1, is characterized in that, the Search Results obtaining described query to be identified in step S2 is: obtain in the Search Results of described query to be identified the Search Results coming front N1, described N1 is default positive integer.

3. method according to claim 1 and 2, is characterized in that, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.

4. method according to claim 1, is characterized in that, determines that the core word vector of demand type comprises:

S31, determine that the seed query of this demand type gathers;

5. method according to claim 4, is characterized in that, the determination mode that the seed query of demand type gathers comprises:

Configured by artificial mode; Or

Artificial mode is adopted to mark in search daily record; Or,

6. method according to claim 4, is characterized in that, described step S32 specifically comprises:

7. method according to claim 6, is characterized in that, in step s 32, describedly determines that the weight of each n-gram comprises based on the appearance situation of each n-gram in Search Results text:

8. method according to claim 1, is characterized in that, determines that the demand type of described query to be identified comprises described in step S3 according to the result of calculation of similarity:

9. a device for search need identification, is characterized in that, this device comprises:

Identify object acquisition unit, for obtaining query to be identified;

Wherein, described primary vector determining unit, when determining the weight of each n-gram, is that n-gram gives weight according to the word frequency TF of n-gram in Search Results text and corresponding n value; Or,

10. device according to claim 9, it is characterized in that, described primary vector determining unit is when obtaining the Search Results of described query to be identified, and concrete acquisition in the Search Results of described query to be identified comes the individual Search Results of front N1, and described N1 is default positive integer.

11. devices according to claim 9 or 10, it is characterized in that, described Search Results text comprises: the web page title of Search Results, or comprises the sentence of described query to be identified in the webpage of Search Results.

12. devices according to claim 9, is characterized in that, this device also comprises: secondary vector determining unit;

Described secondary vector determining unit specifically comprises:

13. devices according to claim 12, is characterized in that, described seed query determines that the seed query of the demand type that subelement acquisition is configured by artificial mode gathers; Or,

14. devices according to claim 12, it is characterized in that, described core word vector forms the Search Results that subelement obtains each seed query in the seed query set of this demand type, determine each n-gram in Search Results text and determine the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, obtaining the core word vector of this demand type; Or,

15. devices according to claim 14, it is characterized in that, described core word vector forms subelement when determining the weight of each n-gram based on the appearance situation of each n-gram in Search Results text, is specifically each n-gram imparting weight according to the TF of each n-gram in Search Results text and corresponding n value; Or,

16. devices according to claim 9, it is characterized in that, Similarity value is come the demand type that demand type that the demand type of front N2 or Similarity value exceed default similarity threshold is defined as described query to be identified by described demand type determining units, and described N2 is default positive integer; Or,