CN104615621A - Method and system for processing correlations in searches - Google Patents

Method and system for processing correlations in searches Download PDF

Info

Publication number
CN104615621A
CN104615621A CN201410294419.2A CN201410294419A CN104615621A CN 104615621 A CN104615621 A CN 104615621A CN 201410294419 A CN201410294419 A CN 201410294419A CN 104615621 A CN104615621 A CN 104615621A
Authority
CN
China
Prior art keywords
search results
query string
search
feature
results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410294419.2A
Other languages
Chinese (zh)
Other versions
CN104615621B (en
Inventor
贺海军
李雅凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410294419.2A priority Critical patent/CN104615621B/en
Publication of CN104615621A publication Critical patent/CN104615621A/en
Application granted granted Critical
Publication of CN104615621B publication Critical patent/CN104615621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Abstract

The invention provides a system for processing correlations in searches. A method includes the steps of obtaining an inquire string, and conducting searching according to the inquire string to obtain a plurality of searching results; according to a plurality of pre-defined features, gradually conducting feature extraction on the obtained multiple searching results so as to obtain feature marking values corresponding to the features in the searching results; conducting regression problem processing in the searching results according to the feature marking values corresponding to the features so as to obtain correlation scores of the searching results relative to the inquire string; determining the searching results related to the inquire string according to the correlation scores, and displaying the searching results. By means of the search correlation treatment system and method, the accuracy of correlation processing of the searching results can be improved.

Description

Correlation treatment method in search and system
Technical field
The present invention relates to Computer Applied Technology, particularly relate to the Correlation treatment method in a kind of search and system.
Background technology
Along with the development of search technique, user uses various search engine to complete the search of various query string more and more, to obtain corresponding Search Results.In a search engine, according to query string obtain and the Search Results that is shown in searched page normally magnanimity, therefore, need to carry out correlativity process to Search Results, for user provides the Search Results comparatively relevant to query string.
Such as, but traditional is mostly realize based on attribute single in Search Results to searching for the correlativity process carried out, and, Search Results is relative to the text coverage rate etc. of query string.This will make the inaccurate limitation of correlativity process that there is Search Results in real application.
Summary of the invention
Based on this, be necessary that pin provides the Correlation treatment method in a kind of search that can improve the accuracy of the correlativity process of Search Results.
In addition, there is a need to provide the correlativity disposal system in a kind of search that can improve the accuracy of the correlativity process of Search Results.
A Correlation treatment method in search, comprises the steps:
Obtain query string, and carry out search according to described query string and obtain some Search Results;
According to predefined multiple feature, one by one feature extraction is carried out, to obtain the feature tag value in described Search Results corresponding to each feature to described some the Search Results obtained;
Feature tag value in each Search Results corresponding to feature carries out regression problem process and obtains the relevance score of described Search Results relative to described query string;
Determine Search Results maximally related with described query string according to described relevance score, and show described Search Results.
A correlativity disposal system in search, comprising:
Query string search module, for obtaining query string, and carries out search according to described query string and obtains some Search Results;
Feature extraction module, for carrying out feature extraction one by one, to obtain the feature tag value in described Search Results corresponding to each feature according to predefined multiple feature to described some the Search Results obtained;
Processing module, carries out regression problem process for the feature tag value in each Search Results corresponding to feature and obtains the relevance score of described Search Results relative to described query string;
Correlation determining module, for determining Search Results maximally related with described query string according to described relevance score, and shows described Search Results.
Correlation treatment method in above-mentioned search and system, acquisition query string is carried out search for obtain some Search Results accordingly, feature extraction will be carried out one by one according to predefined multiple feature to some obtained Search Results, to obtain the feature tag value in each Search Results corresponding to each feature, feature tag value in each Search Results corresponding to feature carries out regression problem process and obtains the relevance score of Search Results relative to query string, the maximally related Search Results with query string is determined according to relevance score, and show with the maximally related Search Results of query string, owing to being depend on predefined multiple feature with the maximally related Search Results of query string, and obtain as regression problem process, therefore, to greatly improve the accuracy of the correlativity process of Search Results.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the Correlation treatment method in an embodiment in search;
Fig. 2 obtains the method flow diagram of Search Results relative to the relevance score of query string for the feature tag value in Fig. 1 in each Search Results corresponding to feature carries out regression problem process;
Fig. 3 is in advance according to the method flow diagram of the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence in an embodiment;
Fig. 4 obtains relevancy labels's value of Search Results and Search Results characteristic of correspondence vector in an embodiment, according to the method flow diagram of relevancy labels's value and proper vector optimized regression model;
Fig. 5 is the structural representation of the correlativity disposal system in an embodiment in search;
Fig. 6 is the structural representation of processing module in Fig. 5;
Fig. 7 is the structural representation of the correlativity disposal system in another embodiment in search;
Fig. 8 is the structural representation of model construction module in Fig. 7;
Fig. 9 is the structural representation optimizing module in an embodiment;
A kind of server architecture schematic diagram that Figure 10 provides for the embodiment of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
As shown in Figure 1, in one embodiment, the Correlation treatment method in a kind of search, comprises the steps:
Step 110, obtains query string, and carries out search according to query string and obtain some Search Results.
In the present embodiment, obtain the query string that inputted by searched page of user, obtain some Search Results relevant to this query string to be undertaken searching for by search engine according to query string.
Such as, the search that user carries out can be map search, therefore, searching for obtained Search Results according to query string to map will be point of interest (Point of Interest, be called for short POI) data, each interest point data will contain the much informations such as title, classification, longitude, latitude and importance degree (POIRank).
Step 130, carries out feature extraction according to predefined multiple feature one by one to some the Search Results obtained, to obtain the feature tag value in Search Results corresponding to each feature.
In the present embodiment, pre-define multiple feature, with the multiple attribute by comprising in predefined each Search Results of multiple feature representation.All feature extraction will be carried out according to predefined multiple feature to each Search Results, so that Search Results is reached feature tag value according to predefined multiple mark sheet, i.e. corresponding multiple feature in each Search Results, each feature all has the feature tag value corresponded.
Wherein, its predefined multiple feature of institute of different search procedures is also by different, and the feature tag value corresponding to feature will be used for the degree of correlation weighed between Search Results and query string.
Such as, in map search predefined multiple feature can comprise the position of current results, the text PTS of current results, the importance degree of current results, the confidence level of current results, the technorati authority of current results, the title text score of current results, the polymerization another name text score of current results, the title coverage rate of current results, the polymerization another name coverage rate of current results, with the difference of the text PTS of first bar result, with the difference of the title text score of first bar result, with the difference of the importance degree of first bar result, with the difference of the text PTS of a upper result, with the difference of the title text score of a upper result, with the difference of the importance degree of a upper result, with the difference of the text PTS of next result, with the difference of the title text score of next result, with the difference of the importance degree of next result, the difference of the text PTS of current results and the average text PTS of Top N result, the title text score of current results and the difference of the difference of Top N result average title text score and the importance degree of current results and the average importance degree of Top N result, wherein, TopN result average title text score refers to text PTS mean value corresponding in the highest N number of Search Results of text PTS, Top N result average title text score refers to title text score averages corresponding in the highest N number of Search Results of title text score, the average importance degree of Top N result refers to importance degree mean value corresponding to the highest N number of Search Results of importance degree, and N can carry out value as required flexibly.
Concrete, in map search predefined multiple feature and the feature representation corresponding to each feature (i.e. the acquisition of feature tag value) as shown in the table:
Wherein, result as above is the interest point data carrying out map search and obtain.
A kind of interest point data in map may occur in multiple data source, and the title, address, phone etc. in different pieces of information source may have minute differences, therefore, the interest point data coming from different pieces of information source is done polymerization process, an interest point data is aggregated into the interest point data of different pieces of information being originated, choose the title of the title corresponding to a data source as this interest point data, the title of other Data Source is then called as the polymerization of this interest point data.
In addition, query string will form multiple field after cutting word process, the Search Results existed with the form of the text field also will be cut into multiple field, the ratio that multiple fields in Search Results occur in multiple fields of query string is text coverage rate, defines title text coverage rate, polymerization another name text coverage rate etc. accordingly.
Step 150, the feature tag value in each Search Results corresponding to feature carries out regression problem process and obtains the relevance score of Search Results relative to query string.
In the present embodiment, the method for logistic regression (Logisitic Regression) is adopted to carry out computing to the multiple feature tag value corresponding to feature multiple in each Search Results, to obtain the relevance score of this Search Results relative to query string.
Wherein, the relevance score obtained is higher, then illustrate that corresponding Search Results and query string are more for relevant.
Step 170, determines the maximally related Search Results with query string according to relevance score, and display of search results.
In the present embodiment, after some the Search Results that search obtains all obtain the relevance score corresponding to it, if the highest predetermined number of a relevance score Search Results can be obtained according to the numerical values recited of relevance score, this Search Results is the maximally related Search Results with query string, and will show in searched page.
Further, to be undertaken searching in the searched page obtained by input inquiry string user, only be shown in searched page by what determine with the maximally related Search Results of query string, other Search Results will directly not shown, such as, other Search Results is folded up, clicks similar buttons such as " checking whole result " as user and just can all represent.
Such as, when inputting " Peking University " this query string, only will represent the interest point data that name is called " Peking University ", other more interest point datas will be folded, and select can see time " checking whole result " until user.
As shown in Figure 2, in one embodiment, above-mentioned steps 150 comprises:
Step 151, in each Search Results, each the feature tag value morphogenesis characters vector corresponding to multiple feature.
Step 153, take proper vector as input, obtains the relevance score of Search Results relative to query string according to the regression model built in advance.
In the present embodiment, obtain the regression model built in advance, namely w parameter sets, i.e. β 0, β 1, β 2..., β n, and corresponding relevance score computing formula, and then proper vector and w parameter sets are inputted following relevance score computing formula can calculate the relevance score of corresponding Search Results relative to query string, that is:
p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i )
Wherein, y i{-1 ,+1}, illustrate corresponding Search Results is positive sample (+1) or negative sample (-1) to ∈, x i∈ R nbe a n-dimensional vector, represent the value of i-th Search Results in this n feature.
In one embodiment, before above-mentioned steps 153, method as above further comprises following steps:
In advance according to the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence.
In the present embodiment, given precise search query string set will comprise multiple accurate query string, and this accurate query string will be used for realizing precise search.Such as, in map search, the accurate query string comprised in the set of precise search query string can be " Beijing Peking University ", will be specified in the interest point data in Beijing search " Peking University " by this accurate query string.
Wherein, given precise search query string set can acquire by search daily record, also some modes by other acquire, and the most correlation results data of correspondence also can be obtained by search log acquisition, such as, in search daily record, have recorded query string and maximally related result data, in addition, also obtaining by carrying out search to given precise search query string set, not limiting one by one at this.
Machine learning is carried out according to the multiple features in the most correlation results data of given precise search query string set and correspondence, to build regression model, the machine learning method being applicable to this includes but not limited to that decision tree, support vector machine, artificial neural network and gradient increase progressively the methods such as decision tree.
Realized the structure of regression model by the set of large-scale precise search query string and multiple feature, will greatly improve the accuracy of most relevant search result in regression model identification search procedure.
As shown in Figure 3, in one embodiment, above-mentionedly to comprise according to the step of the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence in advance:
Step 301, obtains the most correlation results data that query string in given precise search query string set and the set of precise search query string is corresponding.
Step 303, carries out feature extraction to most correlation results data, to obtain most correlation results data characteristic of correspondence vector.
In the present embodiment, according to predefined multiple feature, feature extraction is carried out to most correlation results data, so that most correlation results data is expressed as predefined multiple feature, obtain characteristic of correspondence vector.
Concrete, because given precise search query string set contains multiple accurate query string, therefore, the most correlation results data corresponding to this precise search query string set will contain some Search Results corresponding to query string accurate with each.
It can thus be appreciated that will extract each Search Results characteristic of correspondence mark value respectively according to predefined multiple feature, and then each Search Results all can be expressed as the proper vector of a N*1, wherein, N is predefined feature quantity; Whole most correlation results data just can be expressed as the proper vector of a M*N dimension, and M is the quantity of Search Results in most correlation results data.
Step 305, carries out recurrence learning to build regression model according to most correlation results data characteristic of correspondence vector.
In the present embodiment, the proper vector corresponding to most correlation results data carries out machine learning, to build the regression model for identifying most relevant search result.
Concrete, by given M training sample (x 1, y 1), (x 2, y 2), (x 3, y 3) ..., (x m, y m), wherein, x i∈ R nn-dimensional vector, for representing i-th sample, the value of i-th Search Results in a predefined n feature namely in most correlation results data, y i{-1 ,+1} to illustrate this sample be positive sample (+1) or negative sample (-1) to ∈.Regression model passes through logical function by the proper vector x of i-th sample ithe probability being positive sample with this sample connects, that is:
p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i )
Wherein, w tx i0+ β 1x i1+ β 2x i2+ ...+β nx in, the form of w parameter is β 0, β 1, β 2..., β n, parameter w does different weightings to calculate w to the n of an x dimension tx i, then by the logical function to 0 of S type to 1, be the probability of positive sample.
The target of carrying out machine learning needs to look for namely suitable w, and make the relevance score P of original sample all larger, the relevance score of negative sample is all smaller simultaneously.
In another embodiment, after the above-mentioned step in advance according to the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence, method as above also comprises:
Obtain relevancy labels's value and the Search Results characteristic of correspondence vector of Search Results, according to relevancy labels's value and proper vector optimized regression model.
In the present embodiment, also constantly can carry out the optimization of regression model according to relevancy labels's value of Search Results and Search Results characteristic of correspondence vector, to obtain more suitable regression model.
Wherein, relevancy labels's value of Search Results carries out marking obtaining according to the rule preset, and this rule preset is by relevant to predefined multiple feature.Wherein, the correlativity mark value of Search Results includes 0 and 1 two numerical value, and that is, the correlativity mark value corresponding to maximally related Search Results is 1, and the correlativity mark value corresponding to all the other Search Results is 0.
The relevance score of Search Results will be obtained by presently used regression model according to Search Results characteristic of correspondence vector, and then the error compared between relevance score and relevancy labels's value adjusts regression model, to optimize presently used regression model, and then improve constantly the accuracy of correlativity process in search.
As shown in Figure 4, in one embodiment, relevancy labels's value of above-mentioned acquisition Search Results and Search Results characteristic of correspondence vector, the step according to relevancy labels's value and proper vector optimized regression model comprises:
Step 401, obtains relevancy labels's value and the Search Results characteristic of correspondence vector of Search Results.
Step 403, is obtained the relevance score of Search Results by Search Results characteristic of correspondence vector sum regression model.
In the present embodiment, by Search Results characteristic of correspondence vector input formula p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i ) , To calculate the relevance score of Search Results, wherein, the w parameter sets adopted obtains for machine learning.
Step 405, according to relevancy labels's value and the relevance score optimized regression model of Search Results.
In the present embodiment, relevancy labels's value of Search Results and relevance score are compared to the error obtained between the two, and then find the deficiency of regression model according to this error, with optimized regression model, obtain better forecast model.
As shown in Figure 5, in one embodiment, the correlativity disposal system in a kind of search, comprises query string search module 510, feature extraction module 530, processing module 550 and correlation determining module 570.
Query string search module 510, for obtaining query string, and carries out search according to query string and obtains some Search Results.
In the present embodiment, query string search module 510 obtains the query string that user is inputted by searched page, obtains some Search Results relevant to this query string to be undertaken searching for by search engine according to query string.
Such as, the search that user carries out can be map search, therefore, it will be interest point data that query string search module 510 searches for obtained Search Results according to query string to map, and each interest point data will contain the much informations such as title, classification, longitude, latitude and importance degree.
Feature extraction module 530, for carrying out feature extraction according to predefined multiple feature one by one to some the Search Results obtained, to obtain the feature tag value in Search Results corresponding to each feature.
In the present embodiment, pre-define multiple feature, with the multiple attribute by comprising in predefined each Search Results of multiple feature representation.Feature extraction module 530 all will carry out feature extraction according to predefined multiple feature to each Search Results, so that Search Results is reached feature tag value according to predefined multiple mark sheet, i.e. corresponding multiple feature in each Search Results, each feature all has the feature tag value corresponded.
Wherein, its predefined multiple feature of institute of different search procedures is also by different, and the feature tag value corresponding to feature will be used for the degree of correlation weighed between Search Results and query string.
Processing module 550, carries out regression problem process for the feature tag value in each Search Results corresponding to feature and obtains the relevance score of Search Results relative to query string.
In the present embodiment, processing module 550 adopts the method for logistic regression to carry out computing to the multiple feature tag value corresponding to feature multiple in each Search Results, to obtain the relevance score of this Search Results relative to query string.
Wherein, the relevance score obtained is higher, then illustrate that corresponding Search Results and query string are more for relevant.
Correlation determining module 570, for determining the maximally related Search Results with query string according to relevance score, and shows this Search Results.
In the present embodiment, after some the Search Results that search obtains all obtain the relevance score corresponding to it, if correlation determining module 570 can obtain the highest predetermined number of a relevance score Search Results according to the numerical values recited of relevance score, this Search Results is the maximally related Search Results with query string, and will show in searched page.
Further, to be undertaken searching in the searched page obtained by input inquiry string user, only be shown in searched page by what determine with the maximally related Search Results of query string, other Search Results will directly not shown, such as, other Search Results is folded up, clicks similar buttons such as " checking whole result " as user and just can all represent.
Such as, when inputting " Peking University " this query string, only will represent the interest point data that name is called " Peking University ", other more interest point datas will be folded, and select can see time " checking whole result " until user.
As shown in Figure 6, in one embodiment, above-mentioned processing module 550 comprises vectorial forming unit 551 and mode input unit 553.
Vector forming unit 551, for each the feature tag value morphogenesis characters vector in each Search Results corresponding to multiple feature.
Mode input unit 553, for taking proper vector as input, obtains the relevance score of Search Results relative to query string according to the regression model built in advance.
In the present embodiment, mode input unit 553 obtains the regression model built in advance, namely w parameter sets, i.e. β 0, β 1, β 2..., β n, and corresponding relevance score computing formula, and then proper vector and w parameter sets are inputted following relevance score computing formula can calculate the relevance score of corresponding Search Results relative to query string, that is:
p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i )
Wherein, y i{-1 ,+1}, illustrate corresponding Search Results is positive sample (+1) or negative sample (-1) to ∈, x i∈ R nbe a n-dimensional vector, represent the value of i-th Search Results in this n feature.
As shown in Figure 7, in one embodiment, system as above further comprises model construction module 710.
Model construction module 710 is in advance according to the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence.
In the present embodiment, given precise search query string set will comprise multiple accurate query string, and this accurate query string will be used for realizing precise search.Such as, in map search, the accurate query string comprised in the set of precise search query string can be " Beijing Peking University ", will be specified in the interest point data in Beijing search " Peking University " by this accurate query string.
Wherein, given precise search query string set can acquire by search daily record, also some modes by other acquire, and the most correlation results data of correspondence also can be obtained by search log acquisition, such as, in search daily record, have recorded query string and maximally related result data, in addition, also obtaining by carrying out search to given precise search query string set, not limiting one by one at this.
Model construction module 710 carries out machine learning according to the multiple features in the most correlation results data of given precise search query string set and correspondence, to build regression model, the machine learning method being applicable to this includes but not limited to that decision tree, support vector machine, artificial neural network and gradient increase progressively the methods such as decision tree.
Model construction module 710 realizes the structure of regression model by the set of large-scale precise search query string and multiple feature, will greatly improve the accuracy of most relevant search result in regression model identification search procedure.
As shown in Figure 8, in one embodiment, above-mentioned model construction module 710 includes acquiring unit 711, characteristic processing unit 713 and unit 715.
Acquiring unit 711, for obtaining most correlation results data corresponding to query string in given precise search query string set and the set of precise search query string.
Characteristic processing unit 713, for carrying out feature extraction to most correlation results data, to obtain most correlation results data characteristic of correspondence vector.
In the present embodiment, characteristic processing unit 713 carries out feature extraction according to predefined multiple feature to most correlation results data, so that most correlation results data is expressed as predefined multiple feature, obtains characteristic of correspondence vector.
Concrete, because given precise search query string set contains multiple accurate query string, therefore, the most correlation results data corresponding to this precise search query string set will contain some Search Results corresponding to query string accurate with each.
It can thus be appreciated that characteristic processing unit 713 will extract each Search Results characteristic of correspondence mark value respectively according to predefined multiple feature, and then each Search Results all can be expressed as the proper vector of a N*1, wherein, N will be predefined feature quantity; Whole most correlation results data just can be expressed as the proper vector of a M*N dimension, and M is the quantity of Search Results in most correlation results data.
Unit 715, for carrying out recurrence learning to build regression model according to most correlation results data characteristic of correspondence vector.
In the present embodiment, the proper vector of unit 715 corresponding to most correlation results data carries out machine learning, to build the regression model for identifying most relevant search result.
Concrete, by given M training sample (x 1, y 1), (x 2, y 2), (x 3, y 3) ..., (x m, y m), wherein, x i∈ R nn-dimensional vector, for representing i-th sample, the value of i-th Search Results in a predefined n feature namely in most correlation results data, y i{-1 ,+1} to illustrate this sample be positive sample (+1) or negative sample (-1) to ∈.Regression model passes through logical function by the proper vector x of i-th sample ithe probability being positive sample with this sample connects, that is:
p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i )
Wherein, w tx i0+ β 1x i1+ β 2x i2+ ...+β nx in, the form of w parameter is β 0, β 1, β 2..., β n, parameter w does different weightings to calculate w to the n of an x dimension tx i, then by the logical function to 0 of S type to 1, be the probability of positive sample.
The target that unit 715 carries out machine learning needs to look for namely suitable w, and make the relevance score P of original sample all larger, the relevance score of negative sample is all smaller simultaneously.
In another embodiment, system as above also comprises optimization module.This optimization module, for obtaining relevancy labels's value and the Search Results characteristic of correspondence vector of Search Results, is worth and proper vector optimized regression model according to relevancy labels.
In the present embodiment, optimizing module also can constantly according to relevancy labels's value of Search Results and the vectorial optimization carrying out regression model of Search Results characteristic of correspondence, to obtain more suitable regression model.
Wherein, relevancy labels's value of Search Results carries out marking obtaining according to the rule preset, and this rule preset is by relevant to predefined multiple feature.Wherein, the correlativity mark value of Search Results includes 0 and 1 two numerical value, and that is, the correlativity mark value corresponding to maximally related Search Results is 1, and the correlativity mark value corresponding to all the other Search Results is 0.
Optimize module and will be obtained the relevance score of Search Results by presently used regression model according to Search Results characteristic of correspondence vector, and then the error compared between relevance score and relevancy labels's value adjusts regression model, to optimize presently used regression model, and then improve constantly the accuracy of correlativity process in search.
As shown in Figure 9, in one embodiment, above-mentioned optimization module comprises numerical value acquiring unit 901, degree of correlation arithmetic element 903 and model optimization unit 905.
Numerical value acquiring unit 901, for obtaining feature tag value and the Search Results characteristic of correspondence vector of Search Results.
Degree of correlation arithmetic element 903, for being obtained the relevance score of Search Results by Search Results characteristic of correspondence vector sum regression model.
In the present embodiment, degree of correlation arithmetic element 903 is by Search Results characteristic of correspondence vector input formula p ( y i = + 1 | x i , w ) = σ ( y i w T x i ) = 1 1 + exp ( - w T x i ) , To calculate the relevance score of Search Results, wherein, the w parameter sets adopted obtains for machine learning.
Model optimization unit 905, is worth and relevance score optimized regression model for the relevancy labels according to Search Results.
In the present embodiment, relevancy labels's value of model optimization unit 905 pairs of Search Results and relevance score compare the error obtained between the two, and then find the deficiency of regression model according to this error, with optimized regression model, obtain better forecast model.
Figure 10 is a kind of server architecture schematic diagram that the embodiment of the present invention provides.This server 1000 can produce larger difference because of configuration or performance difference, one or more central processing units (central processing units can be comprised, CPU) 1022 (such as, one or more processors) and storer 1032, one or more store the storage medium 51030 (such as one or more mass memory units) of application program 1042 or data 1044.Wherein, storer 1032 and storage medium 1030 can be of short duration storages or store lastingly.The program being stored in storage medium 1030 can comprise one or more modules (illustrating not shown), such as, query string search module 510 in Fig. 5, feature extraction module 530, processing module 550 and correlation determining module 570 etc., each module can comprise a series of command operatings in server.Further, central processing unit 1022 can be set to communicate with storage medium 1030, and server 1000 performs a series of command operatings in storage medium 1030.Server 1000 can also comprise one or more power supplys 1026, one or more wired or wireless network interfaces 550, one or more IO interface 1058, and/or, one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc.Above-mentioned Fig. 1 can based on the server architecture shown in this Figure 10 to the step performed by server described in embodiment illustrated in fig. 4.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in a computer read/write memory medium, as in the embodiment of the present invention, this program can be stored in the storage medium of computer system, and performed by least one processor in this computer system, to realize the flow process of the embodiment comprised as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (12)

1. the Correlation treatment method in search, comprises the steps:
Obtain query string, and carry out search according to described query string and obtain some Search Results;
According to predefined multiple feature, one by one feature extraction is carried out, to obtain the feature tag value in described Search Results corresponding to each feature to described some the Search Results obtained;
Feature tag value in each Search Results corresponding to feature carries out regression problem process and obtains the relevance score of described Search Results relative to described query string;
Determine Search Results maximally related with described query string according to described relevance score, and show described Search Results.
2. method according to claim 1, is characterized in that, described feature tag value in each Search Results corresponding to feature carries out regression problem process and obtains described Search Results and comprise relative to the step of the relevance score of described query string:
In each Search Results, each the feature tag value morphogenesis characters vector corresponding to multiple feature;
With described proper vector for input, obtain the relevance score of described Search Results relative to described query string according to the regression model built in advance.
3. method according to claim 2, is characterized in that, described with described proper vector for input, before obtaining the step of described Search Results relative to the relevance score of described query string by the regression model built in advance, described method also comprises:
In advance according to the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence.
4. method according to claim 3, is characterized in that, describedly comprises according to the step of the multiple feature construction regression models in given precise search query string set and corresponding result data in advance:
Obtain the most correlation results data that query string in given precise search query string set and the set of described precise search query string is corresponding;
Feature extraction is carried out to described most correlation results data, with correlation results data characteristic of correspondence vector most described in obtaining;
Recurrence learning is carried out to build regression model according to described most correlation results data characteristic of correspondence vector.
5. method according to claim 3, is characterized in that, after the described step in advance according to the multiple feature construction regression models in given precise search query string set and corresponding result data, described method also comprises:
Obtain relevancy labels's value and the described Search Results characteristic of correspondence vector of Search Results, optimize described regression model according to described relevancy labels's value and proper vector.
6. method according to claim 5, is characterized in that, relevancy labels's value of described acquisition Search Results and described Search Results characteristic of correspondence vector, and the step optimizing described regression model according to described relevancy labels's value and proper vector comprises:
Obtain feature tag value and the described Search Results characteristic of correspondence vector of Search Results;
The relevance score of described Search Results is obtained by described Search Results characteristic of correspondence vector sum regression model;
Described regression model is optimized according to relevancy labels's value of described Search Results and relevance score.
7. the correlativity disposal system in search, is characterized in that, comprising:
Query string search module, for obtaining query string, and carries out search according to described query string and obtains some Search Results;
Feature extraction module, for carrying out feature extraction one by one, to obtain the feature tag value in described Search Results corresponding to each feature according to predefined multiple feature to described some the Search Results obtained;
Processing module, carries out regression problem process for the feature tag value in each Search Results corresponding to feature and obtains the relevance score of described Search Results relative to described query string;
Correlation determining module, for determining Search Results maximally related with described query string according to described relevance score, and shows described Search Results.
8. system according to claim 7, is characterized in that, described processing module comprises:
Vector forming unit, for each the feature tag value morphogenesis characters vector in each Search Results corresponding to multiple feature;
Mode input unit, for described proper vector for input, obtain the relevance score of described Search Results relative to described query string according to the regression model built in advance.
9. system according to claim 8, is characterized in that, described system also comprises:
Model construction module, in advance according to the multiple feature construction regression models in the most correlation results data of given precise search query string set and correspondence.
10. system according to claim 9, is characterized in that, described model construction module comprises:
Acquiring unit, for obtaining most correlation results data corresponding to query string in given precise search query string set and the set of described precise search query string;
Characteristic processing unit, for carrying out feature extraction to described most correlation results data, with correlation results data characteristic of correspondence vector most described in obtaining;
Unit, for carrying out recurrence learning to build regression model according to described most correlation results data characteristic of correspondence vector.
11. systems according to claim 9, is characterized in that, described system also comprises:
Optimizing module, for obtaining relevancy labels's value and the described Search Results characteristic of correspondence vector of Search Results, optimizing described regression model according to described relevancy labels's value and proper vector.
12. systems according to claim 11, is characterized in that, described optimization module comprises:
Numerical value acquiring unit, for obtaining feature tag value and the described Search Results characteristic of correspondence vector of Search Results;
Degree of correlation arithmetic element, for being obtained the relevance score of described Search Results by described Search Results characteristic of correspondence vector sum regression model;
Model optimization unit, for optimizing described regression model according to relevancy labels's value of described Search Results and relevance score.
CN201410294419.2A 2014-06-25 2014-06-25 Correlation treatment method and system in search Active CN104615621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410294419.2A CN104615621B (en) 2014-06-25 2014-06-25 Correlation treatment method and system in search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410294419.2A CN104615621B (en) 2014-06-25 2014-06-25 Correlation treatment method and system in search

Publications (2)

Publication Number Publication Date
CN104615621A true CN104615621A (en) 2015-05-13
CN104615621B CN104615621B (en) 2017-11-21

Family

ID=53150069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410294419.2A Active CN104615621B (en) 2014-06-25 2014-06-25 Correlation treatment method and system in search

Country Status (1)

Country Link
CN (1) CN104615621B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055607A (en) * 2016-05-25 2016-10-26 百度在线网络技术(北京)有限公司 User visiting prediction model establishment and user visiting prediction method and apparatus
CN108197621A (en) * 2017-12-28 2018-06-22 北京金堤科技有限公司 Company information acquisition methods and system and information processing method and system
CN109543028A (en) * 2017-08-30 2019-03-29 微软技术许可有限责任公司 Computer system, non-transitory machine-readable storage media and computer implemented method
CN109948030A (en) * 2019-02-28 2019-06-28 北京搜狗科技发展有限公司 Webpage searching result quality determining method and device
CN109977293A (en) * 2019-03-29 2019-07-05 北京搜狗科技发展有限公司 A kind of calculation method and device of search result relevance

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248667A1 (en) * 2008-03-31 2009-10-01 Zhaohui Zheng Learning Ranking Functions Incorporating Boosted Ranking In A Regression Framework For Information Retrieval And Ranking
US20110004509A1 (en) * 2009-07-06 2011-01-06 Xiaoyuan Wu Systems and methods for predicting sales of item listings
CN102043834A (en) * 2010-11-25 2011-05-04 北京搜狗科技发展有限公司 Method for realizing searching by utilizing client and search client
CN102375823A (en) * 2010-08-13 2012-03-14 腾讯科技(深圳)有限公司 Searching result gathering display method and system
CN102999508A (en) * 2011-09-13 2013-03-27 腾讯科技(深圳)有限公司 Method and system for sequencing search results
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248667A1 (en) * 2008-03-31 2009-10-01 Zhaohui Zheng Learning Ranking Functions Incorporating Boosted Ranking In A Regression Framework For Information Retrieval And Ranking
US20110004509A1 (en) * 2009-07-06 2011-01-06 Xiaoyuan Wu Systems and methods for predicting sales of item listings
CN102375823A (en) * 2010-08-13 2012-03-14 腾讯科技(深圳)有限公司 Searching result gathering display method and system
CN102043834A (en) * 2010-11-25 2011-05-04 北京搜狗科技发展有限公司 Method for realizing searching by utilizing client and search client
CN102999508A (en) * 2011-09-13 2013-03-27 腾讯科技(深圳)有限公司 Method and system for sequencing search results
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055607A (en) * 2016-05-25 2016-10-26 百度在线网络技术(北京)有限公司 User visiting prediction model establishment and user visiting prediction method and apparatus
CN106055607B (en) * 2016-05-25 2020-05-19 百度在线网络技术(北京)有限公司 User visit prediction model establishment method and device and user visit prediction method and device
CN109543028A (en) * 2017-08-30 2019-03-29 微软技术许可有限责任公司 Computer system, non-transitory machine-readable storage media and computer implemented method
CN108197621A (en) * 2017-12-28 2018-06-22 北京金堤科技有限公司 Company information acquisition methods and system and information processing method and system
CN109948030A (en) * 2019-02-28 2019-06-28 北京搜狗科技发展有限公司 Webpage searching result quality determining method and device
CN109977293A (en) * 2019-03-29 2019-07-05 北京搜狗科技发展有限公司 A kind of calculation method and device of search result relevance
CN109977293B (en) * 2019-03-29 2021-04-20 北京搜狗科技发展有限公司 Method and device for calculating search result relevance

Also Published As

Publication number Publication date
CN104615621B (en) 2017-11-21

Similar Documents

Publication Publication Date Title
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
Gao et al. Database saliency for fast image retrieval
US8560531B2 (en) Search tool that utilizes scientific metadata matched against user-entered parameters
Kuo et al. Unsupervised semantic feature discovery for image object retrieval and tag refinement
CN106033416A (en) A string processing method and device
US20190108274A1 (en) Automated concepts for interrogating a document storage database
US20140328544A1 (en) Hand-drawn sketch recognition
CN104615621A (en) Method and system for processing correlations in searches
CN107657048A (en) user identification method and device
CN104834693A (en) Depth-search-based visual image searching method and system thereof
WO2018090468A1 (en) Method and device for searching for video program
CN108959305A (en) A kind of event extraction method and system based on internet big data
Popescu et al. CEA LIST's Participation at MediaEval 2013 Placing Task.
US20150206101A1 (en) System for determining infringement of copyright based on the text reference point and method thereof
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
US11328001B2 (en) Efficient matching of data fields in response to database requests
CN107590119B (en) Method and device for extracting person attribute information
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN104615620A (en) Map search type identification method and device and map search method and system
US20230062114A1 (en) Machine learning techniques for efficient data pattern recognition across databases
KR101698280B1 (en) Apparatus and Method for searching web page for tags
US20170351738A1 (en) Automatic conversion stage discovery
Shen et al. Predicting named entity location using Twitter
WO2015143911A1 (en) Method and device for pushing webpages containing time-relevant information
Takeuchi et al. Spatio-temporal pseudo relevance feedback for large-scale and heterogeneous scientific repositories

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant