CN1996316A - Search engine searching method based on web page correlation - Google Patents
Search engine searching method based on web page correlation Download PDFInfo
- Publication number
- CN1996316A CN1996316A CN 200710056425 CN200710056425A CN1996316A CN 1996316 A CN1996316 A CN 1996316A CN 200710056425 CN200710056425 CN 200710056425 CN 200710056425 A CN200710056425 A CN 200710056425A CN 1996316 A CN1996316 A CN 1996316A
- Authority
- CN
- China
- Prior art keywords
- user
- webpage
- search engine
- click
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention relates to a search engine webpage relevance searching method. It can provide two times result to users at one time inquiry, effectively solving the issues like one word multi meanings or vice versa at one click by the user and the issue of unabling to decide the inquiry intention of the user based on keyword search engine, in this way to provide the user the webpage they may be interested and related to the keyword, without adding the complexity of the user operation. Besides, updating variance matrix represented by large volume of data to make better judgment for the user. Using statistical stability webpage relevance analysis without long term tracking certain users, it can provide optimal service on the statistical level.
Description
Technical field
The invention belongs to search engine searches technical field in the computer network, particularly relate to a kind of search engine searching method based on web page correlation.
Background technology
Search engine technique is a kind ofly to utilize groups of keywords to be combined in to search relevant information on the network, and sorts according to the matching degree of these information and key word, returns to the technology that the user checks then.Along with Internet fast development, use search engine to become the main approach that the network user obtains Internet resources.In recent years, various search engines had appearred in the whole world, and these search engines have played very important effect in the acquisition process of people to information.At present main search engine can be divided into catalogue formula search engine and based on the search engine of key word.Wherein the thinking of catalogue formula search engine is that web page library is presorted, need to select which kind of webpage then by user oneself, and go down to search to corresponding catalogue, the most representative at present split catalog formula search engine is yahoo[http: //www.yahoo.com].But, often need very thin category division dynamics in order to submit to one group of best Search Results of user, and be unpractical for the network information that existing craft and automatic classification technology are applied to magnanimity, even search engine provides very thin classification in addition, it is very complicated that user's selection course also will become, and can not guarantee that user's judgement and the existing classification of search engine fit like a glove.
Search engine great majority on the internet adopt the inquiring technology based on key word at present, and its typical case is represented as Google[http: //www.google.com] and Baidu [http://www.baidu.com].
This class search engine is extremely huge by the information resources amount of program collection and index, and user's enquirement statement is made up of several speech mostly, because there is polysemy in word itself, thereby cause search engine to be difficult to determine user's demand, this situation will cause the huge Search Results of quantity and can not guarantee the degree of correlation, so the user need spend huge energy browse screening in the result of search engine.In a word, the information quality that provides of present search engine is not very high.
In addition, the sort algorithm of search engine employing generally includes following several: (1) is based on the sort algorithm of word frequency statistics.The sort algorithm that early stage a lot of search engine adopts is based on word frequency statistics, and the calculating of speech power generally takes into account the position that this speech occurs in webpage, and the speech that for example occurs in title is than the speech weights height in text.But because the enormous amount of Internet resources, two webpage quality that word frequency is identical but may fall far short, and also unreliable according to the degree of correlation of word frequency calculating webpage and key word, and therefore the limitation of this algorithm clearly.(2) based on the sort algorithm of super link analysis.Citation analysis method in the tradition information retrieval theory is to determine one of authoritative important method of academic documents, promptly determines the authority of document according to the quality and quantity of quoted passage.Used for reference this thought based on the sort algorithm of super link analysis, by citation analysis thought being used for reference the calculating of network documentation importance, the number of times that the structure of hyperlinks of utilizing network self is cited according to webpage and the importance of webpage referenced self determine the number of degrees of an importance for all webpages, help realize the optimization of sort algorithm with this.But what this algorithm obtained is the importance rate of webpage self, rather than the degree of correlation of the key word of webpage and user inquiring, so but the highest problem not necessarily very relevant with user's query demand of the matter of webpage self in the Query Result appears in regular meeting.
Summary of the invention
In order to address the above problem, the object of the present invention is to provide a kind ofly can under the prerequisite that does not increase Operating Complexity, pick out user's demand exactly, thereby can improve the search engine searching method based on web page correlation of correlativity between the Search Results of search engine and the user's request.
In order to achieve the above object, the search engine searching method based on web page correlation provided by the invention comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time;
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve;
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1;
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data;
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and it is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
Writing time in the described step 1 with every month as the cycle, long-term dynamics is followed the tracks of.
Search engine searching method based on web page correlation provided by the invention has following beneficial effect:
1. the present invention can provide two times result to the user in the one query process, the information that utilizing the user to click for the first time provides has solved the problem of many speech of meaning and the many meanings of a speech effectively, solved the problem that accurately to determine the user inquiring intention based on the search engine of key word, clicking this first time according to the user provides for the second time that the method for Search Results not only can offer user's and webpage that with user's interest webpage be correlated with relevant with key word, and does not increase the complicacy that the user operates.
2. from experience and instinctively, have only that webpage similar, that correlativity is high is just easier is visited simultaneously by the user, so comprised the judgement of user in the click data to the webpage otherness.Use click data to upgrade the otherness matrix, be from the otherness between a new angle judgement webpage, this otherness is the otherness on the statistical significance that embodies in the mass data, is that a large amount of search engine user are used the judgement of making in the search engine process.So utilization of the present invention has the correlativity (otherness) of the webpage level of statistics stationarity to be analyzed, and does not need long-term follow specific user's behavior, promptly can be this user the service of the optimization on the statistical significance is provided.
Embodiment
Search engine searching method based on web page correlation provided by the invention is to determine the information content type that the user really needs by the click behavioral data of collecting the user, simultaneously with click data as one of foundation of judging correlativity between webpage, improve the correlativity of Query Result and user's request thus.
Usually the user of use search engine can not click the link on the search result list randomly, selects but make certain, and click data just becomes a kind of recessiveness feedback that comprises abundant information like this.Because the user trends towards clicking those and linking that their demand matches more, so search engine can go out user's instant demand, solution query word ambiguity problem by following the tracks of the link analysis that the user clicks.Can provide a dynamic queries result as search engine, make Query Result not only relevant but also linked contents that just clicked with the user is relevant, so just can determine the user and want the meaning, make Search Results adaptation user's demand with this query word expression with query word.
In carrying out the one query process, user's demand is more single often, and it can not click generally gratuitously, thus in user's one query process during department clicked a plurality of links be that correlativity is stronger each other.The present invention preserves the information that this quilt is clicked simultaneously by the matrix of a n * n, as the foundation of the degree of correlation between new web page more.Be that the present invention is the web page contents otherness that is obtained by a large number of users click data by safeguarding, at each query requests, the user clicks and web page contents otherness information is come identification inquiry theme and query intention via following the tracks of, and finally offers the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage with user's click of user.
Below the search engine searching method based on web page correlation provided by the invention is elaborated:
Search engine searching method based on web page correlation provided by the invention comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time; This step needs accumulation owing to click behavioral data, so need continue for some time with the search engine operation.
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve; The webpage diversity factor is the attribute opposite with the webpage degree of correlation, is the definition to the quantification of webpage differences degree, and the high more then diversity factor of the degree of correlation of two webpages is more little.
In this process, at first set up the otherness matrix D and realize renewal, to safeguard following data structure:
Common access count matrix A:n*n symmetric matrix has been preserved the counting of being visited simultaneously between all webpages.
Click-through count vector B:n*1 vector, b
iBe nonnegative integer, [0 ,+∞], each element has been preserved total clicks that corresponding webpage is received.
The initial difference matrix D
0: the n*n symmetric matrix is calculated by vector space model.Make Doc={doc
i| 1≤i≤n} represents a webpage collection.According to vector space model, each webpage doc
iCan be represented as vectorial doc
i, D then
0The capable j column element of i d
Ij 0Can be defined as:
‖ ‖
2Be 2 norms.According to defining d as can be known
Ij 0Be normalized being distributed in [0,1] value, D
0Element satisfy and to estimate axiom (satisfying and estimating axiom is that D can obtain the necessary attribute that embeds for how much).
Click the difference Matrix C: the n*n matrix, the element that directly defines C is
The symmetric matrix of otherness matrix D: n*n.The capable j column element of i d
IjPreserved the otherness between i webpage and j the webpage, definition d
IjFor
Wherein w is a customer parameter, 0<w<1.W is changed to 0 in original state, heightens the value of w gradually along with the increase of system operation time.Through after the sufficiently long time, w desirable 1.W also can answer specific demand adjustment, has only received click seldom as some webpage, and then the reliability of click data is just lower, at this moment w can be got a less value, and then otherness depends primarily on by the VSM method and calculates resulting value at this moment.
The compression expression Y:n*d matrix of D, the compression expression of D is handled D with the dimension reduction algorithm and can be obtained Y.Element d among the D
IjBe represented as the capable distance with the capable vector of j of i among the Y.Therefore, the otherness between all webpages can be represented with the Euclidean distance of vector among the Y.
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1; Diversity factor update method between any two webpages is as follows: (a) click data of record in the analytical procedure 1, if click data shows that these two webpages appear in certain Query Result simultaneously and they are all opened by at that time user, click-through count adds 1 in the time of then between these two webpages, handle can obtain behind all click datas in the step 1 between these two webpages in the time period that step 1 continued total in click-through count.
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data; So far obtain the data that search engine produces the required calculating webpage differences degree of Query Result.
In above-mentioned step 3 and 4, regularly the otherness matrix to be upgraded, renewal process is as follows
1. generate the initial difference matrix D according to vector space model
0
2. to each query event, foundation is (specific algorithm that does not need restraint and use) generated query result set someway.The user is submitted in link in the result set in order, and each link all has the summary of corresponding webpage.
3. the user checks that tabulation back clicked several links according to needs at that time, search engine note clicked link and will be between clicked webpage the time access count add 1, as follows: to clicked webpage i, j, execution
a
ij=a
ij+1 (4)
b
i=b
i+1 (5)
b
J=b
j+1 (6)
If have only a webpage i to be opened, then carry out
b
i=b
i+1 (7)
4. the search engine regularity according to A, B and D
0Recomputate and generate D, and D obtains D compression geometric representation Y to carrying out dimensionality reduction.Otherness between webpage is represented as the Euclidean distance under the d dimension embedded space like this, d<<n.
5. add as new webpage fashionable, system's otherness that calculates new web page and other webpage based on the method for vector space model, and be 0 with the w parameter adjustment of this webpage.The click of receiving when this webpage reaches a certain amount of rational non-0 value of again w being adjusted to.
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and tool is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
In this step, when the user uses search engine, carry out following process for the one query request:
1. use method to generate initial query result set r based on vector space model.If go m webpage this moment among the r.
2. after the user observed the initial query result and clicks a link, search engine write down this link (be called target web, establishing its ID in web page library is i).Calculate the diversity factor (promptly calculating the distance between the corresponding row vector among the Y) of other webpage among target web i and the r, obtain difference vector
(also can calculate the diversity factor between target web and all other webpages and get of the expansion of a part of webpage of diversity factor minimum) as query results.
With the webpage among the r according to d
iMiddle corresponding diversity factor ascending order is arranged, and submits to the user, and this submits to user's net result for search engine.
Claims (2)
1, a kind of search engine searching method based on web page correlation is characterized in that: described search engine searching method based on web page correlation comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time;
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve;
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1;
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data;
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and it is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
2, the search engine searching method based on web page correlation according to claim 1 is characterized in that: the writing time in the described step 1 with every month as the cycle, long-term dynamics is followed the tracks of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710056425 CN1996316A (en) | 2007-01-09 | 2007-01-09 | Search engine searching method based on web page correlation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710056425 CN1996316A (en) | 2007-01-09 | 2007-01-09 | Search engine searching method based on web page correlation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1996316A true CN1996316A (en) | 2007-07-11 |
Family
ID=38251404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200710056425 Pending CN1996316A (en) | 2007-01-09 | 2007-01-09 | Search engine searching method based on web page correlation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1996316A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102024031A (en) * | 2010-11-25 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | Method and equipment used for providing second search result based on real-time search |
CN102289459A (en) * | 2010-06-18 | 2011-12-21 | 微软公司 | Automatically generating training data |
CN102314462A (en) * | 2010-06-30 | 2012-01-11 | 北京搜狗科技发展有限公司 | Method and system for obtaining navigation result on input method platform |
CN102314461A (en) * | 2010-06-30 | 2012-01-11 | 北京搜狗科技发展有限公司 | Navigation prompt method and system |
CN102521321A (en) * | 2011-12-02 | 2012-06-27 | 华中科技大学 | Video search method based on search term ambiguity and user preferences |
CN102541857A (en) * | 2010-12-08 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Webpage sorting method and device |
CN102609433A (en) * | 2011-12-16 | 2012-07-25 | 北京大学 | Method and system for recommending query based on user log |
CN102930041A (en) * | 2012-11-12 | 2013-02-13 | 江苏外博资讯有限公司 | Retrieval result real-time updating method based on user behavior information and system thereof |
CN103123653A (en) * | 2013-03-15 | 2013-05-29 | 山东浪潮齐鲁软件产业股份有限公司 | Search engine retrieving ordering method based on Bayesian classification learning |
CN104885075A (en) * | 2013-12-26 | 2015-09-02 | 陶德龙 | Method and apparatus for using key link to execute reverse search |
CN101887437B (en) * | 2009-05-12 | 2016-03-30 | 阿里巴巴集团控股有限公司 | A kind of Search Results generation method and information search system |
CN106528862A (en) * | 2016-11-30 | 2017-03-22 | 四川用联信息技术有限公司 | Search engine keyword optimization realized on the basis of improved mean value center algorithm |
CN106649536A (en) * | 2016-11-01 | 2017-05-10 | 四川用联信息技术有限公司 | Achievement of optimization of search engine keywords based on improved k Means algorithm |
CN107220307A (en) * | 2017-05-10 | 2017-09-29 | 清华大学 | Web search method and device |
CN111611489A (en) * | 2020-05-22 | 2020-09-01 | 北京字节跳动网络技术有限公司 | Search processing method and device, electronic equipment and storage medium |
-
2007
- 2007-01-09 CN CN 200710056425 patent/CN1996316A/en active Pending
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887437B (en) * | 2009-05-12 | 2016-03-30 | 阿里巴巴集团控股有限公司 | A kind of Search Results generation method and information search system |
CN102289459A (en) * | 2010-06-18 | 2011-12-21 | 微软公司 | Automatically generating training data |
CN102314462A (en) * | 2010-06-30 | 2012-01-11 | 北京搜狗科技发展有限公司 | Method and system for obtaining navigation result on input method platform |
CN102314461A (en) * | 2010-06-30 | 2012-01-11 | 北京搜狗科技发展有限公司 | Navigation prompt method and system |
CN102024031B (en) * | 2010-11-25 | 2012-12-19 | 百度在线网络技术(北京)有限公司 | Method and equipment used for providing second search result based on real-time search |
CN102024031A (en) * | 2010-11-25 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | Method and equipment used for providing second search result based on real-time search |
CN102541857A (en) * | 2010-12-08 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Webpage sorting method and device |
CN102521321B (en) * | 2011-12-02 | 2013-07-31 | 华中科技大学 | Video search method based on search term ambiguity and user preferences |
CN102521321A (en) * | 2011-12-02 | 2012-06-27 | 华中科技大学 | Video search method based on search term ambiguity and user preferences |
CN102609433A (en) * | 2011-12-16 | 2012-07-25 | 北京大学 | Method and system for recommending query based on user log |
CN102609433B (en) * | 2011-12-16 | 2013-11-20 | 北京大学 | Method and system for recommending query based on user log |
CN102930041A (en) * | 2012-11-12 | 2013-02-13 | 江苏外博资讯有限公司 | Retrieval result real-time updating method based on user behavior information and system thereof |
CN103123653A (en) * | 2013-03-15 | 2013-05-29 | 山东浪潮齐鲁软件产业股份有限公司 | Search engine retrieving ordering method based on Bayesian classification learning |
CN104885075A (en) * | 2013-12-26 | 2015-09-02 | 陶德龙 | Method and apparatus for using key link to execute reverse search |
CN104885075B (en) * | 2013-12-26 | 2019-05-31 | 陶德龙 | A kind of method and device executing reverse search using crucial link |
CN106649536A (en) * | 2016-11-01 | 2017-05-10 | 四川用联信息技术有限公司 | Achievement of optimization of search engine keywords based on improved k Means algorithm |
CN106528862A (en) * | 2016-11-30 | 2017-03-22 | 四川用联信息技术有限公司 | Search engine keyword optimization realized on the basis of improved mean value center algorithm |
CN107220307A (en) * | 2017-05-10 | 2017-09-29 | 清华大学 | Web search method and device |
CN107220307B (en) * | 2017-05-10 | 2020-09-25 | 清华大学 | Webpage searching method and device |
CN111611489A (en) * | 2020-05-22 | 2020-09-01 | 北京字节跳动网络技术有限公司 | Search processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1996316A (en) | Search engine searching method based on web page correlation | |
CN101364239B (en) | Method for auto constructing classified catalogue and relevant system | |
Xue et al. | Optimizing web search using web click-through data | |
Haveliwala et al. | Evaluating strategies for similarity search on the web | |
Dou et al. | Evaluating the effectiveness of personalized web search | |
CN103020164B (en) | Semantic search method based on multi-semantic analysis and personalized sequencing | |
CN101520785B (en) | Information retrieval method and system therefor | |
US8352474B2 (en) | System and method for retrieving information using a query based index | |
Graus et al. | Dynamic collective entity representations for entity ranking | |
CN103455487A (en) | Extracting method and device for search term | |
CN102722501A (en) | Search engine and realization method thereof | |
Nasraoui et al. | A framework for mining evolving trends in web data streams using dynamic learning and retrospective validation | |
Kim et al. | Efficient distributed selective search | |
Sharma et al. | Web search result optimization by mining the search engine query logs | |
CN101814085A (en) | WEB data bank selection method based on WDB (World Data Bank) characteristics and user query requests | |
Alhaidari et al. | User preference based weighted page ranking algorithm | |
Aggarwal et al. | Information retrieval and search engines | |
Hsu et al. | Efficient and effective prediction of social tags to enhance web search | |
Yan et al. | Research on PageRank and hyperlink-induced topic search in web structure mining | |
Kunpeng et al. | A new query expansion method based on query logs mining | |
Lei et al. | Improved relevance ranking in WebGather | |
Zhang et al. | Is learning to rank effective for Web search? | |
Gupta et al. | Page ranking algorithms in online digital libraries: A survey | |
Xu et al. | Generating personalized web search using semantic context | |
Wang et al. | QueryFind: Search ranking based on users' feedback and expert's agreement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |