CN1996316A - Search engine searching method based on web page correlation - Google Patents

Search engine searching method based on web page correlation Download PDF

Info

Publication number
CN1996316A
CN1996316A CN 200710056425 CN200710056425A CN1996316A CN 1996316 A CN1996316 A CN 1996316A CN 200710056425 CN200710056425 CN 200710056425 CN 200710056425 A CN200710056425 A CN 200710056425A CN 1996316 A CN1996316 A CN 1996316A
Authority
CN
China
Prior art keywords
user
webpage
search engine
click
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710056425
Other languages
Chinese (zh)
Inventor
侯越先
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN 200710056425 priority Critical patent/CN1996316A/en
Publication of CN1996316A publication Critical patent/CN1996316A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a search engine webpage relevance searching method. It can provide two times result to users at one time inquiry, effectively solving the issues like one word multi meanings or vice versa at one click by the user and the issue of unabling to decide the inquiry intention of the user based on keyword search engine, in this way to provide the user the webpage they may be interested and related to the keyword, without adding the complexity of the user operation. Besides, updating variance matrix represented by large volume of data to make better judgment for the user. Using statistical stability webpage relevance analysis without long term tracking certain users, it can provide optimal service on the statistical level.

Description

Search engine searching method based on web page correlation
Technical field
The invention belongs to search engine searches technical field in the computer network, particularly relate to a kind of search engine searching method based on web page correlation.
Background technology
Search engine technique is a kind ofly to utilize groups of keywords to be combined in to search relevant information on the network, and sorts according to the matching degree of these information and key word, returns to the technology that the user checks then.Along with Internet fast development, use search engine to become the main approach that the network user obtains Internet resources.In recent years, various search engines had appearred in the whole world, and these search engines have played very important effect in the acquisition process of people to information.At present main search engine can be divided into catalogue formula search engine and based on the search engine of key word.Wherein the thinking of catalogue formula search engine is that web page library is presorted, need to select which kind of webpage then by user oneself, and go down to search to corresponding catalogue, the most representative at present split catalog formula search engine is yahoo[http: //www.yahoo.com].But, often need very thin category division dynamics in order to submit to one group of best Search Results of user, and be unpractical for the network information that existing craft and automatic classification technology are applied to magnanimity, even search engine provides very thin classification in addition, it is very complicated that user's selection course also will become, and can not guarantee that user's judgement and the existing classification of search engine fit like a glove.
Search engine great majority on the internet adopt the inquiring technology based on key word at present, and its typical case is represented as Google[http: //www.google.com] and Baidu [http://www.baidu.com].
This class search engine is extremely huge by the information resources amount of program collection and index, and user's enquirement statement is made up of several speech mostly, because there is polysemy in word itself, thereby cause search engine to be difficult to determine user's demand, this situation will cause the huge Search Results of quantity and can not guarantee the degree of correlation, so the user need spend huge energy browse screening in the result of search engine.In a word, the information quality that provides of present search engine is not very high.
In addition, the sort algorithm of search engine employing generally includes following several: (1) is based on the sort algorithm of word frequency statistics.The sort algorithm that early stage a lot of search engine adopts is based on word frequency statistics, and the calculating of speech power generally takes into account the position that this speech occurs in webpage, and the speech that for example occurs in title is than the speech weights height in text.But because the enormous amount of Internet resources, two webpage quality that word frequency is identical but may fall far short, and also unreliable according to the degree of correlation of word frequency calculating webpage and key word, and therefore the limitation of this algorithm clearly.(2) based on the sort algorithm of super link analysis.Citation analysis method in the tradition information retrieval theory is to determine one of authoritative important method of academic documents, promptly determines the authority of document according to the quality and quantity of quoted passage.Used for reference this thought based on the sort algorithm of super link analysis, by citation analysis thought being used for reference the calculating of network documentation importance, the number of times that the structure of hyperlinks of utilizing network self is cited according to webpage and the importance of webpage referenced self determine the number of degrees of an importance for all webpages, help realize the optimization of sort algorithm with this.But what this algorithm obtained is the importance rate of webpage self, rather than the degree of correlation of the key word of webpage and user inquiring, so but the highest problem not necessarily very relevant with user's query demand of the matter of webpage self in the Query Result appears in regular meeting.
Summary of the invention
In order to address the above problem, the object of the present invention is to provide a kind ofly can under the prerequisite that does not increase Operating Complexity, pick out user's demand exactly, thereby can improve the search engine searching method based on web page correlation of correlativity between the Search Results of search engine and the user's request.
In order to achieve the above object, the search engine searching method based on web page correlation provided by the invention comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time;
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve;
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1;
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data;
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and it is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
Writing time in the described step 1 with every month as the cycle, long-term dynamics is followed the tracks of.
Search engine searching method based on web page correlation provided by the invention has following beneficial effect:
1. the present invention can provide two times result to the user in the one query process, the information that utilizing the user to click for the first time provides has solved the problem of many speech of meaning and the many meanings of a speech effectively, solved the problem that accurately to determine the user inquiring intention based on the search engine of key word, clicking this first time according to the user provides for the second time that the method for Search Results not only can offer user's and webpage that with user's interest webpage be correlated with relevant with key word, and does not increase the complicacy that the user operates.
2. from experience and instinctively, have only that webpage similar, that correlativity is high is just easier is visited simultaneously by the user, so comprised the judgement of user in the click data to the webpage otherness.Use click data to upgrade the otherness matrix, be from the otherness between a new angle judgement webpage, this otherness is the otherness on the statistical significance that embodies in the mass data, is that a large amount of search engine user are used the judgement of making in the search engine process.So utilization of the present invention has the correlativity (otherness) of the webpage level of statistics stationarity to be analyzed, and does not need long-term follow specific user's behavior, promptly can be this user the service of the optimization on the statistical significance is provided.
Embodiment
Search engine searching method based on web page correlation provided by the invention is to determine the information content type that the user really needs by the click behavioral data of collecting the user, simultaneously with click data as one of foundation of judging correlativity between webpage, improve the correlativity of Query Result and user's request thus.
Usually the user of use search engine can not click the link on the search result list randomly, selects but make certain, and click data just becomes a kind of recessiveness feedback that comprises abundant information like this.Because the user trends towards clicking those and linking that their demand matches more, so search engine can go out user's instant demand, solution query word ambiguity problem by following the tracks of the link analysis that the user clicks.Can provide a dynamic queries result as search engine, make Query Result not only relevant but also linked contents that just clicked with the user is relevant, so just can determine the user and want the meaning, make Search Results adaptation user's demand with this query word expression with query word.
In carrying out the one query process, user's demand is more single often, and it can not click generally gratuitously, thus in user's one query process during department clicked a plurality of links be that correlativity is stronger each other.The present invention preserves the information that this quilt is clicked simultaneously by the matrix of a n * n, as the foundation of the degree of correlation between new web page more.Be that the present invention is the web page contents otherness that is obtained by a large number of users click data by safeguarding, at each query requests, the user clicks and web page contents otherness information is come identification inquiry theme and query intention via following the tracks of, and finally offers the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage with user's click of user.
Below the search engine searching method based on web page correlation provided by the invention is elaborated:
Search engine searching method based on web page correlation provided by the invention comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time; This step needs accumulation owing to click behavioral data, so need continue for some time with the search engine operation.
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve; The webpage diversity factor is the attribute opposite with the webpage degree of correlation, is the definition to the quantification of webpage differences degree, and the high more then diversity factor of the degree of correlation of two webpages is more little.
In this process, at first set up the otherness matrix D and realize renewal, to safeguard following data structure:
Common access count matrix A:n*n symmetric matrix has been preserved the counting of being visited simultaneously between all webpages.
Click-through count vector B:n*1 vector, b iBe nonnegative integer, [0 ,+∞], each element has been preserved total clicks that corresponding webpage is received.
The initial difference matrix D 0: the n*n symmetric matrix is calculated by vector space model.Make Doc={doc i| 1≤i≤n} represents a webpage collection.According to vector space model, each webpage doc iCan be represented as vectorial doc i, D then 0The capable j column element of i d Ij 0Can be defined as:
d ij 0 ≡ | | doc i | | doc i | | 2 - doc i | | doc j | | 2 | | 2 arg max i , j { | | doc i | | doc i | | 2 - doc j | | doc j | | 2 | | 2 } - - - ( 1 )
‖ ‖ 2Be 2 norms.According to defining d as can be known Ij 0Be normalized being distributed in [0,1] value, D 0Element satisfy and to estimate axiom (satisfying and estimating axiom is that D can obtain the necessary attribute that embeds for how much).
Click the difference Matrix C: the n*n matrix, the element that directly defines C is
c ij ≡ 1 - ( a ij / max { b i , b j } ) , i ≠ j 0 , i = j - - - ( 2 )
The symmetric matrix of otherness matrix D: n*n.The capable j column element of i d IjPreserved the otherness between i webpage and j the webpage, definition d IjFor
d ij ≡ w · c ij + ( 1 - w ) · d ij 0 , i ≠ j 0 , i = j - - - ( 3 )
Wherein w is a customer parameter, 0<w<1.W is changed to 0 in original state, heightens the value of w gradually along with the increase of system operation time.Through after the sufficiently long time, w desirable 1.W also can answer specific demand adjustment, has only received click seldom as some webpage, and then the reliability of click data is just lower, at this moment w can be got a less value, and then otherness depends primarily on by the VSM method and calculates resulting value at this moment.
The compression expression Y:n*d matrix of D, the compression expression of D is handled D with the dimension reduction algorithm and can be obtained Y.Element d among the D IjBe represented as the capable distance with the capable vector of j of i among the Y.Therefore, the otherness between all webpages can be represented with the Euclidean distance of vector among the Y.
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1; Diversity factor update method between any two webpages is as follows: (a) click data of record in the analytical procedure 1, if click data shows that these two webpages appear in certain Query Result simultaneously and they are all opened by at that time user, click-through count adds 1 in the time of then between these two webpages, handle can obtain behind all click datas in the step 1 between these two webpages in the time period that step 1 continued total in click-through count.
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data; So far obtain the data that search engine produces the required calculating webpage differences degree of Query Result.
In above-mentioned step 3 and 4, regularly the otherness matrix to be upgraded, renewal process is as follows
1. generate the initial difference matrix D according to vector space model 0
2. to each query event, foundation is (specific algorithm that does not need restraint and use) generated query result set someway.The user is submitted in link in the result set in order, and each link all has the summary of corresponding webpage.
3. the user checks that tabulation back clicked several links according to needs at that time, search engine note clicked link and will be between clicked webpage the time access count add 1, as follows: to clicked webpage i, j, execution
a ij=a ij+1 (4)
b i=b i+1 (5)
b J=b j+1 (6)
If have only a webpage i to be opened, then carry out
b i=b i+1 (7)
4. the search engine regularity according to A, B and D 0Recomputate and generate D, and D obtains D compression geometric representation Y to carrying out dimensionality reduction.Otherness between webpage is represented as the Euclidean distance under the d dimension embedded space like this, d<<n.
5. add as new webpage fashionable, system's otherness that calculates new web page and other webpage based on the method for vector space model, and be 0 with the w parameter adjustment of this webpage.The click of receiving when this webpage reaches a certain amount of rational non-0 value of again w being adjusted to.
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and tool is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
In this step, when the user uses search engine, carry out following process for the one query request:
1. use method to generate initial query result set r based on vector space model.If go m webpage this moment among the r.
2. after the user observed the initial query result and clicks a link, search engine write down this link (be called target web, establishing its ID in web page library is i).Calculate the diversity factor (promptly calculating the distance between the corresponding row vector among the Y) of other webpage among target web i and the r, obtain difference vector d i ≡ [ d ij 1 , d ij 2 , . . . , d ij m ] T (also can calculate the diversity factor between target web and all other webpages and get of the expansion of a part of webpage of diversity factor minimum) as query results.
With the webpage among the r according to d iMiddle corresponding diversity factor ascending order is arranged, and submits to the user, and this submits to user's net result for search engine.

Claims (2)

1, a kind of search engine searching method based on web page correlation is characterized in that: described search engine searching method based on web page correlation comprises the following step that carries out in order:
(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time;
(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve;
(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1;
(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data;
(5) when receiving a user's one query request, search engine carries out the following step:
(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and it is submitted to the user check with certain relatedness computation method;
(b) user will click its interested link after checking the initial query tabulation;
(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;
(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.
2, the search engine searching method based on web page correlation according to claim 1 is characterized in that: the writing time in the described step 1 with every month as the cycle, long-term dynamics is followed the tracks of.
CN 200710056425 2007-01-09 2007-01-09 Search engine searching method based on web page correlation Pending CN1996316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710056425 CN1996316A (en) 2007-01-09 2007-01-09 Search engine searching method based on web page correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710056425 CN1996316A (en) 2007-01-09 2007-01-09 Search engine searching method based on web page correlation

Publications (1)

Publication Number Publication Date
CN1996316A true CN1996316A (en) 2007-07-11

Family

ID=38251404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710056425 Pending CN1996316A (en) 2007-01-09 2007-01-09 Search engine searching method based on web page correlation

Country Status (1)

Country Link
CN (1) CN1996316A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024031A (en) * 2010-11-25 2011-04-20 百度在线网络技术(北京)有限公司 Method and equipment used for providing second search result based on real-time search
CN102289459A (en) * 2010-06-18 2011-12-21 微软公司 Automatically generating training data
CN102314462A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Method and system for obtaining navigation result on input method platform
CN102314461A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Navigation prompt method and system
CN102521321A (en) * 2011-12-02 2012-06-27 华中科技大学 Video search method based on search term ambiguity and user preferences
CN102541857A (en) * 2010-12-08 2012-07-04 腾讯科技(深圳)有限公司 Webpage sorting method and device
CN102609433A (en) * 2011-12-16 2012-07-25 北京大学 Method and system for recommending query based on user log
CN102930041A (en) * 2012-11-12 2013-02-13 江苏外博资讯有限公司 Retrieval result real-time updating method based on user behavior information and system thereof
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN104885075A (en) * 2013-12-26 2015-09-02 陶德龙 Method and apparatus for using key link to execute reverse search
CN101887437B (en) * 2009-05-12 2016-03-30 阿里巴巴集团控股有限公司 A kind of Search Results generation method and information search system
CN106528862A (en) * 2016-11-30 2017-03-22 四川用联信息技术有限公司 Search engine keyword optimization realized on the basis of improved mean value center algorithm
CN106649536A (en) * 2016-11-01 2017-05-10 四川用联信息技术有限公司 Achievement of optimization of search engine keywords based on improved k Means algorithm
CN107220307A (en) * 2017-05-10 2017-09-29 清华大学 Web search method and device
CN111611489A (en) * 2020-05-22 2020-09-01 北京字节跳动网络技术有限公司 Search processing method and device, electronic equipment and storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887437B (en) * 2009-05-12 2016-03-30 阿里巴巴集团控股有限公司 A kind of Search Results generation method and information search system
CN102289459A (en) * 2010-06-18 2011-12-21 微软公司 Automatically generating training data
CN102314462A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Method and system for obtaining navigation result on input method platform
CN102314461A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Navigation prompt method and system
CN102024031B (en) * 2010-11-25 2012-12-19 百度在线网络技术(北京)有限公司 Method and equipment used for providing second search result based on real-time search
CN102024031A (en) * 2010-11-25 2011-04-20 百度在线网络技术(北京)有限公司 Method and equipment used for providing second search result based on real-time search
CN102541857A (en) * 2010-12-08 2012-07-04 腾讯科技(深圳)有限公司 Webpage sorting method and device
CN102521321B (en) * 2011-12-02 2013-07-31 华中科技大学 Video search method based on search term ambiguity and user preferences
CN102521321A (en) * 2011-12-02 2012-06-27 华中科技大学 Video search method based on search term ambiguity and user preferences
CN102609433A (en) * 2011-12-16 2012-07-25 北京大学 Method and system for recommending query based on user log
CN102609433B (en) * 2011-12-16 2013-11-20 北京大学 Method and system for recommending query based on user log
CN102930041A (en) * 2012-11-12 2013-02-13 江苏外博资讯有限公司 Retrieval result real-time updating method based on user behavior information and system thereof
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN104885075A (en) * 2013-12-26 2015-09-02 陶德龙 Method and apparatus for using key link to execute reverse search
CN104885075B (en) * 2013-12-26 2019-05-31 陶德龙 A kind of method and device executing reverse search using crucial link
CN106649536A (en) * 2016-11-01 2017-05-10 四川用联信息技术有限公司 Achievement of optimization of search engine keywords based on improved k Means algorithm
CN106528862A (en) * 2016-11-30 2017-03-22 四川用联信息技术有限公司 Search engine keyword optimization realized on the basis of improved mean value center algorithm
CN107220307A (en) * 2017-05-10 2017-09-29 清华大学 Web search method and device
CN107220307B (en) * 2017-05-10 2020-09-25 清华大学 Webpage searching method and device
CN111611489A (en) * 2020-05-22 2020-09-01 北京字节跳动网络技术有限公司 Search processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN1996316A (en) Search engine searching method based on web page correlation
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
Xue et al. Optimizing web search using web click-through data
Haveliwala et al. Evaluating strategies for similarity search on the web
Dou et al. Evaluating the effectiveness of personalized web search
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
CN101520785B (en) Information retrieval method and system therefor
US8352474B2 (en) System and method for retrieving information using a query based index
Graus et al. Dynamic collective entity representations for entity ranking
CN103455487A (en) Extracting method and device for search term
CN102722501A (en) Search engine and realization method thereof
Nasraoui et al. A framework for mining evolving trends in web data streams using dynamic learning and retrospective validation
Kim et al. Efficient distributed selective search
Sharma et al. Web search result optimization by mining the search engine query logs
CN101814085A (en) WEB data bank selection method based on WDB (World Data Bank) characteristics and user query requests
Alhaidari et al. User preference based weighted page ranking algorithm
Aggarwal et al. Information retrieval and search engines
Hsu et al. Efficient and effective prediction of social tags to enhance web search
Yan et al. Research on PageRank and hyperlink-induced topic search in web structure mining
Kunpeng et al. A new query expansion method based on query logs mining
Lei et al. Improved relevance ranking in WebGather
Zhang et al. Is learning to rank effective for Web search?
Gupta et al. Page ranking algorithms in online digital libraries: A survey
Xu et al. Generating personalized web search using semantic context
Wang et al. QueryFind: Search ranking based on users' feedback and expert's agreement

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication