CN100543744C - Method to webpage and website grading - Google Patents

Method to webpage and website grading Download PDF

Info

Publication number
CN100543744C
CN100543744C CNB2006101658019A CN200610165801A CN100543744C CN 100543744 C CN100543744 C CN 100543744C CN B2006101658019 A CNB2006101658019 A CN B2006101658019A CN 200610165801 A CN200610165801 A CN 200610165801A CN 100543744 C CN100543744 C CN 100543744C
Authority
CN
China
Prior art keywords
webpage
node
weight
link
rank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006101658019A
Other languages
Chinese (zh)
Other versions
CN1996299A (en
Inventor
孙斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNB2006101658019A priority Critical patent/CN100543744C/en
Publication of CN1996299A publication Critical patent/CN1996299A/en
Application granted granted Critical
Publication of CN100543744C publication Critical patent/CN100543744C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A kind of method to network node grading is recursively determined the rank of each webpage by the multiple linking relationship character between the node.The rank of each node is that it goes into the forward weight of chain and other weighted sum of level of the source node of going into chain, or it goes out the reverse weight of chain and other weighted sum of level of the destination node that goes out chain, or the weight of its common adduction relationship together the adduction relationship node the level other weighted sum, or other weighted sum of level of the weight of its co-reference and co-reference node, or the further weighted sum of this 4 class weighted sum.Network node can be a webpage, also can be the super webpage of representing the linking relationship between all webpages in the website.Rating result provided by the invention can reflect quality, importance and the authority of node more comprehensively, exactly, and has preferably stability, resists the influence of cheating better.Ranking method of the present invention can provide better technique effect for application such as webpage collection, websites collection and Search Results orderings.

Description

Method to webpage and website grading
Technical field
The present invention relates to the Internet information search techniques field, particularly relate to according to the linking relationship between the network node (for example webpage or website) and to the method for node grading, for example in internet search engine, utilize the hypertext link between the webpage, the method that the quality or the importance of the webpage of being included and website are distinguished, weighed and grades.
Background technology
Along with the continuous development of computer technology and network technology, universal day by day along with internet, applications particularly, the information in the network effectively searched for becomes very important daily routines and research topic.Current, search engine has become with Email etc. and has similarly used one of the most frequent internet, applications.Therefore, the raising the Internet information search techniques has great importance and is worth.Through continual research and development and market competition in recent years, the internet search engine technology has had significant progress, has formed the technical system and the business model of comparative maturity.On the one hand, traditional document information retrieval technology has obtained extensive, deep application in search engine; On the other hand, also be developed at some new technology of network information characteristic, and produced actively, significant effect.
One big feature of the network information is the linking relationship that enriches: information distribution on each node of network, interrelated, the mutual reference or quote mutually by link between the node with certain semantic.For example, WWW in the internet (the World-Wide Web) is exactly a huge information node network that couples together by hypertext link (based on HTML (Hypertext Markup Language) HTTP), its basic information node is a webpage, wherein not limited hypertext links (hypertext links is designated hereinafter simply as " hyperlink " or " link ") such as quantity, target and display format can be set; Simultaneously, the distribution of webpage also has the structure of higher one deck, and promptly webpage all visits by the website, thereby the website has constituted the information node of the bigger one-level of WWW again.In addition, the webpage in the website also has the bibliographic structure of the intermediate level, and also can pass through the hierarchical structure that domain name (domain name) forms higher level between the website.Therefore, the information node of network can comprise the information node of webpage, website and other granularity, for example domain name node, certain one-level file directory node etc.Linking relationship and hierarchical setting abundant between the node become the key property that the network information is different from information such as conventional text, image, audio frequency and video.These characteristics are made full use of the technical merit that helps to promote network information search.Therefore, the internet search engine of current acquisition mainstream applications has all generally used the linking relationship of the network information.This class technology is commonly called " link analysis (link analysis) " technology.Its purpose is by the linking relationship between the information nodes such as webpage or website, and information node is done content, attributive analysis or grading.Promptly give node one or more rank numerical value to the grading of node, so that distinguish character such as its quality, importance, authority or pouplarity quantitatively.
United States Patent (USP) the 6th, 285, No. 999 (US Patent No.6,285,999. titles: Method for node ranking in alinked database. inventor: Lawrence Page) disclose a kind of link analysis method.This method is commonly called PageRank, is a kind of link analysis method that obtains to pay close attention to the most widely and study and obtain successful Application so far.(it also is the used proprietary technology of Google.com search engine.) this method is fully based on the linking relationship between the node and according to the aeoplotropism of hyperlink, for each node is given a rank mark, this mark is the weighted sum of the mark of each node (being the chain egress) of being linked to this node, and the weights of the mark of chain egress are the inverse of the out-degree (outdegree, the sum that links of promptly going out) of this node.The webpage rank of determining by PageRank be a kind of integral body, with the description of the irrelevant webpage popularity degree of search inquiry, it provides a kind of indirect tolerance for quality or the significance level of distinguishing a large amount of webpages quantitatively.The rating result of this globality can be used as the priority level that instructs collecting web page on the one hand, so that important webpage is collected as early as possible or upgraded; On the other hand, it can also combine with the conventional Search Results scoring at concrete searching keyword, promotes the rank of high-quality webpage, thereby realizes the better ordering to Search Results of effect.PageRank has obtained the common concern and the research of industrial community and academia after 1998 propose, and a large amount of relevant paper publishings are arranged.About detailed nature, algorithm, parameter adjustment and the improved comprehensive argumentation of PageRank can be referring to document Deeper Inside PageRank (author: A.Langville and C.Meyer. periodicals: InternetMathematics Vol.1, No.3, p335-380. network address: http://www.internetmathematics.org/volumes/1/3/Langville.pdf).
Simultaneously, also can utilize the PageRank method to be graded in the website.The PageRank that is similar to webpage is this webpage by certain probability of choosing of navigation process at random, and the PageRank of website is the probability of the viewed person's selected at random in this website.Website PageRank can be defined as the PageRank sum of all webpages that it comprised simply, also can be defined as the tolerance of certain special website quality or trusted degree.For example, the high-quality website of one small part can be picked out, rule of thumb value is given higher quality rank or credit level (or being called trust rank) respectively, the rank of calculating these high-quality websites according to the PageRank method is delivered to the result of other each website then, thus the relatively quality or the credit rating of each website.Linking relationship between the website can be constructed by the linking relationship between the webpage, for example can simply the linking relationship between the webpage be merged to each website node and ignore the link of inside, website, perhaps linking between the inner webpage of web page interlinkage between the website and website is provided with different weight etc.The website rating result that is obtained collects scheduling, websites collection and anti-cheating for webpage and the final search result ordering can provide vital role.
Although the PageRank method provides innovative technology and obtained great success in use in market for internet information search, it has also manifested the aspect of some shortcomings fully based on the character of linking relationship and unidirectional webpage rank transmission.Particularly, after the searched engine widespread usage of PageRank method, a kind of search engine cheat technology of the PageRank of utilization unidirectional delivery characteristics occurs, be called link cheating (link spamming).The cribber is as long as constantly increase the webpage that comprises the link of pointing to certain webpage, and the PageRank of indication webpage just can improve constantly.This cheating is difficult to obtain in the mechanism of PageRank differentiate and handle, and must spend a large amount of manpower and materials and use peculiar method to check specially.And these anti-chains connect cheat method to be taken as trade secret usually strictly conservative, will not disclose.This has also just in time demonstrated the fragility of PageRank algorithm itself.
In general, through large-scale application and check in recent years, the advantage of PageRank method and many weak points are comparatively clear and definite.Its main weak point comprises following several aspect (the some of them aspect is the problem that link analysis itself brings):
● only rely on the backward chaining of webpage promptly to enter link (in-bound links) and grade, the rank of webpage always singly increases along with the increase of going into chain, and do not distinguish into chain whether have correlativity and degree of correlation thereof, cause its webpage rank to be subjected to linking the manipulation of cheatings such as exchange, link relay accumulation easily thus;
● directly based on the linking relationship between the webpage, ignored the gathering relation of webpage on website one-level and other level, the granularity of linking relationship is meticulous, cause thus the PageRank calculated amount big, upgrade slowly, and lost efficacy because of lacking linking relationship for the webpage of up-to-date appearance;
● the link that the webpage in the same website comprised mostly is the station internal chaining, thereby is difficult to accurate grading is made in the website.Although can link different weights is set between station internal chaining and station, the weighted value setting of different web sites there is no definite foundation;
● have any linking relationship hardly between the company's site that has commercial competition to concern, even their content is very relevant, this will influence the accuracy of grading.Usually have many quoting altogether (co-citation) between the website of vying each other and refer to relations such as (coreference) (seeing hereinafter explanation for details) altogether, but existing ranking method is not applied to this;
● go out to link (out-bound links) or be called forward chaining to tend to cause that the rank of webpage and total rank of website, place thereof descend, this character is unfavorable for encouraging the actively creation link of going out of web page authors, particularly encourage to be provided with point to that high-quality, content are more relevant, the website that there is no business relations and the link of webpage; On the contrary, this causes a large amount of exchanges or dealing so-called " reciprocal link " between the website usually;
● the even random skip probability between the webpage of being supposed and the agenda mode of people's browsing page differ greatly.This deficiency can overcome by introducing one " personalized vector " (as the outside probability source of web page browsing stochastic process) usually, but how personalized vector is set is more complicated, a high problem that assesses the cost, and therefore in fact personalized vector is not widely used;
● have nothing to do, concern based on hyperlink fully with content of text, promptly ignored the content information of document fully, thereby can not improve problems such as document and inquiry semantic matches error substantially.
This shows that PageRank still is a kind of fairly simple and elementary link analysis method.For these problems of PageRank, can carry out some and improve targetedly.For example for PageRank and web page text or the irrelevant deficiency of query word, can design the expansion PageRank (or being called topic-sensitive PageRank) of a collection of relatively predetermined inquiry theme.But the application specificity that this class is improved one's methods and the complicacy of realization all can cause wider problem, and actual effect is also not obvious.Known improving one's methods at present mostly belongs to the adjustment of locality or the mutation of specific occasion, and its new technique effect is not verified in large-scale practical application as yet, perhaps is difficult to because of computational complexity is too high realize.The more important thing is that the unidirectional delivery character that known these are improved one's methods all not to PageRank makes improvements, thereby substantial improvement effect and more efficiently anti-cheating performance can not be provided.In a word, owing to have the simplification or the omission of all many-sides, PageRank method and existing improving one's methods thereof still fail accurately, utilize all sidedly or more fully the linking relationship between the webpage that webpage and website are made grading, and are easy to be subjected to the artificial influence of handling and linking cheating.
Therefore, be necessary to study than prior art more comprehensively, more careful, more firm and anti-cheating and the network information node grading technology that can efficiently be realized, provide technique effect better to the method and system of webpage and website grading.
Summary of the invention
An object of the present invention is to propose a kind of comprehensive webpage ranking method, the multifrequency nature of utilizing linking relationship between the webpage is evenly carried out more comprehensively and stable grading webpage.Common adduction relationship and co-reference that employed multifrequency nature comprises the bidirectional relationship of link, derived by link, and attributes such as the frequency of these relations, weight.
Another object of the present invention is to propose a kind of website ranking method, synthetically utilizes the diversity confrontation website of linking relationship between the website to carry out comprehensive, careful and stable strong grading.
A further object of the present invention provides a kind of computer based webpage and website rating system, utilize high efficiency algorithm to realize the ranking method of above-mentioned webpage and website, and enable to be applied to the collections of web pages and the set of websites of great scale, for example be used in some areas or global range in the webpage and the website of WWW grade.
For achieving the above object, the technical scheme that the present invention takes is: a kind of with computer implemented method to the network node grading, give expression of each node its other numerical value of level according to the oriented linking relationship between the node, it is characterized in that comprising the steps:
A., at least two kinds of weights as described below are set:, a forward weight is set respectively at least a portion link; To at least a portion link, a reverse weight is set respectively; To the common adduction relationship of at least a portion node, a weight is set respectively; To the co-reference of at least a portion node, a weight is set respectively;
B. according to various weights set among the step a, be calculated as follows various weighted sums: if set weight is the forward weight of link, then computing node is gone into the forward weight of chain and other weighted sum of level of the source node of going into chain; If set weight is the reverse weight of link, then computing node goes out the reverse weight of chain and other weighted sum of level of the destination node that goes out chain; If set weight is the weight of the common adduction relationship between the node, computing node weight other weighted sum of level of adduction relationship node together of adduction relationship altogether then; If set weight is the weight of the co-reference between the node, then the weight of computing node co-reference and co-reference node the level other weighted sum;
C. the resulting various weighted sums of step b are made further weighted sum, as the rank numerical value of node.
Wherein, the reverse weight of the forward weight of described link, link, the weight of quoting altogether, the weight that refers to altogether depend on the out-degree of node, the in-degree of node, the frequency of quoting altogether, the frequency of finger altogether respectively.The rank of node also can comprise the constant rank of an expression prior probability distribution, and the weight factor sum of this constant and described further weighted sum is 1.Above-mentioned network node can be a webpage, also can be the pairing super webpage in website, and total linking relationship between all webpages in the website is constructed, represented to this super webpage by merging webpage in the website.
Be compared with the prior art, the technical program possesses following advantage: owing to used the multiple character of information node linking relationship to grade, the rating result that this method provides can reflect quality, importance and the authority that node is formed by linking relationship more comprehensive and accurately, and have preferably stability, can strengthen the difficulty of link cheating, resist the influence of cheating better.Therefore the rating result of this method can provide better technique effect for webpage collection, websites collection and Search Results ordering.
Description of drawings
This instructions comprises 7 accompanying drawings.
Fig. 1 is the synoptic diagram of two-way rank transitive relation used in the present invention and weight thereof.
Fig. 2 is total to the adduction relationship synoptic diagram between the node that is formed by link used in the present invention.
Fig. 3 is a co-reference synoptic diagram between the node that is formed by link used in the present invention.
Fig. 4 is the process flow diagram to the webpage ranking method of one embodiment of the invention.
Fig. 5 is that webpage ranking method of the present invention is to a rating result diagram that comprises the network of 3 webpages.
Fig. 6 is the process flow diagram that one embodiment of the present of invention are used the rank vector of power product method iterative computation node.
Fig. 7 is the process flow diagram to the website ranking method of one embodiment of the invention.
Embodiment
Below in conjunction with drawings and Examples technique scheme is further described.With the lower part, at first describe method of the present invention in detail and how to be used for web page joint is graded; In the end how part has then illustrated this method according to same thought, in like manner utilizes the linking relationship between the website to be graded in the website.
Embodiments of the invention are realized by an internet search engine system.This search engine system is a computer system that comprises known software and hardware architecture structure, finishes various functions by moving specific instruction sequence (being program).This system is become by document collection, three groups of subsystems of document index and query processing, realizes that respectively the document in the discovery of webpage (being HTML or XML document) on the Internet Server website and other data format file and collection, the index file storehouse, the query requests that search subscriber is submitted to handle and return functions such as Search Results.System by extract, analyze and the arrangement document library in oriented hyperlink that each webpage comprised to other webpage set up linking relationship between the webpage and between the website.These linking relationships leave in one or more files with the form of digraph usually.System uses integer that each webpage and website in the web page library are numbered, and is called the document identification number (doc ID) and the website logo number (site ID) of webpage.In below discussing, represent the digraph that webpage or website are made of linking relationship with G; Represent the webpage numbering with variable names such as i, j or did, its span is from 1 to N (N is the sum of webpage); Represent website numbering with variable names such as I, J or sid, its span from 1 to N S(N SBe the sum of website).If webpage i (or website I) in digraph G, then is designated as i ∈ G (or I ∈ G).If webpage i comprises the link of pointing to webpage j, then be expressed as i → j, and claim i source web page for link i → j, claim j target web for link i → j.If have link i → j among the G, then be designated as i → j ∈ G.
The ■ basic model:
To webpage (or website) grading, by certain numerical evaluation model each the webpage i (or website I) among the G is determined a numerical value R (i) (or R (I)) exactly, distinguish its quality, importance or authority quantitatively with this.In the following discussion, the rank of representing webpage i with R (i).Grading based on link analysis is to determine its other numerical value of level according to the linking relationship between webpage or the website.This rank is a kind of rank of overall importance that has nothing to do with the user inquiring speech.Well-known PageRank ranking method is the hyperlink relation of utilizing between the webpage, the initial level uniaxially of webpage is transmitted along link, and final other distribution of webpage level is exactly the result that this unidirectional class value transmittance process reaches steady state (SS).Be equivalent to the Markov chain process of N probability distribution P (the i)=R (i) on the node at this rank transmittance process on the mathematics, and last rating result is exactly the probability distribution that this Markov chain arrives stationary state.The main thought of PageRank thinks that the hyperlink between the webpage can be used as a kind of relation of quoting and recommend, and is had bigger importance by the webpage of a lot of webpage recommendings; And, have prior value from the recommendation of important webpage; The rank of each webpage is along with the link that it comprised is outwards transmitted fifty-fifty, and the resulting rank of webpage is exactly all other summations of level of transmitting along the link of pointing to it.This summation is other weighted sum of level that each chain of being linked to this webpage goes out webpage, wherein each chain weights of going out webpage for its sum of going out to link (be out-degree, inverse outdegree).
As previously mentioned, the unidirectional delivery character of PageRank has a series of weak point, is easy to be subjected to artificially being provided with the manipulation of link, and underuses the multiple character of linking relationship between the node.Ranking method of the present invention is graded by the multiple character of using linking relationship, so that more comprehensive, objective and reflect quality, importance or the authoritative difference that node is formed by linking relationship exactly, and reduce the influence of link cheating better.
According to embodiments of the invention, the character that can influence other linking relationship of webpage level comprises following 4 classes at least:
● the link of forward, and the forward weight of these links;
● reverse link, and the reverse weight of these links;
● quoting altogether between the node (co-citation) relation and attribute thereof;
● common finger (co-reference) relation and attribute thereof between the node.
The rank of any webpage can recursively be determined by the rank of other webpage according to part or all of above-mentioned 4 class linking relationship character.This provides attainable algorithm for the rank of utilizing multiple linking relationship property calculation webpage quantitatively, that is: the rank R of webpage i (i) can by all and webpage i have linking relationship other webpage j rank R (j) certain linear superposition (weighted sum) and determine.Particularly, according to the embodiment of the invention, determine webpage i rank R (i) (i=1,2 ..., basic model N) is:
R ( i ) = c 1 · Σ j → i ∈ G W + ( j , i ) · R ( j ) + c 2 · Σ i → j ∈ G W - ( i , j ) · R ( j ) +
(1)
c 3 · Σ j ∈ G W C ( i , j ) · R ( j ) + c 4 · Σ j ∈ G W R ( i , j ) · R ( j ) + D ( i ) ,
All summations all are that index j is carried out in the formula, and j ≠ i (unless have linking relationship specified webpage to the link of oneself, quote altogether or refer to altogether).Wherein, the function W in 4 summations +(j, i), W -(i, j), W C(i, j), W R(i is respectively that the backward chaining weight, webpage i of forward weight, the i → j of link j → i quoted weight, the webpage i common finger weight with j altogether with j j); c 1, c 2, c 3, c 4Be constant coefficient, represent various linking relationship character to other contribution proportion of level, its value can be determined according to the model that reality is used; D (i) I=1,2 ..., NBe N constant, and other certain prior distribution of expression webpage level (just under situation without any the linking relationship influence, i.e. each weighting function W +=W -=W C=W R=0 o'clock, the class value of each webpage).Also can be rewritten as following form to D (i):
D(i)=d·E(i), d = Σ i ∈ G D ( i ) , Σ i ∈ G E ( i ) = 1 ,
Wherein E (i)=D (i)/d is a normalized vector, can regard other prior probability distribution of webpage level as.
Formula (1) is actually the further weighted sum of the weighted sum that other 4 class linking relationship rank of webpage level is transmitted, and coefficient c 1, c 2, c 3, c 4It is the weight of a back weighted sum.Below respectively to above-mentioned various linking relationship character and weighting function W thereof +(j, i), W -(i, j), W C(i, j), W R(i j) is described in detail.
As shown in Figure 1, the webpage that has a direct linking relationship with arbitrary webpage i can be divided into two big classes: a class is to be linked to the set that the webpage j of webpage i is formed; The set that the another kind of webpage j ' that is linked by webpage i is formed.What the link of the sensing webpage i that the former comprised was called webpage i enters link (in-bound links), perhaps abbreviates " going into chain " as (in-links); The number of going into chain is called the in-degree of webpage i, is designated as functional form in-degree (i); And the link of other webpage of sensing that webpage i is comprised is called the link (out-bound links) of going out of webpage i, and perhaps abbreviates " going out chain " as (out-links); The number that goes out chain is called the out-degree of webpage i, is designated as out-degree (i).
In webpage grading process, the rank of webpage is transmitted by (direct or indirect) linking relationship.At first, each webpage j relevant with the in-degree of webpage i has direct contribution to the rank R (i) of webpage i, and wherein the contribution of each webpage j is certain percentage of himself rank R (j).This contribution is pressed the forward transmission of link j → i to webpage i, the scale-up factor W that is transmitted for the rank R (j) of webpage j +(j i) is called the forward weight that links j → i.Therefore, the rank R of webpage i (i) at first is that it goes into the forward weight of chain and other weighted sum of level of the source web page of going into chain.This is first on the right of formula (1).
Therefore, according to the present invention,, be different from the importance of going into chain of different web pages j to same webpage i.Go into the forward weights W of the significance level of chain by link j → i +(j i) represents.By this grading principle, has bigger importance from the recommendation of the important link of important webpage.Obviously, (going out) that webpage j itself is comprised link is many more, and it is to just should be more little by other contribution of the level of linked web pages.This relation can utilize the out-degree of webpage j to represent, can think the link j → i the forward weights W +(j, i) the out-degree out-degree (j) with webpage j is inversely proportional to, i.e. W +(j, i) ∝ 1/out-degree (j). by introducing a scale factor w +(j i), can be shown this relation table
W+(j,i)=w +(j,i)/out-degree(j). (2)
Scale factor w +(j i) depends on the multiple attribute relevant with linking j → i (see for details and the following describes).And in the simplification application model of this method, can be taken as
w +(j, i) ≡ 1.0, to all-links j → i; w +(j, i)=0, when there not being link j → i. (3)
Secondly, according to ranking method of the present invention, other influence is two-way to the hyperlink between the webpage to the webpage level.With above-mentioned to transmit other mechanism of level corresponding along link forward, transmit rank in the other direction along link and also can be used as the mechanism that a kind of valuable network node is graded.The present invention is integrated into this mechanism in the ranking method.As shown in Figure 1, relevant with the out-degree of webpage i each webpage j ' also can exert an influence to the rank R (i) of webpage i.Its main thought is: being provided with fully by the web page authors arbitrary decision of the hyperlink in the webpage; Though web page authors can not be controlled the link of pointing to its webpage, but they can select arbitrarily by the website of its web page interlinkage and webpage, and reverse webpage rank transmission can form effective restriction and active influence to this spontaneous behaviour of web page interlinkage setting, if that is: web page authors is initiatively pointed to high-quality webpage, then as encouragement, the rank of its webpage might be significantly improved; And if point to low-quality webpage, then the rank of its webpage only has very little increase, can not obtain substantial lifting.In some anti-cheating technology, there be the method for application class like mechanism, for example, to a certain degree punishment (reducing its priority aspect webpage collection, renewal and the Search Results ordering) is carried out in the webpage that comprises the link of pointing to known cheating website or website.
On the other hand, from directed high-quality webpage obtainable rank to promote benefit be again relevant with the in-degree that is referred to webpage.If it is more to point to the link of certain high-quality webpage, then the rank contribution meeting of this webpage webpage that active link [HTML] is come is less.Thereby the present invention is by integrated reciprocal rank pass through mechanism in ranking method, can be preferably to various artificial controlled key elements balance in addition.
Therefore, each webpage j ' relevant with the out-degree of webpage i also can contribute the part of its rank R (j ') the rank R (i) to webpage i. and the rank R (j ') that this contribution is webpage j ' presses and links i → j ' other back transfer of level to webpage i, thereby the scale-up factor W of its contribution -(i, j ') is called the reverse weight that links i → j '.Like this, the rank R of webpage i (i) also comprises its reverse weight that goes out chain and goes out other weighted sum of level of the target web of chain that promptly the right of formula (1) is second.
As mentioned above, to go out the importance of chain be different to the difference of same webpage i.Go out the reverse weights W of the significance level of chain by link i → j ' -(i, j ') represents.By this grading principle, the link that the high-quality (big weight) that points to the high-quality webpage is set can improve the quality of this webpage largely, points to the quality lifting (even this link has bigger reverse weight) that the inferior quality webpage then can not obtain essence.
With the forward weight in like manner, for the link i → j ' reverse weights W -(i, j '), webpage j ' is had, and to go into chain many more, and its other contribution of level to the webpage i that drive chain fetches just should be more little.Can represent this relation with the in-degree of webpage j ', promptly link the reverse weights W of i → j ' -(i, j ') and the in-degree in-degree (j ') of webpage j ' are inversely proportional to, i.e. W -(i, j ') ∝ 1/in-degree (j '). by introducing a scale factor w -(i j), can be shown this relation table
W -(i,j)=w -(i,j)/in-degree(j). (4)
Factor w -(i j) depends on the multiple attribute (see for details and the following describes) that links i → j, can be taken as in the simplification application scenarios
w -(i, j) ≡ 1.0, when there being link i → j ∈ G; w -(i, j)=0, when there not being link i → j ∈ G. (5)
The bi-directional character of comprehensive above-mentioned link and two-way weight, according to the embodiment of the invention, from the recommendation (going into chain) of the important link of important webpage and the quoting (going out chain) and can produce bigger influence of important link of pointing to important webpage to the importance of webpage.This mechanism can encourage to point to the link of the high-quality (big weight) of high-quality webpage, improves the oeverall quality of hyperlink relation, and reduces link and link exchange to lower-quality information to a great extent.
Once more, according to ranking method of the present invention, have between the webpage of common adduction relationship and co-reference or the website and also have the rank transitive relation, that is to say that common adduction relationship and co-reference can serve as certain indirect " reciprocal link " relation between webpage or the website, make rank numerical value each other can transmit mutually, increase mutually.
As shown in Figure 2, be numbered between 2 and 3 the webpage without any direct linking relationship, but exist another to be numbered 1 webpage, comprised the link 1 → 2 of pointing to webpage 2 and 3 simultaneously, 1 → 3. that is to say, webpage 2 and 3 is quoted by webpage 1 simultaneously, and webpage 1 has comprised the quoting altogether of webpage 2 and 3 (co-citation) in other words.Like this, webpage 2 and 3 has formed a kind of indirect relation by webpage 1, and this is the common adduction relationship between the above-mentioned webpage.Obviously, this is a kind of relation of (being two-way) indirect link mutually.
At accompanying drawing 3, between the webpage 2 and 3 also without any direct linking relationship, but the two has pointed to another simultaneously and has been numbered 1 webpage.Like this, webpage 2 and 3 has formed another kind of indirect relation by direct linking relationship 2 → 1 and 3 → 1, promptly refers to (co-reference) relation altogether.The link direction of co-reference is opposite (being equivalent to " reverse co-citation ") of adduction relationship together just in time.This also is a kind of mutual, two-way relation (corresponding two-way rank transmission).
Two webpages being quoted altogether by a lot of webpages, and two webpages that point to a plurality of same web page generally have bigger correlativity, for example identical field, theme or to quoting of the similar resource of type etc.Usually, generally do not have any linking relationship between the company's site that has commercial competition to concern, yet their content has very strong correlativity.See on the whole, can have many quoting altogether and co-reference between these business websites of vying each other, promptly have more third party's webpage can quote them simultaneously, and they also may point to some identical third party's webpage or website.Prior aries such as PageRank method are not applied to the linking relationship characteristic of this derivation.Ranking method of the present invention is integrated into above-mentioned two kinds of indirect linking relationships in the rating model, further improves the objectivity and the stability of rating result with this.
Obviously, this by third party's webpage or website and " the reciprocal link " that form indirectly relation has reflected comparatively objectively between the node and can reflect the global impact of the link structure of network to node better in the contact aspect theme, the interior perhaps type; It is difficult to be manipulated more than direct unidirectional hyperlink relation again simultaneously, thereby possesses the performance of extremely strong resistance link cheating.The difficulty that relies on quoting altogether of web page interlinkage to practise fraud with co-reference is more much more difficult than cheating modes such as link accumulation, link exchanges.Factors such as mixed economy cost, technical difficulty, competition are quoted in a large number or co-reference reaches the webpage rank of remarkable lifting oneself, in fact other purpose of webpage level that does not increase the rival simultaneously is difficult to realization altogether by artificially being provided with.
According to the embodiment of the invention, there is each webpage j of common adduction relationship the part of its rank R (j) can be contributed rank R (i) to webpage i, its scale-up factor W with webpage i C(i, what j) be called webpage i and j quotes weight altogether; And there is each webpage j of co-reference the part of its rank R (j) can be contributed rank R (i) to webpage i, its scale-up factor W with webpage i R(i j) is called the common finger weight of webpage i and j.The 3rd and the 4th of make-up formula (1) the right distinguished in the contribution of this two aspect.
Further, weights W C(i, j) and W R(i, j) can also by introduce two new function coci-degree (i, j) and coref-degree (i j) comes definitely, and the latter represents to quote altogether the frequency attribute with co-reference respectively.Adduction relationship gets webpage i and webpage j for existing altogether, if it is many more to include the third party's webpage of the two simultaneously, then the probability browsed simultaneously of webpage i, j will increase on the whole, and showing as between these two webpages has bigger transition probability.Quote weights W altogether C(i j) is exactly the probability intensity of this transition from webpage j to webpage i.Therefore, W C(i, j) be with webpage i and webpage j between a relevant function of the number of times of quoting altogether (or be called quote the frequency altogether).(i, j) expression is quoted the frequency altogether for the contribution that jumps to the probability of webpage j from webpage i, then quotes weights W altogether with coci-degree C(i, j) ∝ coci-degree (i, j). introduce a scale factor w C(i j), is shown this relation table
W C(i,j)∝w C(i,j)·coci-degree(i,j). (6)
Factor w C(i j) depends on the attribute (see for details and the following describes) of webpage i and j, can be taken as simplifying application scenarios
w C(i, j) ≡ 1.0, quote altogether when i and j exist; w C(i j)=0, quotes when not existing altogether. and (7)
Correspondingly, can think common finger weights W R(i is the transition probability intensity from webpage j to webpage i that is caused by co-reference j), and be with webpage i and webpage j between a function being directly proportional of the number of times (or being called the common finger frequency) of common finger.(i, j) expression refers to the frequency altogether for the contribution that jumps to the probability of webpage j from webpage i, introduces scale factor w with coref-degree R(i, j), can be with W R(i j) is expressed as
W R(i,j)∝w R(i,j)·coref-degree(i,j). (8)
Factor w R(i j) depends on the attribute (see for details and the following describes) of webpage i and j, can be taken as simplifying application scenarios
w R(i, j) ≡ 1.0, when there are finger altogether in i and j; w R(i, j)=0, when there not being common finger. (9)
According to the embodiment of the invention, coefficient coci-degree (i, j) between webpage i and the webpage j quote altogether frequency coci_freq (i, function j), promptly
coci-degree(i,j)=f(coci_freq(i,j)).
In the system configuration of the preferred embodiment of the present invention, (i j) is proportional to and quotes the frequency altogether between webpage i and the webpage j coci-degree, and may be defined as coci-degree (i, j)=coci_freq (i, j). work as i=j, can think coci_freq (i, j)=in-degree (i). promptly
coci-degree(i,j)=coci_freq(i,j),i≠j;coci-degree(i,i)=in-degree(i). (10)
The present invention also can use other functional form f to realize that (i j), is total to the technique effect of adduction relationship to the others of webpage or website grading thereby analyze to coci-degree.For example, f (coci_freq) can be for log (coci_freq) or (coci_freq) 1/2Etc. form.
Similarly, according to the embodiment of the invention, coefficient coref-degree (i, j) be between webpage i and the webpage j common finger frequency coref_freq (i, function j), promptly
coref-degree(i,j)=g(coref_freq(i,j)).
In optimum decision system configuration, coref-degree (i j) is proportional to the common finger frequency between webpage i and the webpage j, and be defined as coref-degree (i, j)=coref_freq (i, j). when i=j, coref_freq (i, j)=out-degree (i).Promptly
coref-degree(i,j)=coref_freq(i,j),i≠j;coref-degree(i,i)=out-degree(i).(11)
The present invention also can use other functional form g on demand, for example log (coref_freq (i, j)) or [coref_freq (i, j)] 1/2Etc. form.
The ■ algorithm of grading:
Comprehensive above-mentioned each grading factor, the grading flow process of the embodiment of the invention as shown in Figure 4.In step 410, according to the linking relationship between the web page joint, according to foregoing description, for each link between the node is provided with a forward weights W +With a reverse weights W -, and quote altogether between any two nodes each weights W is set C, refer to be provided with a weights W altogether between any two nodes each RThen in step 420, according to the described 4 class linking relationship character of described rating model of above-mentioned formula (1) and formula (2)~(11), determine the rank R (i) of each webpage i one by one by following factor, that is: be linked to the rank R (j) of each webpage j of webpage i and the forward weights W of these links +(j, i); The rank R (j) of each webpage j that webpage i is linked, and the reverse weights W of these links -(i, j); With webpage i the rank R (j) of each webpage j of common adduction relationship and these weights W of quoting are altogether arranged C(i, j); With webpage i the rank R (j) of each webpage j of co-reference and these weights W that refer to are altogether arranged R(i, j).According to these factors, can the class value R (i) of each webpage i accurately be found the solution.
Above-mentioned grading process has comprised a concrete algorithm, and this algorithm can be described by shifting formula as lower probability.This is a N unit system of linear equations of being made up of N equation, and wherein the rank R of webpage (i) webpage i of equal value is by the probability of selected at random (browse or click):
R ( i ) = c 1 · Σ j → i ∈ G w + ( j , i ) out - degree ( j ) R ( j ) + c 2 · Σ i → j ∈ G w - ( i , j ) in - degree ( j ) R ( j ) +
c 3 · Σ j ∈ G , j ≠ i coci - degree ( i , j ) · w C ( i , j ) α ( j ) R ( j ) +
(12)
c 4 · Σ j ∈ G , j ≠ i coref - degree ( i , j ) · w R ( i , j ) β ( j ) R ( j ) + d · E ( i ) ,
α in the formula (j) and β (j) they are the normalized factor of probability matrix, and w +, w -, w CAnd w RIt is respectively the corresponding weights factor of above-mentioned 4 class probability transition mechanism.According to the requirement of probability transfer conversion, constant c 1, c 2, c 3, c 4Satisfy following relation with d:
d=1-(c 1+c 2+c 3+c 4). (13)
Normalized vectorial E (i) satisfies condition Σ i ∈ G E ( i ) = 1 , It act as a probability external source (external source), and dE (i) expression page viewers are the probability of each node i of random choose not along the linking relationship between the webpage but on the whole, is called " personalized grading vector " at this.Its fundamental property is vectorial identical with the personalization among the PageRank.In embodiment of the invention preferred disposition, each component of outside probability source vector E (i) all is taken as 1/N, the prior probability distribution that promptly is averaged.
Rank R (i) by N definite webpage of above-mentioned algorithm has been the probability distribution that reaches steady state (SS) by the stochastic process of linking relationship browsing page, therefore satisfies following non-negative and normalizing condition:
∀ i , R ( i ) ≥ 0 ; Σ i = 1 N R ( i ) ≡ 1 . - - - ( 14 )
The rank R (i) of an above-mentioned N webpage is formed a column vector R, above-mentioned formula can be written as matrix form:
R=M(c 1,c 2,c 3,c 4)·R (15)
Wherein matrix M is the linear combination of a plurality of matrixes:
M(c 1,c 2,c 3,c 4)=c 1M ++c 2M -+c 3M C+c 4M R+dM 0, (16)
For webpage i, j, x ∈ G, each matrix of the right is defined as follows respectively:
M i , j + = w + ( j , i ) out - degree ( j ) , (for link j → i) (17)
M i , j - = w - ( j , i ) in - degree ( j ) , (for link i → j) (18)
M i , j C = coci - degree ( i , j ) · w C ( i , j ) α ( j ) , (for being total to adduction relationship x → i, x → j) (19)
M i , j R = coref - degree ( i , j ) · w R ( i , j ) β ( j ) , (for co-reference i → x, j → x) (20)
M i , j 0 = E ( i ) , For any webpage j=1,2 ..., N. (21)
More than to matrix M 0Derivation utilized the following non-negative and normalizing attribute of rank vector R.
Above-mentioned matrix M and M +, M -, M CAnd M REach all be the probability transfer matrix of a Markov chain, they all satisfy a base attribute of probability transfer matrix: for the arbitrary node i among the G, transition matrix any one row the element sum be 1, that is:
∀ j , M k ∈ { M , M + , M - , M C , M R , M 0 } : Σ i ∈ G M i , j k = 1 . - - - ( 22 )
This character guarantees that the above-mentioned non-negative and normalizing attribute of vectorial R does not change because of the conversion of probability transfer matrix.By (17), (18), following relational expression is arranged:
Σ i ∈ G w + ( j , i ) = out - degree ( j ) , Σ i ∈ G w - ( i , j ) = in - degree ( j ) . - - - ( 23 )
And for normalizing factor α and β, according to above-mentioned probability transfer matrix attribute (19), (20), the two is defined as:
α ( j ) = Σ i ∈ G , i ≠ j coci - degree ( i , j ) · w C ( i , j ) , ∃ i : w C ( i , j ) ≠ 0 ; elseα ( j ) = 1 , - - - ( 24 )
β ( j ) = Σ i ∈ G , i ≠ j coref - degree ( i , j ) · w R ( i , j ) , ∃ i : w R ( i , j ) ≠ 0 ; elseβ ( j ) = 1 . - - - ( 25 )
Therefore, total frequency (weighted sum) of the common adduction relationship that factor-alpha (j) expression webpage j is participated in, and ratio coci-degree (i, j)/α (j) then for the rank R (j) of webpage j because common adduction relationship is distributed to the ratio of webpage i; Total frequency (weighted sum) of the co-reference that β (j) expression webpage j is participated in, and coref-degree (i, j)/β (j) distributes to the ratio of webpage i by co-reference for the rank R (j) of webpage j.
Above-mentioned probability transfer matrix M (c 1, c 2, c 3, c 4) unified model of the expression embodiment of the invention, can regard enhancing PageRank model as based on bi-directional chaining weight and two-way common adduction relationship.As constant c 1, c 2, c 3, c 4A part of value be 0 o'clock, can be by matrix M (c 1, c 2, c 3, c 4) obtain different simplified models.For example, in fact PageRank is exactly by the represented a kind of simplified model special case of matrix M (1-d, 0,0,0), and has wherein further supposed all forward chaining weight w +(i, j)=1.Other several important simplified models comprise:
R +-=M(c 1,c 2,0,0)·R +-
R +C=M(c 1,0,c 3,0)·R +C
R +R=M(c 1,0,0,c 4)·R +R
And
R +-C=M(c 1,c 2,c 3,0)·R +-C
R +-R=M(c 1,c 2,0,c 4)·R +-R
R +CR=M(c 1,0,c 3,c 4)·R +CR
These rating models have utilized a part of grading factor respectively, can be used for same network structure is provided multiple rating result.These results can be separately or jointly are applied to various objectives.For example, rank vector R +-Can be used for tolerance individually, partly show the effect of creating the behavior of high-quality hyperlink " active link [HTML] is to the high-quality webpage " situation.
In addition, when any column element sum that guarantees above-mentioned each transition matrix is 1 attribute, also have a problem that needs special processing, that is: in the network linking structure of reality, all having out-degree or in-degree usually is 0 node.For example, for the document (comprising pdf document, Word DOC file etc.) of certain non-webpage or as yet not/can't successful web pages downloaded j, out-degree (j)=0 is arranged; And for some not by the website homepage k of other any web page interlinkage, then in-degree (k)=0. is for the former, matrix M +The respective column element all be 0, thereby can not satisfy above-mentioned normalizing formula (22).For the latter, matrix M -Corresponding column element all be 0, do not satisfy above-mentioned formula (22).And the existence of these webpages also might cause matrix M CAnd M RHaving whole elements is 0 row, can not satisfy formula (22).
In corresponding M arkov chain process, these in-degrees or out-degree are that 0 node is called as " waving node " (danging nodes).The embodiment of the invention is used a kind of standardized mathematical skill these nodes is carried out special processing, if the node that is: in the network adds up to N, then is 0 node for in-degree, and its in-degree is corrected for N; And be 0 node for out-degree, its out-degree is corrected for N; And for these new urls that node possessed that are corrected (being called " empty link " virtual links), its forward and reverse link weight all are 1.0, promptly are the link of the node that is corrected for any source node or destination node, w ±=1.0; In addition, the calculating that the node that is corrected does not participate in quoting the frequency altogether and refers to the frequency altogether.(other node then is left intact.)
After handling like this, the in-degree of any node in the network and out-degree can not be 0, thereby above-mentioned matrix M (c 1, c 2, c 3, c 4) all will be qualified probability transfer matrix for any network linking structure.
The result that the grading algorithm of being described by above-mentioned formula (12) or (15) is obtained is actually N dimension matrix M (c 1, c 2, c 3, c 4) main proper vector.This algorithm can be realized (see for details hereinafter and describe) efficiently.
■ parameter and weight factor setting:
In above-mentioned ranking method, model parameter c 1, c 2, c 3, c 4And d can be by concrete application adjustment.Wherein parameter d has special effect, it represents on the one hand that page viewers do not rely on linking relationship but the probability intensity of each web page joint of random choose, it is relevant with the iterative computation rate of convergence of grading algorithm on the other hand: the d value is big more, the convergence of iteration is fast more, but rating result departs from actual network linking structure more.On the mathematics, the purpose of introducing parameter d (being outside probability source) is to accelerate the Markov chain and reaches plateau.
For depart from network structure and convergence quickly, common desirable d ≈ 10%, i.e. c lessly 1+ c 2+ c 3+ c 4≈ 90%. and c 1, c 2, c 3, c 4Ratio can adjust as required again, adjust the weight of various linking relationship character thus to rank contribution.Emphasize the effect of direct linking relationship if desired, then can suitably increase parameter c 1And c 2Emphasize if desired then can increase c by the effect of indirect " reciprocal link " relation that forms of third party's webpage or website 3And c 4And c 1And c 2And c 3And c 4Relative scale between the two also can be by in like manner adjusting.
The related weight factor w of above-mentioned grading algorithm +, w -, w CAnd w RRepresent that respectively 4 kinds of linking relationship character between the webpage shift the strength factor (transmission ratio) of (being the rank transmission) to probability, they all are the functions of the multiple association attributes of concrete webpage i and j.
According to the embodiment of the invention, weight factor w +, w -, w CAnd w ROne or morely can get constant value.Simplify in the model of using w for one in this method +, w -, w CAnd w RAll is constant, and distinguishes by formula (3), (5), (7), (9) value, but integrating representation is:
w +=w -=w C=w R=1.0, when there being corresponding linking relationship;=0, when no corresponding relation. (26)
And as weight factor w CAnd w RBe taken as at 1 o'clock, for quoting altogether and co-reference of non-NULL between webpage i and the j, by above-mentioned definition, normalizing factor α and β are reduced to
α ( j ) = Σ i ∈ G , i ≠ j coci - degree ( i , j ) , β ( j ) = Σ i ∈ G , i ≠ j coref - degree ( i , j ) . - - - ( 27 )
Promptly be respectively common adduction relationship that webpage j participated in and total frequency of co-reference.
As an example, the above-mentioned simplified model of this ranking method can be applied to network as shown in Figure 5, comprising N=3 webpage (perhaps website) node and 4 links.According to these linking relationships, have
out-degree(1)=2,out-degree(2)=1,out-degree(3)=1;
in-degree(1)=1,in-degree(2)=1,in-degree(3)=2;
coci-degree(2,3)=coci-degree(3,2)=1;
coref-degree(1,2)=coref-degree(2,1)=1;
w -(1,2)=w -(1,3)=w -(2,3)=w -(3,1)=1.0, w +(i is j)=0 to other i, j;
w +(2,1)=w +(3,1)=w +(3,2)=w +(1,3)=1.0, w +(i is j)=0 to other i, j;
w C(2,3)=w C(3,2)=1.0, w C(i is j)=0 to other i, j;
w R(1,2)=w R(2,1)=1.0, w R(i is j)=0 to other i, j;
By α and β definition and formula (27)
α(1)=1,α(2)=1,α(3)=1;β(1)=1,β(2)=1,β(3)=1.
With above-mentioned each factor and prior probability distribution E (i)=1/3 substitution grading formula (12), obtain following system of linear equations:
R ( 1 ) = ( c 2 + c 4 ) · R ( 2 ) + ( c 1 + c 2 / 2 ) · R ( 3 ) + d / 3 , R ( 2 ) = ( c 1 / 2 + c 4 ) · R ( 1 ) + ( c 2 / 2 + c 3 ) · R ( 3 ) + d / 3 , R ( 3 ) = ( c 1 / 2 + c 2 ) · R ( 1 ) + ( c 1 + c 3 ) · R ( 2 ) + d / 3 ,
And constraint condition R (1)+R (2)+R (3)=1.
Obviously, R (i) is the parameter parameter c 1, c 2, c 3, c 4And the function of d.As simple examples, establish d=0, weighting coefficient c 1=c 2=c 3=c 4=1/4 (being equal weight) then obtains rating result and is
R(1)=36/121≈0.2975,R(2)=3/11≈0.2727,R(3)=52/121≈0.4298.
When using method of the present invention and carry out finer webpage grading, weight factor w +(j, i), w -(i, j), w C(i, j) and w R(i j) can define according to the specific object of webpage i and j and adjust, so that reflect the effect that the above-mentioned 4 kinds of linking relationship character between the webpage are transmitted the webpage rank more accurately.For example, establish function A 1(i), A 2(j), A 3(i j) represents that respectively the attribute of attribute, link i → j or j → i of attribute, the webpage j of webpage i is to the effect of weight factor, then link weight factor w +Or w -Can be expressed as
w +,-(i,j)=A 1(i)·A 2(j)·A 3(i,j),
And quote altogether and refer to weight factor w altogether C, w RCan be expressed as
Figure C200610165801D0016144801QIETU
Wherein x forms with i, j to quote altogether or the webpage of co-reference.
The attribute of described webpage comprises: the URL of this webpage and the attribute of this URL, the establishment of this webpage, collection and/or update time, the access times of this webpage, visiting frequency, the result of the last time grading of this webpage etc.And the URL attribute of webpage comprises: the attribute of host name and domain name (domain name registration information, host IP address and region thereof etc.), the degree of depth of file directory, file name and length thereof etc.
The attribute of link i → j comprises: this is linked at the attribute among the webpage i, the attribute of webpage j.The latter as mentioned above; The former comprises: be linked at the position (whether being in page top or centre etc.) among the webpage i, link literal and link text (comprising the quantity of word length, keyword, the subject categories of keyword etc.), the typesetting format information of link (comprises font size, color, the relative size of linked graphics and visual effect, and other html tag information), this is linked at number of times clicked in this webpage, frequency and click person's information such as source.Simultaneously, the attribute of link i → j also comprises the contrast situation of the attribute of the attribute of webpage i and webpage j, comprise: the comparison attribute of the URL of the URL of webpage i and webpage j (for example distance of the IP address between the two main frame or actual geographic position distance, the two file directory degree of depth relatively waits), the difference in the difference of the two accessed number of times, visitor source, the difference of the text attribute of webpage i, j (density that comprises separately number of characters length, keyword quantity, keyword and link, and the similarity degree of the text of the two).
In the practical application of the inventive method, can adjust each weight factor respectively according to part or all of above-mentioned attribute.For example, for forward chaining weight factor w +(j, i), can be mainly according to being linked at position among the source web page j, showing attribute such as vision, distinguish the weight that each goes out to link, thereby simulate other scale factor of level that webpage j is outwards transmitted in each link in source web page j more accurately.And for backward chaining weight factor w -(i, j), can mainly consider the correlativity (comprise host information among title, link text, main contents, the URL etc. degree of correlation) of webpage i and j, the webpage j of link i → j and sensing thereof is strong more with the correlativity of webpage i, and then the rank of webpage j is w to other contribution proportion of level of webpage i -(i, j) just big more.
Two webpage i, j quote weight factor w altogether C(i j) represents certain webpage j that certain and webpage i have common adduction relationship significance level in the collections of web pages with common adduction relationship of all and webpage i.The process of determining this weight is for quoting weight analysis (co-citation weighting) altogether.Under simple situation, all have identical importance, i.e. w with the webpage with common adduction relationship of webpage i C(i j)=1, quotes weights W in the formula (1) altogether C(i, j) just be proportional to quote altogether frequency coci-degree (i, j). and under general situation, weight factor w C(i, j) attribute by more above-mentioned webpage i, all determine with the various relevant attribute that webpage i has the webpage j of common adduction relationship, have wherein also comprised the attribute of the webpage x that points to webpage i and j simultaneously, and the attribute of link x → j, x → j.By these attributes, can determine certain distance feature between webpage i and the webpage j, then will have bigger weight factor w with the less webpage j of the distance of webpage i C(i, j).
Refer to weight factor w altogether R(i j) represents certain webpage j that certain and webpage i have co-reference significance level in the collections of web pages with co-reference of all and webpage i.The process of determining this weight is for referring to weight analysis (coreferenceweighting) altogether.Under simple situation, all have identical importance, i.e. w with the webpage with co-reference of webpage i R(i, j)=1, the common finger weights W in the formula (1) R(i, j) just be proportional to common finger frequency coref-degree (i, j). and under general situation, weight factor w R(i j) determines by above-mentioned every webpage attribute and every link attribute, and reflects certain distance feature between webpage i and the webpage j, if promptly the distance of webpage i, j is less, and weight factor w then R(i j) is correspondingly strengthened.
The realization of ■ grading algorithm:
The R as a result (i) that is obtained according to the grading algorithm of formula (12) or (15) is N dimension matrix M (c 1, c 2, c 3, c 4) main proper vector (i.e. the pairing proper vector of Zui Da eigenwert).In the above-mentioned search engine system of the embodiment of the invention, the high-level efficiency of this algorithm realizes needing to use the data structure of a collection of key, i.e. matrix M +, M -, M CAnd M RCanned data and form.The related outside probability source vector E (i) of algorithm does not then need special processing.Distribute when being E (i)=1/N when getting equiprobability, E (i) vector does not need storage, and directly use gets final product in calculation procedure; And when getting E (i) for certain other personalized vector, then can it be deposited hereof, wherein each component of E (i) is deposited by the order of webpage numbering i.
According to the embodiment of the invention, above-mentioned 4 matrix M +, M -, M CAnd M RStorage mode respectively be a sparse matrix file, be called Outdegree file, Indegree file, Cocitation file and Coreference file.Described other computing method of web page joint level of formula (12) can realize in the following manner:
● analyzing web page at first, extract the link that wherein comprises, generate an Outdegree file (M +Sparse matrix represent), its record unit be the link information of going out of each webpage, comprise by the numbering of linked web pages and this going out the forward weight of chain.The form of each the webpage record in the Outdegree file is:
src_did:n,(linked_did1,w + 1),......,(linked_did n,w + n).(28)
Wherein src_did is the numbering of the source web page that goes out of chain, linked_did jFor by the numbering of linked web pages, w + jForward weight w for this link +(src_did, linked_did j), and Integer n is the out-degree out-degree (src_did) of webpage src_did.
● generate an Indegree file (M -Sparse matrix represent), all that write down each webpage enter the information of link, comprise that all reverse weights of going into chain and this chain of going into chain go out the numbering of webpage.The record format of each webpage is in the Indegree file:
linked_did:n,(src_did 1,w - 1),......,(src_did n,w - n).(29)
Wherein linked_did is for by the numbering of the target web that linked, src_did jFor chain goes out the numbering of webpage, w - jReverse weight w for this link -(src_did j, linked_did), Integer n is the in-degree in-degree (linked_did) of webpage linked_did.
According to embodiments of the invention, the Indegree file can generate according to the Outdegree file, its method is: utilize the efficient transposition algorithm of sparse matrix, the matrix of being opened by src_did and linked_did in the Outdegree file is carried out transposition computing (row, column exchange); Again by its reverse weight w of various property calculation that links src_did → linked_did -(src_did, linked_did).
Generate a Cocitation file (M by the Indegree file CSparse matrix represent), write down the common reference information of each webpage, its record format is:
did:n,(coci_did 1,coci_degree 1,w c 1),......,(coci_did n,coci_degree n,w c n).(30)
Wherein Integer n is the number of follow-up tlv triple; The webpage coci_did that common adduction relationship is arranged for each and webpage did i, write down the frequency information coci_degree of this common adduction relationship respectively with a tlv triple i=coci-degree (coci_did i, did) and weight w c i=w c(coci_did i, did).Because the normalizing factor α (did) relevant with did can directly obtain by its definition, so needn't leave in the Cocitation file.
● generate a Coreference file (M by the Outdegree file RSparse matrix represent), write down the common finger information of each webpage, its record format is:
did:n,(coref_did 1,coref_degree 1,w R 1),.....,(coref_did n,coref_degree n,w R n).(31)
Wherein Integer n is the number of follow-up tlv triple; The webpage coref_did that co-reference is arranged for each and webpage did i, write down the frequency information coref_degree of this co-reference respectively with a tlv triple i=coref-degree (coref_did i, did), and weight w R i=w R(coref_did i, did).The normalizing factor β (did) relevant with did can directly obtain by its definition, needn't leave in the Coreference file.
● generate after above-mentioned 4 sparse matrix files, can use the power method, carry out iterative computation R ( N+1)=M (c 1, c 2, c 3, c 4) R (n).
The webpage rank R of the embodiment of the invention (comprises R +, R -, R 0Deng) be the main proper vector (i.e. the pairing proper vector of Zui Da eigenwert) of pairing Metzler matrix.The power method of compute matrix master proper vector (Power Method) is applicable to this calculating, and it is a kind of iterative computation, from an optional non-zero initial vector R (0)The beginning, with matrix M repeatedly to R (0)Do multiplying:
R (n+1)=M·R (n)=M 2·R (n-1)=......=M n·R (0), (32)
Up to the error amount δ of following increment less than certain appointment:
||R (n+1)-R (n)|| 1=∑ i|R (n+1)(i)-R (n)(i)|≤δ, (33)
According to the convergence of power method, iterative computation R (n+1)=MR (n)Rate of convergence be numerical value (1-d) on the whole mLevel off to 0 speed, i.e. lim m(1-d) m→ 0, wherein m is an iterations, and d is the random skip probability coefficent in the formula (12).By (1-d) m≤ δ can obtain above-mentioned iterative computation and reach the needed iterations of specification error δ and be
m=log 10δ/log 10(1-d). (34)
According to the embodiment of the invention, error delta is made as 0.0001, and random skip coefficient d=0.1 between the network node then can estimate required iterative computation number of times and be at most m=88.
The iterative computation flow process of power method as shown in Figure 6.In step 610, system opens and forms matrix M (c 1, c 2, c 3, c 4) 4 matrix M +, M -, M CAnd M REach sparse matrix file, promptly above-mentioned Outdegree file, Indegree file, Cocitation file and Coreference file.In step 620, the N dimensional vector R that the initial level of an expression webpage distributes (0)File be opened and be set to the order read in (in this document each the record R (0)(i) generally be 1, or last result calculated).
In step 630~640, carry out iterative computation, its concrete steps are as follows: for n=0,1,2 ..., m-1 deposits initial level vector R in disk file (n)(i), distribute expression rank vector R and in internal memory (n+1)(i) array; Read above-mentioned 4 matrix M line by line +, M -, M CAnd M RThe sparse matrix file, and read vectorial R one by one (n)(i) each component is according to formula (12), with each initial level R (n)(i) pass to the vectorial R of each appointment one by one (n+1)(i) component in; Traveled through the vectorial R in the disk file (n)(i) after each component, with the vectorial R in the internal memory (n+1)(i) write this document and (promptly use R (n+1)(i) each component substitutes R (n)(i) each component), and then with R (n+1)(i) be initial vector, in like manner calculate new vectorial R (n+2)(i); Repeat this process, up to new vectorial R (m)(i) satisfy predetermined precision.Then in step 650, the rating result that obtains webpage is R (i)=R (m)(i).
In this computation process, for guaranteeing the precision of floating point arithmetic, can be with each component of a vector R (n)(i) multiplication by constants N (webpage sum), then after calculate finishing again with each components R (n)(i) be the actual level R (i) of webpage divided by N.
In addition, in the aforementioned calculation step, for very large collections of web pages, whole components of vectorial R (i) can't leave in the internal memory of single computing machine usually.According to the embodiment of the invention, can adopt the rank vector of the method calculating super large collections of web pages of following staging treating: with the document code i=1 of webpage, 2 ..., N is divided into the s section of equal length, each section R (i) of feasible vector (i=1,2 ..., s; S+1 ..., 2s; ...) can leave in the internal memory; Simultaneously, with above-mentioned 4 matrix M +, M -, M CAnd M REach row of sparse matrix file also divide according to same number of documents segmented mode, thereby each sparse matrix file is decomposed into s less file by its rectangular array number; Subsequently can be by above-mentioned iterative algorithm, by initial level vector R (n)(i) the sparse matrix file of file and each segmentation calculates new rank vector R successively (n+1)(i) each section; With good R (n+1)(i) each section writes in the disk file by the number of documents order, thereby obtains complete new rank vector R (n+1)(i); Repeat this process, up to new vectorial R (m)(i) satisfy predetermined precision.
According to the embodiment of the invention, above-mentioned segmentation computation process also can adopt the method for Distributed Calculation to carry out: use s the node computer by the express network link; With initial level vector R (n)(i) file allocation arrives each node computer, and with above-mentioned 4 matrix M +, M -, M CAnd M RThe segmentation of sparse matrix divide file and distribute to each node computer according to the number of documents piecewise interval; Each node computer calculates new vectorial R respectively (n+1)(i) a certain section; Vectorial R that then will be good (n+1)(i) each section is combined into new vectorial R (n+1)(i); Again with R (n+1)(i) distribute to each node computer for initial vector, by the same new vectorial R of segmented mode Distribution calculation (n+2)(i); Repeat this process, up to the new vectorial R that combines by each section (m)(i) satisfy specified accuracy.
On the other hand, generate Coreference file (M by the Outdegree file RSparse matrix represent) and generate Cocitation file (M by the Indegree file CSparse matrix represent) process also can accelerate processing procedure by the mode of segmentation and Distributed Calculation, being about to each row of Outdegree file and Indegree file divides according to the number of documents segmented mode, being assigned to a plurality of node computers then handles, each row of Part of Co reference file that each node computer is generated and Part of Co citation file merges to get up according to rectangular array number order again, promptly obtains needed M respectively RAnd M CThe sparse matrix file.
In the aforementioned calculation process, also can use some skills with further raising counting yield.When first calculated, deposit initial level vector R hereof (0)Can be chosen for even probability distribution,, get R (i)=1/N (N is the webpage sum) promptly for all webpage i.After update calculation in, for the webpage i of new collection, get R (i)=1/N, and for already present webpage j, desirable R (j) is last result calculated.In the power method is calculated, if suitably select initial vector R (0), make that it can be approaching with the last vector that converges to, iterations is greatly reduced.For update frequency or the less webpage collection of degree, select the last rating result that calculates as the initial level vector that calculates next time, can significantly accelerate computation process.In addition, relevant other method of proper vector calculating convergent of accelerating also can be applied to the aforementioned calculation process in the matrix computations.
In the rating model that these computing method also can be applicable to simplify.According to the embodiment of the invention, a kind of simplification situation is: with each weight factor w in the formula (12) +, w -, w CAnd w RAll get and be decided to be constant, for example get w +=w -=w C=w R=1.0 (when there being corresponding linking relationship), then the aforementioned calculation process the time/can carry out corresponding optimization aspect the empty efficient, comprise: the sparse matrix file can directly generate according to the linking relationship between the webpage, and need not analyze all multiattributes and the metamessage record of related link of these weights and webpage; In the sparse matrix file, need not store these weighted values, etc.
■ is to the grading of website:
Above-mentioned ranking method and algorithm thereof realize being not limited to the grading to webpage, but can directly apply to the network of being made up of by oriented linking relationship arbitrarily any type of node.Above-mentioned two-way rank hereditary property, common adduction relationship and co-reference are all generally set up for various forms of networks.Therefore, grading algorithm of the present invention is equally applicable to the grading to the website, as long as the linking relationship of certain form arbitrarily between the in advance given website.Usually, there is not direct linking relationship between the website, but, can derives the various linking relationships between the website by the linking relationship between the webpage is carried out certain conversion.The conversion of being derived the web site url relation by the web page interlinkage relation can have various ways.By the various web site url networks that different conversion obtain, can grade to website node wherein by ranking method of the present invention.
In the search engine system of the embodiment of the invention, each website numbering is with the integer numbering, as unique website logo number (site ID).The numbering of representing the website below with variable names such as I, J or sid is represented the digraph that the web site url relation is constituted with G, represents the link of website I to website J with I → J.According to the embodiment of the invention, can construct linking relationship between the website from the linking relationship between the webpage in order to following method:
■ at first is super webpage of each website structure, and it represents all webpages in this website.For example, can all merge to the content of all webpages in the website (link URL of going out that particularly wherein comprises) linearly in the big web page files simply, be super webpage with this web page files; Perhaps utilize the page layout mode, the webpage under the different directories path in the website is distinguished at aspects such as composing, position, forms, form the content of super webpage by the content of a plurality of webpages.
■ links merging then, and the hyperlink relation between the webpage of being about to is converted into the linking relationship between the corresponding super webpage, represents linking relationship between the website with this.
The web page interlinkage relation is merged into super web page interlinkage relation comprise following different processing mode.Link between the webpage can be divided into link two classes between station internal chaining and station.For link between the station, it is the link between the webpage on the different web sites, can be reduced to a link between two corresponding super webpages to web page interlinkage between the station between any two websites, and this has two kinds of concrete modes: a kind of is simply with the two-way weights W of the link between the super webpage of correspondence +And W -All being set to constant, for example is 1.0; Another kind is according to the weights W of number to linking between the super webpage of web page interlinkage between between the station +And W -Adjust, the link number between the webpage is many more, and the weight of corresponding super web page interlinkage is big more.
And for the station internal chaining, it is the link between the webpage on the same website, two kinds of processing modes are also arranged: the one, ignore the station internal chaining, the link in the promptly same website between the webpage does not influence the weight of super web page interlinkage to not contribution of the link between the super webpage yet; Another kind is that the station internal chaining is used as is that corresponding super webpage points to own link certainly (show as on same webpage from a hyperlink to another place), and these equally have two-way weight from linking with the link between the common super webpage.When reservation station internal chaining during as the link certainly of the super webpage of website, these will influences by super page indegree, out-degree and bi-directional chaining weight from link, and these also have two-way link weight W from links +And W -On the other hand, the link certainly of super webpage does not influence the common adduction relationship between the super webpage.
Construct after the linking relationship between the super webpage according to above-mentioned processing mode, can obtain in-degree in-degree (I), the out-degree out-degree (I) of each super webpage I as stated above, and quote frequency function coci-degree (I altogether between any two super webpage I, J, J), refer to frequency function coref-degree (I altogether, and each weight factor w of the corresponding super web page interlinkage nexus nature of 4 classes can further be set J), +(J, I), w -(I, J), w C(I, J), w R(I is J) with weighting function W +(J, I), W -(I, J), W C(I, J), W R(I, J).Like this, the rating model that formula (1) is described can directly apply to super webpage, and the grading algorithm of being described by formula (12) or (15) also can directly call.Therefore, the calculating of the super webpage rank vector R (I) of website and the algorithm of webpage rank vector R (i) in like manner only need the webpage i that super webpage I substitutes in the above-mentioned webpage grading arthmetic statement is got final product.So following website grading algorithm is arranged:
R ( I ) = c 1 · Σ J → I ∈ G w + ( J , I ) out - degree ( J ) R ( J ) + c 2 · Σ I → J ∈ G w - ( I , J ) in - degree ( J ) R ( J ) +
c 3 · Σ J ∈ G coci - degree ( I , J ) · w C ( I , J ) α ( J ) R ( J ) +
(35)
c 4 · Σ J ∈ G coref - degree ( i , j ) · w R ( I , J ) β ( J ) R ( J ) + d · E ( I ) ,
The specific implementation of this algorithm in system is also the same with the efficient realization of above-mentioned webpage grading algorithm fully.Relevant weight factor w +, w -, w CAnd w RRepresent the strength factor (transmission ratio) that 4 kinds of linking relationship character between the super webpage are transmitted rank between the website respectively, they all can be by the situation of above-mentioned webpage grading, similarly adjust according to concrete super webpage I and the multiple association attributes of J, so that reflect rank transmission effect between the super webpage more accurately.In the website rating model of simplifying, w +, w -, w CAnd w RAlso can be taken as the constant value (being that value is 1 or 0) shown in formula (26).
In sum, ranking method of the present invention to the grading flow process of website as shown in Figure 7.In step 710, for super webpage of each website structure, the relation of the hyperlink between the webpage is simplified, merged into the linking relationship between the corresponding super webpage, thereby obtain the directed chain map interlinking between the website in a manner described.Then in step 720, according to the linking relationship between the super web page joint, for each link between the node is provided with a forward weights W +With a reverse weights W -, and quote altogether between any two nodes each weights W is set C, refer to be provided with a weights W altogether between any two nodes each RIn step 730, according to the described 4 class linking relationship character of described rating model of above-mentioned formula (1) and formula (2)~(11), determine the rank R (I) of each super webpage i one by one by following factor, that is: be linked to the rank R (J) of each super webpage J of super webpage I and the forward weights W of these links +(J, I); The rank R (J) of each super webpage J that super webpage I is linked, and the reverse weights W of these links -(I, J); With super webpage I the rank R (J) of each super webpage J of common adduction relationship and these weights W of quoting are altogether arranged C(I, J); With super webpage I the rank R (J) of each super webpage J of co-reference and these weights W that refer to are altogether arranged R(I, J).By above-mentioned grading algorithm, by the rank vector R (J) of the super webpage J of above-mentioned each factor iterative computation, up to satisfying specified accuracy.
On the other hand, because the website number is much smaller than the webpage number, the scale of the network that super webpage the is formed network than webpage usually is little a lot.Therefore, the grading of website is calculated than the webpage grading and is calculated soon a lot, and the expense of internal memory and disk storage also can be little a lot.Like this,, can at first obtain the rank of each website, and then estimate the rank of the webpage in each website with approximate method with website of the present invention ranking method for very large collections of web pages.Estimate that by the website rank wherein other method of webpage level has various ways, as long as guarantee that webpage rank sum equals the website rank in the station.For example can degressively the rank of website be distributed in webpage under each catalogue, perhaps determine allocation proportion, perhaps can adopt the mode etc. of simple evenly Distribution Level for less website according to the actual access frequency of webpage according to directories deep.Though other precision of webpage level of Huo Deing is lower than foregoing webpage rank like this, its computation complexity is less, and can upgrade quickly.Particularly, if relevant weight factor value is the constant shown in the formula (26), above-mentioned website grading algorithm the time, empty expense all can keep very low.For the webpage of up-to-date appearance, this ranking method can also be than directly the ranking method based on the web page interlinkage relation be more effective, and the latter was lost efficacy because new web page lacks linking relationship usually.
Embodiments of the invention have used specific algorithm steps and data structure, and realize based on specific application system.But any personnel that are familiar with this area background technology know clearly that the scope of application of the present invention is not limited to such algorithm and system.Technical scheme of the present invention can be applied to other multiple different embodiment.Appending claims has been contained many distortion and the replacement to each key element of this technical scheme.

Claims (9)

1. one kind with computer implemented method to network node grading, gives expression of each node its other numerical value of level according to the oriented linking relationship between the node, it is characterized in that comprising the steps:
A., at least two kinds of weights as described below are set:
(1) to the link between at least a portion node, for wherein each link is provided with a forward weight;
(2) to the link between at least a portion node, for wherein each link is provided with a reverse weight;
(3) at least a portion node, for each common adduction relationships of wherein any two nodes is provided with a weight;
(4) at least a portion node, for each co-references of wherein any two nodes is provided with a weight;
B. according to various weights set among the step a, be calculated as follows various weighted sums:
(1) if set weight is the forward weight of link, then computing node is gone into the forward weight of chain and other weighted sum of level of the source node of going into chain;
(2) if set weight is the reverse weight of link, then computing node goes out the reverse weight of chain and other weighted sum of level of the destination node that goes out chain;
(3) if set weight is the weight of the common adduction relationship between the node, computing node weight other weighted sum of level of adduction relationship node together of adduction relationship altogether then;
(4) if set weight is the weight of the co-reference between the node, then the weight of computing node co-reference and co-reference node the level other weighted sum;
C. the resulting various weighted sums of step b are made further weighted sum, as the rank numerical value of node.
2. network node ranking method according to claim 1 is characterized in that: the forward weight of described link, the reverse weight of link, the weight of quoting altogether, the weight that refers to altogether depend on the out-degree of node, the in-degree of node, the frequency of quoting altogether, the frequency of finger altogether respectively.
3. network node ranking method according to claim 1 is characterized in that: the rank of node also comprises the constant rank of an expression prior probability distribution, and the weight factor sum of this constant and described further weighted sum is 1.
4. according to the described network node ranking method of one of claim 1 to 3, it is characterized in that: described node is a webpage.
5. network node ranking method according to claim 4 is characterized in that: the forward weight of described link, the reverse weight of link, the weight of quoting altogether, the weight that refers to altogether also are set up according at least one following cited factor:
The attribute of webpage comprises: the URL of this webpage and the attribute of this URL, the establishment of this webpage, collection or update time, the access times of this webpage, visiting frequency, the perhaps result of the last time of this webpage grading;
The attribute of link, comprise: be linked at the position in the webpage, link literal and link text, the typesetting format information of link, number of times, frequency and click person's that this link is clicked source-information, the contrast attribute of distance between two webpages of link or the content of text that is comprised.
6. according to the described network node ranking method of one of claim 1 to 3, it is characterized in that: described node is the pairing super webpage in website, this super webpage is constructed by the webpage that merges in the website, and the linking relationship between the super webpage obtains according to the linking relationship between the webpage of website.
7. network node ranking method according to claim 6 is characterized in that: the super webpage of website comprises the content of each webpage in the direct mixed web, perhaps each webpage is positioned over the different placement positions in the super webpage.
8. network node ranking method according to claim 6, it is characterized in that: the rank of webpage is determined by the rank of the super webpage of the website at its place, its mode comprises according to file directory the rank of super webpage is distributed in each webpage, perhaps determine allocation proportion, perhaps give each webpage with the rank mean allocation of super webpage simply according to the actual access frequency of webpage.
9. one kind with computer implemented system to network node grading, gives expression of each node its other numerical value of level according to the oriented linking relationship between the node, it is characterized in that comprising as lower device:
A., the device of at least two kinds of weights as described below is set:
To at least a portion link, a forward weight is set respectively;
To at least a portion link, a reverse weight is set respectively;
To the common adduction relationship of at least a portion node, a weight is set respectively;
To the co-reference of at least a portion node, a weight is set respectively;
B. according to various weights set among the step a, be calculated as follows the device of various weighted sums:
(1) if set weight is the forward weight of link, then computing node is gone into the forward weight of chain and other weighted sum of level of the source node of going into chain;
(2) if set weight is the reverse weight of link, then computing node goes out the reverse weight of chain and other weighted sum of level of the destination node that goes out chain;
(3) if set weight is the weight of the common adduction relationship between the node, computing node weight other weighted sum of level of adduction relationship node together of adduction relationship altogether then;
(4) if set weight is the weight of the co-reference between the node, then the weight of computing node co-reference and co-reference node the level other weighted sum;
C. the resulting various weighted sums of step b are made further weighted sum, as the device of the rank numerical value of node.
CNB2006101658019A 2006-12-12 2006-12-12 Method to webpage and website grading Expired - Fee Related CN100543744C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101658019A CN100543744C (en) 2006-12-12 2006-12-12 Method to webpage and website grading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101658019A CN100543744C (en) 2006-12-12 2006-12-12 Method to webpage and website grading

Publications (2)

Publication Number Publication Date
CN1996299A CN1996299A (en) 2007-07-11
CN100543744C true CN100543744C (en) 2009-09-23

Family

ID=38251392

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101658019A Expired - Fee Related CN100543744C (en) 2006-12-12 2006-12-12 Method to webpage and website grading

Country Status (1)

Country Link
CN (1) CN100543744C (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011B (en) * 2007-07-18 2011-09-07 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101493819B (en) * 2008-01-24 2011-09-14 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN101976245A (en) * 2010-10-09 2011-02-16 吕琳媛 Sequencing method of node importance in network
CN102033914A (en) * 2010-11-29 2011-04-27 百度在线网络技术(北京)有限公司 Authority-based method and equipment for determining reliable description information of link resources
CN102541949B (en) * 2010-12-31 2014-04-02 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
CN102541946B (en) * 2010-12-31 2014-11-05 百度在线网络技术(北京)有限公司 Method and equipment for determining recommendation degree of hyperlink based on recommendation attribute of hyperlink
CN102222115B (en) * 2011-07-12 2013-09-11 厦门大学 Method for analyzing edge connectivity of research hotspot based on keyword concurrent
CN102243661B (en) * 2011-07-21 2014-04-23 中国科学院计算机网络信息中心 Website content quality assessment method and device
CN102750380B (en) * 2012-06-27 2014-10-15 山东师范大学 Page sorting method in combination with difference feature distribution and link feature
CN103778139B (en) * 2012-10-22 2017-09-19 阿里巴巴集团控股有限公司 Searching method and server
CN103870519B (en) * 2012-12-17 2019-03-12 北京千橡网景科技发展有限公司 The method and apparatus for calculating document quality value
CN103116660A (en) * 2013-03-15 2013-05-22 人民搜索网络股份公司 Method and device for acquiring website authority values
US11275748B2 (en) 2013-06-03 2022-03-15 Ent. Services Development Corporation Lp Influence score of a social media domain
CN103491197B (en) * 2013-10-12 2016-08-10 北京海联捷讯信息科技发展有限公司 Distributed automatic tour inspection system and resource collection method thereof
CN105335363B (en) * 2014-05-28 2018-12-07 华为技术有限公司 A kind of Object Push method and system
CN106453207B (en) * 2015-08-07 2021-01-29 北京奇虎科技有限公司 Advertisement material data website verification method and device
CN105608133B (en) * 2015-12-16 2019-07-02 北京神州绿盟信息安全科技股份有限公司 A kind of determination method and device of the key page
CN105608652A (en) * 2016-03-21 2016-05-25 四川九鼎智远知识产权运营有限公司 Intellectual property service system based on Internet of things
CN108121741B (en) * 2016-11-30 2021-12-28 百度在线网络技术(北京)有限公司 Website quality evaluation method and device
CN108009202B (en) * 2017-11-01 2022-02-08 昆明理工大学 Web page classification and sorting dynamic crawler method based on Viterbi algorithm
CN107943853A (en) * 2017-11-06 2018-04-20 浙江三米教育科技有限公司 Knowledge node selects test method and its institute's computation machine equipment and storage medium
CN110309189B (en) * 2018-03-13 2023-04-18 深圳市腾讯计算机系统有限公司 Method and device for acquiring heat of entity words
CN108460158A (en) * 2018-03-28 2018-08-28 天津大学 Differentiation Web page sequencing method based on PageRank
CN109359795A (en) * 2018-08-17 2019-02-19 苏州黑云信息科技有限公司 A kind of industry cluster digital resource use value ranking method based on semantic compatible degree
CN114925308B (en) * 2022-04-29 2023-10-03 北京百度网讯科技有限公司 Webpage processing method and device of website, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
信息检索中基于链接的网页排序算法. 王奇,宋国新,邵志清.华东理工大学学报,第26卷第5期. 2000 *

Also Published As

Publication number Publication date
CN1996299A (en) 2007-07-11

Similar Documents

Publication Publication Date Title
CN100543744C (en) Method to webpage and website grading
US7779001B2 (en) Web page ranking with hierarchical considerations
Eirinaki et al. Web path recommendations based on page ranking and markov models
CN100573513C (en) Be used to arrange the document of Search Results to improve the method and system of diversity and abundant information degree
Debreceny et al. The production and use of semantically rich accounting reports on the Internet: XML and XBRL
CN102982042B (en) A kind of personalization content recommendation method, platform and system
Özkan et al. Evaluating the websites of academic departments through SEO criteria: a hesitant fuzzy linguistic MCDM approach
CN101719145B (en) Individuation searching method based on book domain ontology
Mihaila et al. Using Quality of Data Metadata for Source Selection and Ranking.
US20240070181A1 (en) Methods and apparatuses for content preparation and/or selection
CN103559252A (en) Method for recommending scenery spots probably browsed by tourists
CN102456064B (en) Method for realizing community discovery in social networking
Sidiropoulos et al. A new perspective to automatically rank scientific conferences using digital libraries
CN114896423A (en) Construction method and system of enterprise basic information knowledge graph
Yang et al. A model for book inquiry history analysis and book-acquisition recommendation of libraries
Espadas et al. Web site visibility evaluation
CN117033654A (en) Science and technology event map construction method for science and technology mist identification
van Gils et al. On the quality of resources on the Web: An information retrieval perspective
Antoniou et al. Context-similarity based hotlinks assignment: Model, metrics and algorithm
Luo et al. Generation of similarity knowledge flow for intelligent browsing based on semantic link networks
Dinucă Web structure mining
Zhao et al. The development of social network analysis research in mainland China: a literature review perspective
Wang et al. An Optimization Methods of Government Website Based on Search Engine
Yen et al. Towards effective web site designs: A framework for modeling, design evaluation and enhancement
Gu et al. Key analysis of smart tourism project setting and tourists' satisfaction degree based on data mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090923

Termination date: 20151212

EXPY Termination of patent right or utility model