CN102750380A

CN102750380A - Page sorting method in combination with difference feature distribution and link feature

Info

Publication number: CN102750380A
Application number: CN2012102158608A
Authority: CN
Inventors: 张化祥; 张悦童; 刘阳
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2012-06-27
Filing date: 2012-06-27
Publication date: 2012-10-24
Anticipated expiration: 2032-06-27
Also published as: CN102750380B

Abstract

The invention relates to a page sorting method in combination with difference feature distribution and link feature. The method comprises the following steps of: firstly, calculating the page trust value through a TrustRank algorithm; analyzing the difference feature distribution of the pages marked as normal and spam, selecting the feature for which the difference feature distribution of the normal page and the spam page has obvious difference, as the difference feature; according to the page difference feature distribution, calculating the trust contribution value of the page difference feature; calculating the page trust in combination with the page trust value and the trust contribution value of the page difference feature; and sorting the pages according to the page trust. In the method provided by the invention, based on the different content features of the normal page and the spam page on distribution, the sorted order of the good pages is improved better in combination with the page link feature, and the sorted order of the spam page is reduced.

Description

A kind of Web page sequencing method that combines difference characteristic distribution and chain feature

Technical field

The present invention relates to a kind of Web page sequencing method that combines difference characteristic distribution and chain feature, belong to the internet information searching field.

Background technology

Search engine is that the user searches one of main path of useful information; An investigation according to 2009 shows [CNNIC (China Internet Network Information Center) [R] .the 23rd report in development of Internet in China; 2009:1-3]; 68% people often uses search engine, 84.5% people search engine as the main method of obtaining fresh information.Show [SILVERSTEIN C, MARAIS H, HENZINGER M according to research; MORICZ M.Analysis of a very large Web search engine query log [C] .Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, California; 1999; 33 (1): 6-12], in the search engine return results, most users only check first three page or leaf; Therefore the forward more webpage click amount that sorts is high more, and the profit of bringing is big more.In order in search-engine results, to obtain higher ordering, the portal management person can make great efforts to improve the webpage quality.And under the ordering about of commercial interest; Fraudulent means deception search engine is adopted in some website; Improve the spam page ordering, seriously disturbed the user to obtain useful information, detecting spam page is that search engine faces one of significant challenge [HENZINGER M R; MOTWANI R; SILVERSTEIN C.Challenges in web search engines [C] .Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR) Forum, 2002,36 (2): 11-22].

At present, search engine mainly relies on the content degree of correlation and webpage significance level to confirm the webpage ordering.The content degree of correlation can be by TF/IDF algorithm [BAEZA-YATES; RIBEIRO-NETOB B.Modern information retrieval [M] .Addison Wesley Longman 1999] etc. information retrieval method calculate; And the webpage significance level is by HIST [KLEINBERG J M.Authoritative sources in a hyperlinked environment [J] .Journal of the ACM, 1999,46 (5): 604-632], PageRank algorithm [BIANCHINI M; GORI M; SCARSELLI F.Inside PageRank [J] .Journal of the ACM, 2005,5 (1): 92-128] and TrustRank algorithm [GYONGYI Z; GARCIA-MOLINA H; PEDERSEN J.Combating web spam with TrustRank [C] .Proceedings of the 30th VLDB Conference, ACM Press, 2004:576-587] etc. draw based on the algorithm of link analysis.

The PageRank algorithm utilizes the web page interlinkage characteristic that webpage is sorted, and the webpage significance level is high more, and score is high more, and it is forward more to sort.In the PageRank algorithm, the score of webpage p is defined as:

r (p) = α \cdot \underset{q : (q, p) &Element; ϵ}{Σ} \frac{r (q)}{p (q)} + (1 - α) \cdot \frac{1}{N} - - - (1)

Wherein α is an attenuation coefficient, and o (q) is the number of links that goes out of webpage q, and promptly has what hyperlink to point to other webpages in the webpage q.Q: (q, p) ∈ ε representes to point to any webpage of webpage p, (q, p) ∈ ε representes that webpage q has the link of going out to point to webpage p, and what ε represented that all point to webpage p goes out the link set, and N representes the webpage number.The score of webpage p is made up of two parts: a part derives from those webpages that points to webpage p, and another part is whole webpages to the contribution that p did.The PageRank value of all webpages is calculated as:

r = α \cdot T \cdot r + (1 - α) \cdot \frac{1}{N} \cdot 1_{N} - - - (2)

Wherein T is the transition matrix of whole network chart N * N, and r is the matrix of N * 1, the score of N webpage of expression, and 1N representes 1 matrix of N * 1.Network chart be meant with web be defined as a graph structure G=(v, ε), wherein v is the set of webpage among the web figure, ε is the set that links between webpage.Each webpage all has some links of being pointed to, is called into link, and the link of pointing to other webpages is called link.The number of links of going into of webpage p is defined as in-degree, with i (p)) expression, out-degree o (p) expression webpage p goes out number of links.If a webpage is not gone into link, be referred to as the webpage that is not cited, the webpage that does not go out link representes that with unreferenced webpage isolated webpage is meant does not both go into the webpage that link does not go out link yet.The expression formula of transition matrix T is:

T (p, q) = \{\begin{matrix} 0 & (p, q) &NotElement; ϵ \\ 1 / o (p) & (p, q) &Element; ϵ \end{matrix} - - - (3)

Wherein

expression webpage p does not go out link sensing webpage q; (p, q) ∈ ε representes that webpage p has the link of going out to point to webpage q.TrustRank utilizes the belief propagation mode that each webpage is composed a trust value on the PageRank basis, according to the trust value size webpage is sorted.When calculating the webpage trust value, utilize approximate isolated fully good webpage, and expected that webpage does not point to spam page.From handmarking's webpage; Choose some webpages and form S set; And good webpage collection represented with S+, as seed set [GYONGYI Z, GARCIA-M H.Seed selection in TrustRank.Technical report [R] .Stanford University; 2004], the trust value with webpage in the seed set is made as 1.The spam page collection representes that with S-trust value is decided to be 0.If good webpage through the M step number or still less step number can arrive certain webpage, then this webpage trust value being composed is 1.Belief propagation (TM) formula is:

Q → Mp representes that webpage q is M to the MAXPATHLEN of p, and does not comprise spam page in this path.

In the belief propagation process, because whether the webpage be not sure of on the path has been webpage, so along with the expansion of propagation distance, trust value should successively decrease gradually.Two kinds of trust value damped systems are arranged: first method suppresses for trusting.When webpage A had the link of pointing to webpage B, the trust value of webpage B was the trust value of webpage A and the product of β, and β is a decay factor; Second method is for trusting division.If the trust value of a webpage is A; Point to n webpage, then the trust value that from A, obtains of its each webpage pointed is

of A webpage trust value and the trust value of a webpage is gone into the summation of the link value of establishing trust for it from all.The trust value formula of whole web figure is:

TR＝β·T·TR+(1-β)·d (6)

Wherein, β is decay factor (value is 0.85 usually), and T is the transition matrix of web figure, and d is the initial trust value of good webpage in the seed set.Formula (6) convergence is so through after certain number of times (value is 20 usually) iteration, the TR value is the trust value of webpage among the web figure.

When utilization is sorted to webpage based on the above-mentioned algorithm of webpage significance level, only considered the link information of webpage.Research shows [FETTERLY D, MANASSE M, and NAJORK M.Spam; Damn spam; And statistics:Using statistical analysis to locate spam web pages [C] .In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1-6, Paris; France, June 2004; Ntoulas A; Najork M.Detecting Spam Web Pages through Content Analysis [C] .the International World Wide Web Conference; 2006, May 23-26,2006; Edinburgh; Scotland] normal webpage shows the different statistic characteristic in terms of content with spam page, and like approximate Normal Distribution such as normal web page title length, webpage number of words, anchor amount of text, and there is notable difference in the distribution of these characteristics of spam page with the normal distribution of the above-mentioned characteristic of webpage.

TrustRank calculates the webpage trust value based on chain feature, and according to trust value webpage is sorted, and reduces the spam page ordering.But this method is not effective to all spam pages.Such as a web pages some useful resources are provided; Attract other web site urls, but this web pages has comprised the link of many definite object spam pages, these links possibly be hidden; The trust value of target spam page just might be very high; And the topological structure of some spam page is similar with the topological structure of normal webpage, combines the distributed intelligence of webpage difference characteristic when at this moment calculating the webpage trust value, and webpage ordering effect can be better.

Summary of the invention

The object of the invention provides a kind of Web page sequencing method that combines difference characteristic distribution and chain feature exactly for addressing the above problem, and the ordering of the webpage that can realize is forward, after the spam page ordering is leaned on, reduces the influence of spam page to search engine search results.

For realizing above-mentioned purpose, the present invention adopts following technical scheme:

A kind of Web page sequencing method that combines difference characteristic distribution and chain feature is at first through TrustRank algorithm computation webpage trust value; Analyze the characteristic distribution that has been labeled as normal and spam page, select normal webpage and spam page characteristic distribution that the characteristic of notable difference is arranged, be called difference characteristic; Distribute according to the webpage difference characteristic, calculate the trust contribution margin of webpage difference characteristic; Trust contribution margin in conjunction with webpage trust value and webpage difference characteristic calculates the webpage degree of belief; According to the webpage degree of belief webpage is sorted.

Specifically may further comprise the steps:

Step 1. is utilized the trust value of each webpage among the TrustRank algorithm computation web figure;

Be labeled as the web page contents characteristic of normal and rubbish among the step 2. statistics web figure; According to distribute different with the spam page characteristic distribution of the normal web page characteristics of statistical information analysis; Confirm that normal webpage and spam page characteristic distribution have the characteristic of notable difference; Be called difference characteristic, confirm the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page simultaneously;

Step 3. is according to the trust contribution margin of difference characteristic Distribution calculation webpage difference characteristic;

The webpage difference characteristic that step 4. utilizes webpage trust value that step 1 obtains and step 3 to obtain is trusted contribution margin, calculates webpage degree of belief among the web figure;

Step 5. sorts to the webpage among the web figure according to the webpage degree of belief that step 4 obtains, and the ordering that degree of belief is big is forward, after the ordering that degree of belief is little is leaned on.The probability that the high more expression webpage of webpage degree of belief is normal webpage is big more, and the webpage degree of belief is more little, and the expression webpage is that the probability of spam page is big more.

Difference characteristic in the said step 2 is that example is explained with web page title length.During use search engine information, general through the input keyword, together Here it is so-called keyword stuffing is put as web page title a large amount of and the incoherent keyword of web page contents in a lot of rubbish websites.Normal web page title length number of words distribution similar normal state distributes, and the spam page title is filled because of malice or is made up of methods such as a large amount of repetition target keyword, and distributing does not have rule.Along with the increase of web page title length, webpage is that the possibility of spam page also increases.Some spam page will be deposited in as web page title with the web page contents irrelevant keywords for obtaining higher ranked in a large number.

In the said step 2, the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page is approximate with normal distyribution function, adds up the average and the variance of normal each difference characteristic of the page of mark, obtains the corresponding normal distyribution function of each difference characteristic.

In the said step 3, the content characteristic of webpage p is trusted the contribution margin computing formula and is:

g | p | = Π_{i = 1}^{n} | f_{i} (x) - y_{pi} (x) | - - - (7)

Wherein

Be the corresponding normal distyribution function of i difference characteristic of webpage, μ _iBe the average of i difference characteristic of webpage, σ _iStandard deviation for i difference characteristic of webpage.y _Pi(x) the webpage proportion that is x for i the difference characteristic value of webpage p, n is the difference characteristic number.

In the said step 4, the degree of belief of webpage p is calculated as:

td (p) = \frac{TR (p)}{{1 + λ}^{(1 + 1 nn)} \cdot g (p)} - - - (8)

Wherein the trust value of TR (p) expression webpage p is obtained by step 1.λ is a parameter, and control g (p) value is to the punishment of webpage trust value, and value is 9.Lnn representes with e to be the natural logarithm of end n.

Beneficial effect of the present invention: the present invention proposes a kind of method that combines distributed intelligence of webpage difference characteristic and link information that webpage is sorted simultaneously.Use the inventive method, good webpage ordering is forward, after the spam page ordering is leaned on.The webpage ordering that can realize is forward, after the spam page ordering is leaned on, reduces the influence of spam page to search engine search results.

Description of drawings

Fig. 1 is the belief propagation synoptic diagram;

Fig. 2 is for trusting the division synoptic diagram;

Fig. 3 selects synoptic diagram for the webpage difference characteristic;

Fig. 4 is spam page difference characteristic distribution (is example with the web page title length characteristic) synoptic diagram;

Fig. 5 is normal webpage difference characteristic distribution (is example with the web page title length characteristic) synoptic diagram;

Fig. 6 is a webpage degree of belief computing whole process flow diagram;

Embodiment

Below in conjunction with accompanying drawing and instance the present invention is described further.

As depicted in figs. 1 and 2, TrustRank utilizes belief propagation, trusts division or combines both to calculate the trust value of each webpage among the web figure; As shown in Figure 3, statistics shows that there are distributional difference property in normal webpage and spam page on Partial Feature, be example with web page title length, and it distributes respectively like Fig. 4 and shown in Figure 5.

Among Fig. 6, detailed process of the present invention is:

Step 3. is according to the trust contribution margin of the difference characteristic Distribution calculation webpage difference characteristic that extracts;

g | p | = Π_{i = 1}^{n} | f_{i} (x) - y_{pi} (x) | - - - (7)

Wherein

Be the corresponding normal distyribution function of i difference characteristic of webpage, μ _iFor

The average of i difference characteristic of webpage, σ _iStandard deviation for i difference characteristic of webpage.y _Pi(x) the webpage proportion that is x for i the difference characteristic value of webpage p, n is the difference characteristic number.

In the said step 4, the degree of belief of webpage p is calculated as:

td (p) = \frac{TR (p)}{{1 + λ}^{(1 + 1 nn)} \cdot g (p)} - - - (8)

With web page title length is example; Calculate the trust contribution margin of this characteristic according to the web page title length distribution | f (x)-y (x) |, wherein is the probability density function of normal distribution.X representes the web page title length variable, the average of μ web page title length, and σ is the standard deviation of web page title length, y (x) is the webpage proportion of x for web page title length;

Shown in Figure 6 the webpage trust value is combined with the trust contribution degree of difference characteristic, by formula webpage degree of belief among the web figure is calculated in (8).

Claims

1. a Web page sequencing method that combines difference characteristic distribution and chain feature is characterized in that, at first through TrustRank algorithm computation webpage trust value; Analyze the characteristic distribution that has been labeled as normal and spam page, select normal webpage and spam page characteristic distribution that the characteristic of notable difference is arranged, be called difference characteristic; Distribute according to difference characteristic then, calculate the webpage difference characteristic and trust contribution margin; In conjunction with webpage trust value and web page contents eigenvalue calculation webpage degree of belief; According to the webpage degree of belief webpage is sorted.

2. the Web page sequencing method of combination difference characteristic distribution as claimed in claim 1 and chain feature is characterized in that concrete steps are following:

Step 3. is according to the trust contribution margin of difference characteristic Distribution calculation webpage p difference characteristic;

Step 4. utilizes the difference characteristic of the webpage p that webpage p trust value that step 1 obtains and step 3 obtain to trust contribution margin, calculates the degree of belief of webpage p among the web figure;

Step 5. sorts to the webpage among the web figure according to the webpage degree of belief that step 4 obtains, and the ordering that degree of belief is big is forward, after the ordering that degree of belief is little is leaned on; The probability that the high more expression webpage of webpage degree of belief is normal webpage is big more, and the webpage degree of belief is more little, and the expression webpage is that the probability of spam page is big more.

3. the described combination difference characteristic of claim 2 Web page sequencing method with chain feature that distributes; It is characterized in that the difference characteristic in the said step 2 is chosen as: ratio, webpage content viewable that webpage number of words, web page title number of words, Web page anchor text number of words account for web page contents account for the ratio of web page contents, the compressibility of web page contents.The normal above-mentioned 5 kinds of basic Normal Distribution of characteristic of the page, and the comparatively not tangible regularity of distribution of the distribution of the above-mentioned 5 kinds of characteristics of spam page.In the said step 2, the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page is approximate with normal distyribution function, and the average and the variance of adding up normal each difference characteristic of the page of own mark obtain the corresponding normal distyribution function of each difference characteristic.

4. the Web page sequencing method of described combination difference characteristic distribution of claim 2 and chain feature is characterized in that, in the said step 3, the content characteristic of webpage p is trusted the contribution margin computing formula and is:

g | p | = Π_{i = 1}^{n} | f_{i} (x) - y_{pi} (x) | - - - (7)

Wherein

Be the corresponding normal distyribution function of i difference characteristic of webpage, μ _iBe the average of i difference characteristic of webpage, σ _iStandard deviation for j difference characteristic of webpage.y _Pi(x) the webpage proportion that is cutter for i the difference characteristic value of webpage p, n=5 is the difference characteristic number.

5. the Web page sequencing method of described combination difference characteristic distribution of claim 2 and chain feature is characterized in that, said step 4, and the degree of belief of webpage p is calculated as:

td (p) = \frac{TR (p)}{{1 + λ}^{(1 + 1 nn)} \cdot g (p)} - - - (8)

Wherein the trust value of TR (p) expression webpage p is obtained by step 1, and λ is a parameter, and control g (p) value is to the punishment of webpage trust value, and value is 9; 1nn representes with e to be the natural logarithm of end n.