CN102750380A - Page sorting method in combination with difference feature distribution and link feature - Google Patents

Page sorting method in combination with difference feature distribution and link feature Download PDF

Info

Publication number
CN102750380A
CN102750380A CN2012102158608A CN201210215860A CN102750380A CN 102750380 A CN102750380 A CN 102750380A CN 2012102158608 A CN2012102158608 A CN 2012102158608A CN 201210215860 A CN201210215860 A CN 201210215860A CN 102750380 A CN102750380 A CN 102750380A
Authority
CN
China
Prior art keywords
webpage
page
characteristic
normal
difference characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102158608A
Other languages
Chinese (zh)
Other versions
CN102750380B (en
Inventor
张化祥
张悦童
刘阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201210215860.8A priority Critical patent/CN102750380B/en
Publication of CN102750380A publication Critical patent/CN102750380A/en
Application granted granted Critical
Publication of CN102750380B publication Critical patent/CN102750380B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a page sorting method in combination with difference feature distribution and link feature. The method comprises the following steps of: firstly, calculating the page trust value through a TrustRank algorithm; analyzing the difference feature distribution of the pages marked as normal and spam, selecting the feature for which the difference feature distribution of the normal page and the spam page has obvious difference, as the difference feature; according to the page difference feature distribution, calculating the trust contribution value of the page difference feature; calculating the page trust in combination with the page trust value and the trust contribution value of the page difference feature; and sorting the pages according to the page trust. In the method provided by the invention, based on the different content features of the normal page and the spam page on distribution, the sorted order of the good pages is improved better in combination with the page link feature, and the sorted order of the spam page is reduced.

Description

A kind of Web page sequencing method that combines difference characteristic distribution and chain feature
Technical field
The present invention relates to a kind of Web page sequencing method that combines difference characteristic distribution and chain feature, belong to the internet information searching field.
Background technology
Search engine is that the user searches one of main path of useful information; An investigation according to 2009 shows [CNNIC (China Internet Network Information Center) [R] .the 23rd report in development of Internet in China; 2009:1-3]; 68% people often uses search engine, 84.5% people search engine as the main method of obtaining fresh information.Show [SILVERSTEIN C, MARAIS H, HENZINGER M according to research; MORICZ M.Analysis of a very large Web search engine query log [C] .Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, California; 1999; 33 (1): 6-12], in the search engine return results, most users only check first three page or leaf; Therefore the forward more webpage click amount that sorts is high more, and the profit of bringing is big more.In order in search-engine results, to obtain higher ordering, the portal management person can make great efforts to improve the webpage quality.And under the ordering about of commercial interest; Fraudulent means deception search engine is adopted in some website; Improve the spam page ordering, seriously disturbed the user to obtain useful information, detecting spam page is that search engine faces one of significant challenge [HENZINGER M R; MOTWANI R; SILVERSTEIN C.Challenges in web search engines [C] .Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR) Forum, 2002,36 (2): 11-22].
At present, search engine mainly relies on the content degree of correlation and webpage significance level to confirm the webpage ordering.The content degree of correlation can be by TF/IDF algorithm [BAEZA-YATES; RIBEIRO-NETOB B.Modern information retrieval [M] .Addison Wesley Longman 1999] etc. information retrieval method calculate; And the webpage significance level is by HIST [KLEINBERG J M.Authoritative sources in a hyperlinked environment [J] .Journal of the ACM, 1999,46 (5): 604-632], PageRank algorithm [BIANCHINI M; GORI M; SCARSELLI F.Inside PageRank [J] .Journal of the ACM, 2005,5 (1): 92-128] and TrustRank algorithm [GYONGYI Z; GARCIA-MOLINA H; PEDERSEN J.Combating web spam with TrustRank [C] .Proceedings of the 30th VLDB Conference, ACM Press, 2004:576-587] etc. draw based on the algorithm of link analysis.
The PageRank algorithm utilizes the web page interlinkage characteristic that webpage is sorted, and the webpage significance level is high more, and score is high more, and it is forward more to sort.In the PageRank algorithm, the score of webpage p is defined as:
r ( p ) = α · Σ q : ( q , p ) ∈ ϵ r ( q ) p ( q ) + ( 1 - α ) · 1 N - - - ( 1 )
Wherein α is an attenuation coefficient, and o (q) is the number of links that goes out of webpage q, and promptly has what hyperlink to point to other webpages in the webpage q.Q: (q, p) ∈ ε representes to point to any webpage of webpage p, (q, p) ∈ ε representes that webpage q has the link of going out to point to webpage p, and what ε represented that all point to webpage p goes out the link set, and N representes the webpage number.The score of webpage p is made up of two parts: a part derives from those webpages that points to webpage p, and another part is whole webpages to the contribution that p did.The PageRank value of all webpages is calculated as:
r = α · T · r + ( 1 - α ) · 1 N · 1 N - - - ( 2 )
Wherein T is the transition matrix of whole network chart N * N, and r is the matrix of N * 1, the score of N webpage of expression, and 1N representes 1 matrix of N * 1.Network chart be meant with web be defined as a graph structure G=(v, ε), wherein v is the set of webpage among the web figure, ε is the set that links between webpage.Each webpage all has some links of being pointed to, is called into link, and the link of pointing to other webpages is called link.The number of links of going into of webpage p is defined as in-degree, with i (p)) expression, out-degree o (p) expression webpage p goes out number of links.If a webpage is not gone into link, be referred to as the webpage that is not cited, the webpage that does not go out link representes that with unreferenced webpage isolated webpage is meant does not both go into the webpage that link does not go out link yet.The expression formula of transition matrix T is:
T ( p , q ) = 0 ( p , q ) ∉ ϵ 1 / o ( p ) ( p , q ) ∈ ϵ - - - ( 3 )
Wherein
Figure BDA00001817632200032
expression webpage p does not go out link sensing webpage q; (p, q) ∈ ε representes that webpage p has the link of going out to point to webpage q.TrustRank utilizes the belief propagation mode that each webpage is composed a trust value on the PageRank basis, according to the trust value size webpage is sorted.When calculating the webpage trust value, utilize approximate isolated fully good webpage, and expected that webpage does not point to spam page.From handmarking's webpage; Choose some webpages and form S set; And good webpage collection represented with S+, as seed set [GYONGYI Z, GARCIA-M H.Seed selection in TrustRank.Technical report [R] .Stanford University; 2004], the trust value with webpage in the seed set is made as 1.The spam page collection representes that with S-trust value is decided to be 0.If good webpage through the M step number or still less step number can arrive certain webpage, then this webpage trust value being composed is 1.Belief propagation (TM) formula is:
Figure BDA00001817632200033
Q → Mp representes that webpage q is M to the MAXPATHLEN of p, and does not comprise spam page in this path.
In the belief propagation process, because whether the webpage be not sure of on the path has been webpage, so along with the expansion of propagation distance, trust value should successively decrease gradually.Two kinds of trust value damped systems are arranged: first method suppresses for trusting.When webpage A had the link of pointing to webpage B, the trust value of webpage B was the trust value of webpage A and the product of β, and β is a decay factor; Second method is for trusting division.If the trust value of a webpage is A; Point to n webpage, then the trust value that from A, obtains of its each webpage pointed is
Figure BDA00001817632200034
of A webpage trust value and the trust value of a webpage is gone into the summation of the link value of establishing trust for it from all.The trust value formula of whole web figure is:
TR=β·T·TR+(1-β)·d (6)
Wherein, β is decay factor (value is 0.85 usually), and T is the transition matrix of web figure, and d is the initial trust value of good webpage in the seed set.Formula (6) convergence is so through after certain number of times (value is 20 usually) iteration, the TR value is the trust value of webpage among the web figure.
When utilization is sorted to webpage based on the above-mentioned algorithm of webpage significance level, only considered the link information of webpage.Research shows [FETTERLY D, MANASSE M, and NAJORK M.Spam; Damn spam; And statistics:Using statistical analysis to locate spam web pages [C] .In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1-6, Paris; France, June 2004; Ntoulas A; Najork M.Detecting Spam Web Pages through Content Analysis [C] .the International World Wide Web Conference; 2006, May 23-26,2006; Edinburgh; Scotland] normal webpage shows the different statistic characteristic in terms of content with spam page, and like approximate Normal Distribution such as normal web page title length, webpage number of words, anchor amount of text, and there is notable difference in the distribution of these characteristics of spam page with the normal distribution of the above-mentioned characteristic of webpage.
TrustRank calculates the webpage trust value based on chain feature, and according to trust value webpage is sorted, and reduces the spam page ordering.But this method is not effective to all spam pages.Such as a web pages some useful resources are provided; Attract other web site urls, but this web pages has comprised the link of many definite object spam pages, these links possibly be hidden; The trust value of target spam page just might be very high; And the topological structure of some spam page is similar with the topological structure of normal webpage, combines the distributed intelligence of webpage difference characteristic when at this moment calculating the webpage trust value, and webpage ordering effect can be better.
Summary of the invention
The object of the invention provides a kind of Web page sequencing method that combines difference characteristic distribution and chain feature exactly for addressing the above problem, and the ordering of the webpage that can realize is forward, after the spam page ordering is leaned on, reduces the influence of spam page to search engine search results.
For realizing above-mentioned purpose, the present invention adopts following technical scheme:
A kind of Web page sequencing method that combines difference characteristic distribution and chain feature is at first through TrustRank algorithm computation webpage trust value; Analyze the characteristic distribution that has been labeled as normal and spam page, select normal webpage and spam page characteristic distribution that the characteristic of notable difference is arranged, be called difference characteristic; Distribute according to the webpage difference characteristic, calculate the trust contribution margin of webpage difference characteristic; Trust contribution margin in conjunction with webpage trust value and webpage difference characteristic calculates the webpage degree of belief; According to the webpage degree of belief webpage is sorted.
Specifically may further comprise the steps:
Step 1. is utilized the trust value of each webpage among the TrustRank algorithm computation web figure;
Be labeled as the web page contents characteristic of normal and rubbish among the step 2. statistics web figure; According to distribute different with the spam page characteristic distribution of the normal web page characteristics of statistical information analysis; Confirm that normal webpage and spam page characteristic distribution have the characteristic of notable difference; Be called difference characteristic, confirm the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page simultaneously;
Step 3. is according to the trust contribution margin of difference characteristic Distribution calculation webpage difference characteristic;
The webpage difference characteristic that step 4. utilizes webpage trust value that step 1 obtains and step 3 to obtain is trusted contribution margin, calculates webpage degree of belief among the web figure;
Step 5. sorts to the webpage among the web figure according to the webpage degree of belief that step 4 obtains, and the ordering that degree of belief is big is forward, after the ordering that degree of belief is little is leaned on.The probability that the high more expression webpage of webpage degree of belief is normal webpage is big more, and the webpage degree of belief is more little, and the expression webpage is that the probability of spam page is big more.
Difference characteristic in the said step 2 is that example is explained with web page title length.During use search engine information, general through the input keyword, together Here it is so-called keyword stuffing is put as web page title a large amount of and the incoherent keyword of web page contents in a lot of rubbish websites.Normal web page title length number of words distribution similar normal state distributes, and the spam page title is filled because of malice or is made up of methods such as a large amount of repetition target keyword, and distributing does not have rule.Along with the increase of web page title length, webpage is that the possibility of spam page also increases.Some spam page will be deposited in as web page title with the web page contents irrelevant keywords for obtaining higher ranked in a large number.
In the said step 2, the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page is approximate with normal distyribution function, adds up the average and the variance of normal each difference characteristic of the page of mark, obtains the corresponding normal distyribution function of each difference characteristic.
In the said step 3, the content characteristic of webpage p is trusted the contribution margin computing formula and is:
g | p | = Π i = 1 n | f i ( x ) - y pi ( x ) | - - - ( 7 )
Wherein
Figure BDA00001817632200062
Be the corresponding normal distyribution function of i difference characteristic of webpage, μ iBe the average of i difference characteristic of webpage, σ iStandard deviation for i difference characteristic of webpage.y Pi(x) the webpage proportion that is x for i the difference characteristic value of webpage p, n is the difference characteristic number.
In the said step 4, the degree of belief of webpage p is calculated as:
td ( p ) = TR ( p ) 1 + λ ( 1 + 1 nn ) · g ( p ) - - - ( 8 )
Wherein the trust value of TR (p) expression webpage p is obtained by step 1.λ is a parameter, and control g (p) value is to the punishment of webpage trust value, and value is 9.Lnn representes with e to be the natural logarithm of end n.
Beneficial effect of the present invention: the present invention proposes a kind of method that combines distributed intelligence of webpage difference characteristic and link information that webpage is sorted simultaneously.Use the inventive method, good webpage ordering is forward, after the spam page ordering is leaned on.The webpage ordering that can realize is forward, after the spam page ordering is leaned on, reduces the influence of spam page to search engine search results.
Description of drawings
Fig. 1 is the belief propagation synoptic diagram;
Fig. 2 is for trusting the division synoptic diagram;
Fig. 3 selects synoptic diagram for the webpage difference characteristic;
Fig. 4 is spam page difference characteristic distribution (is example with the web page title length characteristic) synoptic diagram;
Fig. 5 is normal webpage difference characteristic distribution (is example with the web page title length characteristic) synoptic diagram;
Fig. 6 is a webpage degree of belief computing whole process flow diagram;
Embodiment
Below in conjunction with accompanying drawing and instance the present invention is described further.
As depicted in figs. 1 and 2, TrustRank utilizes belief propagation, trusts division or combines both to calculate the trust value of each webpage among the web figure; As shown in Figure 3, statistics shows that there are distributional difference property in normal webpage and spam page on Partial Feature, be example with web page title length, and it distributes respectively like Fig. 4 and shown in Figure 5.
Among Fig. 6, detailed process of the present invention is:
Step 1. is utilized the trust value of each webpage among the TrustRank algorithm computation web figure;
Be labeled as the web page contents characteristic of normal and rubbish among the step 2. statistics web figure; According to distribute different with the spam page characteristic distribution of the normal web page characteristics of statistical information analysis; Confirm that normal webpage and spam page characteristic distribution have the characteristic of notable difference; Be called difference characteristic, confirm the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page simultaneously;
Step 3. is according to the trust contribution margin of the difference characteristic Distribution calculation webpage difference characteristic that extracts;
The webpage difference characteristic that step 4. utilizes webpage trust value that step 1 obtains and step 3 to obtain is trusted contribution margin, calculates webpage degree of belief among the web figure;
Step 5. sorts to the webpage among the web figure according to the webpage degree of belief that step 4 obtains, and the ordering that degree of belief is big is forward, after the ordering that degree of belief is little is leaned on.The probability that the high more expression webpage of webpage degree of belief is normal webpage is big more, and the webpage degree of belief is more little, and the expression webpage is that the probability of spam page is big more.
Difference characteristic in the said step 2 is that example is explained with web page title length.During use search engine information, general through the input keyword, together Here it is so-called keyword stuffing is put as web page title a large amount of and the incoherent keyword of web page contents in a lot of rubbish websites.Normal web page title length number of words distribution similar normal state distributes, and the spam page title is filled because of malice or is made up of methods such as a large amount of repetition target keyword, and distributing does not have rule.Along with the increase of web page title length, webpage is that the possibility of spam page also increases.Some spam page will be deposited in as web page title with the web page contents irrelevant keywords for obtaining higher ranked in a large number.
In the said step 2, the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page is approximate with normal distyribution function, adds up the average and the variance of normal each difference characteristic of the page of mark, obtains the corresponding normal distyribution function of each difference characteristic.
In the said step 3, the content characteristic of webpage p is trusted the contribution margin computing formula and is:
g | p | = Π i = 1 n | f i ( x ) - y pi ( x ) | - - - ( 7 )
Wherein
Figure BDA00001817632200082
Be the corresponding normal distyribution function of i difference characteristic of webpage, μ iFor
The average of i difference characteristic of webpage, σ iStandard deviation for i difference characteristic of webpage.y Pi(x) the webpage proportion that is x for i the difference characteristic value of webpage p, n is the difference characteristic number.
In the said step 4, the degree of belief of webpage p is calculated as:
td ( p ) = TR ( p ) 1 + λ ( 1 + 1 nn ) · g ( p ) - - - ( 8 )
Wherein the trust value of TR (p) expression webpage p is obtained by step 1.λ is a parameter, and control g (p) value is to the punishment of webpage trust value, and value is 9.Lnn representes with e to be the natural logarithm of end n.
With web page title length is example; Calculate the trust contribution margin of this characteristic according to the web page title length distribution | f (x)-y (x) |, wherein is the probability density function of normal distribution.X representes the web page title length variable, the average of μ web page title length, and σ is the standard deviation of web page title length, y (x) is the webpage proportion of x for web page title length;
Shown in Figure 6 the webpage trust value is combined with the trust contribution degree of difference characteristic, by formula webpage degree of belief among the web figure is calculated in (8).

Claims (5)

1. a Web page sequencing method that combines difference characteristic distribution and chain feature is characterized in that, at first through TrustRank algorithm computation webpage trust value; Analyze the characteristic distribution that has been labeled as normal and spam page, select normal webpage and spam page characteristic distribution that the characteristic of notable difference is arranged, be called difference characteristic; Distribute according to difference characteristic then, calculate the webpage difference characteristic and trust contribution margin; In conjunction with webpage trust value and web page contents eigenvalue calculation webpage degree of belief; According to the webpage degree of belief webpage is sorted.
2. the Web page sequencing method of combination difference characteristic distribution as claimed in claim 1 and chain feature is characterized in that concrete steps are following:
Step 1. is utilized the trust value of each webpage among the TrustRank algorithm computation web figure;
Be labeled as the web page contents characteristic of normal and rubbish among the step 2. statistics web figure; According to distribute different with the spam page characteristic distribution of the normal web page characteristics of statistical information analysis; Confirm that normal webpage and spam page characteristic distribution have the characteristic of notable difference; Be called difference characteristic, confirm the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page simultaneously;
Step 3. is according to the trust contribution margin of difference characteristic Distribution calculation webpage p difference characteristic;
Step 4. utilizes the difference characteristic of the webpage p that webpage p trust value that step 1 obtains and step 3 obtain to trust contribution margin, calculates the degree of belief of webpage p among the web figure;
Step 5. sorts to the webpage among the web figure according to the webpage degree of belief that step 4 obtains, and the ordering that degree of belief is big is forward, after the ordering that degree of belief is little is leaned on; The probability that the high more expression webpage of webpage degree of belief is normal webpage is big more, and the webpage degree of belief is more little, and the expression webpage is that the probability of spam page is big more.
3. the described combination difference characteristic of claim 2 Web page sequencing method with chain feature that distributes; It is characterized in that the difference characteristic in the said step 2 is chosen as: ratio, webpage content viewable that webpage number of words, web page title number of words, Web page anchor text number of words account for web page contents account for the ratio of web page contents, the compressibility of web page contents.The normal above-mentioned 5 kinds of basic Normal Distribution of characteristic of the page, and the comparatively not tangible regularity of distribution of the distribution of the above-mentioned 5 kinds of characteristics of spam page.In the said step 2, the APPROXIMATE DISTRIBUTION function of normal each difference characteristic of the page is approximate with normal distyribution function, and the average and the variance of adding up normal each difference characteristic of the page of own mark obtain the corresponding normal distyribution function of each difference characteristic.
4. the Web page sequencing method of described combination difference characteristic distribution of claim 2 and chain feature is characterized in that, in the said step 3, the content characteristic of webpage p is trusted the contribution margin computing formula and is:
g | p | = Π i = 1 n | f i ( x ) - y pi ( x ) | - - - ( 7 )
Wherein
Figure FDA00001817632100022
Be the corresponding normal distyribution function of i difference characteristic of webpage, μ iBe the average of i difference characteristic of webpage, σ iStandard deviation for j difference characteristic of webpage.y Pi(x) the webpage proportion that is cutter for i the difference characteristic value of webpage p, n=5 is the difference characteristic number.
5. the Web page sequencing method of described combination difference characteristic distribution of claim 2 and chain feature is characterized in that, said step 4, and the degree of belief of webpage p is calculated as:
td ( p ) = TR ( p ) 1 + λ ( 1 + 1 nn ) · g ( p ) - - - ( 8 )
Wherein the trust value of TR (p) expression webpage p is obtained by step 1, and λ is a parameter, and control g (p) value is to the punishment of webpage trust value, and value is 9; 1nn representes with e to be the natural logarithm of end n.
CN201210215860.8A 2012-06-27 2012-06-27 Page sorting method in combination with difference feature distribution and link feature Expired - Fee Related CN102750380B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210215860.8A CN102750380B (en) 2012-06-27 2012-06-27 Page sorting method in combination with difference feature distribution and link feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210215860.8A CN102750380B (en) 2012-06-27 2012-06-27 Page sorting method in combination with difference feature distribution and link feature

Publications (2)

Publication Number Publication Date
CN102750380A true CN102750380A (en) 2012-10-24
CN102750380B CN102750380B (en) 2014-10-15

Family

ID=47030565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210215860.8A Expired - Fee Related CN102750380B (en) 2012-06-27 2012-06-27 Page sorting method in combination with difference feature distribution and link feature

Country Status (1)

Country Link
CN (1) CN102750380B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064984A (en) * 2013-01-25 2013-04-24 清华大学 Spam webpage identifying method and spam webpage identifying system
CN105930365A (en) * 2016-04-11 2016-09-07 天津大学 Network link topology reconstruction method based on content
CN108984630A (en) * 2018-06-20 2018-12-11 天津大学 Application method of the Node Contraction in Complex Networks importance in spam page detection
CN109831451A (en) * 2019-03-07 2019-05-31 北京华安普特网络科技有限公司 Preventing Trojan method based on firewall
CN109902236A (en) * 2019-03-07 2019-06-18 成都数之联科技有限公司 A kind of spam page down method based on non-probability model
CN111368092A (en) * 2020-02-21 2020-07-03 中国科学院电子学研究所苏州研究院 Knowledge graph construction method based on trusted webpage resources

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069982A1 (en) * 2004-09-30 2006-03-30 Microsoft Corporation Click distance determination
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site
CN102012934A (en) * 2010-11-30 2011-04-13 百度在线网络技术(北京)有限公司 Method and system for searching picture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069982A1 (en) * 2004-09-30 2006-03-30 Microsoft Corporation Click distance determination
CN1996299A (en) * 2006-12-12 2007-07-11 孙斌 Ranking method for web page and web site
CN102012934A (en) * 2010-11-30 2011-04-13 百度在线网络技术(北京)有限公司 Method and system for searching picture

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064984A (en) * 2013-01-25 2013-04-24 清华大学 Spam webpage identifying method and spam webpage identifying system
CN103064984B (en) * 2013-01-25 2016-08-10 清华大学 The recognition methods of spam page and system
CN105930365A (en) * 2016-04-11 2016-09-07 天津大学 Network link topology reconstruction method based on content
CN108984630A (en) * 2018-06-20 2018-12-11 天津大学 Application method of the Node Contraction in Complex Networks importance in spam page detection
CN108984630B (en) * 2018-06-20 2021-08-24 天津大学 Application method of node importance in complex network in spam webpage detection
CN109831451A (en) * 2019-03-07 2019-05-31 北京华安普特网络科技有限公司 Preventing Trojan method based on firewall
CN109902236A (en) * 2019-03-07 2019-06-18 成都数之联科技有限公司 A kind of spam page down method based on non-probability model
CN109902236B (en) * 2019-03-07 2021-06-11 成都数之联科技有限公司 Junk web page degradation method based on non-probability model
CN111368092A (en) * 2020-02-21 2020-07-03 中国科学院电子学研究所苏州研究院 Knowledge graph construction method based on trusted webpage resources
CN111368092B (en) * 2020-02-21 2020-12-04 中国科学院电子学研究所苏州研究院 Knowledge graph construction method based on trusted webpage resources

Also Published As

Publication number Publication date
CN102750380B (en) 2014-10-15

Similar Documents

Publication Publication Date Title
Xue et al. Optimizing web search using web click-through data
CN102750380B (en) Page sorting method in combination with difference feature distribution and link feature
EP1596314B1 (en) Method and system for determining similarity between queries and between web pages based on their relationships
US8612453B2 (en) Topic distillation via subsite retrieval
Hsu et al. Topic-specific crawling on the Web with the measurements of the relevancy context graph
CN101706812B (en) Method and device for searching documents
CN103853831A (en) Personalized searching realization method based on user interest
Zhou et al. A spamicity approach to web spam detection
US7890502B2 (en) Hierarchy-based propagation of contribution of documents
Hati et al. An approach for identifying URLs based on division score and link score in focused crawler
Gurrin et al. Replicating web structure in small-scale test collections
Yan et al. Research on PageRank and hyperlink-induced topic search in web structure mining
Sangeetha et al. Page ranking algorithms used in web mining
Tao et al. Query-sensitive self-adaptable web page ranking algorithm
Poblete et al. Dr. searcher and mr. browser: a unified hyperlink-click graph
Kumar et al. Focused crawling based upon tf-idf semantics and hub score learning
Liang et al. R-SpamRank: a spam detection algorithm based on link analysis
Bianchini et al. PageRank and web communities
Mukhopadhyay et al. FlexiRank: an algorithm offering flexibility and accuracy for ranking the web pages
Liu et al. Webpage importance analysis using conditional markov random walk
Singh et al. A meta search approach to find similarity between web pages using different similarity measures
Chen et al. Postingrank: Bringing order to web forum postings
Pawar et al. Effective utilization of page ranking and HITS in significant information retrieval
Ganeshiya et al. Hierarchicalrank: webpage rank improvement using HTML taglevel similarity.
Chibane et al. Relevance propagation model for large hypertext document collections

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141015

Termination date: 20200627