CN103064984A

CN103064984A - Spam webpage identifying method and spam webpage identifying system

Info

Publication number: CN103064984A
Application number: CN201310029963XA
Authority: CN
Inventors: 刘奕群; 马少平; 张敏; 金奕江; 张阔
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2013-01-25
Filing date: 2013-01-25
Publication date: 2013-04-24
Anticipated expiration: 2033-01-25
Also published as: CN103064984B

Abstract

The invention provides a spam webpage identifying method and a spam webpage identifying system. The spam webpage identifying method includes acquiring query logs of a search engine and preprocessing the query logs to obtain preprocessed query logs; selecting a query-result set with query user click rate and showing times of result webpages larger than threshold values from multiple query and result webpages of the preprocessed query logs; generating a spam webpage sample set by manually screening and extracting multiple spam webpages from the query-result set; computing spam score of each result webpage and cheating score of each query in the query-result set according to the query-result set and the spam webpage sample set; and determining the result webpages to be spam webpages when the spam scores of the result webpages is larger than the threshold value and adding the result webpages into the spam webpage set. According to the spam webpage identifying method, algorithm complexity by finding and identifying the spam webpages via the query logs of the search engine is reduced, and the spam webpage identifying method has better popularization performance and applicability.

Description

The recognition methods of spam page and system

Technical field

The present invention relates to network information Intelligent treatment technical field, particularly a kind of recognition methods of spam page and system.

Background technology

The growth at full speed of internet information amount makes search engine become indispensable acquisition of information means in people's routine work and the life.According to the CNNIC statistics in Dec, 2011, the quantity of search engine user has reached 3.96 hundred million in the netizen colony of China, and the application popularization rate is nearly 80%, is that the netizen uses one of maximum Internet service.Search engine is being brought into play important entrance effect in user's upper network process, therefore, and the effective way that obtains in search engine retrieving result that favourable rank become that Internet resources obtain as early as possible that the user pays close attention to.

Under this acquisition of information mode take search engine as main entrance, the high flow capacity that high search rank brings and high yield lure that many internet content providers use cheating mode that search engine algorithms is swindled into, obtaining more favourable as a result rank, and the webpage that this use cheating mode is made a profit based on swindle is exactly spam page.Spam page is defined as: utilize the defective of search engine executing arithmetic, take the fraudulent means for search engine, make its acquisition be higher than its network information quality rank effect to seek the webpage of direct or indirect interests.

The people such as Fetterly 2003 by to the sampling analysis of English Webpage, think that wherein having 8.1% the page at least is spam page; And Then estimated nearly 10% to 15% rubbish contents among the Web in 2004 Deng the people; According to our sampling analysis to about 800,000,000 Chinese web pages under the search dog search engine is assisted, there is approximately 15% webpage to belong to spam page in the Chinese Internet resources.

Spam page all can produce significant adverse effect for the network user, Internet resources environment and search engine.For the network user, spam page comes in the result for retrieval tabulation forward position and clicks with user cheating, and this behavior has increased the difficulty that the user searches the useful information of wanting, and reduces user's information acquisition efficiency; Spam page also often with the combinations such as virus, wooden horse software, user's information security caused seriously influence.For the Internet resources environment, because the restriction of state's laws rules, search engine can not provide the bid advertisement service for the illegal Web content such as pornographic, gambling usually, this becomes the main selection that these contents provide the website so that promote rank by cheating mode, in the spam page thereby also be flooded with all kinds of illegal contents, and the illegal contents webpage of this adding cheating technology tends to cause widely harmful effect, more serious destruction Internet resources environment.For search engine system, the existence of spam page causes being full of the useless page in the data directory, wastes a large amount of storage spaces and processing time, thereby strengthens the consumption of search engine when processing each inquiry, reduce the search treatment effeciency, reduce simultaneously the user to the degree of belief of search engine.

Conventional garbage web page identification method a kind of is the Study of recognition work aspect for content-based cheating, URL feature and common phrases feature for the rubbish page are analyzed, and 1.05 hundred million webpages based on MSN search crawl have been carried out the content of pages feature extraction, used the features such as the average length, the ratio of content visible, the content compression that comprise length for heading, word compare to distinguish spam page and normal webpage.Also used on this basis more content characteristic to carry out identification work, its feature comprises the quantity that contains popular vocabulary in the quantity, the page of anchor text etc., and has used the ordering learning method feature to be merged the identification of carrying out spam page.

Another kind is based on the spam page identification of link structure analysis.

Opened a new way of utilizing link structure information identification spam page Deng the TrustRank algorithm that the people proposed in 2004, the identification that can be applied to comprise the content cheating and link the various garbage webpage of cheating.Although the method lacks the coping style for noise data among the link structure figure, but still having a considerable amount of researchers to propose a plurality of link analysis algorithm application based on the improvement to the TrustRank algorithm identifies in spam page, these algorithms comprise Anti-TrustRank, Truncated PageRank etc.

The identification of above spam page is operated in relatively-stationary webpage test set and closes and all obtained preferably recognition effect, the evaluation result that internationally recognizable spam page evaluation and test Web Spam Challenge provides much reaches the recognition accuracy more than 80%, and the experimental result accuracy rate that the paper of much being correlated with provides then often surpasses 90%.Yet various reasons causes these recognizers still to be faced with huge challenge when being applied to true internet environment, is difficult to give full play to its recognition effect, and this has also caused current spam page still search engine to be used the fact that causes tremendous influence.

The shortcoming of prior art is mainly as follows:

(1) these algorithms often can only be identified for the spam page of certain particular type, the robustness that lacks identification, and the cheating form of spam page emerges in an endless stream, although recognizer is very high for the recognition performance of certain class spam page, but can't identify the rubbish of other types, in a single day the spam page author adopts new cheating form, and these algorithms just tend to lose identification effectiveness.

(2) along with the cheating form development, the mode that many algorithms need to expend a large amount of calculating, storage or bandwidth resources is carried out rubbish identification, for example, web page contents is carried out polynary language model to make up, webpage is repeatedly grasped, page script is carried out deep layer parsing etc., this is so that the online service demand of the efficient of these algorithm identified spam pages and search engine is inconsistent, thereby can't be applied in the actual search engine service.

Summary of the invention

Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency.

For achieving the above object, the embodiment of one aspect of the present invention proposes a kind of recognition methods of spam page, may further comprise the steps: S1: obtain the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page; S2: from the multiple queries of described pre-service inquiry log and results web page, filter out the occurrence number of user's clicking rate of described inquiry and described results web page greater than the inquiry-results set of threshold value; S3: artificial screening extracts a plurality of spam pages and generates the set of spam page sample from described inquiry-results set; S4: calculate the rubbish score of each results web page in described inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; And S5: if the rubbish score of results web page is greater than threshold value then described results web page is spam page in described inquiry-results set, and described results web page added in the described spam page set.

According to the method for the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.

In an example of the present invention, described step S1 specifically comprises: S11: obtain the inquiry log of search engine, and described inquiry log is converted to the GBK form; S12: the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.

In an example of the present invention, described step S2 specifically comprises: S21: each the inquiry participle to described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set; S22: calculate user's results web page click frequency of each inquiry in described the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value; S23: calculate the number of times that each result occurs in described the second inquiry-results set in described the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

In an example of the present invention, described step S4 specifically comprises: S41: the initial cheating score of each inquiry in described inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set; S42: calculate with described inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry; And S43: the mean value that calculates the cheating score of all inquiries that are associated with each results web page in described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.

For achieving the above object, embodiments of the invention propose a kind of recognition system of spam page on the other hand, comprise: pretreatment module, be used for obtaining the inquiry log of search engine and described inquiry log is carried out pre-service and obtain the pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page; The screening module, the user's clicking rate that is used for filtering out described inquiry from multiple queries and the results web page of described pre-service inquiry log and the occurrence number of described results web page are greater than the inquiry-results set of threshold value; Extraction module is used for extracting a plurality of spam pages from described inquiry-results set artificial screening and generates the set of spam page sample; Computing module is used for calculating the rubbish score of described each results web page of inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; Whether judge module, the rubbish score that is used for judging described inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page; And processing module, be used for adding described results web page to described spam page set.

According to the system of the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.

In an example of the present invention, described pretreatment module comprises: obtain converting unit, be used for obtaining the inquiry log of search engine, and described inquiry log is converted to the GBK form; Pretreatment unit is used for the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.

In an example of the present invention, described screening module comprises: construction unit, and each the inquiry participle that is used for described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set; The first computing unit is used for calculating user's results web page click frequency of described first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value; The second computing unit is used for calculating the number of times that described each result of the second inquiry-results set occurs in described the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

In an example of the present invention, described computing module comprises: setting unit is used for the initial cheating score of described each inquiry of inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set; The 3rd computing unit, be used for to calculate and the mean value of the rubbish score of all results web page that each inquiry of described inquiry-results set is associated as the cheating score of corresponding inquiry; And the 4th computing unit, mean value for the cheating score of calculating all inquiries that are associated with each results web page of described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.

The aspect that the present invention adds and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or the additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:

Fig. 1 is the process flow diagram of the recognition methods of spam page according to an embodiment of the invention;

Fig. 2 is pretreated according to an embodiment of the invention daily record organization chart;

Fig. 3 is the calculating synoptic diagram of the rubbish score of according to an embodiment of the invention inquiry-results set;

Fig. 4 is the frame diagram of the recognition system of spam page in accordance with another embodiment of the present invention

Embodiment

The below describes embodiments of the invention in detail, and the example of embodiment is shown in the drawings, and wherein identical or similar label represents identical or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.

In description of the invention, it will be appreciated that term " first ", " second ", " the 3rd ", " the 4th " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ", " the 3rd ", " the 4th ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.

Fig. 1 is the process flow diagram of the recognition methods of spam page according to an embodiment of the invention.As shown in Figure 1, the recognition methods according to the spam page of the embodiment of the invention may further comprise the steps:

Step S101 obtains the inquiry log of search engine and inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, the pre-service inquiry log comprises multiple queries and results web page.

Particularly, at first obtain the inquiry log of search engine, and inquiry log is converted to the GBK form.Then, the inquiry log after the conversion is put in order acquisition pre-service inquiry log, the structural drawing of its pre-service inquiry log, as shown in Figure 2.The content that table 1 comprises for search engine inquiry daily record after the pre-service.

Table 1

In one embodiment of the invention, employed daily record has comprised all inquiries within the 9 day time in 1 to 9 March in 2011 of search dog search engine.Wherein, comprise 8,443,963 different inquiries, 12,470,865 different webpage clicking, these webpages belong to 1,055,001 different website.The information that comprises in the daily record is as shown in table 2.

Table 2

Comprise enough items of information that is used for the search engine automatic Evaluation in the log information of table 2, therefore can utilize this daily record to carry out the performance evaluation of each Chinese search engine.

Step S102 filters out the occurrence number of user's clicking rate of inquiry and results web page greater than the inquiry-results set of threshold value from the multiple queries of pre-service inquiry log and results web page.

Particularly, be a plurality of keywords to each inquiry participle of pre-service inquiry log, and with each keyword and user's click results web page structure the first inquiry-results set.Then calculate user's results web page click frequency of each inquiry in the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value, calculate again the number of times that each result occurs in the second inquiry-results set in the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

Step S103, artificial screening extracts a plurality of spam pages and generates the set of spam page sample from inquiry-results set.

Particularly, from inquiry-results set, randomly draw the Search Results of a plurality of quantity, for example, 1000 inquiry-results, and whether be the mark of spam page to results web page wherein, until the spam page quantity that marks out reaches predetermined quantity, for example, stop mark in the time of 200, if the quantity of spam page does not reach predetermined quantity, then from inquiry-results set, continue to extract 1000 and mark, by that analogy, until spam page quantity reaches predetermined quantity.The spam page that marks out is gathered as the spam page sample.

Step S104 calculates the rubbish score of each results web page in the inquiry-results set and the cheating score of each inquiry according to inquiry-results set and the set of spam page sample.

Particularly, the initial of each inquiry is set in inquiry-results set practises fraud to such an extent that be divided into 0, and the initial waste score of results web page in inquiry-results set is set, if the results web page in the inquiry-results set is in the set of spam page sample, then the initial waste score of correspondence is set to 1, otherwise its corresponding initial waste score is set to 0.Then, calculate with inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry.At last, calculate the mean value of the cheating score of all inquiries that are associated with each results web page in the inquiry-results set, if results web page not in spam page the mean value of the score of will practising fraud as the rubbish score of corresponding webpage, otherwise do not change the rubbish score.In an embodiment of the present invention, repeatedly be generally 20-30 time in order by the update method of above-mentioned rubbish score with the cheating score, the final rubbish that obtains must be divided into the rubbish score of results web page.

Fig. 3 is the calculating synoptic diagram of the rubbish score of according to an embodiment of the invention inquiry-results set.As shown in Figure 3, inquiry-results set has comprised the corresponding relation between inquiry and the result, between the two the size of strength of association then by the frequency of occurrences of inquiry-results set (in Fig. 3 by w _IiExpression) record.From the small-scale spam page sample set of manual mark, progressively the spam page score of each webpage of iterative computation.Suppose URL ₁Be the webpage (its rubbish must be divided into 1) in the set of spam page sample, and URL ₂Not the webpage (its initial waste must be divided into 0) in the set of spam page sample, then Query ₁And Query ₃For the first time the keyword cheating score during iteration is URL ₁And URL ₂Spam page score mean value (can be directly average by equal weight, also can by strength of association size weighted mean); Further, URL ₂Spam page to get score value be Query ₁And Query ₃Keyword cheating score mean value (can be directly average by equal weight, also can by strength of association size weighted mean), thereby realized the spam page score is gathered other webpages from sample diffusion.By that analogy, namely can calculate the spam page score of all webpages.

Step S105 is spam page with the rubbish score of results web page in the inquiry-results set greater than the results web page of threshold value, and results web page is added in the spam page set.

In one embodiment of the invention, the rubbish score threshold value of spam page criterion can according to circumstances be decided, and for example, is made as 0.8.The spam page that identifies is added in the spam page set as the data use of identifying spam page.

Fig. 4 is the frame diagram of the recognition system of spam page in accordance with another embodiment of the present invention.As shown in Figure 4, the recognition system according to the spam page of the embodiment of the invention comprises pretreatment module 100, screening module 200, extraction module 300, computing module 400, judge module 500 and processing module 600.

Pretreatment module 100 is used for obtaining the inquiry log of search engine and inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, the pre-service inquiry log comprises multiple queries and results web page.

In one embodiment of the invention, pretreatment module 100 comprises and obtains converting unit 110 and pretreatment unit 120.

Obtain the inquiry log that converting unit 110 is used for obtaining search engine, and inquiry log is converted to the GBK form.

Pretreatment unit 120 is used for the inquiry log after the conversion is put in order acquisition pre-service inquiry log.

In one embodiment of the invention, obtain the inquiry log of search engine, and the inquiry log Unified coding is converted to the GBK form.Inquiry log after the conversion is put in order and filtering useless information acquisition pre-service inquiry log, and Fig. 2 is the structural drawing of pre-service inquiry log.

User's clicking rate that screening module 200 is used for filtering out inquiry from multiple queries and the results web page of pre-service inquiry log and the occurrence number of results web page are greater than the inquiry-results set of threshold value.

In one embodiment of the invention, screening module 200 comprises construction unit 210, the first computing unit 220 and the second computing unit 230.

Each inquiry participle that construction unit 210 is used for the pre-service inquiry log is a plurality of keywords, and each keyword and user's click results web page is made up the first inquiry-results set.

The first computing unit 220 is used for calculating user's results web page click frequency of first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value.

The second computing unit 230 is used for calculating the number of times that each result of the second inquiry-results set occurs in the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

In one embodiment of the invention, from inquiry-results set, randomly draw the Search Results of a plurality of quantity, for example, 1000 inquiry-results, and whether be the mark of spam page to results web page wherein, until the spam page quantity that marks out reaches predetermined quantity, for example, stop mark in the time of 200, if the quantity of spam page does not reach predetermined quantity, then from inquiry-results set, continue to extract 1000 and mark, by that analogy, until spam page quantity reaches predetermined quantity.The spam page that marks out is gathered as the spam page sample.

Extraction module 300 is used for extracting a plurality of spam pages from inquiry-results set artificial screening and generates the set of spam page sample.

Computing module 400 is used for calculating the rubbish score of each results web page of inquiry-results set and the cheating score of each inquiry according to inquiry-results set and the set of spam page sample.

In one embodiment of the invention, computing module 400 comprises setting unit 410, the 3rd computing unit 420 and the 4th computing unit 430.

Setting unit 410 is used for the initial cheating score of each inquiry of inquiry-results set is set, and the initial waste score of results web page in inquiry-results set is set.

The 3rd computing unit 420 be used for calculating and the mean value of the rubbish score of all results web page that each inquiry of inquiry-results set is associated as the cheating score of corresponding inquiry.

The 4th computing unit 430 is for the mean value of the cheating score of calculating all inquiries that are associated with each results web page of inquiry-results set, if results web page not in spam page the mean value of the score of will practising fraud as the rubbish score of corresponding webpage, otherwise do not change the rubbish score.

In an embodiment of the present invention, repeatedly upgrade in order the rubbish score by the 3rd computing unit and the 4th computing unit and be generally 20-30 time with the cheating score, the final rubbish that obtains must be divided into the rubbish score of results web page.

Whether the rubbish score that judge module 500 is used for judging inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page.In one embodiment of the invention, the rubbish score threshold value of spam page criterion can according to circumstances be decided, and for example, is made as 0.8 etc.

Processing module 600 is used for adding results web page to the spam page set.The spam page that identifies is added in the spam page set as the data use of identifying spam page.

Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment in the situation that does not break away from principle of the present invention and aim within the scope of the invention, modification, replacement and modification.

Claims

1. the recognition methods of a spam page is characterized in that, may further comprise the steps:

S1: obtain the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page;

S2: from the multiple queries of described pre-service inquiry log and results web page, filter out the occurrence number of user's clicking rate of described inquiry and described results web page greater than the inquiry-results set of threshold value;

S3: artificial screening extracts a plurality of spam pages and generates the set of spam page sample from described inquiry-results set;

S4: calculate the rubbish score of each results web page in described inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; And

S5: if the rubbish score of results web page is greater than threshold value then described results web page is spam page in described inquiry-results set, and described results web page added in the described spam page set.

2. the recognition methods of spam page according to claim 1 is characterized in that, described step S1 specifically comprises:

S11: obtain the inquiry log of search engine, and described inquiry log is converted to the GBK form;

S12: the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.

3. the recognition methods of spam page according to claim 1 is characterized in that, described step S2 specifically comprises:

S21: each the inquiry participle to described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set;

S22: calculate user's results web page click frequency of each inquiry in described the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value;

S23: calculate the number of times that each result occurs in described the second inquiry-results set in described the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

4. the recognition methods of spam page according to claim 1 is characterized in that, described step S4 specifically comprises:

S41: the initial cheating score of each inquiry in described inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set;

S42: calculate with described inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry; And

S43: the mean value that calculates the cheating score of all inquiries that are associated with each results web page in described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.

5. the recognition system of a spam page is characterized in that, comprising:

Pretreatment module is used for obtaining the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, described pre-service inquiry log comprises multiple queries and results web page;

The screening module, the user's clicking rate that is used for filtering out described inquiry from multiple queries and the results web page of described pre-service inquiry log and the occurrence number of described results web page are greater than the inquiry-results set of threshold value;

Extraction module is used for extracting a plurality of spam pages from described inquiry-results set artificial screening and generates the set of spam page sample;

Computing module is used for calculating the rubbish score of described each results web page of inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample;

Whether judge module, the rubbish score that is used for judging described inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page; And

Processing module is used for adding described results web page to described spam page set.

6. the recognition system of spam page according to claim 5 is characterized in that, described pretreatment module comprises:

Obtain converting unit, be used for obtaining the inquiry log of search engine, and described inquiry log is converted to the GBK form;

Pretreatment unit is used for the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.

7. the recognition system of spam page according to claim 5 is characterized in that, described screening module comprises:

Construction unit, each the inquiry participle that is used for described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set;

The first computing unit is used for calculating user's results web page click frequency of described first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value;

The second computing unit is used for calculating the number of times that described each result of the second inquiry-results set occurs in described the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.

8. the recognition system of spam page according to claim 5 is characterized in that, described computing module comprises:

Setting unit is used for the initial cheating score of described each inquiry of inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set;

The 3rd computing unit, be used for to calculate and the mean value of the rubbish score of all results web page that each inquiry of described inquiry-results set is associated as the cheating score of corresponding inquiry; And

The 4th computing unit, mean value for the cheating score of calculating all inquiries that are associated with each results web page of described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.