CN103064984A - Spam webpage identifying method and spam webpage identifying system - Google Patents

Spam webpage identifying method and spam webpage identifying system Download PDF

Info

Publication number
CN103064984A
CN103064984A CN201310029963XA CN201310029963A CN103064984A CN 103064984 A CN103064984 A CN 103064984A CN 201310029963X A CN201310029963X A CN 201310029963XA CN 201310029963 A CN201310029963 A CN 201310029963A CN 103064984 A CN103064984 A CN 103064984A
Authority
CN
China
Prior art keywords
inquiry
results
spam
web page
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310029963XA
Other languages
Chinese (zh)
Other versions
CN103064984B (en
Inventor
刘奕群
马少平
张敏
金奕江
张阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CN201310029963.XA priority Critical patent/CN103064984B/en
Publication of CN103064984A publication Critical patent/CN103064984A/en
Application granted granted Critical
Publication of CN103064984B publication Critical patent/CN103064984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a spam webpage identifying method and a spam webpage identifying system. The spam webpage identifying method includes acquiring query logs of a search engine and preprocessing the query logs to obtain preprocessed query logs; selecting a query-result set with query user click rate and showing times of result webpages larger than threshold values from multiple query and result webpages of the preprocessed query logs; generating a spam webpage sample set by manually screening and extracting multiple spam webpages from the query-result set; computing spam score of each result webpage and cheating score of each query in the query-result set according to the query-result set and the spam webpage sample set; and determining the result webpages to be spam webpages when the spam scores of the result webpages is larger than the threshold value and adding the result webpages into the spam webpage set. According to the spam webpage identifying method, algorithm complexity by finding and identifying the spam webpages via the query logs of the search engine is reduced, and the spam webpage identifying method has better popularization performance and applicability.

Description

The recognition methods of spam page and system
Technical field
The present invention relates to network information Intelligent treatment technical field, particularly a kind of recognition methods of spam page and system.
Background technology
The growth at full speed of internet information amount makes search engine become indispensable acquisition of information means in people's routine work and the life.According to the CNNIC statistics in Dec, 2011, the quantity of search engine user has reached 3.96 hundred million in the netizen colony of China, and the application popularization rate is nearly 80%, is that the netizen uses one of maximum Internet service.Search engine is being brought into play important entrance effect in user's upper network process, therefore, and the effective way that obtains in search engine retrieving result that favourable rank become that Internet resources obtain as early as possible that the user pays close attention to.
Under this acquisition of information mode take search engine as main entrance, the high flow capacity that high search rank brings and high yield lure that many internet content providers use cheating mode that search engine algorithms is swindled into, obtaining more favourable as a result rank, and the webpage that this use cheating mode is made a profit based on swindle is exactly spam page.Spam page is defined as: utilize the defective of search engine executing arithmetic, take the fraudulent means for search engine, make its acquisition be higher than its network information quality rank effect to seek the webpage of direct or indirect interests.
The people such as Fetterly 2003 by to the sampling analysis of English Webpage, think that wherein having 8.1% the page at least is spam page; And Then estimated nearly 10% to 15% rubbish contents among the Web in 2004 Deng the people; According to our sampling analysis to about 800,000,000 Chinese web pages under the search dog search engine is assisted, there is approximately 15% webpage to belong to spam page in the Chinese Internet resources.
Spam page all can produce significant adverse effect for the network user, Internet resources environment and search engine.For the network user, spam page comes in the result for retrieval tabulation forward position and clicks with user cheating, and this behavior has increased the difficulty that the user searches the useful information of wanting, and reduces user's information acquisition efficiency; Spam page also often with the combinations such as virus, wooden horse software, user's information security caused seriously influence.For the Internet resources environment, because the restriction of state's laws rules, search engine can not provide the bid advertisement service for the illegal Web content such as pornographic, gambling usually, this becomes the main selection that these contents provide the website so that promote rank by cheating mode, in the spam page thereby also be flooded with all kinds of illegal contents, and the illegal contents webpage of this adding cheating technology tends to cause widely harmful effect, more serious destruction Internet resources environment.For search engine system, the existence of spam page causes being full of the useless page in the data directory, wastes a large amount of storage spaces and processing time, thereby strengthens the consumption of search engine when processing each inquiry, reduce the search treatment effeciency, reduce simultaneously the user to the degree of belief of search engine.
Conventional garbage web page identification method a kind of is the Study of recognition work aspect for content-based cheating, URL feature and common phrases feature for the rubbish page are analyzed, and 1.05 hundred million webpages based on MSN search crawl have been carried out the content of pages feature extraction, used the features such as the average length, the ratio of content visible, the content compression that comprise length for heading, word compare to distinguish spam page and normal webpage.Also used on this basis more content characteristic to carry out identification work, its feature comprises the quantity that contains popular vocabulary in the quantity, the page of anchor text etc., and has used the ordering learning method feature to be merged the identification of carrying out spam page.
Another kind is based on the spam page identification of link structure analysis.
Figure BDA00002779716100021
Opened a new way of utilizing link structure information identification spam page Deng the TrustRank algorithm that the people proposed in 2004, the identification that can be applied to comprise the content cheating and link the various garbage webpage of cheating.Although the method lacks the coping style for noise data among the link structure figure, but still having a considerable amount of researchers to propose a plurality of link analysis algorithm application based on the improvement to the TrustRank algorithm identifies in spam page, these algorithms comprise Anti-TrustRank, Truncated PageRank etc.
The identification of above spam page is operated in relatively-stationary webpage test set and closes and all obtained preferably recognition effect, the evaluation result that internationally recognizable spam page evaluation and test Web Spam Challenge provides much reaches the recognition accuracy more than 80%, and the experimental result accuracy rate that the paper of much being correlated with provides then often surpasses 90%.Yet various reasons causes these recognizers still to be faced with huge challenge when being applied to true internet environment, is difficult to give full play to its recognition effect, and this has also caused current spam page still search engine to be used the fact that causes tremendous influence.
The shortcoming of prior art is mainly as follows:
(1) these algorithms often can only be identified for the spam page of certain particular type, the robustness that lacks identification, and the cheating form of spam page emerges in an endless stream, although recognizer is very high for the recognition performance of certain class spam page, but can't identify the rubbish of other types, in a single day the spam page author adopts new cheating form, and these algorithms just tend to lose identification effectiveness.
(2) along with the cheating form development, the mode that many algorithms need to expend a large amount of calculating, storage or bandwidth resources is carried out rubbish identification, for example, web page contents is carried out polynary language model to make up, webpage is repeatedly grasped, page script is carried out deep layer parsing etc., this is so that the online service demand of the efficient of these algorithm identified spam pages and search engine is inconsistent, thereby can't be applied in the actual search engine service.
Summary of the invention
Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency.
For achieving the above object, the embodiment of one aspect of the present invention proposes a kind of recognition methods of spam page, may further comprise the steps: S1: obtain the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page; S2: from the multiple queries of described pre-service inquiry log and results web page, filter out the occurrence number of user's clicking rate of described inquiry and described results web page greater than the inquiry-results set of threshold value; S3: artificial screening extracts a plurality of spam pages and generates the set of spam page sample from described inquiry-results set; S4: calculate the rubbish score of each results web page in described inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; And S5: if the rubbish score of results web page is greater than threshold value then described results web page is spam page in described inquiry-results set, and described results web page added in the described spam page set.
According to the method for the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.
In an example of the present invention, described step S1 specifically comprises: S11: obtain the inquiry log of search engine, and described inquiry log is converted to the GBK form; S12: the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.
In an example of the present invention, described step S2 specifically comprises: S21: each the inquiry participle to described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set; S22: calculate user's results web page click frequency of each inquiry in described the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value; S23: calculate the number of times that each result occurs in described the second inquiry-results set in described the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
In an example of the present invention, described step S4 specifically comprises: S41: the initial cheating score of each inquiry in described inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set; S42: calculate with described inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry; And S43: the mean value that calculates the cheating score of all inquiries that are associated with each results web page in described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.
For achieving the above object, embodiments of the invention propose a kind of recognition system of spam page on the other hand, comprise: pretreatment module, be used for obtaining the inquiry log of search engine and described inquiry log is carried out pre-service and obtain the pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page; The screening module, the user's clicking rate that is used for filtering out described inquiry from multiple queries and the results web page of described pre-service inquiry log and the occurrence number of described results web page are greater than the inquiry-results set of threshold value; Extraction module is used for extracting a plurality of spam pages from described inquiry-results set artificial screening and generates the set of spam page sample; Computing module is used for calculating the rubbish score of described each results web page of inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; Whether judge module, the rubbish score that is used for judging described inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page; And processing module, be used for adding described results web page to described spam page set.
According to the system of the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.
In an example of the present invention, described pretreatment module comprises: obtain converting unit, be used for obtaining the inquiry log of search engine, and described inquiry log is converted to the GBK form; Pretreatment unit is used for the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.
In an example of the present invention, described screening module comprises: construction unit, and each the inquiry participle that is used for described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set; The first computing unit is used for calculating user's results web page click frequency of described first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value; The second computing unit is used for calculating the number of times that described each result of the second inquiry-results set occurs in described the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
In an example of the present invention, described computing module comprises: setting unit is used for the initial cheating score of described each inquiry of inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set; The 3rd computing unit, be used for to calculate and the mean value of the rubbish score of all results web page that each inquiry of described inquiry-results set is associated as the cheating score of corresponding inquiry; And the 4th computing unit, mean value for the cheating score of calculating all inquiries that are associated with each results web page of described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.
The aspect that the present invention adds and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or the additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the recognition methods of spam page according to an embodiment of the invention;
Fig. 2 is pretreated according to an embodiment of the invention daily record organization chart;
Fig. 3 is the calculating synoptic diagram of the rubbish score of according to an embodiment of the invention inquiry-results set;
Fig. 4 is the frame diagram of the recognition system of spam page in accordance with another embodiment of the present invention
Embodiment
The below describes embodiments of the invention in detail, and the example of embodiment is shown in the drawings, and wherein identical or similar label represents identical or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that term " first ", " second ", " the 3rd ", " the 4th " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ", " the 3rd ", " the 4th ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.
Fig. 1 is the process flow diagram of the recognition methods of spam page according to an embodiment of the invention.As shown in Figure 1, the recognition methods according to the spam page of the embodiment of the invention may further comprise the steps:
Step S101 obtains the inquiry log of search engine and inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, the pre-service inquiry log comprises multiple queries and results web page.
Particularly, at first obtain the inquiry log of search engine, and inquiry log is converted to the GBK form.Then, the inquiry log after the conversion is put in order acquisition pre-service inquiry log, the structural drawing of its pre-service inquiry log, as shown in Figure 2.The content that table 1 comprises for search engine inquiry daily record after the pre-service.
Table 1
Figure BDA00002779716100051
In one embodiment of the invention, employed daily record has comprised all inquiries within the 9 day time in 1 to 9 March in 2011 of search dog search engine.Wherein, comprise 8,443,963 different inquiries, 12,470,865 different webpage clicking, these webpages belong to 1,055,001 different website.The information that comprises in the daily record is as shown in table 2.
Table 2
Figure BDA00002779716100052
Comprise enough items of information that is used for the search engine automatic Evaluation in the log information of table 2, therefore can utilize this daily record to carry out the performance evaluation of each Chinese search engine.
Step S102 filters out the occurrence number of user's clicking rate of inquiry and results web page greater than the inquiry-results set of threshold value from the multiple queries of pre-service inquiry log and results web page.
Particularly, be a plurality of keywords to each inquiry participle of pre-service inquiry log, and with each keyword and user's click results web page structure the first inquiry-results set.Then calculate user's results web page click frequency of each inquiry in the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value, calculate again the number of times that each result occurs in the second inquiry-results set in the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
Step S103, artificial screening extracts a plurality of spam pages and generates the set of spam page sample from inquiry-results set.
Particularly, from inquiry-results set, randomly draw the Search Results of a plurality of quantity, for example, 1000 inquiry-results, and whether be the mark of spam page to results web page wherein, until the spam page quantity that marks out reaches predetermined quantity, for example, stop mark in the time of 200, if the quantity of spam page does not reach predetermined quantity, then from inquiry-results set, continue to extract 1000 and mark, by that analogy, until spam page quantity reaches predetermined quantity.The spam page that marks out is gathered as the spam page sample.
Step S104 calculates the rubbish score of each results web page in the inquiry-results set and the cheating score of each inquiry according to inquiry-results set and the set of spam page sample.
Particularly, the initial of each inquiry is set in inquiry-results set practises fraud to such an extent that be divided into 0, and the initial waste score of results web page in inquiry-results set is set, if the results web page in the inquiry-results set is in the set of spam page sample, then the initial waste score of correspondence is set to 1, otherwise its corresponding initial waste score is set to 0.Then, calculate with inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry.At last, calculate the mean value of the cheating score of all inquiries that are associated with each results web page in the inquiry-results set, if results web page not in spam page the mean value of the score of will practising fraud as the rubbish score of corresponding webpage, otherwise do not change the rubbish score.In an embodiment of the present invention, repeatedly be generally 20-30 time in order by the update method of above-mentioned rubbish score with the cheating score, the final rubbish that obtains must be divided into the rubbish score of results web page.
Fig. 3 is the calculating synoptic diagram of the rubbish score of according to an embodiment of the invention inquiry-results set.As shown in Figure 3, inquiry-results set has comprised the corresponding relation between inquiry and the result, between the two the size of strength of association then by the frequency of occurrences of inquiry-results set (in Fig. 3 by w IiExpression) record.From the small-scale spam page sample set of manual mark, progressively the spam page score of each webpage of iterative computation.Suppose URL 1Be the webpage (its rubbish must be divided into 1) in the set of spam page sample, and URL 2Not the webpage (its initial waste must be divided into 0) in the set of spam page sample, then Query 1And Query 3For the first time the keyword cheating score during iteration is URL 1And URL 2Spam page score mean value (can be directly average by equal weight, also can by strength of association size weighted mean); Further, URL 2Spam page to get score value be Query 1And Query 3Keyword cheating score mean value (can be directly average by equal weight, also can by strength of association size weighted mean), thereby realized the spam page score is gathered other webpages from sample diffusion.By that analogy, namely can calculate the spam page score of all webpages.
Step S105 is spam page with the rubbish score of results web page in the inquiry-results set greater than the results web page of threshold value, and results web page is added in the spam page set.
In one embodiment of the invention, the rubbish score threshold value of spam page criterion can according to circumstances be decided, and for example, is made as 0.8.The spam page that identifies is added in the spam page set as the data use of identifying spam page.
According to the method for the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.
Fig. 4 is the frame diagram of the recognition system of spam page in accordance with another embodiment of the present invention.As shown in Figure 4, the recognition system according to the spam page of the embodiment of the invention comprises pretreatment module 100, screening module 200, extraction module 300, computing module 400, judge module 500 and processing module 600.
Pretreatment module 100 is used for obtaining the inquiry log of search engine and inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, the pre-service inquiry log comprises multiple queries and results web page.
In one embodiment of the invention, pretreatment module 100 comprises and obtains converting unit 110 and pretreatment unit 120.
Obtain the inquiry log that converting unit 110 is used for obtaining search engine, and inquiry log is converted to the GBK form.
Pretreatment unit 120 is used for the inquiry log after the conversion is put in order acquisition pre-service inquiry log.
In one embodiment of the invention, obtain the inquiry log of search engine, and the inquiry log Unified coding is converted to the GBK form.Inquiry log after the conversion is put in order and filtering useless information acquisition pre-service inquiry log, and Fig. 2 is the structural drawing of pre-service inquiry log.
User's clicking rate that screening module 200 is used for filtering out inquiry from multiple queries and the results web page of pre-service inquiry log and the occurrence number of results web page are greater than the inquiry-results set of threshold value.
In one embodiment of the invention, screening module 200 comprises construction unit 210, the first computing unit 220 and the second computing unit 230.
Each inquiry participle that construction unit 210 is used for the pre-service inquiry log is a plurality of keywords, and each keyword and user's click results web page is made up the first inquiry-results set.
The first computing unit 220 is used for calculating user's results web page click frequency of first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value.
The second computing unit 230 is used for calculating the number of times that each result of the second inquiry-results set occurs in the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
In one embodiment of the invention, from inquiry-results set, randomly draw the Search Results of a plurality of quantity, for example, 1000 inquiry-results, and whether be the mark of spam page to results web page wherein, until the spam page quantity that marks out reaches predetermined quantity, for example, stop mark in the time of 200, if the quantity of spam page does not reach predetermined quantity, then from inquiry-results set, continue to extract 1000 and mark, by that analogy, until spam page quantity reaches predetermined quantity.The spam page that marks out is gathered as the spam page sample.
Extraction module 300 is used for extracting a plurality of spam pages from inquiry-results set artificial screening and generates the set of spam page sample.
In one embodiment of the invention, from inquiry-results set, randomly draw the Search Results of a plurality of quantity, for example, 1000 inquiry-results, and whether be the mark of spam page to results web page wherein, until the spam page quantity that marks out reaches predetermined quantity, for example, stop mark in the time of 200, if the quantity of spam page does not reach predetermined quantity, then from inquiry-results set, continue to extract 1000 and mark, by that analogy, until spam page quantity reaches predetermined quantity.The spam page that marks out is gathered as the spam page sample.
Computing module 400 is used for calculating the rubbish score of each results web page of inquiry-results set and the cheating score of each inquiry according to inquiry-results set and the set of spam page sample.
In one embodiment of the invention, computing module 400 comprises setting unit 410, the 3rd computing unit 420 and the 4th computing unit 430.
Setting unit 410 is used for the initial cheating score of each inquiry of inquiry-results set is set, and the initial waste score of results web page in inquiry-results set is set.
The 3rd computing unit 420 be used for calculating and the mean value of the rubbish score of all results web page that each inquiry of inquiry-results set is associated as the cheating score of corresponding inquiry.
The 4th computing unit 430 is for the mean value of the cheating score of calculating all inquiries that are associated with each results web page of inquiry-results set, if results web page not in spam page the mean value of the score of will practising fraud as the rubbish score of corresponding webpage, otherwise do not change the rubbish score.
In an embodiment of the present invention, repeatedly upgrade in order the rubbish score by the 3rd computing unit and the 4th computing unit and be generally 20-30 time with the cheating score, the final rubbish that obtains must be divided into the rubbish score of results web page.
Fig. 3 is the calculating synoptic diagram of the rubbish score of according to an embodiment of the invention inquiry-results set.As shown in Figure 3, inquiry-results set has comprised the corresponding relation between inquiry and the result, between the two the size of strength of association then by the frequency of occurrences of inquiry-results set (in Fig. 3 by w IiExpression) record.From the small-scale spam page sample set of manual mark, progressively the spam page score of each webpage of iterative computation.Suppose URL 1Be the webpage (its rubbish must be divided into 1) in the set of spam page sample, and URL 2Not the webpage (its initial waste must be divided into 0) in the set of spam page sample, then Query 1And Query 3For the first time the keyword cheating score during iteration is URL 1And URL 2Spam page score mean value (can be directly average by equal weight, also can by strength of association size weighted mean); Further, URL 2Spam page to get score value be Query 1And Query 3Keyword cheating score mean value (can be directly average by equal weight, also can by strength of association size weighted mean), thereby realized the spam page score is gathered other webpages from sample diffusion.By that analogy, namely can calculate the spam page score of all webpages.
Whether the rubbish score that judge module 500 is used for judging inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page.In one embodiment of the invention, the rubbish score threshold value of spam page criterion can according to circumstances be decided, and for example, is made as 0.8 etc.
Processing module 600 is used for adding results web page to the spam page set.The spam page that identifies is added in the spam page set as the data use of identifying spam page.
According to the system of the embodiment of the invention, by the search engine inquiry daily record data, reduced algorithm complex thereby find and identify spam page, and structure and parameter is simple, recognition result is comprehensively reliable, has preferably generalization and adaptability.
Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment in the situation that does not break away from principle of the present invention and aim within the scope of the invention, modification, replacement and modification.

Claims (8)

1. the recognition methods of a spam page is characterized in that, may further comprise the steps:
S1: obtain the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, wherein, described pre-service inquiry log comprises multiple queries and results web page;
S2: from the multiple queries of described pre-service inquiry log and results web page, filter out the occurrence number of user's clicking rate of described inquiry and described results web page greater than the inquiry-results set of threshold value;
S3: artificial screening extracts a plurality of spam pages and generates the set of spam page sample from described inquiry-results set;
S4: calculate the rubbish score of each results web page in described inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample; And
S5: if the rubbish score of results web page is greater than threshold value then described results web page is spam page in described inquiry-results set, and described results web page added in the described spam page set.
2. the recognition methods of spam page according to claim 1 is characterized in that, described step S1 specifically comprises:
S11: obtain the inquiry log of search engine, and described inquiry log is converted to the GBK form;
S12: the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.
3. the recognition methods of spam page according to claim 1 is characterized in that, described step S2 specifically comprises:
S21: each the inquiry participle to described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set;
S22: calculate user's results web page click frequency of each inquiry in described the first inquiry-results set, and therefrom filter out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value;
S23: calculate the number of times that each result occurs in described the second inquiry-results set in described the second inquiry-results set, and therefrom screen occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
4. the recognition methods of spam page according to claim 1 is characterized in that, described step S4 specifically comprises:
S41: the initial cheating score of each inquiry in described inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set;
S42: calculate with described inquiry-results set in the mean value of rubbish score of all results web page of being associated of each inquiry as the cheating score of corresponding inquiry; And
S43: the mean value that calculates the cheating score of all inquiries that are associated with each results web page in described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.
5. the recognition system of a spam page is characterized in that, comprising:
Pretreatment module is used for obtaining the inquiry log of search engine and described inquiry log is carried out pre-service acquisition pre-service inquiry log, and wherein, described pre-service inquiry log comprises multiple queries and results web page;
The screening module, the user's clicking rate that is used for filtering out described inquiry from multiple queries and the results web page of described pre-service inquiry log and the occurrence number of described results web page are greater than the inquiry-results set of threshold value;
Extraction module is used for extracting a plurality of spam pages from described inquiry-results set artificial screening and generates the set of spam page sample;
Computing module is used for calculating the rubbish score of described each results web page of inquiry-results set and the cheating score of each inquiry according to described inquiry-results set and the set of spam page sample;
Whether judge module, the rubbish score that is used for judging described inquiry-results set results web page greater than threshold value, if greater than threshold value then be spam page; And
Processing module is used for adding described results web page to described spam page set.
6. the recognition system of spam page according to claim 5 is characterized in that, described pretreatment module comprises:
Obtain converting unit, be used for obtaining the inquiry log of search engine, and described inquiry log is converted to the GBK form;
Pretreatment unit is used for the inquiry log after the described conversion is put in order acquisition pre-service inquiry log.
7. the recognition system of spam page according to claim 5 is characterized in that, described screening module comprises:
Construction unit, each the inquiry participle that is used for described pre-service inquiry log is a plurality of keywords, and described each keyword and user's click results web page is made up the first inquiry-results set;
The first computing unit is used for calculating user's results web page click frequency of described first each inquiry of inquiry-results set, and therefrom filters out user's clicking rate greater than inquiry and the results web page generation second inquiry-results set of threshold value;
The second computing unit is used for calculating the number of times that described each result of the second inquiry-results set occurs in described the second inquiry-results set, and therefrom screens occurrence number greater than inquiry and the results web page generated query-results set of threshold value.
8. the recognition system of spam page according to claim 5 is characterized in that, described computing module comprises:
Setting unit is used for the initial cheating score of described each inquiry of inquiry-results set is set, and the initial waste score of results web page in described inquiry-results set is set;
The 3rd computing unit, be used for to calculate and the mean value of the rubbish score of all results web page that each inquiry of described inquiry-results set is associated as the cheating score of corresponding inquiry; And
The 4th computing unit, mean value for the cheating score of calculating all inquiries that are associated with each results web page of described inquiry-results set, if described results web page is not in spam page then with the mean value of the described cheating score rubbish score as corresponding webpage, otherwise do not change described rubbish score.
CN201310029963.XA 2013-01-25 2013-01-25 The recognition methods of spam page and system Active CN103064984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310029963.XA CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310029963.XA CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Publications (2)

Publication Number Publication Date
CN103064984A true CN103064984A (en) 2013-04-24
CN103064984B CN103064984B (en) 2016-08-10

Family

ID=48107614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310029963.XA Active CN103064984B (en) 2013-01-25 2013-01-25 The recognition methods of spam page and system

Country Status (1)

Country Link
CN (1) CN103064984B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103595732A (en) * 2013-11-29 2014-02-19 北京奇虎科技有限公司 Method and device for obtaining evidence of network attack
CN104598460A (en) * 2013-10-30 2015-05-06 腾讯科技(深圳)有限公司 Method and device for identifying garbage anchor text
CN104933055A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Webpage identification method and webpage identification device
CN106844685A (en) * 2017-01-26 2017-06-13 百度在线网络技术(北京)有限公司 Method, device and server for recognizing website
CN106844371A (en) * 2015-12-03 2017-06-13 阿里巴巴集团控股有限公司 Chess game optimization method and apparatus
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109361957A (en) * 2018-10-18 2019-02-19 广州酷狗计算机科技有限公司 Send the method and apparatus for thumbing up request
CN109831451A (en) * 2019-03-07 2019-05-31 北京华安普特网络科技有限公司 Preventing Trojan method based on firewall
CN110147472A (en) * 2017-07-14 2019-08-20 北京搜狗科技发展有限公司 Detection method, device and the detection device for website of practising fraud of cheating website

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814093A (en) * 2010-04-02 2010-08-25 南京邮电大学 Similarity-based semi-supervised learning spam page detection method
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
US20110314122A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Discrepancy detection for web crawling
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814093A (en) * 2010-04-02 2010-08-25 南京邮电大学 Similarity-based semi-supervised learning spam page detection method
US20110314122A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Discrepancy detection for web crawling
CN102184208A (en) * 2011-04-29 2011-09-14 武汉慧人信息科技有限公司 Junk web page detection method based on multi-dimensional data abnormal cluster mining
CN102750380A (en) * 2012-06-27 2012-10-24 山东师范大学 Page sorting method in combination with difference feature distribution and link feature

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598460A (en) * 2013-10-30 2015-05-06 腾讯科技(深圳)有限公司 Method and device for identifying garbage anchor text
CN104598460B (en) * 2013-10-30 2018-11-02 腾讯科技(深圳)有限公司 The recognition methods of rubbish Anchor Text and device
CN103595732B (en) * 2013-11-29 2017-09-15 北京奇虎科技有限公司 A kind of method and device of network attack evidence obtaining
CN103595732A (en) * 2013-11-29 2014-02-19 北京奇虎科技有限公司 Method and device for obtaining evidence of network attack
CN104933055B (en) * 2014-03-18 2020-01-31 腾讯科技(深圳)有限公司 Webpage identification method and webpage identification device
CN104933055A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Webpage identification method and webpage identification device
CN106844371B (en) * 2015-12-03 2020-09-08 阿里巴巴集团控股有限公司 Search optimization method and device
CN106844371A (en) * 2015-12-03 2017-06-13 阿里巴巴集团控股有限公司 Chess game optimization method and apparatus
CN106844685B (en) * 2017-01-26 2020-07-28 百度在线网络技术(北京)有限公司 Method, device and server for identifying website
CN106844685A (en) * 2017-01-26 2017-06-13 百度在线网络技术(北京)有限公司 Method, device and server for recognizing website
CN110147472A (en) * 2017-07-14 2019-08-20 北京搜狗科技发展有限公司 Detection method, device and the detection device for website of practising fraud of cheating website
CN110147472B (en) * 2017-07-14 2021-10-15 北京搜狗科技发展有限公司 Detection method and device for cheating sites and detection device for cheating sites
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system
CN109361957A (en) * 2018-10-18 2019-02-19 广州酷狗计算机科技有限公司 Send the method and apparatus for thumbing up request
CN109361957B (en) * 2018-10-18 2021-02-12 广州酷狗计算机科技有限公司 Method and device for sending praise request
CN109831451A (en) * 2019-03-07 2019-05-31 北京华安普特网络科技有限公司 Preventing Trojan method based on firewall

Also Published As

Publication number Publication date
CN103064984B (en) 2016-08-10

Similar Documents

Publication Publication Date Title
CN103064984A (en) Spam webpage identifying method and spam webpage identifying system
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN107229668B (en) Text extraction method based on keyword matching
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN103793434A (en) Content-based image search method and device
CN101788988B (en) Information extraction method
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN101169780A (en) Semantic ontology retrieval system and method
Sun et al. The keyword extraction of Chinese medical web page based on WF-TF-IDF algorithm
CN104679825A (en) Web text-based acquiring and screening method of seismic macroscopic anomaly information
CN102955771A (en) Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN110012122B (en) Domain name similarity analysis method based on word embedding technology
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN108345686A (en) A kind of data analysing method and system based on search engine technique
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN103365910A (en) Method and system for information retrieval
CN102541910A (en) Keywords extraction method
CN101968801A (en) Method for extracting key words of single text
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
CN103530429A (en) Webpage content extracting method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant