CN104951539A - Internet data center harmful information monitoring system - Google Patents

Internet data center harmful information monitoring system Download PDF

Info

Publication number
CN104951539A
CN104951539A CN201510343226.6A CN201510343226A CN104951539A CN 104951539 A CN104951539 A CN 104951539A CN 201510343226 A CN201510343226 A CN 201510343226A CN 104951539 A CN104951539 A CN 104951539A
Authority
CN
China
Prior art keywords
module
search
harmful information
webpage
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510343226.6A
Other languages
Chinese (zh)
Other versions
CN104951539B (en
Inventor
彭光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Aierpu Science & Technology Co Ltd
Original Assignee
Chengdu Aierpu Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Aierpu Science & Technology Co Ltd filed Critical Chengdu Aierpu Science & Technology Co Ltd
Priority to CN201510343226.6A priority Critical patent/CN104951539B/en
Publication of CN104951539A publication Critical patent/CN104951539A/en
Application granted granted Critical
Publication of CN104951539B publication Critical patent/CN104951539B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses an internet data center harmful information monitoring system. The system comprises a crawler system and a harmful information monitoring system. The harmful information monitoring system obtains webpage data in an internet data center through the crawler system and carries out harmful analysis on the webpage data. The crawler system comprises multiple crawler clusters composed of crawler nodes and crawler root nodes. Each crawler node comprises a multi-thread webpage acquisition module, a webpage base, a code identifying and processing module, a webpage content automatic extraction module, a URL filter, a URL de-emphasis module and a URL scheduling module. The harmful information monitoring system achieves more precise search through a harmful information search unit, an automatic word segmentation unit, a keyword processing unit and a fuzzy matching unit. A strong data collecting function is provided, comprehensive real-time monitoring is carried out on dynamic webpages and static webpages through the crawler clusters, data relative to sensitive words can be collected from massive data, and harmful webpages are actively found.

Description

Internet data center's harmful information monitoring system
Technical field
The present invention relates to Internet data center's harmful information monitoring system.
Background technology
Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.Search engine becomes as the instrument of auxiliary people's retrieving information entrance and the guide that user accesses WWW.But these versatility search engines also also exist certain limitation.
In the face of the Web Community's environment become increasingly active, each netizen may become publisher and the diffuser of harmful information, and network is harmful to route of transmission and more and more extensively comprises blog, news, forum, microblogging and other approach.Web crawlers is the precursor technique that various search engine can realize, the arriving of large data age and the develop rapidly of Internet technology, makes web crawlers have more great Research Significance.Reply web data amount has a big increase, the network text update cycle is short and the series of challenges such as structure of web page dynamic change, high-level efficiency and the web crawlers of non-stop run becomes the study hotspot that harmful information excavates.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, Internet data center's harmful information monitoring system is provided, present system provides powerful data collection function, by multiple reptile cluster, monitoring is in real time carried out comprehensively to dynamic web page and static Web page; And from mass data, collect the data relevant with sensitive word, accomplish initiatively to find harmful webpage, realize searching for more accurately by harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
The object of the invention is to be achieved through the following technical solutions: Internet data center's harmful information monitoring system, it comprises crawler system and harmful information monitoring system, harmful information monitoring system obtains the web data in Internet data center by crawler system, and carries out Harmful analysis to it.
Described crawler system comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and intercom mutually with harmful information monitoring system, reptile node is used for the harmful information in collection network, and described each reptile node forms by following multiple module:
1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module.
2, web page library, stores the webpage that multithreading web retrieval module gathers.
3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it.
4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary according to responsive dictionary after capturing code conversion process.
5, url filtering device, filters the URL not needing to download.
6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL.
7, URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.
8, removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer carry out follow-up process to this webpage, and being deleted from web page library.
Described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
9, interval handling module, interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.
10, rules for grasping arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.
11, anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web.
12, acquisition monitoring module, the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
Described harmful information monitoring system comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
Harmful information search unit comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality.Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node.
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, completes precise search by multiple screening washer and combination thereof.
Keyword processing unit is for generating keyword search instruction, and harmful information search unit performs harmful information search mission according to this keyword search instruction.
Fuzzy matching unit is used for, according to the akin approximate vocabularies of searching character String matching of input, while harmful information search unit is searched for search string, also completing the search of approximate vocabularies, and returning approximate vocabularies Search Results.
Automatic word segmentation unit is used for the search string of input automatically to extract key word, makes harmful information search unit complete precise search according to this automatic key word that extracts.
Described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time.Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
Described harmful information monitoring system also comprises autoabstract generation unit, and autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof.
Described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extracts critical field generating web page summary.
Described harmful information monitoring system also comprises result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
Described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprises harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
Described report generation module is used for according to search result information generating report forms.
Described task trend analysis module is for generating increment graph.
Described duty profile analysis module is for generating task list, website distribution plan and media distribution figure.
Described harmful information monitoring system also comprises fire wall, and crawler system carries out safety by fire wall to the web data in Internet data center and crawls.
The invention has the beneficial effects as follows: Internet data center's harmful information monitoring system proposed by the invention, from mass data, the data relevant with sensitive word can be collected, accomplish initiatively to find to be harmful to; Include the relevant informations such as harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate, participant, assistant analysis is harmful to temperature, importance, the development trend of webpage, accomplishes that accurate analysis is harmful to; A suspect's virtual identity is set and carries out key monitoring, according to collecting data analysis scope of activities, scattering content, activity time etc.; The analysis of speech qualitative data can be set; Event temperature quick position is analyzed.
The present invention also has following multiple functional characteristics:
1) multithreading collection: customize different strategies for dissimilar website, gathers and supports multithreading, realize snap information collection;
2) distributed capture: carry out larger scale data acquisition by multiple reptile cluster, some reptile nodes;
3) acquisition monitoring: monitor and managment is carried out to reptile node duty, acquisition tasks, sampling depth, daily record, system operation report etc.;
4) web page contents extracts automatically: can gather multiple dynamic and static state webpage, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;
5) coding identifies conversion automatically: support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;
6) incremental update: ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;
7) anti-crawler capturing: anti-crawlers website is set for part should corresponding strategies be set, avoid capturing the page;
8) reptile interval captures: adopt webpage scoring and weight of website etc. automatically to generate interval rule, carry out the crawl of corresponding interval to webpage;
9) self-defined rules for grasping: user also oneself can arrange rules for grasping.
Accompanying drawing explanation
Fig. 1 is crawler system structured flowchart of the present invention;
Fig. 2 is the structural principle block diagram of reptile node in the present invention;
Fig. 3 is the structural principle block diagram of harmful information monitoring system in the present invention.
Embodiment
Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail, but protection scope of the present invention is not limited to the following stated.
Internet data center's harmful information monitoring system, it comprises crawler system and harmful information monitoring system, and harmful information monitoring system obtains the web data in Internet data center by crawler system, and carries out Harmful analysis to it.
(1) crawler system
As shown in Figure 1, described crawler system be responsible for carrying out from internet raw data discovery, crawl with normalized.According to the difference of interconnected web-based applications, comprise one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and intercoms mutually with harmful information monitoring system, and reptile node is used for the harmful information in collection network.
As shown in Figure 2, in the present invention, described each reptile node forms by following multiple module:
1, multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module; Described web analysis module comprises dns resolution module, HTTP parsing module, FTP parsing module, GOPHER parsing module etc.;
Realize multithreading acquisition function: different strategies can be customized for dissimilar website, gather and support multithreading, realize snap information collection;
2, web page library, stores the webpage that multithreading web retrieval module gathers;
3, code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it; Support that the Multi-encodings such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1 identify automatically, it is UTF that system carries out code conversion automatically;
4, the automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary after capturing code conversion process; Can multiple dynamic and static state webpage be gathered, the webpages such as such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;
5, url filtering device, filters the URL not needing to download;
6, URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL; Realize incremental update function, ensure reptile node only gather upgraded last time after the webpage of newly-generated or change, the webpage downloaded without Resurvey carrys out the efficiency that guarantee information upgrades, and user also also can set whole collection as required;
7, URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage.
8, removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer carry out follow-up process to this webpage, and being deleted from web page library.
9, fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
10, interval handling module, interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage.
11, rules for grasping arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage.
12, anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web.
13, acquisition monitoring module, the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
Described reptile node also comprises label counter and label counting journal file, and these data for recording the download number in web page library, and are recorded in label counting journal file by label counter.
Described crawler system also comprises full-text database, index data base and row order sequenced data storehouse, and full-text database, index data base are all connected with reptile node and reptile root node with row order sequenced data storehouse.
(2) harmful information monitoring system
As shown in Figure 1, described harmful information monitoring system comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
1, harmful information search unit, comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality.Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node.
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, precise search is completed, as provided the search weight of keyword, the weight combinatorial search etc. of multiple metadata fields by multiple screening washer and combination thereof.
Key word screening washer: support the combination of keyword logical expression, comprise AND, OR, NOT etc.
Label field screening washer: support the logic AND-OR INVERTER limit search combined by multiple label field.
Metadata fields screening washer: multiple metadata fields can be defined, select Search Results by parameter.
Time screening washer: support the ranking function according to date, the degree of correlation and other field combination.
Field label search is label field by setting up index text, and user can select tag combination targetedly, thus returns and limit result accordingly.
Harmful information search unit carries out the whole network search according to the hot word of burst deleterious network, harmful quantity of quick search accident, distribution site, harmful temperature.
2, keyword processing unit, for generating keyword search instruction, harmful information search unit adopts boolean logical expression, and performs harmful information search mission according to this keyword search instruction.
Described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time.Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
3, fuzzy matching unit, for the akin approximate vocabularies of searching character String matching according to input, while harmful information search unit is searched for search string, also completes the search of approximate vocabularies, and returns approximate vocabularies Search Results.
User can input a word, passage or even an entire article, and system can analyze the contents concept of user search condition, then finds out the result of user's care from the degree of correlation of concept.If user does not know how the content of inquiring about spells, can by searching for generally, system, except returning corresponding Search Results, also returns other vocabulary close with input of character string, thus allows user find other results of being correlated with.
4, automatic word segmentation unit, for the search string of input is extracted key word automatically, makes harmful information search unit complete precise search according to this automatic key word that extracts.Automatic word segmentation module is the basis of Chinese information processing and analysis.Based on dictionary Sum fanction, fully utilize the language model method based on probability analysis, and the participle of applicable particular requirement can be carried out according to different application.
5, autoabstract generation unit, autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof.Webpage can generate different web-page summarization dynamically according to the different search string of input, according to this web-page summarization, user can judge whether that needing to open this webpage investigates, and by dynamic web-page summarization understand return results in relation between each webpage.
Described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extracts critical field generating web page summary.When user checks the particular content of webpage, autoabstract generation unit can, automatically to article content generating web page summary, now not need to analyze webpage according to search string and approximate vocabularies thereof yet.
Autoabstract generation unit can consider word frequency, part of speech, positional information, realizes accurately extraction and analysis keyword automatically, and according to the automatic generating web page summary of its key word analyzed.
6, result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
Described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprises harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
Described report generation module is used for according to search result information generating report forms, comprises histogram, broken line graph list rod figure, double stick figure, three rod figure, multiple chart and X-Y figure.
Described task trend analysis module, for generating increment graph, comprises increment graph every day, weekly increment graph, monthly increment graph etc.
Described duty profile analysis module is for generating patterned task list, website distribution plan and media distribution figure.
Described Search Results comprises harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant information.
Statistical analysis unit is that user provides powerful query function, carries out analyzing, representing for real-time and historical data, carries out data mining, comprise historical data, patrol and examine data, network data, monitor node data for historical data application.Can be as required, various querying condition is set flexibly, multiple statistical forms is provided, as the form such as single rod figure, double stick figure, three rod figure, multiple chart, X-Y figure (coordinate points drawing), and can combine with dispatch service, the form generating multiple output format, as word form, PDF, Excel form etc., sends to designated user, enrich decision analysis function, facilitate user's data query, analytic trend, formulation Adjusted Option.Meanwhile, system has extendability, is user's editing picture.
Harmful information monitoring system of the present invention also comprises fire wall, and crawler system carries out safety by fire wall to the web data in Internet data center and crawls.

Claims (9)

1. Internet data center's harmful information monitoring system, it comprises crawler system and harmful information monitoring system, harmful information monitoring system obtains the web data in Internet data center by crawler system, and Harmful analysis is carried out to it, it is characterized in that: described crawler system comprises one or more reptile cluster, and each reptile cluster includes multiple reptile node and a reptile root node, form a distributed data acquisition network, wherein, reptile root node is used for carrying out control and management to the reptile node in this reptile cluster, and intercom mutually with harmful information monitoring system, reptile node is used for the harmful information in collection network, described each reptile node forms by following multiple module:
Multithreading web retrieval module, comprises multiple web retrieval passage and web analysis module, for dissimilar webpage, is gathered it by the web retrieval passage that matches with it and web analysis module;
Web page library, stores the webpage that multithreading web retrieval module gathers;
Code identification processing module, automatically identifies the type of coding of webpage, and carries out code conversion process to it;
The automatic extraction module of web page contents, comprises dynamic web content extraction module and static web contents extraction module, there is the URL of harmful Intelligence Page according to responsive dictionary according to responsive dictionary after capturing code conversion process;
Url filtering device, filters the URL not needing to download;
URL duplicate removal module, whether consistent with the URL stored in URL storer for judging the URL after filtering, if consistent, no longer follow-up process is carried out to this URL;
URL scheduler module, according to the URL queue after duplicate removal, controls multithreading web retrieval module and downloads corresponding webpage;
Described harmful information monitoring system comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit;
Harmful information search unit comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality; Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node;
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, completes precise search by multiple screening washer and combination thereof;
Keyword processing unit is for generating keyword search instruction, and harmful information search unit performs harmful information search mission according to this keyword search instruction;
Fuzzy matching unit is used for, according to the akin approximate vocabularies of searching character String matching of input, while harmful information search unit is searched for search string, also completing the search of approximate vocabularies, and returning approximate vocabularies Search Results;
Automatic word segmentation unit is used for the search string of input automatically to extract key word, makes harmful information search unit complete precise search according to this automatic key word that extracts.
2. Internet data center according to claim 1 harmful information monitoring system, it is characterized in that: described reptile node also comprises removing duplicate webpages module, for judging that whether web page contents is consistent with the web page contents downloaded, if consistent, no longer follow-up process carried out to this webpage, and deleted from web page library.
3. Internet data center according to claim 2 harmful information monitoring system, it is characterized in that: described removing duplicate webpages module comprises fingerprint computing module, fingerprint base and fingerprint duplicate removal module, fingerprint computing module is according to web page fingerprint algorithm, the content of webpage is generated fingerprint through calculating, fingerprint in this generation fingerprint and fingerprint base contrasts by fingerprint duplicate removal module, if there is identical or akin fingerprint, then judge that this web page contents was downloaded, fingerprint base is for storing finger print data, and the fingerprint base of each reptile node carries out synchronized update.
4. Internet data center according to claim 1 harmful information monitoring system, it is characterized in that: described reptile node also comprises interval handling module, interval handling module generates interval rule automatically by webpage scoring and weight of website, and controls the automatic extraction module of web page contents and carry out the crawl of corresponding interval to webpage;
Described reptile node also comprises rules for grasping and arranges module, and rules for grasping arranges module according to set rules for grasping, controls the automatic extraction module of web page contents and carries out corresponding grasping movement to webpage;
Described reptile node also comprises anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing module, carries out pressure collection to target web;
Described reptile node also comprises acquisition monitoring module, and the duty of reptile node, acquisition tasks, sampling depth and log information are transmitted to reptile root node and carry out convergence processing by acquisition monitoring module, and receive the control of reptile root node.
5. Internet data center according to claim 1 harmful information monitoring system, is characterized in that: described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time; Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
6. Internet data center according to claim 1 harmful information monitoring system, it is characterized in that: described harmful information monitoring system also comprises autoabstract generation unit, autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof;
Described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extracts critical field generating web page summary.
7. Internet data center according to claim 1 harmful information monitoring system, it is characterized in that: described harmful information monitoring system also comprises result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
8. Internet data center according to claim 7 harmful information monitoring system, it is characterized in that: described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprise harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics;
Described report generation module is used for according to search result information generating report forms;
Described task trend analysis module is for generating increment graph;
Described duty profile analysis module is for generating task list, website distribution plan and media distribution figure.
9. Internet data center according to claim 1 harmful information monitoring system, is characterized in that: described harmful information monitoring system also comprises fire wall, crawler system carries out safety by fire wall to the web data in Internet data center and crawls.
CN201510343226.6A 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system Expired - Fee Related CN104951539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510343226.6A CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510343226.6A CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Publications (2)

Publication Number Publication Date
CN104951539A true CN104951539A (en) 2015-09-30
CN104951539B CN104951539B (en) 2017-12-22

Family

ID=54166197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510343226.6A Expired - Fee Related CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Country Status (1)

Country Link
CN (1) CN104951539B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447081A (en) * 2015-11-04 2016-03-30 国云科技股份有限公司 Cloud platform-oriented government affair and public opinion monitoring method
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN105974811A (en) * 2016-07-05 2016-09-28 无锡市华东电力设备有限公司 Smart home control method and system
CN106302797A (en) * 2016-08-31 2017-01-04 北京锐安科技有限公司 A kind of cookie accesses De-weight method and device
WO2017177872A1 (en) * 2016-04-11 2017-10-19 中兴通讯股份有限公司 Data collection method and apparatus, and storage medium
CN108304481A (en) * 2017-12-29 2018-07-20 成都三零凯天通信实业有限公司 A kind of visible image content supervision method towards multichannel internet new media data
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN108550380A (en) * 2018-04-12 2018-09-18 北京深度智耀科技有限公司 A kind of drug safety information monitoring method and device based on public network
CN109145233A (en) * 2018-08-27 2019-01-04 山东浪潮商用系统有限公司 internet information acquisition system
CN109286613A (en) * 2018-08-28 2019-01-29 刘琦 Control system is led in a kind of monitoring of network public-opinion
CN109792439A (en) * 2016-09-16 2019-05-21 甲骨文国际公司 Dynamic strategy injection and access visualization for threat detection
CN109783619A (en) * 2018-12-14 2019-05-21 广东创我科技发展有限公司 A kind of data filtering method for digging
CN109886764A (en) * 2017-12-06 2019-06-14 航天信息股份有限公司 A kind of commodity De-weight method and system based on material combinations
CN110020256A (en) * 2017-12-30 2019-07-16 惠州学院 The method and system of the harmful video of identification based on User ID and trailer content
CN110399554A (en) * 2019-07-12 2019-11-01 苏州浪潮智能科技有限公司 A kind of detection method, device and the storage system of web site contents specific information
CN110543595A (en) * 2019-08-12 2019-12-06 南京莱斯信息技术股份有限公司 in-station search system and method
CN111191098A (en) * 2019-12-25 2020-05-22 山石网科通信技术股份有限公司 Data filtering method and device
CN112148956A (en) * 2020-09-30 2020-12-29 上海交通大学 Hidden net threat information mining system and method based on machine learning
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN112905888A (en) * 2020-09-10 2021-06-04 中数通信息有限公司 Keyword discovery method and system based on information monitoring and electronic equipment
US11265329B2 (en) 2017-03-31 2022-03-01 Oracle International Corporation Mechanisms for anomaly detection and access management
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search
US8782037B1 (en) * 2010-06-20 2014-07-15 Remeztech Ltd. System and method for mark-up language document rank analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
US8782037B1 (en) * 2010-06-20 2014-07-15 Remeztech Ltd. System and method for mark-up language document rank analysis
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447081A (en) * 2015-11-04 2016-03-30 国云科技股份有限公司 Cloud platform-oriented government affair and public opinion monitoring method
CN105743901B (en) * 2016-03-07 2019-04-09 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
WO2017177872A1 (en) * 2016-04-11 2017-10-19 中兴通讯股份有限公司 Data collection method and apparatus, and storage medium
CN105974811A (en) * 2016-07-05 2016-09-28 无锡市华东电力设备有限公司 Smart home control method and system
CN106302797A (en) * 2016-08-31 2017-01-04 北京锐安科技有限公司 A kind of cookie accesses De-weight method and device
US11516255B2 (en) 2016-09-16 2022-11-29 Oracle International Corporation Dynamic policy injection and access visualization for threat detection
CN109792439B (en) * 2016-09-16 2021-08-27 甲骨文国际公司 Dynamic policy injection and access visualization for threat detection
CN109792439A (en) * 2016-09-16 2019-05-21 甲骨文国际公司 Dynamic strategy injection and access visualization for threat detection
US11265329B2 (en) 2017-03-31 2022-03-01 Oracle International Corporation Mechanisms for anomaly detection and access management
CN109886764A (en) * 2017-12-06 2019-06-14 航天信息股份有限公司 A kind of commodity De-weight method and system based on material combinations
CN108304481A (en) * 2017-12-29 2018-07-20 成都三零凯天通信实业有限公司 A kind of visible image content supervision method towards multichannel internet new media data
CN110020256A (en) * 2017-12-30 2019-07-16 惠州学院 The method and system of the harmful video of identification based on User ID and trailer content
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN108550380A (en) * 2018-04-12 2018-09-18 北京深度智耀科技有限公司 A kind of drug safety information monitoring method and device based on public network
CN109145233A (en) * 2018-08-27 2019-01-04 山东浪潮商用系统有限公司 internet information acquisition system
CN109286613A (en) * 2018-08-28 2019-01-29 刘琦 Control system is led in a kind of monitoring of network public-opinion
CN109783619A (en) * 2018-12-14 2019-05-21 广东创我科技发展有限公司 A kind of data filtering method for digging
CN110399554A (en) * 2019-07-12 2019-11-01 苏州浪潮智能科技有限公司 A kind of detection method, device and the storage system of web site contents specific information
CN110543595A (en) * 2019-08-12 2019-12-06 南京莱斯信息技术股份有限公司 in-station search system and method
CN111191098A (en) * 2019-12-25 2020-05-22 山石网科通信技术股份有限公司 Data filtering method and device
CN111191098B (en) * 2019-12-25 2022-10-18 山石网科通信技术股份有限公司 Data filtering method and device
CN112905888A (en) * 2020-09-10 2021-06-04 中数通信息有限公司 Keyword discovery method and system based on information monitoring and electronic equipment
CN112148956A (en) * 2020-09-30 2020-12-29 上海交通大学 Hidden net threat information mining system and method based on machine learning
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet

Also Published As

Publication number Publication date
CN104951539B (en) 2017-12-22

Similar Documents

Publication Publication Date Title
CN104951539A (en) Internet data center harmful information monitoring system
CN104899324B (en) One kind monitoring systematic sample training system based on IDC harmful informations
Hasan et al. TwitterNews+: a framework for real time event detection from the Twitter data stream
CN104899323A (en) Crawler system used for IDC harmful information monitoring platform
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN105718587A (en) Network content resource evaluation method and evaluation system
CN109815382B (en) Method and system for sensing and acquiring large-scale network data
Das et al. A CV parser model using entity extraction process and big data tools
Jha et al. A review on the study and analysis of big data using data mining techniques
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN104598536B (en) A kind of distributed network information structuring processing method
Dueñas-Fernández et al. Detecting trends on the web: A multidisciplinary approach
Demirbaga HTwitt: a hadoop-based platform for analysis and visualization of streaming Twitter data
CN112328806A (en) Data processing method, system, computer equipment and storage medium
CN104965894A (en) Data analysis system for IDC hazardous information monitoring platform
Viet et al. Analyzing recent research trends of computer science from academic open-access digital library
Pandya et al. Mated: metadata-assisted twitter event detection system
He et al. Research on the dynamic monitoring system model of university network public opinion under the big data environment
Wu et al. Sub-event discovery and retrieval during natural hazards on social media data
KR101880474B1 (en) Keyword-based service provide method for high value added content information service and method and recording medium storing program for executing the same and recording medium storing program for executing the same
KR102413961B1 (en) Method for providing news analysis service using robotic process automation monitoring
Jung Discovering social bursts by using link analytics on large-scale social networks
Xu et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification
Zhang et al. Research on keyword extraction and sentiment orientation analysis of educational texts
Qureshi et al. Detecting social polarization and radicalization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171222

Termination date: 20180619

CF01 Termination of patent right due to non-payment of annual fee