CN104252530B - A kind of unit crawler capturing method and system - Google Patents

A kind of unit crawler capturing method and system Download PDF

Info

Publication number
CN104252530B
CN104252530B CN201410458191.6A CN201410458191A CN104252530B CN 104252530 B CN104252530 B CN 104252530B CN 201410458191 A CN201410458191 A CN 201410458191A CN 104252530 B CN104252530 B CN 104252530B
Authority
CN
China
Prior art keywords
url
current
web data
data
crawler capturing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410458191.6A
Other languages
Chinese (zh)
Other versions
CN104252530A (en
Inventor
廖耀华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410458191.6A priority Critical patent/CN104252530B/en
Publication of CN104252530A publication Critical patent/CN104252530A/en
Application granted granted Critical
Publication of CN104252530B publication Critical patent/CN104252530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The present invention discloses a kind of unit crawler capturing method and system, and method includes:Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;Rule corresponding with the current type is obtained according to the current type;According to the crawler capturing parameter from the current URL crawls web data, parsing is carried out to the web data according to the rule and obtains parsing data.The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.

Description

A kind of unit crawler capturing method and system
Technical field
The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.
Background technology
Internet possesses the data and information of magnanimity, how these data and information is converted to oneself desired thing, And then it is the intractable thing of a comparison to be analyzed and handled.The appearance of web crawlers solves all these problems.
Most reptile devices is all the function of simply realizing and crawl webpage at present, but crawled for repetition, It is absorbed in endless loop trap, formulates all not good embodiments in terms of anti-creep strategy (extension crawl time).In addition, current unit Network compatibility is bad, it is impossible to solve to capture the crawl demand of a variety of websites simultaneously.
The content of the invention
Based on this, it is necessary to for the existing unit web crawlers grasping mechanism operating efficiency bottom of prior art, crawl Time is short, and can not capture the technical problem of polytype website simultaneously there is provided a kind of unit crawler capturing method and be System.
A kind of unit crawler capturing method, including:
Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, The website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;
Rule corresponding with the current type is obtained according to the current type;
According to the crawler capturing parameter from the current URL crawls web data, according to the rule to the webpage Data carry out parsing and obtain parsing data.
A kind of unit crawler capturing system, including:
Seed receiving module, includes URL, website numbering and type seed, by the seed for obtaining at least one URL as current URL, the website numbering of the seed is numbered as current site, using the type of the seed as working as Preceding type;
Policy module, for obtaining at least one strategy, at least one crawler capturing parameter is determined according to the strategy;
Rule module, for obtaining rule corresponding with the current type according to the current type;
Parsing module, for capturing web data from the current URL according to the crawler capturing parameter, according to the rule Parsing is then carried out to the web data and obtains parsing data.
The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to carry High workload efficiency, extends the crawl time, and adapts to polytype website.
Brief description of the drawings
Fig. 1 is a kind of workflow diagram of unit crawler capturing method of the invention;
Fig. 2 is a kind of construction module figure of unit crawler capturing system of the invention;
Fig. 3 is a kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention.
Embodiment
The present invention will be further described in detail with specific embodiment below in conjunction with the accompanying drawings.
It is a kind of workflow diagram of unit crawler capturing method of the invention as shown in Figure 1, including:
Step 11, obtain at least one including URL, website numbering and type seed, using the URL of the seed as work as Preceding URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
Step 12, at least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;
Step 13, rule corresponding with the current type is obtained according to the current type;
Step 14, web data is captured from the current URL according to the crawler capturing parameter, according to the rule to institute State web data and carry out parsing acquisition parsing data.
Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines that different reptiles grab Parameter is taken, so that at step 14, Webpage data capturing is carried out using by crawler capturing parameter determined by step 12.Due to Crawler capturing parameter is determined by the strategy by step 12, to therefore, it can by setting different strategies, different to meet Crawl demand, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.
In one of the embodiments, in the step 14, if in the crawl web data or to the webpage There are abnormal conditions in data in being analyzed, then preserve the abnormal conditions.
By monitoring the abnormal conditions occurred in capturing the web data or analyzing the web data, Abnormal conditions timely can be fed back into user, prevent the wasting of resources.
In one of the embodiments, the strategy includes:Seed is absorbed in endless loop processing strategy, the switching of browser mark Strategy, cookie dynamic more new strategies and/or Agent IP switchover policy.
In the strategy of the present embodiment, seed is absorbed in endless loop processing strategy and fallen into for preventing from repeating crawling, being absorbed in endless loop Trap, and browser mark switchover policy, cookie dynamic more new strategy and/or Agent IP switchover policy are when can then extend crawl Between.
In one of the embodiments:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described Current URL crawls web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set It is set to refusal and captures web data from the current URL, otherwise the crawler capturing parameter is set to allow from the current URL Capture web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls webpage The browser mark used during data, if the abnormal conditions be from the current URL crawl web data failure or Default timing is reached, then is updated to the browser mark used when capturing web data from the current URL another Browser mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is updated from described to allow or refusing Cookie during current URL crawls web data, if reaching default timing, the crawler capturing parameter is set to permit Perhaps update from the current URL crawl web data when cookie, otherwise the crawler capturing parameter be set to refusal and update Cookie when capturing web data from the current URL;
The Agent IP switchover policy is specially:The crawler capturing parameter is that permission or refusal are updated from described current URL captures Agent IP during web data, if the abnormal conditions be from the current URL crawl web data failure or Reach default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when agency IP。
The present embodiment further illustrates seed and is absorbed in endless loop processing strategy, browser mark switchover policy, cookie Dynamic more new strategy and Agent IP switchover policy, wherein, seed be absorbed in endless loop processing strategy, browser mark switchover policy and Agent IP switchover policy adjusts crawler capturing parameter according to abnormal conditions, and cookie dynamics more new strategy is then updated by timing Mode adjust crawler capturing parameter.
Specifically, seed is absorbed in endless loop processing strategy mainly for the endless loop trap for solving website.Reptile according to URL is grabbed after web data, and new URL is analyzed from the web data, and further according to the new webpage number of new URL crawls According to.However, some websites can set endless loop trap, i.e., the new URL analyzed according to web data is existing URL, So as to cause crawler capturing to be absorbed in endless loop, crawler capturing is influenceed, and seed is absorbed in endless loop processing strategy, then is in monitoring When being absorbed in the abnormal conditions of endless loop to current URL, then refusal is set from the current URL crawls web data, so as to avoid It is absorbed in endless loop.
Specifically, browser mark switchover policy is used to imitate user behavior as far as possible.It is clear that different users uses Device of looking at can be different, in order to imitate user behavior as much as possible, it is necessary to change the type or version of browser.And the class of browser Type or version, using browser mark (for example:User-agent) it is identified, reptile can simulate one when crawling Virtual browser, is made a distinction with user-agent, and use-agent value is determined by the type and version number of browser, is changed Become user-agent value equivalent to have switched browser.Therefore, when detecting from the current URL crawl web data failure Abnormal conditions or when reaching default timing, browser mark is changed, to extend the crawl time of reptile.
Specifically, cookie dynamics more new strategy is mainly realized using timing update mode, that is, reach it is presetting constantly Between when, then update cookie, update cookie and set up new session equivalent to being crawled the website of web data, so as to Extend the crawl time.
Specifically, Agent IP switchover policy captures the same IP (networks of web data mainly for website to long-time Address, for example:IPv4 addresses, or IPv6 addresses, are generally used:XXX.XXX.XXX.XXX IPv4 addresses) blocked Situation.For unit crawler capturing, due to the general only one of which IP of unit, therefore carry out reptile by the way of Agent IP and grab Take.Agent IP switchover policy, is detected presetting from the abnormal conditions of the current URL crawls web data failure or arrival When the time when, its Agent IP is changed, to avoid being blocked.
In one of the embodiments, in the step 14, parsing is carried out to the web data according to the rule and obtained Data must be parsed, are specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is carried out Parsing obtains parsing data, is paging URL's using the URL parsed in data as type if parsing data include URL Seed storage;
If the current type is paging URL, regular accordingly according to paging URL, the web data is carried out Parsing obtains parsing data, if parsing data include URL, is details page URL using the URL parsed in data as type Seed storage;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is entered Row parsing obtains parsing data, preserves the web page contents in the parsing data.
The present embodiment is classified to URL, to allow different types of URL to be parsed using different rules, So as to obtain more accurate analysis result.
A kind of construction module figure of unit crawler capturing system of the invention is illustrated in figure 2, including:
Seed receiving module 201, includes URL, website numbering and type seed, by the kind for obtaining at least one The URL of son numbers the website numbering of the seed as current site as current URL, using the type of the seed as Current type;
Policy module 202, for obtaining at least one strategy, at least one crawler capturing ginseng is determined according to the strategy Number;
Rule module 203, for obtaining rule corresponding with the current type according to the current type;
Parsing module 204, for capturing web data from the current URL according to the crawler capturing parameter, according to institute State rule and parsing acquisition parsing data are carried out to the web data.
In one of the embodiments, in the parsing module 204, if in the crawl web data or to described There are abnormal conditions in web data in being analyzed, then preserves the abnormal conditions.
In one of the embodiments, the strategy includes:Seed is absorbed in endless loop processing strategy, the switching of browser mark Strategy, cookie dynamic more new strategies and/or Agent IP switchover policy.
In one of the embodiments:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described Current URL crawls web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set It is set to refusal and captures web data from the current URL, otherwise the crawler capturing parameter is set to allow from the current URL Capture web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls webpage The browser mark used during data, if the abnormal conditions be from the current URL crawl web data failure or Default timing is reached, then is updated to the browser mark used when capturing web data from the current URL another Browser mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is updated from described to allow or refusing Cookie during current URL crawls web data, if reaching default timing, the crawler capturing parameter is set to permit Perhaps update from the current URL crawl web data when cookie, otherwise the crawler capturing parameter be set to refusal and update Cookie when capturing web data from the current URL;
The Agent IP switchover policy is specially:The crawler capturing parameter is that permission or refusal are updated from described current URL captures Agent IP during web data, if the abnormal conditions be from the current URL crawl web data failure or Reach default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when agency IP。
In one of the embodiments, in the parsing module 204, the web data is solved according to the rule Analysis obtains parsing data, specifically includes:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is carried out Parsing obtains parsing data, is paging URL's using the URL parsed in data as type if parsing data include URL Seed storage;
If the current type is paging URL, regular accordingly according to paging URL, the web data is carried out Parsing obtains parsing data, if parsing data include URL, is details page URL using the URL parsed in data as type Seed storage;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is entered Row parsing obtains parsing data, preserves the web page contents in the parsing data.
A kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention is illustrated in figure 3, including is planted Sub- generation module 310, handling module 320 and data memory module 330.
The main function of seed generation module 310 is to provide seed for handling module, seed can be website URL or The SKU of commodity.Seed can be stored in text or database, and handling module can be from text or database batch Obtain seed.
Each seed must have the virtual numbering and type of a website, and virtually numbering can tell handling module for website Corresponding rule file parsing document is called, and type field is mainly described the seed and what type is belonged to, and is details page URL, paging URL or homepage URL.
Handling module 320 is the core of whole unit reptile, and it manages submodule 321, document by rule file and parses son Module 322, policy management sub-module 323 and exception reporting submodule 323 are constituted.Rule file manages the main of submodule 321 Effect is the document resolution rules for managing all kinds of websites, is that document analyzing sub-module 322 provides resolution rules.Document parses submodule Block 322 manages the rule that submodule 321 obtains each website from rule file, by these rule parsing documents, obtains user Information interested.Policy management sub-module 323, can be by a series of tactical management as the optimization submodule of handling module Chain is constituted, by analyzing the crawl flow of handling module, can be for preventing from repeating crawling, being absorbed in endless loop trap and extension Crawl time etc..Exception reporting submodule 324 is used for reporting various problems of the handling module 320 during crawl, anti-in time Feed user, prevents the wasting of resources.
Handling module 320 is got after seed, is analyzed the report information of exception reporting, is called corresponding policy management module And requested webpage.Policy management module 323 includes a series of strategies defined, is stored in multiple tactful chains.Such as Seed is absorbed in the processing strategy of endless loop;Browser agent switchover policy;Cookie dynamics more new strategy;Agent IP switches plan Slightly etc..These strategies may insure that handling module is more efficient in requested webpage.
After info web is got, by the virtual numbering calling rule file management submodule 321 of seed, phase is obtained The rule file answered, document is parsed by document analyzing sub-module 322.Each seed has a type field, can tell text Shelves analyzing sub-module 322 will parse any content, such as be homepage URL, can typically parse paging URL;If paging URL, then need to parse detail page URL;If details page URL, then can directly parse content.What is parsed is interior Appearance can be retained separately, if parse or URL, needs him to stamp type mark, individually save, for crawl mould Block 320 is subsequently used, if what is parsed is content, can be stored in database or text, is directly made for user With.
Exception reporting is divided into two kinds, one kind belongs to system-level mistake, one kind belongs to user class through whole crawl flow Mistake.System-level errors should report that to handling module 320, handling module once receives such type of error, can adjust With corresponding policy management sub-module 323 come optimal grasp process.And user's staging error is system to handle, it is necessary to feed back To user, such as parsing module parsing content error etc..
Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be stored in number According in storehouse or document, the module can also provide data for handling module 320.
Some data need to save to reuse to system, and some data are can be used directly to user 's.
The present invention is designed the sub-module of the handling module of unit reptile, and autgmentability is very good, and adds tactful pipe Manage submodule and exception reporting submodule, the whole crawl flow greatly optimized.
Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (8)

1. a kind of unit crawler capturing method, it is characterised in that including:
Step (11), obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
Step (12), obtains at least one strategy, and at least one crawler capturing parameter is determined according to the strategy;
Step (13), rule corresponding with the current type is obtained according to the current type;
Step (14), captures web data, according to the rule to described according to the crawler capturing parameter from the current URL Web data carries out parsing and obtains parsing data;
In the step (14), parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the seed that type is paging URL Preserve;
If the current type is paging URL, regular accordingly according to paging URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the kind that type is details page URL Son is preserved;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is solved Analysis obtains parsing data, preserves the web page contents in the parsing data.
2. unit crawler capturing method according to claim 1, it is characterised in that in the step (14), if grabbed Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.
3. unit crawler capturing method according to claim 2, it is characterised in that the strategy includes:Seed is absorbed in extremely Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.
4. unit crawler capturing method according to claim 3, it is characterised in that:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described current URL captures web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set to Refusal is set to allow to capture from the current URL from the current URL crawls web data, otherwise the crawler capturing parameter Web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls web data When the browser mark that is used, if the abnormal conditions are from the current URL crawls web data failure or reached Default timing, then be updated to another browse by the browser mark used when capturing web data from the current URL Device mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is that permission or refusal are updated from described current URL captures cookie during web data, if reaching default timing, the crawler capturing parameter is set to allow more It is new from the current URL crawls web data when cookie, otherwise the crawler capturing parameter be set to refusal and update from institute State cookie during current URL crawls web data;
The Agent IP switchover policy is specially:The crawler capturing parameter is updated from the current URL to allow or refusing Agent IP during web data is captured, if the abnormal conditions are from the current URL crawls web data failure or reached To default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when Agent IP.
5. a kind of unit crawler capturing system, it is characterised in that including:
Seed receiving module, includes URL, website numbering and type seed, by the URL of the seed for obtaining at least one As current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current class Type;
Policy module, for obtaining at least one strategy, at least one crawler capturing parameter is determined according to the strategy;
Rule module, for obtaining rule corresponding with the current type according to the current type;
Parsing module, it is right according to the rule for capturing web data from the current URL according to the crawler capturing parameter The web data carries out parsing and obtains parsing data;
In the parsing module, parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the seed that type is paging URL Preserve;
If the current type is paging URL, regular accordingly according to paging URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the kind that type is details page URL Son is preserved;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is solved Analysis obtains parsing data, preserves the web page contents in the parsing data.
6. unit crawler capturing system according to claim 5, it is characterised in that in the parsing module, if grabbed Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.
7. unit crawler capturing system according to claim 6, it is characterised in that the strategy includes:Seed is absorbed in extremely Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.
8. unit crawler capturing system according to claim 7, it is characterised in that:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described current URL captures web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set to Refusal is set to allow to capture from the current URL from the current URL crawls web data, otherwise the crawler capturing parameter Web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls web data When the browser mark that is used, if the abnormal conditions are from the current URL crawls web data failure or reached Default timing, then be updated to another browse by the browser mark used when capturing web data from the current URL Device mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is that permission or refusal are updated from described current URL captures cookie during web data, if reaching default timing, the crawler capturing parameter is set to allow more It is new from the current URL crawls web data when cookie, otherwise the crawler capturing parameter be set to refusal and update from institute State cookie during current URL crawls web data;
The Agent IP switchover policy is specially:The crawler capturing parameter is updated from the current URL to allow or refusing Agent IP during web data is captured, if the abnormal conditions are from the current URL crawls web data failure or reached To default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when Agent IP.
CN201410458191.6A 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system Active CN104252530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410458191.6A CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410458191.6A CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Publications (2)

Publication Number Publication Date
CN104252530A CN104252530A (en) 2014-12-31
CN104252530B true CN104252530B (en) 2017-09-15

Family

ID=52187420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410458191.6A Active CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Country Status (1)

Country Link
CN (1) CN104252530B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989151B (en) * 2015-03-02 2019-09-06 阿里巴巴集团控股有限公司 Webpage capture method and device
CN106021257B (en) * 2015-12-31 2019-10-18 广州华多网络科技有限公司 A kind of crawler capturing data method, apparatus and system for supporting online programming
CN107045507B (en) * 2016-02-05 2020-08-21 北京国双科技有限公司 Webpage crawling method and device
CN105956175B (en) * 2016-05-24 2017-09-05 考拉征信服务有限公司 The method and apparatus that web page contents are crawled
CN107451046B (en) * 2016-05-30 2020-11-17 腾讯科技(深圳)有限公司 Method and terminal for detecting threads
CN107957939B (en) * 2016-10-14 2021-02-26 北京京东尚科信息技术有限公司 Webpage interaction interface testing method and system
CN106599270B (en) * 2016-12-23 2020-08-21 浙江省公众信息产业有限公司 Network data capturing method and crawler
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data
CN111881337B (en) * 2020-08-06 2021-06-01 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process

Also Published As

Publication number Publication date
CN104252530A (en) 2014-12-31

Similar Documents

Publication Publication Date Title
CN104252530B (en) A kind of unit crawler capturing method and system
US20200404015A1 (en) System and method for cybersecurity analysis and score generation for insurance purposes
US20210092161A1 (en) Collaborative database and reputation management in adversarial information environments
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN104391979B (en) Network malice reptile recognition methods and device
CN104488231B (en) Method, apparatus and system for selectively monitoring flow
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
CN103179132B (en) A kind of method and device detecting and defend CC attack
CN105956175A (en) Webpage content crawling method and device
CN103279507B (en) Webpage spider operational method and system
CN103326947B (en) The learning method of PMTU, the sending method of data message and the network equipment
CN103546830B (en) A kind of processing method and system of video address failure
EP1713010A3 (en) Using attribute inheritance to identify crawl paths
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
US9055113B2 (en) Method and system for monitoring flows in network traffic
US11347620B2 (en) Parsing hierarchical session log data for search and analytics
CN103399871A (en) Equipment and method for capturing second-level domain information associated with main domain
CN104462242B (en) Webpage capacity of returns statistical method and device
CN106485148A (en) The implementation method of the malicious code behavior analysiss sandbox being combined based on JS BOM
CN109446441B (en) General credible distributed acquisition and storage system for network community
CN106657422A (en) Method, apparatus and system for crawling website page
CN103354546A (en) Message filtering method and message filtering apparatus
CN105516114B (en) Method and device for scanning vulnerability based on webpage hash value and electronic equipment
US20180183799A1 (en) Method and system for defending against malicious website
CN108280094B (en) Application up-line and down-line data statistical method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant