CN104252530B - A kind of unit crawler capturing method and system - Google Patents
A kind of unit crawler capturing method and system Download PDFInfo
- Publication number
- CN104252530B CN104252530B CN201410458191.6A CN201410458191A CN104252530B CN 104252530 B CN104252530 B CN 104252530B CN 201410458191 A CN201410458191 A CN 201410458191A CN 104252530 B CN104252530 B CN 104252530B
- Authority
- CN
- China
- Prior art keywords
- url
- current
- web data
- data
- crawler capturing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Abstract
The present invention discloses a kind of unit crawler capturing method and system, and method includes:Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;Rule corresponding with the current type is obtained according to the current type;According to the crawler capturing parameter from the current URL crawls web data, parsing is carried out to the web data according to the rule and obtains parsing data.The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.
Description
Technical field
The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.
Background technology
Internet possesses the data and information of magnanimity, how these data and information is converted to oneself desired thing,
And then it is the intractable thing of a comparison to be analyzed and handled.The appearance of web crawlers solves all these problems.
Most reptile devices is all the function of simply realizing and crawl webpage at present, but crawled for repetition,
It is absorbed in endless loop trap, formulates all not good embodiments in terms of anti-creep strategy (extension crawl time).In addition, current unit
Network compatibility is bad, it is impossible to solve to capture the crawl demand of a variety of websites simultaneously.
The content of the invention
Based on this, it is necessary to for the existing unit web crawlers grasping mechanism operating efficiency bottom of prior art, crawl
Time is short, and can not capture the technical problem of polytype website simultaneously there is provided a kind of unit crawler capturing method and be
System.
A kind of unit crawler capturing method, including:
Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL,
The website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;
Rule corresponding with the current type is obtained according to the current type;
According to the crawler capturing parameter from the current URL crawls web data, according to the rule to the webpage
Data carry out parsing and obtain parsing data.
A kind of unit crawler capturing system, including:
Seed receiving module, includes URL, website numbering and type seed, by the seed for obtaining at least one
URL as current URL, the website numbering of the seed is numbered as current site, using the type of the seed as working as
Preceding type;
Policy module, for obtaining at least one strategy, at least one crawler capturing parameter is determined according to the strategy;
Rule module, for obtaining rule corresponding with the current type according to the current type;
Parsing module, for capturing web data from the current URL according to the crawler capturing parameter, according to the rule
Parsing is then carried out to the web data and obtains parsing data.
The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to carry
High workload efficiency, extends the crawl time, and adapts to polytype website.
Brief description of the drawings
Fig. 1 is a kind of workflow diagram of unit crawler capturing method of the invention;
Fig. 2 is a kind of construction module figure of unit crawler capturing system of the invention;
Fig. 3 is a kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention.
Embodiment
The present invention will be further described in detail with specific embodiment below in conjunction with the accompanying drawings.
It is a kind of workflow diagram of unit crawler capturing method of the invention as shown in Figure 1, including:
Step 11, obtain at least one including URL, website numbering and type seed, using the URL of the seed as work as
Preceding URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
Step 12, at least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy;
Step 13, rule corresponding with the current type is obtained according to the current type;
Step 14, web data is captured from the current URL according to the crawler capturing parameter, according to the rule to institute
State web data and carry out parsing acquisition parsing data.
Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines that different reptiles grab
Parameter is taken, so that at step 14, Webpage data capturing is carried out using by crawler capturing parameter determined by step 12.Due to
Crawler capturing parameter is determined by the strategy by step 12, to therefore, it can by setting different strategies, different to meet
Crawl demand, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.
In one of the embodiments, in the step 14, if in the crawl web data or to the webpage
There are abnormal conditions in data in being analyzed, then preserve the abnormal conditions.
By monitoring the abnormal conditions occurred in capturing the web data or analyzing the web data,
Abnormal conditions timely can be fed back into user, prevent the wasting of resources.
In one of the embodiments, the strategy includes:Seed is absorbed in endless loop processing strategy, the switching of browser mark
Strategy, cookie dynamic more new strategies and/or Agent IP switchover policy.
In the strategy of the present embodiment, seed is absorbed in endless loop processing strategy and fallen into for preventing from repeating crawling, being absorbed in endless loop
Trap, and browser mark switchover policy, cookie dynamic more new strategy and/or Agent IP switchover policy are when can then extend crawl
Between.
In one of the embodiments:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described
Current URL crawls web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set
It is set to refusal and captures web data from the current URL, otherwise the crawler capturing parameter is set to allow from the current URL
Capture web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls webpage
The browser mark used during data, if the abnormal conditions be from the current URL crawl web data failure or
Default timing is reached, then is updated to the browser mark used when capturing web data from the current URL another
Browser mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is updated from described to allow or refusing
Cookie during current URL crawls web data, if reaching default timing, the crawler capturing parameter is set to permit
Perhaps update from the current URL crawl web data when cookie, otherwise the crawler capturing parameter be set to refusal and update
Cookie when capturing web data from the current URL;
The Agent IP switchover policy is specially:The crawler capturing parameter is that permission or refusal are updated from described current
URL captures Agent IP during web data, if the abnormal conditions be from the current URL crawl web data failure or
Reach default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when
Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when agency
IP。
The present embodiment further illustrates seed and is absorbed in endless loop processing strategy, browser mark switchover policy, cookie
Dynamic more new strategy and Agent IP switchover policy, wherein, seed be absorbed in endless loop processing strategy, browser mark switchover policy and
Agent IP switchover policy adjusts crawler capturing parameter according to abnormal conditions, and cookie dynamics more new strategy is then updated by timing
Mode adjust crawler capturing parameter.
Specifically, seed is absorbed in endless loop processing strategy mainly for the endless loop trap for solving website.Reptile according to
URL is grabbed after web data, and new URL is analyzed from the web data, and further according to the new webpage number of new URL crawls
According to.However, some websites can set endless loop trap, i.e., the new URL analyzed according to web data is existing URL,
So as to cause crawler capturing to be absorbed in endless loop, crawler capturing is influenceed, and seed is absorbed in endless loop processing strategy, then is in monitoring
When being absorbed in the abnormal conditions of endless loop to current URL, then refusal is set from the current URL crawls web data, so as to avoid
It is absorbed in endless loop.
Specifically, browser mark switchover policy is used to imitate user behavior as far as possible.It is clear that different users uses
Device of looking at can be different, in order to imitate user behavior as much as possible, it is necessary to change the type or version of browser.And the class of browser
Type or version, using browser mark (for example:User-agent) it is identified, reptile can simulate one when crawling
Virtual browser, is made a distinction with user-agent, and use-agent value is determined by the type and version number of browser, is changed
Become user-agent value equivalent to have switched browser.Therefore, when detecting from the current URL crawl web data failure
Abnormal conditions or when reaching default timing, browser mark is changed, to extend the crawl time of reptile.
Specifically, cookie dynamics more new strategy is mainly realized using timing update mode, that is, reach it is presetting constantly
Between when, then update cookie, update cookie and set up new session equivalent to being crawled the website of web data, so as to
Extend the crawl time.
Specifically, Agent IP switchover policy captures the same IP (networks of web data mainly for website to long-time
Address, for example:IPv4 addresses, or IPv6 addresses, are generally used:XXX.XXX.XXX.XXX IPv4 addresses) blocked
Situation.For unit crawler capturing, due to the general only one of which IP of unit, therefore carry out reptile by the way of Agent IP and grab
Take.Agent IP switchover policy, is detected presetting from the abnormal conditions of the current URL crawls web data failure or arrival
When the time when, its Agent IP is changed, to avoid being blocked.
In one of the embodiments, in the step 14, parsing is carried out to the web data according to the rule and obtained
Data must be parsed, are specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is carried out
Parsing obtains parsing data, is paging URL's using the URL parsed in data as type if parsing data include URL
Seed storage;
If the current type is paging URL, regular accordingly according to paging URL, the web data is carried out
Parsing obtains parsing data, if parsing data include URL, is details page URL using the URL parsed in data as type
Seed storage;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is entered
Row parsing obtains parsing data, preserves the web page contents in the parsing data.
The present embodiment is classified to URL, to allow different types of URL to be parsed using different rules,
So as to obtain more accurate analysis result.
A kind of construction module figure of unit crawler capturing system of the invention is illustrated in figure 2, including:
Seed receiving module 201, includes URL, website numbering and type seed, by the kind for obtaining at least one
The URL of son numbers the website numbering of the seed as current site as current URL, using the type of the seed as
Current type;
Policy module 202, for obtaining at least one strategy, at least one crawler capturing ginseng is determined according to the strategy
Number;
Rule module 203, for obtaining rule corresponding with the current type according to the current type;
Parsing module 204, for capturing web data from the current URL according to the crawler capturing parameter, according to institute
State rule and parsing acquisition parsing data are carried out to the web data.
In one of the embodiments, in the parsing module 204, if in the crawl web data or to described
There are abnormal conditions in web data in being analyzed, then preserves the abnormal conditions.
In one of the embodiments, the strategy includes:Seed is absorbed in endless loop processing strategy, the switching of browser mark
Strategy, cookie dynamic more new strategies and/or Agent IP switchover policy.
In one of the embodiments:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described
Current URL crawls web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set
It is set to refusal and captures web data from the current URL, otherwise the crawler capturing parameter is set to allow from the current URL
Capture web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls webpage
The browser mark used during data, if the abnormal conditions be from the current URL crawl web data failure or
Default timing is reached, then is updated to the browser mark used when capturing web data from the current URL another
Browser mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is updated from described to allow or refusing
Cookie during current URL crawls web data, if reaching default timing, the crawler capturing parameter is set to permit
Perhaps update from the current URL crawl web data when cookie, otherwise the crawler capturing parameter be set to refusal and update
Cookie when capturing web data from the current URL;
The Agent IP switchover policy is specially:The crawler capturing parameter is that permission or refusal are updated from described current
URL captures Agent IP during web data, if the abnormal conditions be from the current URL crawl web data failure or
Reach default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when
Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when agency
IP。
In one of the embodiments, in the parsing module 204, the web data is solved according to the rule
Analysis obtains parsing data, specifically includes:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is carried out
Parsing obtains parsing data, is paging URL's using the URL parsed in data as type if parsing data include URL
Seed storage;
If the current type is paging URL, regular accordingly according to paging URL, the web data is carried out
Parsing obtains parsing data, if parsing data include URL, is details page URL using the URL parsed in data as type
Seed storage;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is entered
Row parsing obtains parsing data, preserves the web page contents in the parsing data.
A kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention is illustrated in figure 3, including is planted
Sub- generation module 310, handling module 320 and data memory module 330.
The main function of seed generation module 310 is to provide seed for handling module, seed can be website URL or
The SKU of commodity.Seed can be stored in text or database, and handling module can be from text or database batch
Obtain seed.
Each seed must have the virtual numbering and type of a website, and virtually numbering can tell handling module for website
Corresponding rule file parsing document is called, and type field is mainly described the seed and what type is belonged to, and is details page
URL, paging URL or homepage URL.
Handling module 320 is the core of whole unit reptile, and it manages submodule 321, document by rule file and parses son
Module 322, policy management sub-module 323 and exception reporting submodule 323 are constituted.Rule file manages the main of submodule 321
Effect is the document resolution rules for managing all kinds of websites, is that document analyzing sub-module 322 provides resolution rules.Document parses submodule
Block 322 manages the rule that submodule 321 obtains each website from rule file, by these rule parsing documents, obtains user
Information interested.Policy management sub-module 323, can be by a series of tactical management as the optimization submodule of handling module
Chain is constituted, by analyzing the crawl flow of handling module, can be for preventing from repeating crawling, being absorbed in endless loop trap and extension
Crawl time etc..Exception reporting submodule 324 is used for reporting various problems of the handling module 320 during crawl, anti-in time
Feed user, prevents the wasting of resources.
Handling module 320 is got after seed, is analyzed the report information of exception reporting, is called corresponding policy management module
And requested webpage.Policy management module 323 includes a series of strategies defined, is stored in multiple tactful chains.Such as
Seed is absorbed in the processing strategy of endless loop;Browser agent switchover policy;Cookie dynamics more new strategy;Agent IP switches plan
Slightly etc..These strategies may insure that handling module is more efficient in requested webpage.
After info web is got, by the virtual numbering calling rule file management submodule 321 of seed, phase is obtained
The rule file answered, document is parsed by document analyzing sub-module 322.Each seed has a type field, can tell text
Shelves analyzing sub-module 322 will parse any content, such as be homepage URL, can typically parse paging URL;If paging
URL, then need to parse detail page URL;If details page URL, then can directly parse content.What is parsed is interior
Appearance can be retained separately, if parse or URL, needs him to stamp type mark, individually save, for crawl mould
Block 320 is subsequently used, if what is parsed is content, can be stored in database or text, is directly made for user
With.
Exception reporting is divided into two kinds, one kind belongs to system-level mistake, one kind belongs to user class through whole crawl flow
Mistake.System-level errors should report that to handling module 320, handling module once receives such type of error, can adjust
With corresponding policy management sub-module 323 come optimal grasp process.And user's staging error is system to handle, it is necessary to feed back
To user, such as parsing module parsing content error etc..
Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be stored in number
According in storehouse or document, the module can also provide data for handling module 320.
Some data need to save to reuse to system, and some data are can be used directly to user
's.
The present invention is designed the sub-module of the handling module of unit reptile, and autgmentability is very good, and adds tactful pipe
Manage submodule and exception reporting submodule, the whole crawl flow greatly optimized.
Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously
Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention
Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (8)
1. a kind of unit crawler capturing method, it is characterised in that including:
Step (11), obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current
URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type;
Step (12), obtains at least one strategy, and at least one crawler capturing parameter is determined according to the strategy;
Step (13), rule corresponding with the current type is obtained according to the current type;
Step (14), captures web data, according to the rule to described according to the crawler capturing parameter from the current URL
Web data carries out parsing and obtains parsing data;
In the step (14), parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is parsed
Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the seed that type is paging URL
Preserve;
If the current type is paging URL, regular accordingly according to paging URL, the web data is parsed
Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the kind that type is details page URL
Son is preserved;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is solved
Analysis obtains parsing data, preserves the web page contents in the parsing data.
2. unit crawler capturing method according to claim 1, it is characterised in that in the step (14), if grabbed
Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.
3. unit crawler capturing method according to claim 2, it is characterised in that the strategy includes:Seed is absorbed in extremely
Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.
4. unit crawler capturing method according to claim 3, it is characterised in that:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described current
URL captures web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set to
Refusal is set to allow to capture from the current URL from the current URL crawls web data, otherwise the crawler capturing parameter
Web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls web data
When the browser mark that is used, if the abnormal conditions are from the current URL crawls web data failure or reached
Default timing, then be updated to another browse by the browser mark used when capturing web data from the current URL
Device mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is that permission or refusal are updated from described current
URL captures cookie during web data, if reaching default timing, the crawler capturing parameter is set to allow more
It is new from the current URL crawls web data when cookie, otherwise the crawler capturing parameter be set to refusal and update from institute
State cookie during current URL crawls web data;
The Agent IP switchover policy is specially:The crawler capturing parameter is updated from the current URL to allow or refusing
Agent IP during web data is captured, if the abnormal conditions are from the current URL crawls web data failure or reached
To default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when
Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when Agent IP.
5. a kind of unit crawler capturing system, it is characterised in that including:
Seed receiving module, includes URL, website numbering and type seed, by the URL of the seed for obtaining at least one
As current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current class
Type;
Policy module, for obtaining at least one strategy, at least one crawler capturing parameter is determined according to the strategy;
Rule module, for obtaining rule corresponding with the current type according to the current type;
Parsing module, it is right according to the rule for capturing web data from the current URL according to the crawler capturing parameter
The web data carries out parsing and obtains parsing data;
In the parsing module, parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included:
If the current type is homepage URL, regular accordingly according to homepage URL, the web data is parsed
Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the seed that type is paging URL
Preserve;
If the current type is paging URL, regular accordingly according to paging URL, the web data is parsed
Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the kind that type is details page URL
Son is preserved;
If the current type is details page URL, regular accordingly according to homepage URL, the web data is solved
Analysis obtains parsing data, preserves the web page contents in the parsing data.
6. unit crawler capturing system according to claim 5, it is characterised in that in the parsing module, if grabbed
Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.
7. unit crawler capturing system according to claim 6, it is characterised in that the strategy includes:Seed is absorbed in extremely
Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.
8. unit crawler capturing system according to claim 7, it is characterised in that:
The seed is absorbed in endless loop processing strategy:The crawler capturing parameter is permission or refuses from described current
URL captures web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set to
Refusal is set to allow to capture from the current URL from the current URL crawls web data, otherwise the crawler capturing parameter
Web data;
The browser mark switchover policy is specially:The crawler capturing parameter is from the current URL crawls web data
When the browser mark that is used, if the abnormal conditions are from the current URL crawls web data failure or reached
Default timing, then be updated to another browse by the browser mark used when capturing web data from the current URL
Device mark, the browser mark used when capturing web data from the current URL is not otherwise updated;
Dynamically more new strategy is specially the cookie:The crawler capturing parameter is that permission or refusal are updated from described current
URL captures cookie during web data, if reaching default timing, the crawler capturing parameter is set to allow more
It is new from the current URL crawls web data when cookie, otherwise the crawler capturing parameter be set to refusal and update from institute
State cookie during current URL crawls web data;
The Agent IP switchover policy is specially:The crawler capturing parameter is updated from the current URL to allow or refusing
Agent IP during web data is captured, if the abnormal conditions are from the current URL crawls web data failure or reached
To default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when
Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when Agent IP.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104252530A CN104252530A (en) | 2014-12-31 |
CN104252530B true CN104252530B (en) | 2017-09-15 |
Family
ID=52187420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410458191.6A Active CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104252530B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105989151B (en) * | 2015-03-02 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Webpage capture method and device |
CN106021257B (en) * | 2015-12-31 | 2019-10-18 | 广州华多网络科技有限公司 | A kind of crawler capturing data method, apparatus and system for supporting online programming |
CN107045507B (en) * | 2016-02-05 | 2020-08-21 | 北京国双科技有限公司 | Webpage crawling method and device |
CN105956175B (en) * | 2016-05-24 | 2017-09-05 | 考拉征信服务有限公司 | The method and apparatus that web page contents are crawled |
CN107451046B (en) * | 2016-05-30 | 2020-11-17 | 腾讯科技(深圳)有限公司 | Method and terminal for detecting threads |
CN107957939B (en) * | 2016-10-14 | 2021-02-26 | 北京京东尚科信息技术有限公司 | Webpage interaction interface testing method and system |
CN106599270B (en) * | 2016-12-23 | 2020-08-21 | 浙江省公众信息产业有限公司 | Network data capturing method and crawler |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
CN111881337B (en) * | 2020-08-06 | 2021-06-01 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
-
2014
- 2014-09-10 CN CN201410458191.6A patent/CN104252530B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
Also Published As
Publication number | Publication date |
---|---|
CN104252530A (en) | 2014-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104252530B (en) | A kind of unit crawler capturing method and system | |
US20200404015A1 (en) | System and method for cybersecurity analysis and score generation for insurance purposes | |
US20210092161A1 (en) | Collaborative database and reputation management in adversarial information environments | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN104391979B (en) | Network malice reptile recognition methods and device | |
CN104488231B (en) | Method, apparatus and system for selectively monitoring flow | |
CN103902386B (en) | Multi-thread network crawler processing method based on connection proxy optimal management | |
CN103179132B (en) | A kind of method and device detecting and defend CC attack | |
CN105956175A (en) | Webpage content crawling method and device | |
CN103279507B (en) | Webpage spider operational method and system | |
CN103326947B (en) | The learning method of PMTU, the sending method of data message and the network equipment | |
CN103546830B (en) | A kind of processing method and system of video address failure | |
EP1713010A3 (en) | Using attribute inheritance to identify crawl paths | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
US9055113B2 (en) | Method and system for monitoring flows in network traffic | |
US11347620B2 (en) | Parsing hierarchical session log data for search and analytics | |
CN103399871A (en) | Equipment and method for capturing second-level domain information associated with main domain | |
CN104462242B (en) | Webpage capacity of returns statistical method and device | |
CN106485148A (en) | The implementation method of the malicious code behavior analysiss sandbox being combined based on JS BOM | |
CN109446441B (en) | General credible distributed acquisition and storage system for network community | |
CN106657422A (en) | Method, apparatus and system for crawling website page | |
CN103354546A (en) | Message filtering method and message filtering apparatus | |
CN105516114B (en) | Method and device for scanning vulnerability based on webpage hash value and electronic equipment | |
US20180183799A1 (en) | Method and system for defending against malicious website | |
CN108280094B (en) | Application up-line and down-line data statistical method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |