CN104252530B

CN104252530B - A kind of unit crawler capturing method and system

Info

Publication number: CN104252530B
Application number: CN201410458191.6A
Authority: CN
Inventors: 廖耀华
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-09-10
Filing date: 2014-09-10
Publication date: 2017-09-15
Anticipated expiration: 2034-09-10
Also published as: CN104252530A

Abstract

The present invention discloses a kind of unit crawler capturing method and system, and method includes：Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type；At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy；Rule corresponding with the current type is obtained according to the current type；According to the crawler capturing parameter from the current URL crawls web data, parsing is carried out to the web data according to the rule and obtains parsing data.The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.

Description

A kind of unit crawler capturing method and system

Technical field

The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.

Background technology

Internet possesses the data and information of magnanimity, how these data and information is converted to oneself desired thing, And then it is the intractable thing of a comparison to be analyzed and handled.The appearance of web crawlers solves all these problems.

Most reptile devices is all the function of simply realizing and crawl webpage at present, but crawled for repetition, It is absorbed in endless loop trap, formulates all not good embodiments in terms of anti-creep strategy (extension crawl time).In addition, current unit Network compatibility is bad, it is impossible to solve to capture the crawl demand of a variety of websites simultaneously.

The content of the invention

Based on this, it is necessary to for the existing unit web crawlers grasping mechanism operating efficiency bottom of prior art, crawl Time is short, and can not capture the technical problem of polytype website simultaneously there is provided a kind of unit crawler capturing method and be System.

A kind of unit crawler capturing method, including：

Obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, The website numbering of the seed is numbered as current site, the type of the seed is regard as current type；

At least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy；

Rule corresponding with the current type is obtained according to the current type；

According to the crawler capturing parameter from the current URL crawls web data, according to the rule to the webpage Data carry out parsing and obtain parsing data.

A kind of unit crawler capturing system, including：

Seed receiving module, includes URL, website numbering and type seed, by the seed for obtaining at least one URL as current URL, the website numbering of the seed is numbered as current site, using the type of the seed as working as Preceding type；

Policy module, for obtaining at least one strategy, at least one crawler capturing parameter is determined according to the strategy；

Rule module, for obtaining rule corresponding with the current type according to the current type；

Parsing module, for capturing web data from the current URL according to the crawler capturing parameter, according to the rule Parsing is then carried out to the web data and obtains parsing data.

The present invention determines crawler capturing parameter by strategy, to overcome produced problem during crawl in time, so as to carry High workload efficiency, extends the crawl time, and adapts to polytype website.

Brief description of the drawings

Fig. 1 is a kind of workflow diagram of unit crawler capturing method of the invention；

Fig. 2 is a kind of construction module figure of unit crawler capturing system of the invention；

Fig. 3 is a kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention.

Embodiment

The present invention will be further described in detail with specific embodiment below in conjunction with the accompanying drawings.

It is a kind of workflow diagram of unit crawler capturing method of the invention as shown in Figure 1, including：

Step 11, obtain at least one including URL, website numbering and type seed, using the URL of the seed as work as Preceding URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type；

Step 12, at least one strategy is obtained, at least one crawler capturing parameter is determined according to the strategy；

Step 13, rule corresponding with the current type is obtained according to the current type；

Step 14, web data is captured from the current URL according to the crawler capturing parameter, according to the rule to institute State web data and carry out parsing acquisition parsing data.

Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines that different reptiles grab Parameter is taken, so that at step 14, Webpage data capturing is carried out using by crawler capturing parameter determined by step 12.Due to Crawler capturing parameter is determined by the strategy by step 12, to therefore, it can by setting different strategies, different to meet Crawl demand, so as to improve operating efficiency, extends the crawl time, and adapt to polytype website.

In one of the embodiments, in the step 14, if in the crawl web data or to the webpage There are abnormal conditions in data in being analyzed, then preserve the abnormal conditions.

By monitoring the abnormal conditions occurred in capturing the web data or analyzing the web data, Abnormal conditions timely can be fed back into user, prevent the wasting of resources.

In one of the embodiments, the strategy includes：Seed is absorbed in endless loop processing strategy, the switching of browser mark Strategy, cookie dynamic more new strategies and/or Agent IP switchover policy.

In the strategy of the present embodiment, seed is absorbed in endless loop processing strategy and fallen into for preventing from repeating crawling, being absorbed in endless loop Trap, and browser mark switchover policy, cookie dynamic more new strategy and/or Agent IP switchover policy are when can then extend crawl Between.

In one of the embodiments：

The seed is absorbed in endless loop processing strategy：The crawler capturing parameter is permission or refuses from described Current URL crawls web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set It is set to refusal and captures web data from the current URL, otherwise the crawler capturing parameter is set to allow from the current URL Capture web data；

The browser mark switchover policy is specially：The crawler capturing parameter is from the current URL crawls webpage The browser mark used during data, if the abnormal conditions be from the current URL crawl web data failure or Default timing is reached, then is updated to the browser mark used when capturing web data from the current URL another Browser mark, the browser mark used when capturing web data from the current URL is not otherwise updated；

Dynamically more new strategy is specially the cookie：The crawler capturing parameter is updated from described to allow or refusing Cookie during current URL crawls web data, if reaching default timing, the crawler capturing parameter is set to permit Perhaps update from the current URL crawl web data when cookie, otherwise the crawler capturing parameter be set to refusal and update Cookie when capturing web data from the current URL；

The Agent IP switchover policy is specially：The crawler capturing parameter is that permission or refusal are updated from described current URL captures Agent IP during web data, if the abnormal conditions be from the current URL crawl web data failure or Reach default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when agency IP。

The present embodiment further illustrates seed and is absorbed in endless loop processing strategy, browser mark switchover policy, cookie Dynamic more new strategy and Agent IP switchover policy, wherein, seed be absorbed in endless loop processing strategy, browser mark switchover policy and Agent IP switchover policy adjusts crawler capturing parameter according to abnormal conditions, and cookie dynamics more new strategy is then updated by timing Mode adjust crawler capturing parameter.

Specifically, seed is absorbed in endless loop processing strategy mainly for the endless loop trap for solving website.Reptile according to URL is grabbed after web data, and new URL is analyzed from the web data, and further according to the new webpage number of new URL crawls According to.However, some websites can set endless loop trap, i.e., the new URL analyzed according to web data is existing URL, So as to cause crawler capturing to be absorbed in endless loop, crawler capturing is influenceed, and seed is absorbed in endless loop processing strategy, then is in monitoring When being absorbed in the abnormal conditions of endless loop to current URL, then refusal is set from the current URL crawls web data, so as to avoid It is absorbed in endless loop.

Specifically, browser mark switchover policy is used to imitate user behavior as far as possible.It is clear that different users uses Device of looking at can be different, in order to imitate user behavior as much as possible, it is necessary to change the type or version of browser.And the class of browser Type or version, using browser mark (for example：User-agent) it is identified, reptile can simulate one when crawling Virtual browser, is made a distinction with user-agent, and use-agent value is determined by the type and version number of browser, is changed Become user-agent value equivalent to have switched browser.Therefore, when detecting from the current URL crawl web data failure Abnormal conditions or when reaching default timing, browser mark is changed, to extend the crawl time of reptile.

Specifically, cookie dynamics more new strategy is mainly realized using timing update mode, that is, reach it is presetting constantly Between when, then update cookie, update cookie and set up new session equivalent to being crawled the website of web data, so as to Extend the crawl time.

Specifically, Agent IP switchover policy captures the same IP (networks of web data mainly for website to long-time Address, for example：IPv4 addresses, or IPv6 addresses, are generally used：XXX.XXX.XXX.XXX IPv4 addresses) blocked Situation.For unit crawler capturing, due to the general only one of which IP of unit, therefore carry out reptile by the way of Agent IP and grab Take.Agent IP switchover policy, is detected presetting from the abnormal conditions of the current URL crawls web data failure or arrival When the time when, its Agent IP is changed, to avoid being blocked.

In one of the embodiments, in the step 14, parsing is carried out to the web data according to the rule and obtained Data must be parsed, are specifically included：

If the current type is homepage URL, regular accordingly according to homepage URL, the web data is carried out Parsing obtains parsing data, is paging URL's using the URL parsed in data as type if parsing data include URL Seed storage；

If the current type is paging URL, regular accordingly according to paging URL, the web data is carried out Parsing obtains parsing data, if parsing data include URL, is details page URL using the URL parsed in data as type Seed storage；

If the current type is details page URL, regular accordingly according to homepage URL, the web data is entered Row parsing obtains parsing data, preserves the web page contents in the parsing data.

The present embodiment is classified to URL, to allow different types of URL to be parsed using different rules, So as to obtain more accurate analysis result.

A kind of construction module figure of unit crawler capturing system of the invention is illustrated in figure 2, including：

Seed receiving module 201, includes URL, website numbering and type seed, by the kind for obtaining at least one The URL of son numbers the website numbering of the seed as current site as current URL, using the type of the seed as Current type；

Policy module 202, for obtaining at least one strategy, at least one crawler capturing ginseng is determined according to the strategy Number；

Rule module 203, for obtaining rule corresponding with the current type according to the current type；

Parsing module 204, for capturing web data from the current URL according to the crawler capturing parameter, according to institute State rule and parsing acquisition parsing data are carried out to the web data.

In one of the embodiments, in the parsing module 204, if in the crawl web data or to described There are abnormal conditions in web data in being analyzed, then preserves the abnormal conditions.

In one of the embodiments：

In one of the embodiments, in the parsing module 204, the web data is solved according to the rule Analysis obtains parsing data, specifically includes：

A kind of construction module figure of the most preferred embodiment of unit crawler capturing system of the invention is illustrated in figure 3, including is planted Sub- generation module 310, handling module 320 and data memory module 330.

The main function of seed generation module 310 is to provide seed for handling module, seed can be website URL or The SKU of commodity.Seed can be stored in text or database, and handling module can be from text or database batch Obtain seed.

Each seed must have the virtual numbering and type of a website, and virtually numbering can tell handling module for website Corresponding rule file parsing document is called, and type field is mainly described the seed and what type is belonged to, and is details page URL, paging URL or homepage URL.

Handling module 320 is the core of whole unit reptile, and it manages submodule 321, document by rule file and parses son Module 322, policy management sub-module 323 and exception reporting submodule 323 are constituted.Rule file manages the main of submodule 321 Effect is the document resolution rules for managing all kinds of websites, is that document analyzing sub-module 322 provides resolution rules.Document parses submodule Block 322 manages the rule that submodule 321 obtains each website from rule file, by these rule parsing documents, obtains user Information interested.Policy management sub-module 323, can be by a series of tactical management as the optimization submodule of handling module Chain is constituted, by analyzing the crawl flow of handling module, can be for preventing from repeating crawling, being absorbed in endless loop trap and extension Crawl time etc..Exception reporting submodule 324 is used for reporting various problems of the handling module 320 during crawl, anti-in time Feed user, prevents the wasting of resources.

Handling module 320 is got after seed, is analyzed the report information of exception reporting, is called corresponding policy management module And requested webpage.Policy management module 323 includes a series of strategies defined, is stored in multiple tactful chains.Such as Seed is absorbed in the processing strategy of endless loop；Browser agent switchover policy；Cookie dynamics more new strategy；Agent IP switches plan Slightly etc..These strategies may insure that handling module is more efficient in requested webpage.

After info web is got, by the virtual numbering calling rule file management submodule 321 of seed, phase is obtained The rule file answered, document is parsed by document analyzing sub-module 322.Each seed has a type field, can tell text Shelves analyzing sub-module 322 will parse any content, such as be homepage URL, can typically parse paging URL；If paging URL, then need to parse detail page URL；If details page URL, then can directly parse content.What is parsed is interior Appearance can be retained separately, if parse or URL, needs him to stamp type mark, individually save, for crawl mould Block 320 is subsequently used, if what is parsed is content, can be stored in database or text, is directly made for user With.

Exception reporting is divided into two kinds, one kind belongs to system-level mistake, one kind belongs to user class through whole crawl flow Mistake.System-level errors should report that to handling module 320, handling module once receives such type of error, can adjust With corresponding policy management sub-module 323 come optimal grasp process.And user's staging error is system to handle, it is necessary to feed back To user, such as parsing module parsing content error etc..

Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be stored in number According in storehouse or document, the module can also provide data for handling module 320.

Some data need to save to reuse to system, and some data are can be used directly to user 's.

The present invention is designed the sub-module of the handling module of unit reptile, and autgmentability is very good, and adds tactful pipe Manage submodule and exception reporting submodule, the whole crawl flow greatly optimized.

Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of unit crawler capturing method, it is characterised in that including：

Step (11), obtaining at least one includes URL, website numbering and type seed, using the URL of the seed as current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current type；

Step (12), obtains at least one strategy, and at least one crawler capturing parameter is determined according to the strategy；

Step (13), rule corresponding with the current type is obtained according to the current type；

Step (14), captures web data, according to the rule to described according to the crawler capturing parameter from the current URL Web data carries out parsing and obtains parsing data；

In the step (14), parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included：

If the current type is homepage URL, regular accordingly according to homepage URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the seed that type is paging URL Preserve；

If the current type is paging URL, regular accordingly according to paging URL, the web data is parsed Obtain parsing data, if parsing data include URL, regard the URL parsed in data as the kind that type is details page URL Son is preserved；

If the current type is details page URL, regular accordingly according to homepage URL, the web data is solved Analysis obtains parsing data, preserves the web page contents in the parsing data.

2. unit crawler capturing method according to claim 1, it is characterised in that in the step (14), if grabbed Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.

3. unit crawler capturing method according to claim 2, it is characterised in that the strategy includes：Seed is absorbed in extremely Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.

4. unit crawler capturing method according to claim 3, it is characterised in that：

The seed is absorbed in endless loop processing strategy：The crawler capturing parameter is permission or refuses from described current URL captures web data, if the abnormal conditions, which are current URL, is absorbed in endless loop, the crawler capturing parameter is set to Refusal is set to allow to capture from the current URL from the current URL crawls web data, otherwise the crawler capturing parameter Web data；

The browser mark switchover policy is specially：The crawler capturing parameter is from the current URL crawls web data When the browser mark that is used, if the abnormal conditions are from the current URL crawls web data failure or reached Default timing, then be updated to another browse by the browser mark used when capturing web data from the current URL Device mark, the browser mark used when capturing web data from the current URL is not otherwise updated；

Dynamically more new strategy is specially the cookie：The crawler capturing parameter is that permission or refusal are updated from described current URL captures cookie during web data, if reaching default timing, the crawler capturing parameter is set to allow more It is new from the current URL crawls web data when cookie, otherwise the crawler capturing parameter be set to refusal and update from institute State cookie during current URL crawls web data；

The Agent IP switchover policy is specially：The crawler capturing parameter is updated from the current URL to allow or refusing Agent IP during web data is captured, if the abnormal conditions are from the current URL crawls web data failure or reached To default timing, then the crawler capturing parameter be set to allow to update from the current URL crawls web data when Agent IP, otherwise the crawler capturing parameter be set to refusal update from the current URL crawl web data when Agent IP.

5. a kind of unit crawler capturing system, it is characterised in that including：

Seed receiving module, includes URL, website numbering and type seed, by the URL of the seed for obtaining at least one As current URL, the website numbering of the seed is numbered as current site, the type of the seed is regard as current class Type；

Parsing module, it is right according to the rule for capturing web data from the current URL according to the crawler capturing parameter The web data carries out parsing and obtains parsing data；

In the parsing module, parsing is carried out to the web data according to the rule and obtains parsing data, is specifically included：

6. unit crawler capturing system according to claim 5, it is characterised in that in the parsing module, if grabbed Take the web data or abnormal conditions occur in analyzing the web data, then preserve the abnormal conditions.

7. unit crawler capturing system according to claim 6, it is characterised in that the strategy includes：Seed is absorbed in extremely Circular treatment strategy, browser mark switchover policy, cookie dynamic more new strategies and/or Agent IP switchover policy.

8. unit crawler capturing system according to claim 7, it is characterised in that：