CN102262635A

CN102262635A - Page crawler system and page crawler method

Info

Publication number: CN102262635A
Application number: CN2010101899986A
Authority: CN
Inventors: 肖小剑; 李天武
Original assignee: Beijing Venus Information Security Technology Co Ltd; Beijing Venus Information Technology Co Ltd
Current assignee: Beijing Venus Information Security Technology Co Ltd; Beijing Venus Information Technology Co Ltd
Priority date: 2010-05-25
Filing date: 2010-05-25
Publication date: 2011-11-30

Abstract

The invention discloses a page crawler system and a page crawler method and overcomes the technical defect that a dynamic uniform resource locator (URL) cannot be effectively extracted in the prior art. The page crawler method comprises the following steps of: setting a first duplication removal queue; receiving a target page; crawling the target page by using a static crawler; regarding a URL which cannot be analyzed by the static crawler in the target page as the dynamic URL; submitting the dynamic URL to the first duplication removal queue; and continuously crawling the dynamic URL in the first duplication removal queue by using a dynamic crawler. By the invention, the technical defect that the dynamic URL cannot be effectively extracted in the prior art is overcome, the search efficiency and performance of pages are effectively improved, and the security application of the pages can be maintained.

Description

A kind of spiders system and method

Technical field

The present invention relates to the Webpage search technology, relate in particular to a kind of spiders system and method.

Background technology

Web crawlers is a program of automatically extracting webpage, it be search engine from the internet (internet) download webpage, be the important composition of search engine.The tradition reptile is from the URL(uniform resource locator) (URL) of one or several Initial pages, obtain the URL on the Initial page, in the process that grasps webpage, constantly extracting new URL from current page puts into formation and proceeds analysis, so go round and begin again, when complete internet latter of traversal satisfies certain stop condition of system, stop.

Range of application from reptile, be mainly used in search engine such as Google (Google), the professional search engine of Baidu and segmentation (as the search engine etc. of working) is exactly the collection that is applied in Virus Sample in addition, and the monitoring of network security, congruence aspect, Yunan County.

According to whether containing the browser end execution script in the webpage, webpage can be divided into dynamic page and static page.URL in the static page directly is embedded in the html file in the mode of HTML(Hypertext Markup Language) hyperlink, generally this URL is called static URL (or static linkage), and in the dynamic page except static URL, also containing in a large number must be by carrying out the dynamic URL (or dynamic link) that the browser end script just can obtain.The browser end script that dominates on the internet is the JavaScript language at present.

The reptile that generally will be merely able to extract static URL is called static reptile, and the reptile that can extract dynamic URL is called dynamic reptile.

By analyzing the HTML hyperlink label of pagefile, static URL can extract with comparalive ease.For dynamic URL, in fact just a section scripted code in pagefile may just not have the HTML mark at all, therefore can not get corresponding URL by the method for analyzing hyperlink label, the deficiency of the maximum of Here it is static reptile, promptly static reptile can not obtain dynamic URL.

In view of this, await proposing a kind of web crawlers technology, with the dynamic URL of effective extraction.

Summary of the invention

Technical matters to be solved by this invention is that a kind of spiders system and method need be provided, and solves the technological deficiency that can not effectively extract dynamic URL in the prior art.

In order to solve the problems of the technologies described above, the invention provides a kind of spiders method, comprising:

Be provided with one first and go heavily formation;

Receive a target pages;

Adopt static reptile that this target pages is creeped;

With the URL(uniform resource locator) (URL) should the static state reptile in this target pages do not analyzed as dynamic URL;

Should dynamic URL be submitted to this and first go heavily formation;

Adopt dynamic reptile to continue first to go the dynamic URL in the heavily formation to creep to this.

Preferably, this is set first when going heavily formation, further is provided with one second and goes heavily formation;

When adopting static reptile that this target pages is creeped, further obtain the static URL in this target pages;

Further should static state URL be submitted to this and second go heavily formation;

Further adopt static reptile further second to go the static URL in the heavily formation to creep to this.

Preferably, adopt dynamic reptile to continue this first step of going the dynamic URL in the heavily formation to creep is comprised: to obtain dynamic URL and be submitted to this and first go heavily formation, obtain static URL and be submitted to this and second go heavily formation; Adopt static reptile to continue this second step of going the static URL in the heavily formation to creep is comprised: to obtain dynamic URL and be submitted to this and first go heavily formation; Obtaining static URL is submitted to this and second goes heavily formation.

Preferably, this method further comprises:

This first goes dynamic URL in the heavily formation and this second to go the static URL in the heavily formation all to creep when finishing, and perhaps stops to creep according to a stop condition.

Preferably, this is set first goes heavily formation and this second to go the step of heavily formation, comprising:

By database or internal memory list structure this being set first goes heavily formation and this second to go heavily formation.

In order to solve the problems of the technologies described above, the present invention also provides a kind of spiders system, comprising:

Module is set, is used to be provided with one first and goes heavily formation;

Receiver module is used to receive a target pages;

Static reptile module is used to adopt static reptile that this target pages is creeped;

Dynamic reptile module, the URL(uniform resource locator) (URL) that is used for this target pages should the static state reptile not be analyzed be as dynamic URL, also is used for adopting dynamic reptile continuation first to go the dynamic URL of heavily formation to creep to this;

Submit module to, be used for that this dynamic URL is submitted to this and first go heavily formation.

Preferably, this is provided with module and is further used for being provided with one second and goes heavily formation;

When this static state reptile module is further used for adopting static reptile that this target pages is creeped, obtain the static URL in this target pages, and be used for adopting static reptile further second to go the static URL of heavily formation to creep this;

This submission module is further used for that this static state URL is submitted to this and second goes heavily formation.

Preferably, this dynamic reptile module is used for adopting dynamic reptile to continue first to go the dynamic URL of heavily formation to creep to this, obtains dynamic URL and is submitted to this and first goes heavily formation, obtains static URL and is submitted to this and second goes heavily formation; This static state reptile module is used for adopting static reptile to continue second to go the static URL of heavily formation to creep to this, obtains dynamic URL and is submitted to this and first goes heavily formation, obtains static URL and is submitted to this and second goes heavily formation.

Preferably, this system further comprises:

Stopping modular is used for this and first goes the dynamic URL of heavily formation and this second to go the static URL in the heavily formation all to creep when finishing, and perhaps stops to creep according to a stop condition.

Preferably, the described module that is provided with is used for by database or internal memory list structure this being set and first goes heavily formation and this second to go heavily formation.

Compared with prior art, one embodiment of the present of invention have effectively overcome the technological deficiency that can't effectively extract dynamic URL in the prior art.An alternative embodiment of the invention will be in the same place at the Webpage search technology of the static URL of static scenario language compilation and the Webpage search technical organization at the dynamic URL of dynamic script language compilation effectively, can effectively extract static URL and dynamic URL in the webpage, and adopt static crawler technology and dynamic crawler technology to carry out Webpage search respectively, effectively improve Webpage search efficient and performance, helped safeguarding the Secure Application of webpage.

Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, perhaps understand by implementing the present invention.Purpose of the present invention and other advantages can realize and obtain by specifically noted structure in instructions, claims and accompanying drawing.

Description of drawings

Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, is used from explanation the present invention with embodiments of the invention one, is not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the principle schematic of static reptile flow process in the prior art;

Fig. 2 is the composition synoptic diagram of system embodiment of the present invention;

Fig. 3 is the schematic flow sheet of the inventive method embodiment.

Embodiment

Describe embodiments of the present invention in detail below with reference to drawings and Examples, how the application technology means solve technical matters to the present invention whereby, and the implementation procedure of reaching technique effect can fully understand and implements according to this.

At first, if do not conflict, each feature among the embodiment of the invention and the embodiment can mutually combine, all within protection scope of the present invention.In addition, can in computer system, carry out in the step shown in the process flow diagram of accompanying drawing such as a set of computer-executable instructions, and, though there is shown logical order in flow process, but in some cases, can carry out step shown or that describe with the order that is different from herein.

Fig. 1 is the principle schematic of static reptile flow process in the prior art.As shown in Figure 1, this static state reptile flow process mainly comprises the steps:

Step S110 receives webpage URL, and static reptile is carried out initialization process, comprises number of threads is set the reptile degree of depth, maximum URL length or the like;

Step S120 carries out dom tree to page URL and resolves, obtain with target labels as＜a,＜img〉wait the URL label that is complementary, the attribute (as href=" ... ", src=" ... " etc.) of the URL label that obtains joined link in the array; This link array is used to store a URL link data that the page comprised;

Step S130 should link array and join in the heavily formation; In order to prevent that the page from being climbed repeatedly by reptile, climb the situation of cannot not climbing intactly forever a website and occur thereby cause, just introduced and gone heavily formation; Going heavily formation is a chained list queue structure, when a new URL joins in this structure, will compare with the element (URL) of this formation, see whether it exists in chained list queue structure, if existed, this new URL just need not go into this chained list formation, if this new URL does not exist, just this new URL is joined this chained list rear of queue;

Step S140 judges to go whether also have untreated URL in the heavily formation, if having then change step S150, otherwise changes step S160;

Step S150 obtains the next URL page in the heavily formation, returns step S120 and continues to carry out;

Step S160 will go the url data in the heavily formation to note, and finishes.

Core concept of the present invention is that target pages to be analyzed is resolved, (such as the URL that is write by dynamic script language (as JAVASCRIPT etc.)) URL that static reptile in the target pages is not analyzed is as dynamic URL, and obtains the static URL that produced by the HTML hyperlink label; Static reptile is when carrying out page analysis, run into extraction not URL (being dynamic URL) time, just current page (the current page of analyzing of static reptile) is submitted to dynamic reptile, and continue analyzing current page, the static URL that extraction can be analyzed voluntarily also carries out mark to static URL.When dynamic reptile analyzes this current page, the URL (unmarked URL in this current page) that the operation of analog subscriber is not analyzed as static reptiles of analysis such as click input characters, thus obtain corresponding dynamic URL.

The dynamically operation of reptile is meant the operation of creeping to the dynamic URL by the dynamic script language compilation; The operation of static reptile is meant the operation of creeping to the static URL that is produced by the HTML hyperlink label.Wherein dynamically creep operation such as comprising by the click (to button (button) control) of DOM control recognizer analog subscriber, input text character (to Input) or selecting (to selecting control) operation or click the hyperlink that generates by javascript (as＜a href=" javascript... "), obtain this and operate resulting URL.Wherein the main operation of static reptile is such as being to find out the label (as a, img, iframe etc.) that may contain hyperlink, and its attribute (as href=" ... ", src=" ... ") extracted joins in the heavily formation.

Fig. 2 is the composition synoptic diagram of system embodiment of the present invention.As shown in Figure 2, this system embodiment mainly comprise be provided with module 210, receiver module 220, static state creep module 230, dynamically creep module 240, submit module 250 and stopping modular 260 to, wherein:

Module 210 is set, is used to be provided with one first and goes heavily formation and one second to go heavily formation;

Receiver module 220 is used to receive a target pages;

The static state module 230 of creeping, with this module 210 is set and receiver module 220 links to each other, be used to adopt static reptile that this target pages is creeped, obtain the static URL(uniform resource locator) (URL) that is produced by the HTML hyperlink label in the target pages, also be used for adopting static reptile second to go the static URL of heavily formation to creep, obtain static URL and dynamic URL this;

The module 240 of dynamically creeping, with this module 210 is set and receiver module 220 links to each other, is used to adopt dynamic reptile that this target pages is creeped, obtain dynamic URL, also be used for adopting dynamic reptile first to go the dynamic URL of heavily formation to creep, obtain static URL and dynamic URL this; Comprise in the target pages by the static state URL that module 230 can't creep that creeps, the dynamic URL that contains in this target pages by the dynamic script language compilation then is described;

Submit module 250 to, with this creep module 230 and module 240 of dynamically creeping of module 210, static state is set and links to each other, be used for that this static state URL is submitted to this and second go heavily formation, should dynamic URL be submitted to this and first go heavily formation;

Stopping modular 260 links to each other with creep module 230 and the module 240 of dynamically creeping of this static state, is used for this second static URL that goes heavily formation and first goes dynamic URL in the heavily formation all to creep to finish or stop to creep according to a stop condition with this.

Above-mentioned submission module 250 can be used as a standalone module in the system embodiment of the present invention.In other embodiments of the invention, also can be used as ingredient is integrated in static state respectively and creeps in the module 230 and the module 240 of dynamically creeping.

Above-mentioned static state is creeped module 230 according to submitting to module 250 that static URL is submitted to the second submission time priority of going in the heavily formation, the operation of creeping in proper order.Also promptly, be introduced into this second static URL that goes in the heavily formation, can carry out earlier the operation of creeping, after enter into this and second remove the static URL of heavily formation, the back execution operation of creeping.

The above-mentioned module 240 of dynamically creeping is removed the time order and function of heavily formation according to submitting to module 250 that dynamic URL is submitted to first, the operation of creeping in proper order.Also promptly, be introduced into this first dynamic URL that goes in the heavily formation, can carry out earlier the operation of creeping, after enter into this and first remove the dynamic URL of heavily formation, the back execution operation of creeping.

Judging when submitting to module 250 to be used to submit to static URL this second goes heavily formation whether to have identical static URL, is then not submit this identical static URL to.Judging when submitting to module 250 also to be used to submit to dynamic URL this first goes heavily formation whether to have identical dynamic URL, is then not submit this identical dynamic URL to.

In actual applications, above-mentioned static state is creeped to operate with the operation of dynamically creeping and is generally carried out simultaneously.

Fig. 3 is the schematic flow sheet of the inventive method embodiment.In conjunction with system embodiment shown in Figure 2, method embodiment shown in Figure 3 mainly comprises the steps:

Step S310 is provided with one first and goes heavily formation and one second to go heavily formation;

Step S320 receives a target pages;

Step S330 adopts static reptile that one target pages is creeped, and obtains the static URL(uniform resource locator) (URL) that is produced by the HTML hyperlink label in the target pages;

Step S340 should static state URL be submitted to this and second goes heavily formation;

Step S350, in step S320 process,, the dynamic URL that contains in this target pages by the dynamic script language compilation is described then if static reptile runs into the URL that can't creep, adopt dynamic reptile that this target pages is creeped this moment, obtains the dynamic URL in this target pages;

Step S360 should dynamic URL be submitted to this and first goes heavily formation;

Step S370, adopt static reptile and dynamic reptile respectively, continuation second goes the static URL in the heavily formation to creep to this, obtain static URL and dynamically URL be submitted to this respectively and second go heavily formation and this first to go heavily formation, and first go the dynamic URL in the heavily formation to creep to this, obtain static URL and dynamically URL be submitted to this respectively and second go heavily formation and this first to go heavily formation, second go static URL in the heavily formation and this first to go dynamic URL in the heavily formation all to creep to finish or stop to creep until this according to a stop condition.

This is set among above-mentioned steps S310 second goes heavily formation and this first to go heavily formation, can be provided with, also can be provided with by the internal memory list structure by database.By the internal memory list structure heavily formation being set compares by database and designs heavily formation better performances, because carrying out data manipulation in internal memory will compare database to carry out data manipulation speed faster, and interior existence is local, database may be in this locality also may be in the strange land, therefore algorithm relatively reasonably internal memory go database of heavy platoon ratio to go heavily formation performance higher.

Go heavily formation just a new URL who parses to be submitted in the database table with the database setting, if having this URL in the database then do not submit to, if it's not true then be submitted to the afterbody of database table.After a URL had climbed, then article one data-base recording that article one of this database table is not creeped was as the new URL that will climb.

Go heavily formation just a new URL who parses to be submitted in the heavy queue linked list with the setting of internal memory chained list, if having this URL in the chained list then do not submit to, if it's not true then be submitted to the afterbody of heavy queue linked list.After a URL has climbed, then go this URL in first structure of not creeping of article one of heavy queue linked list as the new URL that will climb.

Above-mentioned steps S340 and step S350 do not have strict sequencing when carrying out, and static reptile can creep simultaneously with dynamic reptile.

By a simple bright for instance system embodiment shown in Figure 2 and method embodiment shown in Figure 3.

Such as including a, b, c and four URL of d among the page P, wherein:

A and b are the dynamic URL by the dynamic script language generation, and include e and two URL of f among a, and e is the dynamic URL by the dynamic script language generation, and f is the static URL by the static scenario language generation;

C and d are the static URL by the static scenario language generation, and include e and two URL of f among the c, and g is the dynamic URL by the dynamic script language generation, and h is the static URL by the static scenario language generation.

Start dynamically creep operation and static state and creep after the operation, receive page P and page P is resolved, the static URL that obtains among the page P is c and d, and dynamic URL is a and b; Just c and d (static URL) are submitted to second and go heavily formation, a and b (dynamically URL) are submitted to first go heavily formation; Adopting static reptile that c is carried out static state when creeping operation, obtain g (dynamically URL) and h (static URL), this g is submitted to first goes heavily formation, and this h is submitted to second goes heavily formation; When adopting dynamic reptile that a is proceeded dynamically to creep operation, obtain e (dynamically URL) and f (static URL), this e is submitted to first goes heavily formation, and this f is submitted to second goes heavily formation.

Need to prove, if e was submitted to for first time of going heavily formation early than g being submitted to first time of going heavily formation, then can be earlier to the e operation (e is submitted to first and goes the time of heavily formation to be later than g, then can be earlier to the g operation of dynamically creeping) of dynamically creeping; If f was submitted to for second time of going heavily formation and was submitted to for second time of going heavily formation early than h, then can be earlier f be carried out the static state operation (f is submitted to second and goes the time of heavily formation to be later than h, then can be earlier h be carried out the static state operation of creeping) of creeping.That is to say that can be submitted to first sequencing that goes in the heavily formation according to dynamic URL, the operation of dynamically creeping successively is submitted to second sequencing that goes in the heavily formation according to static URL, carry out the static state operation of creeping successively.

The present invention merges the operation of creeping of dynamic URL and static URL and forms the mixing the be association of activity and inertia operation of creeping, to the dynamic URL in the page and the static URL operation of creeping respectively, and the result who operates according to creeping continues URL is carried out dynamic and static differentiation, guarantee reptile efficient, improved the processing power of the operation of creeping.Dynamic reptile among the present invention carries out function and static reptile execution function can adopt the integrated form setting, also can adopt distributed setting, has improved the dirigibility of the operation of creeping.

Though dynamically reptile can be extracted dynamic URL and static URL, extract by the operation realization of analog subscriber, be difficult to dispose (, then generally speaking being equal to useless) efficiently in the product separately if dispose separately at needs.Certainly, if some project or demand are not done requirement to the performance of reptile, dynamically reptile also can be disposed separately.

Dynamically the reptile function is very powerful, but because dynamic reptile need carry out IE plays up, so performance is lower, on the contrary, the performance height is the advantage of static reptile, therefore this two covers reptile is combined the advantage that just can bring into play each other.Static reptile is handled static linkage, and dynamically reptile is handled dynamic link, is improved thereby reach performance, and function is strengthened.

The reciprocal process of two reptiles is that dynamically reptile has found that static URL just gives static reptile and removes to handle the dynamic URL that while oneself processing is found oneself; Static reptile has found that dynamic URL just gives dynamic reptile and goes to handle, and handles the static URL that oneself finds simultaneously.

In a concrete application example of the present invention, under the situation of the many websites of task scan, for control center's (perhaps other utilizes reptile result's module), if the URL of all websites is kept in two tables storing static URL and dynamic URL respectively, may there be the problem that inquiry velocity is slow and manage inconvenience.Therefore design corresponding subtask, a website, during subtask of each execution, just generate four tables, its table name generates at random, is recorded in the reptile control (controlofspider) in the Tablelist table, static URL (urlofstaticspider), dynamically in URL (urlofdynamicspider) and Host List (hostlist) field, i.e. reptile interactive control table, static page table, the dynamic page table, the hostlist table.

Whether reptile interactive control table control reptile finishes, and static reptile complement mark (StaticSatus) of two fields and dynamic reptile complement mark (DynamicSatus) are wherein arranged.When static reptile has climbed the static page table, StaticSatus=0 is set and observes the DynamicSatus value, if DynamicSatus=0 then been scanned; Otherwise StaticSatus=1 is set, continues to resolve URL.When dynamically reptile has climbed the dynamic page table, DynamicSatus=0 is set and observes the StaticSatus value, if StaticSatus=0 then been scanned; Otherwise DynamicSatus=1 is set, continues to resolve URL.

The static page table is to be used for preserving the static URL that dynamic reptile is found, waits for what static reptile creeped.After static reptile had climbed its all URL that go heavily formation, with regard to the access static page table, finding whether to have Status was 0 record, if any, just these was joined it and went to inherit in the heavily formation and creep.Be not over if had then represented that static reptile temporarily climbs.Whether really climbed all static URL as for static reptile, need check whether also to occur the new static URL that does not creep in the static page table, if continuation occurs then proceeds to creep.

The dynamic page table is to be used for preserving the dynamic URL that static reptile is found, waits for what dynamic reptile creeped.After dynamic reptile has climbed its all URL that go heavily formation, just visit the dynamic page table, finding whether to have Status is 0 record, if any, just these is joined it and goes to inherit in the heavily formation and creep.Be not over if had then represented that dynamic reptile temporarily climbs.Whether really climbed all dynamic URL as for dynamic reptile, need check whether also to occur the new dynamic URL that does not creep in the dynamic page table, if continuation occurs then proceeds to creep.

Need to prove, can in computer system, carry out such as a set of computer-executable instructions in the step shown in the process flow diagram of accompanying drawing.In addition, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with the general calculation device, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and carry out by calculation element, perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

Though the disclosed embodiment of the present invention as above, the embodiment that described content just adopts for the ease of understanding the present invention is not in order to limit the present invention.Technician in any the technical field of the invention; under the prerequisite that does not break away from the disclosed spirit and scope of the present invention; can do any modification and variation what implement in form and on the details; but scope of patent protection of the present invention still must be as the criterion with the scope that appending claims was defined.

Claims

1. a spiders method is characterized in that, comprising:

Be provided with one first and go heavily formation;

Receive a target pages;

Adopt static reptile that this target pages is creeped;

Should dynamic URL be submitted to this and first go heavily formation;

2. method according to claim 1 is characterized in that:

This is set first when going heavily formation, further is provided with one second and goes heavily formation;

3. method according to claim 2 is characterized in that:

Adopt dynamic reptile to continue this first step of going the dynamic URL in the heavily formation to creep is comprised: to obtain dynamic URL and be submitted to this and first go heavily formation, obtain static URL and be submitted to this and second go heavily formation;

Adopt static reptile to continue this second step of going the static URL in the heavily formation to creep is comprised: to obtain dynamic URL and be submitted to this and first go heavily formation; Obtaining static URL is submitted to this and second goes heavily formation.

4. method according to claim 2 is characterized in that, this method further comprises:

5. according to claim 2 or 4 described methods, it is characterized in that, this be set first go heavily formation and this second to go the step of heavily formation, comprising:

6. a spiders system is characterized in that, comprising:

Receiver module is used to receive a target pages;

7. system according to claim 6 is characterized in that:

This is provided with module and is further used for being provided with one second and goes heavily formation;

8. system according to claim 7 is characterized in that:

This dynamic reptile module is used for adopting dynamic reptile to continue first to go the dynamic URL of heavily formation to creep to this, obtains dynamic URL and is submitted to this and first goes heavily formation, obtains static URL and is submitted to this and second goes heavily formation;

This static state reptile module is used for adopting static reptile to continue second to go the static URL of heavily formation to creep to this, obtains dynamic URL and is submitted to this and first goes heavily formation, obtains static URL and is submitted to this and second goes heavily formation.

9. system according to claim 6 is characterized in that, this system further comprises:

10. according to claim 7 or 9 described systems, it is characterized in that:

The described module that is provided with is used for by database or internal memory list structure this being set and first goes heavily formation and this second to go heavily formation.