CN102982161A

CN102982161A - Method and device for acquiring webpage information

Info

Publication number: CN102982161A
Application number: CN2012105168743A
Authority: CN
Inventors: 徐锐波; 路轶
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2012-12-05
Filing date: 2012-12-05
Publication date: 2013-03-20

Abstract

The invention discloses a method and a device for acquiring webpage information. The method comprises the following steps of: grabbing: grabbing webpages from a station server; analyzing webpage information: extracting specified webpage information from specified positions of the webpages according to a preset webpage extraction rule; and storing: structurally storing the specified webpage information. According to the method and device for acquiring webpage information provided by the invention, after webpages are grabbed from the station server, information of entire webpages is not stored directly, and specified webpage information is extracted from specified positions of the webpages according to the webpage extraction rule and is structurally stored. The webpage extraction rule can be customized according to the requirements of users, and information of webpages is resolved, so that the requirement of customized extraction of webpage information is met.

Description

The acquisition methods of info web and device

Technical field

The present invention relates to technical field of the computer network, be specifically related to a kind of acquisition methods and device of info web.

Background technology

(web crawlers is otherwise known as webpage spider, network robot, in some communities, more frequent be called as the webpage follower) be a kind of program or script of automatic acquisition web page contents, it is the important component part of search engine, the optimization that the optimization of search engine is made for web crawlers to a great extent exactly.

Web crawlers generally is divided into traditional reptile and focused crawler.The tradition reptile is from the URL(Uniform/Universal Resource Locator of one or several Initial pages, URL(uniform resource locator)) beginning, the URL of acquisition Initial page; In the process of crawl webpage, constantly the URL from the new webpage of current page extraction puts into formation, until satisfy certain stop condition of system.The workflow of focused crawler is comparatively complicated, need to filter and irrelevant the linking of theme according to certain web page analysis algorithm, remains with the link of usefulness and puts it into the URL formation of waiting for crawl; Then, from formation, select the URL of next step webpage that will grasp according to certain search strategy, repeat said process, until stop when reaching a certain condition of system.In addition, all will be carried out certain analysis and filtration by system storage by the webpage of crawler capturing, and set up index, so that retrieval and indexing afterwards.

Above-mentioned two kinds of web crawlers all are the information of obtaining whole webpage, then directly storage.This class reptile can not cooked parsing to the information of webpage, can't satisfy the demand that extracts info web that customizes.

Summary of the invention

In view of the above problems, the present invention has been proposed in order to provide a kind of acquisition methods of the info web that overcomes the problems referred to above or address the above problem at least in part and the deriving means of corresponding info web.

According to an aspect of the present invention, provide a kind of acquisition methods of info web, having comprised:

Crawl step, slave site server place crawl webpage;

The page info analyzing step is according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;

Storing step carries out structured storage with described specified page information.

According to a further aspect in the invention, provide a kind of deriving means of info web, having comprised:

The webpage grabber is suitable for slave site server place crawl webpage;

The page info resolver is suitable for according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;

Action processor is suitable for described specified page information is carried out structured storage.

Acquisition methods and device according to info web provided by the invention, after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.

Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Description of drawings

By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:

Fig. 1 shows the according to an embodiment of the invention process flow diagram of the acquisition methods of info web;

Fig. 2 shows the according to an embodiment of the invention structured flowchart of the deriving means of info web; And

Fig. 3 shows the according to an embodiment of the invention structured flowchart of the system that obtains of info web.

Embodiment

Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.

Fig. 1 shows the according to an embodiment of the invention process flow diagram of the acquisition methods 100 of info web.As shown in Figure 1, method 100 starts from step S101, and step S101 is crawl step, is specially slave site server place crawl webpage.Crawler system slave site server place crawl webpage can specifically adopt following three kinds of methods: 1) the direct downloading web pages in slave site server place, can adopt this method for the tactful website of anti-crawl.2) by browser renders method slave site server place downloading web pages; Because some website has used ajax(Asynchronous JavaScript and XML, asynchronous JavaScript and extend markup language) technology, need to utilize the method for browser renders to obtain complete page structure.Crawler system has been equipped with the rendering module of several kernels, such as IE kernel, Gecko(red fox) kernel, Chrome kernel etc.3) in order to prevent that crawler system from frequently accessing certain server in station and causing by the situation of this server in station envelope IP, crawler system can pass through acting server slave site server place downloading web pages, adopts the acting server downloading web pages can guarantee promptness and the continuity that grasps.More than three kinds of methods substantially can solve the crawl problem of various types of websites.

Subsequently, method 100 enters step S102, and step S102 is the page info analyzing step, is specially according to predefined page decimation rule, from the assigned address extraction specified page information of webpage.Crawler system is analyzed the page structure of each webpage, extracts specified page information according to page decimation rule.Wherein page decimation rule customizes, can be by human configuration.Alternatively, page decimation rule has been set the html tag of the front and back of assigned address.Because all in html tag, assigned address generally also all is html tag to the effective information in the page, assigned address is defined by the html tag before and after it, and the html tag of this assigned address is exactly the specified page information that will extract.For example, for the webpage from certain server in station, if want to extract " game name " field in this webpage, the page decimation rule that customizes so should comprise the html tag＜div before and after this field 〉.When crawler system is analyzed this webpage, therefrom extract two html tag＜div〉between information, i.e. " game name ".

For download file (for example software package) linked web pages, the specified page information that therefrom extracts generally includes the download file link, optionally, also comprise the parent page link of this webpage, these link informations are extracted for follow-up download corresponding download file according to this link information.The parent page link is used for tracing to the source, and can also find the source of this download file when downloading corresponding download file, comprises parent page or website etc., is convenient to follow-up maintenance to data and corresponding query function is provided.

Further, crawl webpage in crawler system slave site server place can adopt dual mode: full dose crawls mode and increment crawls mode.Adopting full dose to crawl mode or increment, to crawl mode be according to demand and fixed.For example: for a new game website server, can include a lot of new game, at this moment need the webpage of this server in station is all traveled through, namely full dose crawls, and grasps all game, and follow-up doing again unifies to process (being that page info is resolved and stores processor).The game of this game website server all crawl complete after, this server in station every day is new game more also, at this moment needs to adopt increment to crawl mode, grasps the game of upgrading its every day.

The server in station that crawls mode for full dose carries out disposable task delivery, and namely disposable crawl is from the webpage of this server in station.At first notify the title of task dispatcher server in station to be crawled, task dispatcher can be inquired about the crawl rule of this server in station voluntarily, then can finish full dose and crawl.Task dispatcher delivers the crawl task to the specific works process, and performed crawl task can comprise: at first, and slave site server place crawl Initial page.Resolve this Initial page, obtain the network address of the new webpage of Initial page link.This new webpage is grasped at network address slave site server place according to new webpage.A common server in station begins recurrence from initial page, have ten multilayers even more, task dispatcher begins crawl from initial page, grasp the more webpage of deep layer according to the link recurrence in the webpage, that is: then carry out full dose recurrence substep, be specially and resolve new webpage, obtain again the network address of the new webpage of new web page interlinkage, the new webpage that the crawl of slave site server place is obtained again; Repeat this full dose recurrence substep, until satisfy the crawl condition that stops.Usually, which floor webpage can satisfy the demands before crawler system generally need to grasp, so crawler system can arrange the recurrence number of plies of single server in station, the setting recurrence number of plies that recurrence grabs this server in station just satisfies the crawl condition that stops.After full dose crawls webpage from certain server in station, these webpages are done unified the processing, comprise according to predefined page decimation rule, extract specified page information from the Initial page of above-mentioned crawl and the assigned address of all new webpages.

The server in station that crawls mode for increment carries out algorithms for periodic task scheduling, is that crawl dispatching cycle of server in station setting is from the webpage of this server in station according to crawler system namely.The dispatching cycle that crawler system is set for each server in station can be different, have plenty of 1 hour, have plenty of 3 hours, decide on the renewal speed of server in station.The server in station that crawler system will need increment to crawl forms scheduling queue according to ordering dispatching cycle, every Preset Time (for example 10 minutes) this scheduling queue is detected, scheduling time the server in station greater than the current time be considered as server in station to be grasped.Task dispatcher delivers the crawl task to the specific works process subsequently.In the concrete progress of work, performed step can comprise: at first, and slave site server place crawl Initial page.According to predefined page decimation rule, from the assigned address extraction specified page information of Initial page.Resolve Initial page, obtain the network address of the new webpage of Initial page link.According to the network address of new webpage, the new webpage of slave site server place crawl.According to predefined page decimation rule, from the assigned address extraction specified page information of new webpage.Increment recurrence substep is resolved new webpage, obtains the network address of the new webpage of new web page interlinkage again; The new webpage that the crawl of slave site server place is obtained again; According to predefined page decimation rule, extract specified page information from the assigned address of the new webpage that obtains again; Repeat this increment recurrence substep, until satisfy the crawl condition that stops.Crawler system can arrange the recurrence number of plies of single server in station, and the setting recurrence number of plies that recurrence grabs this server in station just satisfies the crawl condition that stops.Crawl the mode difference with full dose and mainly be, it is that resolve on crawl webpage limit, limit that increment crawls mode; And, increment recurrence substep when crawler system be to carry out when official hour arrives the dispatching cycle that server in station is set.

Alternatively, in this method, task dispatcher will grasp task and process by the progress of work that gearman passes to the downstream.This method uses gearman as the inter-process messages formation, carries out process communication by gearman and realizes parallel expansion and high concurrent processing.Above-mentioned webpage take the time as thread all leaves among the redis in the mode of ordered set, accurately dispatches the webpage monitor task by calling the redis Interface realization.Redis is the memory database of a key-value type, and whole database operates in the middle of completely being carried in internal memory, regularly database data is exported (flush) by asynchronous operation and is preserved to hard disk.Because be pure internal memory operation, the performance of redis is very outstanding, and per second can be processed and surpass 100,000 read-write operations, thereby has improved the performance of crawler system.

After step S102, method 100 enters the storing step of step S103, is specially specified page information is carried out structured storage.So-called structured storage refers to store specified page information and specified page information is carried out structural description, for example: the structural description to " game name " information is exactly game name, is exactly the download file link to the structural description of " download file link " information.Alternatively, can use XML(extensible markup language, extend markup language) carry out structured storage, be about to every specified page information and be stored in the XML node, be convenient to like this processing of subsequent module, also simplified system architecture simultaneously.By carrying out structured storage, the information that the user can accurately be known crawler system and crawled.

Alternatively, after step S103, method 100 enters step S104, wherein according to specified page information, the related resource of slave site server place downloading web pages is further stored related resource and the related resource of webpage and the corresponding relation of specified page information of webpage.Be linked as example take specified page information as software package, but download this software package according to software package link slave site server place, further store the corresponding relation that this software package and software package and software package link.Pass through the method, crawler system can crawl any information and the download file that can see on the webpage, for example: the relevant information of software package and software package, such as dbase, update time, software size, software author, usage platform and software description etc., can also crawl the resource such as news, picture of portal.

Alternatively, according to the strategy of prior customization, crawler system can also be done respective handling to the information of crawl and the resource of download, as sending out mail, pushing distributed storage etc.As long as the server in station of downloading web pages content, such as door, news site etc. only needs the crawl information needed for some, with the information pushing of crawl to specified interface, mail notification specific people again.For some software package server in station, need to obtain software package and relevant information thereof, after grabbing necessary information, carry out again follow-up download and unpack, software package is very large usually, need to push to distributed storage.

The acquisition methods of the info web that provides according to present embodiment, after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.Take the info web that crawls certain game website as example, can directly obtain the download link of all game in this game website by the method, and these download link are carried out structured storage, the information that the user can accurately be known crawler system and crawled.

Fig. 2 shows the according to an embodiment of the invention structured flowchart of the deriving means of info web.As shown in Figure 2, this info web deriving means 200 comprises: webpage grabber 210, page info resolver 220 and action processor 230.Alternatively, info web deriving means 200 can also comprise: web page interlinkage resolver 240, downloader 250 and task dispatcher 260.

Webpage grabber 210 is suitable for slave site server place crawl webpage.Alternatively, webpage grabber 210 is suitable for the direct downloading web pages in slave site server place; Perhaps, by browser renders method slave site server place downloading web pages; Perhaps, by acting server slave site server place downloading web pages.Webpage grabber 210 comprises elementary webpage grabber 211 and webpage recurrence grabber 212.Elementary webpage grabber 211 is suitable for slave site server place crawl Initial page, web page interlinkage resolver 240 is suitable for resolving Initial page, obtain the network address of the new webpage of Initial page link, webpage recurrence grabber 212 is suitable for the new webpage of slave site server place crawl.Web page interlinkage resolver 240 also is suitable for resolving new webpage, obtains the network address of the new webpage of new web page interlinkage again; Webpage recurrence grabber 212 also is suitable for the new webpage that the crawl of slave site server place is obtained again; Web page interlinkage resolver 240 and webpage recurrence grabber 212 repeated works are until satisfy the crawl condition that stops.

Page info resolver 220 is suitable for according to predefined page decimation rule, from the assigned address extraction specified page information of webpage.Alternatively, page decimation rule has been set the html tag of the front and back of assigned address; Page info resolver 220 is further adapted for from webpage the specified page information between the html tag of the front and back of extracting assigned address.Further, page info resolver 220 is suitable for according to predefined page decimation rule, extracts specified page information from the assigned address of Initial page and new webpage.

Action processor 230 is suitable for specified page information is carried out structured storage.So-called structured storage refers to store specified page information and specified page information is carried out structural description, by carrying out structured storage, and the information that the user can accurately be known crawler system and crawled.

Downloader 250 is suitable for according to specified page information, the related resource of slave site server place downloading web pages.Action processor 230 is further adapted for related resource and the related resource of webpage and the corresponding relation of specified page information of storage webpage.

Task dispatcher 260 is suitable for according to distributed call method (such as gearman) corresponding task being delivered to webpage grabber 210.Task transfers device 260 and webpage grabber 210 can adopt full dose to crawl mode or increment crawls the crawl that mode is carried out webpage, and detailed process can be referring to the description of embodiment of the method.

This info web deriving means 200 can also comprise cache database, and redis for example is suitable for depositing webpage take the time as thread in the mode of ordered set, accurately dispatches the webpage monitor task by calling the redis Interface realization.

Fig. 3 shows the according to an embodiment of the invention structured flowchart of the system that obtains of info web.As shown in Figure 3, the system that obtains of this info web comprises info web deriving means 200 and server in station 100, and the concrete structure of info web deriving means 200 can be referring to the associated description of above-described embodiment.Info web deriving means 200 slave site servers 100 places obtain the related resource of webpage and webpage.

Deriving means according to info web provided by the invention, the info web deriving means is after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.Take the info web that crawls certain game website as example, can directly obtain the download link of all game in this game website by this device, and these download link are carried out structured storage, the information that the user can accurately be known crawler system and crawled.

Intrinsic not relevant with any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can be with using based on the teaching at this.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.

In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment among the embodiment.Can be combined into a module or unit or assembly to the module among the embodiment or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such feature and/or process or unit at least some are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless in addition clearly statement, disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).

In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included among other embodiment, the combination of the feature of different embodiment means and is within the scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving at one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the deriving means of the info web of the embodiment of the invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides at carrier signal, perhaps provides with any other form.

It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims

1. the acquisition methods of an info web comprises:

Crawl step, slave site server place crawl webpage;

2. method according to claim 1, described page decimation rule has been set the html tag of the front and back of described assigned address; Described page info analyzing step is specially: the specified page information from described webpage between the html tag of the front and back of the described assigned address of extraction.

3. method according to claim 1 and 2, described crawl step comprises:

Slave site server place crawl Initial page;

Resolve described Initial page, obtain the network address of the new webpage of described Initial page link;

Described new webpage is grasped at slave site server place;

Full dose recurrence substep is resolved described new webpage, obtains the network address of the new webpage of described new web page interlinkage again, the new webpage that the crawl of slave site server place is obtained again; Repeat this full dose recurrence substep, until satisfy the crawl condition that stops;

Described page info analyzing step comprises: according to predefined page decimation rule, extract specified page information from the assigned address of described Initial page and described new webpage.

4. method according to claim 1 and 2, described crawl step and page info analyzing step comprise:

Slave site server place crawl Initial page;

According to predefined page decimation rule, from the assigned address extraction specified page information of described Initial page;

Described new webpage is grasped at slave site server place;

According to predefined page decimation rule, extract specified page information from the assigned address of described new webpage;

Increment recurrence substep is resolved described new webpage, obtains the network address of the new webpage of described new web page interlinkage again; The new webpage that the crawl of slave site server place is obtained again; According to predefined page decimation rule, extract specified page information from the assigned address of the new webpage that obtains again; Repeat this increment recurrence substep, until satisfy the crawl condition that stops.

5. method according to claim 4, described increment recurrence substep are carried out when official hour arrives when system the dispatching cycle that described server in station is set.

6. according to claim 1 to 5 each described methods, described crawl step comprises:

The direct downloading web pages in slave site server place;

Perhaps, by browser renders method slave site server place downloading web pages;

Perhaps, by acting server slave site server place downloading web pages.

7. according to claim 1 to 6 each described methods, after described storing step, also comprise:

According to described specified page information, download the related resource of described webpage from described server in station;

Further store related resource and the related resource of described webpage and the corresponding relation of described specified page information of described webpage.

8. the deriving means of an info web comprises:

The webpage grabber is suitable for slave site server place crawl webpage;

9. device according to claim 8, described page decimation rule has been set the html tag of the front and back of described assigned address; Described page info resolver is further adapted for from described webpage the specified page information between the html tag of the front and back of extracting described assigned address.

10. also comprise: the web page interlinkage resolver according to claim 8 or 9 described devices;

Described webpage grabber comprises elementary webpage grabber and webpage recurrence grabber;

Described elementary webpage grabber is suitable for slave site server place crawl Initial page, and described web page interlinkage resolver is suitable for resolving described Initial page, obtains the network address of the new webpage of described Initial page link; Described webpage recurrence grabber is suitable for slave site server place and grasps described new webpage;

Described web page interlinkage resolver also is suitable for resolving described new webpage, obtains the network address of the new webpage of described new web page interlinkage again; Described webpage recurrence grabber also is suitable for the new webpage that the crawl of slave site server place is obtained again; Described web page interlinkage resolver and described webpage recurrence grabber repeated work are until satisfy the crawl condition that stops.

11. device according to claim 10, described page info resolver specifically are suitable for according to predefined page decimation rule, extract specified page information from the assigned address of described Initial page and described new webpage.

12. according to claim 8 to 11 each described devices, described webpage grabber is further adapted for the direct downloading web pages in slave site server place; Perhaps, by browser renders method slave site server place downloading web pages; Perhaps, by acting server slave site server place downloading web pages.

13. to 11 each described devices, also comprise according to claim 8:

Downloader is suitable for downloading the related resource of described webpage from described server in station according to described specified page information;

Described action processor is further adapted for related resource and the related resource of described webpage and the corresponding relation of described specified page information of the described webpage of storage.

14. to 11 each described devices, also comprise: task dispatcher according to claim 8;

Described task dispatcher is suitable for according to distributed call method corresponding task being delivered to described webpage grabber.