CN102982161A - Method and device for acquiring webpage information - Google Patents

Method and device for acquiring webpage information Download PDF

Info

Publication number
CN102982161A
CN102982161A CN2012105168743A CN201210516874A CN102982161A CN 102982161 A CN102982161 A CN 102982161A CN 2012105168743 A CN2012105168743 A CN 2012105168743A CN 201210516874 A CN201210516874 A CN 201210516874A CN 102982161 A CN102982161 A CN 102982161A
Authority
CN
China
Prior art keywords
webpage
page
crawl
site server
slave site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105168743A
Other languages
Chinese (zh)
Inventor
徐锐波
路轶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN2012105168743A priority Critical patent/CN102982161A/en
Publication of CN102982161A publication Critical patent/CN102982161A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for acquiring webpage information. The method comprises the following steps of: grabbing: grabbing webpages from a station server; analyzing webpage information: extracting specified webpage information from specified positions of the webpages according to a preset webpage extraction rule; and storing: structurally storing the specified webpage information. According to the method and device for acquiring webpage information provided by the invention, after webpages are grabbed from the station server, information of entire webpages is not stored directly, and specified webpage information is extracted from specified positions of the webpages according to the webpage extraction rule and is structurally stored. The webpage extraction rule can be customized according to the requirements of users, and information of webpages is resolved, so that the requirement of customized extraction of webpage information is met.

Description

The acquisition methods of info web and device
Technical field
The present invention relates to technical field of the computer network, be specifically related to a kind of acquisition methods and device of info web.
Background technology
(web crawlers is otherwise known as webpage spider, network robot, in some communities, more frequent be called as the webpage follower) be a kind of program or script of automatic acquisition web page contents, it is the important component part of search engine, the optimization that the optimization of search engine is made for web crawlers to a great extent exactly.
Web crawlers generally is divided into traditional reptile and focused crawler.The tradition reptile is from the URL(Uniform/Universal Resource Locator of one or several Initial pages, URL(uniform resource locator)) beginning, the URL of acquisition Initial page; In the process of crawl webpage, constantly the URL from the new webpage of current page extraction puts into formation, until satisfy certain stop condition of system.The workflow of focused crawler is comparatively complicated, need to filter and irrelevant the linking of theme according to certain web page analysis algorithm, remains with the link of usefulness and puts it into the URL formation of waiting for crawl; Then, from formation, select the URL of next step webpage that will grasp according to certain search strategy, repeat said process, until stop when reaching a certain condition of system.In addition, all will be carried out certain analysis and filtration by system storage by the webpage of crawler capturing, and set up index, so that retrieval and indexing afterwards.
Above-mentioned two kinds of web crawlers all are the information of obtaining whole webpage, then directly storage.This class reptile can not cooked parsing to the information of webpage, can't satisfy the demand that extracts info web that customizes.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to provide a kind of acquisition methods of the info web that overcomes the problems referred to above or address the above problem at least in part and the deriving means of corresponding info web.
According to an aspect of the present invention, provide a kind of acquisition methods of info web, having comprised:
Crawl step, slave site server place crawl webpage;
The page info analyzing step is according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;
Storing step carries out structured storage with described specified page information.
According to a further aspect in the invention, provide a kind of deriving means of info web, having comprised:
The webpage grabber is suitable for slave site server place crawl webpage;
The page info resolver is suitable for according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;
Action processor is suitable for described specified page information is carried out structured storage.
Acquisition methods and device according to info web provided by the invention, after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.
Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows the according to an embodiment of the invention process flow diagram of the acquisition methods of info web;
Fig. 2 shows the according to an embodiment of the invention structured flowchart of the deriving means of info web; And
Fig. 3 shows the according to an embodiment of the invention structured flowchart of the system that obtains of info web.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
Fig. 1 shows the according to an embodiment of the invention process flow diagram of the acquisition methods 100 of info web.As shown in Figure 1, method 100 starts from step S101, and step S101 is crawl step, is specially slave site server place crawl webpage.Crawler system slave site server place crawl webpage can specifically adopt following three kinds of methods: 1) the direct downloading web pages in slave site server place, can adopt this method for the tactful website of anti-crawl.2) by browser renders method slave site server place downloading web pages; Because some website has used ajax(Asynchronous JavaScript and XML, asynchronous JavaScript and extend markup language) technology, need to utilize the method for browser renders to obtain complete page structure.Crawler system has been equipped with the rendering module of several kernels, such as IE kernel, Gecko(red fox) kernel, Chrome kernel etc.3) in order to prevent that crawler system from frequently accessing certain server in station and causing by the situation of this server in station envelope IP, crawler system can pass through acting server slave site server place downloading web pages, adopts the acting server downloading web pages can guarantee promptness and the continuity that grasps.More than three kinds of methods substantially can solve the crawl problem of various types of websites.
Subsequently, method 100 enters step S102, and step S102 is the page info analyzing step, is specially according to predefined page decimation rule, from the assigned address extraction specified page information of webpage.Crawler system is analyzed the page structure of each webpage, extracts specified page information according to page decimation rule.Wherein page decimation rule customizes, can be by human configuration.Alternatively, page decimation rule has been set the html tag of the front and back of assigned address.Because all in html tag, assigned address generally also all is html tag to the effective information in the page, assigned address is defined by the html tag before and after it, and the html tag of this assigned address is exactly the specified page information that will extract.For example, for the webpage from certain server in station, if want to extract " game name " field in this webpage, the page decimation rule that customizes so should comprise the html tag<div before and after this field 〉.When crawler system is analyzed this webpage, therefrom extract two html tag<div〉between information, i.e. " game name ".
For download file (for example software package) linked web pages, the specified page information that therefrom extracts generally includes the download file link, optionally, also comprise the parent page link of this webpage, these link informations are extracted for follow-up download corresponding download file according to this link information.The parent page link is used for tracing to the source, and can also find the source of this download file when downloading corresponding download file, comprises parent page or website etc., is convenient to follow-up maintenance to data and corresponding query function is provided.
Further, crawl webpage in crawler system slave site server place can adopt dual mode: full dose crawls mode and increment crawls mode.Adopting full dose to crawl mode or increment, to crawl mode be according to demand and fixed.For example: for a new game website server, can include a lot of new game, at this moment need the webpage of this server in station is all traveled through, namely full dose crawls, and grasps all game, and follow-up doing again unifies to process (being that page info is resolved and stores processor).The game of this game website server all crawl complete after, this server in station every day is new game more also, at this moment needs to adopt increment to crawl mode, grasps the game of upgrading its every day.
The server in station that crawls mode for full dose carries out disposable task delivery, and namely disposable crawl is from the webpage of this server in station.At first notify the title of task dispatcher server in station to be crawled, task dispatcher can be inquired about the crawl rule of this server in station voluntarily, then can finish full dose and crawl.Task dispatcher delivers the crawl task to the specific works process, and performed crawl task can comprise: at first, and slave site server place crawl Initial page.Resolve this Initial page, obtain the network address of the new webpage of Initial page link.This new webpage is grasped at network address slave site server place according to new webpage.A common server in station begins recurrence from initial page, have ten multilayers even more, task dispatcher begins crawl from initial page, grasp the more webpage of deep layer according to the link recurrence in the webpage, that is: then carry out full dose recurrence substep, be specially and resolve new webpage, obtain again the network address of the new webpage of new web page interlinkage, the new webpage that the crawl of slave site server place is obtained again; Repeat this full dose recurrence substep, until satisfy the crawl condition that stops.Usually, which floor webpage can satisfy the demands before crawler system generally need to grasp, so crawler system can arrange the recurrence number of plies of single server in station, the setting recurrence number of plies that recurrence grabs this server in station just satisfies the crawl condition that stops.After full dose crawls webpage from certain server in station, these webpages are done unified the processing, comprise according to predefined page decimation rule, extract specified page information from the Initial page of above-mentioned crawl and the assigned address of all new webpages.
The server in station that crawls mode for increment carries out algorithms for periodic task scheduling, is that crawl dispatching cycle of server in station setting is from the webpage of this server in station according to crawler system namely.The dispatching cycle that crawler system is set for each server in station can be different, have plenty of 1 hour, have plenty of 3 hours, decide on the renewal speed of server in station.The server in station that crawler system will need increment to crawl forms scheduling queue according to ordering dispatching cycle, every Preset Time (for example 10 minutes) this scheduling queue is detected, scheduling time the server in station greater than the current time be considered as server in station to be grasped.Task dispatcher delivers the crawl task to the specific works process subsequently.In the concrete progress of work, performed step can comprise: at first, and slave site server place crawl Initial page.According to predefined page decimation rule, from the assigned address extraction specified page information of Initial page.Resolve Initial page, obtain the network address of the new webpage of Initial page link.According to the network address of new webpage, the new webpage of slave site server place crawl.According to predefined page decimation rule, from the assigned address extraction specified page information of new webpage.Increment recurrence substep is resolved new webpage, obtains the network address of the new webpage of new web page interlinkage again; The new webpage that the crawl of slave site server place is obtained again; According to predefined page decimation rule, extract specified page information from the assigned address of the new webpage that obtains again; Repeat this increment recurrence substep, until satisfy the crawl condition that stops.Crawler system can arrange the recurrence number of plies of single server in station, and the setting recurrence number of plies that recurrence grabs this server in station just satisfies the crawl condition that stops.Crawl the mode difference with full dose and mainly be, it is that resolve on crawl webpage limit, limit that increment crawls mode; And, increment recurrence substep when crawler system be to carry out when official hour arrives the dispatching cycle that server in station is set.
Alternatively, in this method, task dispatcher will grasp task and process by the progress of work that gearman passes to the downstream.This method uses gearman as the inter-process messages formation, carries out process communication by gearman and realizes parallel expansion and high concurrent processing.Above-mentioned webpage take the time as thread all leaves among the redis in the mode of ordered set, accurately dispatches the webpage monitor task by calling the redis Interface realization.Redis is the memory database of a key-value type, and whole database operates in the middle of completely being carried in internal memory, regularly database data is exported (flush) by asynchronous operation and is preserved to hard disk.Because be pure internal memory operation, the performance of redis is very outstanding, and per second can be processed and surpass 100,000 read-write operations, thereby has improved the performance of crawler system.
After step S102, method 100 enters the storing step of step S103, is specially specified page information is carried out structured storage.So-called structured storage refers to store specified page information and specified page information is carried out structural description, for example: the structural description to " game name " information is exactly game name, is exactly the download file link to the structural description of " download file link " information.Alternatively, can use XML(extensible markup language, extend markup language) carry out structured storage, be about to every specified page information and be stored in the XML node, be convenient to like this processing of subsequent module, also simplified system architecture simultaneously.By carrying out structured storage, the information that the user can accurately be known crawler system and crawled.
Alternatively, after step S103, method 100 enters step S104, wherein according to specified page information, the related resource of slave site server place downloading web pages is further stored related resource and the related resource of webpage and the corresponding relation of specified page information of webpage.Be linked as example take specified page information as software package, but download this software package according to software package link slave site server place, further store the corresponding relation that this software package and software package and software package link.Pass through the method, crawler system can crawl any information and the download file that can see on the webpage, for example: the relevant information of software package and software package, such as dbase, update time, software size, software author, usage platform and software description etc., can also crawl the resource such as news, picture of portal.
Alternatively, according to the strategy of prior customization, crawler system can also be done respective handling to the information of crawl and the resource of download, as sending out mail, pushing distributed storage etc.As long as the server in station of downloading web pages content, such as door, news site etc. only needs the crawl information needed for some, with the information pushing of crawl to specified interface, mail notification specific people again.For some software package server in station, need to obtain software package and relevant information thereof, after grabbing necessary information, carry out again follow-up download and unpack, software package is very large usually, need to push to distributed storage.
The acquisition methods of the info web that provides according to present embodiment, after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.Take the info web that crawls certain game website as example, can directly obtain the download link of all game in this game website by the method, and these download link are carried out structured storage, the information that the user can accurately be known crawler system and crawled.
Fig. 2 shows the according to an embodiment of the invention structured flowchart of the deriving means of info web.As shown in Figure 2, this info web deriving means 200 comprises: webpage grabber 210, page info resolver 220 and action processor 230.Alternatively, info web deriving means 200 can also comprise: web page interlinkage resolver 240, downloader 250 and task dispatcher 260.
Webpage grabber 210 is suitable for slave site server place crawl webpage.Alternatively, webpage grabber 210 is suitable for the direct downloading web pages in slave site server place; Perhaps, by browser renders method slave site server place downloading web pages; Perhaps, by acting server slave site server place downloading web pages.Webpage grabber 210 comprises elementary webpage grabber 211 and webpage recurrence grabber 212.Elementary webpage grabber 211 is suitable for slave site server place crawl Initial page, web page interlinkage resolver 240 is suitable for resolving Initial page, obtain the network address of the new webpage of Initial page link, webpage recurrence grabber 212 is suitable for the new webpage of slave site server place crawl.Web page interlinkage resolver 240 also is suitable for resolving new webpage, obtains the network address of the new webpage of new web page interlinkage again; Webpage recurrence grabber 212 also is suitable for the new webpage that the crawl of slave site server place is obtained again; Web page interlinkage resolver 240 and webpage recurrence grabber 212 repeated works are until satisfy the crawl condition that stops.
Page info resolver 220 is suitable for according to predefined page decimation rule, from the assigned address extraction specified page information of webpage.Alternatively, page decimation rule has been set the html tag of the front and back of assigned address; Page info resolver 220 is further adapted for from webpage the specified page information between the html tag of the front and back of extracting assigned address.Further, page info resolver 220 is suitable for according to predefined page decimation rule, extracts specified page information from the assigned address of Initial page and new webpage.
Action processor 230 is suitable for specified page information is carried out structured storage.So-called structured storage refers to store specified page information and specified page information is carried out structural description, by carrying out structured storage, and the information that the user can accurately be known crawler system and crawled.
Downloader 250 is suitable for according to specified page information, the related resource of slave site server place downloading web pages.Action processor 230 is further adapted for related resource and the related resource of webpage and the corresponding relation of specified page information of storage webpage.
Task dispatcher 260 is suitable for according to distributed call method (such as gearman) corresponding task being delivered to webpage grabber 210.Task transfers device 260 and webpage grabber 210 can adopt full dose to crawl mode or increment crawls the crawl that mode is carried out webpage, and detailed process can be referring to the description of embodiment of the method.
This info web deriving means 200 can also comprise cache database, and redis for example is suitable for depositing webpage take the time as thread in the mode of ordered set, accurately dispatches the webpage monitor task by calling the redis Interface realization.
Fig. 3 shows the according to an embodiment of the invention structured flowchart of the system that obtains of info web.As shown in Figure 3, the system that obtains of this info web comprises info web deriving means 200 and server in station 100, and the concrete structure of info web deriving means 200 can be referring to the associated description of above-described embodiment.Info web deriving means 200 slave site servers 100 places obtain the related resource of webpage and webpage.
Deriving means according to info web provided by the invention, the info web deriving means is after slave site server place grabs webpage, be not the information of directly storing whole webpage, but according to the assigned address extraction specified page information of page decimation rule from webpage, this specified page information is carried out structured storage.Wherein page decimation rule can customize according to user's demand, does parsing by the information to webpage, has satisfied the demand that extracts info web that customizes.Take the info web that crawls certain game website as example, can directly obtain the download link of all game in this game website by this device, and these download link are carried out structured storage, the information that the user can accurately be known crawler system and crawled.
Intrinsic not relevant with any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can be with using based on the teaching at this.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment among the embodiment.Can be combined into a module or unit or assembly to the module among the embodiment or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such feature and/or process or unit at least some are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless in addition clearly statement, disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included among other embodiment, the combination of the feature of different embodiment means and is within the scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving at one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the deriving means of the info web of the embodiment of the invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides at carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (14)

1. the acquisition methods of an info web comprises:
Crawl step, slave site server place crawl webpage;
The page info analyzing step is according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;
Storing step carries out structured storage with described specified page information.
2. method according to claim 1, described page decimation rule has been set the html tag of the front and back of described assigned address; Described page info analyzing step is specially: the specified page information from described webpage between the html tag of the front and back of the described assigned address of extraction.
3. method according to claim 1 and 2, described crawl step comprises:
Slave site server place crawl Initial page;
Resolve described Initial page, obtain the network address of the new webpage of described Initial page link;
Described new webpage is grasped at slave site server place;
Full dose recurrence substep is resolved described new webpage, obtains the network address of the new webpage of described new web page interlinkage again, the new webpage that the crawl of slave site server place is obtained again; Repeat this full dose recurrence substep, until satisfy the crawl condition that stops;
Described page info analyzing step comprises: according to predefined page decimation rule, extract specified page information from the assigned address of described Initial page and described new webpage.
4. method according to claim 1 and 2, described crawl step and page info analyzing step comprise:
Slave site server place crawl Initial page;
According to predefined page decimation rule, from the assigned address extraction specified page information of described Initial page;
Resolve described Initial page, obtain the network address of the new webpage of described Initial page link;
Described new webpage is grasped at slave site server place;
According to predefined page decimation rule, extract specified page information from the assigned address of described new webpage;
Increment recurrence substep is resolved described new webpage, obtains the network address of the new webpage of described new web page interlinkage again; The new webpage that the crawl of slave site server place is obtained again; According to predefined page decimation rule, extract specified page information from the assigned address of the new webpage that obtains again; Repeat this increment recurrence substep, until satisfy the crawl condition that stops.
5. method according to claim 4, described increment recurrence substep are carried out when official hour arrives when system the dispatching cycle that described server in station is set.
6. according to claim 1 to 5 each described methods, described crawl step comprises:
The direct downloading web pages in slave site server place;
Perhaps, by browser renders method slave site server place downloading web pages;
Perhaps, by acting server slave site server place downloading web pages.
7. according to claim 1 to 6 each described methods, after described storing step, also comprise:
According to described specified page information, download the related resource of described webpage from described server in station;
Further store related resource and the related resource of described webpage and the corresponding relation of described specified page information of described webpage.
8. the deriving means of an info web comprises:
The webpage grabber is suitable for slave site server place crawl webpage;
The page info resolver is suitable for according to predefined page decimation rule, from the assigned address extraction specified page information of described webpage;
Action processor is suitable for described specified page information is carried out structured storage.
9. device according to claim 8, described page decimation rule has been set the html tag of the front and back of described assigned address; Described page info resolver is further adapted for from described webpage the specified page information between the html tag of the front and back of extracting described assigned address.
10. also comprise: the web page interlinkage resolver according to claim 8 or 9 described devices;
Described webpage grabber comprises elementary webpage grabber and webpage recurrence grabber;
Described elementary webpage grabber is suitable for slave site server place crawl Initial page, and described web page interlinkage resolver is suitable for resolving described Initial page, obtains the network address of the new webpage of described Initial page link; Described webpage recurrence grabber is suitable for slave site server place and grasps described new webpage;
Described web page interlinkage resolver also is suitable for resolving described new webpage, obtains the network address of the new webpage of described new web page interlinkage again; Described webpage recurrence grabber also is suitable for the new webpage that the crawl of slave site server place is obtained again; Described web page interlinkage resolver and described webpage recurrence grabber repeated work are until satisfy the crawl condition that stops.
11. device according to claim 10, described page info resolver specifically are suitable for according to predefined page decimation rule, extract specified page information from the assigned address of described Initial page and described new webpage.
12. according to claim 8 to 11 each described devices, described webpage grabber is further adapted for the direct downloading web pages in slave site server place; Perhaps, by browser renders method slave site server place downloading web pages; Perhaps, by acting server slave site server place downloading web pages.
13. to 11 each described devices, also comprise according to claim 8:
Downloader is suitable for downloading the related resource of described webpage from described server in station according to described specified page information;
Described action processor is further adapted for related resource and the related resource of described webpage and the corresponding relation of described specified page information of the described webpage of storage.
14. to 11 each described devices, also comprise: task dispatcher according to claim 8;
Described task dispatcher is suitable for according to distributed call method corresponding task being delivered to described webpage grabber.
CN2012105168743A 2012-12-05 2012-12-05 Method and device for acquiring webpage information Pending CN102982161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105168743A CN102982161A (en) 2012-12-05 2012-12-05 Method and device for acquiring webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105168743A CN102982161A (en) 2012-12-05 2012-12-05 Method and device for acquiring webpage information

Publications (1)

Publication Number Publication Date
CN102982161A true CN102982161A (en) 2013-03-20

Family

ID=47856178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105168743A Pending CN102982161A (en) 2012-12-05 2012-12-05 Method and device for acquiring webpage information

Country Status (1)

Country Link
CN (1) CN102982161A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN103412857A (en) * 2013-09-04 2013-11-27 广东全通教育股份有限公司 System and method for realizing Chinese-English translation of webpage
CN103761230A (en) * 2013-10-17 2014-04-30 北京奇虎科技有限公司 Method and device for capturing media content information of webpage by search engine
CN104346350A (en) * 2013-07-26 2015-02-11 南京中兴力维软件有限公司 Method and system for inquiring tree node of asynchronous tree
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN105095395A (en) * 2015-06-30 2015-11-25 北京金山安全软件有限公司 Information processing method and device
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105426225A (en) * 2015-12-28 2016-03-23 上海瀚之友信息技术服务有限公司 Recharging platform updating method and system
CN106202348A (en) * 2016-07-04 2016-12-07 中山大学 A kind of web page form information extraction method
CN106227823A (en) * 2016-07-21 2016-12-14 知几科技(深圳)有限公司 A kind of webpage update detection method, info web capture and rendering method
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108121728A (en) * 2016-11-29 2018-06-05 北京京东尚科信息技术有限公司 The method and apparatus that data are extracted from database
CN109063110A (en) * 2018-07-28 2018-12-21 安徽捷兴信息安全技术有限公司 A kind of grasping means and device using application message in store
CN109582885A (en) * 2018-10-31 2019-04-05 阿里巴巴集团控股有限公司 It is a kind of that the method and device that block chain deposits card is carried out to webpage by webpage monitoring
CN110618934A (en) * 2019-08-15 2019-12-27 重庆金融资产交易所有限责任公司 Front-end automatic test debugging method and device and computer readable storage medium
CN111061971A (en) * 2019-12-16 2020-04-24 百度在线网络技术(北京)有限公司 Method and device for extracting information

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN101404026A (en) * 2008-11-25 2009-04-08 北京邮电大学 Crawler system construction method for video-previewing search engine
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
CN102236713A (en) * 2011-07-05 2011-11-09 广东星海数字家庭产业技术研究院有限公司 Digital television interaction service page information extraction method and device
US20120191693A1 (en) * 2009-08-25 2012-07-26 Vizibility Inc. Systems and methods of identifying and handling abusive requesters
US20120246139A1 (en) * 2010-10-21 2012-09-27 Bindu Rama Rao System and method for resume, yearbook and report generation based on webcrawling and specialized data collection
CN102708178A (en) * 2012-05-08 2012-10-03 上海互联网软件有限公司 Data fetching method of B (browser)/S (server) structural system
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN101404026A (en) * 2008-11-25 2009-04-08 北京邮电大学 Crawler system construction method for video-previewing search engine
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
US20120191693A1 (en) * 2009-08-25 2012-07-26 Vizibility Inc. Systems and methods of identifying and handling abusive requesters
US20120246139A1 (en) * 2010-10-21 2012-09-27 Bindu Rama Rao System and method for resume, yearbook and report generation based on webcrawling and specialized data collection
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler
CN102236713A (en) * 2011-07-05 2011-11-09 广东星海数字家庭产业技术研究院有限公司 Digital television interaction service page information extraction method and device
CN102708178A (en) * 2012-05-08 2012-10-03 上海互联网软件有限公司 Data fetching method of B (browser)/S (server) structural system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
翁岩青: "网页抓取策略研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 May 2011 (2011-05-15) *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN104346350A (en) * 2013-07-26 2015-02-11 南京中兴力维软件有限公司 Method and system for inquiring tree node of asynchronous tree
CN104346350B (en) * 2013-07-26 2019-09-20 南京中兴力维软件有限公司 The tree node querying method and system of Asynchronous Tree
CN103399908B (en) * 2013-07-30 2017-02-08 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN103412857A (en) * 2013-09-04 2013-11-27 广东全通教育股份有限公司 System and method for realizing Chinese-English translation of webpage
CN103761230A (en) * 2013-10-17 2014-04-30 北京奇虎科技有限公司 Method and device for capturing media content information of webpage by search engine
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN104462532B (en) * 2014-12-23 2017-07-07 北京奇虎科技有限公司 The method and apparatus that Web page text is extracted
CN105095395A (en) * 2015-06-30 2015-11-25 北京金山安全软件有限公司 Information processing method and device
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN105243159B (en) * 2015-10-28 2019-06-25 福建亿榕信息技术有限公司 A kind of distributed network crawler system based on visualization script editing machine
CN105243159A (en) * 2015-10-28 2016-01-13 福建亿榕信息技术有限公司 Visual script editor-based distributed web crawler system
CN105426225A (en) * 2015-12-28 2016-03-23 上海瀚之友信息技术服务有限公司 Recharging platform updating method and system
CN105426225B (en) * 2015-12-28 2018-08-14 上海瀚之友信息技术服务有限公司 One kind supplementing platform update method and system with money
CN106202348A (en) * 2016-07-04 2016-12-07 中山大学 A kind of web page form information extraction method
CN106227823A (en) * 2016-07-21 2016-12-14 知几科技(深圳)有限公司 A kind of webpage update detection method, info web capture and rendering method
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108009171B (en) * 2016-10-27 2020-06-30 腾讯科技(北京)有限公司 Method and device for extracting content data
CN108121728A (en) * 2016-11-29 2018-06-05 北京京东尚科信息技术有限公司 The method and apparatus that data are extracted from database
CN109063110A (en) * 2018-07-28 2018-12-21 安徽捷兴信息安全技术有限公司 A kind of grasping means and device using application message in store
CN109582885A (en) * 2018-10-31 2019-04-05 阿里巴巴集团控股有限公司 It is a kind of that the method and device that block chain deposits card is carried out to webpage by webpage monitoring
CN110618934A (en) * 2019-08-15 2019-12-27 重庆金融资产交易所有限责任公司 Front-end automatic test debugging method and device and computer readable storage medium
CN111061971A (en) * 2019-12-16 2020-04-24 百度在线网络技术(北京)有限公司 Method and device for extracting information

Similar Documents

Publication Publication Date Title
CN102982161A (en) Method and device for acquiring webpage information
CN102982162B (en) The acquisition system of info web
CN102882991B (en) A kind of browser and carry out the method for domain name mapping
CN102843445B (en) A kind of browser and carry out the method for domain name mapping
US20200410031A1 (en) Systems and methods for cloud computing
CN103744853B (en) The method and device of Research of Search Engine Website Snapshot System information is provided
CN102831252B (en) A kind of method for upgrading index data base and device, searching method and system
CN105243159A (en) Visual script editor-based distributed web crawler system
CN104036011A (en) Webpage element display method and browser device.
CN102880607A (en) Dynamic network content grabbing method and dynamic network content crawler system
US8356048B2 (en) Systems and methods for improved forums
CN103631875A (en) Method for carrying out network search on browser side and browser
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN108366096A (en) A kind of information subscribing method, terminal and computer readable storage medium
CN103092817A (en) Data collection method and data collection device based on script engine
CN110147475A (en) A kind of network data acquisition system of distributed deployment
US9454535B2 (en) Topical mapping
CN102981848B (en) Webpage main body element process browser and method
WO2015200277A1 (en) Search results for native applications
CN103020266A (en) Method and device for extracting webpage text content
CN103678487A (en) Method and device for generating web page snapshot
CN103577552A (en) Webpage picture processing method and device
CN103177115A (en) Method and device of extracting page link of webpage
CN102968428A (en) Efficient data extraction by a remote application
CN102855334A (en) Browser and method for acquiring domain name system (DNS) resolving data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130320