CN103593396A - Network resource extracting method and device based on browser - Google Patents

Network resource extracting method and device based on browser Download PDF

Info

Publication number
CN103593396A
CN103593396A CN201310464253.XA CN201310464253A CN103593396A CN 103593396 A CN103593396 A CN 103593396A CN 201310464253 A CN201310464253 A CN 201310464253A CN 103593396 A CN103593396 A CN 103593396A
Authority
CN
China
Prior art keywords
webpage
message
operated
browser
document message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310464253.XA
Other languages
Chinese (zh)
Inventor
徐锐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310464253.XA priority Critical patent/CN103593396A/en
Publication of CN103593396A publication Critical patent/CN103593396A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a network resource extracting method based on a browser. The method includes: extracting document messages from a message queue containing multiple document messages or receiving document messages in the message queue containing multiple document messages and transmitted through task scheduling, wherein each document message contains the URL (uniform resource locator) nodes of a to-be-operated webpage and an operation strategy for operating the webpage; using the browser to open the corresponding webpage according to the URL nodes in each document message; operating the webpage according to the operation strategy contained in each document message; outputting the webpage operation result. The invention further discloses a network resource extracting device based on the browser. By the method and the device, the advantage that the browser highly supports network technologies, the complex technologies such as http (hyper text transport protocol) communication process, encryption and JS events are executed by the browser, and time for network resource extraction of a user is reduced.

Description

The extracting method of the Internet resources based on browser and device
Technical field
The present invention relates to computer networking technology, relate in particular to extracting method and the device of Internet resources.
Background technology
At present, increasing network technology is widely used, and for example object is to save the Asynchronous Request treatment technology of cost of development; By JS or cookie, the link of resource is implemented to encrypt, object is to prevent the encrypted url technology etc. of the crawl of Web Spider.No matter for which kind of object, for the crawl of this class resource, there is certain difficulty, and in the situation that cannot cracking cryptographic algorithm, cannot realize robotization crawl work.
Summary of the invention
In view of the above problems, the present invention has been proposed to a kind of extracting method and device of the Internet resources that overcome the problems referred to above or address the above problem are at least in part provided.
According to one aspect of the present invention, the extracting method of Internet resources is provided, it comprises: from the message queue that comprises a plurality of document message, extract document message, the URL node of the webpage that wherein each document message comprises needs operation and the operation strategy that this webpage is operated; The webpage that URL node with comprising in described document message is corresponding is opened with browser; According to the described operation strategy comprising in described document message, this webpage is operated; And the operating result of output to this web page operation.
According to another aspect of the present invention, the extraction element of Internet resources is provided, it comprises: message acquisition module, it is suitable for extracting document message from the message queue of a plurality of document message, the URL node of the webpage that wherein each document message comprises needs operation and the operation strategy that this webpage is operated; Webpage is opened module, and it is suitable for webpage corresponding to the URL node with comprising in described document message to open with browser; Web page operation module, it is suitable for according to the described operation strategy comprising in described document message, this webpage being operated; And result output module, it is suitable for the operating result of output to this web page operation.
The advantage that the extracting method of Internet resources of the present invention and device are supported network technology height by browser, by complicated http communication process, encrypts, and the technology such as JS event are given browser and carried out, and have saved a large amount of manpowers.The application of the invention, user only need be absorbed in simple artificial browser operation, operation steps is informed to the extraction element of Internet resources of the present invention or carried out by the extracting method of Internet resources of the present invention by configuration information, just can obtain final effectively info web or resource.In addition, the present invention provides the foundation to the crawl robotization of the complicated webpage of web crawlers and resource.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows according to the process flow diagram of the extracting method of the Internet resources based on browser of one embodiment of the present invention;
Fig. 2 shows the block diagram of the extraction element of the Internet resources based on browser according to another implementation of the invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and the scope of the present disclosure intactly can be conveyed to those skilled in the art.
In existing Internet resources download technology, due to webpage or resource adopted Asynchronous Request treatment technology or by JS or cookie to technology such as the link of webpage or resource are encrypted, on the one hand, conventional curl, urllib, the network technologies such as socket adopt the mode of directly downloading cannot obtain the network linking of info web or resource; On the other hand, need to analyze one by one http communications protocol, cryptographic algorithm etc., expends a large amount of manpowers, and has the risk that cannot crack, and is extremely unfavorable for the robotization of production procedure.
Because browser has the advantage that network technology height is supported, it is supported Asynchronous Request treatment technology and when opening webpage, can automatically decipher and Web page loading resource, so can directly obtain the network linking of info web or resource by browser.Therefore, the present invention proposes a kind of extracting method and device of Internet resources, it is by means of browser, obtain the final effective network information or resource according to customization step.
Fig. 1 shows according to the extracting method of the Internet resources of one embodiment of the present invention.
As shown in Figure 1, first, at step S110, from the message queue that comprises a plurality of document message, extract document message or the message queue that comprises a plurality of document message that the mode by task scheduling that receives transmits in document message, the URL node that wherein each document message comprises the webpage that needs operate and the operation strategy that this webpage is operated.Described document message can be created by XML, JSON or protobuf.Wherein protobuf is the serializing form of google definition, can be for establishment document message.The described operation strategy that this webpage is operated can comprise the operation steps that this webpage is operated, wherein each operation steps is corresponding to an OPTION node of XML, JSON or protobuf, and each OPTION node comprises following attribute: the mode of operation that webpage is carried out; With the part that webpage is operated.The wherein said mode of operation that webpage is carried out can arrange as required and comprise the left button of clicking the mouse; The right button of clicking the mouse; Download file; And the DOM structure of obtaining webpage.The described part that webpage is operated is included in the URL of coordinate, the control title of click or the click of record clicked on webpage.
Specifically describe below and how by XML document, to create document message.
When user wants the webpage of a certain website to carry out a certain operation, such as downloaded resources from a certain website, webpage is played up etc., can be according to the feature of webpage in this website, true operation step establishment XML document according to user to webpage.In this XML document, stored the URL node of the webpage that user need to operate and the operation steps that webpage is operated, wherein each operation steps is corresponding to an OPTION node of XML, and each OPTION node can have following attribute:
The mode of operation type that <1> carries out webpage, wherein can arrange:
0 represents that left mouse button clicks, and it is the operation of a click, can for obtain webpage rendering result, open webpage, trigger JS event etc.;
1 representative is clicked by mouse right button, and it is the operation of a click, can eject a menu while conventionally operating;
2 represent download file, and it for example, selects " Save Target As " option clicking by mouse right button for carrying out down operation in the menu ejecting, can carry out down operation;
The DOM structure of webpage is obtained in 3 representatives, and it mainly uses when expecting the rendering result of certain webpage.
The mode of operation type that webpage is carried out more than providing is common user's operation, and the user's operation that comprises other also can be set as required.
Here for an example that obtains the rendering result of webpage by XML is set.When user need to obtain the information on page A (A i.e. the URL of this page), only, after clicking control H, just can obtain, the required step of at this moment manual operation is as follows:
A1) left button is clicked the control H on the A page, and at this moment this control can be carried out js, changes the DOM structure of webpage.
B1) obtain the result after rendering result (through the a1) operation of the A page).
At this moment corresponding above-mentioned manually-operated step, with XML, work out corresponding document message as follows:
(1) the URL node arranging is A;
(2) operation steps webpage A being operated:
Operation steps one: corresponding to above-mentioned manually-operated actual step a1), when working out with XML, option is set to: type=0; The title of click_info=control H, is specifically compiled into following form:
Figure BDA0000392455970000041
Operation steps two: corresponding to above-mentioned manually-operated actual step b1), while working out with XML, option is set to type=3, is specifically compiled into following form:
Figure BDA0000392455970000042
The part that <2> operates webpage, it mainly determines the operated object of operation in above-mentioned <1>.This can adopt the performance of one of following three kinds of forms:
The coordinate of 2a) clicking, is represented by coord.Because coordinate can provide a fixing point, so do not need to use MSAA (Microsoft Active Accessibility) technology to choose webpage search.Had the coordinate values of clicking, the operation in above-mentioned <1> has just had definite operand; This form in the situation that definite use of control coordinate that needs are clicked can directly be clicked after filling in coordinate, has been saved the process of obtaining page control by MSAA program conventionally.Such as, for certain resource downloading website, on each page, the download button of resource is fixed on (X, Y) coordinate of the page always, like this, by this coordinate of direct click, can trigger corresponding down operation.
The control title of 2b) clicking, is represented by click_info.Under this form, the control name that operates in above-mentioned <1> deserves to be called execution, just can carry out corresponding operation.In the uncertain situation of control coordinate that this form is clicked at needs conventionally, use.Such as, certain resource downloading website, on each page, the download button of resource " downloads to computer " and is presented on the diverse location of the page, but the title of this control " downloads to computer " and fixes, can by searching this fixing control title in each page, " download to computer " like this, obtain the coordinate of this control, and then click, to trigger corresponding down operation.
The URL that 2c) record is clicked, is represented by click_url.If known clearly the address URL of link, the operation in above-mentioned <1> has also just had definite operand.Conventionally, for general software resource website, have the button of " click enters downloading page " in the details page of software, the link of this button is constant, and other parameters can be transmitted by cookie.For the constant situation of this link, can specify and click chained address.Certainly, in the constant situation of the control title of button, also can adopt above-mentioned 2b) method realize to download.
In the attribute of each OPTION node, above-mentioned <1> item is Mandatory fields, three kinds of forms in <2> item, user can be according to different demands, select wherein one to fill in.
For example, user wants to download http:// mm.10086.cn/android/info/216774.htmlandroid software game in webpage " the happy family of dioctahedral smectite ", the download link of this software game is http:// mm.10086.cn/download/android/216774, the control name that URL is corresponding is called " downloading to computer ".Yet this website mm.10086.cn has carried out cookie encryption to download link, cannot be directly from this website downloaded resources.For this reason, according to XML establishment document message for the actual step of artificial this software of download.The actual step of wherein manually downloading software from website mm.10086.cn is:
1) in IE browser, open this page;
2) click this download link by right key, then select target saves as, can implement to download.
According to XML establishment document message for the actual step of above-mentioned artificial download software, the document message comprises:
(1) the URL node arranging is:
<url> http://mm10086cn/download/android/216774</url>;
(2) right http:// mm.10086.cn/the operation steps that the webpage of website operates:
(a) use an OPTION Node configuration operation steps one of XML, the actual step 2 of this OPTION node based on above-mentioned artificial download software) and comprise following attribute:
The mode of operation type that <a1> carries out webpage is set to 1, clicks by mouse right button, and this correspondence the actual step 2 of above-mentioned artificial download software) in " clicking by right key ";
The part that <a2> operates webpage adopts " the control title click_info of click ".Due to above-mentioned http:// mm.10086.cn/while downloading software in each webpage of website, all can demonstrate the control title of " downloading to computer " on webpage, user only need to click this control title by right key can show download link, without filling in coord and click_url again.
Aforesaid operations step 1 can become following form with the concrete establishment of XML:
Figure BDA0000392455970000061
(b) use another OPTION Node configuration operation steps two of XML, the actual step 2 of this OPTION node based on above-mentioned artificial download software) and comprise following attribute:
The mode of operation type that <b1> carries out webpage is set to 2, i.e. download file.This is corresponding to the actual step 2 of above-mentioned artificial download software) in " Save Target As, implement download ";
Aforesaid operations step 2 can become following form with the concrete establishment of XML:
Figure BDA0000392455970000062
10086 website http:// mm.10086.cn/on have application up to ten thousand, lay respectively at different webpages, but the version of these webpages is the same.When user is during from a plurality of page download resource of this website, be all to carry out same operation.Therefore, in order to improve the efficiency of obtaining Internet resources, the download of the resource on each webpage on this website is compiled into above to similarly document message according to manually-operated step by XML document, there are up to ten thousand application, corresponding up to ten thousand URL just, and corresponding each URL produces an XML document message.These XML document message are placed in message queue, then from message queue, obtain the information in XML document message, the operation that enforcement has customized, then by the result output of operation, or adopt the mode of task scheduling to transmit XML document message.
In above-mentioned steps Sll0, extract after document message, execution step S120, opens the webpage that the URL node with comprising in described document message is corresponding with browser.Because browser is supported Asynchronous Request treatment technology and can automatically decipher and Web page loading resource, can directly open this webpage and resource on downloading web pages by means of browser like this when opening webpage.
Above-mentioned steps S110 for example in, the URL node arranging in document message is:
<url> http://mm.10086.cn/download/android/216774</url>
, at step S120, in browser, open webpage corresponding to above-mentioned URL node.
Next, at step S130, according to the described operation strategy comprising in described document message, this webpage is operated.Above-mentioned steps S110 for example in, the described operation strategy comprising in described document message is the operation steps that webpage is operated.Described operation steps is all worked out in XML document.Known according to described XML document, first carry out with clicking this webpage by mouse right button http:// mm.10086.cn/ download/android/216774.htmlthe control title of upper demonstration " is downloaded to computer ", and then carries out the operation of selecting " Save Target As ", realizes the download to Android software game " the happy family of dioctahedral smectite ".
Next, at step S140, the operating result of output to this web page operation.After executing above-mentioned S130 step, just can obtain corresponding Internet resources from corresponding webpage.Above-mentioned steps Sl10 for example in, just can download to the installation kit file akoopf_sina_0724.apk of Android software game " the happy family of dioctahedral smectite ".At this moment, can by operating result after carrying out base64 coding, be stored in the result node of XML, output XML, in addition, also can use the formatted output that adopts other, such as json, and protobuf.
In above-mentioned steps S130, in the document message of XML establishment, when " part that <2> operates webpage " adopt take the form of " the control title of click " time, can be by MSAA technology, read the DOM structure of the Webpage in browser, thus the coordinate of location control title.
And for the other two kinds of forms of expression " click coordinate " in " part that <2> operates webpage " and " URL that record is clicked ", because the position of operand is very clear and definite, do not need again it to be positioned.
Fig. 2 shows the Internet resources extraction element based on browser according to one embodiment of the present invention.The extraction element 200 of the Internet resources based on browser of the present invention is mainly to realize the mass of Internet resources to process.For this reason, the present invention is according to the version of the webpage of website, XML, JSON for step that each webpage of website is manually operated or protobuf are compiled into a document message, then described document message is put into message queue, for the extraction element 200 of the Internet resources based on browser.Wherein by the method for XML, JSON or protobuf establishment document message, with reference to the mode described in the step S110 above, work out, for for purpose of brevity, be no longer repeated in this description here.
As shown in Figure 2, the extraction element 200 of the Internet resources based on browser of the present invention comprises that message acquisition module 210, webpage open module 220, web page operation module 230 and result output module 240.Wherein, message acquisition module 210 extracts document message or receives the document message in the message queue that comprises a plurality of document message transmitting by task scheduling mode from the message queue that comprises a plurality of document message, the URL node of the webpage that wherein each document message comprises needs operation and the operation strategy that this webpage is operated.Wherein, the described operation strategy that webpage is operated can comprise the operation steps that webpage is operated, each operation steps is corresponding to an OPTION node of XML, JSON or protobuf, each OPTION node can comprise the mode of operation that following attribute: <1> carries out webpage, and <2> part that webpage is operated.The described mode of operation that webpage is carried out can be set to comprise:
Left mouse button is clicked in 0 representative;
Right mouse button is clicked in 1 representative;
2 represent download file; And
The DOM structure of webpage is obtained in 3 representatives.
The described part that webpage is operated, it mainly determines the operand in <1>, webpage being operated, and is included in the URL of coordinate, the control title of click or the click of record clicked on webpage.Illustrate the description of seeing relevant portion in preceding step S110.
Webpage is opened module 220 webpage corresponding to the URL node with comprising in document message is opened with browser.Because browser is supported Asynchronous Request treatment technology and can automatically decipher when opening webpage and downloading web pages resource, like this, even if the resource on website is encrypted, also can directly open the webpage of website and resource on downloading web pages by means of browser, thus can solution must not be directly from the problem of backstage download site resource.
Web page operation module 230 operates this webpage according to the described operation strategy comprising in document message.Owing to the operation strategy of webpage being organized in document message with XML, JSON or protobuf, for example, so by carrying out the operation strategy in XML document message, operation steps, can complete the operation to webpage.
Result output module 240 exports according to the operation of 230 pairs of webpages of web page operation module the operating result that this webpage is operated.Described operating result can be stored in the result node of XML after carrying out base64 coding, and output XML in addition, also can use the formatted output that adopts other, such as JSON, and protobuf.
In the document message of XML establishment, when " part that <2> operates webpage " adopt take the form of " the control title of click " time, can be by MSAA technology, read the DOM structure of the Webpage in browser, thus the coordinate of location control title.And for the other two kinds of forms of expression " click coordinate " in " part that < 2> operates webpage " and " URL that record is clicked ", because the position of operand is very clear and definite, do not need again it to be positioned.
Utilize extracting method and the device of the Internet resources based on browser of the present invention, can obtain the page of a classification or the resource on website, on the page or website such, each webpage has identical version, therefore, only need a set of manual steps of customization, can realize the robotization that Internet resources extract.For example the resource website of Android, provide the numerous application for downloading above, and the version of each webpage is identical, therefore adopts the extracting method of the Internet resources based on browser of the present invention and device can automatically download numerous application.
The extracting method of the Internet resources based on browser of the present invention and device can also capture Internet resources for network search engines.Network search engines mainly adopts the technology of spider reptile to capture.Yet, conventionally the resource on website is all encrypted, prevent that backstage from directly capturing, therefore, utilize extracting method and the device of the Internet resources based on browser of the present invention, by browser, open the resource of website, can the countless Internet resources of automatic capturing, for example, up to ten thousand application on a website.The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize the some or all functions according to the some or all parts in the browser client of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. an extracting method for the Internet resources based on browser, comprising:
From the message queue that comprises a plurality of document message, extract document message or receive the document message in the message queue that comprises a plurality of document message transmitting by task scheduling mode, the URL node of the webpage that wherein each document message comprises needs operation and the operation strategy that this webpage is operated;
With browser, webpage corresponding to the URL node with comprising in described document message opened;
According to the described operation strategy comprising in described document message, this webpage is operated; And
The operating result of output to this web page operation.
2. method according to claim 1, wherein,
Described document message is created by XML, JSON or protobuf.
3. method according to claim 2, wherein,
The described operation strategy that this webpage is operated comprises the operation steps that this webpage is operated, and wherein each operation steps is corresponding to an OPTION node of XML, JSON or protobuf, and each OPTION node comprises following attribute:
The mode of operation that webpage is carried out; And
The part that webpage is operated.
4. method according to claim 3, wherein,
The described mode of operation that webpage is carried out comprises:
The left button of clicking the mouse;
The right button of clicking the mouse;
Download file; And
Obtain the DOM structure of webpage.
5. according to the method described in claim 3 or 4, wherein,
The described part that webpage is operated is included in the URL of coordinate, the control title of click or the click of record clicked on webpage.
6. according to the method described in any one in claim 1-5, wherein,
Described output comprises the operating result of this web page operation:
Operating result is encoded, be stored in the result node of XML, JSON or protobuf;
Output XML, JSON or protobuf file.
7. an extraction element for the Internet resources based on browser, comprising:
Message acquisition module, document message in the message queue that comprises a plurality of document message that it is suitable for extracting document message from the message queue of a plurality of document message or the mode by task scheduling that receives transmits, the URL node that wherein each document message comprises the webpage that needs operate and the operation strategy that this webpage is operated;
Webpage is opened module, and it is suitable for webpage corresponding to the URL node with comprising in described document message to open with browser;
Web page operation module, it is suitable for according to the described operation strategy comprising in described document message, this webpage being operated; And
Result output module, it is suitable for the operating result of output to this web page operation.
8. device according to claim 7, wherein,
Described document message is created by XML, JSON or protobuf.
9. device according to claim 8, wherein,
The described operation strategy that this webpage is operated comprises the operation steps that this webpage is operated, and wherein each operation steps is corresponding to an OPTION node of XML, JSON or protobuf, and each OPTION node comprises following attribute:
The mode of operation that webpage is carried out; And
The part that webpage is operated.
10. device according to claim 9, wherein,
The described mode of operation that webpage is carried out comprises:
The left button of clicking the mouse;
The right button of clicking the mouse;
Download file; And
Obtain the DOM structure of webpage.
CN201310464253.XA 2013-10-08 2013-10-08 Network resource extracting method and device based on browser Pending CN103593396A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310464253.XA CN103593396A (en) 2013-10-08 2013-10-08 Network resource extracting method and device based on browser

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310464253.XA CN103593396A (en) 2013-10-08 2013-10-08 Network resource extracting method and device based on browser

Publications (1)

Publication Number Publication Date
CN103593396A true CN103593396A (en) 2014-02-19

Family

ID=50083537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310464253.XA Pending CN103593396A (en) 2013-10-08 2013-10-08 Network resource extracting method and device based on browser

Country Status (1)

Country Link
CN (1) CN103593396A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462811A (en) * 2014-12-05 2015-03-25 云中万维(北京)科技有限公司 Network game data processing method
CN105117222A (en) * 2015-08-19 2015-12-02 北京奇虎科技有限公司 Method and apparatus for providing android package (APK) customization service
CN108038233A (en) * 2017-12-26 2018-05-15 福建中金在线信息科技有限公司 A kind of method, apparatus, electronic equipment and storage medium for gathering article
CN109559174A (en) * 2018-11-30 2019-04-02 上海连尚网络科技有限公司 It promotes resource and gets and count the method that resource is clicked of promoting ready
WO2020015186A1 (en) * 2018-07-19 2020-01-23 平安科技(深圳)有限公司 Method and apparatus for real-time update of page data and electronic device
CN112642157A (en) * 2020-12-31 2021-04-13 广州华多网络科技有限公司 Agent development control method and corresponding device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615237B1 (en) * 2000-02-04 2003-09-02 Microsoft Corporation Automatic searching for data in a network
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101118553A (en) * 2007-08-09 2008-02-06 姜边 Internet information acquisition method facing field and oriented by policy
CN103268361A (en) * 2013-06-07 2013-08-28 百度在线网络技术(北京)有限公司 Extracting method, device and system of hidden URL (Uniform Resource Locator) in webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615237B1 (en) * 2000-02-04 2003-09-02 Microsoft Corporation Automatic searching for data in a network
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101118553A (en) * 2007-08-09 2008-02-06 姜边 Internet information acquisition method facing field and oriented by policy
CN103268361A (en) * 2013-06-07 2013-08-28 百度在线网络技术(北京)有限公司 Extracting method, device and system of hidden URL (Uniform Resource Locator) in webpage

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462811A (en) * 2014-12-05 2015-03-25 云中万维(北京)科技有限公司 Network game data processing method
CN105117222A (en) * 2015-08-19 2015-12-02 北京奇虎科技有限公司 Method and apparatus for providing android package (APK) customization service
CN108038233A (en) * 2017-12-26 2018-05-15 福建中金在线信息科技有限公司 A kind of method, apparatus, electronic equipment and storage medium for gathering article
CN108038233B (en) * 2017-12-26 2021-07-23 福建中金在线信息科技有限公司 Method and device for collecting articles, electronic equipment and storage medium
WO2020015186A1 (en) * 2018-07-19 2020-01-23 平安科技(深圳)有限公司 Method and apparatus for real-time update of page data and electronic device
CN109559174A (en) * 2018-11-30 2019-04-02 上海连尚网络科技有限公司 It promotes resource and gets and count the method that resource is clicked of promoting ready
CN109559174B (en) * 2018-11-30 2021-04-16 上海连尚网络科技有限公司 Method for dotting popularization resource and counting click of popularization resource
CN112642157A (en) * 2020-12-31 2021-04-13 广州华多网络科技有限公司 Agent development control method and corresponding device, equipment and medium
CN112642157B (en) * 2020-12-31 2023-04-28 广州华多网络科技有限公司 Agent development control method and corresponding device, equipment and medium thereof

Similar Documents

Publication Publication Date Title
KR102220127B1 (en) Method and apparatus for customized software development kit (sdk) generation
CN107895009B (en) Distributed internet data acquisition method and system
US8266202B1 (en) System and method for auto-generating JavaScript proxies and meta-proxies
US7958232B1 (en) Dashboard for on-the-fly AJAX monitoring
KR102218995B1 (en) Method and apparatus for code virtualization and remote process call generation
CN103593396A (en) Network resource extracting method and device based on browser
US8639743B1 (en) System and method for on-the-fly rewriting of JavaScript
US8527860B1 (en) System and method for exposing the dynamic web server-side
EP2775407B1 (en) Method and system for performing local invocation with webpage
CN103150513B (en) The method of the implantation information in interception application program and device
US9798524B1 (en) System and method for exposing the dynamic web server-side
US10877825B2 (en) System for offline object based storage and mocking of rest responses
US8819539B1 (en) On-the-fly rewriting of uniform resource locators in a web-page
US10007532B1 (en) Data infrastructure for cross-platform cross-device API inter-connectivity
CN102982162B (en) The acquisition system of info web
KR20090080981A (en) Aggregating portlets for use within a client environment without relying upon server resources
CN110224896B (en) Network performance data acquisition method and device and storage medium
CN103019817B (en) A kind of method and apparatus mutual for the page
CN103595770A (en) Method and device for achieving file downloading through SDK
JP2006195979A (en) Web application architecture
CN103177115A (en) Method and device of extracting page link of webpage
CN108701130A (en) Hints model is updated using auto-browsing cluster
CN103034495A (en) Browser capable of isolating plug-in in webpage and webpage plug-in isolating method
CN105516333A (en) Interactive method and system based on webpage
CN113934913A (en) Data capture method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140219

RJ01 Rejection of invention patent application after publication