CN101178736A - Web page collecting method and web page collecting server - Google Patents

Web page collecting method and web page collecting server Download PDF

Info

Publication number
CN101178736A
CN101178736A CNA2007101985301A CN200710198530A CN101178736A CN 101178736 A CN101178736 A CN 101178736A CN A2007101985301 A CNA2007101985301 A CN A2007101985301A CN 200710198530 A CN200710198530 A CN 200710198530A CN 101178736 A CN101178736 A CN 101178736A
Authority
CN
China
Prior art keywords
webpage
buffer area
web page
extracting
grasp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101985301A
Other languages
Chinese (zh)
Other versions
CN100501746C (en
Inventor
王为
纪宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2007101985301A priority Critical patent/CN100501746C/en
Publication of CN101178736A publication Critical patent/CN101178736A/en
Application granted granted Critical
Publication of CN100501746C publication Critical patent/CN100501746C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a web page snatching method and web page snatching server. The method comprises: A. the method receives web page request; B. the method estimates whether the requested web page is snatched, executes step C if yes, otherwise, snatches the web page and ends the flow; C. the invention estimates whether the snatching time of the requested web page is bigger than the presetting time threshold value and executes step D if yes,. Otherwise, does not snatch the web page and end the flow; D. the invention searches whether the web page is updated and snatches the web page if yes, otherwise does not snatch the web page. The server comprises: a web page request receiving module, an estimation module, a searching module and a snatching module. The invention can lighten the burden for the web page snatching server, reduces the occupation to the network band width material and elevates the efficiency of the web page snatching.

Description

Webpage grasping means and webpage grasp server
Technical field
The present invention relates to technical field of information processing, the webpage grasping means and the webpage that relate in particular in a kind of wireless search Web page transition system grasp server.
Background technology
Along with development of internet technology, wireless interconnected network technology is also developing by leaps and bounds, people can get in touch with other people whenever and wherever possible by mobile communication terminal (for example mobile phone, wireless Palm Pilots etc.), simultaneously along with the reduction of post and telecommunication tariff and the popularization of 3G technology, wireless Internet will have great development, and change our life style.
Having maximum resources on the internet at present is webpage, but these webpages are the HTML (Hypertext Markup Language) (HTML that aim at personal computer (PC) design, HyperText Markup Language) form, because the restriction of mobile communication terminal screen size, processing power and the network bandwidth, these webpages can't directly be browsed on mobile communication terminal, at this situation, designed a kind of WAP Markup Language (WML at present, Wireless Markup Language) SGML of form is used to write the webpage that can show on mobile communication terminal.
Also there is the demand of search information in the user of wireless Internet, therefore, need provide a search engine that is similar on the PC to help user search information, because html web page quantity is far longer than WML webpage quantity at present, the major part as a result of user search is to be present in the html web page, therefore occurred a kind of wireless search Web page transition system at present, can automatically html web page have been converted to the WML webpage, directly on mobile communication terminal, browsed for wireless interconnected network users.
The wireless search Web page transition system comprises that webpage grasps server, change server and storage server.Its basic process is that webpage grasps the request that server obtains mobile communication terminal user earlier, isolate original html web page address, to grasp this html web page automatically afterwards, giving change server resolves, convert the WML webpage to, and with described WML web storage in storage server, search for mobile communication terminal visit.
Grasp server for webpage and how to grasp html web page, existing technical scheme is as follows:
Utilize STL (STL, Standard Template Library) the Map data structure in is as buffer memory, this buffer memory is used to store the URL object, the key word of a URL object is the message digest algorithm (MD5 of webpage URL, Messsage-Digest algorithm 5) value, value is the extracting time of webpage.Simultaneously, the unified time threshold of setting the extracting time interval of webpage for example is set to 10 minutes usually.
Mobile communication terminal searches corresponding webpage by wireless search engine, after the user clicks Search Results, mobile communication terminal sends to the wireless search Web page transition system with the web-page requests of correspondence, after the wireless search Web page transition system is received web-page requests, isolate the URL address of institute's requested webpage, and calculate the MD5 value of this URL address, with this MD5 value is key word, current time is value, in the buffer memory of webpage extracting server, search, if there is the URL object of same keyword, then inquire about the extracting time of this URL object, and compare with the current time, if both differences are then rewritten this URL object in the buffer memory more than or equal to the time threshold of described setting, the value that is about to this URL object is updated to the current time, and grasp the webpage of this URL object again, and convert the WML webpage to by change server and deposit storage server in; If both differences are less than the time threshold of described setting, then expression need not to grasp again this webpage, webpage grasps server can directly abandon described web-page requests, is returned the WML webpage of this URL object correspondence of present storage to the mobile communication terminal of initiating request by described storage server.
There is following shortcoming in above-mentioned prior art:
Prior art is only set a webpage to all types of webpages and is grasped the interlude threshold value, more new situation that can not the dissimilar webpages of flexible adaptation, suppose that if time threshold is set be 10 minutes, upgrade frequent webpage for some so, for example forum, the comment and so on webpage, 10 minutes the extracting time interval is long; Otherwise for the low-down type of webpage of those renewal frequencies, for example news web page just can not upgrade after the issue probably, but system can't adapt to this situation at present, still will remove to grasp again one time webpage every 10 minutes.After the extracting of webpage surpassed the time threshold of setting at interval, promptly this webpage was crossed after date from buffer memory, can not represent this web page contents to upgrade, and need to grasp again, yet in fact, the update cycle of most of webpage was all long on the internet.
Therefore, the wireless search Web page transition system of prior art can't adapt to long this situation of page refreshment cycle, cause and repeated to grasp the webpage that does not much carry out content update, increase the weight of webpage and grasped load of server, taken too much network bandwidth resources, and the efficient of extracting webpage is lower.
Summary of the invention
In view of this, technical matters to be solved by this invention is to provide a kind of webpage grasping means, grasps load of server to alleviate webpage, and minimizing takies network bandwidth resources, improves the efficient that webpage grasps.
Another technical matters to be solved by this invention is to provide a kind of webpage to grasp server, to alleviate the burden of self system, reduces the taking of network bandwidth resources, and improves the efficient that webpage grasps.
In order to realize the foregoing invention purpose, main technical schemes of the present invention is:
A kind of webpage grasping means comprises:
A, reception web-page requests;
B, judge whether institute's requested webpage grasped, if, execution in step C then; Otherwise, grasp this webpage, process ends;
C, judge institute's requested webpage extracting at interval whether greater than default time threshold, if, execution in step D then; Otherwise, do not grasp this webpage, process ends;
Whether D, the described webpage of inquiry have renewal, if renewal is arranged, then grasp this webpage; Otherwise, do not grasp this webpage.
Preferably, this method sets in advance buffer area and this buffer area time corresponding threshold value;
And, when grasping webpage for the first time, depositing buffer area at object of this Web page create, this object comprises the sign and the request time of this webpage, and further upgrades original time with the current time in subsequent step D;
In step B, whether in buffer memory, exist according to the banner of being asked and to judge whether this webpage grasped; Among the step C, described extracting is spaced apart the difference of included time of web object described in current time and the buffer area, and described time threshold is this buffer area time corresponding threshold value.
Preferably, the difference that grasps frequency according to webpage is provided with the buffer area of different stage, the extracting interlude threshold value that buffer area wherein not at the same level is corresponding different; And in buffer area not at the same level, move according to the extracting frequency of webpage object with webpage.
Preferably, the buffer area correspondence of described each grade is provided with one and grasps the frequency level value, and the extracting number of times further is set in the object of described each webpage, and the initial value of this extracting number of times is 0;
In step D, further comprise: if webpage has renewal, then the extracting number of times in this web object is added 1, if webpage does not upgrade, then the extracting number of times in this web object subtracts 1; And the extracting frequency level value of buffer area under the extracting number of times of this web object and this web object relatively, if grasp number of times greater than described extracting frequency level value, then this web object is moved to the shorter upper level buffer area of time threshold, if grasp number of times less than described extracting frequency level value, then this web object moved to the longer next stage buffer area of time threshold.
Preferably, whether the described webpage of the described inquiry of step D has renewal to be specially: whether the described webpage of return code inquiry judging according to HTML (Hypertext Markup Language) has renewal.
Preferably, described webpage is the HTML (Hypertext Markup Language) webpage.
A kind of webpage grasps server, comprising:
The web-page requests receiver module is used to receive web-page requests;
Judge module is used to judge whether institute's requested webpage grasped and grasped at interval, is not grasping out-of-dately, triggers and grasps module, triggers enquiry module when grasping at interval greater than default time interval;
Enquiry module, whether have renewal, having when renewal to trigger the extracting module if being used to inquire about described webpage;
Grasp module, be used to grasp webpage.
Preferably, further comprise buffer area, be used to store the object that grasps webpage, and this buffer area has the time corresponding threshold value; Described judge module judges according to the web object in the described buffer area whether webpage grasped and grasped at interval, and the described time threshold that is used for comparison is this buffer area time corresponding threshold value.
Preferably, described buffer area has two-stage at least, and every grade of corresponding different webpage of buffer area grasps frequency and grasps the interlude threshold value;
And described webpage grasps server and further comprises the object migration module, is used for moving at buffer area not at the same level according to the extracting frequency of the webpage object with webpage.
Preferably, described webpage is the HTML (Hypertext Markup Language) webpage.
Owing to the invention enables webpage to grasp server does not need to grasp again the user in the certain hour threshold value requested webpage, but directly return the event memory of storage server, and after the extracting of webpage is at interval greater than default time threshold, further judge whether webpage has renewal, if renewal is arranged then grasp webpage, otherwise do not grasp webpage.Therefore can avoid repeating to grasp the webpage that does not much carry out content update, alleviate webpage and grasp load of server, minimizing takies network bandwidth resources, improves the efficient that grasps webpage.
In addition, the present invention also further utilizes the mechanism of hierarchical cache to improve webpage and grasps efficient, the buffer area of different stage is set according to the difference of webpage extracting frequency, correspond respectively to the webpage of different update frequency, and in buffer area not at the same level, move according to the extracting frequency of webpage object with webpage.Make the renewal frequency of this URL object level off to real web page contents renewal frequency, improve the accuracy of buffer area.
Description of drawings
Fig. 1 is the process flow diagram of a kind of embodiment of webpage grasping means of the present invention;
Fig. 2 is that webpage of the present invention grasps a kind of structure of server and the synoptic diagram that concerns with the external world.
Embodiment
Below by specific embodiments and the drawings the present invention is described in further details.
Webpage grasping means of the present invention is applicable to that the webpage in the wireless search Web page transition system grasps server, this webpage grasps a kind of caching mechanism of server by utilizing and guarantee not repeat to grasp same html web page in the certain hour scope, simultaneously, after the preset time threshold value arrives, detect this html web page content according to the HTTP header information and whether more newly arrived and judge whether that needs grasp html web page again.When needs grasp html web page, this webpage extracting server grasps this html web page from the server at this html web page place, and the html web page that grasps sent to change server in the wireless search Web page transition system, convert the WML webpage to by change server, and be deposited in the storage server in the wireless search Web page transition system, obtain for the mobile communication terminal user visit.
Fig. 1 is the process flow diagram of a kind of embodiment of webpage grasping means of the present invention.Among this embodiment, when initial, grasp three buffer areas of initialization in the server at webpage, be used to store the URL object, the corresponding html web page of a described URL object, this URL object is a key word with the MD5 value of webpage URL address, and comprises the request time of html web page and the actual extracting number of times update of html web page, and update is a shaping numerical value.Each buffer area is inner realizes that data structure is the Map of STL.Described three buffer areas grasp according to webpage and are divided into three ranks at interval, the webpage that three buffer area correspondences are set grasps the interlude threshold value, for example first buffer area is set to 5 minutes in the present embodiment, second buffer area is set to 10 minutes, the 3rd buffer area is set to 20 minutes, simultaneously each buffer area also is provided with the updateLevel value of a correspondence respectively, represent the extracting frequency of the corresponding html web page of URL object in this rank buffer area, grade is frequently spent in the renewal that also is equivalent to html web page.Size, time threshold parameter and the updateLevel value of each buffer area are stored in the configuration file, webpage grasps server and read this configuration file when starting, and can grasp the server admin thread by webpage simultaneously and dynamically upgrade in webpage extracting server operational process.
Referring to Fig. 1, in this embodiment, webpage grasps server and specifically carries out following steps:
Step 101, reception user's web-page requests is the URL request, isolates the URL address of institute's requested webpage, and calculates the MD5 value of this URL address.
The source of described web-page requests is a mobile communication terminal, mobile communication terminal searches corresponding webpage by wireless search engine, after the user clicks Search Results, mobile communication terminal sends to the wireless search Web page transition system with the web-page requests of correspondence, the webpage of wireless search Web page transition system grasps server and storage server can receive this web-page requests, webpage grasps the follow-up grasping manipulation that server can carry out according to this web-page requests, and described storage server can be according to the corresponding WML webpage of web-page requests inquiry.
Step 102, be key word, in described webpage grasps three buffer areas of server, search the URL object successively with the MD5 value of described URL, if all search less than, then execution in step 103; Otherwise, execution in step 104.
Step 103, generate a URL object, comprise request time and the update of this URL, the initial value of update is 0, and is key word with the MD5 value of described URL.Described URL object is inserted in the buffer area of an appointment, for example be generally in first buffer area, need to start network simultaneously and connect the actual content that grasps described html web page again, promptly grasp the html web page of this URL correspondence from the server at described html web page place, after the html web page that grabs issued change server, webpage grasped the execution flow process that server finishes this URL request.Afterwards, html web page is converted to the WML webpage, the WML webpage is deposited in the storage server obtain the WML webpage for the mobile communication terminal user visit by change server.
When webpage extracting server is asked some URL for the first time, if successfully grasp webpage, the return state of this webpage place server can be 200, and content is a web data, the time that has this webpage of attribute-bit of a Last-Modified to be modified at last on the website simultaneously, form is similar:
Last-Modified:Wed,17?Oct?2007?12:45:30?GMT。
If step 104 finds corresponding URL object in certain buffer area, the value of then taking out this URL object deducts time value in this URL object with the current time value, if difference in this buffer area time corresponding threshold value, then execution in step 105; Otherwise execution in step 106.
Step 105, do not need to grasp again html web page, webpage grasps server and directly ignores this URL request, finishes this URL request flow process.
Step 106, be initial value with the request time of URL object in the described buffer area, utilize 304 return codes in the http protocol to judge whether the html web page content of described URL has actual renewal, if described html web page has renewal, then execution in step 108, otherwise, execution in step 107.
Described step 106 judges whether the html web page content has reality detailed process more to comprise:
Step 61, webpage extracting server serve as the zero-time (If-Modified-Since) whether inquiry is upgraded with the request time of URL object in the described buffer area, send the request whether the inquiry html web page upgrades to html web page place server, comprising the URL address in the described web-page requests; Certain described zero-time also can be the time that the webpage that comprises in the last return state 200 is modified at last.For example:
If-Modified-Since:Wed,17?Oct?2007?12:45:30?GMT。
Whether the html web page of step 62, this described URL of place server lookup address, URL address correspondence has renewal, and carries the result who whether upgrades in HTTP 304 return codes that return to webpage extracting server;
Whether step 63, webpage extracting server have actual renewal according to the described html web page content of HTTP 304 return code inquiry judging, if 304 return codes are empty, then the corresponding html web page of expression was not modified, otherwise represented to be modified.
Then, whether the present invention determines some webpages to stay in working as the prime buffer area according to certain cache policy or adjusts its buffer memory rank.Step 107 specific as follows is to step 109.
Step 107, the request time that directly upgrades the URL object described in the described buffer area are the current time, and the update value in this URL object is subtracted 1, do not need to grasp again simultaneously the actual content of this html web page, execution in step 109.
The request time that step 108, elder generation upgrade URL object described in the described buffer area is the current time, and the update value in this URL object is added 1, needs to start network simultaneously and connects the actual content that grasps this webpage again.
Step 109, adjust the buffer memory rank of this URL object according to the updateLevel value of buffer memory under the update value of URL object and this URL object.Be specially:
In certain grade of buffer area, for example this is in the buffer area of the second level, during the updateLevel value of the update value of certain URL object>this grade buffer area, it is moved on in the upper level buffer area of time threshold shorter (promptly upgrading more frequent), for example herein for moving to first order buffer area, after moving, the update value zero clearing of this URL object.But the situation that belongs to first order buffer area for described URL object is not then done mobile the processing.
In certain grade of buffer area, for example still be in the buffer area of the second level herein, this grade buffer area updateLevel value of the update value of certain URL object<negative the time, it is moved on in the next stage buffer area of time threshold longer (promptly upgrading more not frequent), for example herein for moving in the third level buffer area, after moving, the update value zero clearing of this URL object.But the situation that belongs to the afterbody buffer area for described URL object is not done mobile the processing.
By step 109, can utilize the update value that the URL object is dynamically moved in buffer area not at the same level, make the renewal frequency of this URL object level off to real web page contents renewal frequency, improve the accuracy of buffer area.
Fig. 2 is that webpage of the present invention grasps a kind of structure of server and the synoptic diagram that concerns with the external world.Referring to Fig. 2, described webpage grasps server 200 and comprises:
Web-page requests receiver module 201 is used to receive web-page requests.
Judge module 202, be used to judge whether institute's requested webpage grasped and grasped at interval, if do not grasp then trigger and grasp module 204, if grasped, and trigger enquiry module 203 when grasping at interval greater than default time interval, if grasped, and grasp and to be less than or equal to the default time interval at interval and then to ignore this web-page requests.
Enquiry module 203, whether have renewal, there being when renewal to trigger extracting module 204, abandon this web-page requests when not upgrading if being used to inquire about described webpage.Described concrete querying method is referring to step 106, and promptly the request time with URL object in the described buffer area is an initial value, utilizes 304 return codes in the http protocol to judge whether the html web page content of described URL has actual renewal.
Grasp module 204, be used for grasping the html web page of this URL correspondence, the html web page that grabs is issued change server 300 from the server 500 at described html web page place.Afterwards, change server 300 is converted to the WML webpage with html web page, and the WML webpage deposited in the storage server 400 for the mobile communication terminal user visit obtains the WML webpage.
Described webpage grasps in the server and is provided with buffer area 205, is used to store the object that grasps webpage, and the URL object is a key word with the MD5 value of webpage URL address, and comprises the request time of html web page and the actual extracting number of times update of html web page.And the extracting interlude threshold value of buffer area correspondence is set, described judge module 202 is the URL object that whether has this MD5 sign indicating number correspondence in the described buffer area of keyword query according to the MD5 sign indicating number of the URL address of institute's requested webpage, if exist then the corresponding crawled mistake of html web page of explanation, otherwise do not have crawledly, need to grasp this html web page.And, judge module 202 compares with this buffer area time corresponding threshold value as grasping at interval with the time value that the current time value deducts in this URL object, if grasp at interval greater than this time threshold then need trigger enquiry module 203, grasp module 204 and carry out webpage and grasp otherwise directly trigger.
As described in above-mentioned method, described buffer area can have two-stage at least, for example is three grades of buffer areas among the figure, and every grade of corresponding different webpage of buffer area grasps frequency and grasps the interlude threshold value; And described webpage grasps server and further comprises object migration module 206, be used for moving at buffer area not at the same level according to the extracting frequency of webpage object with webpage, concrete migration pattern referring to above-mentioned steps 107 to step 109.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims (10)

1. a webpage grasping means is characterized in that, comprising:
A, reception web-page requests;
B, judge whether institute's requested webpage grasped, if, execution in step C then; Otherwise, grasp this webpage, process ends;
C, judge institute's requested webpage extracting at interval whether greater than default time threshold, if, execution in step D then; Otherwise, do not grasp this webpage, process ends;
Whether D, the described webpage of inquiry have renewal, if renewal is arranged, then grasp this webpage; Otherwise, do not grasp this webpage.
2. webpage grasping means according to claim 1 is characterized in that, this method sets in advance buffer area and this buffer area time corresponding threshold value;
And, when grasping webpage for the first time, depositing buffer area at object of this Web page create, this object comprises the sign and the request time of this webpage, and further upgrades original time with the current time in subsequent step D;
In step B, whether in buffer memory, exist according to the banner of being asked and to judge whether this webpage grasped; Among the step C, described extracting is spaced apart the difference of included time of web object described in current time and the buffer area, and described time threshold is this buffer area time corresponding threshold value.
3. webpage grasping means according to claim 2 is characterized in that, the difference that grasps frequency according to webpage is provided with the buffer area of different stage, the extracting interlude threshold value that buffer area wherein not at the same level is corresponding different; And in buffer area not at the same level, move according to the extracting frequency of webpage object with webpage.
4. webpage grasping means according to claim 3 is characterized in that,
The buffer area correspondence of described each grade is provided with one and grasps the frequency level value, and the extracting number of times further is set in the object of described each webpage, and the initial value of this extracting number of times is 0;
In step D, further comprise: if webpage has renewal, then the extracting number of times in this web object is added 1, if webpage does not upgrade, then the extracting number of times in this web object subtracts 1; And the extracting frequency level value of buffer area under the extracting number of times of this web object and this web object relatively, if grasp number of times greater than described extracting frequency level value, then this web object is moved to the shorter upper level buffer area of time threshold, if grasp number of times less than described extracting frequency level value, then this web object moved to the longer next stage buffer area of time threshold.
5. webpage grasping means according to claim 1 is characterized in that, whether the described webpage of the described inquiry of step D has to upgrade is specially: whether the described webpage of return code inquiry judging according to HTML (Hypertext Markup Language) has renewal.
6. according to each described webpage grasping means of claim 1 to 5, it is characterized in that described webpage is the HTML (Hypertext Markup Language) webpage.
7. a webpage grasps server, it is characterized in that, comprising:
The web-page requests receiver module is used to receive web-page requests;
Judge module is used to judge whether institute's requested webpage grasped and grasped at interval, is not grasping out-of-dately, triggers and grasps module, triggers enquiry module when grasping at interval greater than default time interval;
Enquiry module, whether have renewal, having when renewal to trigger the extracting module if being used to inquire about described webpage;
Grasp module, be used to grasp webpage.
8. webpage according to claim 7 grasps server, it is characterized in that, further comprises buffer area, be used to store the object that grasps webpage, and this buffer area has the time corresponding threshold value; Described judge module judges according to the web object in the described buffer area whether webpage grasped and grasped at interval, and the described time threshold that is used for comparison is this buffer area time corresponding threshold value.
9. webpage according to claim 8 grasps server, it is characterized in that described buffer area has two-stage at least, and every grade of corresponding different webpage of buffer area grasps frequency and grasps the interlude threshold value;
And described webpage grasps server and further comprises the object migration module, is used for moving at buffer area not at the same level according to the extracting frequency of the webpage object with webpage.
10. grasp server according to each described webpage of claim 7 to 9, it is characterized in that described webpage is the HTML (Hypertext Markup Language) webpage.
CNB2007101985301A 2007-12-11 2007-12-11 Web page collecting method and web page collecting server Active CN100501746C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101985301A CN100501746C (en) 2007-12-11 2007-12-11 Web page collecting method and web page collecting server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101985301A CN100501746C (en) 2007-12-11 2007-12-11 Web page collecting method and web page collecting server

Publications (2)

Publication Number Publication Date
CN101178736A true CN101178736A (en) 2008-05-14
CN100501746C CN100501746C (en) 2009-06-17

Family

ID=39404989

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101985301A Active CN100501746C (en) 2007-12-11 2007-12-11 Web page collecting method and web page collecting server

Country Status (1)

Country Link
CN (1) CN100501746C (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101303700B (en) * 2008-06-13 2010-04-21 成都市华为赛门铁克科技有限公司 Method and system for collecting web page
CN101826074A (en) * 2009-03-04 2010-09-08 上海众恒信息产业股份有限公司 Data exchange method for isolated system and data exchange device
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
CN101917479A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for improving grouped data service in mobile network
WO2010149024A1 (en) * 2009-06-23 2010-12-29 北京搜狗科技发展有限公司 Update notification method and browser
CN101984634A (en) * 2010-11-22 2011-03-09 北京酷我科技有限公司 Server-side automatic steering method and system adapting to resource synchronous mechanism
CN101986659A (en) * 2010-10-27 2011-03-16 青岛普加智能信息有限公司 Real-time data transmission method and system
CN101459571B (en) * 2008-12-16 2011-04-06 北京大学 Method, system and apparatus for website mirroring
CN102184253A (en) * 2011-05-30 2011-09-14 北京搜狗科技发展有限公司 Method and system used for pushing grabbed and updated messages of network resource
CN102196506A (en) * 2010-03-15 2011-09-21 华为技术有限公司 Network resource access control method, system and device
CN102253941A (en) * 2010-05-21 2011-11-23 卓望数码技术(深圳)有限公司 Cache updating method and cache updating device
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102364461A (en) * 2011-06-30 2012-02-29 广州市动景计算机科技有限公司 Page content data acquisition method and server
CN102594787A (en) * 2011-01-14 2012-07-18 腾讯科技(深圳)有限公司 Data grab method, system and routing server
CN102609416A (en) * 2011-01-21 2012-07-25 富泰华工业(深圳)有限公司 Webpage information storage control and method
CN102609481A (en) * 2012-01-20 2012-07-25 苏州简拔林网络科技有限公司 Method for updating and gathering comment information in real time
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN102915363A (en) * 2012-10-18 2013-02-06 北京奇虎科技有限公司 Website storing method and system
CN102129441B (en) * 2010-01-14 2013-02-27 深圳市深信服电子科技有限公司 Web page information identifying and processing method and device
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103020313A (en) * 2013-01-08 2013-04-03 北京航空航天大学 Capturing method based on detection of webpage refreshing period
CN103218452A (en) * 2013-04-27 2013-07-24 人民搜索网络股份公司 Method and device for recognizing valid interlinkage in Hub webpage
WO2013135003A1 (en) * 2012-03-15 2013-09-19 中兴通讯股份有限公司 Embedded network proxy system, terminal device and proxy method
CN103399933A (en) * 2013-08-08 2013-11-20 人民搜索网络股份公司 Method and system for grabbing webpage contents of network print media
CN103905441A (en) * 2014-03-28 2014-07-02 广州华多网络科技有限公司 Data acquisition method and device
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104462493A (en) * 2014-12-18 2015-03-25 北京奇虎科技有限公司 Method and device for grabbing question and answer webpages
CN104462492A (en) * 2014-12-18 2015-03-25 北京奇虎科技有限公司 Method and device for grabbing question and answer webpages
CN104967698A (en) * 2015-02-13 2015-10-07 腾讯科技(深圳)有限公司 Network data crawling method and apparatus
CN106055638A (en) * 2016-05-30 2016-10-26 国家基础地理信息中心 Network geographic information updating method and network geographic information updating system
CN102609416B (en) * 2011-01-21 2016-12-14 富泰华工业(深圳)有限公司 Webpage information storage control and method
CN106371830A (en) * 2016-08-25 2017-02-01 北京量科邦信息技术有限公司 Interactive method for realizing close and back control of native APP and WEB pages
CN106547773A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 The method and device of adjustment event opening speed
CN106557484A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 The update method and device of webpage thermodynamic Background
CN106897127A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of method and server for picture capture treatment
CN106897126A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of picture grasping means and server
CN107102997A (en) * 2016-02-22 2017-08-29 北京国双科技有限公司 data crawling method and device
CN108600342A (en) * 2018-03-30 2018-09-28 连尚(新昌)网络科技有限公司 A kind of message display method, equipment and storage medium
CN110020065A (en) * 2017-07-19 2019-07-16 阿里巴巴集团控股有限公司 A kind of website identification method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118361B (en) * 2009-12-31 2014-07-23 北京金山软件有限公司 Method and device for controlling data transmission based on network protocol

Cited By (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101303700B (en) * 2008-06-13 2010-04-21 成都市华为赛门铁克科技有限公司 Method and system for collecting web page
CN101459571B (en) * 2008-12-16 2011-04-06 北京大学 Method, system and apparatus for website mirroring
CN101826074A (en) * 2009-03-04 2010-09-08 上海众恒信息产业股份有限公司 Data exchange method for isolated system and data exchange device
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
CN101902438B (en) * 2009-05-25 2013-05-15 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
WO2010149024A1 (en) * 2009-06-23 2010-12-29 北京搜狗科技发展有限公司 Update notification method and browser
CN102129441B (en) * 2010-01-14 2013-02-27 深圳市深信服电子科技有限公司 Web page information identifying and processing method and device
CN102196506A (en) * 2010-03-15 2011-09-21 华为技术有限公司 Network resource access control method, system and device
CN102196506B (en) * 2010-03-15 2013-12-04 华为技术有限公司 Network resource access control method, system and device
CN102253941A (en) * 2010-05-21 2011-11-23 卓望数码技术(深圳)有限公司 Cache updating method and cache updating device
CN102347930B (en) * 2010-07-26 2015-09-09 中国电信股份有限公司 Web page contents acquisition methods and system
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN101917479A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for improving grouped data service in mobile network
CN101986659A (en) * 2010-10-27 2011-03-16 青岛普加智能信息有限公司 Real-time data transmission method and system
CN101984634B (en) * 2010-11-22 2013-06-26 北京酷我科技有限公司 Server-side automatic steering method and system adapting to resource synchronous mechanism
CN101984634A (en) * 2010-11-22 2011-03-09 北京酷我科技有限公司 Server-side automatic steering method and system adapting to resource synchronous mechanism
CN102594787A (en) * 2011-01-14 2012-07-18 腾讯科技(深圳)有限公司 Data grab method, system and routing server
CN102594787B (en) * 2011-01-14 2016-01-20 腾讯科技(深圳)有限公司 Data grab method, system and routing server
CN102609416A (en) * 2011-01-21 2012-07-25 富泰华工业(深圳)有限公司 Webpage information storage control and method
CN102609416B (en) * 2011-01-21 2016-12-14 富泰华工业(深圳)有限公司 Webpage information storage control and method
CN102184253A (en) * 2011-05-30 2011-09-14 北京搜狗科技发展有限公司 Method and system used for pushing grabbed and updated messages of network resource
CN102364461A (en) * 2011-06-30 2012-02-29 广州市动景计算机科技有限公司 Page content data acquisition method and server
CN106599239A (en) * 2011-06-30 2017-04-26 广州市动景计算机科技有限公司 Webpage content data acquisition method and server
CN102609481A (en) * 2012-01-20 2012-07-25 苏州简拔林网络科技有限公司 Method for updating and gathering comment information in real time
WO2013135003A1 (en) * 2012-03-15 2013-09-19 中兴通讯股份有限公司 Embedded network proxy system, terminal device and proxy method
CN102831252B (en) * 2012-09-21 2015-11-25 北京奇虎科技有限公司 A kind of method for upgrading index data base and device, searching method and system
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN102915363A (en) * 2012-10-18 2013-02-06 北京奇虎科技有限公司 Website storing method and system
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103020313A (en) * 2013-01-08 2013-04-03 北京航空航天大学 Capturing method based on detection of webpage refreshing period
CN103218452A (en) * 2013-04-27 2013-07-24 人民搜索网络股份公司 Method and device for recognizing valid interlinkage in Hub webpage
CN103218452B (en) * 2013-04-27 2016-08-10 人民搜索网络股份公司 A kind of method and apparatus identifying effectively link in Hub page
CN103399933A (en) * 2013-08-08 2013-11-20 人民搜索网络股份公司 Method and system for grabbing webpage contents of network print media
CN103399933B (en) * 2013-08-08 2017-01-18 人民搜索网络股份公司 Method and system for grabbing webpage contents of network print media
CN103905441B (en) * 2014-03-28 2017-08-25 广州华多网络科技有限公司 Data capture method and device
CN103905441A (en) * 2014-03-28 2014-07-02 广州华多网络科技有限公司 Data acquisition method and device
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104462492B (en) * 2014-12-18 2018-01-16 北京奇虎科技有限公司 The method and apparatus for capturing question and answer class webpage
CN104462493A (en) * 2014-12-18 2015-03-25 北京奇虎科技有限公司 Method and device for grabbing question and answer webpages
CN104462493B (en) * 2014-12-18 2018-08-03 北京奇虎科技有限公司 The method and apparatus for capturing question and answer class webpage
CN104462492A (en) * 2014-12-18 2015-03-25 北京奇虎科技有限公司 Method and device for grabbing question and answer webpages
CN104967698B (en) * 2015-02-13 2018-11-23 腾讯科技(深圳)有限公司 A kind of method and apparatus crawling network data
CN104967698A (en) * 2015-02-13 2015-10-07 腾讯科技(深圳)有限公司 Network data crawling method and apparatus
CN106547773A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 The method and device of adjustment event opening speed
CN106557484A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 The update method and device of webpage thermodynamic Background
CN106897127A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of method and server for picture capture treatment
CN106897126A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of picture grasping means and server
CN107102997A (en) * 2016-02-22 2017-08-29 北京国双科技有限公司 data crawling method and device
CN106055638A (en) * 2016-05-30 2016-10-26 国家基础地理信息中心 Network geographic information updating method and network geographic information updating system
CN106371830A (en) * 2016-08-25 2017-02-01 北京量科邦信息技术有限公司 Interactive method for realizing close and back control of native APP and WEB pages
CN110020065A (en) * 2017-07-19 2019-07-16 阿里巴巴集团控股有限公司 A kind of website identification method and device
CN110020065B (en) * 2017-07-19 2023-04-25 阿里巴巴集团控股有限公司 Website identification method and device
CN108600342A (en) * 2018-03-30 2018-09-28 连尚(新昌)网络科技有限公司 A kind of message display method, equipment and storage medium
CN108600342B (en) * 2018-03-30 2020-01-10 连尚(新昌)网络科技有限公司 Message display method, device and storage medium

Also Published As

Publication number Publication date
CN100501746C (en) 2009-06-17

Similar Documents

Publication Publication Date Title
CN100501746C (en) Web page collecting method and web page collecting server
US6954754B2 (en) Apparatus and methods for managing caches on a mobile device
CN100464308C (en) Method and system for updating user vocabulary synchronouslly
CN101334792B (en) Personalized service recommendation system and method
CN104182408B (en) A kind of webpage offline access method and device
CN110519401A (en) Improve method, apparatus, equipment and the storage medium of network Access Success Rate
CN102164186B (en) Method and system for realizing cloud search service
US10489476B2 (en) Methods and devices for preloading webpages
CN105095226A (en) Method and apparatus for loading webpage resource
CN101702173A (en) Method and device for increasing access speed of mobile portal dynamic page
JP2006511134A (en) Method for automatically replicating data objects between a mobile device and a server
CN102480397A (en) Method and equipment for accessing internet pages
CN104298790A (en) Browser accelerating method and browser device with accelerator
EP1512264B1 (en) Communication system, mobile device and method for storing pages on a mobile device
CN102819554A (en) Favorite data processing method and device and server
CN103701929A (en) Method and device for realizing business data caching
CN105468707A (en) Cache-based data processing method and device
CN102591887B (en) Network data pre-head method and system
CN103916474A (en) Method, device and system for determining caching time
CN103473326A (en) Method and device providing searching advices
CN103617278A (en) Control method and device for address bar searching
CN100489861C (en) Data searching method, system and device
CN102129437A (en) Domain name matching method and browser
CN101299854B (en) Mobile terminal and data maintenance method thereof
CN104348628A (en) Method and device for obtaining local Root authority

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant