CN101739427A - Crawler capturing method and device thereof - Google Patents

Crawler capturing method and device thereof Download PDF

Info

Publication number
CN101739427A
CN101739427A CN200810226245A CN200810226245A CN101739427A CN 101739427 A CN101739427 A CN 101739427A CN 200810226245 A CN200810226245 A CN 200810226245A CN 200810226245 A CN200810226245 A CN 200810226245A CN 101739427 A CN101739427 A CN 101739427A
Authority
CN
China
Prior art keywords
webpage
weights
described webpage
result
current weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200810226245A
Other languages
Chinese (zh)
Other versions
CN101739427B (en
Inventor
孙宏伟
胡珉
罗治国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN2008102262450A priority Critical patent/CN101739427B/en
Publication of CN101739427A publication Critical patent/CN101739427A/en
Application granted granted Critical
Publication of CN101739427B publication Critical patent/CN101739427B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a crawler capturing method and device thereof, aiming at solving the problem that the existing crawler capturing technology is poor in timeliness. The main technical scheme includes that: the current weight number of the webpage is determined according to the rank of the webpage in the current search result or/and the sequence of the webpage clicked by users; result weight number of the webpage is determined according to the current weight number and history weight number of the webpage; and when the result weight number is equal to preset threshold, information in the webpage is captured again. By the technical scheme, the period for crawler to capture information in the webpage can be influenced according to the rank of the webpage in the current search result or/and the sequence of the webpage clicked by users, the period for crawler to capture information in the webpage with high user attention can be reduced, thus ensuring the information in the webpage to have good timeliness and improving user experience.

Description

A kind of crawler capturing method and device thereof
Technical field
The present invention relates to the internet information search field, relate in particular to a kind of crawler capturing method and device thereof.
Background technology
Search engine is a technology of widely using on the internet now, and the partial key that people only need import own required information just can find the information relevant with this key word, for example search engines such as Baidu, Google in a large number by search engine.
The information source of search engine is varied, some is to be propped up to search engine operator by the advertiser that initiates this advertisement by the form of bid advertisement to pay off advertising, search engine operator is published the brief information and the link of this advertisement in the search engine of oneself, and more non-advertising message, news for example, academic information etc. needs search engine operator oneself to go to seek and grasp and adds search engine, in the face of the magnanimity information on the internet, how the information that a large amount of search engine operator are concerned about distinguishes from other garbage, and adds search engine categorizedly and become the problem that search engine operator is needed solution badly.
The appearance of crawler capturing technology has solved the problems referred to above, and the information that this technology can will meet this condition according to the condition of setting crawls out from the magnanimity information of internet.The crawler capturing technology is applied to the extracting problem that search engine can solve various effective informations effectively.The crawler capturing technology need travel through webpage when the information of extracting, in the face of the magnanimity webpage on the internet time, travel through all webpages and almost be difficult to accomplish, even accomplished, also need to expend a large amount of time and resource, the information timeliness that makes crawler capturing arrive is very poor.At this defective, at present the solution that generally adopts is to utilize reptile extracting information in the webpage of some, certain limit, and these webpages generally are that the probability that contains useful information that draws by statistics in advance and quantity are all than higher webpage.Like this, the webpage of this some, certain limit can be formed a search listing and be recorded into the hunting zone of reptile, make reptile retrieve every regular time whether new information page link is arranged on this search listing, if have then download this information page, extract Useful Information in this information page according to link.
Above-mentioned mode according to Fixed Time Interval crawler capturing info web, each crawler capturing information consumed time and resource have been shortened to a certain extent, but in actual applications, user's attention rate difference of different web pages, if use identical frequency to grasp high info web and the low info web of user's attention rate of user's attention rate, obviously for the high info web of user's attention rate information to grasp frequency relatively low, thereby can not the information in the high webpage of user's attention rate be grasped in time and upgrade, make the poor in timeliness of information in this class webpage, be the outdated information or the invalid information that can have some in the webpage, and then influence the satisfaction of user search engine.
Summary of the invention
The invention provides a kind of optimization crawler capturing method and device thereof, in order to solve the problem of existing crawler capturing technology poor in timeliness.
The embodiment of the invention is achieved through the following technical solutions:
The embodiment of the invention provides a kind of crawler capturing method, comprising:
According to the ordering of webpage in the current search result or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;
According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage;
When weights reach setting threshold, grasp the information in the described webpage when described as a result again.
The embodiment of the invention also provides a kind of device of crawler capturing, comprising:
The current weight determining unit is used for according to webpage in current search result's ordering or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;
Weights determining unit as a result is used for the current weight determined according to described current weight determining unit and the historical weights of described webpage, determines the weights as a result of described webpage;
The information placement unit, be used for when the described determining unit of weights as a result determine weights reach setting threshold as a result the time, grasp the information in the described webpage again.
Pass through technique scheme, the embodiment of the invention can be according to the ordering of webpage in the current search result or/and the order that this webpage is clicked by the user, determine the current weight of webpage, then according to the current weight and the historical weights of webpage, determine the weights as a result of this webpage, when weights reach setting threshold as a result, grasp the information in this webpage again.Generally speaking, the ordering of webpage in the current search result is or/and the order that webpage is clicked by the user can embody user's attention rate of this webpage well, based on this, the embodiment of the invention utilizes the ordering of webpage in the current search result or/and the order that webpage is clicked by the user, influence the cycle of information in this webpage of crawler capturing, according to this scheme, can shorten the cycle of crawler capturing info web to the high webpage of user's attention rate, thereby improve the extracting frequency of information in the high webpage of user's attention rate, the information in this class webpage that guarantees has good timeliness, improves user's use experience.
Description of drawings
Fig. 1 is the first pass figure of crawler capturing in the embodiment of the invention;
Fig. 2 is second process flow diagram of crawler capturing in the embodiment of the invention;
Fig. 3 is the 3rd process flow diagram of crawler capturing in the embodiment of the invention;
Fig. 4 is the device synoptic diagram one of crawler capturing in the embodiment of the invention;
Fig. 5 is the device synoptic diagram two of crawler capturing in the embodiment of the invention;
Fig. 6 is the device synoptic diagram three of crawler capturing in the embodiment of the invention.
Embodiment
In order to improve the ageing of crawler capturing information, to improve the satisfaction of user to search engine, the embodiment of the invention has proposed a kind of crawler capturing method and device thereof, is explained in detail to the main realization principle of the embodiment of the invention, specific implementation process and to the beneficial effect that should be able to reach below in conjunction with Figure of description.
Search engine system based on computing machine or computer network, normally comprised a tabulation of web page interlinkage for the Search Results that user inquiring returned, the webpage in this tabulation generally is to sort from high to low according to information in the webpage and the degree of correlation between the searching keyword.This feature of the Search Results that returns at search engine in the one embodiment of the invention, has proposed to utilize the ordering of webpage in Search Results to influence the method in the cycle of information in the crawler capturing webpage, specifically as shown in Figure 1, comprises the steps:
Step 101, according to the ordering of webpage in the current search result, determine the current weight of this webpage.
In this step, the current weight of webpage correspondence is used for identifying the ordering of this webpage at Search Results, particularly, the current weight of webpage correspondence is along with the ordering of webpage in Search Results successively decreased from front to back, particularly, can adopt modes such as linear decrease or exponential taper to determine the current weight of this webpage by the ordering of webpage in Search Results; Further, can only choose n forward webpage of ordering in the Search Results, and only calculate the current weight of this n webpage, for coming n later webpage, can be defaulted as the not high webpage of user's degree of click, giving tacit consent to its current weight is 0.
For example, when adopting the linear decrease mode to determine the webpage current weight, for the webpage that comes the k position in result for retrieval, its corresponding current weight a is:
a = n - k + 1 n a 0 k ≤ n 0 k > n
Wherein, a 0Be the current weight (these weights can be system default value) that comes the 1st webpage correspondence.
When adopting the linear decrease mode to determine the webpage current weight, simpler being exemplified as: come preceding 10 webpage among the default search result and be the high webpage of user's degree of click, at these 10 webpages, can distribute current weight 10 for the webpage that comes the 1st, the webpage that comes the 2nd distributes current weight 9, and the like, for the webpage that comes the 10th distributes current weight 1, correspondence comes the 10th later webpage, be defaulted as the low webpage of user's degree of click, these webpages are all distributed current weight 0.
Step 102, according to the current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.
In this step, preferably can determine the weights as a result of webpage by following dual mode:
Mode one, utilize webpage corresponding historical weights to add the current weight of this webpage correspondence, obtain the weights as a result of this webpage.
Mode two, utilize webpage corresponding historical weights to deduct the current weight of this webpage correspondence, obtain the weights as a result of this webpage.
Under the original state, webpage corresponding historical weights can be set to different initial values according to the mode difference that adopts, for example, for mode one, it is 0 that webpage corresponding historical weights can be set, and corresponded manner two, it is 100 that webpage corresponding historical weights can be set.
Further, preferred mode when above-mentioned dual mode is only determined the weights as a result of webpage for present embodiment, also can be in different ways according to concrete strategy, particularly, the current weight that can set the webpage correspondence is shared proportion in weights as a result, for example, and weights=historical weights+current weight * q as a result, wherein, q greater than 0 less than 1.
Step 103, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 104, otherwise execution in step 105.
In this step 103, the mode that is adopted when determining the weights as a result of webpage in the setting of threshold value t and the step 102 is relevant, and for example, when adopting aforesaid way one to obtain the weights as a result of webpage, this threshold value t is greater than webpage corresponding historical weights under the original state; When adopting aforesaid way two to obtain the weights as a result of webpage, this threshold value t is less than webpage corresponding historical weights under the original state.
Step 104, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.
In this step 104, can above-mentioned steps 103 determine webpages weights reach setting threshold as a result the time, again grasp the information in this webpage at once, further, can be earlier with this as a result the weights webpage that reaches setting threshold record in the default extracting tabulation (same webpage only writes down once), after setting duration, grasp the information in this webpage again, to reduce taking of system resource.
Step 105, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 101.
In the above-mentioned flow process, all do not reach the webpage of setting threshold, can when end cycle, it be grasped again for the interior weights as a result of setting cycle (as three months), and with this webpage corresponding historical weights initialization.
After search engine returns to the user with Search Results, the user browses the key word of each web page interlinkage, actual needs according to oneself is clicked relevant web page interlinkage, but the user is not necessarily according to the DISPLAY ORDER webpage clicking of the web page interlinkage in the Search Results, and may be to skip the web page interlinkage that the back is directly clicked in web page interlinkage that the foremost shows, this feature at user's webpage clicking, in further embodiment of this invention, the order that has proposed to utilize webpage to be clicked by the user influences the method in the cycle of information in the crawler capturing webpage, specifically as shown in Figure 2, comprise the steps:
Step 201, according to webpage by the order that the user clicks, determine the current weight of this webpage.
In this step, the current weight of webpage correspondence is used to identify the order that this webpage is clicked by the user, particularly, the current weight of webpage correspondence is along with webpage is successively decreased from front to back by the order that the user clicks, particularly, can adopt modes such as linear decrease or exponential taper to determine the current weight of this webpage by the order that webpage is clicked by the user; Further, can only choose m forward webpage of ordering in the Search Results, and only calculate the current weight of this m webpage, for coming m later webpage, can be defaulted as the not high webpage of user's degree of click, the acquiescence current weight is 0.
For example, adopt when determining the webpage current weight by the exponential taper mode, for by the webpage of j click of user, its corresponding current weight b is:
b = b 0 ( 1 - c ) ( j - 1 ) j ≤ m 0 j > m
Wherein, b 0Be the current weight (these weights can be system default value) of the webpage correspondence of first click of user, c ∈ (0,1) is an attenuation coefficient, and c is big more, and then the current weight of webpage correspondence decays soon more with the order of being clicked by the user from front to back.
Step 202, according to the current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.
This step 202 is consistent with above-mentioned steps 102 described ultimate principles, is not described in detail herein.
Step 203, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 204, otherwise execution in step 205.
This step 203 is consistent with above-mentioned steps 103 described ultimate principles, is not described in detail herein.
Step 204, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.
Step 205, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 201.
In the above-mentioned flow process, all do not reach the webpage of setting threshold, can when end cycle, it be grasped again for the interior weights as a result of setting cycle (as three months), and with this webpage corresponding historical weights initialization.
Further, for same webpage, its ordering and its order of being clicked by the user in Search Results may be inconsistent, at this characteristic, in the one embodiment of the invention, propose to utilize ordering and this webpage order by user clicked of webpage in Search Results, influenced the method in the cycle of information in the crawler capturing webpage jointly, specifically as shown in Figure 3, comprise the steps:
Step 301, according to the ordering of webpage in the current search result, determine first current weight of this webpage.
These step 301 above-mentioned steps 101 described ultimate principle unanimities are not described in detail herein.
Step 302, according to the historical weights and first current weight of this webpage, determine the weights as a result of this webpage.
This step is consistent with above-mentioned steps 102 or the described ultimate principle of step 202, is not described in detail herein.
Step 303, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 304, otherwise execution in step 305.
Step 304, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.
Step 305, utilize its corresponding historical weights of right value update as a result of this webpage correspondence.
Step 306, according to webpage by the order that the user clicks, determine second current weight of this webpage.
These step 306 above-mentioned steps 201 described ultimate principle unanimities are not described in detail herein.
Step 307, according to second current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.
This step is consistent with above-mentioned steps 102 or the described ultimate principle of step 202, is not described in detail herein.It is pointed out that webpage corresponding historical weights herein are the historical weights after above-mentioned steps 305 is upgraded.
Step 308, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 304, otherwise execution in step 309.
Step 309, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 301.
In the above-mentioned flow process, utilize the ordering of webpage in Search Results to influence the cycle of crawler capturing earlier, the order of utilizing this webpage to be clicked by the user again influences the cycle of crawler capturing.In further embodiment of this invention, also the order that can utilize webpage to be clicked by the user earlier influences the cycle of crawler capturing, utilize the ordering of webpage in Search Results to influence cycle of crawler capturing again, detailed process and above-mentioned flow process basically identical, difference is, in the step 301, by the order that the user clicks, determine first current weight of this webpage, according to webpage in step 306, according to the ordering of webpage in Search Results, determine second current weight of this webpage.
In the foregoing description, at first by webpage in Search Results ordering and a cycle that influences crawler capturing in the order clicked by the user of webpage, promptly influence the weights as a result of this webpage, when weights do not reach setting threshold as a result, influence the cycle of crawler capturing by the ordering and in the order clicked by the user of webpage another of this webpage in Search Results again.The invention allows for a kind of embodiment, promptly, influence the current weight of this webpage simultaneously, and then influence the weights as a result of this webpage according to the ordering and the order clicked by the user of this webpage of webpage in the current search result.Specifically comprise:
The first step, according to the ordering of this webpage in the current search result, determine first weights of this webpage; These first weights according to this webpage the ordering in the current search result successively decrease from front to back.
This process is consistent with above-mentioned steps 101 described ultimate principles, is not described in detail herein.
The first step, according to this webpage by the order that the user clicks, determine second weights of this webpage; These second weights are successively decreased by the order that the user clicks from front to back according to webpage.
This process is consistent with above-mentioned steps 201 described ultimate principles, is not described in detail herein.
The 3rd step, according to first weights and described second weights determined, determine the current weight of this webpage.For example, these first weights and the second weights addition can be determined the current weight of this webpage.
The above-mentioned first step and second step there is no strict execution sequence only for convenience of description, also can carry out for second step earlier and carry out the first step again, perhaps carry out simultaneously.
The embodiment of the invention also provides a kind of device of crawler capturing, as shown in Figure 4, comprising: current weight determining unit 401, weights determining unit 402 and information placement unit 403 as a result.Wherein,
Current weight determining unit 401 is used for according to webpage in current search result's ordering or/and described webpage by the order that the user clicks, is determined the current weight of this webpage.
The weights determining unit 402 as a result, are used for the current weight determined according to current weight determining unit 401 and the historical weights of this webpage, determine the weights as a result of this webpage.
Information placement unit 403, be used for when weights determining unit as a result 402 determine weights reach setting threshold as a result the time, grasp the information in this webpage again.
Among the embodiment, the current weight of determining when above-mentioned current weight determining unit 401 is that these current weights are successively decreased by the order that the user clicks from front to back according to ordering or this webpage of webpage in the current search result according to the ordering or the order clicked by the user of this webpage when determining of this webpage in the current search result; The above results weights determining unit 402 is further used for, and utilizes the historical weights of webpage to add the current weight of this webpage, obtains the weights as a result of this webpage correspondence; Perhaps, utilize the historical weights of webpage to deduct current weight, obtain the weights as a result of this webpage correspondence.
Among the embodiment, above-mentioned current weight determining unit 401 is further used for, and according to the ordering of webpage in the current search result, determines first weights of this webpage; These first weights according to webpage the ordering in the current search result successively decrease from front to back; And, by the order that the user clicks, determine second weights of this webpage according to webpage; These second weights are successively decreased by the order that the user clicks from front to back according to webpage; And, determine the current weight of this webpage further according to first weights and second weights determined.
Among the embodiment, as shown in Figure 5, device shown in Figure 4 can also comprise: historical weights initialization unit 404, this unit be used for when weights determining unit as a result 402 determine weights reach setting threshold as a result the time, the historical weights of the described webpage of initialization.
Among the embodiment, as shown in Figure 6, device shown in Figure 4 can also comprise: historical right value update unit 405, this unit be used for when weights determining unit as a result 402 determine weights do not reach setting threshold as a result the time, utilize the historical weights of the described webpage of right value update as a result of this webpage.
Preferably, in the above-mentioned device shown in Figure 6, current weight determining unit 401 also is used for, when historical right value update unit 405 more behind the historical weights of new web page, and when the current weight of this webpage when the ordering in the current search result is determined according to this webpage, by the order that the user clicks, determine the current weight of this webpage according to this webpage; Perhaps, upgrade the historical weights of these webpages when historical right value update unit 405 after, and the order of being clicked by the user according to this webpage when the current weight of this webpage according to the ordering of this webpage in Search Results, is determined the current weight of described webpage when determining.
Above-mentioned historical weights initialization unit 404 and historical right value update unit 405 can be same unit.
Pass through technique scheme, the embodiment of the invention can be according to the ordering of webpage in the current search result or/and the order that this webpage is clicked by the user, determine the current weight of this webpage, then according to the current weight and the historical weights of webpage, determine the weights as a result of this webpage, when weights reach setting threshold as a result, grasp the information in this webpage again.Generally speaking, the ordering of webpage in the current search result is or/and the order that webpage is clicked by the user can embody user's attention rate of this webpage well, based on this, embodiment of the invention utilization is according to ordering or the webpage order by user clicked of webpage in the current search result, influence the cycle of information in this webpage of crawler capturing, according to this scheme, can shorten the cycle of crawler capturing info web to the high webpage of user's attention rate, thereby improve the extracting frequency of information in the high webpage of user's attention rate, the information in this class webpage that guarantees has good timeliness, improves user's use experience.
Further, in the embodiment of the invention, can be according to ordering and the webpage order by user clicked of webpage in the current search result, the common cycle that influences information in this webpage of crawler capturing, thereby guarantee the ageing of ordering is forward and the user often clicks in the Search Results that search engine returns webpage, guarantee that promptly these webpages are up-to-date extractings, and then improve the satisfaction of user search engine.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (14)

1. a crawler capturing method is characterized in that, comprising:
According to the ordering of webpage in the current search result or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;
According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage;
When weights reach setting threshold, grasp the information in the described webpage when described as a result again.
2. the method for claim 1 is characterized in that,
When the current weight of described webpage when the ordering in the current search result is determined according to described webpage, the current weight of described webpage according to described webpage the ordering in the current search result successively decrease from front to back; Or the order of being clicked by the user according to described webpage when the current weight of described webpage is when determining, the current weight of described webpage is successively decreased by the order that the user clicks from front to back according to described webpage;
According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage, comprising:
The historical weights of described webpage are added current weight, obtain the weights as a result of described webpage correspondence; Perhaps, the historical weights of described webpage are deducted current weight, obtain the weights as a result of described webpage correspondence.
3. the method for claim 1 is characterized in that, described according to webpage in the current search result ordering and described webpage by the order that the user clicks, determine that the current weight of described webpage comprises:
According to the ordering of described webpage in the current search result, determine first weights of described webpage; Described first weights according to described webpage the ordering in the current search result successively decrease from front to back;
And, by the order that the user clicks, determine second weights of described webpage according to described webpage; Described second weights are successively decreased by the order that the user clicks from front to back according to described webpage;
According to described first weights and described second weights, determine the current weight of described webpage.
4. as claim 1 or 2 or 3 described methods, it is characterized in that, when weights reach setting threshold, also comprise when described as a result:
The historical weights of the described webpage of initialization.
5. as claim 1 or 2 or 3 described methods, it is characterized in that, when weights do not reach setting threshold, also comprise when described as a result:
Utilize the historical weights of the described webpage of right value update as a result of described webpage.
6. method as claimed in claim 5 is characterized in that, upgrade the historical weights of described webpage after, and when the current weight of described webpage when the ordering in the current search result is determined according to described webpage, described method also comprises:
By the order that the user clicks, determine the current weight of described webpage according to described webpage;
According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage;
When weights reach setting threshold, grasp the information in the described webpage when described as a result again.
7. method as claimed in claim 5 is characterized in that, upgrade the historical weights of described webpage after, and the order of being clicked by the user according to described webpage when the current weight of described webpage is when determining, described method also comprises:
According to the ordering of described webpage in Search Results, determine the current weight of described webpage;
According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage;
When weights reach setting threshold, grasp the information in the described webpage when described as a result again.
8. the device of a crawler capturing is characterized in that, comprising:
The current weight determining unit is used for according to webpage in current search result's ordering or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;
Weights determining unit as a result is used for the current weight determined according to described current weight determining unit and the historical weights of described webpage, determines the weights as a result of described webpage;
The information placement unit, be used for when the described determining unit of weights as a result determine weights reach setting threshold as a result the time, grasp the information in the described webpage again.
9. device as claimed in claim 8, it is characterized in that, the current weight of determining when described current weight determining unit is when the ordering in the current search result is determined according to described webpage, described current weight according to described webpage the ordering in the current search result successively decrease from front to back; Or the current weight of determining when described current weight determining unit is when determining according to the order that described webpage is clicked by the user, and described current weight is successively decreased by the order that the user clicks from front to back according to described webpage;
The described determining unit of weights as a result is further used for: the historical weights of described webpage are added current weight, obtain the weights as a result of described webpage correspondence; Perhaps, the historical weights of described webpage are deducted current weight, obtain the weights as a result of described webpage correspondence.
10. device as claimed in claim 8 is characterized in that, described current weight determining unit is further used for: according to the ordering of described webpage in the current search result, determine first weights of described webpage; Described first weights according to described webpage the ordering in the current search result successively decrease from front to back; And, by the order that the user clicks, determine second weights of described webpage according to described webpage; Described second weights are successively decreased by the order that the user clicks from front to back according to described webpage; According to described first weights and described second weights, determine the current weight of described webpage.
11. device as claimed in claim 8 is characterized in that, also comprises:
Historical weights initialization unit, be used for when the described determining unit of weights as a result definite weights reach setting threshold as a result the time, the historical weights of the described webpage of initialization.
12. as claim 8 or 9 or 10 or 11 described devices, it is characterized in that, also comprise:
Historical right value update unit, be used for when the described determining unit of weights as a result definite weights do not reach setting threshold as a result the time, utilize the historical weights of the described webpage of right value update as a result of described webpage.
13. device as claimed in claim 12, it is characterized in that, described current weight determining unit also is used for, after described historical right value update unit upgrades the historical weights of described webpage, and when the current weight of described webpage when the ordering in the current search result is determined according to described webpage, by the order that the user clicks, determine the current weight of described webpage according to described webpage.
14. device as claimed in claim 12, it is characterized in that, described current weight determining unit also is used for, after described historical right value update unit upgrades the historical weights of described webpage, and when the order of being clicked by the user according to described webpage when the current weight of described webpage is determined, according to the ordering of described webpage in Search Results, determine the current weight of described webpage.
CN2008102262450A 2008-11-10 2008-11-10 Crawler capturing method and device thereof Expired - Fee Related CN101739427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102262450A CN101739427B (en) 2008-11-10 2008-11-10 Crawler capturing method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102262450A CN101739427B (en) 2008-11-10 2008-11-10 Crawler capturing method and device thereof

Publications (2)

Publication Number Publication Date
CN101739427A true CN101739427A (en) 2010-06-16
CN101739427B CN101739427B (en) 2012-07-04

Family

ID=42462918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102262450A Expired - Fee Related CN101739427B (en) 2008-11-10 2008-11-10 Crawler capturing method and device thereof

Country Status (1)

Country Link
CN (1) CN101739427B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184253A (en) * 2011-05-30 2011-09-14 北京搜狗科技发展有限公司 Method and system used for pushing grabbed and updated messages of network resource
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102929672A (en) * 2012-10-31 2013-02-13 北京奇虎科技有限公司 Application upgrade system and method
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN103778165A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Dynamic collecting adjusting algorithm for spider dispatching center
CN103838877A (en) * 2014-03-26 2014-06-04 北京奇虎科技有限公司 Method and device for pushing timeliness information webpage results based on search
CN103945278A (en) * 2013-01-21 2014-07-23 中国科学院声学研究所 Video content and content source crawling method
CN104202401A (en) * 2012-10-16 2014-12-10 北京奇虎科技有限公司 Application upgrading system
CN102929671B (en) * 2012-10-31 2015-11-25 北京奇虎科技有限公司 Server, application upgrade method and application upgrade system
CN105159953A (en) * 2015-08-14 2015-12-16 中国联合网络通信集团有限公司 Search method and search platform on basis of keywords
CN106294364A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Realize the method and apparatus that web crawlers captures webpage
CN106921703A (en) * 2015-12-25 2017-07-04 阿里巴巴集团控股有限公司 The method of cross-border data syn-chronization, system, and domestic and overseas data center
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN108763537A (en) * 2018-05-31 2018-11-06 河南科技大学 A kind of increment mechanical reptile method based on Time Perception
CN113626673A (en) * 2021-07-30 2021-11-09 彩讯科技股份有限公司 Page data acquisition method, system, terminal and storage medium

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102347930B (en) * 2010-07-26 2015-09-09 中国电信股份有限公司 Web page contents acquisition methods and system
CN102790700B (en) * 2011-05-19 2015-06-10 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102184253A (en) * 2011-05-30 2011-09-14 北京搜狗科技发展有限公司 Method and system used for pushing grabbed and updated messages of network resource
CN104202401B (en) * 2012-10-16 2019-03-22 北京奇虎科技有限公司 Application upgrade system
CN104202401A (en) * 2012-10-16 2014-12-10 北京奇虎科技有限公司 Application upgrading system
CN103778165A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Dynamic collecting adjusting algorithm for spider dispatching center
CN102929672A (en) * 2012-10-31 2013-02-13 北京奇虎科技有限公司 Application upgrade system and method
CN102929671B (en) * 2012-10-31 2015-11-25 北京奇虎科技有限公司 Server, application upgrade method and application upgrade system
CN102929672B (en) * 2012-10-31 2015-11-25 北京奇虎科技有限公司 Application upgrade system and method
CN103945278A (en) * 2013-01-21 2014-07-23 中国科学院声学研究所 Video content and content source crawling method
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN103838877A (en) * 2014-03-26 2014-06-04 北京奇虎科技有限公司 Method and device for pushing timeliness information webpage results based on search
CN103838877B (en) * 2014-03-26 2017-04-12 北京奇虎科技有限公司 Method and device for pushing timeliness information webpage results based on search
CN106294364A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Realize the method and apparatus that web crawlers captures webpage
CN106294364B (en) * 2015-05-15 2020-04-10 阿里巴巴集团控股有限公司 Method and device for realizing web crawler to capture webpage
CN105159953B (en) * 2015-08-14 2018-09-14 中国联合网络通信集团有限公司 Searching method based on keyword and search platform
CN105159953A (en) * 2015-08-14 2015-12-16 中国联合网络通信集团有限公司 Search method and search platform on basis of keywords
CN106921703A (en) * 2015-12-25 2017-07-04 阿里巴巴集团控股有限公司 The method of cross-border data syn-chronization, system, and domestic and overseas data center
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN108429721B (en) * 2017-02-15 2020-08-04 腾讯科技(深圳)有限公司 Identification method and device for web crawler
CN108763537A (en) * 2018-05-31 2018-11-06 河南科技大学 A kind of increment mechanical reptile method based on Time Perception
CN108763537B (en) * 2018-05-31 2021-05-18 河南科技大学 Incremental machine crawler method based on time perception
CN113626673A (en) * 2021-07-30 2021-11-09 彩讯科技股份有限公司 Page data acquisition method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN101739427B (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN101739427B (en) Crawler capturing method and device thereof
US11941660B1 (en) Conversion path performance measures and reports
US20210056157A1 (en) Dynamic user agent strings
US7987261B2 (en) Traffic predictor for network-accessible information modules
CN107463641B (en) System and method for improving access to search results
KR101304119B1 (en) System and method for retargeting advertisements based on previously captured relevance data
CN104850546B (en) Display method and system of mobile media information
CN102932207B (en) The method of monitoring website access information and server
US8239257B2 (en) Displaying online advertisements
CN104239298B (en) Text message recommends method, server, browser and system
US8886650B2 (en) Algorithmically choosing when to use branded content versus aggregated content
US20110288931A1 (en) Microsite models
CN103428076A (en) Method and device for transmitting information to multi-type terminals or applications
US9898748B1 (en) Determining popular and trending content characteristics
US20130144719A1 (en) Using image match technology to improve image advertisement quality
EP2742479A1 (en) Conversion type to conversion type funneling
KR20090001421A (en) System of providing advertising date using advertisement widget application and method thereof
CN102932206A (en) Method and system for monitoring website access information
CN102591887B (en) Network data pre-head method and system
CN102769818A (en) Method and device for pushing information in mobile internet
CN104281619A (en) System and method for ordering search results
US10650074B1 (en) Systems and methods for identifying and managing topical content for websites
CN103886056A (en) Method and system for processing search results and browser
US9292515B1 (en) Using follow-on search behavior to measure the effectiveness of online video ads
CN1996301A (en) Method and system for distributing information directly associated with user

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20201110