CN101739427A

CN101739427A - Crawler capturing method and device thereof

Info

Publication number: CN101739427A
Application number: CN200810226245A
Authority: CN
Inventors: 孙宏伟; 胡珉; 罗治国
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2008-11-10
Filing date: 2008-11-10
Publication date: 2010-06-16
Anticipated expiration: 2028-11-10
Also published as: CN101739427B

Abstract

The invention discloses a crawler capturing method and device thereof, aiming at solving the problem that the existing crawler capturing technology is poor in timeliness. The main technical scheme includes that: the current weight number of the webpage is determined according to the rank of the webpage in the current search result or/and the sequence of the webpage clicked by users; result weight number of the webpage is determined according to the current weight number and history weight number of the webpage; and when the result weight number is equal to preset threshold, information in the webpage is captured again. By the technical scheme, the period for crawler to capture information in the webpage can be influenced according to the rank of the webpage in the current search result or/and the sequence of the webpage clicked by users, the period for crawler to capture information in the webpage with high user attention can be reduced, thus ensuring the information in the webpage to have good timeliness and improving user experience.

Description

A kind of crawler capturing method and device thereof

Technical field

The present invention relates to the internet information search field, relate in particular to a kind of crawler capturing method and device thereof.

Background technology

Search engine is a technology of widely using on the internet now, and the partial key that people only need import own required information just can find the information relevant with this key word, for example search engines such as Baidu, Google in a large number by search engine.

The information source of search engine is varied, some is to be propped up to search engine operator by the advertiser that initiates this advertisement by the form of bid advertisement to pay off advertising, search engine operator is published the brief information and the link of this advertisement in the search engine of oneself, and more non-advertising message, news for example, academic information etc. needs search engine operator oneself to go to seek and grasp and adds search engine, in the face of the magnanimity information on the internet, how the information that a large amount of search engine operator are concerned about distinguishes from other garbage, and adds search engine categorizedly and become the problem that search engine operator is needed solution badly.

The appearance of crawler capturing technology has solved the problems referred to above, and the information that this technology can will meet this condition according to the condition of setting crawls out from the magnanimity information of internet.The crawler capturing technology is applied to the extracting problem that search engine can solve various effective informations effectively.The crawler capturing technology need travel through webpage when the information of extracting, in the face of the magnanimity webpage on the internet time, travel through all webpages and almost be difficult to accomplish, even accomplished, also need to expend a large amount of time and resource, the information timeliness that makes crawler capturing arrive is very poor.At this defective, at present the solution that generally adopts is to utilize reptile extracting information in the webpage of some, certain limit, and these webpages generally are that the probability that contains useful information that draws by statistics in advance and quantity are all than higher webpage.Like this, the webpage of this some, certain limit can be formed a search listing and be recorded into the hunting zone of reptile, make reptile retrieve every regular time whether new information page link is arranged on this search listing, if have then download this information page, extract Useful Information in this information page according to link.

Above-mentioned mode according to Fixed Time Interval crawler capturing info web, each crawler capturing information consumed time and resource have been shortened to a certain extent, but in actual applications, user's attention rate difference of different web pages, if use identical frequency to grasp high info web and the low info web of user's attention rate of user's attention rate, obviously for the high info web of user's attention rate information to grasp frequency relatively low, thereby can not the information in the high webpage of user's attention rate be grasped in time and upgrade, make the poor in timeliness of information in this class webpage, be the outdated information or the invalid information that can have some in the webpage, and then influence the satisfaction of user search engine.

Summary of the invention

The invention provides a kind of optimization crawler capturing method and device thereof, in order to solve the problem of existing crawler capturing technology poor in timeliness.

The embodiment of the invention is achieved through the following technical solutions:

The embodiment of the invention provides a kind of crawler capturing method, comprising:

According to the ordering of webpage in the current search result or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;

According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage;

When weights reach setting threshold, grasp the information in the described webpage when described as a result again.

The embodiment of the invention also provides a kind of device of crawler capturing, comprising:

The current weight determining unit is used for according to webpage in current search result's ordering or/and described webpage by the order that the user clicks, is determined the current weight of described webpage;

Weights determining unit as a result is used for the current weight determined according to described current weight determining unit and the historical weights of described webpage, determines the weights as a result of described webpage;

The information placement unit, be used for when the described determining unit of weights as a result determine weights reach setting threshold as a result the time, grasp the information in the described webpage again.

Pass through technique scheme, the embodiment of the invention can be according to the ordering of webpage in the current search result or/and the order that this webpage is clicked by the user, determine the current weight of webpage, then according to the current weight and the historical weights of webpage, determine the weights as a result of this webpage, when weights reach setting threshold as a result, grasp the information in this webpage again.Generally speaking, the ordering of webpage in the current search result is or/and the order that webpage is clicked by the user can embody user's attention rate of this webpage well, based on this, the embodiment of the invention utilizes the ordering of webpage in the current search result or/and the order that webpage is clicked by the user, influence the cycle of information in this webpage of crawler capturing, according to this scheme, can shorten the cycle of crawler capturing info web to the high webpage of user's attention rate, thereby improve the extracting frequency of information in the high webpage of user's attention rate, the information in this class webpage that guarantees has good timeliness, improves user's use experience.

Description of drawings

Fig. 1 is the first pass figure of crawler capturing in the embodiment of the invention;

Fig. 2 is second process flow diagram of crawler capturing in the embodiment of the invention;

Fig. 3 is the 3rd process flow diagram of crawler capturing in the embodiment of the invention;

Fig. 4 is the device synoptic diagram one of crawler capturing in the embodiment of the invention;

Fig. 5 is the device synoptic diagram two of crawler capturing in the embodiment of the invention;

Fig. 6 is the device synoptic diagram three of crawler capturing in the embodiment of the invention.

Embodiment

In order to improve the ageing of crawler capturing information, to improve the satisfaction of user to search engine, the embodiment of the invention has proposed a kind of crawler capturing method and device thereof, is explained in detail to the main realization principle of the embodiment of the invention, specific implementation process and to the beneficial effect that should be able to reach below in conjunction with Figure of description.

Search engine system based on computing machine or computer network, normally comprised a tabulation of web page interlinkage for the Search Results that user inquiring returned, the webpage in this tabulation generally is to sort from high to low according to information in the webpage and the degree of correlation between the searching keyword.This feature of the Search Results that returns at search engine in the one embodiment of the invention, has proposed to utilize the ordering of webpage in Search Results to influence the method in the cycle of information in the crawler capturing webpage, specifically as shown in Figure 1, comprises the steps:

Step 101, according to the ordering of webpage in the current search result, determine the current weight of this webpage.

In this step, the current weight of webpage correspondence is used for identifying the ordering of this webpage at Search Results, particularly, the current weight of webpage correspondence is along with the ordering of webpage in Search Results successively decreased from front to back, particularly, can adopt modes such as linear decrease or exponential taper to determine the current weight of this webpage by the ordering of webpage in Search Results; Further, can only choose n forward webpage of ordering in the Search Results, and only calculate the current weight of this n webpage, for coming n later webpage, can be defaulted as the not high webpage of user's degree of click, giving tacit consent to its current weight is 0.

For example, when adopting the linear decrease mode to determine the webpage current weight, for the webpage that comes the k position in result for retrieval, its corresponding current weight a is:

a = \{\begin{matrix} \frac{n - k + 1}{n} a_{0} & k \leq n \\ 0 & k > n \end{matrix}

Wherein, a ₀Be the current weight (these weights can be system default value) that comes the 1st webpage correspondence.

When adopting the linear decrease mode to determine the webpage current weight, simpler being exemplified as: come preceding 10 webpage among the default search result and be the high webpage of user's degree of click, at these 10 webpages, can distribute current weight 10 for the webpage that comes the 1st, the webpage that comes the 2nd distributes current weight 9, and the like, for the webpage that comes the 10th distributes current weight 1, correspondence comes the 10th later webpage, be defaulted as the low webpage of user's degree of click, these webpages are all distributed current weight 0.

Step 102, according to the current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.

In this step, preferably can determine the weights as a result of webpage by following dual mode:

Mode one, utilize webpage corresponding historical weights to add the current weight of this webpage correspondence, obtain the weights as a result of this webpage.

Mode two, utilize webpage corresponding historical weights to deduct the current weight of this webpage correspondence, obtain the weights as a result of this webpage.

Under the original state, webpage corresponding historical weights can be set to different initial values according to the mode difference that adopts, for example, for mode one, it is 0 that webpage corresponding historical weights can be set, and corresponded manner two, it is 100 that webpage corresponding historical weights can be set.

Further, preferred mode when above-mentioned dual mode is only determined the weights as a result of webpage for present embodiment, also can be in different ways according to concrete strategy, particularly, the current weight that can set the webpage correspondence is shared proportion in weights as a result, for example, and weights=historical weights+current weight * q as a result, wherein, q greater than 0 less than 1.

Step 103, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 104, otherwise execution in step 105.

In this step 103, the mode that is adopted when determining the weights as a result of webpage in the setting of threshold value t and the step 102 is relevant, and for example, when adopting aforesaid way one to obtain the weights as a result of webpage, this threshold value t is greater than webpage corresponding historical weights under the original state; When adopting aforesaid way two to obtain the weights as a result of webpage, this threshold value t is less than webpage corresponding historical weights under the original state.

Step 104, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.

In this step 104, can above-mentioned steps 103 determine webpages weights reach setting threshold as a result the time, again grasp the information in this webpage at once, further, can be earlier with this as a result the weights webpage that reaches setting threshold record in the default extracting tabulation (same webpage only writes down once), after setting duration, grasp the information in this webpage again, to reduce taking of system resource.

Step 105, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 101.

In the above-mentioned flow process, all do not reach the webpage of setting threshold, can when end cycle, it be grasped again for the interior weights as a result of setting cycle (as three months), and with this webpage corresponding historical weights initialization.

After search engine returns to the user with Search Results, the user browses the key word of each web page interlinkage, actual needs according to oneself is clicked relevant web page interlinkage, but the user is not necessarily according to the DISPLAY ORDER webpage clicking of the web page interlinkage in the Search Results, and may be to skip the web page interlinkage that the back is directly clicked in web page interlinkage that the foremost shows, this feature at user's webpage clicking, in further embodiment of this invention, the order that has proposed to utilize webpage to be clicked by the user influences the method in the cycle of information in the crawler capturing webpage, specifically as shown in Figure 2, comprise the steps:

Step 201, according to webpage by the order that the user clicks, determine the current weight of this webpage.

In this step, the current weight of webpage correspondence is used to identify the order that this webpage is clicked by the user, particularly, the current weight of webpage correspondence is along with webpage is successively decreased from front to back by the order that the user clicks, particularly, can adopt modes such as linear decrease or exponential taper to determine the current weight of this webpage by the order that webpage is clicked by the user; Further, can only choose m forward webpage of ordering in the Search Results, and only calculate the current weight of this m webpage, for coming m later webpage, can be defaulted as the not high webpage of user's degree of click, the acquiescence current weight is 0.

For example, adopt when determining the webpage current weight by the exponential taper mode, for by the webpage of j click of user, its corresponding current weight b is:

b = \{\begin{matrix} b_{0} {(1 - c)}^{(j - 1)} & j \leq m \\ 0 & j > m \end{matrix}

Wherein, b ₀Be the current weight (these weights can be system default value) of the webpage correspondence of first click of user, c ∈ (0,1) is an attenuation coefficient, and c is big more, and then the current weight of webpage correspondence decays soon more with the order of being clicked by the user from front to back.

Step 202, according to the current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.

This step 202 is consistent with above-mentioned steps 102 described ultimate principles, is not described in detail herein.

Step 203, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 204, otherwise execution in step 205.

This step 203 is consistent with above-mentioned steps 103 described ultimate principles, is not described in detail herein.

Step 204, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.

Step 205, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 201.

Further, for same webpage, its ordering and its order of being clicked by the user in Search Results may be inconsistent, at this characteristic, in the one embodiment of the invention, propose to utilize ordering and this webpage order by user clicked of webpage in Search Results, influenced the method in the cycle of information in the crawler capturing webpage jointly, specifically as shown in Figure 3, comprise the steps:

Step 301, according to the ordering of webpage in the current search result, determine first current weight of this webpage.

These step 301 above-mentioned steps 101 described ultimate principle unanimities are not described in detail herein.

Step 302, according to the historical weights and first current weight of this webpage, determine the weights as a result of this webpage.

This step is consistent with above-mentioned steps 102 or the described ultimate principle of step 202, is not described in detail herein.

Step 303, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 304, otherwise execution in step 305.

Step 304, grasp information in this webpage again, and with this webpage corresponding historical weights initialization.

Step 305, utilize its corresponding historical weights of right value update as a result of this webpage correspondence.

Step 306, according to webpage by the order that the user clicks, determine second current weight of this webpage.

These step 306 above-mentioned steps 201 described ultimate principle unanimities are not described in detail herein.

Step 307, according to second current weight and the historical weights of webpage correspondence, determine the weights as a result of this webpage.

This step is consistent with above-mentioned steps 102 or the described ultimate principle of step 202, is not described in detail herein.It is pointed out that webpage corresponding historical weights herein are the historical weights after above-mentioned steps 305 is upgraded.

Step 308, judge whether the weights as a result of webpage correspondence reach setting threshold t, if reach, then execution in step 304, otherwise execution in step 309.

Step 309, utilize its corresponding historical weights of right value update as a result of this webpage correspondence, return step 301.

In the above-mentioned flow process, utilize the ordering of webpage in Search Results to influence the cycle of crawler capturing earlier, the order of utilizing this webpage to be clicked by the user again influences the cycle of crawler capturing.In further embodiment of this invention, also the order that can utilize webpage to be clicked by the user earlier influences the cycle of crawler capturing, utilize the ordering of webpage in Search Results to influence cycle of crawler capturing again, detailed process and above-mentioned flow process basically identical, difference is, in the step 301, by the order that the user clicks, determine first current weight of this webpage, according to webpage in step 306, according to the ordering of webpage in Search Results, determine second current weight of this webpage.

In the foregoing description, at first by webpage in Search Results ordering and a cycle that influences crawler capturing in the order clicked by the user of webpage, promptly influence the weights as a result of this webpage, when weights do not reach setting threshold as a result, influence the cycle of crawler capturing by the ordering and in the order clicked by the user of webpage another of this webpage in Search Results again.The invention allows for a kind of embodiment, promptly, influence the current weight of this webpage simultaneously, and then influence the weights as a result of this webpage according to the ordering and the order clicked by the user of this webpage of webpage in the current search result.Specifically comprise:

The first step, according to the ordering of this webpage in the current search result, determine first weights of this webpage; These first weights according to this webpage the ordering in the current search result successively decrease from front to back.

This process is consistent with above-mentioned steps 101 described ultimate principles, is not described in detail herein.

The first step, according to this webpage by the order that the user clicks, determine second weights of this webpage; These second weights are successively decreased by the order that the user clicks from front to back according to webpage.

This process is consistent with above-mentioned steps 201 described ultimate principles, is not described in detail herein.

The 3rd step, according to first weights and described second weights determined, determine the current weight of this webpage.For example, these first weights and the second weights addition can be determined the current weight of this webpage.

The above-mentioned first step and second step there is no strict execution sequence only for convenience of description, also can carry out for second step earlier and carry out the first step again, perhaps carry out simultaneously.

The embodiment of the invention also provides a kind of device of crawler capturing, as shown in Figure 4, comprising: current weight determining unit 401, weights determining unit 402 and information placement unit 403 as a result.Wherein,

Current weight determining unit 401 is used for according to webpage in current search result's ordering or/and described webpage by the order that the user clicks, is determined the current weight of this webpage.

The weights determining unit 402 as a result, are used for the current weight determined according to current weight determining unit 401 and the historical weights of this webpage, determine the weights as a result of this webpage.

Information placement unit 403, be used for when weights determining unit as a result 402 determine weights reach setting threshold as a result the time, grasp the information in this webpage again.

Among the embodiment, the current weight of determining when above-mentioned current weight determining unit 401 is that these current weights are successively decreased by the order that the user clicks from front to back according to ordering or this webpage of webpage in the current search result according to the ordering or the order clicked by the user of this webpage when determining of this webpage in the current search result; The above results weights determining unit 402 is further used for, and utilizes the historical weights of webpage to add the current weight of this webpage, obtains the weights as a result of this webpage correspondence; Perhaps, utilize the historical weights of webpage to deduct current weight, obtain the weights as a result of this webpage correspondence.

Among the embodiment, above-mentioned current weight determining unit 401 is further used for, and according to the ordering of webpage in the current search result, determines first weights of this webpage; These first weights according to webpage the ordering in the current search result successively decrease from front to back; And, by the order that the user clicks, determine second weights of this webpage according to webpage; These second weights are successively decreased by the order that the user clicks from front to back according to webpage; And, determine the current weight of this webpage further according to first weights and second weights determined.

Among the embodiment, as shown in Figure 5, device shown in Figure 4 can also comprise: historical weights initialization unit 404, this unit be used for when weights determining unit as a result 402 determine weights reach setting threshold as a result the time, the historical weights of the described webpage of initialization.

Among the embodiment, as shown in Figure 6, device shown in Figure 4 can also comprise: historical right value update unit 405, this unit be used for when weights determining unit as a result 402 determine weights do not reach setting threshold as a result the time, utilize the historical weights of the described webpage of right value update as a result of this webpage.

Preferably, in the above-mentioned device shown in Figure 6, current weight determining unit 401 also is used for, when historical right value update unit 405 more behind the historical weights of new web page, and when the current weight of this webpage when the ordering in the current search result is determined according to this webpage, by the order that the user clicks, determine the current weight of this webpage according to this webpage; Perhaps, upgrade the historical weights of these webpages when historical right value update unit 405 after, and the order of being clicked by the user according to this webpage when the current weight of this webpage according to the ordering of this webpage in Search Results, is determined the current weight of described webpage when determining.

Above-mentioned historical weights initialization unit 404 and historical right value update unit 405 can be same unit.

Pass through technique scheme, the embodiment of the invention can be according to the ordering of webpage in the current search result or/and the order that this webpage is clicked by the user, determine the current weight of this webpage, then according to the current weight and the historical weights of webpage, determine the weights as a result of this webpage, when weights reach setting threshold as a result, grasp the information in this webpage again.Generally speaking, the ordering of webpage in the current search result is or/and the order that webpage is clicked by the user can embody user's attention rate of this webpage well, based on this, embodiment of the invention utilization is according to ordering or the webpage order by user clicked of webpage in the current search result, influence the cycle of information in this webpage of crawler capturing, according to this scheme, can shorten the cycle of crawler capturing info web to the high webpage of user's attention rate, thereby improve the extracting frequency of information in the high webpage of user's attention rate, the information in this class webpage that guarantees has good timeliness, improves user's use experience.

Further, in the embodiment of the invention, can be according to ordering and the webpage order by user clicked of webpage in the current search result, the common cycle that influences information in this webpage of crawler capturing, thereby guarantee the ageing of ordering is forward and the user often clicks in the Search Results that search engine returns webpage, guarantee that promptly these webpages are up-to-date extractings, and then improve the satisfaction of user search engine.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. a crawler capturing method is characterized in that, comprising:

2. the method for claim 1 is characterized in that,

When the current weight of described webpage when the ordering in the current search result is determined according to described webpage, the current weight of described webpage according to described webpage the ordering in the current search result successively decrease from front to back; Or the order of being clicked by the user according to described webpage when the current weight of described webpage is when determining, the current weight of described webpage is successively decreased by the order that the user clicks from front to back according to described webpage;

According to the current weight and the historical weights of described webpage, determine the weights as a result of described webpage, comprising:

The historical weights of described webpage are added current weight, obtain the weights as a result of described webpage correspondence; Perhaps, the historical weights of described webpage are deducted current weight, obtain the weights as a result of described webpage correspondence.

3. the method for claim 1 is characterized in that, described according to webpage in the current search result ordering and described webpage by the order that the user clicks, determine that the current weight of described webpage comprises:

According to the ordering of described webpage in the current search result, determine first weights of described webpage; Described first weights according to described webpage the ordering in the current search result successively decrease from front to back;

And, by the order that the user clicks, determine second weights of described webpage according to described webpage; Described second weights are successively decreased by the order that the user clicks from front to back according to described webpage;

According to described first weights and described second weights, determine the current weight of described webpage.

4. as claim 1 or 2 or 3 described methods, it is characterized in that, when weights reach setting threshold, also comprise when described as a result:

The historical weights of the described webpage of initialization.

5. as claim 1 or 2 or 3 described methods, it is characterized in that, when weights do not reach setting threshold, also comprise when described as a result:

Utilize the historical weights of the described webpage of right value update as a result of described webpage.

6. method as claimed in claim 5 is characterized in that, upgrade the historical weights of described webpage after, and when the current weight of described webpage when the ordering in the current search result is determined according to described webpage, described method also comprises:

By the order that the user clicks, determine the current weight of described webpage according to described webpage;

7. method as claimed in claim 5 is characterized in that, upgrade the historical weights of described webpage after, and the order of being clicked by the user according to described webpage when the current weight of described webpage is when determining, described method also comprises:

According to the ordering of described webpage in Search Results, determine the current weight of described webpage;

8. the device of a crawler capturing is characterized in that, comprising:

9. device as claimed in claim 8, it is characterized in that, the current weight of determining when described current weight determining unit is when the ordering in the current search result is determined according to described webpage, described current weight according to described webpage the ordering in the current search result successively decrease from front to back; Or the current weight of determining when described current weight determining unit is when determining according to the order that described webpage is clicked by the user, and described current weight is successively decreased by the order that the user clicks from front to back according to described webpage;

The described determining unit of weights as a result is further used for: the historical weights of described webpage are added current weight, obtain the weights as a result of described webpage correspondence; Perhaps, the historical weights of described webpage are deducted current weight, obtain the weights as a result of described webpage correspondence.

10. device as claimed in claim 8 is characterized in that, described current weight determining unit is further used for: according to the ordering of described webpage in the current search result, determine first weights of described webpage; Described first weights according to described webpage the ordering in the current search result successively decrease from front to back; And, by the order that the user clicks, determine second weights of described webpage according to described webpage; Described second weights are successively decreased by the order that the user clicks from front to back according to described webpage; According to described first weights and described second weights, determine the current weight of described webpage.

11. device as claimed in claim 8 is characterized in that, also comprises:

Historical weights initialization unit, be used for when the described determining unit of weights as a result definite weights reach setting threshold as a result the time, the historical weights of the described webpage of initialization.

12. as claim 8 or 9 or 10 or 11 described devices, it is characterized in that, also comprise:

Historical right value update unit, be used for when the described determining unit of weights as a result definite weights do not reach setting threshold as a result the time, utilize the historical weights of the described webpage of right value update as a result of described webpage.

13. device as claimed in claim 12, it is characterized in that, described current weight determining unit also is used for, after described historical right value update unit upgrades the historical weights of described webpage, and when the current weight of described webpage when the ordering in the current search result is determined according to described webpage, by the order that the user clicks, determine the current weight of described webpage according to described webpage.

14. device as claimed in claim 12, it is characterized in that, described current weight determining unit also is used for, after described historical right value update unit upgrades the historical weights of described webpage, and when the order of being clicked by the user according to described webpage when the current weight of described webpage is determined, according to the ordering of described webpage in Search Results, determine the current weight of described webpage.