CN101105801A - Automatic positioning method of network key resource page - Google Patents

Automatic positioning method of network key resource page Download PDF

Info

Publication number
CN101105801A
CN101105801A CNA2007100985319A CN200710098531A CN101105801A CN 101105801 A CN101105801 A CN 101105801A CN A2007100985319 A CNA2007100985319 A CN A2007100985319A CN 200710098531 A CN200710098531 A CN 200710098531A CN 101105801 A CN101105801 A CN 101105801A
Authority
CN
China
Prior art keywords
user
inquiry
page
search engine
clicking rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007100985319A
Other languages
Chinese (zh)
Other versions
CN100507918C (en
Inventor
岑荣伟
刘奕群
张敏
金奕江
马少平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB2007100985319A priority Critical patent/CN100507918C/en
Publication of CN101105801A publication Critical patent/CN101105801A/en
Application granted granted Critical
Publication of CN100507918C publication Critical patent/CN100507918C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

An automatic positioning method of network key resource pages belongs to the Internet information handling sector. The invention is characterized in that: the invention firstly screens out inquiry theme words with timeliness and typicality from a number of search engine user inquiry click information records, extracts the corresponding user click result page and 'user click rate' on the result page, and obtains the inquiry theme collection, key resource page candidate collection, and 'user click distribution'; and then, the invention integrates the 'user click distribution' obtained from user inquiry click information records of different search engines, and obtains the integrated 'user click distribution' through 'inquiry liability' information; at last, the invention judges the pages in the key resource page candidate collection according to the integrated 'user click distribution' so as to obtain the subject-related key resource page. The method has the advantages of being automatic, accurate and objective, and being quickly positioned with computers.

Description

A kind of automatic positioning method of network key resource page
Technical field
The invention belongs to the internet information process field, particularly relate to a kind of automatic processing method of locating based on the network key resource page of user behavior analysis, excavation.
Background technology
Search engine is with the information on the certain strategy collection internet, information is being organized and is being handled the computer system that the network information service afterwards is provided for the user, three parts of software program that it comprises computer network, computer hardware system and moves on hardware system.Its main effect is to help the user to obtain the high quality information that can meet consumers' demand that is present in the internet information environment fast, efficiently.
At present, universal search engine comprises information search, finish message and user inquiring three parts.Search engine carries out information search by the instrument that is called as new Web Crawler, and the index of reference device is put in order the information that grasps, and then uses requestor to carry out user inquiring, and returns the correlated results tabulation, provides relevant information to satisfy user's query demand.
Angle from the user, search engine provides a page that comprises search box, the user can be reflected the keyword of own query demand in the search box input, after submitting to search engine by browser, search engine returns the search result information tabulation relevant with the content of user's input, and the user clicks the needed information of searching.
It is credible about inquiry theme authority that network key resource page can be understood as, and the user is obtained the most useful page of information.Authority's definition according to text retrieval meeting TREC (Text Retrieval Conference) networked information retrieval part, key resource page should be the portal page of certain crucial website, this website provides authentic communication about certain theme, and (the portal page here is " homepage " on the ordinary meaning not necessarily, it may be the access page of extensive website, also may be the access page of certain the substation point or a certain class page set).Therefore, why crucial key resource page is, is because it offers the authentic communication inlet that one of user comes from certain theme.The user is by key resource page, can be fast find needed information.Simultaneously, the key resource page number of certain theme is counted much less than its related pages, and (related pages is hundreds and thousands of easily, and that key resource page often has only is several to tens), this also make things convenient for the user focus onto a handful of with the properest page of the inquiry theme of oneself on.
Search Requirement above 80% in the current network information retrieval can be realized that in this most Search Requirement, having only key resource page is the needed result of inquiring user with the keystone resources technology of searching.Therefore, the keystone resources of searching automatically and accurately under a certain inquiry theme is very important for the validity that networked information retrieval tools improves its information gathering, and its location technology also becomes one of high spot reviews problem in networked information retrieval research and the application.Be reflected in the information retrieval special interest group SIGIR of american computer association (the International ACM SIGIR Conference onResearch and Development in Information Retrieval) meeting of international information retrieval research highest level, no matter keystone resources searches technology from paper number or quality, always is the focus of discussing in recent years.Searching keystone resources, is the development focus of current network information retrieval, has also obtained some fruitful theoretical research and experimental results.But generally speaking, the development of keystone resources Study of location also rests on the lower level, top ten result retrieval precision (Precision at 10documents as evaluation criterion, P@10) about 20%, pace up and down, and many inquiry extraneous features that the performance network data is different from general data are not investigated fully yet always.
The keystone resources location technology can be divided into two big classes usually according to the starting point difference, and the first kind is from page angle, judges according to page feature whether this page is keystone resources, as content of text, hyperlink relation etc.This class key resource page also is also referred to as the high-quality page usually, when relevant with certain inquiry theme as if it, is also referred to as the searched targets page; Another kind of is from demand, according to given inquiry theme, from a large amount of pages, filters out the relevant page of this theme.The locator meams that two classes are different has different localization methods and applied environment.
The irrelevant keystone resources location technology of first kind theme can come the page is graded by relevant page quality evaluation mode, take a decision as to whether key resource page, existing technology mainly realizes the quality evaluation of the page based on the feature of the relation of the hyperlink between the page and some pages self, existing technology mainly contains PageRank, Hits scheduling algorithm technology.Such technology can be applicable to the hierarchical index of search engine, in the relevant environment such as result for retrieval ordering, is used for improving the retrieval rate and the accuracy of search engine.
The location technology that another kind of theme is relevant can reduce and the irrelevant page of given theme in a large number because directly from user's request, key resource page and theme are associated, and significantly improves the validity of memory page and the utilization factor of storage resources.The keystone resources location technology that theme is relevant has the strong practical application of a lot of demands, as the catalogue formula search according to inquiry theme and key resource page framework Web; The inquiry theme is carried out the answer mark, and then realize evaluation search engine; In the Query Result that is applied to return, improve inquiry precision etc., these demands and use all and can't finish and realize that its key resource page must be associated with specific theme by first kind technology.
The network key resource page that existing theme is relevant judges it all is to given theme basically, goes to judge that with artificial mode whether some pages are the keystone resourceses under this theme, needs great amount of manpower to work.The text retrieval meeting (TREC) of American National technical institute (NIST) tissue has proposed to reduce the technology of manpower work through accumulation for many years, and its core is called as outcome pool filtering technique (Pooling).But in any case, relevant its shortcoming of network key resource location technology of existing theme is tangible, although the workload of manual mark is greatly reduced, but the location of the key resource page that large-scale theme is relevant still is difficult to operation and realizes, and the subjective influence of employee that manual mark brings also is difficult to avoid.Be far from being enough for application and analysis extensive, the networked information retrieval of (the language material scale surpasses 1,000,000,000 pages, and a large amount of inquiry themes is all arranged every day or every several days) in real time especially.In addition, itself can regard a kind of locator meams as general search procedure, but the results page that search engine returns is too much, and retrieval precision is low, and user satisfaction is not high.
In actual commercial search engine, the user can click according to oneself understanding and satisfaction the inquiry return results, is easy to record is carried out in this click behavior of user, and this user inquiring click information record is also referred to as search engine logs usually.The query note click information has not only embodied user's inquiry interest, has also comprised the user and Query Result is selected and has been judged.Therefore, the inquiry theme that screening is relevant from user's inquiry click information is feasible, existing statistical research shows, in daily user search, inquire about the most frequent 1% query word and accounted for the inquiry times more than 70%,, find out those user inquirings commonly used therefore as long as user's click information is added up, then can represent most of user inquiring demands, analyze its associated user and click behavior and just can carry out effectively automatically location the key resource page under this theme.
Summary of the invention
The objective of the invention is at the existing methods deficiency, proposed network key resource localization method based on user behavior analysis.This method is utilized existing user inquiring and click behavior thereof on how tame search engine, from the macroscopic statistics angle, extract the inquiry theme that the user is concerned about, and the user of analysis of key resource page clicks distribution characteristics, pick out corresponding key resource page.Because analytic process has merged the user behavior of how tame search engine, can avoid skewed popularity and the deficiency brought when locating to keystone resources, guarantee the recall ratio and the accuracy of network key resource location to a certain extent because of single search engine index data scale and search strategy.In addition, owing to the position fixing process of selecting inquiry theme and key resource page is to be finished automatically by computing machine, therefore can in time, accurately, objectively reflect the ageing and accuracy of inquiry theme and key resource page.
The particular content of this method is described below:
1. utilize information such as user's enquiry frequency, click as a result, automatic screening is provided effective property, can reflect most of user's query demand, and can be by the inquiry theme that accurately marks;
2. according to the user behavior on the different search engines, calculate the clicked page and the clicking rate thereof of each inquiry theme correspondence respectively, obtain key resource page Candidate Set and inquiring user that all clicked pages under this inquiry theme and clicking rate thereof form and click distribution;
3. utilize fusion method, click the whole inquiring user of respectively being inquired about theme that distributes by the user on a plurality of search engines and click distribution characteristics;
4. click distribution characteristics according to the inquiry of merging each theme of back, the key resource page that screening is corresponding.
The invention is characterized in:
It is finished on computers, contains have the following steps (step 1 and 2 is independent operation on each search engine logs respectively) successively:
The screening of step 1. inquiry theme
The pre-service of step 1.1 data
Network key resource page is located the user journal that the relevant information that is used in employed inquiry theme, key resource page and the position fixing process derives from a plurality of search engines, for these search engine user daily records, it needs to comprise the automatic location that following content information just can be used for network key resource page at least:
The content that the search engine user daily record that table 1 uses for the keystone resources location need comprise
Title Recorded content Record figure place (Bit)
Query The inquiry that the user submits to 256
URL The result address that corresponding this inquiring user is clicked 256
Id By the customer identification number that system distributes automatically, certain user can be distributed a different identification number during certain use search engine automatically. 32
General search engine service provider can obtain above information by search engine web server easily, thereby has guaranteed the feasibility of this method.Because each search engine is to difference to some extent on its user journal storage format and the form of expression, concrete processing procedure is difference slightly, but all needs following steps that user journal is carried out pre-service basically:
Step 1.1.1 carries out the user journal code conversion, the coded format of server record is converted to the GBK form of Chinese characters of the national standard coding.
Step 1.1.2 utilizes the content item of listing in the table 1 that user journal is put in order, removes the information outside table 1 content item, and daily record is organized into the form of above content item character string.
Step 1.1.3 utilizes the noise information in the inquiry of string matching technology (as improved character string pattern matching algorithm KMP) filter user, comprise the query word of violated query word, some online product promotion use etc., only keep the content item that directly reflects search engine common user query demand and behavior.
Through the data preprocessing process, the content that we therefrom enumerate in the extraction table 1, and be applied to the following steps of method.
Step 1.2 inquiry theme is selected
Select needed inquiry theme S set according to following rule:
If: certain the inquiry Q in search engine logs by the number of times of different user inquirings less than 20 times, then get rid of outside S;
Otherwise: Q puts into query set S this inquiry.
Number of users with inquiry carries out certain screening to the inquiry theme, guarantees that selected inquiry can react active user's inquiry trend, guarantees ageing and attention rate, and certain representativeness is arranged.In addition, select the more inquiry of those numbers of users, can reduce in the keystone resources position fixing process, because of individual user's click behavior, and the bigger undulatory property of bringing.
Step 2. is based on the inquiring user clicking rate feature extraction of single search engine daily record
Step 2.1 is extracted " user's clicking rate " information of each page
For each the inquiry Q among the query set S, a series of clicked results page are all arranged, the user inquiring and the click information that provide by table 1, we can obtain this a series of clicked results page address URL, and at this " user's clicking rate " of inquiring about each page URL of calculating, promptly the user is to the ratio of this page number of clicks.For this inquiry Q, the computing formula of " user's clicking rate " of each page is:
Figure A20071009853100081
Wherein, " Query Result URL user's number of clicks " can be by to inquiry Q and being counted to get by the URL that the user clicks, and " total clicks of inquiry Q user " can obtain by the user's click-through count to inquiry Q.
According to its definition, because " Query Result URL user's number of clicks " is inevitable smaller or equal to " total clicks of inquiry Q user ", therefore the span of " user's clicking rate " is between 0 and 1.To inquiry Q, " the user's clicking rate " of the results page URL that its each user clicks and be 1.
The key resource page Candidate Set of step 2.2 generated query
All users for inquiry Q click the page and corresponding " user's clicking rate ", the key resource page Candidate Set of generated query Q correspondence according to the following rules:
If: " user's clicking rate " of certain page correspondence then rejects this page less than 0.05;
Otherwise: this page is added the corresponding key resource page Candidate Set of this inquiry.
For inquiry Q, step 2.1 has been determined " user's clicking rate " of the page that its corresponding user clicked.For " user's clicking rate " the big page, expression is for the correlativity of this inquiry theme and the page, and the user has the approval of comparison unanimity.On the contrary, " user's clicking rate " little page, user's degree of recognition is low, and possibility relevant between the inquiry and the page is less.For inquiry Q, the page that " user's clicking rate " is little has much on the one hand, and is very weak with given topic relativity on the other hand, therefore in advance this class page removed from candidate collection, reduces follow-up useless processing.
" user clicks distribution " of step 2.3 generated query
For inquiry Q, add up the page and corresponding " user's clicking rate " in its page Candidate Set, promptly obtain this inquiry corresponding " user clicks distribution ".
" user clicks distribution " of inquiry Q described the relevant key resource page Candidate Set with inquiry theme Q, and each page is as the confidence level and the support of key resource page of this inquiry Q, and this click results page of the big more expression of " user's clicking rate " value of its correspondence may become the key resource page of this inquiry more.
The inquiring user of step 3. multiple search engine daily record is clicked to distribute and is merged
Step 3.1 is extracted single search engine user journal " inquiry confidence level " information to the inquiry theme
Count information according to the inquiring user of inquiry theme Q in each search engine user daily record SE, we can calculate " the inquiry confidence level " of different search engine logs SE for this inquiry theme Q, and it has mainly quantized the degree of reliability of " user clicks distribution " that this inquiry theme obtains on different search engine logs.For inquiry Q, each search engine logs SE jThe computing formula of " inquiry confidence level " be:
Figure A20071009853100091
Wherein, " search engine logs SE jTotal number of users of middle inquiry Q " can be to search engine logs SE jIn the different I d of inquiry Q count to get, that denominator is that number of users is taken the logarithm on each search engine logs and, " inquiry confidence level " value is carried out normalized.
According to its definition, because molecule " log (search engine logs SE jTotal number of users of middle inquiry Q) " must be less than or equal to denominator, therefore the span of " inquiry confidence level " is inevitable between 0 to 1.
Inquiry confidence level computing formula has reflected that for inquiry Q and search engine logs SE when number of users was fewer, its " inquiry confidence level " was relatively more responsive to the number of users of inquiry Q; When number of users was bigger, number of users just weakened relatively to the influence of " inquiry confidence level ".
Step 3.2 multiple search engine user journal merges
With " user's clicking rate " information for the click results page CRP that inquires about Q after P (CRP| inquires about Q) the expression fusion, the total probability formula computing formula that distributes with condition is:
Wherein, P (SE i| inquiry Q) represent for inquiry Q, search engine user daily record SE iThe support that provides uses " the inquiry confidence level " of (2) formula to calculate P (CRP|SE i, inquiry Q) and be illustrated in search engine logs SE iIn, for inquiry Q, click the clicking rate of results page CRP, use " the user clicking rate " of this results page on this search engine of (1) formula to calculate.The span that can be known P (CRP| inquires about Q) by the related notion of probability is inevitable between 0 to 1.
With step 2.2, according to " user's clicking rate " P (CRP| inquires about Q) after merging, we can obtain " user clicks distribution " after the how tame search engine logs information of this inquiry Q fusion.
" user clicks distribution " of merging the back inquiry got rid of the skewed popularity of " user the clicks distribution " existence that obtains on the single search engine logs.
The key resource page that step 4. inquiry is relevant is judged
Each that step 1 is selected among the query set S that obtains inquired about Q and corresponding key resource page Candidate Set thereof, " user clicks distributions " information after obtaining this inquiry Q and merge according to step 3, according to following rule this inquiry theme Q is carried out the screening of key resource page:
For each inquiry Q, it merges maximum continuously preceding M the key resource page that the page promptly is its corresponding search engine user daily record in back " user's clicking rate ", wherein M satisfies: from merging the maximum page in back " user's clicking rate ", continuously after the fusion of M the page " user's clicking rate " sum greater than 0.9, but after the fusion of continuously preceding M-1 the page " user's clicking rate " sum less than 0.9.
To step 4, we just can obtain inquiring about theme automatically according to step 1, and the corresponding down key resource page of this theme, the automatic location of realizing the key resource page that theme is relevant.
In order to verify validity of the present invention, reliability and application, we design and have tested relevant experiment.
At first the correctness of key resource page location is tested.
On data source, we have used the user inquiring click information record of 4 search engines commonly used.In addition, selected 314 inquiry themes, and used the mode of Pooling, artificial mark theme related pages has been carried out in these inquiries.The Pooling pond comprises domestic Sogou, Baidu, and Google, Zhongsou, Yisou, each big famous search engine such as Sina, each search engine return preceding 20 results as the alternative answer in the pond.Automatically the average accuracy of locator key resource page is 0.661, non-error rate is 0.885, and (accuracy is for for certain theme, the page that theme is relevant accounts for the page ratio that whole quilt marks automatically, non-error rate is meant that the page of removing behind the incoherent page accounts for the ratio of the page that whole quilt marks automatically, because there are some pages not appear at the pooling pond here, therefore cannot judge) it.Table 2 has been listed the relevant key resource page of partial query theme:
Table 2: partial query theme and corresponding key resource page positioning result thereof
Query word Automatic annotation results page URL
Waiting alone http://www.mtime.com/movie/17683
http://www.colordance.com/dzdd.html
http://ent.sina.com.cn/m/c/f/waitingalone/index.html
The Chinese Central Television (CCTV) http://www.cctv.com
This keystone resources localization method can be used to investigate under the different themes field, the retrieval performance of each search engine.We utilize inquiry log respectively the different inquiry themes in variant field in Baidu's roll of the hour and the weathervane TOP of the Yahoo list to be carried out the key resource page location, obtain the active user was concerned about in the different field inquiry theme and key resource page, and utilize positioning result that domestic each large search engine is investigated at different field retrieval effectivenesses.Table 3 has been listed about the retrieval effectiveness rank of software and each large search engine of sports field (corresponding Baidu roll of the hour and Yahoo's weathervane use retrieval evaluation index MAP commonly used respectively).
Table 3 software field retrieval effectiveness rank
Search engine The roll of the hour top of Baidu inquires about (MAP/ ranking) The weathervane top of Yahoo inquires about (MAP/ ranking)
Baidu 0.8120/1 0.7667/1
Google 0.7072/3 0.6979/2
Yahoo 0.7234/2 0.6786/3
Search dog 0.6632/4 0.6241/4
In search 0.653 8/5 0.6023/5
Sina 0.5171/6 0.4934/6
Table 3 sports field searching field effect rank
Search engine The roll of the hour top of Baidu inquiry The weathervane top of Yahoo inquiry
Baidu 0.7488/1 0.3132/2
Google 0.7242/2 0.4715/1
Yahoo 0.6281/3 0.3078/3
Search dog 0.6144/4 0.2901/4
In search 0.5033/6 0.2724/6
Sina 0.5907/5 0.2763/5
The present invention can automatically find the search for of performance user interest automatically from the user behavior daily record data of a plurality of search engines, and each theme is carried out the automatic location of key resource page.The skewed popularity that this method can well avoid single search engine logs to attract, realize to a certain degree fair and just, and the automatic positioning method of network key resource page is applied to actual a lot for present Research into information retrieval with use in the environment that very big difficulty is arranged and go, realize automatic Evaluation as what we gave an example with the network key resource localization method here to search engine.
Description of drawings
Fig. 1. network key resource page localization method flow process;
Fig. 2. single search Engine information organization chart after the pre-service;
Fig. 3. blending algorithm is described;
Fig. 4. the key resource page determination flow.
Embodiment
Accompanying drawing 1 has been described the flow process of this method.This method has adaptability and application widely for the network key resource page location.That utilizes below that the search dog search engine web site provides inquires about the screening and the key resource page location of theme about four search engine logs commonly used, carries out detailed process description with regard to above method invention.
1. data pre-service
Employed daily record is included in the user inquiring click information record of four search engines commonly used that the search dog search engine companies in 28 day time on November 28,8 days to 2006 November in 2006 collects, total non-NULL inquiry click information 55,647, article 885, (four search engines have 32,184,307 respectively, article 9,105,887,, 4, article 766,920,, 9, article 590,771).The information that comprises in the record has:
4 search engine user daily records commonly used that table 4:Sogou search engine provides comprise item of information:
Title Recorded content
FromURL The result for retrieval list address URL that the user clicks
ToUrl The result address URL that the user clicks
Time The user clicks date, the time of generation
Id By the automatic customer identification number that distributes of system
Comprised the affiliated search engine of this daily record in the FromUrl information.Usually, comprised relevant searching keyword in the variable of this address.ToUrl is that the user clicks results page.Therefore, these daily records have comprised the pairing data message item of table 1, can be used for the location of key resource page.
The pre-service of daily record comprises: filter non-search engine logs record (as redirect etc. mutually in the station of search engine); Search engine logs is classified by search engine, obtain four big search engine user inquiring click information records separately commonly used; From the variable of FromUrl, extract relevant searching keyword part, carry out the URL transcoding, and the final unified GBK coding that is transcoded into; Needed garbage of non-table 1 and coherent noise information in the filter record, the number of users of unified calculation inquiry, information such as " user's clicking rates ".
2. inquiry theme set screening
The user inquiring of search engine has certain repeatability and intensive, and for the theme that the user is concerned about, its inquiry theme then can often be submitted to inquiry by inquiring user.The intensive of inquiry theme also is used for inquiring about the macroanalysis of key resource page location by us.Below be the screening process of inquiry theme set, this process is independently investigated and is screened on the inquiry log of each search engine.
Inquiry theme set screening process on the single search engine logs:
To the inquiry that occurs in each search engine logs, screen according to its user inquiring amount, if total inquiry times is less than 20, think that then this inquiry does not have enough macroscopical users to click behavioural information, can't effectively analyze, this theme also lacks enough representativenesses and is used for describing the topic that inquiring user is concerned about simultaneously, rejects this inquiry theme.Otherwise, this inquiry is kept.According in the past we to Sogou daily record analyze the back and find, the user inquiring number of times greater than 100 inquiry above 30,000, and total number of clicks that the user inquires about in this section accounts for about 70% of whole numbers of clicks, these some results of study with forefathers are identical, be in the search engine, the inquiry of lesser amt is inquired about repeatedly, occupies most search engine service.To the inquiry theme is screened, guarantee that selected inquiry can react user's inquiry trend and focus with the number of users of inquiry, guarantee ageing and attention rate, certain representativeness is arranged.In addition, select the more inquiry of those numbers of users, can control in the keystone resources position fixing process, because of individual user's click behavior, and the bigger uncertainty of bringing.
3. the multiple search engine log information merges
The blending algorithm of Fig. 3 has been described a plurality of search engine logs has been carried out information fusion, and finally obtains the inquiry distribution of the pairing key resource page set of each inquiry theme.Here at first utilize the part inquiry of the inquiry theme correspondence of each each search engine of search engine logs information calculations to distribute and the inquiry confidence level.The integral body inquiry that utilizes (3) formula to calculate the key resource page collection of the correspondence of each inquiry theme under many daily records situation then distributes, and the inquiry after also promptly merging distributes.
Use the inquiry after merging to distribute, rather than the inquiry of each search engine logs oneself distributes, and can avoid the deflection of the data set that the deflection of the retrieval list ordering that brings because of single search engine logs and single search engine bring because of resource-constrained preferably.
4. the keystone resources page or leaf that theme is relevant and judging
The key resource page that theme is relevant is judged can be referring to flow process shown in Figure 4.This positioning flow is exactly from inquiry keystone resources candidate collection, picks out the big page of user's clicking rate.User's clicking rate here obtains according to a plurality of search engine inquiry log informations fusions.By the screening process among the figure as can be known, have only and merge back " user's clicking rate " greater than 0.1, and " user's clicking rate " sum of relevant all key resource pages of this theme is greater than 0.9 o'clock, the key resource page decision process of this theme just finishes.
This decision process is clicked the key resource page that results page is judged to be this inquiry theme to the high user of customer's approval degree, rather than all users are clicked the page be judged to be key resource page, rejected because of the user is overdue and hit, perhaps former because of the misleading of result of page searching etc. thereby the click page can guarantee the quality and and the correlativity of theme of the network key resource page of being located largely.
According to above step, just can realize automatic location to network key resource page, utilize search macro engine user's behavior to search the inquiry theme that the user pays close attention to, and it is carried out effective location of network key resource page.

Claims (1)

1. the automatic positioning method of a network key resource page is characterized in that this method contains following content successively:
Step (1). computing machine is screening inquiry theme in the search engine user daily record of each search engine system according to the following steps: step (1.1). the data pre-service, its step is as follows:
Step (1.1.1). computing machine passes through the daily record of search engine web server search subscriber, and the coded format of this server record is converted to Chinese characters of the national standard coding GBK form;
Step (1.1.2). remove the information except following content item in the described user journal of step (1.1.1), described following content item comprises: the inquiry Query (hereinafter to be referred as Q) that the user submits to, the result address URL that clicks corresponding to this inquiring user and the customer identification number ID that distributes automatically by search engine system, and a daily record that obtains is organized into the character string forms that comprises above content item;
Step (1.1.3). utilize the noise information in the user inquiring that character string matching method filtration step (1.1.2) obtains, only keep the content item that directly reflects search engine common user query demand and behavior;
Step (1.2). select inquiry theme S set,
If: the number of times that certain inquiry Q is inquired about by different users in user journal is then got rid of outside S set less than 20 times; Otherwise, this inquiry theme S set put in this inquiry theme;
Step (2). to each inquiry Q, extract the inquiring user clicking rate according to the following steps:
Step (2.1). calculate the inquiring user clicking rate of respectively inquiring about Q as follows:
Figure A2007100985310002C1
This inquiring user clicking rate is between 0 to 1; To inquiry Q, the summation of user's clicking rate of the results page URL that its each user clicks is 1;
Step (2.2). the key resource page Candidate Set of generated query Q:
If: user's clicking rate of certain page correspondence is then rejected this page less than 0.05; Otherwise, this page is joined in the key resource page Candidate Set of this inquiry Q correspondence;
Step (2.3). user's clicking rate of generated query Q distributes:
For inquiry Q, add up the page and corresponding user's clicking rate in its page Candidate Set, obtain the user clicking rate corresponding and distribute with this inquiry Q;
Step (3). the fusion that user's clicking rate of the inquiry Q of multiple search engine daily record distributes, its step is as follows:
Step (3.1). be calculated as follows the inquiry reliability information of single search engine user journal to inquiry Q:
Search engine user daily record SE jOn the inquiry confidence level be:
Figure A2007100985310003C1
This SE jThe inquiry confidence level is between 0 to 1;
Step (3.2). the fusion of multiple search engine user journal:
The user's clicking rate information for the click results page CRP that inquires about Q after the fusion is represented with P (CRP| inquires about Q):
Figure A2007100985310003C2
Wherein, P (SE i| inquiry Q) represent for inquiry Q, SE iThe support that provides is represented with the inquiry confidence level that step (3.1) obtains,
P (CRP|SE i, inquiry Q) and be illustrated in search engine logs SE iIn, for inquiry Q, click the clicking rate of results page, represent with user's clicking rate that step (2.1) obtains.
Step (3.3). the user's clicking rate P (CRP| inquires about Q) after the fusion that obtains according to step (3.2) for inquiry Q, obtains corresponding to each search engine user daily record SE iFusion after the user click distribution;
Step (4). judge relevant key resource page with inquiry Q:
For selecting the key resource page Candidate Set corresponding that each inquiry Q and step (2) of obtaining obtain in the step (1) with it, obtain each inquiry Q user after fusion with step (3) and click distributed intelligence, come inquiry Q is carried out the screening of key resource page by following rule again:
From each inquiry Q, select continuously preceding M the key resource page that the page is exactly the inquiry Q of each search engine user daily record correspondence that merges back user's clicking rate maximum separately, wherein M satisfies: from merging the page of back user's clicking rate maximum, user continuously after the fusion of M the page clicks the clicking rate sum greater than 0.9, but after the fusion of continuously preceding M-1 the page user's clicking rate sum less than 0.9.
CNB2007100985319A 2007-04-20 2007-04-20 Automatic positioning method of network key resource page Active CN100507918C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100985319A CN100507918C (en) 2007-04-20 2007-04-20 Automatic positioning method of network key resource page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100985319A CN100507918C (en) 2007-04-20 2007-04-20 Automatic positioning method of network key resource page

Publications (2)

Publication Number Publication Date
CN101105801A true CN101105801A (en) 2008-01-16
CN100507918C CN100507918C (en) 2009-07-01

Family

ID=38999699

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100985319A Active CN100507918C (en) 2007-04-20 2007-04-20 Automatic positioning method of network key resource page

Country Status (1)

Country Link
CN (1) CN100507918C (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887437A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Search result generating method and information search system
CN102043705A (en) * 2009-10-19 2011-05-04 阿里巴巴集团控股有限公司 Statistical method and apparatus for input behavior
CN101241512B (en) * 2008-03-10 2012-01-11 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN102364475A (en) * 2011-11-24 2012-02-29 迈普通信技术股份有限公司 System and method for sequencing search results based on identity recognition
CN102609439A (en) * 2011-12-23 2012-07-25 浙江大学 Window-based probability query method for fuzzy data in high-dimensional environment
CN103136210A (en) * 2011-11-23 2013-06-05 北京百度网讯科技有限公司 Method and device for mining query with similar requirements
CN103544169A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method and device for adjusting page
CN104298785A (en) * 2014-11-12 2015-01-21 中南大学 Searching method for public searching resources
CN104699705A (en) * 2013-12-06 2015-06-10 腾讯科技(深圳)有限公司 Method, server and system for pushing information
WO2016091051A1 (en) * 2014-12-12 2016-06-16 北京奇虎科技有限公司 Method and device for identifying web page type
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN112749333A (en) * 2020-07-24 2021-05-04 腾讯科技(深圳)有限公司 Resource searching method and device, computer equipment and storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241512B (en) * 2008-03-10 2012-01-11 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
US8849822B2 (en) 2009-05-12 2014-09-30 Alibaba Group Holding Limited Method for generating search result and system for information search
US9672290B2 (en) 2009-05-12 2017-06-06 Alibaba Group Holding Limited Method for generating search result and system for information search
CN101887437B (en) * 2009-05-12 2016-03-30 阿里巴巴集团控股有限公司 A kind of Search Results generation method and information search system
CN101887437A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Search result generating method and information search system
CN102043705A (en) * 2009-10-19 2011-05-04 阿里巴巴集团控股有限公司 Statistical method and apparatus for input behavior
CN103136210A (en) * 2011-11-23 2013-06-05 北京百度网讯科技有限公司 Method and device for mining query with similar requirements
CN102364475A (en) * 2011-11-24 2012-02-29 迈普通信技术股份有限公司 System and method for sequencing search results based on identity recognition
CN102609439A (en) * 2011-12-23 2012-07-25 浙江大学 Window-based probability query method for fuzzy data in high-dimensional environment
CN103544169B (en) * 2012-07-12 2017-05-10 百度在线网络技术(北京)有限公司 method and device for adjusting page
CN103544169A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method and device for adjusting page
CN104699705A (en) * 2013-12-06 2015-06-10 腾讯科技(深圳)有限公司 Method, server and system for pushing information
CN104699705B (en) * 2013-12-06 2018-09-04 腾讯科技(深圳)有限公司 Information-pushing method, server and system
CN104298785B (en) * 2014-11-12 2017-05-03 中南大学 Searching method for public searching resources
CN104298785A (en) * 2014-11-12 2015-01-21 中南大学 Searching method for public searching resources
WO2016091051A1 (en) * 2014-12-12 2016-06-16 北京奇虎科技有限公司 Method and device for identifying web page type
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
WO2020052405A1 (en) * 2018-09-10 2020-03-19 腾讯科技(深圳)有限公司 Corpus annotation set generation method and apparatus, electronic device, and storage medium
CN112749333A (en) * 2020-07-24 2021-05-04 腾讯科技(深圳)有限公司 Resource searching method and device, computer equipment and storage medium
CN112749333B (en) * 2020-07-24 2024-01-16 腾讯科技(深圳)有限公司 Resource searching method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN100507918C (en) 2009-07-01

Similar Documents

Publication Publication Date Title
CN100507918C (en) Automatic positioning method of network key resource page
CN100507920C (en) Search engine retrieving result reordering method based on user behavior information
CN100440224C (en) Automatization processing method of rating of merit of search engine
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN101329687B (en) Method for positioning news web page
US6640218B1 (en) Estimating the usefulness of an item in a collection of information
US8015065B2 (en) Systems and methods for assigning monetary values to search terms
US7831474B2 (en) System and method for associating an unvalued search term with a valued search term
CN102722498B (en) Search engine and implementation method thereof
US20020042784A1 (en) System and method for automatically searching and analyzing intellectual property-related materials
US8380693B1 (en) System and method for automatically identifying classified websites
CN101609450A (en) Web page classification method based on training set
CN101382954B (en) Method and system for providing web site collection name
CN101178728A (en) Web side navigation method and system
US8838643B2 (en) Context-aware parameterized action links for search results
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN102722499B (en) Search engine and implementation method thereof
US9367638B2 (en) Surfacing actions from social data
CN102737021B (en) Search engine and realization method thereof
CN102254039A (en) Searching engine-based network searching method
CN1991829A (en) Searching method of search engine system
CN101782998A (en) Intelligent judging method for illegal on-line product information and system
CN1996316A (en) Search engine searching method based on web page correlation
US20110184815A1 (en) System and method for sharing profits with one or more content providers
CN102214183A (en) Search engine query method for combining feedback contents of pages with fixed ranking

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee
CP03 Change of name, title or address

Address after: Beijing 100084 mailbox 82 cent box Tsinghua University Patent Office postcode: 100084

Co-patentee after: Sogo Science-Technology Development Co., Ltd., Beijing

Patentee after: Tsinghua University

Address before: Beijing 100084 mailbox 82 cent box Tsinghua University Patent Office

Patentee before: Tsinghua University