CN101246498A - News web page searching method - Google Patents

News web page searching method Download PDF

Info

Publication number
CN101246498A
CN101246498A CNA200810088028XA CN200810088028A CN101246498A CN 101246498 A CN101246498 A CN 101246498A CN A200810088028X A CNA200810088028X A CN A200810088028XA CN 200810088028 A CN200810088028 A CN 200810088028A CN 101246498 A CN101246498 A CN 101246498A
Authority
CN
China
Prior art keywords
doc
web page
news web
time
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200810088028XA
Other languages
Chinese (zh)
Other versions
CN101246498B (en
Inventor
刘云峰
唐年鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN200810088028XA priority Critical patent/CN101246498B/en
Publication of CN101246498A publication Critical patent/CN101246498A/en
Application granted granted Critical
Publication of CN101246498B publication Critical patent/CN101246498B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a searching method for a news webpage, comprising the steps of: A, creating a news webpage index, determining a weight parameter of each news webpage corresponding to a relevant retrieval data according to text content, time and reprint situation; B, retrieving matched new webpage information from the news webpage index according to an input retrieval data when retrieving a news webpage; C, outputting ordered retrieval result. The invention is capable of accomplish effective retrieval order for a news webpage, and improving precision for news retrieval, so as to enable the retrieval result close to virtual retrieval requirement of user.

Description

A kind of searching method of news web page
Technical field
The present invention relates to the Internet search technology, relate in particular to a kind of searching method of news web page.
Background technology
Network search engines is to use frequent a kind of service system on the present internet.Network search engines has been concentrated the information of thousands upon thousands websites, and major function is to help these websites of user search, also can classify by the website that some are good, searches related data to make things convenient for the user.
At present, many search techniques all adopt inversed document frequency (IDF) technology to carry out Webpage search.IDF is proposed by Si Bake-Jones of Cambridge University the earliest, is an important technology in the information retrieval.IDF can distinguish a word to one piece of proportion that the document expression of significance is shared.Such as one piece of document more word appears as " vector ", and " vector " this word only occurs in the document of minority, " vector " this word just can well be distinguished this piece document so, can affirm that the theme that this piece document talks about is relevant with vector.On the contrary, as similar " can " wait word, the more discrimination that can not show this speech of the appearance word frequency in one piece of document (TF, i.e. occurrence number) to this piece document.The weighing computation method of IDF generally is according to formula: IDF=lg (N/n) obtains, and wherein, N is total number of files, and n is the number of documents that certain word occurs.
But, in the web search technology, needing different search techniques at object search, for example news search technology and picture searching technology, music searching technology have very big difference.The news search technology is a kind of search technique that approaches Webpage search most, and single piece of text of its index is longer, the various attributes that the object of description has unlike picture etc.The difference of news web page search technique and simple Webpage search is that the news web page search needs to pay attention to ageing more and takes into account correlativity the time.If handle bad, the result that might return or more relevant, but whether very new; Very new, but whether very relevant, not the news of extensively being paid close attention at present, do not satisfy user's search need.So the relevance ranking of its result for retrieval has certain degree of difficulty relatively.
In the prior art, a kind of news web page search technique also do not occur, make the existing very strong correlativity of news web page searching order result, have very strong ageing again.Therefore the news search result's of prior art accuracy is relatively poor, can't satisfy user's true search need.
Summary of the invention
In view of this, technical matters to be solved by this invention is to provide a kind of searching method of news web page, realize effective searching order at news web page, improve accuracy, make Search Results approach user's true search need more news search.
In order to realize the foregoing invention purpose, main technical schemes of the present invention is:
A kind of searching method of news web page comprises:
A, set up the news web page index, determine the weight parameter that each news web page is corresponding with the coordinate indexing string according to content of text, time and reprinting situation;
B, in search during news web page, from described news web page index, retrieve the news web page information of coupling according to the retrieval string of input, and sort according to the weight parameter corresponding with described input retrieval string;
Search Results after C, the output ordering.
Preferably, in the steps A,, determine this news web page and the corresponding weight parameter of this retrieval string according to following method at one piece of specific news web page and a specific retrieval string:
A1, determine the text relevant weights W of described retrieval string and described news web page Text(Q, doc); Determine the numerical value relevance weight W of described retrieval string and described news web page Num(doc), this W Num(doc) mainly comprise time span weight and the situation of reprinting weight in;
A2, be described W Text(Q, doc) and W Num(doc) the adjusting parameter lambda is set respectively TextAnd λ Num
A3, the described λ of combination calculation Text* W Text(Q, doc) and λ Num* W Num(doc) obtain described news web page and the corresponding weight parameter of described retrieval string.
Preferably, steps A 3 according to formula: Weight (Q, doc)=(1+ λ Text* W Text(Q, doc)) * (1+ λ Num* W Num(doc)) carry out described combination calculation, obtain the described news web page weight parameter Weight corresponding with described retrieval string (Q, doc).
Preferably, in the steps A 1, determine the numerical value relevance weight W of described retrieval string and described news web page Num(doc) concrete grammar is:
A11, determine the time span weights W Time, reprinting rate weights W Du, and reprint speed W Dv
A12, be described W Time, W Du, and W DvThe adjusting parameter lambda is set respectively Time, λ Du, and λ Dv
A13, the described λ of combination calculation Time* W Time, λ Du* W Du, and λ Dv* W DvObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
Preferably, in the steps A 13, according to formula: W Num(doc)=(1+ λ Time* W Time) * (1+ λ Du* W Du) * (1+ λ Dv* W Dv) carry out described combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
Preferably, in the steps A 11, according to formula: W time = MaxTimeSpanWeight ( t + b b ) α Determine W TimeWherein, MaxTimeSpanWeight is a maximum time span weight, and b is a Preset Time unit, and the default constant of α one, t are that described news pages goes out the time span of now to current time on network.
Preferably, in the steps A 11, described reprinting rate weights W DuDefinite method be: add up the number of times that described news web page is reprinted, with the number of times that counts and reference value contrast, the ratio that obtains is W Du
Preferably, in the steps A 11, according to formula: W Dv=W Dn/ (T e-T b) determine described reprinting speed W Dv, wherein, W DnBe the number of times that described news web page is reprinted in the schedule time, T bBe the reproduced time of beginning, T eIt is last reproduced time.
Preferably, further determine the authoritative weights W of news web page in the steps A 11 GA, and be this W in steps A 12 GAThe adjusting parameter lambda is set GA, the described λ of combination calculation in steps A 13 Time* W Time, λ Du* W Du, λ Dv* W Dv, and λ GA* W GAObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
Preferably, in the steps A 13 according to formula: W Num(doc)=(1+ λ Time* W Time) * (1+ λ Du* W Du) * (1+ λ Dv* W Dv) * (1+ λ GA* W GA) carry out combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
Preferably, in the steps A 11, according to formula W GA=W Page* W StDetermine described W GA, W wherein PageBe the page weight of the described news web page affiliated web site rank of reflection and this news web page page location of living in, W StBe that whether news web page is the thematic weight of thematic webpage under the reflection.
Preferably, in the steps A 1, determine the text relevant weights W of described retrieval string and described news web page Text(Q, concrete grammar doc) is:
A11, determine the relevance weight W that described retrieval string hits in described news web page title TI(Q, doc) and the relevance weight W that in described news web page text, hits of described retrieval string Tx(Q, doc);
A12, be described W TI(Q, doc) and W Tx(Q doc) is provided with the adjusting parameter lambda respectively TIAnd λ Tx, and λ TI+ λ Tx=1;
A13, according to formula W Text(Q, doc)=λ TI* W TI(Q, doc)+λ Tx* W Tx(Q doc) determines the text relevant weights W of described retrieval string and described news web page Text(Q, doc).
Preferably, among the step a11, according to formula W TI ( Q , doc ) = Σ q ∈ Q W IDF ( q ) * HitTitle ( q , doc ) Σ q ∈ Q W IDF ( q ) Determine described W TI(Q, doc), wherein, described Q is the retrieval string, q is a term in the retrieval string, W IDF(q) be the inversed document frequency IDF weight of q, (q doc) gets 1, otherwise gets 0 HitTitle when q is included in the described news web page title.
Preferably, among the step a11, according to formula W Tx(Q, doc)=log 2(1+WTF (Q, doc)) determines described W Tx(Q, doc); (Q is doc) according to formula for WTF WTF ( Q , doc ) = Σ pos ∈ { POS } ( Σ q ∈ Q W IDF ( q ) * log 2 ( 1 + tf ( q , pos ) ) ) Determine that wherein { POS} is the huge location sets of news web page text, and pos is one of them huge position, and (q pos) is the frequency that occurs in the huge position of term q in described news web page text, W to tf IDF(q) be the inversed document frequency IDF weight of q.
Preferably, described W IDF(q) according to formula W IDF ( q ) = log ( N - df ( q ) + 0.5 df ( q ) + 0.5 + 1.0 ) Determine that wherein N is the total number of documents order in the index, df (q) is the number of files that term q hits.
Among the present invention, for the search of news web page except considering the relevance ranking technology of traditional full-text search, the objective characteristic of also taking into account simultaneously according to the emphasis timeliness and the correlativity of news web page, pass through the time, the reprinting rate, rotary speed, even factor such as website weight is carried out weight calculation, thereby make the existing very strong content of text correlativity of final searching order result, can embody the degree of concern that is subjected to of news web page strongly, have very strong ageing again, thereby improve accuracy, make Search Results approach user's true search need more news search.
Description of drawings
Fig. 1 is the main process flow diagram of the method for the invention;
Fig. 2 a and Fig. 2 b are W TimeExemplary graph with time relationship.
Embodiment
Below by specific embodiments and the drawings the present invention is described in further details.
Fig. 1 is the main process flow diagram of the method for the invention.Referring to Fig. 1, this flow process comprises:
Step 101, set up the news web page index, according to content of text, time and reprinting situation determine each news web page weight parameter Weight corresponding with the coordinate indexing string (Q, doc).
Step 102, in search during news web page, from described news web page index, retrieve the news web page information of coupling according to the retrieval string of input, and sort according to the weight parameter corresponding with described input retrieval string;
Search Results after step 103, the output ordering.
Wherein, step 101 is key technical features of the present invention.In the step 101,, determine this news web page and the corresponding weight parameter of this retrieval string according to following method at one piece of specific news web page and a specific retrieval string:
Step 111, determine the text relevant weights W of described retrieval string and described news web page Text(Q, doc); Determine the numerical value relevance weight W of described retrieval string and described news web page Num(doc), this W Num(doc) mainly comprise time span weight and the situation of reprinting weight in.Described Q is the retrieval string, Q={q 1, q 2..., q n, q is a term in the retrieval string, n is the term number after the cutting of retrieval string.
Step 112, be described W Text(Q, doc) and W Num(doc) the adjusting parameter lambda is set respectively TextAnd λ Num
Step 113, the described λ of combination calculation Text* W Text(Q, doc) and λ Num* W Num(doc) obtain described news web page and the corresponding weight parameter of described retrieval string.Described adjusting parameter lambda TextAnd λ NumCan transfer big according to actual needs or turn its corresponding W down Text(Q, doc), W Num(doc) degree of influence can be transferred big accordingly or turn down.
Concrete, the weight parameter Weight that described news web page is corresponding with described retrieval string (Q, doc) can determine according to following formula (1) combination calculation:
Weight(Q,doc)=(1+λ text*W text(Q,doc))*(1+λ num*W num(doc)) (1)
In the above-mentioned formula (1), λ Text* W Text(Q, doc) and λ Num* W Num(doc) correlation has adopted the multiplication mode when combination calculation, also can adopt add mode, promptly Weight (Q, doc)=(1+ λ Text* W Text(Q, doc))+(1+ λ Num* W Num(doc)), but adopt the mode that multiplies each other more to be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, adopt the mode of multiplying each other total weighted value can be drawn high, promote the sorting position of this news web page in Search Results greatly.
Introduce how to determine described W respectively below Text(Q, doc) and W Num(doc).
One, W Text(Q, definite method doc).
The field that index is set up in news search has header field and body field, at definite W Text(Q in the time of doc), needs to consider the hit situation of retrieval string in header field and body field.
Concrete, the text relevant weights W of described retrieval string and described news web page Text(Q, definite process doc) comprises:
Step a11, determine the relevance weight W that described retrieval string hits in described news web page title TI(Q, doc) and the relevance weight W that in described news web page text, hits of described retrieval string Tx(Q, doc).
Step a12, be described W TI(Q, doc) and W Tx(Q doc) is provided with the adjusting parameter lambda respectively TIAnd λ Tx, and λ TI+ λ Tx=1.
Step a13, determine the text relevant weights W of described retrieval string and described news web page according to following formula (2) Text(Q, doc).
W text(Q,doc)=λ TI*W TI(Q,doc)+λ tx*W tx(Q,doc) (2)
In step a11, determine described W according to following formula (3) TI(Q, doc):
W TI ( Q , doc ) = Σ q ∈ Q W IDF ( q ) * HitTitle ( q , doc ) Σ q ∈ Q W IDF ( q ) - - - ( 3 )
In the formula (3), described Q is the retrieval string, and q is a term in the retrieval string, W IDF(q) be the inversed document frequency IDF weight of q, (q doc) gets 1, otherwise gets 0 HitTitle when q is included in the described news web page title.Only considered in the formula (3) that the title of news web page is to Q={q 1, q 2..., q nCoverage rate, do not consider each entry q word frequency TF in title, reason is in the title of news web page, important centre word generally only occurs once, is that unessential speech occurrence number is more on the contrary.
In step a11, determine described W according to following formula (4) Tx(Q, doc):
W tx(Q,doc)=log 2(1+WTF(Q,doc)) (4)
In the formula (4), and WTF (Q is to be an entry with retrieval string Q total abstract doc), the weighting frequency in the news web page document, and this value can be regarded a kind of TF of broad sense as.
Concrete, WTF (Q, doc) determine according to following formula (5):
WTF ( Q , doc ) = Σ pos ∈ { POS } ( Σ q ∈ Q W IDF ( q ) * log 2 ( 1 + tf ( q , pos ) ) ) - - - ( 5 )
In the formula (5), { POS} is the huge location sets of news web page text, and pos is one of them huge position, and (q pos) is the frequency that occurs in the huge position of term q in described news web page text, W to tf IDF(q) be the IDF weight of q.
Described W IDF(q) determine according to following formula (6):
W IDF ( q ) = log ( N - df ( q ) + 0.5 df ( q ) + 0.5 + 1.0 ) - - - ( 6 )
In the formula (6), described N is the total number of documents order in the index, and df (q) is the number of files that term q hits.Certainly, above-mentioned formula (6) can be done simple transformation, for example can be according to formula W IDF ( q ) = log ( N + 0.5 df ( q ) + 0.5 + 1.0 ) Determine described W IDF(q), perhaps more simply utilize formula: W IDF ( q ) = log ( N df ( q ) ) Determine described W IDF(q), the embodiment of only above-mentioned formula (6) more can draw back the ordering distance of the corresponding news web page of different terms, make the ranking results of the news web page that searches out by the hot news term obtain more embodying the ageing and correlativity of the Search Results of standing out in advance.
Two, W Num(doc) definite method.
Determine the numerical value relevance weight W of described retrieval string and described news web page Num(doc) concrete grammar is:
Steps A 11, determine the time span weights W Time, reprinting rate weights W Du, and reprint speed W Dv
Steps A 12, be described W Time, W Du, and W DvThe adjusting parameter lambda is set respectively Time, λ Du, and λ DvDescribed adjusting parameter can be provided with according to actual needs, transfers and turns described adjusting parameter, the W that it is corresponding greatly down Time, W Du, W DvDegree of influence can corresponding transfer big or turn down.
Steps A 13, the described λ of combination calculation Time* W Time, λ Du* W Du, and λ Dv* W DvObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).Concrete, can carry out described combination calculation according to following formula (7), obtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
W num(doc)=(1+λ time*W time)*(1+λ du*W du)*(1+λ dv*W dv) (7)
In the formula (7), when combination calculation, (1+ λ Time* W Time), (1+ λ Du* W Du) and (1+ λ Dv* W Dv) between adopted the account form that multiplies each other, also can adopt add mode, i.e. W Num(doc)=(1+ λ Time* W Time) * (1+ λ Du* W Du) * (1+ λ Dv* W Dv), still, adopt the calculation mode that multiplies each other can more be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, then can be with W Num(doc) weighted value is drawn high, and promotes the sorting position of corresponding news web page greatly.
Among the above-mentioned steps A11, W TimeSpan be [0~100], determine described W according to formula (8) Time:
W time = MaxTimeSpanWeight ( t + b b ) α - - - ( 8 )
In the formula (8), MaxTimeSpanWeight is a maximum time span weight, can value be 100 for example herein; B be one with t time corresponding unit, t be second in the present embodiment, b=3600 then, promptly one hour second number; α is the constant of described power function, can be preset as 0.5 herein, and t is that described news pages goes out now to the time span of current time on network.
Fig. 2 a and Fig. 2 b are W TimeWith the exemplary graph of time relationship, wherein the horizontal ordinate of Fig. 2 a is minute, and ordinate is the time span weights W TimeValue (Weight), the horizontal ordinate of Fig. 2 b is hour that ordinate is the time span weights W TimeValue (Weight).Data result referring to Fig. 2 a and Fig. 2 b:
0 minute, W Time=100;
3 hours, W Time=50;
15 hours, W Time=25;
24 hours, W Time=20.
By Fig. 2 a and Fig. 2 b as can be seen, for news web page, definite method of above-mentioned formula (8) has demonstrated fully the ageing of news web page, be that the time is near more, weighted value is high more, in case surpass section sometime, then weighted value decay rapidly, thus make the webpage sorting position of latest news shift to an earlier date greatly.
In steps A 11, described reprinting rate weights W DuSpan be [0~100].The method of determining is: add up the number of times that described news web page is reprinted on the backstage of search engine, with the number of times and the reference value contrast that count, the ratio that obtains is W Du
In steps A 11, described reprinting speed W DvSpan be [0~100].Specifically determine according to following formula (9):
W dv=W dn/(T e-T b) (9)
In the formula (9), W DnBe the number of times that described news web page is reprinted in the schedule time (for example 48 hours), T bBe the reproduced time of beginning, T eIt is last reproduced time.In a kind of preferred embodiment,, generally can only consider the reprinting speed in 48 hours, T according to the appearance characteristic of news web page e-T bMaximal value be 48 hours, and T e-T bUnit be hour that less than one hour was calculated by one hour.
Can also further determine the authoritative weights W of news web page in the steps A 11 GA, and be this W in steps A 12 GAThe adjusting parameter lambda is set GA, λ GAAccent turn corresponding W greatly down GADegree of influence can be transferred big accordingly or turn down.The described λ of combination calculation in steps A 13 Time* W Time, λ Du* W Du, λ Dv* W Dv, and λ GA* W GAObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).Specifically carry out combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page according to following formula (10) Num(doc).
W num(doc)=(1+λ time*W time)*(1+λ du*W du)*(1+λ dv*W dv)*(1+λ GA*W GA) (10)
Certainly, also can utilize formula:
W num(doc)=(1+λ time*W time)+(1+λ du*W du)+(1+λ dv*W dv)+(1+λ GA*W GA)
Determine described W Num(doc), just adopt the calculation mode that multiplies each other can more be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, then can be with W Num(doc) weighted value is drawn high, and promotes the sorting position of corresponding news web page greatly.
The authoritative weights W of described news web page GACan determine according to formula (11):
W GA=W page*W st (11)
In the formula (11), W PageBeing page weight, is that the search engine backstage combines the rank of website, news web page place and the weighted value that page location calculates.W StBe thematic weighting, if a page is special topic (special topic), then W StEqual one greater than 1 constant, as W St=1.2; If a page is not the thematic page, then W St=1.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims (15)

1, a kind of searching method of news web page is characterized in that, comprising:
A, set up the news web page index, determine the weight parameter that each news web page is corresponding with the coordinate indexing string according to content of text, time and reprinting situation;
B, in search during news web page, from described news web page index, retrieve the news web page information of coupling according to the retrieval string of input, and sort according to the weight parameter corresponding with described input retrieval string;
Search Results after C, the output ordering.
2, method according to claim 1 is characterized in that, in the steps A, at one piece of specific news web page and a specific retrieval string, determines this news web page and the corresponding weight parameter of this retrieval string according to following method:
A1, determine the text relevant weights W of described retrieval string and described news web page Text(Q, doc); Determine the numerical value relevance weight W of described retrieval string and described news web page Num(doc), this W Num(doc) mainly comprise time span weight and the situation of reprinting weight in;
A2, be described W Text(Q, doc) and W Num(doc) the adjusting parameter lambda is set respectively TextAnd λ Num
A3, the described λ of combination calculation Text* W Text(Q, doc) and λ Num* W Num(doc) obtain described news web page and the corresponding weight parameter of described retrieval string.
3, method according to claim 2 is characterized in that, steps A 3 according to formula: Weight (Q, doc)=(1+ λ Text* W Text(Q, doc)) * (1+ λ Num* W Num(doc)) carry out described combination calculation, obtain the described news web page weight parameter Weight corresponding with described retrieval string (Q, doc).
4, method according to claim 2 is characterized in that, in the steps A 1, determines the numerical value relevance weight W of described retrieval string and described news web page Num(doc) concrete grammar is:
A11, determine the time span weights W Time, reprinting rate weights W Du, and reprint speed W Dv
A12, be described W Time, W Du, and W DvThe adjusting parameter lambda is set respectively Time, λ Du, and λ Dv
A13, the described λ of combination calculation Time* W Time, λ Du* W Du, and λ Dv* W DvObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
5, method according to claim 4 is characterized in that, in the steps A 13, according to formula: W Num(doc)=(1+ λ Time* W Time) * (1+ λ Du* W Du) * (1+ λ Dv* W Dv) carry out described combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
6, method according to claim 4 is characterized in that, in the steps A 11, according to formula: W time = MaxTimeSpanWeight ( t + b b ) α Determine W TimeWherein, MaxTimeSpanWeight is a maximum time span weight, and b is a Preset Time unit, and the default constant of α one, t are that described news pages goes out the time span of now to current time on network.
7, method according to claim 4 is characterized in that, in the steps A 11, and described reprinting rate weights W DuDefinite method be: add up the number of times that described news web page is reprinted, with the number of times that counts and reference value contrast, the ratio that obtains is W Du
8, method according to claim 4 is characterized in that, in the steps A 11, according to formula: W Dv=W Dn/ (T e-T b) determine described reprinting speed W Dv, wherein, W DnBe the number of times that described news web page is reprinted in the schedule time, T bBe the reproduced time of beginning, T eIt is last reproduced time.
9, method according to claim 4 is characterized in that, further determines the authoritative weights W of news web page in the steps A 11 GA, and be this W in steps A 12 GAThe adjusting parameter lambda is set GA, the described λ of combination calculation in steps A 13 Time* W Time, λ Du* W Du, λ Dv* W Dv, and λ GA* W GAObtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
10, method according to claim 9 is characterized in that, in the steps A 13 according to formula: W Num(doc)=(1+ λ Time* W Time) * (1+ λ Du* W Du) * (1+ λ Dv* W Dv) * (1+ λ GA* W GA) carry out combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page Num(doc).
11, method according to claim 9 is characterized in that, in the steps A 11, according to formula W GA=W Page* W StDetermine described W GA, W wherein PageBe the page weight of the described news web page affiliated web site rank of reflection and this news web page page location of living in, W StBe that whether news web page is the thematic weight of thematic webpage under the reflection.
12, method according to claim 2 is characterized in that, in the steps A 1, determines the text relevant weights W of described retrieval string and described news web page Text(Q, concrete grammar doc) is:
A11, determine the relevance weight W that described retrieval string hits in described news web page title TI(Q, doc) and the relevance weight W that in described news web page text, hits of described retrieval string Tx(Q, doc);
A12, be described W TI(Q, doc) and W Tx(Q doc) is provided with the adjusting parameter lambda respectively TIAnd λ Tx, and λ TI+ λ Tx=1;
A13, according to formula W Text(Q, doc)=λ TI* W TI(Q, doc)+λ Tx* W Tx(Q doc) determines the text relevant weights W of described retrieval string and described news web page Text(Q, doc).
13, method according to claim 12 is characterized in that, among the step a11, according to formula W TI ( Q , doc ) = Σ q ∈ Q W IDF ( q ) * HitTitle ( q , doc ) Σ q ∈ Q W IDF ( q ) Determine described W TI(Q, doc), wherein, described Q is the retrieval string, q is a term in the retrieval string, W IDF(q) be the inversed document frequency IDF weight of q, (q doc) gets 1, otherwise gets 0 HitTitle when q is included in the described news web page title.
14, method according to claim 12 is characterized in that, among the step a11, according to formula W Tx(Q, doc)=log 2(1+WTF (Q, doc)) determines described W Tx(Q, doc); (Q is doc) according to formula for WTF WTF ( Q , doc ) = Σ pos ∈ { POS } ( Σ q ∈ Q W IDF ( q ) * log 2 ( 1 + tf ( q , pos ) ) ) Determine that wherein { POS} is the huge location sets of news web page text, and pos is one of them huge position, and (q pos) is the frequency that occurs in the huge position of term q in described news web page text, W to tf IDF(q) be the inversed document frequency IDF weight of q.
15, according to claim 13 or 14 described methods, it is characterized in that described W IDF(q) according to formula W IDF ( q ) = log ( N - df ( q ) + 0.5 df ( q ) + 0.5 + 1.0 ) Determine that wherein N is the total number of documents order in the index, df (q) is the number of files that term q hits.
CN200810088028XA 2008-03-27 2008-03-27 News web page searching method Active CN101246498B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810088028XA CN101246498B (en) 2008-03-27 2008-03-27 News web page searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810088028XA CN101246498B (en) 2008-03-27 2008-03-27 News web page searching method

Publications (2)

Publication Number Publication Date
CN101246498A true CN101246498A (en) 2008-08-20
CN101246498B CN101246498B (en) 2010-07-14

Family

ID=39946949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810088028XA Active CN101246498B (en) 2008-03-27 2008-03-27 News web page searching method

Country Status (1)

Country Link
CN (1) CN101246498B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117332A (en) * 2011-03-10 2011-07-06 辜进荣 Given time-based searching method
CN102236710A (en) * 2011-06-30 2011-11-09 百度在线网络技术(北京)有限公司 Method and equipment for displaying news information in query result
CN102314435A (en) * 2010-06-30 2012-01-11 腾讯科技(深圳)有限公司 Method for searching webpage content and system
CN102598038A (en) * 2009-10-30 2012-07-18 乐天株式会社 Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
CN103324637A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and system for mining hotspot message
CN103530321A (en) * 2013-09-18 2014-01-22 上海交通大学 Sequencing system based on machine learning
CN103810295A (en) * 2014-03-06 2014-05-21 北京邮电大学 Method and device for extracting internet data
CN104182442A (en) * 2014-03-28 2014-12-03 无锡天脉聚源传媒科技有限公司 News searching method and device
CN104298674A (en) * 2013-07-17 2015-01-21 腾讯科技(北京)有限公司 Method and device for displaying articles
WO2015143911A1 (en) * 2014-03-26 2015-10-01 北京奇虎科技有限公司 Method and device for pushing webpages containing time-relevant information
CN105320770A (en) * 2015-10-30 2016-02-10 江苏省电力公司电力科学研究院 Instant assistance search system based on web page keyword
CN105447009A (en) * 2014-08-12 2016-03-30 阿里巴巴集团控股有限公司 Method and device for generating query strings
CN106649810A (en) * 2016-12-29 2017-05-10 山东舜网传媒股份有限公司 Ajax-based news webpage dynamic data grabbing method and system
CN108614825A (en) * 2016-12-12 2018-10-02 中移(杭州)信息技术有限公司 A kind of web page characteristics extracting method and device

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102598038B (en) * 2009-10-30 2015-02-18 乐天株式会社 Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
CN102598038A (en) * 2009-10-30 2012-07-18 乐天株式会社 Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
CN102314435A (en) * 2010-06-30 2012-01-11 腾讯科技(深圳)有限公司 Method for searching webpage content and system
CN102117332A (en) * 2011-03-10 2011-07-06 辜进荣 Given time-based searching method
CN102236710A (en) * 2011-06-30 2011-11-09 百度在线网络技术(北京)有限公司 Method and equipment for displaying news information in query result
CN103324637A (en) * 2012-03-23 2013-09-25 腾讯科技(深圳)有限公司 Method and system for mining hotspot message
CN103324637B (en) * 2012-03-23 2017-12-12 深圳市世纪光速信息技术有限公司 A kind of hot information method for digging and system
CN104298674B (en) * 2013-07-17 2019-05-14 腾讯科技(北京)有限公司 The method and apparatus for showing article
CN104298674A (en) * 2013-07-17 2015-01-21 腾讯科技(北京)有限公司 Method and device for displaying articles
CN103530321A (en) * 2013-09-18 2014-01-22 上海交通大学 Sequencing system based on machine learning
CN103530321B (en) * 2013-09-18 2016-09-07 上海交通大学 A kind of ordering system based on machine learning
CN103810295A (en) * 2014-03-06 2014-05-21 北京邮电大学 Method and device for extracting internet data
WO2015143911A1 (en) * 2014-03-26 2015-10-01 北京奇虎科技有限公司 Method and device for pushing webpages containing time-relevant information
CN104182442A (en) * 2014-03-28 2014-12-03 无锡天脉聚源传媒科技有限公司 News searching method and device
CN105447009A (en) * 2014-08-12 2016-03-30 阿里巴巴集团控股有限公司 Method and device for generating query strings
CN105320770A (en) * 2015-10-30 2016-02-10 江苏省电力公司电力科学研究院 Instant assistance search system based on web page keyword
CN108614825A (en) * 2016-12-12 2018-10-02 中移(杭州)信息技术有限公司 A kind of web page characteristics extracting method and device
CN108614825B (en) * 2016-12-12 2022-04-15 中移(杭州)信息技术有限公司 Webpage feature extraction method and device
CN106649810A (en) * 2016-12-29 2017-05-10 山东舜网传媒股份有限公司 Ajax-based news webpage dynamic data grabbing method and system
CN106649810B (en) * 2016-12-29 2019-05-28 山东舜网传媒股份有限公司 The grasping means and system of news web page dynamic data based on Ajax

Also Published As

Publication number Publication date
CN101246498B (en) 2010-07-14

Similar Documents

Publication Publication Date Title
CN101246498B (en) News web page searching method
CN100472522C (en) A method, system, and computer program product for searching for, navigating among, and ranking of documents in a personal web
CN103377226B (en) A kind of intelligent search method and system thereof
CN1755678B (en) System and method for incorporating anchor text into ranking of search results
KR100462292B1 (en) A method for providing search results list based on importance information and a system thereof
Soboroff et al. Overview of the TREC 2006 Enterprise Track.
US6640218B1 (en) Estimating the usefulness of an item in a collection of information
CN101055580B (en) System, method and user interface for retrieving documents
Jones et al. Query word deletion prediction
CN102722501B (en) Search engine and realization method thereof
CN100433007C (en) Method for providing research result
US20100191740A1 (en) System and method for ranking web searches with quantified semantic features
CN102722499B (en) Search engine and implementation method thereof
CN101297291A (en) Suggesting and refining user input based on original user input
WO2002027541A1 (en) A method and apparatus for concept-based searching across a network
CN1609845A (en) Method and apparatus for improving readability of automatic generated abstract by machine
CN101853272A (en) Search engine technology based on relevance feedback and clustering
CN103838735A (en) Data retrieval method for improving retrieval efficiency and quality
CN102521321A (en) Video search method based on search term ambiguity and user preferences
Park et al. Techniques for improving web retrieval effectiveness
CN104636403B (en) Handle the method and device of inquiry request
WO2010037314A1 (en) A method for searching and the device and system thereof
CN105808739A (en) Search result ranking method based on Borda algorithm
CN105740448A (en) Topic-oriented multi-microblog time sequence abstracting method
CN102750380B (en) Page sorting method in combination with difference feature distribution and link feature

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131016

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131016

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.