Summary of the invention
In view of this, technical matters to be solved by this invention is to provide a kind of searching method of news web page, realize effective searching order at news web page, improve accuracy, make Search Results approach user's true search need more news search.
In order to realize the foregoing invention purpose, main technical schemes of the present invention is:
A kind of searching method of news web page comprises:
A, set up the news web page index, determine the weight parameter that each news web page is corresponding with the coordinate indexing string according to content of text, time and reprinting situation;
B, in search during news web page, from described news web page index, retrieve the news web page information of coupling according to the retrieval string of input, and sort according to the weight parameter corresponding with described input retrieval string;
Search Results after C, the output ordering.
Preferably, in the steps A,, determine this news web page and the corresponding weight parameter of this retrieval string according to following method at one piece of specific news web page and a specific retrieval string:
A1, determine the text relevant weights W of described retrieval string and described news web page
Text(Q, doc); Determine the numerical value relevance weight W of described retrieval string and described news web page
Num(doc), this W
Num(doc) mainly comprise time span weight and the situation of reprinting weight in;
A2, be described W
Text(Q, doc) and W
Num(doc) the adjusting parameter lambda is set respectively
TextAnd λ
Num
A3, the described λ of combination calculation
Text* W
Text(Q, doc) and λ
Num* W
Num(doc) obtain described news web page and the corresponding weight parameter of described retrieval string.
Preferably, steps A 3 according to formula: Weight (Q, doc)=(1+ λ
Text* W
Text(Q, doc)) * (1+ λ
Num* W
Num(doc)) carry out described combination calculation, obtain the described news web page weight parameter Weight corresponding with described retrieval string (Q, doc).
Preferably, in the steps A 1, determine the numerical value relevance weight W of described retrieval string and described news web page
Num(doc) concrete grammar is:
A11, determine the time span weights W
Time, reprinting rate weights W
Du, and reprint speed W
Dv
A12, be described W
Time, W
Du, and W
DvThe adjusting parameter lambda is set respectively
Time, λ
Du, and λ
Dv
A13, the described λ of combination calculation
Time* W
Time, λ
Du* W
Du, and λ
Dv* W
DvObtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).
Preferably, in the steps A 13, according to formula: W
Num(doc)=(1+ λ
Time* W
Time) * (1+ λ
Du* W
Du) * (1+ λ
Dv* W
Dv) carry out described combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).
Preferably, in the steps A 11, according to formula:
Determine W
TimeWherein, MaxTimeSpanWeight is a maximum time span weight, and b is a Preset Time unit, and the default constant of α one, t are that described news pages goes out the time span of now to current time on network.
Preferably, in the steps A 11, described reprinting rate weights W
DuDefinite method be: add up the number of times that described news web page is reprinted, with the number of times that counts and reference value contrast, the ratio that obtains is W
Du
Preferably, in the steps A 11, according to formula: W
Dv=W
Dn/ (T
e-T
b) determine described reprinting speed W
Dv, wherein, W
DnBe the number of times that described news web page is reprinted in the schedule time, T
bBe the reproduced time of beginning, T
eIt is last reproduced time.
Preferably, further determine the authoritative weights W of news web page in the steps A 11
GA, and be this W in steps A 12
GAThe adjusting parameter lambda is set
GA, the described λ of combination calculation in steps A 13
Time* W
Time, λ
Du* W
Du, λ
Dv* W
Dv, and λ
GA* W
GAObtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).
Preferably, in the steps A 13 according to formula: W
Num(doc)=(1+ λ
Time* W
Time) * (1+ λ
Du* W
Du) * (1+ λ
Dv* W
Dv) * (1+ λ
GA* W
GA) carry out combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).
Preferably, in the steps A 11, according to formula W
GA=W
Page* W
StDetermine described W
GA, W wherein
PageBe the page weight of the described news web page affiliated web site rank of reflection and this news web page page location of living in, W
StBe that whether news web page is the thematic weight of thematic webpage under the reflection.
Preferably, in the steps A 1, determine the text relevant weights W of described retrieval string and described news web page
Text(Q, concrete grammar doc) is:
A11, determine the relevance weight W that described retrieval string hits in described news web page title
TI(Q, doc) and the relevance weight W that in described news web page text, hits of described retrieval string
Tx(Q, doc);
A12, be described W
TI(Q, doc) and W
Tx(Q doc) is provided with the adjusting parameter lambda respectively
TIAnd λ
Tx, and λ
TI+ λ
Tx=1;
A13, according to formula W
Text(Q, doc)=λ
TI* W
TI(Q, doc)+λ
Tx* W
Tx(Q doc) determines the text relevant weights W of described retrieval string and described news web page
Text(Q, doc).
Preferably, among the step a11, according to formula
Determine described W
TI(Q, doc), wherein, described Q is the retrieval string, q is a term in the retrieval string, W
IDF(q) be the inversed document frequency IDF weight of q, (q doc) gets 1, otherwise gets 0 HitTitle when q is included in the described news web page title.
Preferably, among the step a11, according to formula W
Tx(Q, doc)=log
2(1+WTF (Q, doc)) determines described W
Tx(Q, doc); (Q is doc) according to formula for WTF
Determine that wherein { POS} is the huge location sets of news web page text, and pos is one of them huge position, and (q pos) is the frequency that occurs in the huge position of term q in described news web page text, W to tf
IDF(q) be the inversed document frequency IDF weight of q.
Preferably, described W
IDF(q) according to formula
Determine that wherein N is the total number of documents order in the index, df (q) is the number of files that term q hits.
Among the present invention, for the search of news web page except considering the relevance ranking technology of traditional full-text search, the objective characteristic of also taking into account simultaneously according to the emphasis timeliness and the correlativity of news web page, pass through the time, the reprinting rate, rotary speed, even factor such as website weight is carried out weight calculation, thereby make the existing very strong content of text correlativity of final searching order result, can embody the degree of concern that is subjected to of news web page strongly, have very strong ageing again, thereby improve accuracy, make Search Results approach user's true search need more news search.
Embodiment
Below by specific embodiments and the drawings the present invention is described in further details.
Fig. 1 is the main process flow diagram of the method for the invention.Referring to Fig. 1, this flow process comprises:
Step 101, set up the news web page index, according to content of text, time and reprinting situation determine each news web page weight parameter Weight corresponding with the coordinate indexing string (Q, doc).
Step 102, in search during news web page, from described news web page index, retrieve the news web page information of coupling according to the retrieval string of input, and sort according to the weight parameter corresponding with described input retrieval string;
Search Results after step 103, the output ordering.
Wherein, step 101 is key technical features of the present invention.In the step 101,, determine this news web page and the corresponding weight parameter of this retrieval string according to following method at one piece of specific news web page and a specific retrieval string:
Step 111, determine the text relevant weights W of described retrieval string and described news web page
Text(Q, doc); Determine the numerical value relevance weight W of described retrieval string and described news web page
Num(doc), this W
Num(doc) mainly comprise time span weight and the situation of reprinting weight in.Described Q is the retrieval string, Q={q
1, q
2..., q
n, q is a term in the retrieval string, n is the term number after the cutting of retrieval string.
Step 112, be described W
Text(Q, doc) and W
Num(doc) the adjusting parameter lambda is set respectively
TextAnd λ
Num
Step 113, the described λ of combination calculation
Text* W
Text(Q, doc) and λ
Num* W
Num(doc) obtain described news web page and the corresponding weight parameter of described retrieval string.Described adjusting parameter lambda
TextAnd λ
NumCan transfer big according to actual needs or turn its corresponding W down
Text(Q, doc), W
Num(doc) degree of influence can be transferred big accordingly or turn down.
Concrete, the weight parameter Weight that described news web page is corresponding with described retrieval string (Q, doc) can determine according to following formula (1) combination calculation:
Weight(Q,doc)=(1+λ
text*W
text(Q,doc))*(1+λ
num*W
num(doc)) (1)
In the above-mentioned formula (1), λ
Text* W
Text(Q, doc) and λ
Num* W
Num(doc) correlation has adopted the multiplication mode when combination calculation, also can adopt add mode, promptly Weight (Q, doc)=(1+ λ
Text* W
Text(Q, doc))+(1+ λ
Num* W
Num(doc)), but adopt the mode that multiplies each other more to be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, adopt the mode of multiplying each other total weighted value can be drawn high, promote the sorting position of this news web page in Search Results greatly.
Introduce how to determine described W respectively below
Text(Q, doc) and W
Num(doc).
One, W
Text(Q, definite method doc).
The field that index is set up in news search has header field and body field, at definite W
Text(Q in the time of doc), needs to consider the hit situation of retrieval string in header field and body field.
Concrete, the text relevant weights W of described retrieval string and described news web page
Text(Q, definite process doc) comprises:
Step a11, determine the relevance weight W that described retrieval string hits in described news web page title
TI(Q, doc) and the relevance weight W that in described news web page text, hits of described retrieval string
Tx(Q, doc).
Step a12, be described W
TI(Q, doc) and W
Tx(Q doc) is provided with the adjusting parameter lambda respectively
TIAnd λ
Tx, and λ
TI+ λ
Tx=1.
Step a13, determine the text relevant weights W of described retrieval string and described news web page according to following formula (2)
Text(Q, doc).
W
text(Q,doc)=λ
TI*W
TI(Q,doc)+λ
tx*W
tx(Q,doc) (2)
In step a11, determine described W according to following formula (3)
TI(Q, doc):
In the formula (3), described Q is the retrieval string, and q is a term in the retrieval string, W
IDF(q) be the inversed document frequency IDF weight of q, (q doc) gets 1, otherwise gets 0 HitTitle when q is included in the described news web page title.Only considered in the formula (3) that the title of news web page is to Q={q
1, q
2..., q
nCoverage rate, do not consider each entry q word frequency TF in title, reason is in the title of news web page, important centre word generally only occurs once, is that unessential speech occurrence number is more on the contrary.
In step a11, determine described W according to following formula (4)
Tx(Q, doc):
W
tx(Q,doc)=log
2(1+WTF(Q,doc)) (4)
In the formula (4), and WTF (Q is to be an entry with retrieval string Q total abstract doc), the weighting frequency in the news web page document, and this value can be regarded a kind of TF of broad sense as.
Concrete, WTF (Q, doc) determine according to following formula (5):
In the formula (5), { POS} is the huge location sets of news web page text, and pos is one of them huge position, and (q pos) is the frequency that occurs in the huge position of term q in described news web page text, W to tf
IDF(q) be the IDF weight of q.
Described W
IDF(q) determine according to following formula (6):
In the formula (6), described N is the total number of documents order in the index, and df (q) is the number of files that term q hits.Certainly, above-mentioned formula (6) can be done simple transformation, for example can be according to formula
Determine described W
IDF(q), perhaps more simply utilize formula:
Determine described W
IDF(q), the embodiment of only above-mentioned formula (6) more can draw back the ordering distance of the corresponding news web page of different terms, make the ranking results of the news web page that searches out by the hot news term obtain more embodying the ageing and correlativity of the Search Results of standing out in advance.
Two, W
Num(doc) definite method.
Determine the numerical value relevance weight W of described retrieval string and described news web page
Num(doc) concrete grammar is:
Steps A 11, determine the time span weights W
Time, reprinting rate weights W
Du, and reprint speed W
Dv
Steps A 12, be described W
Time, W
Du, and W
DvThe adjusting parameter lambda is set respectively
Time, λ
Du, and λ
DvDescribed adjusting parameter can be provided with according to actual needs, transfers and turns described adjusting parameter, the W that it is corresponding greatly down
Time, W
Du, W
DvDegree of influence can corresponding transfer big or turn down.
Steps A 13, the described λ of combination calculation
Time* W
Time, λ
Du* W
Du, and λ
Dv* W
DvObtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).Concrete, can carry out described combination calculation according to following formula (7), obtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).
W
num(doc)=(1+λ
time*W
time)*(1+λ
du*W
du)*(1+λ
dv*W
dv) (7)
In the formula (7), when combination calculation, (1+ λ
Time* W
Time), (1+ λ
Du* W
Du) and (1+ λ
Dv* W
Dv) between adopted the account form that multiplies each other, also can adopt add mode, i.e. W
Num(doc)=(1+ λ
Time* W
Time) * (1+ λ
Du* W
Du) * (1+ λ
Dv* W
Dv), still, adopt the calculation mode that multiplies each other can more be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, then can be with W
Num(doc) weighted value is drawn high, and promotes the sorting position of corresponding news web page greatly.
Among the above-mentioned steps A11, W
TimeSpan be [0~100], determine described W according to formula (8)
Time:
In the formula (8), MaxTimeSpanWeight is a maximum time span weight, can value be 100 for example herein; B be one with t time corresponding unit, t be second in the present embodiment, b=3600 then, promptly one hour second number; α is the constant of described power function, can be preset as 0.5 herein, and t is that described news pages goes out now to the time span of current time on network.
Fig. 2 a and Fig. 2 b are W
TimeWith the exemplary graph of time relationship, wherein the horizontal ordinate of Fig. 2 a is minute, and ordinate is the time span weights W
TimeValue (Weight), the horizontal ordinate of Fig. 2 b is hour that ordinate is the time span weights W
TimeValue (Weight).Data result referring to Fig. 2 a and Fig. 2 b:
0 minute, W
Time=100;
3 hours, W
Time=50;
15 hours, W
Time=25;
24 hours, W
Time=20.
By Fig. 2 a and Fig. 2 b as can be seen, for news web page, definite method of above-mentioned formula (8) has demonstrated fully the ageing of news web page, be that the time is near more, weighted value is high more, in case surpass section sometime, then weighted value decay rapidly, thus make the webpage sorting position of latest news shift to an earlier date greatly.
In steps A 11, described reprinting rate weights W
DuSpan be [0~100].The method of determining is: add up the number of times that described news web page is reprinted on the backstage of search engine, with the number of times and the reference value contrast that count, the ratio that obtains is W
Du
In steps A 11, described reprinting speed W
DvSpan be [0~100].Specifically determine according to following formula (9):
W
dv=W
dn/(T
e-T
b) (9)
In the formula (9), W
DnBe the number of times that described news web page is reprinted in the schedule time (for example 48 hours), T
bBe the reproduced time of beginning, T
eIt is last reproduced time.In a kind of preferred embodiment,, generally can only consider the reprinting speed in 48 hours, T according to the appearance characteristic of news web page
e-T
bMaximal value be 48 hours, and T
e-T
bUnit be hour that less than one hour was calculated by one hour.
Can also further determine the authoritative weights W of news web page in the steps A 11
GA, and be this W in steps A 12
GAThe adjusting parameter lambda is set
GA, λ
GAAccent turn corresponding W greatly down
GADegree of influence can be transferred big accordingly or turn down.The described λ of combination calculation in steps A 13
Time* W
Time, λ
Du* W
Du, λ
Dv* W
Dv, and λ
GA* W
GAObtain the numerical value relevance weight W of described retrieval string and described news web page
Num(doc).Specifically carry out combination calculation, obtain the numerical value relevance weight W of described retrieval string and described news web page according to following formula (10)
Num(doc).
W
num(doc)=(1+λ
time*W
time)*(1+λ
du*W
du)*(1+λ
dv*W
dv)*(1+λ
GA*W
GA) (10)
Certainly, also can utilize formula:
W
num(doc)=(1+λ
time*W
time)+(1+λ
du*W
du)+(1+λ
dv*W
dv)+(1+λ
GA*W
GA)
Determine described W
Num(doc), just adopt the calculation mode that multiplies each other can more be applicable to the searching order of news web page, as long as wherein the parameter value of any weight is higher, then can be with W
Num(doc) weighted value is drawn high, and promotes the sorting position of corresponding news web page greatly.
The authoritative weights W of described news web page
GACan determine according to formula (11):
W
GA=W
page*W
st (11)
In the formula (11), W
PageBeing page weight, is that the search engine backstage combines the rank of website, news web page place and the weighted value that page location calculates.W
StBe thematic weighting, if a page is special topic (special topic), then W
StEqual one greater than 1 constant, as W
St=1.2; If a page is not the thematic page, then W
St=1.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.