CN104408089A

CN104408089A - Data integration method supporting diversification of information retrieving results

Info

Publication number: CN104408089A
Application number: CN201410642955.7A
Authority: CN
Inventors: 李洁玉; 黄春兰; 吴胜利
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2015-03-11

Abstract

The invention discloses a data integration method supporting the diversification of information retrieving results. The method is mainly based on a complementary weight allocation strategy covered by a sub-theme. The calculation of the complementary weight mainly comprises the following steps of providing t information retrieving systems, retrieving a corresponding result r1, r2,...,rt from a same database by each information retrieving system for a given inquiry q; establishing a super result r on the basis of two results ri and rj; then evaluating the ri, rj and r by utilizing a performance index to obtain performance values, respectively recording the performance values as p<ri>, p<rj> and p<r>, calculating the complementation degree of ri corresponding to rj according to the performance value, calculating the complementary weight ci of the calculation result ri (i is more than or equal to 1 and less than or equal to t), acquiring the complementary weight, and directly utilizing the complementary weight for the linear combination or as a part of the linear combined weight. By adopting the method, the novelty can be considered on the basis of diversification, the complementation degree of a result to the integrity can be quantified, and the method can be used for integrating various types such as texts, pictures and the like.

Description

The data fusion method of a kind of support information result for retrieval diversification

Technical field

The invention belongs to information retrieval field, be specifically related to the weight allocation strategy of Data fusion technique neutral line combined method.

Background technology

In information retrieval task, correlativity is the important indicator that people evaluate result for retrieval quality always.A good result rank will not provide a large amount of incoherent result for retrieval.Traditional information retrieval system often carries out rank according to the degree of relevancy between document and given inquiry, and this is quite reasonable when relevant documentation is fewer.But for there being the situation of more relevant documentation, the relevant documentation of more repetition in result for retrieval, just may be had.A lot of information retrieval system now, particularly web retrieval, not only considers correlative character, also considers diversity therebetween or novel features in the degree of correlation process calculating document and inquiry.

The present invention attempts finding from data fusion angle the method solving retrieval diversification problem.It is likely improve retrieval performance that research [1,2] in the past indicates Data fusion technique, but they have more only paid close attention to correlativity, and therefore for the diversification of information retrieval result, some data fusion methods should do some adjustment.

LINEAR COMBINATION METHOD is that in data fusion method, one compares typical method.The method is flexible especially, and its key obtaining better syncretizing effect is weight allocation, and different Weight Value Distributed Methods can bring different effects to fusion.At present, the more existing strategy that assigns weight considers two factors.One is the performance (or validity) of information about firms searching system.There is the information retrieval system of relatively good retrieval performance, a larger weight should be given, and for poor-performing, then should distribute less weight to its.Another factor is the otherness between information about firms searching system.If the otherness of the result for retrieval of an information retrieval system and the result of other information retrieval systems is larger, compare more dissimilar with other information retrieval systems in other words, so it should obtain larger weight, otherwise then should be assigned with a less weight.Document [3] one and has only been considered performance weights distribution method, has investigated and has used different performance function as the syncretizing effect of weight.Adopt information retrieval system evaluation of estimate p of (as MAP) under a certain measurement index, optional weight calculation scheme has p ^0.5, p, p ², p ³etc..Document [4,5] describes the method only considering similarity, weighs the similarity degree between two results by the coverage rate calculating identical document in two information retrieval system results.Document [6] is then by these two kinds of integrate features.

But two above-mentioned factors are all consider from the angle of correlativity.Document combines correlativity and diversity in [7], considers validity weight and the otherness weight of information retrieval system equally.For validity weight, have selected and define validity weight for multifarious evaluation index (as ERR-IA@20); For similarity (or otherness) weight, then propose two kinds of different computing method.A kind of is the method for set of computations coverage rate.Consider the document coming a front n position in t member result, suppose result r _iin certain document d _ijthe number of times occurred in other t-1 result is c _ij, definition result r _ias follows with the otherness value of other results:

{dis}_{i} = \frac{1}{n} Σ_{j = 1}^{n} \frac{(t - 1 - c_{ij})}{t - 1} - - - (1)

Another kind of then be that otherness weight is determined in rank position by comparing result for retrieval document.Assuming that all contain the result for retrieval r of n document for a pair _a, r _bin, there is m document at r _a, r _bin all occurred having n-m document only to appear in a result in addition respectively.First difference value (the p between these two results is calculated _a(d), p _bd () represents that document d is at r respectively _a, r _bin position):

\begin{matrix} v (r_{A}, r_{B}) = \frac{1}{n} {Σ_{i = 1,2, \cdot \cdot \cdot, m}^{d_{i} &Element; r_{A} Λ d_{i} &Element; r_{B}} \frac{| p_{A} (d_{i}) - p_{B} (d_{i}) |}{m} + Σ_{i = 1,2, \cdot \cdot \cdot, n - m}^{d_{i} &Element; r_{A} Λ d_{i} &NotElement; r_{B}} \frac{{| p}_{A} (d_{i}) - (n + i) |}{n - m} \\ + Σ_{i = 1,2, \cdot \cdot \cdot, n - m}^{d_{i} &NotElement; r_{A} Λ d_{i} &Element; r_{B}} \frac{| p_{B} (d_{i}) - (n + i) |}{n - m}} \end{matrix} - - - (2)

Then result of calculation r _ithe otherness weight of (1≤i≤t):

{dis}_{i} = \frac{1}{t - 1} Σ_{j = 1,2, \cdot \cdot \cdot, t}^{j &NotEqual; i} v (r_{i}, r_{j}) - - - (3)

When retrieval tasks requires to consider diversity, document [6] not clear and definite proposition relates to the weights influence factor of diversity or novelty.Relevant documentation under diversification target may be relevant on multiple sub-topicses of inquiry, namely multiple sub-topics may be covered, different result for retrieval then may have different sub-topicses to cover distribution, some result for retrieval more or less may cover identical sub-topics, and what have then may cover diverse sub-topics.For the entirety after result for retrieval combination, those can cover to overall institute the result that sub-topics supplements to some extent should be relatively important, and the result for retrieval of supplementary degree difference is then relatively secondary.Therefore, the present invention proposes a kind of feature covered based on sub-topics and describes result for retrieval to merging overall supplementary degree.

[1]S.Wu and S.McClean.Performance prediction of data fusion for information retrieval.Information Processing&Management，42(4):899–915，July2006.

[2]G.V.Cormack，C.L.A.Clarke，and S.Buttcher.Reciprocal rank fusion outperformscondorcet and individual rank learning methods.In Proceedings of the32nd AnnualInternational ACM SIGIR Conference，pages758–759，Boston，MA，USA，July2009.

[3]S.Wu，Y.Bi，X.Zeng et al.Assigning appropriate weights for the linear combination datafusion method in information retrieval[J].Information Processing&Management，2009，45(4):413-426.

[4]S.Wu andS.McClean.Improving high accuracy retrieval by eliminating the unevencorrelation effect in data fusion[J].Journal of the American Society for Information Scienceand Technology，2006，57(14):1962-1973.

[5]S.Wu and S.McClean.Data fusion with correlation weights[M].Advances in InformationRetrieval.Springer Berlin Heidelberg，2005:275-286.

[6]S.Wu.Data Fusion in Information Retrieval[M].Heidelberg:Springer，2012:97-101.

[7]S.Wu andC.Huang.Search result diversification via data fusion.In Proceedings of the37th Annual International ACM SIGIR Conference，Cold Coast，QLD，Australia.2014:827–830.

Summary of the invention

The object of the present invention is to provide the data fusion method of a kind of support information result for retrieval diversification, fusion results is not only showed in correlativity good, diversity also can make moderate progress, more more novel sub-topicses can be covered.

In order to solve above technical matters, the present invention is from the angle of diversification, and propose a kind of complementarity weight covered based on sub-topics newly, for the multiple result for retrieval of linear combination, the concrete technical scheme of employing is as follows:

The data fusion method of a kind of support information result for retrieval diversification, it is characterized in that first process obtains the weights of each information retrieval system on one group of training data, adopt the result of LINEAR COMBINATION METHOD to all information retrieval systems to merge again, concrete steps are as follows:

Step one, suppose total t information retrieval system, for same inquiry q, each information retrieval system is searched for from same database, obtains the ordered sequence that is made up of some documents and result for retrieval r _i(1≤i≤t);

Step 2, selects a result for retrieval r _iwith another result for retrieval r _j, at r _i, r _jbasis on construct super result r; Note r _iin to come the sub-topics set that document on kth position covers be S _i(k), r _jin the sub-topics set that covers of document on same k position be S _jk (), on same position k, the sub-topics set that the document in super result covers is S _i(k) ∪ S _j(k); For all k (k=1,2,3 ...., n, n are the length of result for retrieval), according to the method structure, thus obtain r _iand r _jon super result r, wherein 1≤i≤t, 1≤j≤t, and i ≠ j;

Step 3, using property data ERR-IA@20 evaluates described r _i, r _jand r, the performance number obtained is designated as p (r successively _i), p (r _j) and p (r); According to p (r _i), p (r _j) and p (r) value calculating r _ito r _jsupplementary degree c _i(j), computing formula is as follows:

c_{i} (j) = \frac{p (r) - p (r_{j})}{p (r)} - - - (4)

Step 4, repeats step 2 and step 3, calculates result for retrieval r _iother t-1 result (r relatively ₁, r ₂..., r _t, but do not comprise r _i) supplementary degree c _ias result for retrieval r _icomplementarity weight, c _icomputing method as follows:

c_{i} = \frac{1}{t - 1} Σ_{j = 1 Λj &NotEqual; i}^{t} c_{i} (j) - - - (5)

For each different r _i, all according to above-mentioned formulae discovery, the complementarity weight c of one group of result for retrieval under inquiry q can be obtained _i1≤i≤t;

Step 5, above-mentioned four steps can repeat multiple different inquiry, thus obtain the many group complementarity weight of information retrieval system on multiple queries, namely a complementarity weight is had for a Query Information searching system, to multiple queries, then this information retrieval system can have multiple complementarity weight; Now, the complementarity weight of each information retrieval system then gets the mean value of the complementarity weight of each information retrieval system on multiple queries;

Step 6, using the complementarity weight of information retrieval system as final weight w _i, the score value of linear combination document d in t result for retrieval, obtain overall score value g (d) of document, formula is as follows:

g (d) = Σ_{i = 1}^{t} w_{i} * s_{i} (d)

S _id () is for document d is at result for retrieval r _iin score value.

Described weight, from diversification angle, considers the novelty of information retrieval system result, complementarity.

The present invention has beneficial effect.When the present invention assigns weight, consider the weights influence factor relating to diversity or novelty, thus improve the validity of fusion results, result diversification makes moderate progress.

Embodiment

Below in conjunction with specific embodiment, technical scheme of the present invention is described in further details.

Embodiment 1

Suppose a given inquiry, two information retrieval systems sets forth 2 result for retrieval r ₁, r ₂, wherein r ₁=<d ₁, d ₂, d ₃>, r ₂=<d ₄, d ₅, d ₆>, is the ordered arrangement of document.Wherein document d ₁at sub-topics 1, relevant on 2, d ₃relevant on sub-topics 3, d ₄at sub-topics 1, relevant on 3, d ₅relevant on sub-topics 4, in addition, d ₂, d ₆uncorrelated.At r ₁, r ₂basis on construct super result r (r=<d ₇, d ₈, d ₉>).According to the formation rule of super result r, the sub-topics that on each position in r, document covers should be r ₁, r ₂on middle same position, document covers sub-topics union of sets collection.So for d ₇, the sub-topics that it covers should be { 1,2} ∪ { 1,3}={1,2,3}.Show to the construction process and the result that give r below, S ₁k () represents in r1 the sub-topics set coming document d on kth position and cover, document d is relevant on which sub-topics, and namely which sub-topics document covers; S ₂(k), S _rk () then show respectively result r ₂, the sub-topics coverage condition of r, result is as shown in table 1.

Table 1 super result construction process example

Super result r can be obtained thus, wherein d ₇at sub-topics 1,2, relevant on 3, d ₈relevant on sub-topics 4, d ₉relevant on sub-topics 3.

Embodiment 2

Suppose a given inquiry, three information retrieval systems sets forth result for retrieval r ₁, r ₂, r ₃, and obtain r ₁, r ₂on super result r ₁₂; r ₁, r ₃on super result r ₁₃, and r ₂, r ₃on super result r ₂₃.Assuming that the performance number adopting index ERR-IA@20 to evaluate these results is respectively p ₁=0.3, p ₂=0.2, p ₃=0.4, p ₁₂=0.5, p ₁₃=0.7, p ₂₃=0.6.The complementarity weight calculation of three results is as follows:

First result for retrieval supplementary degree is between any two calculated according to formula 4:

c ₁(2)＝(0.5-0.2)/0.5＝0.6；

c ₂(1)＝(0.5-0.3)/0.5＝0.4；

c ₁(3)＝(0.7-0.4)/0.7＝3/7；

c ₃(1)＝(0.7-0.3)/0.7＝4/7；

c ₂(3)＝(0.6-0.4)/0.6＝1/3；

c ₃(2)＝(0.6-0.2)/0.6＝2/3；

Formula 5 is adopted to calculate the complementarity weight of each information retrieval system:

c ₁＝(c ₁(2)+c ₁(3))/2＝(0.6+3/7)/2＝0.514286；

c ₂＝(c ₂(1)+c ₂(3))/2＝(0.4+1/3)/2＝0.233333；

c ₃＝(c ₃(1)+c ₃(2))/2＝(4/7+2/3)＝0.428571。

Setup Experiments and result

Test the result submitted in 3 years of 2009 to 2011 according to the web track theme diversity task under TREC text retrieval conference, choose the result of diversity task under three groups of web track themes as experimental data, these three groups of data contain classic 8 result for retrieval then respectively, by the sequence of ERR-IA@20 index, about the information of these result for retrieval in table 2.Calculate the mean value and variance of often organizing result retrieval effect, these two features can have influence on the effect of final fusion results simultaneously.

Three groups, table 2 is filed in the object information that TREC web retrieves diversity task.

The result often organized all relates to 50 inquiries, uses 5 folding cross-validation methods to verify last fusion results.These inquiries are on average divided into 5 groups according to sequence number: 1-10,11-20,21-30,31-40 and 41-50.Stochastic choice wherein 4 groups are for training weight, and remaining one group is used for linear combination, attempt all may until obtain the fusion results in all inquiries.

For concrete weight allocation strategy, consider the validity of information retrieval system first respectively, otherness and these three factors of complementarity, its respective weights is expressed as p, v, c.On the training data, use ERR-IA@20 to evaluate and obtain the performance weights p of performance number as information retrieval system, the otherness weight v of information retrieval system then obtains according to formula 2 and formula 3, and choose front 100 participations in document sequence to calculate, complementarity weight c then calculates according to formula 4 and formula 5 respectively.Then the different combination of some p, v, c is attempted as final information retrieval system weight: p*v, p ²* v, p*v ², p*c, p ²* c, p*c ², p ²* v*c.

In addition, in data fusion, mark standardization is absolutely necessary, and directly affects final syncretizing effect.Present invention uses the result for retrieval of method to all information about firms searching systems based on logarithmic function and carry out mark standardization.This method have employed formula score (i)=max{1-0.2*ln (i+1), and 0} carrys out the score value of document on the i of calculated for rank position.

Experimental result is as shown in table 3 table 4.Have employed two and be applicable to multifarious evaluation index to weigh the performance of each result, in table 2 table 3, point other evaluation index is ERR-IA 20, α-nDCG 20.Optimum member result is used as the benchmark of Performance comparision, and two classical data fusion method CombSum, CombMNZ have also been made test, and fusion method is then calculated the method representation of weight by difference.The performance of table 3 one group of data fusion method in ERR-IA@20 index

Note: what represent with bold type numerals is in same group or the on average best result of entirety.

The performance of table 4 one group of data fusion method in α-nDCG@20 index

As can be seen from table 3 and table 4, the performance of all data fusion methods is all better than optimum member result.CombSum, CombMNZ also show well.Considering validity weight p individually, during otherness weight v and complementarity weight c, and CombSum, CombMNZ contrast, can find that complementarity weight c and validity weight p is more useful.When only using a kind of weight, complementarity weight c is better than validity weight p, adopts the square value of complementarity weight c only more helpful with c as weight ratio in addition.

Combinationally use this several weight in experiment, have employed 7 kinds of different array modes: pv, p ²v, pv ², pc, p ²c, pc ², p ²vc.From the average behavior of three groups of data, in 7 kinds of array modes, the combination of validity weight and complementarity weight shows the most excellent, particularly pc ², and the performance of the array mode of other weights is relatively poor.

On the whole, c ², pc, pc ²the fusion method of representative in average behavior all than using other weight allocation modes good.In addition contrast with CombSum, CombMNZ, although the amplitude promoted is not large especially, these differences are all significant; Compare with optimum member result, the lifting of result performance is then fairly obvious.Therefore, the complementarity weight of information retrieval system is a more useful weight, and it is effective for processing result for retrieval diversification task to employing linear combination method.

Claims

1. the data fusion method of support information result for retrieval diversification, it is characterized in that first process obtains the weights of each information retrieval system on one group of training data, adopt the result of LINEAR COMBINATION METHOD to all information retrieval systems to merge again, concrete steps are as follows:

c_{i} (j) = \frac{p (r) - p (r_{j})}{p (r)}

c_{i} = \frac{1}{t - 1} Σ_{j = 1 Λj &NotEqual; i}^{t} c_{i} (j)

For each different r _i, all according to above-mentioned formulae discovery, the complementarity weight c of one group of result for retrieval under inquiry q can be obtained _i;

1≤i≤t；

g (d) = Σ_{i = 1}^{t} w_{i} * s_{i} (d)

S _id () is for document d is at result for retrieval r _iin score value.