CN104424279A

CN104424279A - Text relevance calculating method and device

Info

Publication number: CN104424279A
Application number: CN201310388496.XA
Authority: CN
Inventors: 赫南; 张文斌; 姚伶伶; 王莉峰; 何琪; 张博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2013-08-30
Filing date: 2013-08-30
Publication date: 2015-03-18
Anticipated expiration: 2033-08-30
Also published as: CN104424279B

Abstract

Embodiments of the invention provide a text relevance calculating method and a device thereof. The method comprises the following steps: receiving a first character string and a second character string; calculating a text relevance characteristic value of the first character string and the second character string, and calculating a semantic relevance characteristic value of the first character string and the second character string; fitting the text relevance characteristic value and the semantic relevance characteristic value into a relevance characteristic value of the first character string and the second character string based on the logistic regression model. The text relevance calculating method and the device thereof increase the precision of relevance judgment, save storage space and reduce the cost.

Description

A kind of correlation calculations method and apparatus of text

Technical field

Embodiment of the present invention relates to technical field of internet application, more specifically, relates to a kind of correlation calculations method and apparatus of text.

Background technology

Along with the develop rapidly of computer technology and network technology, the effect that internet (Internet) plays in daily life, study and work is also increasing.Various application on internet emerge in an endless stream.

Search advertisements are very important business in the Internet advertising ecosystem, and it depends on search engine, are in essence to sell coupling based on keyword.Advertiser is in the database of business promotion, except be provided for show advertisement title, describe except, also to add the keyword (namely buying word) that some and this advertisement have certain correlativity, and specify match-type and bid and directed coupling target flow (namely meeting the user of retrieval intention).In the coupling flow process of classics, purchase word defines the direct index to advertisement.When the query word of user " mates " with the purchase word of advertiser, correlativity acquires a certain degree, namely the primary election condition that advertisement triggers of meeting (suppose first to ignore other directed and filter link) is thought here, can pull corresponding advertisement (title, description) out does further follow-up selected, such as clicking rate is estimated, order ads, exhibition strategy selection etc.

In retrieval (Retrieve) stage, ad system can utilize the query string of user, uses multiple strategy that is online, off-line to do and buys word coupling.Here the purchase word found be all that advertiser specifies when filling in material, to advertisement title and describe relevant short text.System vacuum metrics query word (query) and candidate buy the essence of the correlativity of word (bidterm) is on line correlativity between short text.

Have a lot based on the method for the literal coupling of character string traditionally, the online appraisal procedure of off-line also has difference, all has some limitations.The people such as the Sahami of Google propose to utilize the Webpage searching result of short text as semantic extension, calculate the semantic dependency between short text on this basis, than the simple better effects if based on word.The people such as the Dumais of Metzler and Microsoft of University of Massachusetts have also attempted method that multiple short text represents for computing semantic correlativity.

But traditional computing method based on word vector space model in document, short text faces the problem that feature is sparse.Meanwhile, because the word segmentation result of short text depends on language model, the consistent of different word segmentation can not be ensured, also can aggravate the sparse of vector to a certain extent.Therefore, traditional computing method based on word vector space model in document, have the shortcoming that correlation prediction accuracy rate is not high.

And, traditional based on document in word vector space model computing method in, need a large amount of storage space to store term vector, therefore also waste storage space and improve cost.

Summary of the invention

Embodiment of the present invention proposes a kind of correlation calculations method of text, to improve the accuracy rate of correlation prediction.

Embodiment of the present invention proposes a kind of correlation calculations device of text, to improve the accuracy rate of correlation prediction.

The technical scheme of embodiment of the present invention is as follows:

A correlation calculations method for text, the method comprises:

Receive the first character string and the second character string;

Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string;

Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.

A correlation calculations device for text, this device comprises character string receiving element, correlative character value computing unit and correlative character value fitting unit, wherein:

Character string receiving element, for receiving the first character string and the second character string;

Correlative character value computing unit, for the semantic dependency eigenwert of the text relevant eigenwert and the first character string and the second character string that calculate the first character string and the second character string;

Correlative character value fitting unit, fits to the correlative character value of the first character string and the second character string by described text relevant eigenwert and semantic dependency eigenwert for logic-based regression model.

As can be seen from technique scheme, in embodiments of the present invention, the first character string and the second character string is received; Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string; Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.As can be seen here, embodiment of the present invention avoids the computing method based on word vector space model in document, therefore avoids the problem that feature is sparse, thus improves the accuracy rate of correlation prediction, and has saved storage space and reduced cost.

And, embodiment of the present invention proposes based on feature based on the text relevant of the character string such as editing distance, longest common subsequence aspect, they can express text similarity between short string from multiple dimension, better can process that a lot of short text is lack of standardization, participle is forbidden or inconsistent situation.

In addition, embodiment of the present invention proposes the correlative character analyzed based on text classification, probability implicit semantic, fully can excavate the implication relation between short text and the word forming short text, thus the classification contact calculated between two short texts and theme contact, formed and the feature of text relevant is supplemented.

Also have, embodiment of the present invention proposes the correlative character of the Webpage searching result based on word, the dictionary resources number relied on is controlled, and unit storage space, computing velocity have very significantly to be improved, and the lightweight semantic dependency between the short string of canbe used on line is calculated becomes possibility.

Accompanying drawing explanation

Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text;

Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the present invention is described in further detail.

In various applications, the correlation calculations of two short texts can often be related to.The correlativity of two short texts refers to the two correlation degree semantically existed, but not necessarily similar literal.Correlativity be one than similarity (Similarity) concept widely, all significant in a lot of product and system.Short text refers to and the character string that length is shorter such as in some network application, is no more than 38 Chinese characters etc.

Buying word (Bidterm) is the purchase word for bidding that in bid advertisement system, advertiser submits to; Query word (Query) is the search keyword that in search engine, user submits to.Query word and to buy word be all generally the shorter text-string of length, and can buy word and be referred to as short text all query words.

Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text.

As shown in Figure 1, the method comprises:

Step 101: receive the first character string and the second character string.

Here, the first character string and the second character string are preferably short text.Such as, the first character string and the second character string can be query word respectively, buy word etc.

Step 102: calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string.

Text similarity between the short string of correlative character primary metric of text aspect.The correlative character of text aspect has only used the text message of short string, can be obtained by efficient optimized algorithm instant computing.

Such as, the first character string and the second character string correlative character value based on editing distance can be calculated, and/or calculate the first character string and the second character string correlative character value based on longest common subsequence.

Concept between the short string of correlative character primary metric of semantic level, the similarity of meaning.

In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises:

Build category of employment Feature Words dictionary (such as one-level category of employment Feature Words dictionary);

For the first character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution;

Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

Preferably, described structure category of employment Feature Words dictionary comprises:

Based on the category of employment Feature Words set of artificial mark, coupling mode classification is in full adopted to classify to each webpage;

Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described category of employment Feature Words, to build category of employment Feature Words dictionary.

For the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again;

Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises: the correlative character value calculating the first character string and the second character string Corpus--based Method mechanical translation.

In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises: calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.

In fact, multiple account form can be adopted to calculate the text relevant eigenwert of the first character string and the second character string simultaneously.Such as can calculate the first character string and the second character string correlative character value based on editing distance, and calculate the first character string and the second character string correlative character value based on longest common subsequence, then using the correlative character value based on editing distance and the correlative character value based on longest common subsequence simultaneously as the text relevant eigenwert calculated to participate in the Fitting Calculation of step 103.

Similarly, multiple account form can be adopted to calculate the semantic dependency eigenwert of the first character string and the second character string simultaneously.

Such as: the semantic dependency eigenwert calculating the first character string and the second character string comprise following at least one:

Calculate the correlative character value based on editing distance of the first character string and the second character string; Calculate the correlative character value based on longest common subsequence of the first character string and the second character string; Calculate the correlative character value based on text classification of the first character string and the second character string; Calculate the topic relativity eigenwert based on probability latent semantic analysis (PLSA) of the first character string and the second character string; Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string; Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.

Then all semantic dependency eigenwerts calculated are participated in the Fitting Calculation of step 103.

Step 103: described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.

Here, for text relevant eigenwert and the semantic dependency eigenwert of the first character string calculated and the second character string, construction feature vector;

Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively;

Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.

Be described in more detail below the correlation calculations method of the text of embodiment of the present invention.

The problem formal definition that the present invention solves is as follows:

Given two short text T ₁, T ₂, calculate the semantic dependency R (T of its semantic association degree of reflection ₁, T ₂), wherein R (T ₁, T ₂) ∈ [0,1].

For a short text T, its string length with | T| represents, its word segmentation result is expressed as T=t ₁t ₂... t _n; Then T ₁, T ₂word segmentation result be respectively T ₁=t ₁₁t ₁₂... t _1n, T ₂=t ₂₁t ₂₂... t _2n.

First two short texts are calculated respectively to the correlative character of various dimensions, then use Logic Regression Models the correlative character score value of multiple dimension to be fitted to a final semantic dependency score.

Specific as follows:

For the text relevant eigenwert between calculating two short texts, namely the correlative character of text aspect is calculated, due to text aspect the short string of correlative character primary metric between text similarity, only use the text message of short string, therefore can have been obtained by efficient optimized algorithm instant computing.

Such as:

(1), based on the correlation calculations text relevant eigenwert of editing distance

Editing distance (Edit Distance), also known as Levenshtein distance, refers between two character strings, changes into the minimum editing operation number of times needed for another by one.The editing operation of license comprises a character is replaced to another character, inserts a character, deletes a character.

Two short text T ₁, T ₂editing distance EditDist (T ₁, T ₂), can pass through time complexity O (| T ₁| * | T ₂|) dynamic programming algorithm calculate.

Two short texts are as follows based on the correlative character computing formula of editing distance:

R_{ed} (T_{1}, T_{2}) = 1 - \frac{2 EditDist (T_{1}, T_{2})}{| T_{1} | + | T_{2} |} .

(2), based on the correlation calculations text relevant eigenwert of longest common subsequence

The substring (sub-string) that the subsequence of a character string obtains after referring to and can deleting some characters by this character string.

The longest common subsequence of two character strings is one the longest in its all identical subsequence.Two short text T ₁, T ₂longest common subsequence LCS (T ₁, T ₂), can pass through time complexity O (| T ₁| * | T ₂|) dynamic programming algorithm calculate.

Two short texts are as follows based on the correlative character computing formula of longest common subsequence:

R_{lcs} (T_{1}, T_{2}) = \frac{2 LCS (T_{1}, T_{2})}{| T_{1} | + | T_{2} |} .

For the semantic dependency eigenwert between calculating two short texts, i.e. concept between the short string of correlative character primary metric of computing semantic aspect, the similarity of meaning.

The semantic dependency eigenwert between two short texts can be calculated in the following way:

(1), based on the correlative character computing semantic correlative character value of text classification

Exemplarily, embodiment of the present invention mainly have employed the method for feature based word to short text classification, and its basic procedure is:

First based on the initial one-level category of employment Feature Words set (this set comprises the one-level category of employment Feature Words of a small amount of artificial mark) of artificial mark, hundreds of millions of webpages is adopted to the mode classification of coupling in full, each webpage is classified;

Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, the Based on Class Feature Word Quadric that these extract from webpage, for the weight contribution (i.e. weight vectors) of generic, is then integrated with in the set of one-level category of employment Feature Words by the Based on Class Feature Word Quadric calculating extraction;

Treat that whole web page characteristics word extracts complete, just automatically obtain the set of a comprehensive one-level category of employment Feature Words, thus structure obtains one-level category of employment Feature Words dictionary.This dictionary formula is described as: p (c|w), and wherein c represents classification, and w represents word, and that is each word has a category distribution.

Given two short text T ₁, T ₂for each short text, can obtain category distribution belonging to each word according to p (c|w), the Global ID's F the weight then category distribution of this each word of short text being multiplied by this word adds up again, finally obtains the category distribution p (c|T) of this short text.

Utilize cosine formula, obtain two short text T ₁, T ₂text classification similarity be:

R_{category} (T_{1}, T_{2}) = \frac{p (c | T_{1}) \cdot p (c | T_{2})}{| | p (c | T_{1}) | | | | p (c | T_{2}) | |} .

(2), based on the topic relativity feature calculation semantic dependency eigenwert of PLSA

PLSA model is a kind of non-supervisory machine learning model, for identifying theme (Topic) information potential in document, excavates the semantic relation that document is potential.PLSA model is thought when user's authored documents, and what first select is the subject information distribution of document, then selects suitable word according to the theme distribution of document, thus forms one section of complete document.Be described below with mathematical linguistics:

The probability of selected one section of document is p (d), and every section of document belongs to a theme with Probability p (z|d), and a given theme, each word produces with Probability p (w|z).By the probability model expression formula that this process forms associating be:

p(d,w)=p(d)p(w|d)

p(w|d)=∑ _z∈Zp(w|z)p(z|d)；

Given two short text T ₁, T ₂for each short text, theme distribution belonging to each word can be obtained according to p (z|w), then be taken advantage of by the theme distribution of all for this short text words the Global ID's F weight in this word to add up again, then obtain the theme distribution p (z|T) of this short text.

Utilize cosine formula, the PLSA similarity obtaining two short texts is:

R_{plsa} (T_{1}, T_{2}) = \frac{p (z | T_{1}) \cdot p (z | T_{2})}{| | p (z | T_{1}) | | | | p (z | T_{2}) | |} .

(3), the correlative character computing semantic correlative character value of Corpus--based Method mechanical translation

The translation probability thought that bilingual sentence is right in statistical machine translation field, can expect very naturally for carrying out correlation modeling to short text.

Given two short text T ₁, T ₂if, given T ₂, T ₁the probability occurred is P (T ₁| T ₂), i.e. likelihood score (likelihood).

Obviously, T ₁, T ₂more relevant, its likelihood score is larger.Because text varies, directly carry out modeling to its likelihood score comparatively difficult, application Bayesian formula rewrites as follows:

P (T_{1} | T_{2}) = \frac{P (T_{2} | T_{1}) P (T_{1})}{P (T_{2})};

Wherein, P (T ₂| T ₁) be the translation model in mechanical translation; Represent T ₁be translated as T ₂probability; P (T ₁) and P (T ₂) be respectively T ₁and T ₂language model; That portray respectively is T ₁and T ₂it is whether the probability of a legal short text.

Based on BOW model hypothesis, then

P (T_{2} | T_{1}) = \underset{j}{Π} P (t_{2 j} | T_{1}) = \underset{j}{Π} \underset{i}{Σ} P (t_{2 j} | t_{1 i});

Wherein P (t _2j| t _1i) be word t _1ito t _2jtranslation probability, i.e. word alignment dictionary.Word between translation probability EM algorithm can be used to train on parallel corpora obtain.

In a particular application, translation model and language model, can utilize large-scale Webpage search daily record and advertiser to buy word, utilizes the machine translation software moses training of increasing income to obtain.

Two short text T ₁, T ₂based on the correlative character computing formula of Machine Translation Model, design as follows:

R_{mt} (T_{1}, T_{2}) = \frac{P (T_{1} | T_{2}) + P (T_{2} | T_{1})}{2} .

In statistical machine translation field, it is fine that this method maps effect to the translation between different language.But between single language (being such as both Chinese short string), experiment shows that dictionary for translation coverage rate is limited, promoting coverage rate needs the number of the parallel corpora increased larger.Embodiment of the present invention uses for reference the thought of mechanical translation, constructs the correlative character between a short text.

(4), based on the correlative character computing semantic correlative character value of the word granularity of Webpage searching result

The core calculated based on the correlative character of mechanical translation is above word alignment dictionary, and by the inspiration of this word granularity mapping relations, embodiment of the present invention proposes the correlative character of the Webpage searching result based on word further, portrays the correlativity between short text.

A given word, extracts N number of Feature Words (in real system, N gets 64) that TF-IDF value is maximum, proper vector V (t)=(w that the TF-IDF value of these Feature Words is formed from its Webpage searching result ₁, w ₂... w _n) as the sign to this word justice.Then two word t ₁, t ₂correlation calculations formula based on the Webpage searching result of word is defined as follows:

R_{bow} (t_{1}, t_{2}) = \frac{V (t_{1}) \cdot V (t_{2})}{| | V (t_{1}) | | \times | | V (t_{2}) | |};

Two short text T ₁, T ₂based on the correlative character computing formula of the Webpage searching result of word, design as follows:

R_{bow} (T_{1}, T_{2}) = \frac{\underset{i}{Σ} \max_{j} (R_{bow} (t_{1 i}, t_{2 j})) + \underset{j}{Σ} \max_{i} (R_{bow} (t_{1 i}, t_{2 j}))}{2};

Based on the feature of word granularity, only need the TF-IDF proper vector storing common word, just greatly can reduce the expense of disk space, do not need the long retrieval storing magnanimity to go here and there.Each retrieval string can be expressed by the feature of more fine-grained word, the correlativity between short text, can above formula measure.

According to above-mentioned algorithm, multiple correlative character value (comprising text relevant and/or semantic relevant) can be calculated, then these correlative character values can be merged the total correlative character value of formation one of getting up.

Specifically comprise:

According to aforementioned, can for calculating the correlative character value of multiple different dimensions between short string, the feature of concrete selection, including, but not limited to editing distance, longest common subsequence, classification, PLSA topic model, correlativity etc. based on word granularity, finally uses Logic Regression Models all correlative character values to be fitted to a total semantic dependency score value.

The sample of the corpus of semantic dependency model is generally two short texts and edits the relevance score provided, and what wish that model exports is relevance score between one 0 to 1.But logistic regression is a disaggregated model, require that the sample of corpus is proper vector and a class label, what model exported is also a class label.

Embodiment of the present invention comprises::

Aforesaid multiple correlative character score value is calculated, a proper vector of composition to the short text of often pair of editor's mark;

M training examples is formed by each proper vector, if editor's marking is S (S ∈ [0,1]), then will wherein the category label of individual sample is 1, and all the other samples are labeled as 0;

Two sorted logic regression model training are adopted to obtain the weight w of each correlative character ₁, w ₂... w _nwith biased b;

For given two short text T ₁, T ₂, first calculate its aforesaid multiple correlative character score value R ₁, R ₂... R _n, then utilize Sigmoid function to calculate final relevance score to be

R (T_{1}, T_{2}) = \frac{1}{1 + e^{- (\underset{i}{Σ} R_{i} W_{i} + b)}};

The input domain of Sigmoid function is (-∞ ,+∞), and domain output is [0,1], is suitable for very much calculating relevance score.

Can embodiment of the present invention be applied in multiple fields, such as can be applied in the searching system of search advertisements reality, Logic Regression Models is utilized to do primary election to purchase word, and according to the relevance score between short string, arrange certain threshold value to filter, reservation and the semantic maximally related purchase word of query string are alternatively.

In sum, traditional based on document in word vector space model computing method in, short text faces the problem that feature is sparse.Meanwhile, because the word segmentation result of short text depends on language model, the consistent of different word segmentation can not be ensured, also can aggravate the sparse of vector to a certain extent.

For this problem, embodiment of the present invention proposes based on feature based on the text relevant of the character string such as editing distance, longest common subsequence aspect, they can express text similarity between short string from multiple dimension, better can process that a lot of short text is lack of standardization, participle is forbidden or inconsistent situation.

And, tradition is based on literal similar correlation calculations method, mainly utilize traditional BOW (bag-of-words) model, generally be based upon on the basis of feature independence hypothesis, measure the correlativity of short text according to the match condition of proper vector, but in practice, many times there is a lot of incidence relations between feature, special in running into the situation such as polysemy and adopted many words, semantically have skew, cause coulometer not calculated accurately really.

For this problem, embodiment of the present invention proposes the correlative character analyzed based on text classification, probability implicit semantic.It fully can excavate the implication relation between short text and the word forming short text, thus the classification contact calculated between two short texts and theme contact, formed and the feature of text relevant is supplemented.

And traditional computing method based on short text Webpage searching result are utilize external resource to form literal expansion to short string in essence.From effect, spreading result depends critically upon the correlativity quality of the products such as selected search engine.From performance, its Search Results huge amount relied on, each short string needs to store corresponding result, requires very high to download and computing velocity; Two synonyms but literally have difference slightly, the short text that even word order is different, Search Results also may differ widely, and needs to store respectively.In addition, indexed results is also meeting regular update, and the spreading result of respective stored also needs to change thereupon, and how to ensure that expansion quality does not decline, the renewal expense how equilibrium criterion upgrades is all the problem that can not avoid.

Embodiment of the present invention proposes the correlative character of the Webpage searching result based on word, the dictionary resources number relied on is controlled, unit storage space, computing velocity have very significantly to be improved, and the lightweight semantic dependency between the short string of canbe used on line is calculated becomes possibility.

Based on above-mentioned labor, embodiment of the present invention also proposed a kind of correlation calculations device of text.

Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention text.

As shown in Figure 2, this device comprises character string receiving element 201, correlative character value computing unit 202 and correlative character value fitting unit 203, wherein:

Character string receiving element 201, for receiving the first character string and the second character string;

Correlative character value computing unit 202, for the semantic dependency eigenwert of the text relevant eigenwert and the first character string and the second character string that calculate the first character string and the second character string;

Correlative character value fitting unit 203, fits to the correlative character value of the first character string and the second character string by described text relevant eigenwert and semantic dependency eigenwert for logic-based regression model.

In one embodiment:

Correlative character value computing unit 202, for calculating the first character string and the second character string correlative character value based on editing distance, and/or calculates the first character string and the second character string correlative character value based on longest common subsequence.

In one embodiment:

Correlative character value computing unit, for building one-level category of employment Feature Words dictionary; For the first character string, obtain the category distribution belonging to each word according to one-level category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to one-level category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution; Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

In one embodiment:

Correlative character value computing unit 202, for the one-level category of employment Feature Words set based on artificial mark, adopts coupling mode classification in full to classify to each webpage; Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described one-level category of employment Feature Words, to build one-level category of employment Feature Words dictionary.

In one embodiment:

Correlative character value computing unit 202, for for the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again; Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

In one embodiment:

Correlative character value computing unit 202, for calculating the correlative character value of the first character string and the second character string Corpus--based Method mechanical translation, and/or calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.

In one embodiment:

Correlative character value fitting unit 202, for for the text relevant eigenwert of the first character string calculated and the second character string and semantic dependency eigenwert, construction feature vector; Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively; Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.

In one embodiment:

Correlative character value computing unit 202, for calculate perform following at least one:

Calculate the correlative character value based on editing distance of the first character string and the second character string;

Calculate the correlative character value based on longest common subsequence of the first character string and the second character string;

Calculate the correlative character value based on text classification of the first character string and the second character string;

Calculate the topic relativity eigenwert based on probability latent semantic analysis PLSA of the first character string and the second character string;

Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string;

Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.

In fact, the correlation calculations method of the text that embodiment of the present invention proposes specifically can be implemented by various ways.Such as, the application programming interfaces of certain specification can be followed, the correlation calculations method of text is written as the plug-in card program be installed in server, also can be encapsulated as application program and download use voluntarily for user.When being written as plug-in card program, the multiple card format such as ocx, dll, cab can be implemented as.Also the correlation calculations method of the text that embodiment of the present invention proposes can be implemented by the concrete technology such as Flash plug-in unit, RealPlayer plug-in unit, MMS plug-in unit, MI staff plug-in unit, ActiveX plug-in unit.

The correlation calculations method of the text that embodiment of the present invention is proposed by the storing mode that can be stored by instruction or instruction set is stored on various storage medium.These storage mediums include, but are not limited to: floppy disk, CD, DVD, hard disk, flash memory, USB flash disk, CF card, SD card, mmc card, SM card, memory stick (Memory Stick), xD card etc.

In addition, the correlation calculations method of the text that embodiment of the present invention can also be proposed is applied in the storage medium based on flash memory (Nand flash), such as USB flash disk, CF card, SD card, SDHC card, mmc card, SM card, memory stick, xD card etc.

In sum, in embodiments of the present invention, in embodiments of the present invention, the first character string and the second character string is received; Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string; Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.As can be seen here, embodiment of the present invention avoids the computing method based on word vector space model in document, therefore avoids the problem that feature is sparse, thus improves the accuracy rate of correlation prediction, and has saved storage space and reduced cost.

The above, be only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a correlation calculations method for text, it is characterized in that, the method comprises:

Receive the first character string and the second character string;

2. the correlation calculations method of text according to claim 1, is characterized in that, the text relevant eigenwert of described calculating first character string and the second character string comprises:

Calculate the first character string and the second character string correlative character value based on editing distance, and/or calculate the first character string and the second character string correlative character value based on longest common subsequence.

3. the correlation calculations method of text according to claim 1, is characterized in that, the semantic dependency eigenwert of described calculating first character string and the second character string comprises:

Build category of employment Feature Words dictionary;

4. the correlation calculations method of text according to claim 3, is characterized in that,

Described structure category of employment Feature Words dictionary comprises:

5. the correlation calculations method of text according to claim 1, is characterized in that,

The semantic dependency eigenwert of described calculating first character string and the second character string comprises:

6. the correlation calculations method of text according to claim 1, is characterized in that,

The semantic dependency eigenwert of described calculating first character string and the second character string comprises: the correlative character value calculating the first character string and the second character string Corpus--based Method mechanical translation.

7. the correlation calculations method of text according to claim 1, is characterized in that,

The semantic dependency eigenwert of described calculating first character string and the second character string comprises: calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.

8. the correlation calculations method of the text according to any one of claim 1-7, is characterized in that, described text relevant eigenwert and semantic dependency eigenwert are fitted to correlative character value and comprise by described logic-based regression model:

For text relevant eigenwert and the semantic dependency eigenwert of the first character string calculated and the second character string, construction feature vector;

9. the correlation calculations method of the text according to any one of claim 1-7, is characterized in that,

The semantic dependency eigenwert of described calculating first character string and the second character string comprise following at least one:

10. a correlation calculations device for text, is characterized in that, this device comprises character string receiving element, correlative character value computing unit and correlative character value fitting unit, wherein:

The correlation calculations device of 11. texts according to claim 10, is characterized in that,

Correlative character value computing unit, for calculating the first character string and the second character string correlative character value based on editing distance, and/or calculates the first character string and the second character string correlative character value based on longest common subsequence.

The correlation calculations device of 12. texts according to claim 10, is characterized in that,

Correlative character value computing unit, for building category of employment Feature Words dictionary; For the first character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution; Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

The correlation calculations device of 13. texts according to claim 12, is characterized in that,

Correlative character value computing unit, for the category of employment Feature Words set based on artificial mark, adopts coupling mode classification in full to classify to each webpage; Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described category of employment Feature Words, to build category of employment Feature Words dictionary.

The correlation calculations device of 14. texts according to claim 10, is characterized in that,

Correlative character value computing unit, for for the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again; Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.

The correlation calculations device of 15. texts according to claim 10, is characterized in that,

Correlative character value computing unit, for calculating the correlative character value of the first character string and the second character string Corpus--based Method mechanical translation, and/or calculates the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.

The correlation calculations device of 16. texts according to any one of claim 10-15, is characterized in that,

Correlative character value fitting unit, for for the text relevant eigenwert of the first character string calculated and the second character string and semantic dependency eigenwert, construction feature vector; Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively; Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.

The correlation calculations device of 17. texts according to any one of claim 10-15, is characterized in that,

Correlative character value computing unit, for calculate perform following at least one: