CN104424279A - Text relevance calculating method and device - Google Patents

Text relevance calculating method and device Download PDF

Info

Publication number
CN104424279A
CN104424279A CN201310388496.XA CN201310388496A CN104424279A CN 104424279 A CN104424279 A CN 104424279A CN 201310388496 A CN201310388496 A CN 201310388496A CN 104424279 A CN104424279 A CN 104424279A
Authority
CN
China
Prior art keywords
character string
eigenwert
character
word
correlative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310388496.XA
Other languages
Chinese (zh)
Other versions
CN104424279B (en
Inventor
赫南
张文斌
姚伶伶
王莉峰
何琪
张博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310388496.XA priority Critical patent/CN104424279B/en
Publication of CN104424279A publication Critical patent/CN104424279A/en
Application granted granted Critical
Publication of CN104424279B publication Critical patent/CN104424279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Abstract

Embodiments of the invention provide a text relevance calculating method and a device thereof. The method comprises the following steps: receiving a first character string and a second character string; calculating a text relevance characteristic value of the first character string and the second character string, and calculating a semantic relevance characteristic value of the first character string and the second character string; fitting the text relevance characteristic value and the semantic relevance characteristic value into a relevance characteristic value of the first character string and the second character string based on the logistic regression model. The text relevance calculating method and the device thereof increase the precision of relevance judgment, save storage space and reduce the cost.

Description

A kind of correlation calculations method and apparatus of text
Technical field
Embodiment of the present invention relates to technical field of internet application, more specifically, relates to a kind of correlation calculations method and apparatus of text.
Background technology
Along with the develop rapidly of computer technology and network technology, the effect that internet (Internet) plays in daily life, study and work is also increasing.Various application on internet emerge in an endless stream.
Search advertisements are very important business in the Internet advertising ecosystem, and it depends on search engine, are in essence to sell coupling based on keyword.Advertiser is in the database of business promotion, except be provided for show advertisement title, describe except, also to add the keyword (namely buying word) that some and this advertisement have certain correlativity, and specify match-type and bid and directed coupling target flow (namely meeting the user of retrieval intention).In the coupling flow process of classics, purchase word defines the direct index to advertisement.When the query word of user " mates " with the purchase word of advertiser, correlativity acquires a certain degree, namely the primary election condition that advertisement triggers of meeting (suppose first to ignore other directed and filter link) is thought here, can pull corresponding advertisement (title, description) out does further follow-up selected, such as clicking rate is estimated, order ads, exhibition strategy selection etc.
In retrieval (Retrieve) stage, ad system can utilize the query string of user, uses multiple strategy that is online, off-line to do and buys word coupling.Here the purchase word found be all that advertiser specifies when filling in material, to advertisement title and describe relevant short text.System vacuum metrics query word (query) and candidate buy the essence of the correlativity of word (bidterm) is on line correlativity between short text.
Have a lot based on the method for the literal coupling of character string traditionally, the online appraisal procedure of off-line also has difference, all has some limitations.The people such as the Sahami of Google propose to utilize the Webpage searching result of short text as semantic extension, calculate the semantic dependency between short text on this basis, than the simple better effects if based on word.The people such as the Dumais of Metzler and Microsoft of University of Massachusetts have also attempted method that multiple short text represents for computing semantic correlativity.
But traditional computing method based on word vector space model in document, short text faces the problem that feature is sparse.Meanwhile, because the word segmentation result of short text depends on language model, the consistent of different word segmentation can not be ensured, also can aggravate the sparse of vector to a certain extent.Therefore, traditional computing method based on word vector space model in document, have the shortcoming that correlation prediction accuracy rate is not high.
And, traditional based on document in word vector space model computing method in, need a large amount of storage space to store term vector, therefore also waste storage space and improve cost.
Summary of the invention
Embodiment of the present invention proposes a kind of correlation calculations method of text, to improve the accuracy rate of correlation prediction.
Embodiment of the present invention proposes a kind of correlation calculations device of text, to improve the accuracy rate of correlation prediction.
The technical scheme of embodiment of the present invention is as follows:
A correlation calculations method for text, the method comprises:
Receive the first character string and the second character string;
Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string;
Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.
A correlation calculations device for text, this device comprises character string receiving element, correlative character value computing unit and correlative character value fitting unit, wherein:
Character string receiving element, for receiving the first character string and the second character string;
Correlative character value computing unit, for the semantic dependency eigenwert of the text relevant eigenwert and the first character string and the second character string that calculate the first character string and the second character string;
Correlative character value fitting unit, fits to the correlative character value of the first character string and the second character string by described text relevant eigenwert and semantic dependency eigenwert for logic-based regression model.
As can be seen from technique scheme, in embodiments of the present invention, the first character string and the second character string is received; Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string; Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.As can be seen here, embodiment of the present invention avoids the computing method based on word vector space model in document, therefore avoids the problem that feature is sparse, thus improves the accuracy rate of correlation prediction, and has saved storage space and reduced cost.
And, embodiment of the present invention proposes based on feature based on the text relevant of the character string such as editing distance, longest common subsequence aspect, they can express text similarity between short string from multiple dimension, better can process that a lot of short text is lack of standardization, participle is forbidden or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character analyzed based on text classification, probability implicit semantic, fully can excavate the implication relation between short text and the word forming short text, thus the classification contact calculated between two short texts and theme contact, formed and the feature of text relevant is supplemented.
Also have, embodiment of the present invention proposes the correlative character of the Webpage searching result based on word, the dictionary resources number relied on is controlled, and unit storage space, computing velocity have very significantly to be improved, and the lightweight semantic dependency between the short string of canbe used on line is calculated becomes possibility.
Accompanying drawing explanation
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text;
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the present invention is described in further detail.
In various applications, the correlation calculations of two short texts can often be related to.The correlativity of two short texts refers to the two correlation degree semantically existed, but not necessarily similar literal.Correlativity be one than similarity (Similarity) concept widely, all significant in a lot of product and system.Short text refers to and the character string that length is shorter such as in some network application, is no more than 38 Chinese characters etc.
Buying word (Bidterm) is the purchase word for bidding that in bid advertisement system, advertiser submits to; Query word (Query) is the search keyword that in search engine, user submits to.Query word and to buy word be all generally the shorter text-string of length, and can buy word and be referred to as short text all query words.
Fig. 1 is the correlation calculations method flow diagram according to embodiment of the present invention text.
As shown in Figure 1, the method comprises:
Step 101: receive the first character string and the second character string.
Here, the first character string and the second character string are preferably short text.Such as, the first character string and the second character string can be query word respectively, buy word etc.
Step 102: calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string.
Text similarity between the short string of correlative character primary metric of text aspect.The correlative character of text aspect has only used the text message of short string, can be obtained by efficient optimized algorithm instant computing.
Such as, the first character string and the second character string correlative character value based on editing distance can be calculated, and/or calculate the first character string and the second character string correlative character value based on longest common subsequence.
Concept between the short string of correlative character primary metric of semantic level, the similarity of meaning.
In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises:
Build category of employment Feature Words dictionary (such as one-level category of employment Feature Words dictionary);
For the first character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution;
Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
Preferably, described structure category of employment Feature Words dictionary comprises:
Based on the category of employment Feature Words set of artificial mark, coupling mode classification is in full adopted to classify to each webpage;
Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described category of employment Feature Words, to build category of employment Feature Words dictionary.
In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises:
For the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again;
Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises: the correlative character value calculating the first character string and the second character string Corpus--based Method mechanical translation.
In one embodiment, the semantic dependency eigenwert calculating the first character string and the second character string comprises: calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.
In fact, multiple account form can be adopted to calculate the text relevant eigenwert of the first character string and the second character string simultaneously.Such as can calculate the first character string and the second character string correlative character value based on editing distance, and calculate the first character string and the second character string correlative character value based on longest common subsequence, then using the correlative character value based on editing distance and the correlative character value based on longest common subsequence simultaneously as the text relevant eigenwert calculated to participate in the Fitting Calculation of step 103.
Similarly, multiple account form can be adopted to calculate the semantic dependency eigenwert of the first character string and the second character string simultaneously.
Such as: the semantic dependency eigenwert calculating the first character string and the second character string comprise following at least one:
Calculate the correlative character value based on editing distance of the first character string and the second character string; Calculate the correlative character value based on longest common subsequence of the first character string and the second character string; Calculate the correlative character value based on text classification of the first character string and the second character string; Calculate the topic relativity eigenwert based on probability latent semantic analysis (PLSA) of the first character string and the second character string; Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string; Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.
Then all semantic dependency eigenwerts calculated are participated in the Fitting Calculation of step 103.
Step 103: described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.
Here, for text relevant eigenwert and the semantic dependency eigenwert of the first character string calculated and the second character string, construction feature vector;
Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively;
Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.
Be described in more detail below the correlation calculations method of the text of embodiment of the present invention.
The problem formal definition that the present invention solves is as follows:
Given two short text T 1, T 2, calculate the semantic dependency R (T of its semantic association degree of reflection 1, T 2), wherein R (T 1, T 2) ∈ [0,1].
For a short text T, its string length with | T| represents, its word segmentation result is expressed as T=t 1t 2... t n; Then T 1, T 2word segmentation result be respectively T 1=t 11t 12... t 1n, T 2=t 21t 22... t 2n.
First two short texts are calculated respectively to the correlative character of various dimensions, then use Logic Regression Models the correlative character score value of multiple dimension to be fitted to a final semantic dependency score.
Specific as follows:
For the text relevant eigenwert between calculating two short texts, namely the correlative character of text aspect is calculated, due to text aspect the short string of correlative character primary metric between text similarity, only use the text message of short string, therefore can have been obtained by efficient optimized algorithm instant computing.
Such as:
(1), based on the correlation calculations text relevant eigenwert of editing distance
Editing distance (Edit Distance), also known as Levenshtein distance, refers between two character strings, changes into the minimum editing operation number of times needed for another by one.The editing operation of license comprises a character is replaced to another character, inserts a character, deletes a character.
Two short text T 1, T 2editing distance EditDist (T 1, T 2), can pass through time complexity O (| T 1| * | T 2|) dynamic programming algorithm calculate.
Two short texts are as follows based on the correlative character computing formula of editing distance:
R ed ( T 1 , T 2 ) = 1 - 2 EditDist ( T 1 , T 2 ) | T 1 | + | T 2 | .
(2), based on the correlation calculations text relevant eigenwert of longest common subsequence
The substring (sub-string) that the subsequence of a character string obtains after referring to and can deleting some characters by this character string.
The longest common subsequence of two character strings is one the longest in its all identical subsequence.Two short text T 1, T 2longest common subsequence LCS (T 1, T 2), can pass through time complexity O (| T 1| * | T 2|) dynamic programming algorithm calculate.
Two short texts are as follows based on the correlative character computing formula of longest common subsequence:
R lcs ( T 1 , T 2 ) = 2 LCS ( T 1 , T 2 ) | T 1 | + | T 2 | .
For the semantic dependency eigenwert between calculating two short texts, i.e. concept between the short string of correlative character primary metric of computing semantic aspect, the similarity of meaning.
The semantic dependency eigenwert between two short texts can be calculated in the following way:
(1), based on the correlative character computing semantic correlative character value of text classification
Exemplarily, embodiment of the present invention mainly have employed the method for feature based word to short text classification, and its basic procedure is:
First based on the initial one-level category of employment Feature Words set (this set comprises the one-level category of employment Feature Words of a small amount of artificial mark) of artificial mark, hundreds of millions of webpages is adopted to the mode classification of coupling in full, each webpage is classified;
Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, the Based on Class Feature Word Quadric that these extract from webpage, for the weight contribution (i.e. weight vectors) of generic, is then integrated with in the set of one-level category of employment Feature Words by the Based on Class Feature Word Quadric calculating extraction;
Treat that whole web page characteristics word extracts complete, just automatically obtain the set of a comprehensive one-level category of employment Feature Words, thus structure obtains one-level category of employment Feature Words dictionary.This dictionary formula is described as: p (c|w), and wherein c represents classification, and w represents word, and that is each word has a category distribution.
Given two short text T 1, T 2for each short text, can obtain category distribution belonging to each word according to p (c|w), the Global ID's F the weight then category distribution of this each word of short text being multiplied by this word adds up again, finally obtains the category distribution p (c|T) of this short text.
Utilize cosine formula, obtain two short text T 1, T 2text classification similarity be:
R category ( T 1 , T 2 ) = p ( c | T 1 ) · p ( c | T 2 ) | | p ( c | T 1 ) | | | | p ( c | T 2 ) | | .
(2), based on the topic relativity feature calculation semantic dependency eigenwert of PLSA
PLSA model is a kind of non-supervisory machine learning model, for identifying theme (Topic) information potential in document, excavates the semantic relation that document is potential.PLSA model is thought when user's authored documents, and what first select is the subject information distribution of document, then selects suitable word according to the theme distribution of document, thus forms one section of complete document.Be described below with mathematical linguistics:
The probability of selected one section of document is p (d), and every section of document belongs to a theme with Probability p (z|d), and a given theme, each word produces with Probability p (w|z).By the probability model expression formula that this process forms associating be:
p(d,w)=p(d)p(w|d)
p(w|d)=∑ z∈Zp(w|z)p(z|d);
By EM algorithm, carry out the training of PLSA model parameter, obtain p (z|d) and p (w|z).By Bayesian formula, p (z|w)=p (w|z) p (z)/p (w) obtains p (z|w).
Given two short text T 1, T 2for each short text, theme distribution belonging to each word can be obtained according to p (z|w), then be taken advantage of by the theme distribution of all for this short text words the Global ID's F weight in this word to add up again, then obtain the theme distribution p (z|T) of this short text.
Utilize cosine formula, the PLSA similarity obtaining two short texts is:
R plsa ( T 1 , T 2 ) = p ( z | T 1 ) · p ( z | T 2 ) | | p ( z | T 1 ) | | | | p ( z | T 2 ) | | .
(3), the correlative character computing semantic correlative character value of Corpus--based Method mechanical translation
The translation probability thought that bilingual sentence is right in statistical machine translation field, can expect very naturally for carrying out correlation modeling to short text.
Given two short text T 1, T 2if, given T 2, T 1the probability occurred is P (T 1| T 2), i.e. likelihood score (likelihood).
Obviously, T 1, T 2more relevant, its likelihood score is larger.Because text varies, directly carry out modeling to its likelihood score comparatively difficult, application Bayesian formula rewrites as follows:
P ( T 1 | T 2 ) = P ( T 2 | T 1 ) P ( T 1 ) P ( T 2 ) ;
Wherein, P (T 2| T 1) be the translation model in mechanical translation; Represent T 1be translated as T 2probability; P (T 1) and P (T 2) be respectively T 1and T 2language model; That portray respectively is T 1and T 2it is whether the probability of a legal short text.
Based on BOW model hypothesis, then P ( T 2 | T 1 ) = Π j P ( t 2 j | T 1 ) = Π j Σ i P ( t 2 j | t 1 i ) ;
Wherein P (t 2j| t 1i) be word t 1ito t 2jtranslation probability, i.e. word alignment dictionary.Word between translation probability EM algorithm can be used to train on parallel corpora obtain.
In a particular application, translation model and language model, can utilize large-scale Webpage search daily record and advertiser to buy word, utilizes the machine translation software moses training of increasing income to obtain.
Two short text T 1, T 2based on the correlative character computing formula of Machine Translation Model, design as follows:
R mt ( T 1 , T 2 ) = P ( T 1 | T 2 ) + P ( T 2 | T 1 ) 2 .
In statistical machine translation field, it is fine that this method maps effect to the translation between different language.But between single language (being such as both Chinese short string), experiment shows that dictionary for translation coverage rate is limited, promoting coverage rate needs the number of the parallel corpora increased larger.Embodiment of the present invention uses for reference the thought of mechanical translation, constructs the correlative character between a short text.
(4), based on the correlative character computing semantic correlative character value of the word granularity of Webpage searching result
The core calculated based on the correlative character of mechanical translation is above word alignment dictionary, and by the inspiration of this word granularity mapping relations, embodiment of the present invention proposes the correlative character of the Webpage searching result based on word further, portrays the correlativity between short text.
A given word, extracts N number of Feature Words (in real system, N gets 64) that TF-IDF value is maximum, proper vector V (t)=(w that the TF-IDF value of these Feature Words is formed from its Webpage searching result 1, w 2... w n) as the sign to this word justice.Then two word t 1, t 2correlation calculations formula based on the Webpage searching result of word is defined as follows:
R bow ( t 1 , t 2 ) = V ( t 1 ) · V ( t 2 ) | | V ( t 1 ) | | × | | V ( t 2 ) | | ;
Two short text T 1, T 2based on the correlative character computing formula of the Webpage searching result of word, design as follows:
R bow ( T 1 , T 2 ) = Σ i max j ( R bow ( t 1 i , t 2 j ) ) + Σ j max i ( R bow ( t 1 i , t 2 j ) ) 2 ;
Based on the feature of word granularity, only need the TF-IDF proper vector storing common word, just greatly can reduce the expense of disk space, do not need the long retrieval storing magnanimity to go here and there.Each retrieval string can be expressed by the feature of more fine-grained word, the correlativity between short text, can above formula measure.
According to above-mentioned algorithm, multiple correlative character value (comprising text relevant and/or semantic relevant) can be calculated, then these correlative character values can be merged the total correlative character value of formation one of getting up.
Specifically comprise:
According to aforementioned, can for calculating the correlative character value of multiple different dimensions between short string, the feature of concrete selection, including, but not limited to editing distance, longest common subsequence, classification, PLSA topic model, correlativity etc. based on word granularity, finally uses Logic Regression Models all correlative character values to be fitted to a total semantic dependency score value.
The sample of the corpus of semantic dependency model is generally two short texts and edits the relevance score provided, and what wish that model exports is relevance score between one 0 to 1.But logistic regression is a disaggregated model, require that the sample of corpus is proper vector and a class label, what model exported is also a class label.
Embodiment of the present invention comprises::
Aforesaid multiple correlative character score value is calculated, a proper vector of composition to the short text of often pair of editor's mark;
M training examples is formed by each proper vector, if editor's marking is S (S ∈ [0,1]), then will wherein the category label of individual sample is 1, and all the other samples are labeled as 0;
Two sorted logic regression model training are adopted to obtain the weight w of each correlative character 1, w 2... w nwith biased b;
For given two short text T 1, T 2, first calculate its aforesaid multiple correlative character score value R 1, R 2... R n, then utilize Sigmoid function to calculate final relevance score to be
R ( T 1 , T 2 ) = 1 1 + e - ( Σ i R i W i + b ) ;
The input domain of Sigmoid function is (-∞ ,+∞), and domain output is [0,1], is suitable for very much calculating relevance score.
Can embodiment of the present invention be applied in multiple fields, such as can be applied in the searching system of search advertisements reality, Logic Regression Models is utilized to do primary election to purchase word, and according to the relevance score between short string, arrange certain threshold value to filter, reservation and the semantic maximally related purchase word of query string are alternatively.
In sum, traditional based on document in word vector space model computing method in, short text faces the problem that feature is sparse.Meanwhile, because the word segmentation result of short text depends on language model, the consistent of different word segmentation can not be ensured, also can aggravate the sparse of vector to a certain extent.
For this problem, embodiment of the present invention proposes based on feature based on the text relevant of the character string such as editing distance, longest common subsequence aspect, they can express text similarity between short string from multiple dimension, better can process that a lot of short text is lack of standardization, participle is forbidden or inconsistent situation.
And, tradition is based on literal similar correlation calculations method, mainly utilize traditional BOW (bag-of-words) model, generally be based upon on the basis of feature independence hypothesis, measure the correlativity of short text according to the match condition of proper vector, but in practice, many times there is a lot of incidence relations between feature, special in running into the situation such as polysemy and adopted many words, semantically have skew, cause coulometer not calculated accurately really.
For this problem, embodiment of the present invention proposes the correlative character analyzed based on text classification, probability implicit semantic.It fully can excavate the implication relation between short text and the word forming short text, thus the classification contact calculated between two short texts and theme contact, formed and the feature of text relevant is supplemented.
And traditional computing method based on short text Webpage searching result are utilize external resource to form literal expansion to short string in essence.From effect, spreading result depends critically upon the correlativity quality of the products such as selected search engine.From performance, its Search Results huge amount relied on, each short string needs to store corresponding result, requires very high to download and computing velocity; Two synonyms but literally have difference slightly, the short text that even word order is different, Search Results also may differ widely, and needs to store respectively.In addition, indexed results is also meeting regular update, and the spreading result of respective stored also needs to change thereupon, and how to ensure that expansion quality does not decline, the renewal expense how equilibrium criterion upgrades is all the problem that can not avoid.
Embodiment of the present invention proposes the correlative character of the Webpage searching result based on word, the dictionary resources number relied on is controlled, unit storage space, computing velocity have very significantly to be improved, and the lightweight semantic dependency between the short string of canbe used on line is calculated becomes possibility.
Based on above-mentioned labor, embodiment of the present invention also proposed a kind of correlation calculations device of text.
Fig. 2 is the correlation calculations structure drawing of device according to embodiment of the present invention text.
As shown in Figure 2, this device comprises character string receiving element 201, correlative character value computing unit 202 and correlative character value fitting unit 203, wherein:
Character string receiving element 201, for receiving the first character string and the second character string;
Correlative character value computing unit 202, for the semantic dependency eigenwert of the text relevant eigenwert and the first character string and the second character string that calculate the first character string and the second character string;
Correlative character value fitting unit 203, fits to the correlative character value of the first character string and the second character string by described text relevant eigenwert and semantic dependency eigenwert for logic-based regression model.
In one embodiment:
Correlative character value computing unit 202, for calculating the first character string and the second character string correlative character value based on editing distance, and/or calculates the first character string and the second character string correlative character value based on longest common subsequence.
In one embodiment:
Correlative character value computing unit, for building one-level category of employment Feature Words dictionary; For the first character string, obtain the category distribution belonging to each word according to one-level category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to one-level category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution; Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
In one embodiment:
Correlative character value computing unit 202, for the one-level category of employment Feature Words set based on artificial mark, adopts coupling mode classification in full to classify to each webpage; Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described one-level category of employment Feature Words, to build one-level category of employment Feature Words dictionary.
In one embodiment:
Correlative character value computing unit 202, for for the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again; Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
In one embodiment:
Correlative character value computing unit 202, for calculating the correlative character value of the first character string and the second character string Corpus--based Method mechanical translation, and/or calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.
In one embodiment:
Correlative character value fitting unit 202, for for the text relevant eigenwert of the first character string calculated and the second character string and semantic dependency eigenwert, construction feature vector; Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively; Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.
In one embodiment:
Correlative character value computing unit 202, for calculate perform following at least one:
Calculate the correlative character value based on editing distance of the first character string and the second character string;
Calculate the correlative character value based on longest common subsequence of the first character string and the second character string;
Calculate the correlative character value based on text classification of the first character string and the second character string;
Calculate the topic relativity eigenwert based on probability latent semantic analysis PLSA of the first character string and the second character string;
Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string;
Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.
In fact, the correlation calculations method of the text that embodiment of the present invention proposes specifically can be implemented by various ways.Such as, the application programming interfaces of certain specification can be followed, the correlation calculations method of text is written as the plug-in card program be installed in server, also can be encapsulated as application program and download use voluntarily for user.When being written as plug-in card program, the multiple card format such as ocx, dll, cab can be implemented as.Also the correlation calculations method of the text that embodiment of the present invention proposes can be implemented by the concrete technology such as Flash plug-in unit, RealPlayer plug-in unit, MMS plug-in unit, MI staff plug-in unit, ActiveX plug-in unit.
The correlation calculations method of the text that embodiment of the present invention is proposed by the storing mode that can be stored by instruction or instruction set is stored on various storage medium.These storage mediums include, but are not limited to: floppy disk, CD, DVD, hard disk, flash memory, USB flash disk, CF card, SD card, mmc card, SM card, memory stick (Memory Stick), xD card etc.
In addition, the correlation calculations method of the text that embodiment of the present invention can also be proposed is applied in the storage medium based on flash memory (Nand flash), such as USB flash disk, CF card, SD card, SDHC card, mmc card, SM card, memory stick, xD card etc.
In sum, in embodiments of the present invention, in embodiments of the present invention, the first character string and the second character string is received; Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string; Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.As can be seen here, embodiment of the present invention avoids the computing method based on word vector space model in document, therefore avoids the problem that feature is sparse, thus improves the accuracy rate of correlation prediction, and has saved storage space and reduced cost.
And, embodiment of the present invention proposes based on feature based on the text relevant of the character string such as editing distance, longest common subsequence aspect, they can express text similarity between short string from multiple dimension, better can process that a lot of short text is lack of standardization, participle is forbidden or inconsistent situation.
In addition, embodiment of the present invention proposes the correlative character analyzed based on text classification, probability implicit semantic, fully can excavate the implication relation between short text and the word forming short text, thus the classification contact calculated between two short texts and theme contact, formed and the feature of text relevant is supplemented.
Also have, embodiment of the present invention proposes the correlative character of the Webpage searching result based on word, the dictionary resources number relied on is controlled, and unit storage space, computing velocity have very significantly to be improved, and the lightweight semantic dependency between the short string of canbe used on line is calculated becomes possibility.
The above, be only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (17)

1. a correlation calculations method for text, it is characterized in that, the method comprises:
Receive the first character string and the second character string;
Calculate the text relevant eigenwert of the first character string and the second character string and the semantic dependency eigenwert of the first character string and the second character string;
Described text relevant eigenwert and semantic dependency eigenwert are fitted to the correlative character value of the first character string and the second character string by logic-based regression model.
2. the correlation calculations method of text according to claim 1, is characterized in that, the text relevant eigenwert of described calculating first character string and the second character string comprises:
Calculate the first character string and the second character string correlative character value based on editing distance, and/or calculate the first character string and the second character string correlative character value based on longest common subsequence.
3. the correlation calculations method of text according to claim 1, is characterized in that, the semantic dependency eigenwert of described calculating first character string and the second character string comprises:
Build category of employment Feature Words dictionary;
For the first character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution;
Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
4. the correlation calculations method of text according to claim 3, is characterized in that,
Described structure category of employment Feature Words dictionary comprises:
Based on the category of employment Feature Words set of artificial mark, coupling mode classification is in full adopted to classify to each webpage;
Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described category of employment Feature Words, to build category of employment Feature Words dictionary.
5. the correlation calculations method of text according to claim 1, is characterized in that,
The semantic dependency eigenwert of described calculating first character string and the second character string comprises:
For the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again;
Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
6. the correlation calculations method of text according to claim 1, is characterized in that,
The semantic dependency eigenwert of described calculating first character string and the second character string comprises: the correlative character value calculating the first character string and the second character string Corpus--based Method mechanical translation.
7. the correlation calculations method of text according to claim 1, is characterized in that,
The semantic dependency eigenwert of described calculating first character string and the second character string comprises: calculate the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.
8. the correlation calculations method of the text according to any one of claim 1-7, is characterized in that, described text relevant eigenwert and semantic dependency eigenwert are fitted to correlative character value and comprise by described logic-based regression model:
For text relevant eigenwert and the semantic dependency eigenwert of the first character string calculated and the second character string, construction feature vector;
Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively;
Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.
9. the correlation calculations method of the text according to any one of claim 1-7, is characterized in that,
The semantic dependency eigenwert of described calculating first character string and the second character string comprise following at least one:
Calculate the correlative character value based on editing distance of the first character string and the second character string;
Calculate the correlative character value based on longest common subsequence of the first character string and the second character string;
Calculate the correlative character value based on text classification of the first character string and the second character string;
Calculate the topic relativity eigenwert based on probability latent semantic analysis PLSA of the first character string and the second character string;
Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string;
Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.
10. a correlation calculations device for text, is characterized in that, this device comprises character string receiving element, correlative character value computing unit and correlative character value fitting unit, wherein:
Character string receiving element, for receiving the first character string and the second character string;
Correlative character value computing unit, for the semantic dependency eigenwert of the text relevant eigenwert and the first character string and the second character string that calculate the first character string and the second character string;
Correlative character value fitting unit, fits to the correlative character value of the first character string and the second character string by described text relevant eigenwert and semantic dependency eigenwert for logic-based regression model.
The correlation calculations device of 11. texts according to claim 10, is characterized in that,
Correlative character value computing unit, for calculating the first character string and the second character string correlative character value based on editing distance, and/or calculates the first character string and the second character string correlative character value based on longest common subsequence.
The correlation calculations device of 12. texts according to claim 10, is characterized in that,
Correlative character value computing unit, for building category of employment Feature Words dictionary; For the first character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the first character string category distribution; For the second character string, obtain the category distribution belonging to each word according to category of employment Feature Words dictionary, the overall inverse document frequency the weight then category distribution of each word being multiplied by this word adds up again, to obtain the second character string category distribution; Calculate the cosine angle similarity of the category distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
The correlation calculations device of 13. texts according to claim 12, is characterized in that,
Correlative character value computing unit, for the category of employment Feature Words set based on artificial mark, adopts coupling mode classification in full to classify to each webpage; Full text is carried out for the webpage having categorical attribute and cuts word, extract Based on Class Feature Word Quadric, and extracted Based on Class Feature Word Quadric is integrated with the set of described category of employment Feature Words, to build category of employment Feature Words dictionary.
The correlation calculations device of 14. texts according to claim 10, is characterized in that,
Correlative character value computing unit, for for the first character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this first character string being multiplied by this word adds up, to obtain the theme distribution of this first character string again; For the second character string, obtain the theme distribution belonging to each word, the overall inverse document frequency the weight then theme distribution of all words in this second character string being multiplied by this word adds up, to obtain the theme distribution of this second character string again; Calculate the cosine angle similarity of the theme distribution of the first character string and the second character string, to obtain the semantic dependency eigenwert of the first character string and the second character string.
The correlation calculations device of 15. texts according to claim 10, is characterized in that,
Correlative character value computing unit, for calculating the correlative character value of the first character string and the second character string Corpus--based Method mechanical translation, and/or calculates the first character string and the second character string semantic dependency eigenwert based on the word granularity of Webpage searching result.
The correlation calculations device of 16. texts according to any one of claim 10-15, is characterized in that,
Correlative character value fitting unit, for for the text relevant eigenwert of the first character string calculated and the second character string and semantic dependency eigenwert, construction feature vector; Utilize described proper vector to build training examples, and use two sorted logic regression models to train for described training examples, obtain the weight of text relevant eigenwert, the weight of semantic dependency eigenwert and biased respectively; Utilize the weight of the weight of text relevant eigenwert, text relevant eigenwert, semantic dependency eigenwert, semantic dependency eigenwert and be biased, calculating described correlative character value.
The correlation calculations device of 17. texts according to any one of claim 10-15, is characterized in that,
Correlative character value computing unit, for calculate perform following at least one:
Calculate the correlative character value based on editing distance of the first character string and the second character string;
Calculate the correlative character value based on longest common subsequence of the first character string and the second character string;
Calculate the correlative character value based on text classification of the first character string and the second character string;
Calculate the topic relativity eigenwert based on probability latent semantic analysis PLSA of the first character string and the second character string;
Calculate the correlative character value of the Corpus--based Method mechanical translation of the first character string and the second character string;
Calculate the first character string and the second character string correlative character value based on the word granularity of Webpage searching result.
CN201310388496.XA 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text Active CN104424279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310388496.XA CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310388496.XA CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Publications (2)

Publication Number Publication Date
CN104424279A true CN104424279A (en) 2015-03-18
CN104424279B CN104424279B (en) 2018-11-20

Family

ID=52973259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310388496.XA Active CN104424279B (en) 2013-08-30 2013-08-30 A kind of correlation calculations method and apparatus of text

Country Status (1)

Country Link
CN (1) CN104424279B (en)

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105427138A (en) * 2015-12-30 2016-03-23 芜湖乐锐思信息咨询有限公司 Neural network model-based product market share analysis method and system
CN105528336A (en) * 2015-12-23 2016-04-27 北京奇虎科技有限公司 Method and device for determining article correlation by multiple marks
CN105528335A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Method and device for determining correlation among news
CN105550905A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product selling analysis system based on network
CN105550904A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product layout analysis system based on network operation
CN105630766A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Multi-news correlation calculation method apparatus
CN105630767A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text similarity comparison method and device
CN105654346A (en) * 2015-12-30 2016-06-08 芜湖乐锐思信息咨询有限公司 Analysis system based on product refinement operation
CN105678571A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Networked product planning analysis system based on Internet
CN105930468A (en) * 2016-04-22 2016-09-07 江苏金鸽网络科技有限公司 Rule-based information relativity judgment method
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106445963A (en) * 2015-08-10 2017-02-22 北京奇虎科技有限公司 Advertisement index keyword automatic generation method and apparatus for APP platform
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106657016A (en) * 2016-11-10 2017-05-10 北京奇艺世纪科技有限公司 Illegal user name recognition method and system
CN106776493A (en) * 2015-11-19 2017-05-31 腾讯科技(深圳)有限公司 Information filtering method and information filtrating device
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN107066443A (en) * 2017-03-27 2017-08-18 成都优译信息技术股份有限公司 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
WO2018006629A1 (en) * 2016-07-06 2018-01-11 北京搜狗科技发展有限公司 Prescription matching method and device, and device for prescription matching
CN107844559A (en) * 2017-10-31 2018-03-27 国信优易数据有限公司 A kind of file classifying method, device and electronic equipment
CN108027812A (en) * 2015-09-18 2018-05-11 迈克菲有限责任公司 System and method for multipath language translation
CN108182182A (en) * 2017-12-27 2018-06-19 传神语联网网络科技股份有限公司 Document matching process, device and computer readable storage medium in translation database
CN108205757A (en) * 2016-12-19 2018-06-26 阿里巴巴集团控股有限公司 The method of calibration and device of e-payment rightness of business
CN108241867A (en) * 2016-12-26 2018-07-03 阿里巴巴集团控股有限公司 A kind of sorting technique and device
CN108268465A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of text search technology towards mixed data model
CN108388480A (en) * 2017-02-03 2018-08-10 百度在线网络技术(北京)有限公司 Short string Correlation Calibration method and apparatus
CN108536800A (en) * 2018-04-03 2018-09-14 有米科技股份有限公司 File classification method, system, computer equipment and storage medium
CN109271641A (en) * 2018-11-20 2019-01-25 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN109325509A (en) * 2017-07-31 2019-02-12 北京国双科技有限公司 Similarity determines method and device
US10217025B2 (en) 2015-12-22 2019-02-26 Beijing Qihoo Technology Company Limited Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news
CN109522551A (en) * 2018-11-09 2019-03-26 天津新开心生活科技有限公司 Entity link method, apparatus, storage medium and electronic equipment
CN109947919A (en) * 2019-03-12 2019-06-28 北京字节跳动网络技术有限公司 Method and apparatus for generating text matches model
CN110019801A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of determination method and apparatus of text relevant
CN110738220A (en) * 2018-07-02 2020-01-31 百度在线网络技术(北京)有限公司 Method and device for analyzing emotion polarity of sentence and storage medium
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN110929498A (en) * 2018-09-20 2020-03-27 中国移动通信有限公司研究院 Short text similarity calculation method and device and readable storage medium
CN111191087A (en) * 2019-12-31 2020-05-22 歌尔股份有限公司 Character matching method, terminal device and computer-readable storage medium
CN111382255A (en) * 2020-03-17 2020-07-07 北京百度网讯科技有限公司 Method, apparatus, device and medium for question and answer processing
CN111460110A (en) * 2019-01-22 2020-07-28 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN111522918A (en) * 2020-04-24 2020-08-11 天津易维数科信息科技有限公司 Data aggregation method and device, electronic equipment and computer readable storage medium
CN112185573A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 LCS and TF-IDF based similar character string determination method and device
CN112749252A (en) * 2020-07-14 2021-05-04 腾讯科技(深圳)有限公司 Text matching method based on artificial intelligence and related device
CN113239666A (en) * 2021-05-13 2021-08-10 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113254596A (en) * 2021-06-22 2021-08-13 湖南大学 User quality inspection requirement classification method and system based on rule matching and deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090259651A1 (en) * 2008-04-11 2009-10-15 Microsoft Corporation Search results ranking using editing distance and document information
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090259651A1 (en) * 2008-04-11 2009-10-15 Microsoft Corporation Search results ranking using editing distance and document information
CN101777042A (en) * 2010-01-21 2010-07-14 西南科技大学 Neural network and tag library-based statement similarity algorithm
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model

Cited By (70)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445963B (en) * 2015-08-10 2021-11-23 北京奇虎科技有限公司 Advertisement index keyword automatic generation method and device of APP platform
CN106445963A (en) * 2015-08-10 2017-02-22 北京奇虎科技有限公司 Advertisement index keyword automatic generation method and apparatus for APP platform
CN108027812A (en) * 2015-09-18 2018-05-11 迈克菲有限责任公司 System and method for multipath language translation
CN106776493A (en) * 2015-11-19 2017-05-31 腾讯科技(深圳)有限公司 Information filtering method and information filtrating device
CN106776493B (en) * 2015-11-19 2020-03-03 腾讯科技(深圳)有限公司 Information filtering method and information filtering device
CN105630767A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text similarity comparison method and device
CN105630766A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Multi-news correlation calculation method apparatus
CN105630767B (en) * 2015-12-22 2018-06-15 北京奇虎科技有限公司 The comparative approach and device of a kind of text similarity
CN105528335A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Method and device for determining correlation among news
CN105630766B (en) * 2015-12-22 2018-11-06 北京奇虎科技有限公司 Correlation calculations method and apparatus between more news
CN105528335B (en) * 2015-12-22 2018-10-09 北京奇虎科技有限公司 The method and apparatus for determining correlation between news
US10217025B2 (en) 2015-12-22 2019-02-26 Beijing Qihoo Technology Company Limited Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news
CN105528336B (en) * 2015-12-23 2018-09-21 北京奇虎科技有限公司 The method and apparatus that more mark posts determine article correlation
CN105528336A (en) * 2015-12-23 2016-04-27 北京奇虎科技有限公司 Method and device for determining article correlation by multiple marks
CN105550904A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product layout analysis system based on network operation
CN105678571A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Networked product planning analysis system based on Internet
CN105427138A (en) * 2015-12-30 2016-03-23 芜湖乐锐思信息咨询有限公司 Neural network model-based product market share analysis method and system
CN105654346A (en) * 2015-12-30 2016-06-08 芜湖乐锐思信息咨询有限公司 Analysis system based on product refinement operation
CN105550905A (en) * 2015-12-30 2016-05-04 芜湖乐锐思信息咨询有限公司 Product selling analysis system based on network
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN105930468B (en) * 2016-04-22 2019-05-17 江苏金鸽网络科技有限公司 A kind of rule-based information correlativity determination method
CN105930468A (en) * 2016-04-22 2016-09-07 江苏金鸽网络科技有限公司 Rule-based information relativity judgment method
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
WO2018006629A1 (en) * 2016-07-06 2018-01-11 北京搜狗科技发展有限公司 Prescription matching method and device, and device for prescription matching
CN106339371B (en) * 2016-08-30 2019-04-30 齐鲁工业大学 A kind of English-Chinese meaning of a word mapping method and device based on term vector
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106657016A (en) * 2016-11-10 2017-05-10 北京奇艺世纪科技有限公司 Illegal user name recognition method and system
CN108205757B (en) * 2016-12-19 2022-05-27 创新先进技术有限公司 Method and device for verifying legality of electronic payment service
CN108205757A (en) * 2016-12-19 2018-06-26 阿里巴巴集团控股有限公司 The method of calibration and device of e-payment rightness of business
CN108241867A (en) * 2016-12-26 2018-07-03 阿里巴巴集团控股有限公司 A kind of sorting technique and device
CN108268465A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of text search technology towards mixed data model
CN108388480A (en) * 2017-02-03 2018-08-10 百度在线网络技术(北京)有限公司 Short string Correlation Calibration method and apparatus
CN108388480B (en) * 2017-02-03 2021-06-11 百度在线网络技术(北京)有限公司 Short string correlation verification method and device
CN107066443A (en) * 2017-03-27 2017-08-18 成都优译信息技术股份有限公司 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression
CN107301248B (en) * 2017-07-19 2020-07-21 百度在线网络技术(北京)有限公司 Word vector construction method and device of text, computer equipment and storage medium
CN107301248A (en) * 2017-07-19 2017-10-27 百度在线网络技术(北京)有限公司 Term vector construction method and device, computer equipment, the storage medium of text
CN109325509A (en) * 2017-07-31 2019-02-12 北京国双科技有限公司 Similarity determines method and device
CN107844559A (en) * 2017-10-31 2018-03-27 国信优易数据有限公司 A kind of file classifying method, device and electronic equipment
CN110019801B (en) * 2017-12-01 2021-03-23 北京搜狗科技发展有限公司 Text relevance determining method and device
CN110019801A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of determination method and apparatus of text relevant
CN108182182A (en) * 2017-12-27 2018-06-19 传神语联网网络科技股份有限公司 Document matching process, device and computer readable storage medium in translation database
CN108536800B (en) * 2018-04-03 2022-04-19 有米科技股份有限公司 Text classification method, system, computer device and storage medium
CN108536800A (en) * 2018-04-03 2018-09-14 有米科技股份有限公司 File classification method, system, computer equipment and storage medium
CN110738220A (en) * 2018-07-02 2020-01-31 百度在线网络技术(北京)有限公司 Method and device for analyzing emotion polarity of sentence and storage medium
CN110738220B (en) * 2018-07-02 2022-09-30 百度在线网络技术(北京)有限公司 Method and device for analyzing emotion polarity of sentence and storage medium
CN110895656B (en) * 2018-09-13 2023-12-29 北京橙果转话科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN110929498B (en) * 2018-09-20 2023-05-09 中国移动通信有限公司研究院 Method and device for calculating similarity of short text and readable storage medium
CN110929498A (en) * 2018-09-20 2020-03-27 中国移动通信有限公司研究院 Short text similarity calculation method and device and readable storage medium
CN109522551B (en) * 2018-11-09 2024-02-20 天津新开心生活科技有限公司 Entity linking method and device, storage medium and electronic equipment
CN109522551A (en) * 2018-11-09 2019-03-26 天津新开心生活科技有限公司 Entity link method, apparatus, storage medium and electronic equipment
CN109271641B (en) * 2018-11-20 2023-09-08 广西三方大供应链技术服务有限公司 Text similarity calculation method and device and electronic equipment
CN109271641A (en) * 2018-11-20 2019-01-25 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN111460110A (en) * 2019-01-22 2020-07-28 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN111460110B (en) * 2019-01-22 2023-04-25 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN109947919B (en) * 2019-03-12 2020-05-15 北京字节跳动网络技术有限公司 Method and apparatus for generating text matching model
WO2020182122A1 (en) * 2019-03-12 2020-09-17 北京字节跳动网络技术有限公司 Text matching model generation method and device
CN109947919A (en) * 2019-03-12 2019-06-28 北京字节跳动网络技术有限公司 Method and apparatus for generating text matches model
CN111191087A (en) * 2019-12-31 2020-05-22 歌尔股份有限公司 Character matching method, terminal device and computer-readable storage medium
CN111191087B (en) * 2019-12-31 2023-11-07 歌尔股份有限公司 Character matching method, terminal device and computer readable storage medium
CN111382255A (en) * 2020-03-17 2020-07-07 北京百度网讯科技有限公司 Method, apparatus, device and medium for question and answer processing
CN111522918A (en) * 2020-04-24 2020-08-11 天津易维数科信息科技有限公司 Data aggregation method and device, electronic equipment and computer readable storage medium
CN112749252B (en) * 2020-07-14 2023-11-03 腾讯科技(深圳)有限公司 Text matching method and related device based on artificial intelligence
CN112749252A (en) * 2020-07-14 2021-05-04 腾讯科技(深圳)有限公司 Text matching method based on artificial intelligence and related device
CN112185573B (en) * 2020-09-25 2023-11-03 志诺维思(北京)基因科技有限公司 Similar character string determining method and device based on LCS and TF-IDF
CN112185573A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 LCS and TF-IDF based similar character string determination method and device
CN113239666B (en) * 2021-05-13 2023-09-29 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113239666A (en) * 2021-05-13 2021-08-10 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113254596A (en) * 2021-06-22 2021-08-13 湖南大学 User quality inspection requirement classification method and system based on rule matching and deep learning

Also Published As

Publication number Publication date
CN104424279B (en) 2018-11-20

Similar Documents

Publication Publication Date Title
CN104424279A (en) Text relevance calculating method and device
Sidorov et al. Syntactic n-grams as machine learning features for natural language processing
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN108388660A (en) A kind of improved electric business product pain spot analysis method
CN112069312B (en) Text classification method based on entity recognition and electronic device
Mertoğlu et al. Automated fake news detection in the age of digital libraries
Malandrakis et al. SAIL: A hybrid approach to sentiment analysis
CN115329085A (en) Social robot classification method and system
Gong et al. A semantic similarity language model to improve automatic image annotation
Bulut et al. Generating campaign ads & keywords for programmatic advertising
Hu et al. Retrieval-based language model adaptation for handwritten Chinese text recognition
CN112667940A (en) Webpage text extraction method based on deep learning
Ay et al. Turkish abstractive text document summarization using text to text transfer transformer
Rubtsova et al. Aspect extraction from reviews using conditional random fields
Pan et al. Video clip recommendation model by sentiment analysis of time-sync comments
Qiu et al. Automatic corpus expansion for chinese word segmentation by exploiting the redundancy of web information
CN103646017A (en) Acronym generating system for naming and working method thereof
Quazi et al. Twitter sentiment analysis using machine learning
Drury A Text Mining System for Evaluating the Stock Market's Response To News
Wadawadagi et al. A multi-layer approach to opinion polarity classification using augmented semantic tree kernels
Milošević et al. From web crawled text to project descriptions: automatic summarizing of social innovation projects
KR101126186B1 (en) Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof
Rämö et al. Using contextual and cross-lingual word embeddings to improve variety in template-based NLG for automated journalism

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant