CN102831119A - Short text clustering equipment and short text clustering method - Google Patents

Short text clustering equipment and short text clustering method Download PDF

Info

Publication number
CN102831119A
CN102831119A CN2011101605614A CN201110160561A CN102831119A CN 102831119 A CN102831119 A CN 102831119A CN 2011101605614 A CN2011101605614 A CN 2011101605614A CN 201110160561 A CN201110160561 A CN 201110160561A CN 102831119 A CN102831119 A CN 102831119A
Authority
CN
China
Prior art keywords
text
theme
short text
short
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101605614A
Other languages
Chinese (zh)
Other versions
CN102831119B (en
Inventor
赵凯
胡长建
王大亮
许洪志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Data Hall (Beijing) Polytron Technologies Inc
Original Assignee
NEC China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC China Co Ltd filed Critical NEC China Co Ltd
Priority to CN201110160561.4A priority Critical patent/CN102831119B/en
Publication of CN102831119A publication Critical patent/CN102831119A/en
Application granted granted Critical
Publication of CN102831119B publication Critical patent/CN102831119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides short text clustering equipment which comprises a subject analysis unit, a vector generating unit and a clustering unit, wherein the subject analysis unit is used for conducting subject analysis on each text in an auxiliary text collection and a short text collection, thereby obtaining the possibilities that each short text in the short text collection is corresponding to a subject of the auxiliary text collection and the subject of the short text collection; the vector generating unit is used for conducting normalization on the possibilities that each short text is corresponding to the subject of the auxiliary text collection and the subject of the short text collection so as to generate a vector; and the clustering unit is used for clustering the short texts in the short text collection based on the generated vector. Meanwhile, the invention further provides a short text clustering method. According to the short text clustering equipment and the short text clustering method, the independent finding of the auxiliary text subject and the short text subject can be realized, thereby clustering the short texts more accurately.

Description

The short text clustering Apparatus and method for
Technical field
The present invention relates to natural language processing field, be specifically related to a kind of short text clustering Apparatus and method for.
Background technology
Widespread use along with SMS, microblogging, search engine, online advertisement etc.; Short text is more and more frequent by people's use; These texts are shorter usually, and for example a SMS can not surpass 70 words, and the result that search engine returns generally also has only tens words.
Short text and long text (for example news) have bigger difference.For example, under the long text environment, a theme can be described fully, thereby people can recognize nearly all content of theme from this long text.Different therewith, because the number of words of short text is restricted, so only the core content of theme is described usually, a lot of relevant informations are omitted.
The method that traditional text is excavated normally is directed against long text, can meet difficulty on the short text and be applied in, for example cluster.Owing to realize that cluster usually will be used literal and photos and sending messages (occurring simultaneously), and the short text Chinese words and photos and sending messages lack much than long text, so the cluster effect can be affected.For example following two sections newsletter archive L1 and L2:
L1: " Tsing-Hua University's the 4th teaching building is renamed as " Jeanwest building ", the sound that a slice is scoffed in the campus and the internet.Opposing views mainly are: the apparel brand image of teaching building of Tsing-Hua University and Jeanwest is not taken very much.See that from this angle of due procedure of colleges and universities' building using names Tsing-Hua University obviously has can fastidious part.To cast aside this point and do not talk, whether does " Jeanwest building " too lose the image of Tsing-Hua University with regard to single substantive issue of being paid close attention to regard to the scholar of Tsing-Hua University--the so-called brand image angle of teaching building using names? "
L2: " recently, Tsing-Hua University's one teaching building is named as " Jeanwest ", on network, has caused great disturbance.Is not Jeanwest an apparel brand? How is the teaching building of Tsing-Hua University also " Jeanwest "? Noon on the 23rd, Tsing-Hua University's the 4th teaching building exterior wall hangs up the board in " Jeanwest building ".The lower right of several words also is hung with another board, and this apparel brand of Jeanwest is introduced in special use.Teaching building causes student of Tsing-Hua University and online friend's dispute with the brand names using names.Someone thinks the commercialization that colleges and universities are undue should not come using names with enterprise.And the rich friendly Young_pig of Sina thinks that enterprise provides patronage to school, and a hat name does not influence school image.”
L1 and L2 be because there is speech such as " Tsing-Hua University, the 4th teaching building, Jeanwest, clothes, colleges and universities, using names, images ", and they are very similar so judge easily, and can gather is one type.And following two short text S1 and S2 just not gather so easily be one type because their total important literal have only " Tsing-Hua University " (" also, " this speech because use very general, so not too important, usually before cluster, remove):
S1: " having it is said that the image of Jeanwest building and Tsing-Hua University has not been taken too "
S2: " not being exactly an apparel brand, too commercialization of Tsing-Hua University's using names "
In order to improve the correctness of short text clustering, proposed to adopt supplementary to help carry out cluster in the prior art.For example, if above-mentioned S1 of cluster and the such short text of S2 are just introduced the such long text of L1 and L2 as supplementary; Because S1 more similar with L1 (sharing speech such as " Jeanwest, Tsing-Hua University, images; do not take "), and S2 more similar with L2 (sharing speech such as " clothes, Tsing-Hua University; using names, commercializations ").And because L1 is more similar with L2, so S1 is also just similar with S2, and can gather is one type.
List of references 1 (XH Phan; LM Nguyen; S Horiguchi.; " Learning to classifty short and sparse text web with hidden topics from large-scale data collections " WWW2008) described the method that the auxiliary this paper of a kind of basis carries out cluster.As shown in Figure 1, this method may further comprise the steps:
At step S100, auxiliary text collection is carried out subject analysis, obtain some themes and corresponding vocabulary.Particularly, employing as supplementary, forms auxiliary text collection from the text of wikipedia (Wikipedia) download in the list of references 1.Subject analysis uses potential Di Li Cray to distribute (Latent Dirichlet Allocation, LDA) method.Fig. 2 shows the model of LDA.LDA is a kind of generation model, and its main thought is the generative process of simulation text: to each speech, from distribute, select a theme earlier, from theme, select a speech again.With reference to figure 2, the algorithm flow of LDA comprises:
1 to each theme k ∈ [1; K]; From Dir (β) distributes, do a sampling, obtain the distribution
Figure BDA0000068500680000021
of a speech under the theme
2 to each text m ∈ [1, M],
2.1 do a sampling from Dir (α) distribution, obtain a theme distribution
Figure BDA0000068500680000031
2.2 to each speech n,
2.2.1 distribute from polynomial expression
Figure BDA0000068500680000032
In do a sampling, obtain a theme z M, n
2.2.2 distribute from polynomial expression
Figure BDA0000068500680000033
In do a sampling, obtain a speech w M, n
Algorithm 1-LDA
Wherein, the weight distribution of each topic of the value representation of α before sampling, the prior distribution of each descriptor of value representation of β.They are predetermined parameters, are called ultra parameter.
The task of LDA is an estimated parameter
Figure BDA0000068500680000034
And θ dWherein, the density of simultaneous distribution of all apparent variablees and hidden variable is following:
Figure BDA0000068500680000035
Figure BDA0000068500680000036
The likelihood function of one piece of text is following:
Figure BDA0000068500680000037
Figure BDA0000068500680000038
Likelihood function on the whole text collection is following:
In theory; Can solve
Figure BDA00000685006800000310
and
Figure BDA00000685006800000311
still, this method does not have analytic solution through maximizing above-mentioned likelihood function.So general mode with estimation is found the solution in reality.For example, list of references 1 selects for use gibbs sampler (Gibbs Sampling) to come estimated parameter.List of references 2 (Thomas L.Griffiths; Mark Steyvers, " Finding scientific topics ", Proceedings of the National Academy of Sciences of the United States of America; Vol.101; No.Suppl 1. (6 April 2004) is pp.5228-5235) with list of references 3 (Gregor Heinrich, " Parameter estimation for text analysis "; Technical Report, 2004) describe the process and the algorithm that utilize gibbs sampler to realize LDA in detail.
At step S110, based on the theme that obtains among the step S100, reasoning is carried out in set to short text, obtains the theme corresponding with these short texts.The mode of reasoning also is to use gibbs sampler.
At step S120, be the base configuration training sample set with the result of step S110.Training data is a vector mode, that is to say the corresponding vector of each short text.Each bar short text in the short text set all generates corresponding vector, then each bar short text is provided a classification, has so just constituted training sample set.
At step S130, select machine learning method, training sample set is classified, so that obtain disaggregated model.For example, can select machine learning method that training sample set is classified, so that obtain disaggregated model.There is several different methods available, for example decision tree, SVM, maximum entropy etc.What use in the list of references 1 is maximum entropy method.
Yet, short text is being carried out among the step S110 of reasoning on the theme that auxiliary text forms, the theme of list of references 1 hypothesis short text can both be covered by auxiliary text.Under a lot of situation in reality, this hypothesis can not satisfy or satisfied well, and this is because a lot of situation and incident are emerging, can not guarantee that a comprehensive knowledge base can cover the theme that occurs in all situations and the incident.In this case, auxiliary text can only cover the part theme of short text.Therefore, the theme of emerging inherence in the short text can not found and utilize to list of references 1 described method, thereby can reduce the effect of classification or cluster.
Summary of the invention
In order to solve the problems of the technologies described above; The present invention proposes to assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set.The present invention does not require auxiliary text collection that can cover all themes of short text of existence, and only requires auxiliary text and short text part correlation.Specifically, the present invention utilizes two groups of potential Di Li Crays to distribute that (Double Latent Dirichlet Allocation DLDA) carries out cluster to short text.Through setting up two groups of LDA and adding switch therein, DLDA has realized the discovery separately of auxiliary text subject and short text theme, and can confirm the possibility of any one short text corresponding to auxiliary text subject and short text theme.
According to an aspect of the present invention; A kind of short text clustering equipment is provided; Comprise: the subject analysis unit; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And cluster cell, based on the vector that generates the short text in the short text set is carried out cluster.
Preferably, the subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.
Preferably; Subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
Preferably, vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.
Preferably, the value of switch parameter is obeyed binomial distribution.
Preferably; The subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
According to another aspect of the present invention; A kind of short text clustering method is provided; Comprise: the subject analysis step; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And the cluster step, based on the vector that generates the short text in the short text set is carried out cluster.
Preferably, subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.
Preferably; Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the frequency of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the frequency of the theme of the short-and-medium text collection of all other selected ci poems.
Preferably, vector generation step comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.
Preferably, the value of switch parameter is obeyed binomial distribution.
Preferably; Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
The present invention has realized the discovery separately of auxiliary text subject and short text theme, thereby can carry out cluster to short text more accurately.
Description of drawings
Through the hereinafter detailed description with the accompanying drawing, above-mentioned and further feature of the present invention will become more apparent, wherein:
Fig. 1 shows the process flow diagram of the short text clustering method of prior art;
The block diagram of the LDA model that the short text clustering method among Fig. 1 that shows Fig. 2 is adopted;
Fig. 3 shows the block diagram of short text clustering equipment according to an embodiment of the invention;
Fig. 4 shows the block diagram of the DLDA model that short text clustering equipment according to an embodiment of the invention adopted; And
Fig. 5 shows the process flow diagram of short text clustering method according to an embodiment of the invention.
Embodiment
Below, in conjunction with the drawings to the description of specific embodiment of the present invention, principle of the present invention will become obvious with realizing.Should be noted in the discussion above that the present invention should not be limited to specific embodiment hereinafter described.In addition, for for simplicity, omitted not having the detailed description of the known technology of direct correlation with the present invention.
Fig. 3 shows the block diagram of short text clustering equipment 30 according to an embodiment of the invention.As shown in Figure 3, short text clustering equipment 30 comprises subject analysis unit 310, vectorial generation unit 320 and cluster cell 330.
Each text in 310 pairs of auxiliary text collections in subject analysis unit and the short text set is carried out subject analysis, to obtain theme separately.In a specific embodiment, subject analysis unit 310 adopts DLDA model as shown in Figure 4 to carry out subject analysis.As can be seen from Figure 4, DLDA comprises two groups of LDA, corresponds respectively to the subject analysis (wherein, text is assisted in " aux " expression, and " tar " representes short text) of auxiliary text and short text.For two groups of LDA are coordinated, introduced switching variable γ.It is from auxiliary text, to select theme or from short text, select theme that switching variable γ is responsible for selecting each speech.
In the present embodiment, subject analysis unit 310 is analyzed the theme of auxiliary text and short text through following algorithm:
Each theme z ∈ of 1 pair of auxiliary text collection [1 ..., K Aux], from Dir (β Aux) do a sampling in the distribution, obtain the distribution of a speech under the theme
Each theme z ∈ of 2 pairs of short text set [1 ..., K Tar], from Dir (β Tar) do a sampling in the distribution, obtain the distribution of a speech under the theme
Figure BDA0000068500680000072
3 to each text collection c ∈ [aux, tar],
3.1 to each text d ∈ [1 ..., D c],
3.2 from Dir (α Aux) the profile samples theme that obtains an auxiliary text collection distributes
Figure BDA0000068500680000073
3.3 from Dir (α Tar) the profile samples theme that obtains the set of short text distributes
Figure BDA0000068500680000074
3.4 from Beta (γ c) in the distribution sampling obtain a binomial distribution π d
3.5 to each speech w D, n,
3.5.1 from binomial distribution π dMiddle sampling obtains switch value x D, n,
If 3.5.2 x D, n=aux is from the polynomial expression distribution of auxiliary text collection
Figure BDA0000068500680000075
Middle sampling obtains a theme z D, n
If 3.5.3 x D, n=tar is from the polynomial expression distribution of short text set
Figure BDA0000068500680000076
Middle sampling obtains a theme z D, n
3.5.4 distribute from polynomial expression
Figure BDA0000068500680000081
In do a sampling, obtain a speech w D, n
Algorithm 2-DLDA
A concrete applying examples is described below.Suppose to have 100 pieces in auxiliary text, 50 pieces of short texts.Get K Aux=15, K Tar=10, α Aux=0.3, α Tar=0.2, β AuxTar=0.01.It should be noted that α, β is multi-C vector usually, and the value of each dimension maybe be different in theory, but the value of each dimension usually is reduced to same value in practical application.
Figure BDA0000068500680000082
wherein
Figure BDA0000068500680000083
Figure BDA0000068500680000084
Figure BDA0000068500680000085
notes; The setting here should be
Figure BDA0000068500680000086
and
Figure BDA0000068500680000087
wherein c represent a text collection and another text collection with~c; C=aux for example, then~c=tar.Otherwise, if c=tar, then~c=aux.
At first, all vocabulary of auxiliary text of subject analysis unit 310 statistics and short text (occurring repeatedly only calculating once) are designated as V.Here, suppose that V comprises 5000 vocabulary.
Then; Each theme z ∈ [1 of the 310 pairs of auxiliary text collections in subject analysis unit; ...; 15], from Dir (0.01) distributes, do sampling, the distribution
Figure BDA0000068500680000088
that obtains the speech under each theme for example;
Figure BDA0000068500680000089
this vectorial dimension is 5000; The value of all dimensions and be 1, the meaning is as far as this theme, the probability of choosing first speech is 0.001; The probability of choosing second speech is 0 ....In addition,
Figure BDA00000685006800000810
In addition; Each theme z ∈ [1 of the 310 pairs of short text set in subject analysis unit; ...; 10];, Dir (0.01) does sampling from distributing; The distribution that obtains speech under each theme for example,
Figure BDA00000685006800000812
this vectorial dimension is 5000.In addition,
Figure BDA00000685006800000813
Then; First auxiliary text d=1 among the 310 couples of auxiliary text collection c=aux in subject analysis unit; Theme distribution
Figure BDA00000685006800000814
this vectorial dimension that obtains an auxiliary text collection from Dir (0.3) profile samples is 15; The value of all dimensions and be 1; The meaning is that to choose the probability of first theme be 0.1, and the probability of choosing second theme is 0.2, or the like.In addition; The theme that subject analysis unit 310 also obtains the set of short text from Dir (0.2) profile samples
Figure BDA00000685006800000815
this vectorial dimension that distributes is 10, the value of all dimensions and be 1.Next, sampling obtained a binomial distribution π during subject analysis unit 310 distributed from Beta (0.5,0.2) 1=[0.7,0.3], its implication are that to choose the probability of auxiliary text subject be 0.7, and the probability of choosing the short text theme is 0.3.Suppose that text 1 comprises 30 speech, so to first speech w1,1, from π 1Middle sampling obtains switch value x 1,1=aux.Because x 1,1=aux, then the polynomial expression from auxiliary text collection distributes Multinomial ( θ 1 Aux ) = [ 0.1,0.2,0 , . . . , 0.034 ] Middle sampling obtains a theme.Suppose to draw the 15th theme z 1,1=15.Afterwards, subject analysis unit 310 distributes from polynomial expression
Figure BDA0000068500680000092
In carry out a sampling, for example be extracted into the 1200th speech, then corresponding w 1,1=" TV ".
Finding the solution DLDA need solve the concrete method for solving of parameter
Figure BDA0000068500680000093
and can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.In the present embodiment, adopt the following gibbs sampler algorithm of describing to find the solution parameter
Figure BDA0000068500680000094
1 couple of all the auxiliary text subject z ∈ [1 ..., K Aux], all speech w and text d,
Figure BDA0000068500680000095
To all short text theme z ∈ [1 ..., T Tar], all speech w and text d,
Figure BDA0000068500680000096
To all text d,
Figure BDA0000068500680000097
Wherein,
Figure BDA0000068500680000098
With
Figure BDA0000068500680000099
The number of times of implication selected word w when being selected auxiliary text subject of theme and short text theme z respectively.
Figure BDA00000685006800000910
and
Figure BDA00000685006800000911
Text d were selected secondary text topics and the number of short text theme z.
Figure BDA00000685006800000912
and
Figure BDA00000685006800000913
d, respectively, in the words of the text selected to support the text topics and the number of short text theme.
2 to each text collection c ∈ [aux, tar],
2.1 to each text d ∈ [1 ..., D c],
2.1.1 to each speech w,
2.1.1.1 obtain a switch value x from binomial distribution π=[0.5,0.5] profile samples.
If 2.1.1.2 x=aux, then Obtain a theme z from polynomial expression distribution Multinomial (1/Kaux) sampling, n d Aux , z = n d Aux , z + 1 . n w Aux , z = n w Aux , z + 1 .
If 2.1.1.3 x=tar, then
Figure BDA00000685006800000917
From polynomial expression distribution Multinomial (1/K Tar) sampling obtains a theme z, n d Tar , z = n d Tar , z + 1 . n w Tar , z = n w Tar , z + 1 .
3 circulations
3.1 to each text collection c ∈ [aux, tar],
3.1.1 to each text d ∈ [1 ..., D c],
3.1.1.1 to each speech w,
3.1.1.1.1 if x and z by a last circulation w the sampling text collection and the theme that obtain,
Figure BDA0000068500680000101
Figure BDA0000068500680000102
Figure BDA0000068500680000103
then
3.1.1.1.2 from binomial distribution π = [ n d Aux / ( n d Aux + n d Tar ) , n d Tar / ( n d Aux + n d Tar ) Profile samples obtains a switch value x.
If 3.1.1.1.3 x=aux, then
Figure BDA0000068500680000105
Sampling obtains a theme z from the polynomial expression distribution of formula (1) (vide infra) decision, n d Aux , z = n d Aux , z + 1 . n w Aux , z = n w Aux , z + 1 .
If 3.1.1.1.4 x=tar, then
Figure BDA0000068500680000108
Sampling obtains a theme z from the polynomial expression distribution of formula (2) (vide infra) decision, n d Tar , z = n d Tar , z + 1 . n w Tar , z = n w Tar , z + 1 .
If,, and withdraw from circulation then according to formula (3) and (4) (vide infra) calculating parameter 3.2 reach the condition of convergence.Otherwise, continue to carry out circulation.
Algorithm 3-gibbs sampler
Describe formula (1)-(4 of mentioning in the gibbs sampler algorithm of preceding text) below in detail.
Formula (1): to all auxiliary text subject z ∈ [1 ..., K Aux],
p ( x i = x , z i = z | w i = w , x ⫬ i , z ⫬ i , w ⫬ i , α , β , γ ) ∝ n w , ⫬ i aux , z + β w c Σ v = 1 V ( n v , ⫬ i aux , z + β v c ) · n d , ⫬ i aux , z + α z c Σ k = 1 K aux ( n d , ⫬ i aux , k + α k c ) · ( n d , ⫬ i aux + γ x c i )
The implication of formula (1) is: sampling chooses the probability of text collection x and theme z to be proportional to 3 numerical value (three parts that multiply each other on formula (1) the right just).In the third part
Figure BDA00000685006800001012
; The implication of
Figure BDA00000685006800001013
is the number of times of auxiliary text subject after removing the sampling of current speech in a last circulation, in all other selected ci poems, and the purpose that adds
Figure BDA00000685006800001014
is that to avoid this number be 0.The ratio that the implication of second portion is after removing the sampling of current speech in a last circulation, text d chooses auxiliary text subject z, the implication of first be after removing the sampling of current speech in a last circulation, choose the ratio of speech w when choosing auxiliary text subject z.In the symbol
Figure BDA00000685006800001015
Implication be " to remove current speech (w i) selection " (corresponding to the implication of step 3.1.1.1.1).
Formula (2): to all short text theme z ∈ [1 ..., K Tar],
p ( x i = x , z i = z | w i = w , x ⫬ i , z ⫬ i , w ⫬ i , α , β , γ ) ∝ n w , ⫬ i tar , z + β w c Σ v = 1 V ( n v , ⫬ i tar , z + β v c ) · n d , ⫬ i tar , z + α z c Σ k = 1 K tar ( n d , ⫬ i tar , k + α k c ) · ( n d , ⫬ i tar + γ x c i )
The implication of formula (2) is similar with (1), just is changed to the short text theme by auxiliary text subject.C wherein iText collection under the expression text d (that is, auxiliary text collection or short text set).
Formula (3): θ d , z c = n d c , z + α z c Σ k = 1 K c ( n d c , k + α k c )
Formula (4):
Figure BDA0000068500680000113
C ∈ [aux, tar] wherein.
The condition of convergence can have multiple, for example: reach the likelihood function that predefined iterations,
Figure BDA0000068500680000114
change very little or text collection and change very little.
Adopt above-mentioned gibbs sampler algorithm to find the solution, can draw the possibility (being the result of formula (3)) of the theme and the theme that short text is gathered of the corresponding auxiliary text collection of each short text.
Vector generation unit 320 generates vector after with the possibility normalization of corresponding theme.Notice that the vector here is on the intersection of the theme of the theme of auxiliary text collection and short text set, to generate.To any one short text d, each dimension of vector is that c gets aux or tar in the formula (3), and z gets any theme:
f d = [ θ d , 1 θ d , 1 aux S 1 aux , . . . , θ d , K aux aux S K aux aux , θ d , 1 tar S 1 tar , . . . , θ d , K tar tar S K tar tar ]
x ∈ { aux, rar} wherein.For example, vectorial generation unit 320 can generate f d=[0.1,0.5,0,0,0.02 ..., 0] such vector.
Cluster cell 330 carries out short text clustering based on the vector that vectorial generation unit 320 is generated.Particularly, when to after all short texts generate above-mentioned vector, for example can use clustering method such as K-average to carry out short text clustering, thereby obtain the cluster result of short text.
Fig. 5 shows the process flow diagram of short text clustering method 50 according to an embodiment of the invention.As shown in Figure 5, short text clustering method 50 comprises step S510-S530.
At step S510, each text in auxiliary text collection and the short text set is carried out subject analysis, to obtain theme separately.Particularly, can adopt above-described DLDA algorithm that the theme of auxiliary text and short text is analyzed.In the solution procedure of DLDA algorithm, can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.Preferably, adopt above-described gibbs sampler algorithm to realize DLDA.
At step S520, according to the corresponding possibility of assisting the theme and the theme that short text is gathered of text collection of each short text, with generating vector after the possibility normalization of corresponding theme.Preferably, this vector is on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, to generate.
At step S530, short text is carried out cluster based on the vector that generates.For example, after generating vector, can use clustering method execution short text clusterings such as K-average to all short texts.
Describe below and short text clustering equipment of the present invention or method are applied to online advertisement gather resulting result.Suppose to have collected from certain business website totally 182209 pieces of the online advertisements of 42 series products, average every piece of text comprises 29.06 words.In addition, collected 99737 pieces of webpages as auxiliary text according to ProductName, average every piece of text comprises 560.4 words.Every series products is as a cluster.
Evaluation criterion
Figure BDA0000068500680000121
is selected following entropy form for use:
H ( x ~ ) = - Σ c ∈ C p ( c | x ~ ) log 2 p ( c | x ~ ) ,
Wherein representes the cluster that computing machine is accomplished; The cluster classification that the C representative is correct; C is some correct clusters (an a certain series products), wherein:
Figure BDA0000068500680000124
The correct cluster mark of l (x) expression short text c,
Figure BDA0000068500680000125
representes the text number of this cluster.
Figure BDA0000068500680000126
is more little, explains that algorithm performance is good more.
Following table 1 has been listed the result who is applied to the online advertisement set according to DLDA method of the present invention and other several method:
Figure BDA0000068500680000131
Table 1
In table 1, Direct representes directly to use clustering method.LDA-one produces theme on auxiliary text collection, short text carries out reasoning (similar list of references 1) on these themes then.LDA-both produces theme on the union of auxiliary text collection and short text set.STC is a kind of clustering method that adopts when changing the field.Can find out; Because
Figure BDA0000068500680000132
result of DLDA is minimum, so the performance of DLDA aspect short text clustering is best.
Although below combined the preferred embodiments of the present invention to show the present invention, one skilled in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, can carry out various modifications, replacement and change to the present invention.Therefore, the present invention should not limited the foregoing description, and should be limited accompanying claims and equivalent thereof.

Claims (12)

1. short text clustering equipment comprises:
Subject analysis is carried out to assisting each text in the set of text collection and short text in the subject analysis unit, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;
The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And
Cluster cell carries out cluster based on the vector that generates to the short text in the short text set.
2. short text clustering equipment according to claim 1; Wherein, said subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme that short text is gathered corresponding to the theme of auxiliary text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.
3. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
4. short text clustering equipment according to claim 1, wherein, said vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.
5. short text clustering equipment according to claim 2, wherein, the value of said switch parameter is obeyed binomial distribution.
6. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
7. short text clustering method comprises:
The subject analysis step is carried out subject analysis to assisting each text in the set of text collection and short text, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;
Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And
The cluster step is carried out cluster based on the vector that generates to the short text in the short text set.
8. short text clustering method according to claim 7; Wherein, said subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.
9. short text clustering method according to claim 8; Wherein, Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
10. short text clustering method according to claim 7, wherein, said vector generates step and comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.
11. short text clustering method according to claim 8, wherein, the value of said switch parameter is obeyed binomial distribution.
12. short text clustering method according to claim 8; Wherein, Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
CN201110160561.4A 2011-06-15 2011-06-15 Short text clustering Apparatus and method for Active CN102831119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110160561.4A CN102831119B (en) 2011-06-15 2011-06-15 Short text clustering Apparatus and method for

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110160561.4A CN102831119B (en) 2011-06-15 2011-06-15 Short text clustering Apparatus and method for

Publications (2)

Publication Number Publication Date
CN102831119A true CN102831119A (en) 2012-12-19
CN102831119B CN102831119B (en) 2016-08-17

Family

ID=47334262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110160561.4A Active CN102831119B (en) 2011-06-15 2011-06-15 Short text clustering Apparatus and method for

Country Status (1)

Country Link
CN (1) CN102831119B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530316A (en) * 2013-09-12 2014-01-22 浙江大学 Science subject extraction method based on multi-view learning
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104850617A (en) * 2015-05-15 2015-08-19 百度在线网络技术(北京)有限公司 Short text processing method and apparatus
CN107992477A (en) * 2017-11-30 2018-05-04 北京神州泰岳软件股份有限公司 Text subject determines method, apparatus and electronic equipment
CN108228721A (en) * 2017-12-08 2018-06-29 复旦大学 Fast text clustering method on large corpora
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN111090995A (en) * 2019-11-15 2020-05-01 合肥工业大学 Short text topic identification method and system
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828608A (en) * 2006-04-13 2006-09-06 北大方正集团有限公司 Multiple file summarization method based on sentence relation graph
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set
US20090164417A1 (en) * 2004-09-30 2009-06-25 Nigam Kamal P Topical sentiments in electronically stored communications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164417A1 (en) * 2004-09-30 2009-06-25 Nigam Kamal P Topical sentiments in electronically stored communications
CN1828608A (en) * 2006-04-13 2006-09-06 北大方正集团有限公司 Multiple file summarization method based on sentence relation graph
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530316A (en) * 2013-09-12 2014-01-22 浙江大学 Science subject extraction method based on multi-view learning
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN104573070B (en) * 2015-01-26 2018-06-15 清华大学 A kind of Text Clustering Method for mixing length text set
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104850617A (en) * 2015-05-15 2015-08-19 百度在线网络技术(北京)有限公司 Short text processing method and apparatus
CN104850617B (en) * 2015-05-15 2018-04-20 百度在线网络技术(北京)有限公司 Short text processing method and processing device
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN108628875B (en) * 2017-03-17 2022-08-30 腾讯科技(北京)有限公司 Text label extraction method and device and server
CN107992477A (en) * 2017-11-30 2018-05-04 北京神州泰岳软件股份有限公司 Text subject determines method, apparatus and electronic equipment
CN108228721A (en) * 2017-12-08 2018-06-29 复旦大学 Fast text clustering method on large corpora
CN108228721B (en) * 2017-12-08 2021-06-04 复旦大学 Fast text clustering method on large corpus
CN111090995A (en) * 2019-11-15 2020-05-01 合肥工业大学 Short text topic identification method and system
CN111090995B (en) * 2019-11-15 2023-03-31 合肥工业大学 Short text topic identification method and system
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN111897912B (en) * 2020-07-13 2021-04-06 上海乐言科技股份有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113407679B (en) * 2021-06-30 2023-10-03 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102831119B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN102831119A (en) Short text clustering equipment and short text clustering method
CN103390051B (en) A kind of topic detection and tracking method based on microblog data
CN103699525B (en) A kind of method and apparatus automatically generating summary based on text various dimensions feature
Kovachev et al. Learn-as-you-go: new ways of cloud-based micro-learning for the mobile web
CN105095394B (en) webpage generating method and device
CN110866126A (en) College online public opinion risk assessment method
CN105512245A (en) Enterprise figure building method based on regression model
CN104536956A (en) A Microblog platform based event visualization method and system
CN106250550A (en) A kind of method and apparatus of real time correlation news content recommendation
CN105608200A (en) Network public opinion tendency prediction analysis method
CN104090955A (en) Automatic audio/video label labeling method and system
CN104008203A (en) User interest discovering method with ontology situation blended in
Wicaksono et al. Automatically building a corpus for sentiment analysis on Indonesian tweets
CN101819585A (en) Device and method for constructing forum event dissemination pattern
CN103020712B (en) A kind of distributed sorter of massive micro-blog data and method
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN104850650A (en) Short-text expanding method based on similar-label relation
Tang et al. Social media-based disaster research: Development, trends, and obstacles
CN103823890A (en) Microblog hot topic detection method and device aiming at specific group
CN104978332A (en) UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN103886020A (en) Quick search method of real estate information
CN104102658A (en) Method and device for mining text contents
CN103077234A (en) Voice website navigation system and method
CN102163189B (en) Method and device for extracting evaluative information from critical texts
CN103942240A (en) Method for building intelligent substation comprehensive data information application platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20171207

Address after: 100190 Zhongguancun street, Haidian District, Beijing, No. 18, block B, block 18

Patentee after: Data Hall (Beijing) Polytron Technologies Inc

Address before: 100191 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 20

Patentee before: NEC (China) Co., Ltd.