CN102831119A - Short text clustering equipment and short text clustering method - Google Patents
Short text clustering equipment and short text clustering method Download PDFInfo
- Publication number
- CN102831119A CN102831119A CN2011101605614A CN201110160561A CN102831119A CN 102831119 A CN102831119 A CN 102831119A CN 2011101605614 A CN2011101605614 A CN 2011101605614A CN 201110160561 A CN201110160561 A CN 201110160561A CN 102831119 A CN102831119 A CN 102831119A
- Authority
- CN
- China
- Prior art keywords
- text
- theme
- short text
- short
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention provides short text clustering equipment which comprises a subject analysis unit, a vector generating unit and a clustering unit, wherein the subject analysis unit is used for conducting subject analysis on each text in an auxiliary text collection and a short text collection, thereby obtaining the possibilities that each short text in the short text collection is corresponding to a subject of the auxiliary text collection and the subject of the short text collection; the vector generating unit is used for conducting normalization on the possibilities that each short text is corresponding to the subject of the auxiliary text collection and the subject of the short text collection so as to generate a vector; and the clustering unit is used for clustering the short texts in the short text collection based on the generated vector. Meanwhile, the invention further provides a short text clustering method. According to the short text clustering equipment and the short text clustering method, the independent finding of the auxiliary text subject and the short text subject can be realized, thereby clustering the short texts more accurately.
Description
Technical field
The present invention relates to natural language processing field, be specifically related to a kind of short text clustering Apparatus and method for.
Background technology
Widespread use along with SMS, microblogging, search engine, online advertisement etc.; Short text is more and more frequent by people's use; These texts are shorter usually, and for example a SMS can not surpass 70 words, and the result that search engine returns generally also has only tens words.
Short text and long text (for example news) have bigger difference.For example, under the long text environment, a theme can be described fully, thereby people can recognize nearly all content of theme from this long text.Different therewith, because the number of words of short text is restricted, so only the core content of theme is described usually, a lot of relevant informations are omitted.
The method that traditional text is excavated normally is directed against long text, can meet difficulty on the short text and be applied in, for example cluster.Owing to realize that cluster usually will be used literal and photos and sending messages (occurring simultaneously), and the short text Chinese words and photos and sending messages lack much than long text, so the cluster effect can be affected.For example following two sections newsletter archive L1 and L2:
L1: " Tsing-Hua University's the 4th teaching building is renamed as " Jeanwest building ", the sound that a slice is scoffed in the campus and the internet.Opposing views mainly are: the apparel brand image of teaching building of Tsing-Hua University and Jeanwest is not taken very much.See that from this angle of due procedure of colleges and universities' building using names Tsing-Hua University obviously has can fastidious part.To cast aside this point and do not talk, whether does " Jeanwest building " too lose the image of Tsing-Hua University with regard to single substantive issue of being paid close attention to regard to the scholar of Tsing-Hua University--the so-called brand image angle of teaching building using names? "
L2: " recently, Tsing-Hua University's one teaching building is named as " Jeanwest ", on network, has caused great disturbance.Is not Jeanwest an apparel brand? How is the teaching building of Tsing-Hua University also " Jeanwest "? Noon on the 23rd, Tsing-Hua University's the 4th teaching building exterior wall hangs up the board in " Jeanwest building ".The lower right of several words also is hung with another board, and this apparel brand of Jeanwest is introduced in special use.Teaching building causes student of Tsing-Hua University and online friend's dispute with the brand names using names.Someone thinks the commercialization that colleges and universities are undue should not come using names with enterprise.And the rich friendly Young_pig of Sina thinks that enterprise provides patronage to school, and a hat name does not influence school image.”
L1 and L2 be because there is speech such as " Tsing-Hua University, the 4th teaching building, Jeanwest, clothes, colleges and universities, using names, images ", and they are very similar so judge easily, and can gather is one type.And following two short text S1 and S2 just not gather so easily be one type because their total important literal have only " Tsing-Hua University " (" also, " this speech because use very general, so not too important, usually before cluster, remove):
S1: " having it is said that the image of Jeanwest building and Tsing-Hua University has not been taken too "
S2: " not being exactly an apparel brand, too commercialization of Tsing-Hua University's using names "
In order to improve the correctness of short text clustering, proposed to adopt supplementary to help carry out cluster in the prior art.For example, if above-mentioned S1 of cluster and the such short text of S2 are just introduced the such long text of L1 and L2 as supplementary; Because S1 more similar with L1 (sharing speech such as " Jeanwest, Tsing-Hua University, images; do not take "), and S2 more similar with L2 (sharing speech such as " clothes, Tsing-Hua University; using names, commercializations ").And because L1 is more similar with L2, so S1 is also just similar with S2, and can gather is one type.
List of references 1 (XH Phan; LM Nguyen; S Horiguchi.; " Learning to classifty short and sparse text web with hidden topics from large-scale data collections " WWW2008) described the method that the auxiliary this paper of a kind of basis carries out cluster.As shown in Figure 1, this method may further comprise the steps:
At step S100, auxiliary text collection is carried out subject analysis, obtain some themes and corresponding vocabulary.Particularly, employing as supplementary, forms auxiliary text collection from the text of wikipedia (Wikipedia) download in the list of references 1.Subject analysis uses potential Di Li Cray to distribute (Latent Dirichlet Allocation, LDA) method.Fig. 2 shows the model of LDA.LDA is a kind of generation model, and its main thought is the generative process of simulation text: to each speech, from distribute, select a theme earlier, from theme, select a speech again.With reference to figure 2, the algorithm flow of LDA comprises:
1 to each theme k ∈ [1; K]; From Dir (β) distributes, do a sampling, obtain the distribution
of a speech under the theme
2 to each text m ∈ [1, M],
2.2 to each speech n,
Algorithm 1-LDA
Wherein, the weight distribution of each topic of the value representation of α before sampling, the prior distribution of each descriptor of value representation of β.They are predetermined parameters, are called ultra parameter.
The task of LDA is an estimated parameter
And θ
dWherein, the density of simultaneous distribution of all apparent variablees and hidden variable is following:
The likelihood function of one piece of text is following:
Likelihood function on the whole text collection is following:
In theory; Can solve
and
still, this method does not have analytic solution through maximizing above-mentioned likelihood function.So general mode with estimation is found the solution in reality.For example, list of references 1 selects for use gibbs sampler (Gibbs Sampling) to come estimated parameter.List of references 2 (Thomas L.Griffiths; Mark Steyvers, " Finding scientific topics ", Proceedings of the National Academy of Sciences of the United States of America; Vol.101; No.Suppl 1. (6 April 2004) is pp.5228-5235) with list of references 3 (Gregor Heinrich, " Parameter estimation for text analysis "; Technical Report, 2004) describe the process and the algorithm that utilize gibbs sampler to realize LDA in detail.
At step S110, based on the theme that obtains among the step S100, reasoning is carried out in set to short text, obtains the theme corresponding with these short texts.The mode of reasoning also is to use gibbs sampler.
At step S120, be the base configuration training sample set with the result of step S110.Training data is a vector mode, that is to say the corresponding vector of each short text.Each bar short text in the short text set all generates corresponding vector, then each bar short text is provided a classification, has so just constituted training sample set.
At step S130, select machine learning method, training sample set is classified, so that obtain disaggregated model.For example, can select machine learning method that training sample set is classified, so that obtain disaggregated model.There is several different methods available, for example decision tree, SVM, maximum entropy etc.What use in the list of references 1 is maximum entropy method.
Yet, short text is being carried out among the step S110 of reasoning on the theme that auxiliary text forms, the theme of list of references 1 hypothesis short text can both be covered by auxiliary text.Under a lot of situation in reality, this hypothesis can not satisfy or satisfied well, and this is because a lot of situation and incident are emerging, can not guarantee that a comprehensive knowledge base can cover the theme that occurs in all situations and the incident.In this case, auxiliary text can only cover the part theme of short text.Therefore, the theme of emerging inherence in the short text can not found and utilize to list of references 1 described method, thereby can reduce the effect of classification or cluster.
Summary of the invention
In order to solve the problems of the technologies described above; The present invention proposes to assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set.The present invention does not require auxiliary text collection that can cover all themes of short text of existence, and only requires auxiliary text and short text part correlation.Specifically, the present invention utilizes two groups of potential Di Li Crays to distribute that (Double Latent Dirichlet Allocation DLDA) carries out cluster to short text.Through setting up two groups of LDA and adding switch therein, DLDA has realized the discovery separately of auxiliary text subject and short text theme, and can confirm the possibility of any one short text corresponding to auxiliary text subject and short text theme.
According to an aspect of the present invention; A kind of short text clustering equipment is provided; Comprise: the subject analysis unit; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And cluster cell, based on the vector that generates the short text in the short text set is carried out cluster.
Preferably, the subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.
Preferably; Subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
Preferably, vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.
Preferably, the value of switch parameter is obeyed binomial distribution.
Preferably; The subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
According to another aspect of the present invention; A kind of short text clustering method is provided; Comprise: the subject analysis step; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And the cluster step, based on the vector that generates the short text in the short text set is carried out cluster.
Preferably, subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.
Preferably; Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the frequency of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the frequency of the theme of the short-and-medium text collection of all other selected ci poems.
Preferably, vector generation step comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.
Preferably, the value of switch parameter is obeyed binomial distribution.
Preferably; Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
The present invention has realized the discovery separately of auxiliary text subject and short text theme, thereby can carry out cluster to short text more accurately.
Description of drawings
Through the hereinafter detailed description with the accompanying drawing, above-mentioned and further feature of the present invention will become more apparent, wherein:
Fig. 1 shows the process flow diagram of the short text clustering method of prior art;
The block diagram of the LDA model that the short text clustering method among Fig. 1 that shows Fig. 2 is adopted;
Fig. 3 shows the block diagram of short text clustering equipment according to an embodiment of the invention;
Fig. 4 shows the block diagram of the DLDA model that short text clustering equipment according to an embodiment of the invention adopted; And
Fig. 5 shows the process flow diagram of short text clustering method according to an embodiment of the invention.
Embodiment
Below, in conjunction with the drawings to the description of specific embodiment of the present invention, principle of the present invention will become obvious with realizing.Should be noted in the discussion above that the present invention should not be limited to specific embodiment hereinafter described.In addition, for for simplicity, omitted not having the detailed description of the known technology of direct correlation with the present invention.
Fig. 3 shows the block diagram of short text clustering equipment 30 according to an embodiment of the invention.As shown in Figure 3, short text clustering equipment 30 comprises subject analysis unit 310, vectorial generation unit 320 and cluster cell 330.
Each text in 310 pairs of auxiliary text collections in subject analysis unit and the short text set is carried out subject analysis, to obtain theme separately.In a specific embodiment, subject analysis unit 310 adopts DLDA model as shown in Figure 4 to carry out subject analysis.As can be seen from Figure 4, DLDA comprises two groups of LDA, corresponds respectively to the subject analysis (wherein, text is assisted in " aux " expression, and " tar " representes short text) of auxiliary text and short text.For two groups of LDA are coordinated, introduced switching variable γ.It is from auxiliary text, to select theme or from short text, select theme that switching variable γ is responsible for selecting each speech.
In the present embodiment, subject analysis unit 310 is analyzed the theme of auxiliary text and short text through following algorithm:
Each theme z ∈ of 1 pair of auxiliary text collection [1 ..., K
Aux], from Dir (β
Aux) do a sampling in the distribution, obtain the distribution of a speech under the theme
Each theme z ∈ of 2 pairs of short text set [1 ..., K
Tar], from Dir (β
Tar) do a sampling in the distribution, obtain the distribution of a speech under the theme
3 to each text collection c ∈ [aux, tar],
3.1 to each text d ∈ [1 ..., D
c],
3.2 from Dir (α
Aux) the profile samples theme that obtains an auxiliary text collection distributes
3.4 from Beta (γ
c) in the distribution sampling obtain a binomial distribution π
d
3.5 to each speech w
D, n,
3.5.1 from binomial distribution π
dMiddle sampling obtains switch value x
D, n,
If 3.5.2 x
D, n=aux is from the polynomial expression distribution of auxiliary text collection
Middle sampling obtains a theme z
D, n
If 3.5.3 x
D, n=tar is from the polynomial expression distribution of short text set
Middle sampling obtains a theme z
D, n
Algorithm 2-DLDA
A concrete applying examples is described below.Suppose to have 100 pieces in auxiliary text, 50 pieces of short texts.Get K
Aux=15, K
Tar=10, α
Aux=0.3, α
Tar=0.2, β
Aux=β
Tar=0.01.It should be noted that α, β is multi-C vector usually, and the value of each dimension maybe be different in theory, but the value of each dimension usually is reduced to same value in practical application.
wherein
notes; The setting here should be
and
wherein c represent a text collection and another text collection with~c; C=aux for example, then~c=tar.Otherwise, if c=tar, then~c=aux.
At first, all vocabulary of auxiliary text of subject analysis unit 310 statistics and short text (occurring repeatedly only calculating once) are designated as V.Here, suppose that V comprises 5000 vocabulary.
Then; Each theme z ∈ [1 of the 310 pairs of auxiliary text collections in subject analysis unit; ...; 15], from Dir (0.01) distributes, do sampling, the distribution
that obtains the speech under each theme for example;
this vectorial dimension is 5000; The value of all dimensions and be 1, the meaning is as far as this theme, the probability of choosing first speech is 0.001; The probability of choosing second speech is 0 ....In addition,
In addition; Each theme z ∈ [1 of the 310 pairs of short text set in subject analysis unit; ...; 10];, Dir (0.01) does sampling from distributing; The distribution
that obtains speech under each theme for example,
this vectorial dimension is 5000.In addition,
Then; First auxiliary text d=1 among the 310 couples of auxiliary text collection c=aux in subject analysis unit; Theme distribution
this vectorial dimension that obtains an auxiliary text collection from Dir (0.3) profile samples is 15; The value of all dimensions and be 1; The meaning is that to choose the probability of first theme be 0.1, and the probability of choosing second theme is 0.2, or the like.In addition; The theme that subject analysis unit 310 also obtains the set of short text from Dir (0.2) profile samples
this vectorial dimension that distributes is 10, the value of all dimensions and be 1.Next, sampling obtained a binomial distribution π during subject analysis unit 310 distributed from Beta (0.5,0.2)
1=[0.7,0.3], its implication are that to choose the probability of auxiliary text subject be 0.7, and the probability of choosing the short text theme is 0.3.Suppose that text 1 comprises 30 speech, so to first speech w1,1, from π
1Middle sampling obtains switch value x
1,1=aux.Because x
1,1=aux, then the polynomial expression from auxiliary text collection distributes
Middle sampling obtains a theme.Suppose to draw the 15th theme z
1,1=15.Afterwards, subject analysis unit 310 distributes from polynomial expression
In carry out a sampling, for example be extracted into the 1200th speech, then corresponding w
1,1=" TV ".
Finding the solution DLDA need solve the concrete method for solving of parameter
and can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.In the present embodiment, adopt the following gibbs sampler algorithm of describing to find the solution parameter
1 couple of all the auxiliary text subject z ∈ [1 ..., K
Aux], all speech w and text d,
To all short text theme z ∈ [1 ..., T
Tar], all speech w and text d,
To all text d,
Wherein,
With
The number of times of implication selected word w when being selected auxiliary text subject of theme and short text theme z respectively.
and
Text d were selected secondary text topics and the number of short text theme z.
and
d, respectively, in the words of the text selected to support the text topics and the number of short text theme.
2 to each text collection c ∈ [aux, tar],
2.1 to each text d ∈ [1 ..., D
c],
2.1.1 to each speech w,
2.1.1.1 obtain a switch value x from binomial distribution π=[0.5,0.5] profile samples.
If 2.1.1.2 x=aux, then
Obtain a theme z from polynomial expression distribution Multinomial (1/Kaux) sampling,
If 2.1.1.3 x=tar, then
From polynomial expression distribution Multinomial (1/K
Tar) sampling obtains a theme z,
3 circulations
3.1 to each text collection c ∈ [aux, tar],
3.1.1 to each text d ∈ [1 ..., D
c],
3.1.1.1 to each speech w,
3.1.1.1.1 if x and z by a last circulation w the sampling text collection and the theme that obtain,
then
3.1.1.1.2 from binomial distribution
Profile samples obtains a switch value x.
If 3.1.1.1.3 x=aux, then
Sampling obtains a theme z from the polynomial expression distribution of formula (1) (vide infra) decision,
If 3.1.1.1.4 x=tar, then
Sampling obtains a theme z from the polynomial expression distribution of formula (2) (vide infra) decision,
If,, and withdraw from circulation then according to formula (3) and (4) (vide infra) calculating parameter 3.2 reach the condition of convergence.Otherwise, continue to carry out circulation.
Algorithm 3-gibbs sampler
Describe formula (1)-(4 of mentioning in the gibbs sampler algorithm of preceding text) below in detail.
Formula (1): to all auxiliary text subject z ∈ [1 ..., K
Aux],
The implication of formula (1) is: sampling chooses the probability of text collection x and theme z to be proportional to 3 numerical value (three parts that multiply each other on formula (1) the right just).In the third part
; The implication of
is the number of times of auxiliary text subject after removing the sampling of current speech in a last circulation, in all other selected ci poems, and the purpose that adds
is that to avoid this number be 0.The ratio that the implication of second portion is after removing the sampling of current speech in a last circulation, text d chooses auxiliary text subject z, the implication of first be after removing the sampling of current speech in a last circulation, choose the ratio of speech w when choosing auxiliary text subject z.In the symbol
Implication be " to remove current speech (w
i) selection " (corresponding to the implication of step 3.1.1.1.1).
Formula (2): to all short text theme z ∈ [1 ..., K
Tar],
The implication of formula (2) is similar with (1), just is changed to the short text theme by auxiliary text subject.C wherein
iText collection under the expression text d (that is, auxiliary text collection or short text set).
Formula (3):
C ∈ [aux, tar] wherein.
The condition of convergence can have multiple, for example: reach the likelihood function that predefined iterations,
change very little or text collection and change very little.
Adopt above-mentioned gibbs sampler algorithm to find the solution, can draw the possibility (being the result of formula (3)) of the theme and the theme that short text is gathered of the corresponding auxiliary text collection of each short text.
Vector generation unit 320 generates vector after with the possibility normalization of corresponding theme.Notice that the vector here is on the intersection of the theme of the theme of auxiliary text collection and short text set, to generate.To any one short text d, each dimension of vector is that c gets aux or tar in the formula (3), and z gets any theme:
x ∈ { aux, rar} wherein.For example, vectorial generation unit 320 can generate f
d=[0.1,0.5,0,0,0.02 ..., 0] such vector.
Cluster cell 330 carries out short text clustering based on the vector that vectorial generation unit 320 is generated.Particularly, when to after all short texts generate above-mentioned vector, for example can use clustering method such as K-average to carry out short text clustering, thereby obtain the cluster result of short text.
Fig. 5 shows the process flow diagram of short text clustering method 50 according to an embodiment of the invention.As shown in Figure 5, short text clustering method 50 comprises step S510-S530.
At step S510, each text in auxiliary text collection and the short text set is carried out subject analysis, to obtain theme separately.Particularly, can adopt above-described DLDA algorithm that the theme of auxiliary text and short text is analyzed.In the solution procedure of DLDA algorithm, can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.Preferably, adopt above-described gibbs sampler algorithm to realize DLDA.
At step S520, according to the corresponding possibility of assisting the theme and the theme that short text is gathered of text collection of each short text, with generating vector after the possibility normalization of corresponding theme.Preferably, this vector is on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, to generate.
At step S530, short text is carried out cluster based on the vector that generates.For example, after generating vector, can use clustering method execution short text clusterings such as K-average to all short texts.
Describe below and short text clustering equipment of the present invention or method are applied to online advertisement gather resulting result.Suppose to have collected from certain business website totally 182209 pieces of the online advertisements of 42 series products, average every piece of text comprises 29.06 words.In addition, collected 99737 pieces of webpages as auxiliary text according to ProductName, average every piece of text comprises 560.4 words.Every series products is as a cluster.
Wherein
representes the cluster that computing machine is accomplished; The cluster classification that the C representative is correct; C is some correct clusters (an a certain series products), wherein:
The correct cluster mark of l (x) expression short text c,
representes the text number of this cluster.
is more little, explains that algorithm performance is good more.
Following table 1 has been listed the result who is applied to the online advertisement set according to DLDA method of the present invention and other several method:
Table 1
In table 1, Direct representes directly to use clustering method.LDA-one produces theme on auxiliary text collection, short text carries out reasoning (similar list of references 1) on these themes then.LDA-both produces theme on the union of auxiliary text collection and short text set.STC is a kind of clustering method that adopts when changing the field.Can find out; Because
result of DLDA is minimum, so the performance of DLDA aspect short text clustering is best.
Although below combined the preferred embodiments of the present invention to show the present invention, one skilled in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, can carry out various modifications, replacement and change to the present invention.Therefore, the present invention should not limited the foregoing description, and should be limited accompanying claims and equivalent thereof.
Claims (12)
1. short text clustering equipment comprises:
Subject analysis is carried out to assisting each text in the set of text collection and short text in the subject analysis unit, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;
The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And
Cluster cell carries out cluster based on the vector that generates to the short text in the short text set.
2. short text clustering equipment according to claim 1; Wherein, said subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme that short text is gathered corresponding to the theme of auxiliary text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.
3. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
4. short text clustering equipment according to claim 1, wherein, said vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.
5. short text clustering equipment according to claim 2, wherein, the value of said switch parameter is obeyed binomial distribution.
6. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
7. short text clustering method comprises:
The subject analysis step is carried out subject analysis to assisting each text in the set of text collection and short text, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;
Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And
The cluster step is carried out cluster based on the vector that generates to the short text in the short text set.
8. short text clustering method according to claim 7; Wherein, said subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.
9. short text clustering method according to claim 8; Wherein, Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.
10. short text clustering method according to claim 7, wherein, said vector generates step and comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.
11. short text clustering method according to claim 8, wherein, the value of said switch parameter is obeyed binomial distribution.
12. short text clustering method according to claim 8; Wherein, Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110160561.4A CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110160561.4A CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102831119A true CN102831119A (en) | 2012-12-19 |
CN102831119B CN102831119B (en) | 2016-08-17 |
Family
ID=47334262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110160561.4A Active CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102831119B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530316A (en) * | 2013-09-12 | 2014-01-22 | 浙江大学 | Science subject extraction method based on multi-view learning |
CN103729422A (en) * | 2013-12-23 | 2014-04-16 | 武汉传神信息技术有限公司 | Information fragment associative output method and system |
CN104573070A (en) * | 2015-01-26 | 2015-04-29 | 清华大学 | Text clustering method special for mixed length text sets |
CN104850617A (en) * | 2015-05-15 | 2015-08-19 | 百度在线网络技术(北京)有限公司 | Short text processing method and apparatus |
CN107992477A (en) * | 2017-11-30 | 2018-05-04 | 北京神州泰岳软件股份有限公司 | Text subject determines method, apparatus and electronic equipment |
CN108228721A (en) * | 2017-12-08 | 2018-06-29 | 复旦大学 | Fast text clustering method on large corpora |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
CN111090995A (en) * | 2019-11-15 | 2020-05-01 | 合肥工业大学 | Short text topic identification method and system |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN113407679A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1828608A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Multiple file summarization method based on sentence relation graph |
CN101187919A (en) * | 2006-11-16 | 2008-05-28 | 北大方正集团有限公司 | Method and system for abstracting batch single document for document set |
US20090164417A1 (en) * | 2004-09-30 | 2009-06-25 | Nigam Kamal P | Topical sentiments in electronically stored communications |
-
2011
- 2011-06-15 CN CN201110160561.4A patent/CN102831119B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164417A1 (en) * | 2004-09-30 | 2009-06-25 | Nigam Kamal P | Topical sentiments in electronically stored communications |
CN1828608A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Multiple file summarization method based on sentence relation graph |
CN101187919A (en) * | 2006-11-16 | 2008-05-28 | 北大方正集团有限公司 | Method and system for abstracting batch single document for document set |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530316A (en) * | 2013-09-12 | 2014-01-22 | 浙江大学 | Science subject extraction method based on multi-view learning |
CN103729422A (en) * | 2013-12-23 | 2014-04-16 | 武汉传神信息技术有限公司 | Information fragment associative output method and system |
CN104573070B (en) * | 2015-01-26 | 2018-06-15 | 清华大学 | A kind of Text Clustering Method for mixing length text set |
CN104573070A (en) * | 2015-01-26 | 2015-04-29 | 清华大学 | Text clustering method special for mixed length text sets |
CN104850617A (en) * | 2015-05-15 | 2015-08-19 | 百度在线网络技术(北京)有限公司 | Short text processing method and apparatus |
CN104850617B (en) * | 2015-05-15 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | Short text processing method and processing device |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
CN108628875B (en) * | 2017-03-17 | 2022-08-30 | 腾讯科技(北京)有限公司 | Text label extraction method and device and server |
CN107992477A (en) * | 2017-11-30 | 2018-05-04 | 北京神州泰岳软件股份有限公司 | Text subject determines method, apparatus and electronic equipment |
CN108228721A (en) * | 2017-12-08 | 2018-06-29 | 复旦大学 | Fast text clustering method on large corpora |
CN108228721B (en) * | 2017-12-08 | 2021-06-04 | 复旦大学 | Fast text clustering method on large corpus |
CN111090995A (en) * | 2019-11-15 | 2020-05-01 | 合肥工业大学 | Short text topic identification method and system |
CN111090995B (en) * | 2019-11-15 | 2023-03-31 | 合肥工业大学 | Short text topic identification method and system |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN111897912B (en) * | 2020-07-13 | 2021-04-06 | 上海乐言科技股份有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN113407679A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
CN113407679B (en) * | 2021-06-30 | 2023-10-03 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102831119B (en) | 2016-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102831119A (en) | Short text clustering equipment and short text clustering method | |
CN103390051B (en) | A kind of topic detection and tracking method based on microblog data | |
CN103699525B (en) | A kind of method and apparatus automatically generating summary based on text various dimensions feature | |
Kovachev et al. | Learn-as-you-go: new ways of cloud-based micro-learning for the mobile web | |
CN105095394B (en) | webpage generating method and device | |
CN110866126A (en) | College online public opinion risk assessment method | |
CN105512245A (en) | Enterprise figure building method based on regression model | |
CN104536956A (en) | A Microblog platform based event visualization method and system | |
CN106250550A (en) | A kind of method and apparatus of real time correlation news content recommendation | |
CN105608200A (en) | Network public opinion tendency prediction analysis method | |
CN104090955A (en) | Automatic audio/video label labeling method and system | |
CN104008203A (en) | User interest discovering method with ontology situation blended in | |
Wicaksono et al. | Automatically building a corpus for sentiment analysis on Indonesian tweets | |
CN101819585A (en) | Device and method for constructing forum event dissemination pattern | |
CN103020712B (en) | A kind of distributed sorter of massive micro-blog data and method | |
CN109670039A (en) | Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering | |
CN104850650A (en) | Short-text expanding method based on similar-label relation | |
Tang et al. | Social media-based disaster research: Development, trends, and obstacles | |
CN103823890A (en) | Microblog hot topic detection method and device aiming at specific group | |
CN104978332A (en) | UGC label data generating method, UGC label data generating device, relevant method and relevant device | |
CN103886020A (en) | Quick search method of real estate information | |
CN104102658A (en) | Method and device for mining text contents | |
CN103077234A (en) | Voice website navigation system and method | |
CN102163189B (en) | Method and device for extracting evaluative information from critical texts | |
CN103942240A (en) | Method for building intelligent substation comprehensive data information application platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20171207 Address after: 100190 Zhongguancun street, Haidian District, Beijing, No. 18, block B, block 18 Patentee after: Data Hall (Beijing) Polytron Technologies Inc Address before: 100191 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 20 Patentee before: NEC (China) Co., Ltd. |