CN102831119B - Short text clustering Apparatus and method for - Google Patents
Short text clustering Apparatus and method for Download PDFInfo
- Publication number
- CN102831119B CN102831119B CN201110160561.4A CN201110160561A CN102831119B CN 102831119 B CN102831119 B CN 102831119B CN 201110160561 A CN201110160561 A CN 201110160561A CN 102831119 B CN102831119 B CN 102831119B
- Authority
- CN
- China
- Prior art keywords
- text
- theme
- short text
- short
- auxiliary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 43
- 238000005070 sampling Methods 0.000 claims description 34
- 230000004087 circulation Effects 0.000 claims description 15
- 239000000203 mixture Substances 0.000 claims description 3
- 238000012549 training Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 208000006558 Dental Calculus Diseases 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000004800 variational method Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000005266 casting Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
Abstract
The invention provides a kind of short text clustering equipment, including: subject analysis unit, each text in auxiliary text collection and short text set is performed subject analysis, to obtain each short text in short text set corresponding to the theme of auxiliary text collection and the probability of the theme of short text set;Vector signal generating unit, is normalized each short text, to generate vector corresponding to the probability of the theme of the theme and short text set that assist text collection;And cluster cell, based on the vector generated, the short text in short text set is clustered.Present invention also offers a kind of short text clustering method.Present invention achieves auxiliary text subject and each self-discovery of short text theme such that it is able to more accurately short text is clustered.
Description
Technical field
The present invention relates to natural language processing field, be specifically related to a kind of short text clustering Apparatus and method for.
Background technology
Along with the extensive application of SMS, microblogging, search engine, online advertisement etc., short text is by getting over that people use
Coming the most frequent, these texts are the shortest, and such as one SMS not can exceed that 70 words, the result one that search engine returns
As also only have tens words.
Short text and long text (such as news) have bigger difference.Such as, under long text environment, a theme is permissible
It is fully described, thus people can recognize nearly all content of theme from this long text.Unlike this, due to
The number of words of short text is restricted, so being the most only described the core content of theme, a lot of relevant informations are omitted.
The method of traditional text mining is typically for long text, and applies and can meet difficulty on short text, example
Such as cluster.Be often required to use the concurrent information (simultaneously occur) of word owing to realizing cluster, and in short text word and transmit
Breath is few more a lot of than long text, and therefore Clustering Effect can be affected.Two sections of newsletter archive L1 and L2 such as below:
L1: " Tsing-Hua University the 4th teaching building is renamed as " Jeanwest building ", and it a piece of is scoffed in campus and the Internet
Sound.Opposing views are mainly: the teaching building of Tsing-Hua University and the apparel brand image of Jeanwest are not taken.From colleges and universities' building using names just
When this angle of program is seen, Tsing-Hua University obviously has can fastidious part.Casting aside this point not talk, list is paid close attention to regard to Tsing-Hua University scholar
For the so-called brand image angle of substantive issue--teaching building using names, " Jeanwest building " the most excessively loses the image of Tsing-Hua University?”
L2: " recently, Tsing-Hua University one teaching building is named as " Jeanwest ", causes great disturbance on network.Jeanwest
It it not an apparel brand?The teaching building of Tsing-Hua University is the most also " Jeanwest "?Noon on the 23rd, Tsing-Hua University the 4th teaching building
Exterior wall hangs up the board in " Jeanwest building ".The lower right of several words, is also hung with another board, special introduce Jeanwest this
Apparel brand.Teaching building, with brand names using names, causes Tsing-Hua University student and the dispute of online friend.It is believed that colleges and universities are undue
Commercialization, should not carry out using names with enterprise.And Sina blogger@Young_pig thinks, enterprise provides patronage, hat to school
Name does not affect school image.”
L1 and L2 is because having words such as " Tsing-Hua University, the 4th teaching building, Jeanwest, clothing, colleges and universities, using names, images ", institute
To easily determine out that they are much like, can gather is a class.And following two short text S1 and S2 the most easily to gather be one
Class, because they total important words only " Tsing-Hua University " (" also, " this word is very universal because of using, so
Hardly important, usually removed before cluster):
S1: " heard, Jeanwest building, and the image of Tsing-Hua University do not taken "
S2: " not being exactly an apparel brand, Tsing-Hua University's using names excessively commercialization "
In order to improve the correctness of short text clustering, prior art having been proposed that, employing auxiliary information helps to gather
Class.Such as, if to cluster the such short text of above-mentioned S1 and S2, just introduce the such long text of L1 and L2 as auxiliary letter
Breath, because S1 with L1 more similar (sharing words such as " Jeanwest, Tsing-Hua University, images, do not take "), and S2 with L2 is more similar
(sharing words such as " clothing, Tsing-Hua University, using names, commercializations ").Being additionally, since L1 with L2 more similar, therefore S1 and S2 is the most just
Similar, can gather is a class.
List of references 1 (XH Phan, LM Nguyen, S Horiguchi., " Learning to classifty short
And sparse text & web with hidden topics from large-scale data collections ",
WWW2008) a kind of method carrying out herein clustering according to auxiliary is described.As it is shown in figure 1, the method comprises the following steps:
In step S100, auxiliary text collection is performed subject analysis, obtain some themes and corresponding vocabulary.Specifically
Ground, uses the text downloaded from wikipedia (Wikipedia) to assist text set as auxiliary information, formation in list of references 1
Close.Subject analysis uses potential Di Li Cray distribution (Latent Dirichlet Allocation, LDA) method.Fig. 2 illustrates
The model of LDA.LDA is a kind of to generate model, and its main thought is the generation process of simulation text: to each word, first from
Distribution is selected a theme, then from theme, selects a word.The algorithm flow of reference Fig. 2, LDA includes:
1 pair of each theme k ∈ [1, K], does a sampling from Dir (β) is distributed, and obtains dividing of the word under a theme
Cloth
2 pairs of each texts m ∈ [1, M],
2.1 do a sampling from Dir (α) distribution, obtain a theme distribution
2.2 couples of each word n,
2.2.1 from multinomial distributionIn do a sampling, obtain a theme zM, n。
2.2.2 from multinomial distributionIn do a sampling, obtain a word wM, n。
Algorithm 1-LDA
Wherein, the value of α represents each topic weight distribution before sampling, and the value of β represents the priori of each descriptor
Distribution.They are predetermined parameter, referred to as hyper parameter.
The task of LDA is to estimate parameterAnd θd.Wherein, the density of simultaneous distribution of all aobvious variablees and hidden variable is as follows:
The likelihood function of one text is as follows:
Likelihood function on whole text collection is as follows:
In theory, can be solved by the above-mentioned likelihood function of maximizationWithBut, this method does not resolve
Solve.So, in reality, the mode of general estimation solves.Such as, list of references 1 selects gibbs sampler (Gibbs
Sampling) parameter is estimated.List of references 2 (Thomas L.Griffiths, Mark Steyvers, " Finding
Scientific topics ", Proceedings of the National Academy of Sciences of the
United States of America, Vol.101, No.Suppl 1. (6 April 2004), pp.5228-5235) and ginseng
Examine document 3 (Gregor Heinrich, " Parameter estimation for text analysis ", Technical
Report, 2004) describe process and the algorithm utilizing gibbs sampler to realize LDA in detail.
In step S110, based on the theme obtained in step S100, short text set is performed reasoning, obtains short with these
The theme that text is corresponding.The mode of reasoning still uses gibbs sampler.
In step S120, based on the result of step S110, construct training sample set.Training data is vector mode, also
That is the corresponding vector of each short text.Each short text in one short text set is all generated correspondence to
Amount, then provides a classification to each short text, constitutes training sample set.
In step S130, select machine learning method, training sample set is classified, in order to obtain disaggregated model.Example
As, can select machine learning method that training sample set is classified, in order to obtain disaggregated model.Multiple method is had to be available for choosing
Select, such as decision tree, SVM, maximum entropy etc..List of references 1 uses maximum entropy method.
But, in step S110 made inferences on the theme that auxiliary text is formed by short text, list of references 1 is false
If the theme of short text can be covered by auxiliary text.In the case of a lot of in reality, this hypothesis can not meet or quilt
Meet well, this is because a lot of situation and event are emerging, it is impossible to ensure have a comprehensive knowledge base to cover
The theme occurred in all situations and event.In this case, auxiliary text can only cover the distribution subject of short text.Cause
This, the method described by list of references 1 can not find and utilize the theme of emerging inherence in short text, thus can reduce point
Class or the effect of cluster.
Summary of the invention
In order to solve above-mentioned technical problem, the present invention proposes auxiliary text collection and each literary composition in short text set
This execution subject analysis, to obtain each short text in short text set corresponding to assisting theme and the short text of text collection
The probability of the theme of set.The present invention does not seek survival at an auxiliary text collection that can cover all themes of short text, and
Only require that auxiliary text is relevant to short text part.Specifically, the present invention utilizes two groups of potential Di Li Cray distribution (Double
Latent Dirichlet Allocation, DLDA) short text is clustered.By setting up two groups of LDA and adding wherein
Permutator, DLDA achieves auxiliary text subject and each self-discovery of short text theme, and can determine that any one short text
Corresponding to auxiliary text subject and the probability of short text theme.
According to an aspect of the invention, it is provided a kind of short text clustering equipment, including: subject analysis unit, to auxiliary
Text collection is helped to perform subject analysis with each text in short text set, to obtain each short essay in short text set
The probability of the theme of this theme corresponding to auxiliary text collection and short text set;Vector signal generating unit, by each short essay
This is normalized corresponding to the probability of the theme of the theme and short text set that assist text collection, to generate vector;With
And cluster cell, based on the vector generated, the short text in short text set is clustered.
Preferably, subject analysis unit determines each with short text set of auxiliary text collection by switch parameter
Word in individual text is corresponding to assisting theme or the theme of short text set of text collection;If corresponding to auxiliary text set
The theme closed, the most described subject analysis unit performs subject analysis by the first potential Di Li Cray distribution, if corresponding to short
The theme of text collection, the most described subject analysis unit performs subject analysis by the second potential Di Li Cray distribution.
Preferably, subject analysis unit utilizes gibbs sampler algorithm to estimate the first potential Di Li Cray distribution and second
The parameter used in the distribution of potential Di Li Cray, wherein assists the sample frequency of the theme of text collection to be proportional to remove current word
After sampling in a upper circulation, other selected ci poems all assist the number of times of the theme of text collection, the master of short text set
After the sample frequency of topic is proportional to remove current word sampling in a upper circulation, all short-and-medium text collections of other selected ci poem
The number of times of theme.
Preferably, vector signal generating unit generates in the intersection of the theme of the theme and short text set that assist text collection
Vector.
Preferably, the value of switch parameter obeys binomial distribution.
Preferably, subject analysis unit determines that switch parameter is to ensure that the word in auxiliary text corresponds to auxiliary text collection
The probability of theme more than the probability of theme corresponding to short text set, and the word in short text is corresponding to short text
The probability of the theme of set is more than the probability of the theme corresponding to auxiliary text collection.
According to another aspect of the present invention, it is provided that a kind of short text clustering method, including: subject analysis step is right
Each text in auxiliary text collection and short text set performs subject analysis, each short with obtain in short text set
Text corresponds to the theme of auxiliary text collection and the probability of the theme of short text set;Vector generation step, by each short
Text is normalized corresponding to the probability of the theme of the theme and short text set that assist text collection, to generate vector;
And sorting procedure, based on the vector generated, the short text in short text set is clustered.
Preferably, subject analysis step includes: determined in auxiliary text collection and short text set by switch parameter
Each text in word corresponding to the auxiliary theme of text collection or the theme of short text set;If corresponding to auxiliary
The theme of text collection, then perform subject analysis by the first potential Di Li Cray distribution, if corresponding to short text set
Theme, then perform subject analysis by the second potential Di Li Cray distribution.
Preferably, gibbs sampler algorithm is utilized to estimate the first distribution of potential Di Li Cray and the second potential Di Li Cray
The parameter used in distribution, wherein assists the sample frequency of the theme of text collection to be proportional to remove current word in a upper circulation
In sampling after, other selected ci poems all assist the frequency of the theme of text collection, the sample frequency of the theme of short text set
After being proportional to remove current word sampling in a upper circulation, the frequency of the theme of all short-and-medium text collections of other selected ci poem.
Preferably, vector generation step includes: in the intersection of the theme of the theme and short text set assisting text collection
Upper generation vector.
Preferably, the value of switch parameter obeys binomial distribution.
Preferably, determine that switch parameter is to ensure that the word in auxiliary text corresponds to the possibility of the theme of auxiliary text collection
Property more than the probability of theme corresponding to short text set, and the word in short text is corresponding to the theme of short text set
Probability is more than the probability of the theme corresponding to auxiliary text collection.
Present invention achieves auxiliary text subject and each self-discovery of short text theme such that it is able to more accurately to short essay
Originally cluster.
Accompanying drawing explanation
By the detailed description below in conjunction with accompanying drawing, above and other feature of the present invention will become more apparent, its
In:
Fig. 1 shows the flow chart of the short text clustering method of prior art;
Fig. 2 shows the block diagram of the LDA model that the short text clustering method in Fig. 1 is used;
Fig. 3 shows the block diagram of short text clustering equipment according to an embodiment of the invention;
Fig. 4 shows the frame of the DLDA model that short text clustering equipment is used according to an embodiment of the invention
Figure;And
Fig. 5 shows the flow chart of short text clustering method according to an embodiment of the invention.
Detailed description of the invention
Below, by combining the accompanying drawing description to the specific embodiment of the present invention, the principle of the present invention and realization will become
Obtain substantially.It should be noted that, the present invention should not be limited to specific embodiments described below.It addition, for simplicity, save
Omit the detailed description to the known technology not having direct correlation with the present invention.
Fig. 3 shows the block diagram of short text clustering equipment 30 according to an embodiment of the invention.As it is shown on figure 3, it is short
Text cluster equipment 30 includes subject analysis unit 310, vector signal generating unit 320 and cluster cell 330.
Subject analysis unit 310 performs subject analysis to auxiliary text collection with each text in short text set,
To obtain respective theme.In a specific embodiment, subject analysis unit 310 uses DLDA model as shown in Figure 4
Perform subject analysis.Figure 4, it is seen that DLDA includes two groups of LDA, correspond respectively to assist text and the theme of short text
Analyze (wherein, " aux " represents auxiliary text, and " tar " represents short text).In order to two groups of LDA are coordinated, introduce switch
Variable γ.It is to select theme from auxiliary text or select main from short text that switching variable γ is responsible for selecting each word
Topic.
In the present embodiment, the theme of auxiliary text and short text is carried out by subject analysis unit 310 by following algorithm
Analyze:
1 pair auxiliary text collection each theme z ∈ [1 ..., Kaux], from Dir (βaux) sampling is done in distribution,
Obtain the distribution of word under a theme
Each theme z ∈ of 2 pairs of short text set [1 ..., Ktar], from Dir (βtar) sampling is done in distribution,
The distribution of the word under a theme
3 pairs of each text collections c ∈ [aux, tar],
3.1 couples of each text d ∈ [1 ..., Dc],
3.2 from Dir (αaux) profile samples obtain one auxiliary text collection theme distribution
3.3 from Dir (αtar) profile samples obtains the theme distribution of a short text set
3.4 from Beta (γc) in distribution sampling obtain a binomial distribution πd。
3.5 couples of each word wD, n,
3.5.1 from binomial distribution πdMiddle sampling obtains switch value xD, n,
If 3.5.2 xD, n=aux, from the multinomial distribution of auxiliary text collectionIn sample
To a theme zD, n。
If 3.5.3 xD, n=tar, from the multinomial distribution of short text setMiddle sampling obtains
One theme zD, n。
3.5.4 from multinomial distributionIn do a sampling, obtain a word wD, n。
Algorithm 2-DLDA
One concrete application example is described below.Assume have 100, text of auxiliary, short text 50.Take Kaux=15, Ktar
=10, αaux=0.3, αtar=0.2, βaux=βtar=0.01.It should be noted that α, β are multi-C vectors, in theory
The value of each dimension may be different, but the value of each dimension is usually reduced to same value in actual applications.Its
In Noting, setting here should beAndWherein c and~c generation
One text collection of table and another text collection, such as c=aux, then~c=tar.Otherwise, if c=tar, then~c=
aux。
First, all vocabulary (occurring the most only calculating once) of subject analysis unit 310 statistics auxiliary text and short text,
It is designated as V.Here, suppose that V includes 5000 vocabulary.
Then, subject analysis unit 310 to auxiliary text collection each theme z ∈ [1 ..., 15], from Dir (0.01)
Distribution is sampled, obtains the distribution of word under each themeSuch as,This
The dimension of vector is 5000, the value of all dimensions and be 1, the meaning is for this theme, and the probability choosing first word is
0.001, the probability choosing second word is 0 ....Additionally,
Additionally, subject analysis unit 310 to each theme z ∈ of short text set [1 ..., 10], from Dir (0.01)
Distribution is sampled, obtains the distribution of word under each themeSuch as,This
The dimension of individual vector is 5000.Additionally,
Then, subject analysis unit 310 is to first auxiliary text d=1 in auxiliary text collection c=aux, from Dir
(0.3) profile samples obtains the theme distribution of an auxiliary text collectionThis vector
Dimension is 15, the value of all dimensions and be 1, the meaning is that to choose the probability of first topic be 0.1, chooses the general of second theme
Rate is 0.2, etc..Additionally, subject analysis unit 310 also obtains the theme of a short text set from Dir (0.2) profile samples
DistributionThe dimension of this vector is 10, the value of all dimensions and be 1.It follows that it is main
The sampling from Beta (0.5,0.2) is distributed of topic analytic unit 310 obtains a binomial distribution π1=[0.7,0.3], its be meant that choose auxiliary
The probability helping text subject is 0.7, and the probability choosing short text theme is 0.3.Assume that text 1 comprises 30 words, then to first
Individual word w1,1, from π1Middle sampling obtains switch value x1,1=aux.Due to x1,1=aux, then from the multinomial distribution of auxiliary text collection Middle sampling obtains a theme.Assume to draw the 15th theme z1,1=
15.Afterwards, subject analysis unit 310 is from multinomial distribution
In carry out a sampling, be such as extracted into the 1200th word, then corresponding w1,1=" TV ".
Solve DLDA to need to solve parameterConcrete method for solving can use the calculus of variations (Variational
Method), (Expectation propagation) or gibbs sampler (Gibbs Sampling) etc. are propagated in expectation.
In the present embodiment, use gibbs sampler algorithm described below to solve parameter
The 1 couple of all auxiliary text subject z ∈ [1 ..., Kaux], all word w and text d,To all short
Text subject z ∈ [1 ..., Ttar], all word w and text d,To all text d,Wherein,WithImplication be that auxiliary text subject and the number of times of selected word w during short text theme z selected in theme respectively.WithIt is that text d chooses auxiliary text subject and the number of times of short text theme z respectively.WithIt is the selected ci poem in text d respectively
Surely auxiliary text subject and the number of times of short text theme.
2 pairs of each text collections c ∈ [aux, tar],
2.1 couples of each text d ∈ [1 ..., Dc],
2.1.1 to each word w,
2.1.1.1 a switch value x is obtained from binomial distribution π=[0.5,0.5] profile samples.
If 2.1.1.2 x=aux, thenSample from multinomial distribution Multinomial (1/Kaux)
Obtain a theme z,
If 2.1.1.3 x=tar, thenFrom multinomial distribution Multinomial (1/Ktar) sampling
Obtain a theme z,
3 circulations
3.1 pairs of each text collections c ∈ [aux, tar],
3.1.1 to each text d ∈ [1 ..., Dc],
3.1.1.1 to each word w,
If 3.1.1.1.1 x and z is sampled the text collection and theme obtained, then by a upper circulation w
3.1.1.1.2 from binomial distribution Profile samples obtains one
Individual switch value x.
If 3.1.1.1.3 x=aux, thenFrom the multinomial distribution that formula (1) (see below) determines
Sampling obtains a theme z,
If 3.1.1.1.4 x=tar, thenFrom the multinomial distribution that formula (2) (see below) determines
Sampling obtains a theme z,
If 3.2 reach the condition of convergence, then calculate parameter according to formula (3) and (4) (see below), and exit circulation.
Otherwise, circulation is continued executing with.
Algorithm 3-gibbs sampler
Formula (1)-(4) mentioned are described below in detail in gibbs sampler algorithm above.
Formula (1): to all auxiliary text subject z ∈ [1 ..., Kaux],
Formula (1) is meant that: it is (namely public that sampling chooses the probability of text collection x and theme z to be proportional to 3 numerical value
Three parts being multiplied on formula (1) the right).Part IIIIn,It is meant that removing current word is at upper one
After sampling in circulation, other selected ci poems all assist the number of times of text subject, addPurpose be that to avoid this number be 0.The
Two parts are meant that after removing current word sampling in a upper circulation, text d chooses the ratio assisting text subject z,
Part I be meant that remove current word upper one circulation in sampling after, choose auxiliary text subject z time choose word w's
Ratio.In symbolIt is meant that and " removes current word (wi) selection " (corresponding to the implication of step 3.1.1.1.1).
Formula (2): to all short text theme z ∈ [1 ..., Ktar],
The implication of formula (2) is similar with (1), is simply changed to short text theme by auxiliary text subject.Wherein ciRepresent literary composition
Text collection (that is, auxiliary text collection or short text set) belonging to this d.
Formula (3):
Formula (4):
Wherein c ∈ [aux, tar].
The condition of convergence can have multiple, such as: reach iterations set in advance,Change is very
Little or text collection likelihood function varies less.
Above-mentioned gibbs sampler algorithm is used to solve, it can be deduced that the master of each short text correspondence auxiliary text collection
The probability (i.e. the result of formula (3)) of the theme of topic and short text set.
Vector signal generating unit 320 generates vector after the probability normalization of corresponding theme.Note, vector here be
Generate in the intersection of the theme of auxiliary text collection and the theme of short text set.To any one short text d, vectorial is every
One-dimensional is that in formula (3), c takes aux or tar, z and takes any one theme:
WhereinX ∈ { aux, rar}.Such as, vector signal generating unit 320 can generate fd=[0.1,
0.5,0,0,0.02 ..., 0] such vector.
The vector that cluster cell 330 is generated based on vector signal generating unit 320 carries out short text clustering.Specifically, pin is worked as
After all of short text is generated above-mentioned vector, it is possible to use the clustering methods such as such as K-average perform short text clustering, thus
Obtain the cluster result of short text.
Fig. 5 shows the flow chart of short text clustering method 50 according to an embodiment of the invention.As it is shown in figure 5,
Short text clustering method 50 includes step S510-S530.
In step S510, auxiliary text collection is performed subject analysis with each text in short text set, to obtain
Obtain respective theme.Specifically, above-described DLDA algorithm can be used auxiliary text and the theme of short text to be carried out point
Analysis.In the solution procedure of DLDA algorithm, the calculus of variations (Variational Method), expectation can be used to propagate
(Expectation propagation) or gibbs sampler (Gibbs Sampling) etc..Preferably, use above
The gibbs sampler algorithm described realizes DLDA.
In step S520, according to the theme of the theme of each short text correspondence auxiliary text collection and short text set can
Energy property, generates vector after the probability normalization of corresponding theme.Preferably, this vector be auxiliary text collection theme and
Generate in the intersection of the theme of short text set.
In step S530, based on the vector generated, short text is clustered.Such as, when generating for all of short text
After vector, it is possible to use the clustering methods such as K-average perform short text clustering.
The result short text clustering equipment of the present invention or method being applied to obtained by online advertisement set is described below.
Assuming to have collected the online advertisement totally 182209 of 42 series products from certain business website, average every text comprises 29.06
Word.It addition, according to ProductName have collected 99737 webpages as auxiliary text, average every text comprises 560.4 words.Often
Series products is as a cluster.
Evaluation criterionSelect following entropy form:
WhereinRepresenting the cluster that computer completes, C represents correct cluster classification, and c is the cluster that some is correct
(a certain series products), wherein:
L (x) represents that short text c's correctly clusters labelling,Represent this text number clustered.The least, say
Bright algorithm performance is the best.
Table 1 below lists DLDA method according to the present invention and other several method is applied to the knot of online advertisement set
Really:
Table 1
In Table 1, Direct represents and directly uses clustering method.LDA-one is to produce theme on auxiliary text collection,
Then short text makes inferences (similar list of references 1) on these themes.LDA-both is at auxiliary text collection and short essay
Theme is produced on this union of sets collection.The clustering method used when STC is a kind of conversion art.It can be seen that due to DLDA'sResult is minimum, so the performance that DLDA is in terms of short text clustering is best.
Show the present invention although above already in connection with the preferred embodiments of the present invention, but those skilled in the art will
It will be appreciated that without departing from the spirit and scope of the present invention, the present invention can be carried out various amendment, replace and change
Become.Therefore, the present invention should not limited by above-described embodiment, and should be limited by claims and equivalent thereof.
Claims (12)
1. a short text clustering equipment, including:
Subject analysis unit, performs subject analysis to auxiliary text collection with each text in short text set, to obtain
Each short text in short text set belongs to the theme of auxiliary text collection and the probability of the theme of short text set;
Vector signal generating unit, belongs to the theme of auxiliary text collection and the probability of the theme of short text set by each short text
It is normalized, to generate vector;And
Cluster cell, clusters the short text in short text set based on the vector generated.
Short text clustering equipment the most according to claim 1, wherein, described subject analysis unit is come really by switch parameter
Surely auxiliary text collection and the word in each text in short text set belong to theme or the short essay of auxiliary text collection
The theme of this set;If belonging to the theme of auxiliary text collection, the most described subject analysis unit passes through the first potential Di Like
Thunder distribution performs subject analysis, if belonging to the theme of short text set, the most described subject analysis unit passes through second potential Di
Profit Cray distribution performs subject analysis.
Short text clustering equipment the most according to claim 2, wherein, described subject analysis unit utilizes gibbs sampler to calculate
Method estimates the parameter used in the first potential Di Li Cray distribution and the distribution of the second potential Di Li Cray, wherein assists text set
Auxiliary literary composition after the sample frequency of the theme closed is proportional to remove current word sampling in a upper circulation, in other selected ci poems all
The number of times of the theme of this set, the sample frequency of the theme of short text set is proportional to remove current word in a upper circulation
After sampling, the number of times of the theme of all short-and-medium text collections of other selected ci poem.
Short text clustering equipment the most according to claim 1, wherein, described vector signal generating unit is at auxiliary text collection
Vector is generated in the intersection of the theme of theme and short text set.
Short text clustering equipment the most according to claim 2, wherein, the value of described switch parameter obeys binomial distribution.
Short text clustering equipment the most according to claim 2, wherein, described subject analysis unit determines that switch parameter is to protect
Word in card auxiliary text belongs to the probability possibility more than the theme belonging to short text set of the theme of auxiliary text collection
Property, and the word in short text belong to short text set theme probability more than belong to auxiliary text collection theme can
Can property.
7. a short text clustering method, including:
Subject analysis step, performs subject analysis to auxiliary text collection with each text in short text set, to obtain
Each short text in short text set belongs to the theme of auxiliary text collection and the probability of the theme of short text set;
Vector generation step, belongs to the theme of auxiliary text collection and the probability of the theme of short text set by each short text
It is normalized, to generate vector;And
Sorting procedure, clusters the short text in short text set based on the vector generated.
Short text clustering method the most according to claim 7, wherein, described subject analysis step includes: joined by switch
Number determines that auxiliary text collection and the word in each text in short text set belong to the theme of auxiliary text collection also
It it is the theme of short text set;If belonging to the theme of auxiliary text collection, then performed by the first potential Di Li Cray distribution
Subject analysis, if belonging to the theme of short text set, then performs subject analysis by the second potential Di Li Cray distribution.
Short text clustering method the most according to claim 8, wherein, first is potential to utilize gibbs sampler algorithm to estimate
The parameter used in the distribution of Di Li Cray and the distribution of the second potential Di Li Cray, wherein assists the sampling frequency of the theme of text collection
Rate be proportional to remove current word upper one circulation in sampling after, other selected ci poems all assist text collection theme time
Number, the sample frequency of the theme of short text set be proportional to remove current word upper one circulation in sampling after, all other
The number of times of the theme of the short-and-medium text collection of selected ci poem.
Short text clustering method the most according to claim 7, wherein, described vector generation step includes: at auxiliary text
Vector is generated in the intersection of the theme of set and the theme of short text set.
11. short text clustering methods according to claim 8, wherein, the value of described switch parameter obeys binomial distribution.
12. short text clustering methods according to claim 8, wherein it is determined that switch parameter is to ensure in auxiliary text
Word belongs to the probability probability more than the theme belonging to short text set of the theme of auxiliary text collection, and in short text
Word belong to the probability of theme of short text set more than the probability of theme belonging to auxiliary text collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110160561.4A CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110160561.4A CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102831119A CN102831119A (en) | 2012-12-19 |
CN102831119B true CN102831119B (en) | 2016-08-17 |
Family
ID=47334262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110160561.4A Active CN102831119B (en) | 2011-06-15 | 2011-06-15 | Short text clustering Apparatus and method for |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102831119B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530316B (en) * | 2013-09-12 | 2016-06-01 | 浙江大学 | A kind of science subject extraction method based on multi views study |
CN103729422A (en) * | 2013-12-23 | 2014-04-16 | 武汉传神信息技术有限公司 | Information fragment associative output method and system |
CN104573070B (en) * | 2015-01-26 | 2018-06-15 | 清华大学 | A kind of Text Clustering Method for mixing length text set |
CN104850617B (en) * | 2015-05-15 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | Short text processing method and processing device |
CN108628875B (en) * | 2017-03-17 | 2022-08-30 | 腾讯科技(北京)有限公司 | Text label extraction method and device and server |
CN107992477B (en) * | 2017-11-30 | 2019-03-29 | 北京神州泰岳软件股份有限公司 | Text subject determines method and device |
CN108228721B (en) * | 2017-12-08 | 2021-06-04 | 复旦大学 | Fast text clustering method on large corpus |
CN111090995B (en) * | 2019-11-15 | 2023-03-31 | 合肥工业大学 | Short text topic identification method and system |
CN111897912B (en) * | 2020-07-13 | 2021-04-06 | 上海乐言科技股份有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN113407679B (en) * | 2021-06-30 | 2023-10-03 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1828608A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Multiple file summarization method based on sentence relation graph |
CN101187919A (en) * | 2006-11-16 | 2008-05-28 | 北大方正集团有限公司 | Method and system for abstracting batch single document for document set |
US20090164417A1 (en) * | 2004-09-30 | 2009-06-25 | Nigam Kamal P | Topical sentiments in electronically stored communications |
-
2011
- 2011-06-15 CN CN201110160561.4A patent/CN102831119B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164417A1 (en) * | 2004-09-30 | 2009-06-25 | Nigam Kamal P | Topical sentiments in electronically stored communications |
CN1828608A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Multiple file summarization method based on sentence relation graph |
CN101187919A (en) * | 2006-11-16 | 2008-05-28 | 北大方正集团有限公司 | Method and system for abstracting batch single document for document set |
Also Published As
Publication number | Publication date |
---|---|
CN102831119A (en) | 2012-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102831119B (en) | Short text clustering Apparatus and method for | |
CN105095394B (en) | webpage generating method and device | |
CN103699525B (en) | A kind of method and apparatus automatically generating summary based on text various dimensions feature | |
Li et al. | Comparison of word embeddings and sentence encodings as generalized representations for crisis tweet classification tasks | |
CN107220386A (en) | Information-pushing method and device | |
CN101819585A (en) | Device and method for constructing forum event dissemination pattern | |
CN109670039A (en) | Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering | |
Jha et al. | Educational data mining using improved apriori algorithm | |
CN103164428B (en) | Determine the method and apparatus of the correlativity of microblogging and given entity | |
CN103020712B (en) | A kind of distributed sorter of massive micro-blog data and method | |
Tang et al. | Social media-based disaster research: Development, trends, and obstacles | |
CN103886020A (en) | Quick search method of real estate information | |
CN103488637B (en) | A kind of method carrying out expert Finding based on dynamics community's excavation | |
Amin et al. | Community detection and mining using complex networks tools in social internet of things | |
Rani et al. | A survey of tools for social network analysis | |
CN115048506A (en) | Test question generation system, method and device based on knowledge graph and storage medium | |
CN104516873A (en) | Method and device for building emotion model | |
Mishra | Information extraction from digital social trace data with applications to social media and scholarly communication data | |
Lu | The research of the knowledge management technology in the education | |
Wang et al. | IdeaGraph: turning data into human insights for collective intelligence | |
He et al. | Design of shared Internet of Things system for English translation teaching using deep learning text classification | |
CN104503959A (en) | Method and equipment for predicting user emotion tendency | |
Zhang | Integration of art teaching resources in vertical social network | |
Setyawan et al. | Sentiment Analysis of Public Responses on Indonesia Government Using Naïve Bayes and Support Vector Machine | |
Rao | Extremism Video Detection In Social Media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20171207 Address after: 100190 Zhongguancun street, Haidian District, Beijing, No. 18, block B, block 18 Patentee after: Data Hall (Beijing) Polytron Technologies Inc Address before: 100191 Haidian District, Xueyuan Road, No. 35, the world building, the second floor of the building on the ground floor, No. 20 Patentee before: NEC (China) Co., Ltd. |