CN102831119B

CN102831119B - Short text clustering Apparatus and method for

Info

Publication number: CN102831119B
Application number: CN201110160561.4A
Authority: CN
Inventors: 赵凯; 胡长建; 王大亮; 许洪志
Original assignee: NEC China Co Ltd
Current assignee: Data Hall (Beijing) Polytron Technologies Inc
Priority date: 2011-06-15
Filing date: 2011-06-15
Publication date: 2016-08-17
Anticipated expiration: 2031-06-15
Also published as: CN102831119A

Abstract

The invention provides a kind of short text clustering equipment, including: subject analysis unit, each text in auxiliary text collection and short text set is performed subject analysis, to obtain each short text in short text set corresponding to the theme of auxiliary text collection and the probability of the theme of short text set；Vector signal generating unit, is normalized each short text, to generate vector corresponding to the probability of the theme of the theme and short text set that assist text collection；And cluster cell, based on the vector generated, the short text in short text set is clustered.Present invention also offers a kind of short text clustering method.Present invention achieves auxiliary text subject and each self-discovery of short text theme such that it is able to more accurately short text is clustered.

Description

Short text clustering Apparatus and method for

Technical field

The present invention relates to natural language processing field, be specifically related to a kind of short text clustering Apparatus and method for.

Background technology

Along with the extensive application of SMS, microblogging, search engine, online advertisement etc., short text is by getting over that people use Coming the most frequent, these texts are the shortest, and such as one SMS not can exceed that 70 words, the result one that search engine returns As also only have tens words.

Short text and long text (such as news) have bigger difference.Such as, under long text environment, a theme is permissible It is fully described, thus people can recognize nearly all content of theme from this long text.Unlike this, due to The number of words of short text is restricted, so being the most only described the core content of theme, a lot of relevant informations are omitted.

The method of traditional text mining is typically for long text, and applies and can meet difficulty on short text, example Such as cluster.Be often required to use the concurrent information (simultaneously occur) of word owing to realizing cluster, and in short text word and transmit Breath is few more a lot of than long text, and therefore Clustering Effect can be affected.Two sections of newsletter archive L1 and L2 such as below:

L1: " Tsing-Hua University the 4th teaching building is renamed as " Jeanwest building ", and it a piece of is scoffed in campus and the Internet Sound.Opposing views are mainly: the teaching building of Tsing-Hua University and the apparel brand image of Jeanwest are not taken.From colleges and universities' building using names just When this angle of program is seen, Tsing-Hua University obviously has can fastidious part.Casting aside this point not talk, list is paid close attention to regard to Tsing-Hua University scholar For the so-called brand image angle of substantive issue--teaching building using names, " Jeanwest building " the most excessively loses the image of Tsing-Hua University？”

L2: " recently, Tsing-Hua University one teaching building is named as " Jeanwest ", causes great disturbance on network.Jeanwest It it not an apparel brand？The teaching building of Tsing-Hua University is the most also " Jeanwest "？Noon on the 23rd, Tsing-Hua University the 4th teaching building Exterior wall hangs up the board in " Jeanwest building ".The lower right of several words, is also hung with another board, special introduce Jeanwest this Apparel brand.Teaching building, with brand names using names, causes Tsing-Hua University student and the dispute of online friend.It is believed that colleges and universities are undue Commercialization, should not carry out using names with enterprise.And Sina blogger@Young_pig thinks, enterprise provides patronage, hat to school Name does not affect school image.”

L1 and L2 is because having words such as " Tsing-Hua University, the 4th teaching building, Jeanwest, clothing, colleges and universities, using names, images ", institute To easily determine out that they are much like, can gather is a class.And following two short text S1 and S2 the most easily to gather be one Class, because they total important words only " Tsing-Hua University " (" also, " this word is very universal because of using, so Hardly important, usually removed before cluster):

S1: " heard, Jeanwest building, and the image of Tsing-Hua University do not taken "

S2: " not being exactly an apparel brand, Tsing-Hua University's using names excessively commercialization "

In order to improve the correctness of short text clustering, prior art having been proposed that, employing auxiliary information helps to gather Class.Such as, if to cluster the such short text of above-mentioned S1 and S2, just introduce the such long text of L1 and L2 as auxiliary letter Breath, because S1 with L1 more similar (sharing words such as " Jeanwest, Tsing-Hua University, images, do not take "), and S2 with L2 is more similar (sharing words such as " clothing, Tsing-Hua University, using names, commercializations ").Being additionally, since L1 with L2 more similar, therefore S1 and S2 is the most just Similar, can gather is a class.

List of references 1 (XH Phan, LM Nguyen, S Horiguchi., " Learning to classifty short And sparse text & web with hidden topics from large-scale data collections ", WWW2008) a kind of method carrying out herein clustering according to auxiliary is described.As it is shown in figure 1, the method comprises the following steps:

In step S100, auxiliary text collection is performed subject analysis, obtain some themes and corresponding vocabulary.Specifically Ground, uses the text downloaded from wikipedia (Wikipedia) to assist text set as auxiliary information, formation in list of references 1 Close.Subject analysis uses potential Di Li Cray distribution (Latent Dirichlet Allocation, LDA) method.Fig. 2 illustrates The model of LDA.LDA is a kind of to generate model, and its main thought is the generation process of simulation text: to each word, first from Distribution is selected a theme, then from theme, selects a word.The algorithm flow of reference Fig. 2, LDA includes:

1 pair of each theme k ∈ [1, K], does a sampling from Dir (β) is distributed, and obtains dividing of the word under a theme Cloth

2 pairs of each texts m ∈ [1, M],

2.1 do a sampling from Dir (α) distribution, obtain a theme distribution

2.2 couples of each word n,

2.2.1 from multinomial distributionIn do a sampling, obtain a theme z_{M, n}。

2.2.2 from multinomial distributionIn do a sampling, obtain a word w_{M, n}。

Algorithm 1-LDA

Wherein, the value of α represents each topic weight distribution before sampling, and the value of β represents the priori of each descriptor Distribution.They are predetermined parameter, referred to as hyper parameter.

The task of LDA is to estimate parameterAnd θ_d.Wherein, the density of simultaneous distribution of all aobvious variablees and hidden variable is as follows:

The likelihood function of one text is as follows:

Likelihood function on whole text collection is as follows:

In theory, can be solved by the above-mentioned likelihood function of maximizationWithBut, this method does not resolve Solve.So, in reality, the mode of general estimation solves.Such as, list of references 1 selects gibbs sampler (Gibbs Sampling) parameter is estimated.List of references 2 (Thomas L.Griffiths, Mark Steyvers, " Finding Scientific topics ", Proceedings of the National Academy of Sciences of the United States of America, Vol.101, No.Suppl 1. (6 April 2004), pp.5228-5235) and ginseng Examine document 3 (Gregor Heinrich, " Parameter estimation for text analysis ", Technical Report, 2004) describe process and the algorithm utilizing gibbs sampler to realize LDA in detail.

In step S110, based on the theme obtained in step S100, short text set is performed reasoning, obtains short with these The theme that text is corresponding.The mode of reasoning still uses gibbs sampler.

In step S120, based on the result of step S110, construct training sample set.Training data is vector mode, also That is the corresponding vector of each short text.Each short text in one short text set is all generated correspondence to Amount, then provides a classification to each short text, constitutes training sample set.

In step S130, select machine learning method, training sample set is classified, in order to obtain disaggregated model.Example As, can select machine learning method that training sample set is classified, in order to obtain disaggregated model.Multiple method is had to be available for choosing Select, such as decision tree, SVM, maximum entropy etc..List of references 1 uses maximum entropy method.

But, in step S110 made inferences on the theme that auxiliary text is formed by short text, list of references 1 is false If the theme of short text can be covered by auxiliary text.In the case of a lot of in reality, this hypothesis can not meet or quilt Meet well, this is because a lot of situation and event are emerging, it is impossible to ensure have a comprehensive knowledge base to cover The theme occurred in all situations and event.In this case, auxiliary text can only cover the distribution subject of short text.Cause This, the method described by list of references 1 can not find and utilize the theme of emerging inherence in short text, thus can reduce point Class or the effect of cluster.

Summary of the invention

In order to solve above-mentioned technical problem, the present invention proposes auxiliary text collection and each literary composition in short text set This execution subject analysis, to obtain each short text in short text set corresponding to assisting theme and the short text of text collection The probability of the theme of set.The present invention does not seek survival at an auxiliary text collection that can cover all themes of short text, and Only require that auxiliary text is relevant to short text part.Specifically, the present invention utilizes two groups of potential Di Li Cray distribution (Double Latent Dirichlet Allocation, DLDA) short text is clustered.By setting up two groups of LDA and adding wherein Permutator, DLDA achieves auxiliary text subject and each self-discovery of short text theme, and can determine that any one short text Corresponding to auxiliary text subject and the probability of short text theme.

According to an aspect of the invention, it is provided a kind of short text clustering equipment, including: subject analysis unit, to auxiliary Text collection is helped to perform subject analysis with each text in short text set, to obtain each short essay in short text set The probability of the theme of this theme corresponding to auxiliary text collection and short text set；Vector signal generating unit, by each short essay This is normalized corresponding to the probability of the theme of the theme and short text set that assist text collection, to generate vector；With And cluster cell, based on the vector generated, the short text in short text set is clustered.

Preferably, subject analysis unit determines each with short text set of auxiliary text collection by switch parameter Word in individual text is corresponding to assisting theme or the theme of short text set of text collection；If corresponding to auxiliary text set The theme closed, the most described subject analysis unit performs subject analysis by the first potential Di Li Cray distribution, if corresponding to short The theme of text collection, the most described subject analysis unit performs subject analysis by the second potential Di Li Cray distribution.

Preferably, subject analysis unit utilizes gibbs sampler algorithm to estimate the first potential Di Li Cray distribution and second The parameter used in the distribution of potential Di Li Cray, wherein assists the sample frequency of the theme of text collection to be proportional to remove current word After sampling in a upper circulation, other selected ci poems all assist the number of times of the theme of text collection, the master of short text set After the sample frequency of topic is proportional to remove current word sampling in a upper circulation, all short-and-medium text collections of other selected ci poem The number of times of theme.

Preferably, vector signal generating unit generates in the intersection of the theme of the theme and short text set that assist text collection Vector.

Preferably, the value of switch parameter obeys binomial distribution.

Preferably, subject analysis unit determines that switch parameter is to ensure that the word in auxiliary text corresponds to auxiliary text collection The probability of theme more than the probability of theme corresponding to short text set, and the word in short text is corresponding to short text The probability of the theme of set is more than the probability of the theme corresponding to auxiliary text collection.

According to another aspect of the present invention, it is provided that a kind of short text clustering method, including: subject analysis step is right Each text in auxiliary text collection and short text set performs subject analysis, each short with obtain in short text set Text corresponds to the theme of auxiliary text collection and the probability of the theme of short text set；Vector generation step, by each short Text is normalized corresponding to the probability of the theme of the theme and short text set that assist text collection, to generate vector； And sorting procedure, based on the vector generated, the short text in short text set is clustered.

Preferably, subject analysis step includes: determined in auxiliary text collection and short text set by switch parameter Each text in word corresponding to the auxiliary theme of text collection or the theme of short text set；If corresponding to auxiliary The theme of text collection, then perform subject analysis by the first potential Di Li Cray distribution, if corresponding to short text set Theme, then perform subject analysis by the second potential Di Li Cray distribution.

Preferably, gibbs sampler algorithm is utilized to estimate the first distribution of potential Di Li Cray and the second potential Di Li Cray The parameter used in distribution, wherein assists the sample frequency of the theme of text collection to be proportional to remove current word in a upper circulation In sampling after, other selected ci poems all assist the frequency of the theme of text collection, the sample frequency of the theme of short text set After being proportional to remove current word sampling in a upper circulation, the frequency of the theme of all short-and-medium text collections of other selected ci poem.

Preferably, vector generation step includes: in the intersection of the theme of the theme and short text set assisting text collection Upper generation vector.

Preferably, the value of switch parameter obeys binomial distribution.

Preferably, determine that switch parameter is to ensure that the word in auxiliary text corresponds to the possibility of the theme of auxiliary text collection Property more than the probability of theme corresponding to short text set, and the word in short text is corresponding to the theme of short text set Probability is more than the probability of the theme corresponding to auxiliary text collection.

Present invention achieves auxiliary text subject and each self-discovery of short text theme such that it is able to more accurately to short essay Originally cluster.

Accompanying drawing explanation

By the detailed description below in conjunction with accompanying drawing, above and other feature of the present invention will become more apparent, its In:

Fig. 1 shows the flow chart of the short text clustering method of prior art；

Fig. 2 shows the block diagram of the LDA model that the short text clustering method in Fig. 1 is used；

Fig. 3 shows the block diagram of short text clustering equipment according to an embodiment of the invention；

Fig. 4 shows the frame of the DLDA model that short text clustering equipment is used according to an embodiment of the invention Figure；And

Fig. 5 shows the flow chart of short text clustering method according to an embodiment of the invention.

Detailed description of the invention

Below, by combining the accompanying drawing description to the specific embodiment of the present invention, the principle of the present invention and realization will become Obtain substantially.It should be noted that, the present invention should not be limited to specific embodiments described below.It addition, for simplicity, save Omit the detailed description to the known technology not having direct correlation with the present invention.

Fig. 3 shows the block diagram of short text clustering equipment 30 according to an embodiment of the invention.As it is shown on figure 3, it is short Text cluster equipment 30 includes subject analysis unit 310, vector signal generating unit 320 and cluster cell 330.

Subject analysis unit 310 performs subject analysis to auxiliary text collection with each text in short text set, To obtain respective theme.In a specific embodiment, subject analysis unit 310 uses DLDA model as shown in Figure 4 Perform subject analysis.Figure 4, it is seen that DLDA includes two groups of LDA, correspond respectively to assist text and the theme of short text Analyze (wherein, " aux " represents auxiliary text, and " tar " represents short text).In order to two groups of LDA are coordinated, introduce switch Variable γ.It is to select theme from auxiliary text or select main from short text that switching variable γ is responsible for selecting each word Topic.

In the present embodiment, the theme of auxiliary text and short text is carried out by subject analysis unit 310 by following algorithm Analyze:

1 pair auxiliary text collection each theme z ∈ [1 ..., K^aux], from Dir (β^aux) sampling is done in distribution, Obtain the distribution of word under a theme

Each theme z ∈ of 2 pairs of short text set [1 ..., K^tar], from Dir (β^tar) sampling is done in distribution, The distribution of the word under a theme

3 pairs of each text collections c ∈ [aux, tar],

3.1 couples of each text d ∈ [1 ..., D^c],

3.2 from Dir (α^aux) profile samples obtain one auxiliary text collection theme distribution

3.3 from Dir (α^tar) profile samples obtains the theme distribution of a short text set

3.4 from Beta (γ^c) in distribution sampling obtain a binomial distribution π_d。

3.5 couples of each word w_{D, n},

3.5.1 from binomial distribution π_dMiddle sampling obtains switch value x_{D, n},

If 3.5.2 x_{D, n}=aux, from the multinomial distribution of auxiliary text collectionIn sample To a theme z_{D, n}。

If 3.5.3 x_{D, n}=tar, from the multinomial distribution of short text setMiddle sampling obtains One theme z_{D, n}。

3.5.4 from multinomial distributionIn do a sampling, obtain a word w_{D, n}。

Algorithm 2-DLDA

One concrete application example is described below.Assume have 100, text of auxiliary, short text 50.Take K^aux=15, K^tar =10, α^aux=0.3, α^tar=0.2, β^aux=β^tar=0.01.It should be noted that α, β are multi-C vectors, in theory The value of each dimension may be different, but the value of each dimension is usually reduced to same value in actual applications.Its In Noting, setting here should beAndWherein c and～c generation One text collection of table and another text collection, such as c=aux, then～c=tar.Otherwise, if c=tar, then～c= aux。

First, all vocabulary (occurring the most only calculating once) of subject analysis unit 310 statistics auxiliary text and short text, It is designated as V.Here, suppose that V includes 5000 vocabulary.

Then, subject analysis unit 310 to auxiliary text collection each theme z ∈ [1 ..., 15], from Dir (0.01) Distribution is sampled, obtains the distribution of word under each themeSuch as,This The dimension of vector is 5000, the value of all dimensions and be 1, the meaning is for this theme, and the probability choosing first word is 0.001, the probability choosing second word is 0 ....Additionally,

Additionally, subject analysis unit 310 to each theme z ∈ of short text set [1 ..., 10], from Dir (0.01) Distribution is sampled, obtains the distribution of word under each themeSuch as,This The dimension of individual vector is 5000.Additionally,

Then, subject analysis unit 310 is to first auxiliary text d=1 in auxiliary text collection c=aux, from Dir (0.3) profile samples obtains the theme distribution of an auxiliary text collectionThis vector Dimension is 15, the value of all dimensions and be 1, the meaning is that to choose the probability of first topic be 0.1, chooses the general of second theme Rate is 0.2, etc..Additionally, subject analysis unit 310 also obtains the theme of a short text set from Dir (0.2) profile samples DistributionThe dimension of this vector is 10, the value of all dimensions and be 1.It follows that it is main The sampling from Beta (0.5,0.2) is distributed of topic analytic unit 310 obtains a binomial distribution π₁=[0.7,0.3], its be meant that choose auxiliary The probability helping text subject is 0.7, and the probability choosing short text theme is 0.3.Assume that text 1 comprises 30 words, then to first Individual word w1,1, from π₁Middle sampling obtains switch value x_1,1=aux.Due to x_1,1=aux, then from the multinomial distribution of auxiliary text collection

Multinomial (θ_{1}^{aux}) = [0.1,0.2,0, . . ., 0.034]

Middle sampling obtains a theme.Assume to draw the 15th theme z_1,1= 15.Afterwards, subject analysis unit 310 is from multinomial distribution In carry out a sampling, be such as extracted into the 1200th word, then corresponding w_1,1=" TV ".

Solve DLDA to need to solve parameterConcrete method for solving can use the calculus of variations (Variational Method), (Expectation propagation) or gibbs sampler (Gibbs Sampling) etc. are propagated in expectation. In the present embodiment, use gibbs sampler algorithm described below to solve parameter

The 1 couple of all auxiliary text subject z ∈ [1 ..., K^aux], all word w and text d,To all short Text subject z ∈ [1 ..., T^tar], all word w and text d,To all text d,Wherein,WithImplication be that auxiliary text subject and the number of times of selected word w during short text theme z selected in theme respectively.WithIt is that text d chooses auxiliary text subject and the number of times of short text theme z respectively.WithIt is the selected ci poem in text d respectively Surely auxiliary text subject and the number of times of short text theme.

2 pairs of each text collections c ∈ [aux, tar],

2.1 couples of each text d ∈ [1 ..., D^c],

2.1.1 to each word w,

2.1.1.1 a switch value x is obtained from binomial distribution π=[0.5,0.5] profile samples.

If 2.1.1.2 x=aux, thenSample from multinomial distribution Multinomial (1/Kaux) Obtain a theme z,

n_{d}^{aux, z} = n_{d}^{aux, z} + 1 .

n_{w}^{aux, z} = n_{w}^{aux, z} + 1 .

If 2.1.1.3 x=tar, thenFrom multinomial distribution Multinomial (1/K^tar) sampling Obtain a theme z,

n_{d}^{tar, z} = n_{d}^{tar, z} + 1 .

n_{w}^{tar, z} = n_{w}^{tar, z} + 1 .

3 circulations

3.1 pairs of each text collections c ∈ [aux, tar],

3.1.1 to each text d ∈ [1 ..., D^c],

3.1.1.1 to each word w,

If 3.1.1.1.1 x and z is sampled the text collection and theme obtained, then by a upper circulation w

3.1.1.1.2 from binomial distribution

π = [n_{d}^{aux} / (n_{d}^{aux} + n_{d}^{tar}), n_{d}^{tar} / (n_{d}^{aux} + n_{d}^{tar})

Profile samples obtains one Individual switch value x.

If 3.1.1.1.3 x=aux, thenFrom the multinomial distribution that formula (1) (see below) determines Sampling obtains a theme z,

n_{d}^{aux, z} = n_{d}^{aux, z} + 1 .

n_{w}^{aux, z} = n_{w}^{aux, z} + 1 .

If 3.1.1.1.4 x=tar, thenFrom the multinomial distribution that formula (2) (see below) determines Sampling obtains a theme z,

n_{d}^{tar, z} = n_{d}^{tar, z} + 1 .

n_{w}^{tar, z} = n_{w}^{tar, z} + 1 .

If 3.2 reach the condition of convergence, then calculate parameter according to formula (3) and (4) (see below), and exit circulation. Otherwise, circulation is continued executing with.

Algorithm 3-gibbs sampler

Formula (1)-(4) mentioned are described below in detail in gibbs sampler algorithm above.

Formula (1): to all auxiliary text subject z ∈ [1 ..., K^aux],

p (x_{i} = x, z_{i} = z | w_{i} = w, x_{&Not; i}, z_{&Not; i}, w_{&Not; i}, α, β, γ) &Proportional; \frac{n_{w, &Not; i}^{aux, z} + β_{w}^{c}}{Σ_{v = 1}^{V} (n_{v, &Not; i}^{aux, z} + β_{v}^{c})} \cdot \frac{n_{d, &Not; i}^{aux, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{aux}} (n_{d, &Not; i}^{aux, k} + α_{k}^{c})} \cdot (n_{d, &Not; i}^{aux} + γ_{x}^{c_{i}})

Formula (1) is meant that: it is (namely public that sampling chooses the probability of text collection x and theme z to be proportional to 3 numerical value Three parts being multiplied on formula (1) the right).Part IIIIn,It is meant that removing current word is at upper one After sampling in circulation, other selected ci poems all assist the number of times of text subject, addPurpose be that to avoid this number be 0.The Two parts are meant that after removing current word sampling in a upper circulation, text d chooses the ratio assisting text subject z, Part I be meant that remove current word upper one circulation in sampling after, choose auxiliary text subject z time choose word w's Ratio.In symbolIt is meant that and " removes current word (w_i) selection " (corresponding to the implication of step 3.1.1.1.1).

Formula (2): to all short text theme z ∈ [1 ..., K^tar],

p (x_{i} = x, z_{i} = z | w_{i} = w, x_{&Not; i}, z_{&Not; i}, w_{&Not; i}, α, β, γ) &Proportional; \frac{n_{w, &Not; i}^{tar, z} + β_{w}^{c}}{Σ_{v = 1}^{V} (n_{v, &Not; i}^{tar, z} + β_{v}^{c})} \cdot \frac{n_{d, &Not; i}^{tar, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{tar}} (n_{d, &Not; i}^{tar, k} + α_{k}^{c})} \cdot (n_{d, &Not; i}^{tar} + γ_{x}^{c_{i}})

The implication of formula (2) is similar with (1), is simply changed to short text theme by auxiliary text subject.Wherein c_iRepresent literary composition Text collection (that is, auxiliary text collection or short text set) belonging to this d.

Formula (3):

θ_{d, z}^{c} = \frac{n_{d}^{c, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{c}} (n_{d}^{c, k} + α_{k}^{c})}

Formula (4):

Wherein c ∈ [aux, tar].

The condition of convergence can have multiple, such as: reach iterations set in advance,Change is very Little or text collection likelihood function varies less.

Above-mentioned gibbs sampler algorithm is used to solve, it can be deduced that the master of each short text correspondence auxiliary text collection The probability (i.e. the result of formula (3)) of the theme of topic and short text set.

Vector signal generating unit 320 generates vector after the probability normalization of corresponding theme.Note, vector here be Generate in the intersection of the theme of auxiliary text collection and the theme of short text set.To any one short text d, vectorial is every One-dimensional is that in formula (3), c takes aux or tar, z and takes any one theme:

f_{d} = [\frac{θ_{d, 1}^{θ_{d, 1}^{aux}}}{S_{1}^{aux}}, . . ., \frac{θ_{d, K^{aux}}^{aux}}{S_{K^{aux}}^{aux}}, \frac{θ_{d, 1}^{tar}}{S_{1}^{tar}}, . . ., \frac{θ_{d, K^{tar}}^{tar}}{S_{K^{tar}}^{tar}}]

WhereinX ∈ { aux, rar}.Such as, vector signal generating unit 320 can generate f_d=[0.1, 0.5,0,0,0.02 ..., 0] such vector.

The vector that cluster cell 330 is generated based on vector signal generating unit 320 carries out short text clustering.Specifically, pin is worked as After all of short text is generated above-mentioned vector, it is possible to use the clustering methods such as such as K-average perform short text clustering, thus Obtain the cluster result of short text.

Fig. 5 shows the flow chart of short text clustering method 50 according to an embodiment of the invention.As it is shown in figure 5, Short text clustering method 50 includes step S510-S530.

In step S510, auxiliary text collection is performed subject analysis with each text in short text set, to obtain Obtain respective theme.Specifically, above-described DLDA algorithm can be used auxiliary text and the theme of short text to be carried out point Analysis.In the solution procedure of DLDA algorithm, the calculus of variations (Variational Method), expectation can be used to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) etc..Preferably, use above The gibbs sampler algorithm described realizes DLDA.

In step S520, according to the theme of the theme of each short text correspondence auxiliary text collection and short text set can Energy property, generates vector after the probability normalization of corresponding theme.Preferably, this vector be auxiliary text collection theme and Generate in the intersection of the theme of short text set.

In step S530, based on the vector generated, short text is clustered.Such as, when generating for all of short text After vector, it is possible to use the clustering methods such as K-average perform short text clustering.

The result short text clustering equipment of the present invention or method being applied to obtained by online advertisement set is described below. Assuming to have collected the online advertisement totally 182209 of 42 series products from certain business website, average every text comprises 29.06 Word.It addition, according to ProductName have collected 99737 webpages as auxiliary text, average every text comprises 560.4 words.Often Series products is as a cluster.

Evaluation criterionSelect following entropy form:

H (\tilde{x}) = - \underset{c &Element; C}{Σ} p (c | \tilde{x}) \log_{2} p (c | \tilde{x}),

WhereinRepresenting the cluster that computer completes, C represents correct cluster classification, and c is the cluster that some is correct (a certain series products), wherein:

L (x) represents that short text c's correctly clusters labelling,Represent this text number clustered.The least, say Bright algorithm performance is the best.

Table 1 below lists DLDA method according to the present invention and other several method is applied to the knot of online advertisement set Really:

Table 1

In Table 1, Direct represents and directly uses clustering method.LDA-one is to produce theme on auxiliary text collection, Then short text makes inferences (similar list of references 1) on these themes.LDA-both is at auxiliary text collection and short essay Theme is produced on this union of sets collection.The clustering method used when STC is a kind of conversion art.It can be seen that due to DLDA'sResult is minimum, so the performance that DLDA is in terms of short text clustering is best.

Show the present invention although above already in connection with the preferred embodiments of the present invention, but those skilled in the art will It will be appreciated that without departing from the spirit and scope of the present invention, the present invention can be carried out various amendment, replace and change Become.Therefore, the present invention should not limited by above-described embodiment, and should be limited by claims and equivalent thereof.

Claims

1. a short text clustering equipment, including:

Subject analysis unit, performs subject analysis to auxiliary text collection with each text in short text set, to obtain Each short text in short text set belongs to the theme of auxiliary text collection and the probability of the theme of short text set；

Vector signal generating unit, belongs to the theme of auxiliary text collection and the probability of the theme of short text set by each short text It is normalized, to generate vector；And

Cluster cell, clusters the short text in short text set based on the vector generated.

Short text clustering equipment the most according to claim 1, wherein, described subject analysis unit is come really by switch parameter Surely auxiliary text collection and the word in each text in short text set belong to theme or the short essay of auxiliary text collection The theme of this set；If belonging to the theme of auxiliary text collection, the most described subject analysis unit passes through the first potential Di Like Thunder distribution performs subject analysis, if belonging to the theme of short text set, the most described subject analysis unit passes through second potential Di Profit Cray distribution performs subject analysis.

Short text clustering equipment the most according to claim 2, wherein, described subject analysis unit utilizes gibbs sampler to calculate Method estimates the parameter used in the first potential Di Li Cray distribution and the distribution of the second potential Di Li Cray, wherein assists text set Auxiliary literary composition after the sample frequency of the theme closed is proportional to remove current word sampling in a upper circulation, in other selected ci poems all The number of times of the theme of this set, the sample frequency of the theme of short text set is proportional to remove current word in a upper circulation After sampling, the number of times of the theme of all short-and-medium text collections of other selected ci poem.

Short text clustering equipment the most according to claim 1, wherein, described vector signal generating unit is at auxiliary text collection Vector is generated in the intersection of the theme of theme and short text set.

Short text clustering equipment the most according to claim 2, wherein, the value of described switch parameter obeys binomial distribution.

Short text clustering equipment the most according to claim 2, wherein, described subject analysis unit determines that switch parameter is to protect Word in card auxiliary text belongs to the probability possibility more than the theme belonging to short text set of the theme of auxiliary text collection Property, and the word in short text belong to short text set theme probability more than belong to auxiliary text collection theme can Can property.

7. a short text clustering method, including:

Subject analysis step, performs subject analysis to auxiliary text collection with each text in short text set, to obtain Each short text in short text set belongs to the theme of auxiliary text collection and the probability of the theme of short text set；

Vector generation step, belongs to the theme of auxiliary text collection and the probability of the theme of short text set by each short text It is normalized, to generate vector；And

Sorting procedure, clusters the short text in short text set based on the vector generated.

Short text clustering method the most according to claim 7, wherein, described subject analysis step includes: joined by switch Number determines that auxiliary text collection and the word in each text in short text set belong to the theme of auxiliary text collection also It it is the theme of short text set；If belonging to the theme of auxiliary text collection, then performed by the first potential Di Li Cray distribution Subject analysis, if belonging to the theme of short text set, then performs subject analysis by the second potential Di Li Cray distribution.

Short text clustering method the most according to claim 8, wherein, first is potential to utilize gibbs sampler algorithm to estimate The parameter used in the distribution of Di Li Cray and the distribution of the second potential Di Li Cray, wherein assists the sampling frequency of the theme of text collection Rate be proportional to remove current word upper one circulation in sampling after, other selected ci poems all assist text collection theme time Number, the sample frequency of the theme of short text set be proportional to remove current word upper one circulation in sampling after, all other The number of times of the theme of the short-and-medium text collection of selected ci poem.

Short text clustering method the most according to claim 7, wherein, described vector generation step includes: at auxiliary text Vector is generated in the intersection of the theme of set and the theme of short text set.

11. short text clustering methods according to claim 8, wherein, the value of described switch parameter obeys binomial distribution.

12. short text clustering methods according to claim 8, wherein it is determined that switch parameter is to ensure in auxiliary text Word belongs to the probability probability more than the theme belonging to short text set of the theme of auxiliary text collection, and in short text Word belong to the probability of theme of short text set more than the probability of theme belonging to auxiliary text collection.