CN102831119A

CN102831119A - Short text clustering equipment and short text clustering method

Info

Publication number: CN102831119A
Application number: CN2011101605614A
Authority: CN
Inventors: 赵凯; 胡长建; 王大亮; 许洪志
Original assignee: NEC China Co Ltd
Current assignee: Data Hall (Beijing) Polytron Technologies Inc
Priority date: 2011-06-15
Filing date: 2011-06-15
Publication date: 2012-12-19
Anticipated expiration: 2031-06-15
Also published as: CN102831119B

Abstract

The invention provides short text clustering equipment which comprises a subject analysis unit, a vector generating unit and a clustering unit, wherein the subject analysis unit is used for conducting subject analysis on each text in an auxiliary text collection and a short text collection, thereby obtaining the possibilities that each short text in the short text collection is corresponding to a subject of the auxiliary text collection and the subject of the short text collection; the vector generating unit is used for conducting normalization on the possibilities that each short text is corresponding to the subject of the auxiliary text collection and the subject of the short text collection so as to generate a vector; and the clustering unit is used for clustering the short texts in the short text collection based on the generated vector. Meanwhile, the invention further provides a short text clustering method. According to the short text clustering equipment and the short text clustering method, the independent finding of the auxiliary text subject and the short text subject can be realized, thereby clustering the short texts more accurately.

Description

The short text clustering Apparatus and method for

Technical field

The present invention relates to natural language processing field, be specifically related to a kind of short text clustering Apparatus and method for.

Background technology

Widespread use along with SMS, microblogging, search engine, online advertisement etc.; Short text is more and more frequent by people's use; These texts are shorter usually, and for example a SMS can not surpass 70 words, and the result that search engine returns generally also has only tens words.

Short text and long text (for example news) have bigger difference.For example, under the long text environment, a theme can be described fully, thereby people can recognize nearly all content of theme from this long text.Different therewith, because the number of words of short text is restricted, so only the core content of theme is described usually, a lot of relevant informations are omitted.

The method that traditional text is excavated normally is directed against long text, can meet difficulty on the short text and be applied in, for example cluster.Owing to realize that cluster usually will be used literal and photos and sending messages (occurring simultaneously), and the short text Chinese words and photos and sending messages lack much than long text, so the cluster effect can be affected.For example following two sections newsletter archive L1 and L2:

L1: " Tsing-Hua University's the 4th teaching building is renamed as " Jeanwest building ", the sound that a slice is scoffed in the campus and the internet.Opposing views mainly are: the apparel brand image of teaching building of Tsing-Hua University and Jeanwest is not taken very much.See that from this angle of due procedure of colleges and universities' building using names Tsing-Hua University obviously has can fastidious part.To cast aside this point and do not talk, whether does " Jeanwest building " too lose the image of Tsing-Hua University with regard to single substantive issue of being paid close attention to regard to the scholar of Tsing-Hua University--the so-called brand image angle of teaching building using names? "

L2: " recently, Tsing-Hua University's one teaching building is named as " Jeanwest ", on network, has caused great disturbance.Is not Jeanwest an apparel brand? How is the teaching building of Tsing-Hua University also " Jeanwest "? Noon on the 23rd, Tsing-Hua University's the 4th teaching building exterior wall hangs up the board in " Jeanwest building ".The lower right of several words also is hung with another board, and this apparel brand of Jeanwest is introduced in special use.Teaching building causes student of Tsing-Hua University and online friend's dispute with the brand names using names.Someone thinks the commercialization that colleges and universities are undue should not come using names with enterprise.And the rich friendly Young_pig of Sina thinks that enterprise provides patronage to school, and a hat name does not influence school image.”

L1 and L2 be because there is speech such as " Tsing-Hua University, the 4th teaching building, Jeanwest, clothes, colleges and universities, using names, images ", and they are very similar so judge easily, and can gather is one type.And following two short text S1 and S2 just not gather so easily be one type because their total important literal have only " Tsing-Hua University " (" also, " this speech because use very general, so not too important, usually before cluster, remove):

S1: " having it is said that the image of Jeanwest building and Tsing-Hua University has not been taken too "

S2: " not being exactly an apparel brand, too commercialization of Tsing-Hua University's using names "

In order to improve the correctness of short text clustering, proposed to adopt supplementary to help carry out cluster in the prior art.For example, if above-mentioned S1 of cluster and the such short text of S2 are just introduced the such long text of L1 and L2 as supplementary; Because S1 more similar with L1 (sharing speech such as " Jeanwest, Tsing-Hua University, images; do not take "), and S2 more similar with L2 (sharing speech such as " clothes, Tsing-Hua University; using names, commercializations ").And because L1 is more similar with L2, so S1 is also just similar with S2, and can gather is one type.

List of references 1 (XH Phan; LM Nguyen; S Horiguchi.; " Learning to classifty short and sparse text web with hidden topics from large-scale data collections " WWW2008) described the method that the auxiliary this paper of a kind of basis carries out cluster.As shown in Figure 1, this method may further comprise the steps:

At step S100, auxiliary text collection is carried out subject analysis, obtain some themes and corresponding vocabulary.Particularly, employing as supplementary, forms auxiliary text collection from the text of wikipedia (Wikipedia) download in the list of references 1.Subject analysis uses potential Di Li Cray to distribute (Latent Dirichlet Allocation, LDA) method.Fig. 2 shows the model of LDA.LDA is a kind of generation model, and its main thought is the generative process of simulation text: to each speech, from distribute, select a theme earlier, from theme, select a speech again.With reference to figure 2, the algorithm flow of LDA comprises:

1 to each theme k ∈ [1; K]; From Dir (β) distributes, do a sampling, obtain the distribution

of a speech under the theme

2 to each text m ∈ [1, M],

2.1 do a sampling from Dir (α) distribution, obtain a theme distribution

2.2 to each speech n,

2.2.1 distribute from polynomial expression

In do a sampling, obtain a theme z _{M, n}

2.2.2 distribute from polynomial expression

In do a sampling, obtain a speech w _{M, n}

Algorithm 1-LDA

Wherein, the weight distribution of each topic of the value representation of α before sampling, the prior distribution of each descriptor of value representation of β.They are predetermined parameters, are called ultra parameter.

The task of LDA is an estimated parameter

And θ _dWherein, the density of simultaneous distribution of all apparent variablees and hidden variable is following:

The likelihood function of one piece of text is following:

Likelihood function on the whole text collection is following:

In theory; Can solve

and

still, this method does not have analytic solution through maximizing above-mentioned likelihood function.So general mode with estimation is found the solution in reality.For example, list of references 1 selects for use gibbs sampler (Gibbs Sampling) to come estimated parameter.List of references 2 (Thomas L.Griffiths; Mark Steyvers, " Finding scientific topics ", Proceedings of the National Academy of Sciences of the United States of America; Vol.101; No.Suppl 1. (6 April 2004) is pp.5228-5235) with list of references 3 (Gregor Heinrich, " Parameter estimation for text analysis "; Technical Report, 2004) describe the process and the algorithm that utilize gibbs sampler to realize LDA in detail.

At step S110, based on the theme that obtains among the step S100, reasoning is carried out in set to short text, obtains the theme corresponding with these short texts.The mode of reasoning also is to use gibbs sampler.

At step S120, be the base configuration training sample set with the result of step S110.Training data is a vector mode, that is to say the corresponding vector of each short text.Each bar short text in the short text set all generates corresponding vector, then each bar short text is provided a classification, has so just constituted training sample set.

At step S130, select machine learning method, training sample set is classified, so that obtain disaggregated model.For example, can select machine learning method that training sample set is classified, so that obtain disaggregated model.There is several different methods available, for example decision tree, SVM, maximum entropy etc.What use in the list of references 1 is maximum entropy method.

Yet, short text is being carried out among the step S110 of reasoning on the theme that auxiliary text forms, the theme of list of references 1 hypothesis short text can both be covered by auxiliary text.Under a lot of situation in reality, this hypothesis can not satisfy or satisfied well, and this is because a lot of situation and incident are emerging, can not guarantee that a comprehensive knowledge base can cover the theme that occurs in all situations and the incident.In this case, auxiliary text can only cover the part theme of short text.Therefore, the theme of emerging inherence in the short text can not found and utilize to list of references 1 described method, thereby can reduce the effect of classification or cluster.

Summary of the invention

In order to solve the problems of the technologies described above; The present invention proposes to assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set.The present invention does not require auxiliary text collection that can cover all themes of short text of existence, and only requires auxiliary text and short text part correlation.Specifically, the present invention utilizes two groups of potential Di Li Crays to distribute that (Double Latent Dirichlet Allocation DLDA) carries out cluster to short text.Through setting up two groups of LDA and adding switch therein, DLDA has realized the discovery separately of auxiliary text subject and short text theme, and can confirm the possibility of any one short text corresponding to auxiliary text subject and short text theme.

According to an aspect of the present invention; A kind of short text clustering equipment is provided; Comprise: the subject analysis unit; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And cluster cell, based on the vector that generates the short text in the short text set is carried out cluster.

Preferably, the subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.

Preferably; Subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.

Preferably, vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.

Preferably, the value of switch parameter is obeyed binomial distribution.

Preferably; The subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.

According to another aspect of the present invention; A kind of short text clustering method is provided; Comprise: the subject analysis step; To assisting each text in the set of text collection and short text to carry out subject analysis, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set; Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And the cluster step, based on the vector that generates the short text in the short text set is carried out cluster.

Preferably, subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.

Preferably; Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the frequency of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the frequency of the theme of the short-and-medium text collection of all other selected ci poems.

Preferably, vector generation step comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.

Preferably, the value of switch parameter is obeyed binomial distribution.

Preferably; Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.

The present invention has realized the discovery separately of auxiliary text subject and short text theme, thereby can carry out cluster to short text more accurately.

Description of drawings

Through the hereinafter detailed description with the accompanying drawing, above-mentioned and further feature of the present invention will become more apparent, wherein:

Fig. 1 shows the process flow diagram of the short text clustering method of prior art;

The block diagram of the LDA model that the short text clustering method among Fig. 1 that shows Fig. 2 is adopted;

Fig. 3 shows the block diagram of short text clustering equipment according to an embodiment of the invention;

Fig. 4 shows the block diagram of the DLDA model that short text clustering equipment according to an embodiment of the invention adopted; And

Fig. 5 shows the process flow diagram of short text clustering method according to an embodiment of the invention.

Embodiment

Below, in conjunction with the drawings to the description of specific embodiment of the present invention, principle of the present invention will become obvious with realizing.Should be noted in the discussion above that the present invention should not be limited to specific embodiment hereinafter described.In addition, for for simplicity, omitted not having the detailed description of the known technology of direct correlation with the present invention.

Fig. 3 shows the block diagram of short text clustering equipment 30 according to an embodiment of the invention.As shown in Figure 3, short text clustering equipment 30 comprises subject analysis unit 310, vectorial generation unit 320 and cluster cell 330.

Each text in 310 pairs of auxiliary text collections in subject analysis unit and the short text set is carried out subject analysis, to obtain theme separately.In a specific embodiment, subject analysis unit 310 adopts DLDA model as shown in Figure 4 to carry out subject analysis.As can be seen from Figure 4, DLDA comprises two groups of LDA, corresponds respectively to the subject analysis (wherein, text is assisted in " aux " expression, and " tar " representes short text) of auxiliary text and short text.For two groups of LDA are coordinated, introduced switching variable γ.It is from auxiliary text, to select theme or from short text, select theme that switching variable γ is responsible for selecting each speech.

In the present embodiment, subject analysis unit 310 is analyzed the theme of auxiliary text and short text through following algorithm:

Each theme z ∈ of 1 pair of auxiliary text collection [1 ..., K ^Aux], from Dir (β ^Aux) do a sampling in the distribution, obtain the distribution of a speech under the theme

Each theme z ∈ of 2 pairs of short text set [1 ..., K ^Tar], from Dir (β ^Tar) do a sampling in the distribution, obtain the distribution of a speech under the theme

3 to each text collection c ∈ [aux, tar],

3.1 to each text d ∈ [1 ..., D ^c],

3.2 from Dir (α ^Aux) the profile samples theme that obtains an auxiliary text collection distributes

3.3 from Dir (α ^Tar) the profile samples theme that obtains the set of short text distributes

3.4 from Beta (γ ^c) in the distribution sampling obtain a binomial distribution π _d

3.5 to each speech w _{D, n},

3.5.1 from binomial distribution π _dMiddle sampling obtains switch value x _{D, n},

If 3.5.2 x _{D, n}=aux is from the polynomial expression distribution of auxiliary text collection

Middle sampling obtains a theme z _{D, n}

If 3.5.3 x _{D, n}=tar is from the polynomial expression distribution of short text set

Middle sampling obtains a theme z _{D, n}

3.5.4 distribute from polynomial expression

In do a sampling, obtain a speech w _{D, n}

Algorithm 2-DLDA

A concrete applying examples is described below.Suppose to have 100 pieces in auxiliary text, 50 pieces of short texts.Get K ^Aux=15, K ^Tar=10, α ^Aux=0.3, α ^Tar=0.2, β ^Aux=β ^Tar=0.01.It should be noted that α, β is multi-C vector usually, and the value of each dimension maybe be different in theory, but the value of each dimension usually is reduced to same value in practical application.

wherein

notes; The setting here should be

and

wherein c represent a text collection and another text collection with～c; C=aux for example, then～c=tar.Otherwise, if c=tar, then～c=aux.

At first, all vocabulary of auxiliary text of subject analysis unit 310 statistics and short text (occurring repeatedly only calculating once) are designated as V.Here, suppose that V comprises 5000 vocabulary.

Then; Each theme z ∈ [1 of the 310 pairs of auxiliary text collections in subject analysis unit; ...; 15], from Dir (0.01) distributes, do sampling, the distribution

that obtains the speech under each theme for example;

this vectorial dimension is 5000; The value of all dimensions and be 1, the meaning is as far as this theme, the probability of choosing first speech is 0.001; The probability of choosing second speech is 0 ....In addition,

In addition; Each theme z ∈ [1 of the 310 pairs of short text set in subject analysis unit; ...; 10];, Dir (0.01) does sampling from distributing; The distribution that obtains speech under each theme for example,

this vectorial dimension is 5000.In addition,

Then; First auxiliary text d=1 among the 310 couples of auxiliary text collection c=aux in subject analysis unit; Theme distribution

this vectorial dimension that obtains an auxiliary text collection from Dir (0.3) profile samples is 15; The value of all dimensions and be 1; The meaning is that to choose the probability of first theme be 0.1, and the probability of choosing second theme is 0.2, or the like.In addition; The theme that subject analysis unit 310 also obtains the set of short text from Dir (0.2) profile samples

this vectorial dimension that distributes is 10, the value of all dimensions and be 1.Next, sampling obtained a binomial distribution π during subject analysis unit 310 distributed from Beta (0.5,0.2) ₁=[0.7,0.3], its implication are that to choose the probability of auxiliary text subject be 0.7, and the probability of choosing the short text theme is 0.3.Suppose that text 1 comprises 30 speech, so to first speech w1,1, from π ₁Middle sampling obtains switch value x _1,1=aux.Because x _1,1=aux, then the polynomial expression from auxiliary text collection distributes

Multinomial (θ_{1}^{Aux}) = [0.1,0.2,0, . . ., 0.034]

Middle sampling obtains a theme.Suppose to draw the 15th theme z _1,1=15.Afterwards, subject analysis unit 310 distributes from polynomial expression

In carry out a sampling, for example be extracted into the 1200th speech, then corresponding w _1,1=" TV ".

Finding the solution DLDA need solve the concrete method for solving of parameter

and can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.In the present embodiment, adopt the following gibbs sampler algorithm of describing to find the solution parameter

1 couple of all the auxiliary text subject z ∈ [1 ..., K ^Aux], all speech w and text d,

To all short text theme z ∈ [1 ..., T ^Tar], all speech w and text d,

To all text d,

Wherein,

With

The number of times of implication selected word w when being selected auxiliary text subject of theme and short text theme z respectively.

and

Text d were selected secondary text topics and the number of short text theme z.

and

d, respectively, in the words of the text selected to support the text topics and the number of short text theme.

2 to each text collection c ∈ [aux, tar],

2.1 to each text d ∈ [1 ..., D ^c],

2.1.1 to each speech w,

2.1.1.1 obtain a switch value x from binomial distribution π=[0.5,0.5] profile samples.

If 2.1.1.2 x=aux, then Obtain a theme z from polynomial expression distribution Multinomial (1/Kaux) sampling,

n_{d}^{Aux, z} = n_{d}^{Aux, z} + 1 .

n_{w}^{Aux, z} = n_{w}^{Aux, z} + 1 .

If 2.1.1.3 x=tar, then

From polynomial expression distribution Multinomial (1/K ^Tar) sampling obtains a theme z,

n_{d}^{Tar, z} = n_{d}^{Tar, z} + 1 .

n_{w}^{Tar, z} = n_{w}^{Tar, z} + 1 .

3 circulations

3.1 to each text collection c ∈ [aux, tar],

3.1.1 to each text d ∈ [1 ..., D ^c],

3.1.1.1 to each speech w,

3.1.1.1.1 if x and z by a last circulation w the sampling text collection and the theme that obtain,

then

3.1.1.1.2 from binomial distribution

π = [n_{d}^{Aux} / (n_{d}^{Aux} + n_{d}^{Tar}), n_{d}^{Tar} / (n_{d}^{Aux} + n_{d}^{Tar})

Profile samples obtains a switch value x.

If 3.1.1.1.3 x=aux, then

Sampling obtains a theme z from the polynomial expression distribution of formula (1) (vide infra) decision,

n_{d}^{Aux, z} = n_{d}^{Aux, z} + 1 .

n_{w}^{Aux, z} = n_{w}^{Aux, z} + 1 .

If 3.1.1.1.4 x=tar, then

Sampling obtains a theme z from the polynomial expression distribution of formula (2) (vide infra) decision,

n_{d}^{Tar, z} = n_{d}^{Tar, z} + 1 .

n_{w}^{Tar, z} = n_{w}^{Tar, z} + 1 .

If,, and withdraw from circulation then according to formula (3) and (4) (vide infra) calculating parameter 3.2 reach the condition of convergence.Otherwise, continue to carry out circulation.

Algorithm 3-gibbs sampler

Describe formula (1)-(4 of mentioning in the gibbs sampler algorithm of preceding text) below in detail.

Formula (1): to all auxiliary text subject z ∈ [1 ..., K ^Aux],

p (x_{i} = x, z_{i} = z | w_{i} = w, x_{&Not; i}, z_{&Not; i}, w_{&Not; i}, α, β, γ) &Proportional; \frac{n_{w, &Not; i}^{aux, z} + β_{w}^{c}}{Σ_{v = 1}^{V} (n_{v, &Not; i}^{aux, z} + β_{v}^{c})} \cdot \frac{n_{d, &Not; i}^{aux, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{aux}} (n_{d, &Not; i}^{aux, k} + α_{k}^{c})} \cdot (n_{d, &Not; i}^{aux} + γ_{x}^{c_{i}})

The implication of formula (1) is: sampling chooses the probability of text collection x and theme z to be proportional to 3 numerical value (three parts that multiply each other on formula (1) the right just).In the third part

; The implication of

is the number of times of auxiliary text subject after removing the sampling of current speech in a last circulation, in all other selected ci poems, and the purpose that adds

is that to avoid this number be 0.The ratio that the implication of second portion is after removing the sampling of current speech in a last circulation, text d chooses auxiliary text subject z, the implication of first be after removing the sampling of current speech in a last circulation, choose the ratio of speech w when choosing auxiliary text subject z.In the symbol

Implication be " to remove current speech (w _i) selection " (corresponding to the implication of step 3.1.1.1.1).

Formula (2): to all short text theme z ∈ [1 ..., K ^Tar],

p (x_{i} = x, z_{i} = z | w_{i} = w, x_{&Not; i}, z_{&Not; i}, w_{&Not; i}, α, β, γ) &Proportional; \frac{n_{w, &Not; i}^{tar, z} + β_{w}^{c}}{Σ_{v = 1}^{V} (n_{v, &Not; i}^{tar, z} + β_{v}^{c})} \cdot \frac{n_{d, &Not; i}^{tar, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{tar}} (n_{d, &Not; i}^{tar, k} + α_{k}^{c})} \cdot (n_{d, &Not; i}^{tar} + γ_{x}^{c_{i}})

The implication of formula (2) is similar with (1), just is changed to the short text theme by auxiliary text subject.C wherein _iText collection under the expression text d (that is, auxiliary text collection or short text set).

Formula (3):

θ_{d, z}^{c} = \frac{n_{d}^{c, z} + α_{z}^{c}}{Σ_{k = 1}^{K^{c}} (n_{d}^{c, k} + α_{k}^{c})}

Formula (4):

C ∈ [aux, tar] wherein.

The condition of convergence can have multiple, for example: reach the likelihood function that predefined iterations,

change very little or text collection and change very little.

Adopt above-mentioned gibbs sampler algorithm to find the solution, can draw the possibility (being the result of formula (3)) of the theme and the theme that short text is gathered of the corresponding auxiliary text collection of each short text.

Vector generation unit 320 generates vector after with the possibility normalization of corresponding theme.Notice that the vector here is on the intersection of the theme of the theme of auxiliary text collection and short text set, to generate.To any one short text d, each dimension of vector is that c gets aux or tar in the formula (3), and z gets any theme:

f_{d} = [\frac{θ_{d, 1}^{θ_{d, 1}^{aux}}}{S_{1}^{aux}}, . . ., \frac{θ_{d, K^{aux}}^{aux}}{S_{K^{aux}}^{aux}}, \frac{θ_{d, 1}^{tar}}{S_{1}^{tar}}, . . ., \frac{θ_{d, K^{tar}}^{tar}}{S_{K^{tar}}^{tar}}]

x ∈ { aux, rar} wherein.For example, vectorial generation unit 320 can generate f _d=[0.1,0.5,0,0,0.02 ..., 0] such vector.

Cluster cell 330 carries out short text clustering based on the vector that vectorial generation unit 320 is generated.Particularly, when to after all short texts generate above-mentioned vector, for example can use clustering method such as K-average to carry out short text clustering, thereby obtain the cluster result of short text.

Fig. 5 shows the process flow diagram of short text clustering method 50 according to an embodiment of the invention.As shown in Figure 5, short text clustering method 50 comprises step S510-S530.

At step S510, each text in auxiliary text collection and the short text set is carried out subject analysis, to obtain theme separately.Particularly, can adopt above-described DLDA algorithm that the theme of auxiliary text and short text is analyzed.In the solution procedure of DLDA algorithm, can adopt the variational method (Variational Method), expectation to propagate (Expectation propagation) or gibbs sampler (Gibbs Sampling) or the like.Preferably, adopt above-described gibbs sampler algorithm to realize DLDA.

At step S520, according to the corresponding possibility of assisting the theme and the theme that short text is gathered of text collection of each short text, with generating vector after the possibility normalization of corresponding theme.Preferably, this vector is on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, to generate.

At step S530, short text is carried out cluster based on the vector that generates.For example, after generating vector, can use clustering method execution short text clusterings such as K-average to all short texts.

Describe below and short text clustering equipment of the present invention or method are applied to online advertisement gather resulting result.Suppose to have collected from certain business website totally 182209 pieces of the online advertisements of 42 series products, average every piece of text comprises 29.06 words.In addition, collected 99737 pieces of webpages as auxiliary text according to ProductName, average every piece of text comprises 560.4 words.Every series products is as a cluster.

Evaluation criterion

is selected following entropy form for use:

H (\tilde{x}) = - \underset{c &Element; C}{Σ} p (c | \tilde{x}) \log_{2} p (c | \tilde{x}),

Wherein representes the cluster that computing machine is accomplished; The cluster classification that the C representative is correct; C is some correct clusters (an a certain series products), wherein:

The correct cluster mark of l (x) expression short text c,

representes the text number of this cluster.

is more little, explains that algorithm performance is good more.

Following table 1 has been listed the result who is applied to the online advertisement set according to DLDA method of the present invention and other several method:

Table 1

In table 1, Direct representes directly to use clustering method.LDA-one produces theme on auxiliary text collection, short text carries out reasoning (similar list of references 1) on these themes then.LDA-both produces theme on the union of auxiliary text collection and short text set.STC is a kind of clustering method that adopts when changing the field.Can find out; Because

result of DLDA is minimum, so the performance of DLDA aspect short text clustering is best.

Although below combined the preferred embodiments of the present invention to show the present invention, one skilled in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, can carry out various modifications, replacement and change to the present invention.Therefore, the present invention should not limited the foregoing description, and should be limited accompanying claims and equivalent thereof.

Claims

1. short text clustering equipment comprises:

Subject analysis is carried out to assisting each text in the set of text collection and short text in the subject analysis unit, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;

The vector generation unit carries out normalization with each short text corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And

Cluster cell carries out cluster based on the vector that generates to the short text in the short text set.

2. short text clustering equipment according to claim 1; Wherein, said subject analysis unit confirms that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme that short text is gathered corresponding to the theme of auxiliary text collection; If theme corresponding to auxiliary text collection; Then said subject analysis unit distributes through the first potential Di Li Cray carries out subject analysis; If corresponding to the theme of short text set, then said subject analysis unit distributes through the second potential Di Li Cray carries out subject analysis.

3. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit by using gibbs sampler algorithm is estimated the parameter of use in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.

4. short text clustering equipment according to claim 1, wherein, said vectorial generation unit generates vector on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered.

5. short text clustering equipment according to claim 2, wherein, the value of said switch parameter is obeyed binomial distribution.

6. short text clustering equipment according to claim 2; Wherein, Said subject analysis unit confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.

7. short text clustering method comprises:

The subject analysis step is carried out subject analysis to assisting each text in the set of text collection and short text, with the possibility of each short text in the set of acquisition short text corresponding to the theme of theme of assisting text collection and short text set;

Vector generates step, each short text is carried out normalization corresponding to the possibility of the theme of theme of assisting text collection and short text set, to generate vector; And

The cluster step is carried out cluster based on the vector that generates to the short text in the short text set.

8. short text clustering method according to claim 7; Wherein, said subject analysis step comprises: confirm that through switch parameter the speech in each text in auxiliary text collection and the short text set still is the theme of short text set corresponding to the theme of assisting text collection; If corresponding to the theme of auxiliary text collection, then distribute and carry out subject analysis, if, then distribute and carry out subject analysis through the second potential Di Li Cray corresponding to the theme of short text set through the first potential Di Li Cray.

9. short text clustering method according to claim 8; Wherein, Utilize the gibbs sampler algorithm to estimate the parameter of using in the first potential Di Li Cray distribution and the second potential Di Li Cray distribution; After the number of times of the theme of auxiliary text collection after the SF of the theme of wherein auxiliary text collection is proportional to and removes the sampling of current speech in a last circulation, in all other selected ci poems, the SF of the theme of short text set are proportional to and remove the sampling of current speech in a last circulation, the number of times of the theme of the short-and-medium text collection of all other selected ci poems.

10. short text clustering method according to claim 7, wherein, said vector generates step and comprises: on the intersection of the theme that the theme and the short text of auxiliary text collection are gathered, generate vector.

11. short text clustering method according to claim 8, wherein, the value of said switch parameter is obeyed binomial distribution.

12. short text clustering method according to claim 8; Wherein, Confirm switch parameter with guarantee speech in the auxiliary text corresponding to the possibility of the theme of auxiliary text collection greater than possibility corresponding to the theme of short text set, and the speech in the short text corresponding to the possibility of the theme of short text set greater than possibility corresponding to the theme of auxiliary text collection.