CN105159882A - Method and apparatus for determining microblog hot topic - Google Patents

Method and apparatus for determining microblog hot topic Download PDF

Info

Publication number
CN105159882A
CN105159882A CN201510591206.0A CN201510591206A CN105159882A CN 105159882 A CN105159882 A CN 105159882A CN 201510591206 A CN201510591206 A CN 201510591206A CN 105159882 A CN105159882 A CN 105159882A
Authority
CN
China
Prior art keywords
microblogging
sentence
new
effective
root
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510591206.0A
Other languages
Chinese (zh)
Inventor
张玉清
周传锋
李北格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
China University of Geosciences Beijing
Original Assignee
China University of Geosciences Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences Beijing filed Critical China University of Geosciences Beijing
Priority to CN201510591206.0A priority Critical patent/CN105159882A/en
Publication of CN105159882A publication Critical patent/CN105159882A/en
Pending legal-status Critical Current

Links

Abstract

Embodiments of the invention disclose a method and an apparatus for determining a microblog hot topic. The method comprises: obtaining a microblog of an opinion leader; performing word segmentation on the microblog to extract effective sentences, and replacing the effective sentences with relatively short effective sentences with similar semantics to obtain new microblogs for forming a new microblog set; and clustering the effective sentences of the new microblogs in the new microblog set to determine the microblog hot topic. According to the technical scheme provided by the embodiments of the invention, the hot topic can be extracted in real time, so that current public sentiments can be conveniently supervised.

Description

A kind of method and device determining microblog hot topic
Technical field
The present invention relates to Computer Applied Technology field, be specifically related to a kind of method and the device of determining microblog hot topic.
Background technology
Along with the arrival in web2.0 epoch, the use crowd quantity of microblogging is huge gradually, state information updating is frequent, Information Communication is rapid, and microblog vehicle user occupation rate Relatively centralized, the analysis and research therefore based on microblog data are the research directions extremely merited attention.
Microblogging has vast user base, and public feelings information produces fast in microblog and propagates, and microblog users rapidly increases, and the analysis based on microblog data has caused the extensive concern of society.
Analyze Social Public Feelings to effectively utilize microblogging, the acquisition of microblog data seems particularly important.Sina's microblogging such as, enliven a large amount of users, produces the content of microblog of nearly 100,000,000 every day.Microblog users is divided into domestic consumer and authenticated, has numerous bean vermicelli, popularity famous person that is higher, that have certain appeal and influence power is otherwise known as leader of opinion in authenticated.The microblogging that they issue or forward more easily becomes much-talked-about topic, and the microblog data of Real-time Obtaining leader of opinion is one of important method of carrying out the analysis of public opinion.
The microblog data of current acquisition leader of opinion is very convenient, but how to determine microblog hot topic in real time, also there is certain difficulty to supervise current public sentiment.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and the device of determining microblog hot topic, with extract real-time hot issue, to supervise current public sentiment.
Other characteristics of the present disclosure and advantage become obvious by by detailed description below, or the acquistion partially by practice of the present disclosure.
First aspect, embodiments provides a kind of method determining microblog hot topic, comprising:
Obtain the microblogging of leader of opinion;
Subordinate sentence is carried out to described microblogging and extracts effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, forms new microblogging set to obtain new microblogging;
Cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set.
Further, the microblogging obtaining leader of opinion comprises: utilize the microblogging that oriented network reptile acquisition leader of opinion delivers, or adopt the microblogging paid close attention to pattern acquiring leader of opinion and deliver.
Further, comprise alternative for shorter effective sentence close for described effective sentence term justice:
Effective sentence in the set of described microblogging sentence is carried out canonical coupling, the longer shorter effective sentence of effective sentence in effective sentence similar for coupling is substituted.
Further, the effective sentence in the set of described microblogging sentence is carried out coupling to comprise:
The front and back of each word in shorter effective sentence are added asterisk wildcard and form matching condition, judge whether longer effective sentence meets described matching condition.
Further, the effective sentence in the set of described microblogging sentence is carried out coupling and comprises: travel through described microblogging set, canonical coupling is between two carried out to effective sentences all in described microblogging set.
Further, carry out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set to comprise:
Each new microblogging in described new microblogging set is traveled through, successively the root of numbering as microblogging is arranged to described each new microblogging, root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging, if comprise described effective sentence first time to occur, then the label of described effective sentence is set to the root of the new microblogging at described effective sentence place, otherwise the label of described effective sentence is set to the root that first occurs the microblogging of described effective sentence;
According to root and each effective sentence label of new microblogging, determine the classification belonging to described new microblogging, described classification at least comprises ancestors' microblogging and subordinate microblogging, wherein said ancestors' microblogging be all effective sentence all first time the new microblogging that occurs, described subordinate microblogging is the new microblogging that included effective sentence label other effective sentence label except the root for described new microblogging belongs to the root of the subordinate microblogging below the root of same ancestors' microblogging or described ancestors' microblogging;
Search ancestors' microblogging of each subordinate microblogging, the former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic.
Further, described classification also comprises noise microblogging, and described noise microblogging is the new microblogging being subordinated to dissimilar ancestor microblogging;
According to root and each effective sentence label of new microblogging, determine that the classification belonging to described new microblogging comprises:
If the label of all effective sentences is identical in new microblogging, and is the root of described new microblogging, then determine that described new microblogging is ancestors' microblogging;
If effectively sentence label also has an effective sentence label except the root of described new microblogging in new microblogging, then determine that described new microblogging belongs to subordinate microblogging, be subordinated to the new microblogging that root is described effective sentence label;
If also have the sentence label that at least two different in new microblogging in effective sentence label except the root of described new microblogging, and root for described at least two new microbloggings of at least two different sentence labels are subordinate microbloggings of same ancestors' microblogging, then determine that described new microblogging is subordinate microblogging, and be subordinated to described same ancestors' microblogging, otherwise determine that described new microblogging is noise microblogging.
Further, described method also comprises: also comprise after carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set:
To determined hot issue the new microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
Further, described method also comprises: also comprise after carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set:
To determined hot issue the microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
Second aspect, the embodiment of the present invention additionally provides a kind of device determining microblog hot topic, comprising:
Microblogging acquiring unit, for obtaining the microblogging of leader of opinion;
Subordinate sentence unit, extracts effective sentence for carrying out subordinate sentence to described microblogging, is substituted by the close shorter effective sentence of described effective sentence term justice, forms new microblogging set to obtain new microblogging;
Hot issue determining unit, for carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set.
The Advantageous Effects of the technical scheme that the embodiment of the present invention proposes is:
The embodiment of the present invention is by obtaining the microblogging of leader of opinion, subordinate sentence is carried out to described microblogging and extracts effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, new microblogging set is formed to obtain new microblogging, cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set, energy extract real-time hot issue, to supervise current public sentiment.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing the embodiment of the present invention is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the content of the embodiment of the present invention and these accompanying drawings.
Fig. 1 is the method flow diagram of the determination microblog hot topic described in the specific embodiment of the invention one;
Fig. 2 is the method flow diagram of the determination microblog hot topic described in the specific embodiment of the invention two;
Fig. 3 is the method flow diagram of the determination microblog hot topic described in the specific embodiment of the invention three;
Fig. 4 is the structured flowchart of the device of determination microblog hot topic described in the specific embodiment of the invention four.
Embodiment
The technical matters solved for making the present invention, the technical scheme of employing and the technique effect that reaches are clearly, be described in further detail below in conjunction with the technical scheme of accompanying drawing to the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those skilled in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Technical scheme of the present invention is further illustrated by embodiment below in conjunction with accompanying drawing.
Embodiment one
Fig. 1 is the method flow diagram of the determination microblog hot topic described in the present embodiment, and the present embodiment is applicable to energy extract real-time hot issue situation, and as shown in Figure 1, the method for the determination microblog hot topic described in the present embodiment comprises:
The microblogging of step S101, acquisition leader of opinion.
The microblogging obtaining leader of opinion comprises multiple, the microblogging that oriented network reptile acquisition leader of opinion such as can be utilized to deliver, the mode that the microblogging that concern pattern acquiring leader of opinion can also be adopted to deliver etc. are different.
Step S102, subordinate sentence is carried out to described microblogging extract effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, forms new microblogging set to obtain new microblogging.
Microblog can limit the number of words of microblogging usually; such as Sina's microblogging requires the length having at most 140 Chinese characters; therefore; content of microblog has high level overview; the features such as the word used is terse, statement frequency that is similar or that repeat is higher, therefore; if two microblogging has a similar or identical sentence, just these two microbloggings are divided in same topic and go.
In the present embodiment, first subordinate sentence is carried out to microblogging and extract effective sentence (such as number of words is more than the sentence of some).Again the close shorter effective sentence of described effective sentence term justice is substituted, regard identical sentence as by the sentence of expressing close semanteme and analyze.
The method that shorter effective sentence close for described effective sentence term justice substitutes is comprised multiple, such as, the front and back of each word in shorter effective sentence can be added asterisk wildcard and form matching condition, judge whether longer effective sentence meets described matching condition.Above-mentioned matching operation can be carried out between two between all effective subordinate sentences, also can be generate shorter valid sentence subclass according to preset algorithm, or artificial given shorter valid sentence subclass, respectively effective sentences all in microblogging is mated with described effective sentence respectively, if the match is successful, then substitute the effective sentence in this microblogging with the shorter effective sentence mated.
By aforesaid operations, can make between the new microblogging in new microblogging set, to comprise much identical shorter effective sentence.
Further, the method that effective sentence in the set of described microblogging sentence carries out mating also is comprised multiple, such as by the described microblogging set of traversal, effective sentences all in described microblogging set is mated between two, the front and back of each word in shorter effective sentence are added asterisk wildcard and form matching condition, judge whether longer effective sentence meets described matching condition, if so, then the longer shorter effective sentence of effective sentence in effective sentence similar for coupling is substituted.
Step S103, cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set.
The method of this step comprises multiple, such as, add up each effective sentence, according to statistics determination hot issue.
And for example, also carry out the determination of hot issue by following methods, comprising:
Each new microblogging in described new microblogging set is traveled through, successively the root of numbering as microblogging is arranged to described each new microblogging, root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging, if comprise described effective sentence first time to occur, then the label of described effective sentence is set to the root of the new microblogging at described effective sentence place, otherwise the label of described effective sentence is set to the root that first occurs the microblogging of described effective sentence;
According to root and each effective sentence label of new microblogging, determine the classification belonging to described new microblogging, described classification at least comprises ancestors' microblogging and subordinate microblogging, wherein said ancestors' microblogging be all effective sentence all first time the new microblogging that occurs, described subordinate microblogging is the new microblogging that included effective sentence label other effective sentence label except the root for described new microblogging belongs to the root of the subordinate microblogging below the root of same ancestors' microblogging or described ancestors' microblogging;
Search ancestors' microblogging of each subordinate microblogging, the former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic.
For the noise microblogging being subordinated to dissimilar ancestor microblogging, also process can be ignored.In the present embodiment, according to root and each effective sentence label of new microblogging, determine that the classification belonging to described new microblogging also can comprise further:
If the label of all effective sentences is identical in new microblogging, and is the root of described new microblogging, then determine that described new microblogging is ancestors' microblogging;
If effectively sentence label also has an effective sentence label except the root of described new microblogging in new microblogging, then determine that described new microblogging belongs to subordinate microblogging, be subordinated to the new microblogging that root is described effective sentence label;
If also have the sentence label that at least two different in new microblogging in effective sentence label except the root of described new microblogging, and root for described at least two new microbloggings of at least two different sentence labels are subordinate microbloggings of same ancestors' microblogging, then determine that described new microblogging is subordinate microblogging, and be subordinated to described same ancestors' microblogging, otherwise determine that described new microblogging is noise microblogging.
So far, the present embodiment has completed the determination of microblog hot topic.Above-mentioned steps can be determine at set intervals once, set of time can be the first preset duration by the operation S101 of the microblogging of above-mentioned acquisition leader of opinion, such as 24 hours, to determine that the cycle of microblog hot topic is set to the second preset duration, such as 2 hours, the microblogging of so new acquisition second preset duration, just the microblogging of the second preset duration is the earliest rejected from the new microblogging set obtained, can ensure to carry out the data always having 24 hours in the new microblogging set of cluster like this, ensure the real-time of data.
After determining microblog hot topic, can also to determined hot issue the new microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
If determined hot issue is more than one in aforesaid operations, also can to determined hot issue the microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
The present embodiment is by obtaining the microblogging of leader of opinion, subordinate sentence is carried out to described microblogging and extracts effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, new microblogging set is formed to obtain new microblogging, cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set, energy extract real-time hot issue, to supervise current public sentiment.
Embodiment two
Fig. 2 is the method flow diagram of the determination microblog hot topic described in the present embodiment, and as shown in Figure 2, the method for the determination microblog hot topic described in the present embodiment comprises:
The microblogging of step S201, acquisition leader of opinion.
Step S202, subordinate sentence is carried out to described microblogging extract effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, forms new microblogging set to obtain new microblogging.
Step S203, travel through each new microblogging in described new microblogging set, arrange the root of numbering as microblogging to described each new microblogging successively, the root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging.
Such as, this step can comprise: first will arrange a root to each new microblogging respectively, R (i)=i (i=0,1,2,3 according to the order of traversal ...), wherein R (i) is the new microblogging of i-th traversal.Successively label is carried out to effective sentence that each new microblogging splits, if effectively sentence first time occurs, then the label of described effective sentence is set to the root of current new microblogging, otherwise the label of described effective sentence is set to the root that first occurs the microblogging of described effective sentence.
Replace sentence to illustrate in the present embodiment the method that microblogging travels through below by with letter, comprising:
Such as, the root of microblogging R (0) is 0, comprises effective sentence ABC;
The root of microblogging R (1) is 1, comprises effective sentence AB;
The root of microblogging R (2) is 2, comprises effective sentence DAB;
The root of microblogging R (3) is 3, comprises effective sentence BCD;
The root of microblogging R (4) is 4, comprises effective sentence CD.
For microblogging R (0), the value of the root of microblogging R (0) is 0, because effective sentence A, B and C occur first time, then in microblogging R (0), effective sentence label is the root 0 of this microblogging, A label is 0, B label be 0, C label is 0.
Analyze microblogging R (1) again, the root of microblogging R (1) is 1, and effective sentence A first time occurs in R (0), then in microblogging R (1), effective sentence A label is the root 0 of microblogging R (0); In like manner, effective sentence B occurs first time in R (0), then in microblogging R (1), effective sentence B label is also the root 0 of microblogging R (0).
Analyze microblogging R (2) again, the value of the root of microblogging R (2) is 2, and effective sentence D occurs first time, then in microblogging R (2), effective sentence D label is the root 2 of microblogging R (2); Effective sentence A first time occurs in R (0), then in microblogging R (2), effective sentence A label is the root 0 of microblogging R (0); Effective sentence B first time occurs in R (0), then in microblogging R (2), effective sentence B label is the root 0 of microblogging R (0).
The value of the root of microblogging R (3), microblogging R (3) is 3, and effective sentence B first time occurs in R (0), then in microblogging R (3), the label of B is the root 0 of R (0); Effective sentence C first time occurs in R (0), then in microblogging R (3), the label of C is also the root 0 of R (0); Effective sentence D first time occurs in microblogging R (2), then the label of D is the root 2 of R (2).
The value of the root of microblogging R (4), microblogging R (4) is 4, and effective sentence C first time occurs in R (0), then in R (4), C label is the root 0 of R (0); Effective sentence D first time occurs in R (2), then in microblogging R (4), D label is the root 2 of R (2).
Based on above-mentioned analysis, known:
Microblogging R (0) comprises effective sentence ABC, and root is 0, X0=0, X1=0, X2=0;
Microblogging R (1) comprises effective sentence AB, and root is 1, X0=0, X1=0.
Microblogging R (2) comprises effective sentence DAB, and root is 2, X0=2, X1=0, X2=0;
Microblogging R (3) comprises effective sentence BCD, and root is 3, X0=0, X1=0, X2=2;
Microblogging R (4) comprises effective sentence CD, and root is 4, X0=0, X1=2.
Step S204, according to the root of new microblogging and each effective sentence label, determine the classification belonging to described new microblogging.
Described classification at least comprises ancestors' microblogging and subordinate microblogging, wherein said ancestors' microblogging be all effective sentence all first time the new microblogging that occurs, described subordinate microblogging is the new microblogging that included effective sentence label other effective sentence label except the root for described new microblogging belongs to the root of the subordinate microblogging below the root of same ancestors' microblogging or described ancestors' microblogging;
Further, described classification also comprises noise microblogging, and described noise microblogging is the new microblogging being subordinated to dissimilar ancestor microblogging;
According to root and each effective sentence label of new microblogging, determine that the classification belonging to described new microblogging comprises:
If the label of all effective sentences is identical in new microblogging, and is the root of described new microblogging, then determine that described new microblogging is ancestors' microblogging;
If effectively sentence label also has an effective sentence label except the root of described new microblogging in new microblogging, then determine that described new microblogging belongs to subordinate microblogging, be subordinated to the new microblogging that root is described effective sentence label;
If also have the sentence label that at least two different in new microblogging in effective sentence label except the root of described new microblogging, and root for described at least two new microbloggings of at least two different sentence labels are subordinate microbloggings of same ancestors' microblogging, then determine that described new microblogging is subordinate microblogging, and be subordinated to described same ancestors' microblogging, otherwise determine that described new microblogging is noise microblogging.
Step S205, search ancestors' microblogging of each subordinate microblogging, the former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic.
According to previous step, after determining the microblogging of each new microblogging institute subordinate, namely construct the microblogging tree of each ancestors' microblogging, namely constantly trace back by the microblogging of microblogging institute subordinate according to this microblogging tree, search ancestors' microblogging of microblogging.
Such as, in the example cited by previous step, the label of all effective sentences of microblogging R (0) is all identical, is all 0, and all identical with the root 0 of described new microblogging, then determine that described new microblogging R (0) is ancestors' microblogging;
The all effective sentence labels of microblogging R (1) belong to the root 0 of same ancestors' microblogging R (0), then determine the subordinate microblogging that described new microblogging R (1) is ancestors' microblogging R (0);
Effective sentence label included by microblogging R (2) except first effective sentence label be except the root 2 of described new microblogging, other effective sentence label belongs to the root of same ancestors' microblogging R (0), then determine the subordinate microblogging that described new microblogging R (2) is ancestors' microblogging R (0);
Microblogging R (3) first and second effective sentence label are the root 0 of same ancestors' microblogging R (0), 3rd effective sentence label is the root of microblogging R (2), and microblogging R (2) also belongs to the subordinate microblogging of ancestors' microblogging R (0), therefore, effective sentence label included by microblogging R (3) belongs to the new microblogging of the root of the subordinate microblogging below same ancestors' microblogging, therefore, the subordinate microblogging that described new microblogging R (3) is ancestors' microblogging R (0) is determined;
Microblogging R (4) first effective sentence labels are the root 4 of microblogging R (0), second effective sentence label is the root of microblogging R (2), and microblogging R (2) also belongs to the subordinate microblogging of ancestors' microblogging R (0), therefore, effective sentence label included by microblogging R (4) belongs to the new microblogging of the root of the subordinate microblogging below same ancestors' microblogging, therefore, the subordinate microblogging that described new microblogging R (4) is ancestors' microblogging R (0) is determined.
Step S206, to determined hot issue the new microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
If determined hot issue is more than one in aforesaid operations, also can to determined hot issue the microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
Step S207, to determined hot issue the microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
After determining microblog hot topic, can also to determined hot issue the new microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
According to effective sentence label and the root of new microblogging, determine the classification belonging to described new microblogging;
If all effective sentence label of described new microblogging is identical and be the root of described microblogging, then determine that described new microblogging belongs to ancestors' microblogging.
The present embodiment is on the basis of embodiment one, further disclose the embodiment of effective sentence of the new microblogging in new microblogging set being carried out to cluster determination microblog hot topic, by traveling through new microblogging each in new microblogging set, successively the root of numbering as microblogging is arranged to described each new microblogging, root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging, according to root and each effective sentence label of new microblogging, determine the classification belonging to described new microblogging, search ancestors' microblogging of each subordinate microblogging, former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic, microblog hot topic can be determined in real time according to the language feature of microblogging, to supervise current public sentiment.
Embodiment three
Fig. 3 is the method flow diagram of the determination microblog hot topic described in the present embodiment, and as shown in Figure 3, the method for the determination microblog hot topic described in the present embodiment comprises:
Step S301, leader of opinion's microblog data storehouse.
Utilize oriented network reptile, or the microblogging that concern pattern acquiring leader of opinion delivers, obtain the content of microblog of leader of opinion, id, the time, form microblogging collection D0.Step S302, microblogging is split into sentence.
By judging punctuation mark, the microblogging of acquisition being carried out subordinate sentence, recording the microblogging ID at each sentence place.Mated between two by sentence, the mode of coupling is canonical coupling, goes to mate longer sentence with shorter sentence, just replaces long sentence with short sentence filial generation if the match is successful.
Step S303, sentence mate between two.
Travel through the microblogging of all leaders of opinion be concerned, canonical coupling is between two carried out to obtained all sentences.
The process of coupling is: two sentences carrying out mating must be effective sentence (number of words are more than the sentence of some), using sentence shorter in two sentences as regular character string, the front and back of each word in this sentence add symbol " * ", finally mate with longer sentence.
Step S304, judge coupling whether success, if then perform step S305, otherwise return step S303.
Step S305, the filial generation of use short sentence are for long sentence.
Step S306, obtain new microblogging set.
If the match is successful, two sentences are similar sentence.In microblogging, replace longer sentence with shorter sentence, and then obtain new microblogging collection D1.
Step S307, judge that whether all sentence labels of a new microblogging are identical, if then perform step S308, otherwise perform step S311.
All microbloggings in D1 are traveled through.In the process of traversal, root R (i)=i (i=0 is arranged to each microblogging, 1,2,3...), successively sentence fractionation is carried out to each microblogging, the corresponding sentence label xj (j=0 of each effective sentence, 1,2...), the assignment rule of sentence label: if this sentence occurs first time, then sentence label is the root of current microblogging, if this sentence occurred in microblogging before, then the label of this sentence was first root occurring the microblogging of this sentence.
All sentence labels of each microblogging process, and judge current microblogging generic.
Step S308, determine that this new microblogging belongs to ancestors' microblogging.
If the label of all sentences in current microblogging is identical, and be the root of current microblogging, i.e. x0=x1=x2=...=i, then this microblogging is ancestors' microblogging;
Step S309, by label, subordinate microblogging is joined its ancestors' microblogging.
Step S310, determine hot issue, terminate.
By finding ancestors' microblogging of each microblogging, microblogging identical for R (i) being merged and forms final hot issue list, sentence frequency statistics being carried out to the microblogging in hot issue, using the title of effective sentence the highest for the frequency of occurrences as this topic.
Retain former microblogging, carry out cluster and extract topic, merged by the microblogging containing identical sentence in microblogging with the microblogging after coupling sentence, the microblogging number contained in each topic of final statistics, decides the temperature of this topic.
Cluster once at set intervals, it can be 24 hours by set of time, two hours microbloggings the earliest are just rejected the data that can ensure so always to have 24 hours in cluster set by new like this 2 hours microbloggings of coming in from cluster set, ensure the real-time of data.
Step S311, judge whether to have in a microblogging two different labels, if then perform step S312, otherwise perform step S313.
Step S312, determine that this new microblogging belongs to subordinate microblogging, perform step S309.
If except when also have a sentence label xk<i outside the root of front microblogging in sentence label, then this microblogging is subordinate microblogging, and the root of this microblogging is set to R (i)=xk;
Step S313, determine that this new microblogging belongs to perplexed microblogging.
Perplexed microblogging alleged by the present embodiment, is referred to the temporarily uncertain interim classification microblogging whether belonging to subordinate microblogging or noise microblogging, needs to be judged further by preset algorithm.
If except when also have multiple different sentence label outside the root of front microblogging in sentence label, then this microblogging is perplexed microblogging.
Step S314, judge the label whether same ancestors of subordinate, if then perform step S312, otherwise perform step S315.
After having traveled through all microbloggings, perplexed microblogging has carried out single treatment again.If in all sentence labels one in perplexed microblogging, multiple sentence label is also had except the root of the microblogging at this sentence place, judge this multiple label whether subordinate ancestors' microblogging, if be then subordinate microblogging, the multiple ancestors' microblogging of these multiple sentence label subordinates else if, be then noise microblogging, ignore process.
Step S315, determine that this new microblogging belongs to noise microblogging, ignore this new microblogging, terminate.
The technical scheme of the present embodiment, can make full use of the feature of content of microblog self, data processing amount is little, and hot issue coverage rate is high, has good real-time.
Embodiment four
Fig. 4 is the structured flowchart of the device of determination microblog hot topic described in the present embodiment, and as shown in Figure 4, the device of the determination microblog hot topic described in the present embodiment comprises:
Microblogging acquiring unit 410, for obtaining the microblogging of leader of opinion;
Subordinate sentence unit 420, extracts effective sentence for carrying out subordinate sentence to described microblogging, is substituted by the close shorter effective sentence of described effective sentence term justice, forms new microblogging set to obtain new microblogging;
Hot issue determining unit 430, for carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set.
Further, described microblogging acquiring unit 410 for: utilize oriented network reptile to obtain the microblogging delivered of leader of opinion, or adopt the microblogging paid close attention to pattern acquiring leader of opinion and deliver.
Further, described subordinate sentence unit 420 for:
Effective sentence in the set of described microblogging sentence is mated, the longer shorter effective sentence of effective sentence in effective sentence similar for coupling is substituted.
Further, described subordinate sentence unit 420 for:
The front and back of each word in shorter effective sentence are added asterisk wildcard and form matching condition, judge whether longer effective sentence meets described matching condition.
Further, described subordinate sentence unit 420 for: travel through described microblogging set, effective sentences all in described microblogging set mated between two.
Further, described hot issue determining unit 430 for:
Each new microblogging in described new microblogging set is traveled through, successively the root of numbering as microblogging is arranged to described each new microblogging, root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging, if comprise described effective sentence first time to occur, then the label of described effective sentence is set to the root of the new microblogging at described effective sentence place, otherwise the label of described effective sentence is set to the root that first occurs the microblogging of described effective sentence;
According to root and each effective sentence label of new microblogging, determine the classification belonging to described new microblogging, described classification at least comprises ancestors' microblogging and subordinate microblogging, wherein said ancestors' microblogging be all effective sentence all first time the new microblogging that occurs, described subordinate microblogging is the new microblogging that included effective sentence label other effective sentence label except the root for described new microblogging belongs to the root of the subordinate microblogging below the root of same ancestors' microblogging or described ancestors' microblogging;
Search ancestors' microblogging of each subordinate microblogging, the former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic.
Further, described classification also comprises noise microblogging, and described noise microblogging is the new microblogging being subordinated to dissimilar ancestor microblogging;
Described hot issue determining unit 430 for:
If the label of all effective sentences is identical in new microblogging, and is the root of described new microblogging, then determine that described new microblogging is ancestors' microblogging;
If effectively sentence label also has an effective sentence label except the root of described new microblogging in new microblogging, then determine that described new microblogging belongs to subordinate microblogging, be subordinated to the new microblogging that root is described effective sentence label;
If also have the sentence label that at least two different in new microblogging in effective sentence label except the root of described new microblogging, and root for described at least two new microbloggings of at least two different sentence labels are subordinate microbloggings of same ancestors' microblogging, then determine that described new microblogging is subordinate microblogging, and be subordinated to described same ancestors' microblogging, otherwise determine that described new microblogging is noise microblogging.
Further, described device also comprises topic title determining unit (Fig. 4 is not shown), after cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set, to determined hot issue the microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
Further, described device also comprises topic sequencing unit (Fig. 4 is not shown), for after carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set,
To determined hot issue the microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
The device of the determination microblog hot topic that the present embodiment provides can perform the method for the determination microblog hot topic that the embodiment of the present invention one, embodiment two and embodiment three provide, and possesses the corresponding functional module of manner of execution and beneficial effect.
All or part of content in the technical scheme that above embodiment provides can be realized by software programming, and its software program is stored in the storage medium that can read, storage medium such as: the hard disk in computing machine, CD or floppy disk.
Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims (10)

1. determine a method for microblog hot topic, it is characterized in that, comprising:
Obtain the microblogging of leader of opinion;
Subordinate sentence is carried out to described microblogging and extracts effective sentence, the close shorter effective sentence of described effective sentence term justice is substituted, forms new microblogging set to obtain new microblogging;
Cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set.
2. determine the method for microblog hot topic as claimed in claim 1, it is characterized in that, the microblogging obtaining leader of opinion comprises: utilize the microblogging that oriented network reptile acquisition leader of opinion delivers, or adopt the microblogging paid close attention to pattern acquiring leader of opinion and deliver.
3. determine the method for microblog hot topic as claimed in claim 1, it is characterized in that, shorter effective sentence close for described effective sentence term justice is substituted and comprises:
Effective sentence in the set of described microblogging sentence is carried out canonical coupling, the longer shorter effective sentence of effective sentence in effective sentence similar for coupling is substituted.
4. determine the method for microblog hot topic as claimed in claim 3, it is characterized in that, the effective sentence in the set of described microblogging sentence is carried out canonical coupling and comprise:
The front and back of each word in shorter effective sentence are added asterisk wildcard and form matching condition, judge whether longer effective sentence meets described matching condition.
5. determine the method for microblog hot topic as claimed in claim 3, it is characterized in that, the effective sentence in the set of described microblogging sentence is carried out coupling and comprises: travel through described microblogging set, effective sentences all in described microblogging set is mated between two.
6. determine the method for microblog hot topic as claimed in claim 1, it is characterized in that, cluster determination microblog hot topic is carried out to effective sentence of the new microblogging in described new microblogging set and comprises:
Each new microblogging in described new microblogging set is traveled through, successively the root of numbering as microblogging is arranged to described each new microblogging, root according to described microblogging carries out label to the effective sentence comprised in described each new microblogging, if comprise described effective sentence first time to occur, then the label of described effective sentence is set to the root of the new microblogging at described effective sentence place, otherwise the label of described effective sentence is set to the root that first occurs the microblogging of described effective sentence;
According to root and each effective sentence label of new microblogging, determine the classification belonging to described new microblogging, described classification at least comprises ancestors' microblogging and subordinate microblogging, wherein said ancestors' microblogging be all effective sentence all first time the new microblogging that occurs, described subordinate microblogging is the new microblogging that included effective sentence label other effective sentence label except the root for described new microblogging belongs to the root of the subordinate microblogging below the root of same ancestors' microblogging or described ancestors' microblogging;
Search ancestors' microblogging of each subordinate microblogging, the former microblogging corresponding to new microblogging identical for ancestors' microblogging is carried out merging and determines microblog hot topic.
7. determine the method for microblog hot topic as claimed in claim 6, it is characterized in that, described classification also comprises noise microblogging, and described noise microblogging is the new microblogging being subordinated to dissimilar ancestor microblogging;
According to root and each effective sentence label of new microblogging, determine that the classification belonging to described new microblogging comprises:
If the label of all effective sentences is identical in new microblogging, and is the root of described new microblogging, then determine that described new microblogging is ancestors' microblogging;
If effectively sentence label also has an effective sentence label except the root of described new microblogging in new microblogging, then determine that described new microblogging belongs to subordinate microblogging, be subordinated to the new microblogging that root is described effective sentence label;
If also have the sentence label that at least two different in new microblogging in effective sentence label except the root of described new microblogging, and root for described at least two new microbloggings of at least two different sentence labels are subordinate microbloggings of same ancestors' microblogging, then determine that described new microblogging is subordinate microblogging, and be subordinated to described same ancestors' microblogging, otherwise determine that described new microblogging is noise microblogging.
8. determine the method for microblog hot topic as claimed in claim 1, it is characterized in that, described method also comprises: also comprise after carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set:
To determined hot issue the new microblogging of being correlated with carry out sentence frequency statistics, using the title of effective sentence the highest for the frequency of occurrences as described hot issue.
9. determine the method for microblog hot topic as claimed in claim 1, it is characterized in that, described method also comprises: also comprise after carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set:
To determined hot issue the new microblogging of being correlated with add up, determine the temperature of described hot issue according to statistics, according to described temperature, described hot issue sorted.
10. determine a device for microblog hot topic, it is characterized in that, comprising:
Microblogging acquiring unit, for obtaining the microblogging of leader of opinion;
Subordinate sentence unit, extracts effective sentence for carrying out subordinate sentence to described microblogging, is substituted by the close shorter effective sentence of described effective sentence term justice, forms new microblogging set to obtain new microblogging;
Hot issue determining unit, for carrying out cluster determination microblog hot topic to effective sentence of the new microblogging in described new microblogging set.
CN201510591206.0A 2015-09-16 2015-09-16 Method and apparatus for determining microblog hot topic Pending CN105159882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510591206.0A CN105159882A (en) 2015-09-16 2015-09-16 Method and apparatus for determining microblog hot topic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510591206.0A CN105159882A (en) 2015-09-16 2015-09-16 Method and apparatus for determining microblog hot topic

Publications (1)

Publication Number Publication Date
CN105159882A true CN105159882A (en) 2015-12-16

Family

ID=54800741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510591206.0A Pending CN105159882A (en) 2015-09-16 2015-09-16 Method and apparatus for determining microblog hot topic

Country Status (1)

Country Link
CN (1) CN105159882A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635286A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of policy analysis of central issue
CN109739975A (en) * 2018-11-15 2019-05-10 东软集团股份有限公司 Focus incident abstracting method, device, readable storage medium storing program for executing and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100191773A1 (en) * 2009-01-27 2010-07-29 Palo Alto Research Center Incorporated System And Method For Providing Default Hierarchical Training For Social Indexing
CN102945290A (en) * 2012-12-03 2013-02-27 北京奇虎科技有限公司 Hot microblog topic digging device and method
CN102982157A (en) * 2012-12-03 2013-03-20 北京奇虎科技有限公司 Device and method used for mining microblog hot topics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100191773A1 (en) * 2009-01-27 2010-07-29 Palo Alto Research Center Incorporated System And Method For Providing Default Hierarchical Training For Social Indexing
CN102945290A (en) * 2012-12-03 2013-02-27 北京奇虎科技有限公司 Hot microblog topic digging device and method
CN102982157A (en) * 2012-12-03 2013-03-20 北京奇虎科技有限公司 Device and method used for mining microblog hot topics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUANFENG ZHOU 等: "Hot Topics Extraction from Chinese Micro-blog Based on Sentence", 《UIC-ATC-SCALCOM-CBDCOM-IOP 2015》 *
谷保平 等: "热点特征深挖下的高效微博热门话题预测", 《科技通报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739975A (en) * 2018-11-15 2019-05-10 东软集团股份有限公司 Focus incident abstracting method, device, readable storage medium storing program for executing and electronic equipment
CN109739975B (en) * 2018-11-15 2021-03-09 东软集团股份有限公司 Hot event extraction method and device, readable storage medium and electronic equipment
CN109635286A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of policy analysis of central issue
CN109635286B (en) * 2018-11-26 2022-04-12 平安科技(深圳)有限公司 Policy hotspot analysis method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103123618B (en) Text similarity acquisition methods and device
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
CN111783468B (en) Text processing method, device, equipment and medium
CN106126502B (en) A kind of emotional semantic classification system and method based on support vector machines
CN101127042A (en) Sensibility classification method based on language model
CN112395539B (en) Public opinion risk monitoring method and system based on natural language processing
CN101599071A (en) The extraction method of conversation text topic
CN110321549B (en) New concept mining method based on sequential learning, relation mining and time sequence analysis
TW201541267A (en) Method and device of selecting promotion keywords
CN104268130B (en) Social advertising facing Twitter feasibility analysis method
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN104765729A (en) Cross-platform micro-blogging community account matching method
CN111309910A (en) Text information mining method and device
Sasidhar et al. A survey on named entity recognition in Indian languages with particular reference to Telugu
CN105630884A (en) Geographic position discovery method for microblog hot event
CN103699550A (en) Data mining system and data mining method
CN106547875A (en) A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN104516962A (en) Monitoring method and system for microblogging public opinion
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN103631874A (en) UGC label classification determining method and device for social platform
CN104216979A (en) Chinese technology patent automatic classification system and method for patent classification by using system
Bykau et al. Fine-grained controversy detection in Wikipedia
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
CN106055633A (en) Chinese microblog subjective and objective sentence classification method
CN104978308B (en) A kind of microblogging theme emotion evolution analysis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151216

RJ01 Rejection of invention patent application after publication