CN105335347A - Method and device for determining emotion and reason thereof for specific topic - Google Patents

Method and device for determining emotion and reason thereof for specific topic Download PDF

Info

Publication number
CN105335347A
CN105335347A CN201410239139.1A CN201410239139A CN105335347A CN 105335347 A CN105335347 A CN 105335347A CN 201410239139 A CN201410239139 A CN 201410239139A CN 105335347 A CN105335347 A CN 105335347A
Authority
CN
China
Prior art keywords
mood
word
theme
descriptor
reason
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410239139.1A
Other languages
Chinese (zh)
Inventor
宋双永
孟遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201410239139.1A priority Critical patent/CN105335347A/en
Publication of CN105335347A publication Critical patent/CN105335347A/en
Pending legal-status Critical Current

Links

Abstract

The present invention discloses a method and device for determining an emotion and a reason thereof for a specific topic. The method for determining an emotion and a reason thereof for a specific topic of the present invention comprises: collecting multiple documents for a specific topic; setting a topic model, so that the number of topics is equal to a predetermined number of emotional types, wherein topic words of each topic comprise an emotional word of a corresponding emotion, and an emotional word of each emotion appears only in topic words of a corresponding topic; analyzing the multiple documents by using the set topic model, so as to obtain the predetermined number of topics and the topic words of each topic; and from the topic words of each topic, determining a reason associated with the emotion corresponding to the topic.

Description

Determine the method and apparatus of mood for specific topics and reason thereof
Technical field
Relate generally to field of information processing of the present invention.Specifically, the present invention relates to the method and apparatus of a kind of dissecting needle to the reason of the mood in multiple documents of specific topics and mood.
Background technology
In recent years, along with the development of internet, people deliver increasing speech on the internet, have wherein both comprised the information of personal lifestyle, also comprise the comment for contents such as focus incident, dairy products/service, video display stars.
Corresponding platform comprises blog, microblogging, BBS (BulletinBoardSystem, BBS(Bulletin Board System)), forum etc.
If such information is kind add utilization, enterprise can be helped to understand the public to the view of product/service, government or mechanism can be helped to carry out the analysis of public opinion etc.
The commentary of people is often pointed, namely shows emotion tendency for specific topics, namely mood.In addition, relevant to mood important information is the reason of mood.Therefore, people are ited is desirable to identify exactly for the mood of specific topics and mood reason.
Traditional mood reason recognition technology can be divided into the reason identification of Sentence-level mood and the identification of collection of document mood reason.The identification of Sentence-level mood reason refers to the clause extracting the word of expressing user emotion and the reason that can describe this mood of generation from a word.Such as, in " When the Rain Comes again, and what a nuisance " one, " disliking " expresses the worried mood of user, and causes the reason of this mood to be " When the Rain Comes ".
The identification of collection of document mood reason refers to and detect the many reasons that user produces certain mood from a large amount of text, and sorts to reason according to certain criterion, finds the main cause that colony's mood produces.Such as concentrate at the text data relevant to certain mobile phone brand, the word often together occurred with " anger ", " indignant " is " deadlock ", the deadlock problem of visible mobile phone makes user create indignant mood, this problem should be noted in the design of Mobile phone by producer, to improve Consumer's Experience and evaluation.
Traditional method analyzes in the level of words, goes to characterize mood, by the analysis based on word, as word frequency, co-occurrence frequency etc. find the possible reason be associated with mood word with mood word.The explicit associations of the word of mood often that traditional method is excavated and mood reason, cannot find the implicit association between mood and mood reason, and cannot solve the correspondence problem in a sentence between same mood word and multiple mood reason.
Therefore, a kind of method and apparatus accurately can determining the mood for specific topics and reason thereof is expected.
Summary of the invention
Give hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.
The object of the invention is the problems referred to above for prior art, propose a kind of method and apparatus accurately can determining the mood for specific topics and reason thereof.
To achieve these goals, according to an aspect of the present invention, provide a kind of method determining the mood for specific topics and reason thereof, the method comprises: collect the multiple documents for specific topics; Setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, and the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence; Topic model set by utilization, analyzes described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And from the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
According to another aspect of the present invention, provide a kind of equipment determining the mood for specific topics and reason thereof, this equipment comprises: gathering-device, is configured to: collect the multiple documents for specific topics; Setting device, be configured to: setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence; Set topic model: for analyzing described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And determining device, be configured to: from the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
In addition, according to a further aspect in the invention, a kind of storage medium is additionally provided.Described storage medium comprises machine-readable program code, and when performing described program code on messaging device, described program code makes described messaging device perform according to said method of the present invention.
In addition, in accordance with a further aspect of the present invention, a kind of program product is additionally provided.Described program product comprises the executable instruction of machine, and when performing described instruction on messaging device, described instruction makes described messaging device perform according to said method of the present invention.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, above and other objects, features and advantages of the present invention can be understood more easily.Parts in accompanying drawing are just in order to illustrate principle of the present invention.In the accompanying drawings, same or similar technical characteristic or parts will adopt same or similar Reference numeral to represent.In accompanying drawing:
Fig. 1 shows the process flow diagram determined according to an embodiment of the invention for the mood of specific topics and the method for reason thereof;
Fig. 2 shows the sub-step of step S4;
Fig. 3 shows the sub-step of step S42;
Fig. 4 shows according to the determination of the embodiment of the present invention block diagram for the mood of specific topics and the equipment of reason thereof;
Fig. 5 shows the theme supervised LDA model as topic model example; And
Fig. 6 shows and can be used for implementing the schematic block diagram according to the computing machine of the method and apparatus of the embodiment of the present invention.
Embodiment
To be described in detail one exemplary embodiment of the present invention by reference to the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in the description.But, should understand, must make a lot specific to the decision of embodiment in the process of any this actual embodiment of exploitation, to realize the objectives of developer, such as, meet those restrictive conditions relevant to system and business, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, although will also be appreciated that development is likely very complicated and time-consuming, concerning the those skilled in the art having benefited from present disclosure, this development is only routine task.
At this, also it should be noted is that, in order to avoid the present invention fuzzy because of unnecessary details, illustrate only in the accompanying drawings with according to the closely-related apparatus structure of the solution of the present invention and/or treatment step, and eliminate other details little with relation of the present invention.In addition, also it is pointed out that the element described in an accompanying drawing of the present invention or a kind of embodiment and feature can combine with the element shown in one or more other accompanying drawing or embodiment and feature.
Method and apparatus according to the invention, can dissecting needle to the large volume document of specific topics, determine the reason of mood for specific topics and mood.The Main Means taked utilizes topic model analysis in degrees of emotion instead of mood word rank.
The flow process determined according to an embodiment of the invention for the mood of specific topics and the method for reason thereof is described below with reference to Fig. 1.
Fig. 1 shows the process flow diagram determined according to an embodiment of the invention for the mood of specific topics and the method for reason thereof.As shown in Figure 1, determine to comprise the steps: to collect the multiple documents (step S1) for specific topics for the mood of specific topics and the method 100 of reason thereof according to of the present invention; Setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, and the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs (step S2) in the descriptor of a kind of theme of correspondence; Topic model set by utilization, analyzes described multiple document, with the descriptor (step S3) of the theme and each theme that obtain described predetermined quantity; And from the descriptor of each theme, determine the reason (step S4) that the mood corresponding with this theme is associated.
In step sl, the multiple documents for specific topics are collected.
As mentioned above, what the present invention wished excavation is the mood of people and the reason of mood, therefore, is also necessary the object of clear and definite mood.The present invention is directed to specific topic to implement, analyze people to the mood of specific topics and reason.
The content that topic can be the focus incident occurred in a period of time, the hottest up-to-date film and television are acute and electronic product etc. that is up-to-date issue can cause user to pay close attention to and comment on.
The judgement basis of mood and reason is document.Document is such as blog, microblogging, forum postings, BBS model etc.
After collecting the multiple documents for specific topics, need to carry out pre-service and participle, to remove the form of garbage and unified process data.
Pretreated particular content adjusts flexibly according to the type of pretreated document.
Such as, for the document of microblogging type, pre-service generally comprises: remove URL (UniformResourceLocator, uniform resource locator), remove and forward printed words/symbol, remove symbol and the user name occurred thereafter, remove picture, marked off by emoticon and be used as single word etc.
Pretreated Main Function is removed and is judged the information that mood and reason thereof have nothing to do, and avoids irrelevant information to the interference judged and gives prominence to for information about.
Such as, URL, picture are general to have nothing to do with mood, and the present invention is directed to text and process, and not from picture mined information, therefore needs deletion.
Forward microblogging very general, microblog system can be printed words/symbol that such microblogging adds " forwarding microblogging ", and such printed words/symbol is helpless to judge mood and reason thereof, therefore needs to delete.
Microblog users often can utilize the mode of "+user name " to mention other users.But, this mention only can play information transmit effect, with mood have nothing to do, therefore need delete.
In addition, the emoticon in microblogging is divided out, as single word.
Microblog users adds emoticon in microblogging of being everlasting.Part emoticon and mood have nothing to do or directly can not correspond to a certain mood, such as, and " [grimacing] ", " [just] ", " [human skeleton] ".Separately there is part emoticon can reflect the mood of user.Such as, " giggle ", " heartily ", " sadness ".
Hereinafter, will see that emoticon can also be mapped with mood by the present invention.Therefore, in preprocessing process, first these emoticons are considered as single word.
Above, for microblogging, list the pretreated content of part, but the present invention is not limited thereto.
Those skilled in the art can for the pending content of other type, like design class or other pre-service.
After carrying out pre-service, also need to carry out Chinese word segmentation to residue content.
Participle is the process of multiple word by the text segmentation of a sentence or a paragraph, is the technological means often used in natural language processing, do not repeat them here.
After participle, one section of document, as a microblogging, is converted into the set of multiple word.Below, analyzing and processing is carried out by the set of such word.
In step s 2, setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, and the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence.
Basic thought of the present invention utilizes topic model to carry out mood analysis, and mood is regarded as theme.
User emotion is defined as an abstract theme by the present invention.Because mood can not rely on one or several mood words to carry out a complete description simply, but comprise some user's emoticons, some implicit emotion expression service modes etc., these are all performances of user emotion, should not be left in the basket.
By mood is defined as theme, topic model can be utilized to analyze mood theme.
Topic model includes but not limited to: LDA (LatentDirichletAllocation, potential Di Li Cray distributes), LSA (LatentSemanticAnalysis, latent semantic analysis), PLSA (ProbabilityLatentSemanticAnalysis, probability latent semantic analysis).
When to the value of topic model input text and theme quantity, the text of topic model analysis input, and the descriptor of the so much theme of the quantity inputted and sign theme.That is, topic model can remove according to the theme quantity of input the theme exporting such quantity, and each theme corresponds to one group of descriptor.
According to method of the present invention, conventional art does not generally stick to the analysis of mood word rank for another example, but analyzes from degrees of emotion, and mood is considered as theme.
Particularly, specify and have how many kinds of mood, be just set with how many kinds of theme.Theme is set to mood theme.Like this, each theme that topic model exports is in fact a kind of mood theme, corresponding to a kind of mood.The mood word characterizing corresponding mood theme should be included in descriptor.
Therefore, utilizing before topic model analyzes, mood dictionary is obtained in advance.Mood dictionary specifies the quantity of categories of emotions and multiple mood words corresponding to often kind of mood.
For example, a kind of mood is " happiness ", and the mood word of its correspondence comprises " pleasantly surprised ", " happiness ", " happy ", " being glad and thankful ".Another kind of mood is " admiration ", and the mood word of its correspondence comprises " admiration ", " revering ", " respect ", " admiring ", " admiring ", " respect ", " respect ".Another mood is " sadness ", and the mood word of its correspondence comprises " crying ", " grief ", " sigh ", " grief ", " sadness ", " profound grief ", " sentiment ", " sad ".
In step s 2, according to mood dictionary, topic model is set.
Particularly, as mentioned above, setting topic model makes the quantity of theme equal the predetermined quantity of categories of emotions.
In addition, also setting topic model makes the descriptor of each theme comprise a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence.
That is, a kind of theme corresponds to a kind of mood, only comprise an a kind of mood word of corresponding mood in the descriptor of theme, and do not comprise the mood word of other mood, the mood word of the mood that this theme is corresponding does not appear in descriptor corresponding to other theme.
Particularly, setting topic model is: the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
Fig. 5 shows the theme supervised LDA model as topic model example.
LDA model defines each text document and is mixed by multiple theme and formed, and each theme is then expressed according to certain probability distribution by different terms (descriptor).E in Fig. 5 is two Distribution value of mood word on mood theme obtained by mood dictionary, correspondingly, distribution on utilize e to adjust mood theme that mood word obtains at LDA model, in definition mood word other themes except the theme except corresponding categories of emotions place, probability of occurrence is 0, and the proportion of mood word shared by corresponding mood theme is then obtained by mold cycle iteration.The correspondence distribution that φ, θ represent respectively " theme-word " and the correspondence distribution of " document-theme ", α, β calculated factor then for being derived by dirichlet function.The calculating of φ is as shown in formula (1):
φ i ′ ( j ) = C ij WT + β Σ k = 1 W C kj WT + Wβ - - - ( 1 )
Wherein, C wT“ Zhu Ti – word " matrix, W is word number, and T is the theme number (categories of emotions number), it is the number of times that word i is assigned to theme j.C first wTtry to achieve, be uniformly distributed on theme according to word and obtain, each C afterwards wTtry to achieve, be that " theme-word " the distribution probability matrix φ tried to achieve according to circulate at every turn carries out gibbs sampler to word and obtains under each theme, formation C wTwith the cycle calculations of φ.
Similarly, the calculating of θ is as shown in formula (2):
θ j ′ ( d ) = C dj DT + α Σ k = 1 T C dk DT + Tα - - - ( 2 )
C dT“ Wen Dang – theme " matrix, D is document number, and T is the theme number (categories of emotions number), it is the number of times that document d is assigned to theme j.C first dTtry to achieve, be uniformly distributed on theme according to document and obtain, each C afterwards dTtry to achieve, be that " document-theme " the distribution probability matrix θ obtained according to circulate at every turn carries out gibbs sampler to document and obtains under each theme, formation C dTwith the cycle calculations of θ.
In above-mentioned formula, φ, θ represent the result that last circulation exports, the result of calculation (matrix) of θ ', φ ' expression epicycle circulation, it is the element in matrix.Can obtain Matrix C according to φ, θ, calculate θ ', φ ' according to Matrix C, as new φ, θ, so circulation is gone down, until convergence.
Namely φ and θ finally tried to achieve represent “ Zhu Ti – word " and “ Wen Dang – theme " between probability distribution.
Therefore, can think that each theme is the abstract representation being combined, represented a kind of specific emotional by different terms (descriptor) and emoticon (being considered as single word) with different probability.
In Fig. 5, θ (d)be θ, show relevant with document abstractively. be φ, show relevant with theme t abstractively.N drepresent the set of all words in document.
In step s3, the topic model set in step s 2 can be used for analyzing multiple document, can determine the theme that document is corresponding, can determine the descriptor of theme according to the probability distribution between theme-word according to the probability distribution between document-theme.The quantity of theme is identical with the quantity of categories of emotions, and the theme that the categories of emotions preset and topic model generate exists relation one to one, and the descriptor of theme comprises at least some of the mood word of corresponding categories of emotions.
Next, the reason that mood is corresponding is determined.
In step s 4 which, from the descriptor of each theme, the reason that the mood corresponding with this theme is associated is determined.
Fig. 2 shows the sub-step of step S4.
Descriptor characterizes theme, therefore, in descriptor except represent mood mood word except word and mood theme closely related, inside this, easily find the reason that mood is corresponding.
Probability due to the word in descriptor except the mood word of corresponding mood reflects the correlativity of this word and mood theme, so extract probability in the descriptor of mood theme except the mood word of correspondence the highest to the word of determined number, alternatively word (step S41).These candidate word probably represent the reason of mood.
But due to the participle operation carried out before, candidate word is single word.The reason of mood is not expressed by single word language sometimes, but the phrase be made up of multiple word expression.
Therefore, also need to integrate (step S42) candidate word.The benchmark integrated is mutual information.
Fig. 3 shows the sub-step of step S42.
Particularly, first, candidate word is carried out may combine arbitrarily, obtain n unit word, n be greater than 1 positive integer (step S421).N unit word is only retained in the combination occurred in multiple document, that is, obtains that occur in multiple document, that candidate word is formed n unit word.
The preferred value of n is 2 and 3.
Then, the n unit word conformed to a predetermined condition is selected, as the reason (step S422) that the mood corresponding with theme is associated.
Predetermined condition comprises: all binary mutual informations of n unit word are all greater than predetermined threshold.
The binary mutual information of n unit word refers to the mutual information between binary word adjacent to each other in n unit word.
Binary mutual information can embody the tightness degree between two adjacent binary words.Therefore, all binary mutual informations of a n unit word are all greater than predetermined threshold and show that this n unit word should be considered an entirety.
In addition, also merger can be carried out to the n unit word screened through step S421 and step S422, to remove the redundancy that the first word of the n comprised each other causes.
Such as, ternary word " very frequently crashes " and comprises binary word " frequently deadlock ".
Owing to judging based on binary mutual information, so ternary word " very frequently crashes " and binary word " frequently crashes " can think the phrase indicating mood reason.But both exist relation of inclusion, if together exported, then have the effect of repetition and redundancy.
Therefore, for the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, merger (step S423) is carried out.
Merger is such as carried out based on the length of word and the number of times of appearance.
It is generally acknowledged: the longer the better for the length of word, the number of times that word occurs is The more the better.
Exemplarily, according to the following equation (3) calculate the evaluation of estimate T treating merger phrase value.
T value=T frequency*T length(3)
Wherein, T lengthfor the length of phrase, i.e. the number of words that comprises of phrase, T frequencyfor the number of times that this phrase occurs in many sections of documents.
This T can be utilized valuevalue carries out merger to the phrase of candidate.
For ternary phrase and binary phrase.An if ternary phrase comprises another binary phrase, and the T of this ternary phrase valuebe greater than the T of this binary phrase value, then this binary phrase will be merged (casting out).Otherwise, delete this ternary phrase.
Those skilled in the art will be understood that above-mentioned formula (3) is only example.The various appropriate ways such as arithmetic addition, weighting summation can also be adopted to calculate according to these two information of occurrence number of the length of phrase and phrase the evaluation of estimate treating merger phrase.
Be appreciated that above-mentioned steps S423 is preferred.
Through the process of above-mentioned steps S421-S423, obtain the reason of mood.
For example, the mood analyzed from multiple document and reason thereof may be as shown in the table.
Mood Mood reason
Admiration Nuclear power station staff, Self-Defense Forces, speedily carry out rescue work team member
Sad Adopt the rear orphan of shake, Libyan War
Sympathize with Neighbouring resident, evacuation egress, radioactive radiation
Table 1
In addition, according to method of the present invention, the relation between emoticon and mood can also be judged.
As mentioned above, also emoticon is included in document.Emoticon is regarded as an independent word.There is explicit corresponding relation between emoticon and mood, but also there is the implicit relationship that cannot directly determine.By method of the present invention, the corresponding relation between emoticon and mood can be counted, as a reference.
Particularly, according to the frequency that emoticon occurs as descriptor, judge the relation between the mood that the theme that emoticon is corresponding with descriptor is associated.
That is, user's expression that the frequency occurred under finding out often kind of mood theme is high, thinks that high frequency occurs under the mood that this mood theme is corresponding and this mood theme user expresses one's feelings and there is corresponding relation.
Shown in a kind of example results following table.
Mood Emoticon
Admiration [lovely], [fist], [gentle breeze], [din pushes away and hits], [having prostrated oneself]
Sad [cursing in rage], [indignation], [g shock], [g shakes the hand], [good Embarrassing]
Sympathize with [sadness], [candle], [shaking hands], [tear], [sick]
Table 2
As can be seen from Table 2, the incidence relation between emoticon from user emotion and the understanding of common people exist a little different.Such as, among the emoticon such as [cursing in rage] and [indignation], also can comprise sad mood, and [sadness] emoticon also can the mood of express sympathy.Utilize this result, can when user deliver do not comprise obvious mood word, according to the emoticon in model, user emotion is inferred.
Below, describe according to the mood of determination for specific topics of the embodiment of the present invention and the equipment of reason thereof with reference to Fig. 4.
Fig. 4 shows according to the determination of the embodiment of the present invention block diagram for the mood of specific topics and the equipment of reason thereof.As shown in Figure 4, according to of the present invention determine for the mood of specific topics and reason thereof really locking equipment 400 comprise: gathering-device 41, is configured to: collect the multiple documents for specific topics; Setting device 42, be configured to: setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence; Set topic model 43: for analyzing described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And determining device 44, be configured to: from the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
In one embodiment, setting device 42 is further configured to: be set as that the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
In one embodiment, determining device 44 comprises: extracting unit, is configured to: extract probability in the descriptor of each theme except described mood word, that occur the highest to the candidate word of determined number; Selection unit, be configured to: the n unit word formed for that occur in described multiple document, described candidate word, based on the binary mutual information of n unit word, select the n unit word conformed to a predetermined condition, as the reason that the described mood corresponding with this theme is associated, n be greater than 1 positive integer.
In one embodiment, the binary mutual information of n unit word refers to the mutual information between binary word adjacent to each other in n unit word.Predetermined condition refers to that all binary mutual informations of n unit word are all greater than predetermined threshold.
In one embodiment, determining device 44 also comprises: Merging unit, is configured to: for the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, based on the length of word and the number of times of appearance, carry out merger.
In one embodiment, Merging unit comprises: evaluation of estimate computation subunit, is configured to: based on described n unit's length of word and the number of times of appearance, calculate the evaluation of estimate of described n unit word; And merger subelement, be configured to: according to calculated evaluation of estimate and described n unit word relation of inclusion, retain described n unit word in one.
In one embodiment, evaluation of estimate computation subunit is further configured to: the product calculating the length of described n unit word and the number of times of described n unit word appearance, as the evaluation of estimate of described n unit word.
In one embodiment, merger subelement is further configured to: if a n unit word comprises another n unit word, and the evaluation of estimate of this n unit word is greater than the evaluation of estimate of another n unit word described, then retain this n unit word, and delete another n unit word described.
In one embodiment, document comprises emoticon.Equipment also comprises: relation judging unit, is configured to: the frequency occurred as descriptor according to emoticon, judges the relation between the mood that the theme that emoticon is corresponding with descriptor is associated.
In one embodiment, determine that equipment 400 also comprises: dictionary resolution unit, is configured to: according to the mood dictionary obtained in advance, obtain the predetermined quantity of categories of emotions and mood word corresponding to each mood
Owing to determine in equipment 400 that the process in each included device and unit is similar with the process in included each step in above-described method 100 respectively according to of the present invention, therefore for simplicity, the detailed description of these devices and unit is omitted at this.
In addition, still need here it is noted that each component devices, unit can be configured by software, firmware, hardware or its mode combined in the said equipment.Configure spendable concrete means or mode is well known to those skilled in the art, do not repeat them here.When being realized by software or firmware, to the computing machine (multi-purpose computer 600 such as shown in Fig. 6) with specialized hardware structure, the program forming this software is installed from storage medium or network, this computing machine, when being provided with various program, can perform various functions etc.
Fig. 6 shows and can be used for implementing the schematic block diagram according to the computing machine of the method and apparatus of the embodiment of the present invention.
In figure 6, CPU (central processing unit) (CPU) 601 performs various process according to the program stored in ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random access memory (RAM) 603.In RAM603, also store the data required when CPU601 performs various process etc. as required.CPU601, ROM602 and RAM603 are connected to each other via bus 604.Input/output interface 605 is also connected to bus 604.
Following parts are connected to input/output interface 605: importation 606 (comprising keyboard, mouse etc.), output 607 (comprise display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 608 (comprising hard disk etc.), communications portion 609 (comprising network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 609 is via network such as the Internet executive communication process.As required, driver 610 also can be connected to input/output interface 605.Detachable media 611 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed on driver 610 as required, and the computer program therefrom read is installed in storage area 608 as required.
When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as detachable media 611 installs the program forming software.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 6, distributes the detachable media 611 to provide program to user separately with equipment.The example of detachable media 611 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM602, comprise in storage area 608 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.
The present invention also proposes a kind of program product storing the instruction code of machine-readable.When described instruction code is read by machine and performs, the above-mentioned method according to the embodiment of the present invention can be performed.
Correspondingly, be also included within of the present invention disclosing for carrying the above-mentioned storage medium storing the program product of the instruction code of machine-readable.Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.
Above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiment in same or similar mode, combined with the feature in other embodiment, or substitute the feature in other embodiment.
Should emphasize, term " comprises/comprises " existence referring to feature, key element, step or assembly when using herein, but does not get rid of the existence or additional of one or more further feature, key element, step or assembly.
In addition, method of the present invention be not limited to specifications in describe time sequencing perform, also can according to other time sequencing ground, perform concurrently or independently.Therefore, the execution sequence of the method described in this instructions is not construed as limiting technical scope of the present invention.
Although above by the description of specific embodiments of the invention to invention has been disclosure, should be appreciated that, above-mentioned all embodiments and example are all illustrative, and not restrictive.Those skilled in the art can design various amendment of the present invention, improvement or equivalent in the spirit and scope of claims.These amendments, improvement or equivalent also should be believed to comprise in protection scope of the present invention.
remarks
1. determine a method for mood for specific topics and reason thereof, comprising:
Collect the multiple documents for specific topics;
Setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, and the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence;
Topic model set by utilization, analyzes described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And
From the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
2. the method as described in remarks 1, wherein said setting topic model comprises: be set as that the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
3. the method as described in remarks 1, wherein saidly from the descriptor of each theme, determine that the reason that the mood corresponding with this theme is associated comprises:
Extract probability in the descriptor of each theme except described mood word, that occur the highest to the candidate word of determined number;
For the n unit word that occur in described multiple document, described candidate word is formed, based on the binary mutual information of n unit word, select the n unit word conformed to a predetermined condition, as the reason that the described mood corresponding with this theme is associated, n be greater than 1 positive integer.
4. the method as described in remarks 3, the binary mutual information of wherein said n unit word refers to the mutual information between binary word adjacent to each other in n unit word; Described predetermined condition refers to that all binary mutual informations of n unit word are all greater than predetermined threshold.
5. the method as described in remarks 3 or 4, wherein saidly from the descriptor of each theme, determine that the reason that the mood corresponding with this theme is associated also comprises:
For the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, based on the length of word and the number of times of appearance, carry out merger.
6. the method as described in remarks 5, wherein said merger comprises:
Based on described n unit's length of word and the number of times of appearance, calculate the evaluation of estimate of described n unit word;
According to calculated evaluation of estimate and described n unit word relation of inclusion, retain described n unit word in one.
7. the method as described in remarks 6, the evaluation of estimate of wherein said calculating n unit word comprises:
Calculate the product of the length of described n unit word and the number of times of described n unit word appearance, as the evaluation of estimate of described n unit word.
8. the method as described in remarks 6, wherein said reservation step comprises:
If a n unit word comprises another n unit word, and the evaluation of estimate of this n unit word is greater than the evaluation of estimate of another n unit word described, then retain this n unit word, and delete another n unit word described.
9. the method as described in remarks 1, wherein said document comprises emoticon;
Described method also comprises: the frequency occurred as descriptor according to emoticon, judges the relation between the mood that the theme that emoticon is corresponding with descriptor is associated.
10. the method as described in remarks 1, also comprises: according to the mood dictionary obtained in advance, obtains the predetermined quantity of categories of emotions and mood word corresponding to each mood.
11. 1 kinds of equipment determining the mood for specific topics and reason thereof, comprising:
Gathering-device, is configured to: collect the multiple documents for specific topics;
Setting device, be configured to: setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence;
Set topic model: for analyzing described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And
Determining device, is configured to: from the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
12. equipment as described in remarks 11, wherein said setting device is further configured to: be set as that the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
13. equipment as described in remarks 11, wherein said determining device comprises:
Extracting unit, is configured to: extract probability in the descriptor of each theme except described mood word, that occur the highest to the candidate word of determined number;
Selection unit, be configured to: the n unit word formed for that occur in described multiple document, described candidate word, based on the binary mutual information of n unit word, select the n unit word conformed to a predetermined condition, as the reason that the described mood corresponding with this theme is associated, n be greater than 1 positive integer.
14. equipment as described in remarks 13, the binary mutual information of wherein said n unit word refers to the mutual information between binary word adjacent to each other in n unit word; Described predetermined condition refers to that all binary mutual informations of n unit word are all greater than predetermined threshold.
15. equipment as described in remarks 13 or 14, wherein said determining device also comprises:
Merging unit, is configured to: for the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, based on the length of word and the number of times of appearance, carry out merger.
16. equipment as described in remarks 15, wherein said Merging unit comprises:
Evaluation of estimate computation subunit, is configured to: based on described n unit's length of word and the number of times of appearance, calculate the evaluation of estimate of described n unit word;
Merger subelement, is configured to: according to calculated evaluation of estimate and described n unit word relation of inclusion, retain described n unit word in one.
17. equipment as described in remarks 16, wherein said evaluation of estimate computation subunit is further configured to:
Calculate the product of the length of described n unit word and the number of times of described n unit word appearance, as the evaluation of estimate of described n unit word.
18. equipment as described in remarks 16, wherein said merger subelement is further configured to:
If a n unit word comprises another n unit word, and the evaluation of estimate of this n unit word is greater than the evaluation of estimate of another n unit word described, then retain this n unit word, and delete another n unit word described.
19. equipment as described in remarks 11, wherein said document comprises emoticon;
Described equipment also comprises: relation judging unit, is configured to: the frequency occurred as descriptor according to emoticon, judges the relation between the mood that the theme that emoticon is corresponding with descriptor is associated.
20. equipment as described in remarks 11, also comprise: dictionary resolution unit, is configured to: according to the mood dictionary obtained in advance, obtain the predetermined quantity of categories of emotions and mood word corresponding to each mood.

Claims (10)

1. determine a method for mood for specific topics and reason thereof, comprising:
Collect the multiple documents for specific topics;
Setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, and the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence;
Topic model set by utilization, analyzes described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And
From the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
2. the method for claim 1, wherein said setting topic model comprises: be set as that the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
3. the method for claim 1, wherein saidly from the descriptor of each theme, determine that the reason that the mood corresponding with this theme is associated comprises:
Extract probability in the descriptor of each theme except described mood word, that occur the highest to the candidate word of determined number;
For the n unit word that occur in described multiple document, described candidate word is formed, based on the binary mutual information of n unit word, select the n unit word conformed to a predetermined condition, as the reason that the described mood corresponding with this theme is associated, n be greater than 1 positive integer.
4. method as claimed in claim 3, the binary mutual information of wherein said n unit word refers to the mutual information between binary word adjacent to each other in n unit word; Described predetermined condition refers to that all binary mutual informations of n unit word are all greater than predetermined threshold.
5. the method as described in claim 3 or 4, wherein saidly from the descriptor of each theme, determine that the reason that the mood corresponding with this theme is associated also comprises:
For the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, based on the length of word and the number of times of appearance, carry out merger.
6. the method for claim 1, wherein said document comprises emoticon;
Described method also comprises: the frequency occurred as descriptor according to emoticon, judges the relation between the mood that the theme that emoticon is corresponding with descriptor is associated.
7. determine an equipment for mood for specific topics and reason thereof, comprising:
Gathering-device, is configured to: collect the multiple documents for specific topics;
Setting device, be configured to: setting topic model makes: the quantity of theme equals the predetermined quantity of categories of emotions, the descriptor of each theme comprises a kind of mood word of corresponding mood and a kind of mood word of mood only occurs in the descriptor of a kind of theme of correspondence;
Set topic model: for analyzing described multiple document, with the descriptor of the theme and each theme that obtain described predetermined quantity; And
Determining device, is configured to: from the descriptor of each theme, determine the reason that the mood corresponding with this theme is associated.
8. equipment as claimed in claim 7, wherein said setting device is further configured to: be set as that the probability that the mood word of a kind of corresponding mood that the descriptor of each theme comprises occurs in the descriptor of other theme is zero.
9. equipment as claimed in claim 7, wherein said determining device comprises:
Extracting unit, is configured to: extract probability in the descriptor of each theme except described mood word, that occur the highest to the candidate word of determined number;
Selection unit, be configured to: the n unit word formed for that occur in described multiple document, described candidate word, based on the binary mutual information of n unit word, select the n unit word conformed to a predetermined condition, as the reason that the described mood corresponding with this theme is associated, n be greater than 1 positive integer.
10. equipment as claimed in claim 9, wherein said determining device also comprises:
Merging unit, is configured to: for the word that there is relation of inclusion in the n unit word conformed to a predetermined condition, based on the length of word and the number of times of appearance, carry out merger.
CN201410239139.1A 2014-05-30 2014-05-30 Method and device for determining emotion and reason thereof for specific topic Pending CN105335347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410239139.1A CN105335347A (en) 2014-05-30 2014-05-30 Method and device for determining emotion and reason thereof for specific topic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410239139.1A CN105335347A (en) 2014-05-30 2014-05-30 Method and device for determining emotion and reason thereof for specific topic

Publications (1)

Publication Number Publication Date
CN105335347A true CN105335347A (en) 2016-02-17

Family

ID=55285892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410239139.1A Pending CN105335347A (en) 2014-05-30 2014-05-30 Method and device for determining emotion and reason thereof for specific topic

Country Status (1)

Country Link
CN (1) CN105335347A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359181A (en) * 2018-09-27 2019-02-19 深圳前海微众银行股份有限公司 The recognition methods of negative emotions reason, equipment and computer readable storage medium
CN110855554A (en) * 2019-11-08 2020-02-28 腾讯科技(深圳)有限公司 Content aggregation method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319974A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Mining geographic knowledge using a location aware topic model
US20090099996A1 (en) * 2007-10-12 2009-04-16 Palo Alto Research Center Incorporated System And Method For Performing Discovery Of Digital Information In A Subject Area
CN101876985A (en) * 2009-11-26 2010-11-03 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN101901230A (en) * 2009-05-31 2010-12-01 国际商业机器公司 Information retrieval method, user comment processing method and system thereof
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319974A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Mining geographic knowledge using a location aware topic model
US20090099996A1 (en) * 2007-10-12 2009-04-16 Palo Alto Research Center Incorporated System And Method For Performing Discovery Of Digital Information In A Subject Area
CN101901230A (en) * 2009-05-31 2010-12-01 国际商业机器公司 Information retrieval method, user comment processing method and system thereof
CN101876985A (en) * 2009-11-26 2010-11-03 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SOPHIA YAT MEI LEE 等: "DETECTING EMOTION CAUSES WITH A LINGUISTIC RULE-BASED APPROACH", 《COMPUTATIONAL INTELLIGENCE》 *
叶璐: "新闻文本的读者情绪自动预测方法研究", 《中国优秀硕士学位论文全文数据库-信息科技辑》 *
李逸薇等: "基于序列标注模型的情绪原因识别方法", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359181A (en) * 2018-09-27 2019-02-19 深圳前海微众银行股份有限公司 The recognition methods of negative emotions reason, equipment and computer readable storage medium
CN109359181B (en) * 2018-09-27 2021-11-19 深圳前海微众银行股份有限公司 Negative emotion reason identification method, device and computer-readable storage medium
CN110855554A (en) * 2019-11-08 2020-02-28 腾讯科技(深圳)有限公司 Content aggregation method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Kontopoulos et al. Ontology-based sentiment analysis of twitter posts
Haque Sentiment analysis by using fuzzy logic
Deitrick et al. Mutually enhancing community detection and sentiment analysis on twitter networks
US20160299955A1 (en) Text mining system and tool
Furlan et al. Semantic similarity of short texts in languages with a deficient natural language processing support
Lloret et al. A novel concept-level approach for ultra-concise opinion summarization
CN104239300A (en) Method and device for excavating semantic keywords from text
Areed et al. Aspect-based sentiment analysis for Arabic government reviews
CN113157931B (en) Fusion map construction method and device
Kathuria et al. A review of tools and techniques for preprocessing of textual data
Plu et al. A hybrid approach for entity recognition and linking
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
Kungas et al. Cost-effective semantic annotation of XML schemas and web service interfaces
Zhong et al. Natural language processing for systems engineering: Automatic generation of systems modelling language diagrams
Fernandes et al. Analysis of product Twitter data though opinion mining
Amato et al. An application of semantic techniques for forensic analysis
Ray et al. A review of the state of the art in Hindi question answering systems
CN112507721A (en) Method, device and equipment for generating text theme and computer readable storage medium
CN105335347A (en) Method and device for determining emotion and reason thereof for specific topic
Tariku et al. Sentiment Mining and Aspect Based Summarization of Opinionated Afaan Oromoo News Text
Sinan Yüksel et al. A real-time social network-based knowledge discovery system for decision making
CN111753540B (en) Method and system for collecting text data to perform Natural Language Processing (NLP)
Qureshi et al. Detecting social polarization and radicalization
Kirsch et al. Noise reduction in distant supervision for relation extraction using probabilistic soft logic
Procko et al. Towards Improved Scientific Knowledge Proliferation: Leveraging Large Language Models on the Traditional Scientific Writing Workflow

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20180921

AD01 Patent right deemed abandoned