CN105335352A

CN105335352A - Entity identification method based on Weibo emotion

Info

Publication number: CN105335352A
Application number: CN201510864383.1A
Authority: CN
Inventors: 崔晓辉; 朱卫平; 张威风; 杨威; 王志波; 李伟
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2015-11-30
Filing date: 2015-11-30
Publication date: 2016-02-17

Abstract

The invention provides an entity identification technology based on Weibo emotion. The entity identification technology comprises the steps that Weibo data are acquired through an api collection technology and preprocessed, wherein a Circumplex annular emotion model is used as an emotion analysis model, and four kinds of emotion keyword dictionaries are generated; Weibo data are acquired through the API collection technology, preprocessing is conducted on data and vectorization is conducted on a data set, learning and training are conducted through four machine learning algorithms, quintuplicate cross validation is conducted, and classification is conducted on the new data set through a selected optimal machine learning classification program; finally, entity extraction is conducted on the classified data.

Description

Based on the entity recognition method of microblog emotional

Technical field

The present invention relates to the gather and analysis field of large data in network, be specifically related to a kind of entity recognition method based on microblog emotional.

Technical background

At home, because microblogging is the novel social media platform just grown up in recent years, so the research of the domestic sentiment analysis for microblogging short text is started late.Research relatively is early feature extraction leaf is strong, male three scholars of Zhang Ziqiong and Luo Zhen are based upon the N-POS language model generally used basis being carried out Chinese word group, propose the two words of Chinese subjective phrase model 2-POS, the fixed basis of the emotion recognition pad for Chinese-character text content.After this, the method of the machine learning such as Xu Junyong naive Bayesian and maximum entropy is carried out text emotion and is excavated classification, its result of study shows, in based on the Chinese text classifying content of emotion, utilize machine learning method can obtain satisfied effect, accuracy rate can reach more than 90%.For film comment, Hu Yi application N-Gram language model, Naive Bayes Classification method and support vector machine (SVM) carry out emotional semantic classification research, find when the limited deficiency of text training sample, the classification accuracy of N-Gram language model is higher, and has good extendability.On the basis of these researchs, research based on the text mining of emotion constantly increases, Related Research Domain is expanded, if the scholars such as Pang Lei are by naive Bayesian, SVM and maximum entropy three kinds of sorting techniques, the stock comment content in Sina's microblogging is expected to rise and positive and negative attitude classification expected to fall.Fu Xianghua, grandson elder generation and Feng time by different angles, sentiment analysis research is carried out to Chinese blog, and propose a kind of based on document subject matter generation model and the Chinese blog many-sided topic emotion method for digging knowing net dictionary; The sentiment analysis method of adding up based on dictionary is introduced microblog emotional analysis; Propose a kind of algorithm SOAD (sentimentorientationanalysisbasedonsyntacticdependency) based on syntax dependency parsing technology and emotional orientation analysis is carried out to blog article Search Results.

In general, along with the development of internet, in recent years, external a lot of scholar starts to carry out emotion Research on Mining in field more widely, comprises tourism blog, blawg, video display comment etc.Emotion is excavated and to be intended to according to special sorting technique from consumer extracting the comment of specific products or service actively or negative attitude, utilize the result of emotional semantic classification, consumer can recognize the necessary information making purchase decision, and the reaction of user and the performance of its rival can be learned by businessman.Along with widely using of computer technology, the emotion of comment content excavates the trend having become research recently, is widely used in every field.

Named entity recognition, is also referred to as Entity recognition or Named-Entity-Recognition simultaneously, refers to the entity in a string text with certain sense, mainly refer to name, place name, mechanism's name, proper noun etc.In the last few years, along with computer information retrieval technology and search engine technique obtain very fast development, named entity recognition technology based on Chinese has become the hot subject of natural language processing research circle, according to domestic present Research, the technical method at present based on the named entity recognition of Chinese mainly contains following four kinds: the recognition methods that the recognition methods of Corpus--based Method, rule-based recognition methods, rule and statistics combine, the recognition methods based on machine learning.

(1) Statistics-Based Method

The statistical model that the named entity recognition of Chinese adopts mainly contains: Hidden Markov Model (HMM), decision-tree model, supporting vector machine model, maximum entropy model and conditional random field models.Asahara, by adopting the method for support vector machine to the name of China and institutionally having carried out automatic identification, achieves reasonable result.

(2) rule-based method

Rule-based named entity recognition technology mainly utilizes two kinds of information: restricted composition and named entity word.The method that what Tan taked is drives based on transcription error thus obtain the contextual association rule of named entity place name, then these automatic identification of rule realization to Chinese Place Names is used, show through certain data test, the accuracy rate of this recognition methods can reach 97%.

(3) method that combines of rules and statistical approaches

Rule and statistics are combined together by some Chinese named entity automatic recognition systems of current main flow, and it first adopts statistical method to carry out mirror image identification to entity, then utilize rule to carry out correction to it and filter.Huang Degen utilizes a large amount of statistics obtained from a large amount of real text data, and calculates lasting word-building confidence level and the word-building confidence level of each name, then automatically identifies Chinese personal name in conjunction with certain rule.

(4) based on the method for machine learning

Named entity recognition technology in English is simpler than the named entity recognition technology of Chinese a lot, because the trouble that English does not have participle to bring, and the participle accuracy rate of Chinese is the key factor affecting Chinese named entity recognition technology.Named entity recognition technology comparative maturity in English, utilizes the machine learning method of support vector machine to classify to English word, can reach place name and the name recognition accuracy of more than 99%.

Microblogging, as a kind of main medium form of social network sites, is more and more subject to the favor of people.People tend to obtain the information such as news, comment, amusement from microblogging, and instantly, the impact that microblogging is propagated network public-opinion is more and more serious.Comprise the affective characteristics of different trend in micro-blog information, excavate these features and control all significant for public sentiment monitoring, the marketing, rumour.Most sentiment analysis is all that text emotion is divided into negative 3 classes in center, if directly the sentiment analysis of this coarseness is applied to this social media of microblogging, help limited to the understanding of people, be not enough to reach real society of listening to and pulse, listen attentively to the object of social affection.

Summary of the invention

For the deficiencies in the prior art, the present invention have devised a kind of entity analysis technology based on microblog emotional, and accuracy of identification of the present invention is high, and processing speed is fast, is applicable to the accurate identification of large-scale data.

For achieving the above object, present invention employs following technical scheme, a kind of entity recognition method based on microblog emotional, comprises following step:

Step 1. training stage, choose optimum machine learning algorithm;

Step 1.1, according to Circumplex annular emotion model, constructs four class emotion word dictionaries;

Four described class emotion word dictionaries are mapped among a two-dimensional coordinate system, the coordinate axis of this four dimensions respectively: happy and active, happy but inactive, unhappy but active and unhappy inactive;

Step 1.2 uses network AP I acquisition technique, with four class emotion word for keyword obtains microblog data from microblogging, as training data.

Step 1.3 carries out pre-service to the training data collected, the training dataset of generating standard;

Step 1.4 pair training data extracts key word, carries out vectorization according to vector space model to training dataset;

Punctuation mark and emoticon are carried out vectorization as a mark equally, can the be more effective and proper emotion of text be analyzed.The vectorization of punctuation mark and emoticon is that emoticon and punctuation mark are replaced to corresponding English word, and then carries out word vectorization, such as: smiling face replaces with happy, and the term vector (1,0,0,1,1,2) of happy.

Step 1.5, according to the machine learning algorithm preset, carries out emotional semantic classification and 5 retransposings checking to the training dataset of vectorization respectively;

Step 1.6 calculates accuracy rate and the recall rate of each machine learning algorithm 5 cross validations, picks out accuracy rate and the highest machine learning algorithm of recall rate mean value as optimum machine learning classification algorithm.

Step 2. experimental phase, according to the optimum machine learning classification algorithm that step 1 obtains, obtain the emotion entity be identified.

Step 2.1 obtains the experimental data collection of vectorization to the method that step 1.4 is identical according to step 1.1 in step 1;

Step 2.2 uses the optimum machine learning classification algorithm obtained in step 1, classifies to experimental data collection, obtains four class affection data collection;

Step 2.3 is carried out an entity respectively to four class affection data collection and is extracted, and obtains the emotion entity be identified.

Further, the pre-service in described step 1.3, comprises the phrase that corrects mistakes, the irrelevant phrase of deletion, the phrase that corrects mistakes, the microblogging deleting ambiguity and synonym conversion; The described phrase that corrects mistakes refers to be revised the word of misspelling; Delete irrelevant phrase to refer to delete the word of sentiment analysis without any benefit; The microblogging deleting ambiguity refers to the microblogging but belonging to different emotion classifications at a text; Synonym conversion refers to and another word of the word of equivalent is replaced.

Preferably, in described step 1.4, use TF-IDF algorithm to extract keyword, if comprise expression and punctuation mark, then the punctuation mark of conventional emoticon and the expression tone is converted into corresponding word.

Preferably, use word2vec Open-Source Tools to build term vector in described step 1.4, according to vector space model, vectorization is carried out to training dataset.

Preferably, in described step 2.3, use SENNA degree of deep learning tool bag, an entity is carried out respectively to four class affection data collection and extracts.

Preferably, in described step 1.5, the machine learning algorithm preset comprises naive Bayesian, logistic regression, support vector machine and k nearest neighbor algorithm 4 kinds of machine learning algorithms.

The present invention is undertaken classifying and Entity recognition by the study of the machine degree of depth, and carry out more fine-grained Entity recognition to the emotion of microblogging, the degree of accuracy of identification is high, effective.Following benefit can be produced:

1. by data processing and the sentiment analysis that can carry out the granularity of more refinement after analyzing;

2. the fine granularity sentiment analysis by obtaining, can react the emotional status of people to this colony of microblogging;

3. be conducive to government, tissue, the individual understanding and grasping to social affection.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Embodiment

For making technological means of the present invention, creation characteristic, reaching object and effect is easy to understand, below in conjunction with embodiment, setting forth the present invention further.

Data in microblogging are very large, rely on artificial method to classify to it, and by manpower, material resources and financial resources a large amount of for cost, the Hashtag theme label therefore provided in use microblogging is as the emotion of this microblogging.If we think that a microblogging is marked by this emotion class label, then this microblogging belongs to this emotion classification.

Based on an entity recognition method for microblog emotional, comprise following step:

Step 1. training stage, choose optimum machine learning algorithm;

Step 1.1, according to Circumplex annular emotion model, constructs four class emotion word dictionaries; Four described class emotion word dictionaries are mapped among a two-dimensional coordinate system, the coordinate axis of this four dimensions respectively: happy and active, happy but inactive, unhappy but active and unhappy inactive;

Step 1.3 carries out pre-service to the training data collected, the training dataset of generating standard; Comprise the pre-service of data: correct mistakes phrase, delete irrelevant phrase, delete ambiguous data, synonym is changed.

The phrase that corrects mistakes refers to be revised the word of misspelling, such as: eta is modified to eat, delete irrelevant phrase and refer to that those are deleted the word of sentiment analysis without any benefit, such as the, of etc. are without the word of practical significance, and the microblogging deleting ambiguity refers to that those text but belongs to the microblogging of different emotion classifications.Synonym conversion refers to and the word of an equivalent word is replaced.

Step 1.4 pair training data extraction key word, uses TF-IDF algorithm to extract keyword, if comprise expression and punctuation mark, then the punctuation mark of conventional emoticon and the expression tone is converted into corresponding word.

Use word2vec Open-Source Tools to build term vector, according to vector space model, vectorization is carried out to training dataset; Not only comprise word in described vectorization procedure, also comprise punctuation mark and emoticon.

Vector space model is classical text feature model, is proposed, and achieved successful application by people such as Salton in the sixties on SMART text retrieval system.

Build term vector: term vector refers to and represents a word with a vector, such as: happy can represent with vectorial (0,1,3,4,1,1).

Word2vec is a efficient tool word being characterized by real number value vector that Goole increased income in 2013.We use this instrument to be represented by each word vector.

The vectorization of data set: extract keyword to each data, the one group of keyword being comparative maturity TF-IDF algorithm and generating used herein, is then converted into term vector keyword.This data are represented with this group term vector.Such as: these data of Iwanttogohome, can extract keyword: I, go, home tri-keywords, the term vector of three keywords is (1,0,1,0,1,3), (0,1,2,3,0,0), (1,1,3,2,1,6) so this data can be represented with these three vectors.

5 retransposing checkings: the data set obtained is divided into 5 equal portions at random, will wherein 4 equal portions as training set, 1 equal portions are as test set, training set is used to train machine learning algorithm, after having trained, machine learning algorithm can generate a decision tree function, and tests remaining test set with decision tree function.And calculate accuracy rate and the recall rate of classification.This process repeats 5 times.

This method presets 4 kinds of four kinds of machine learning algorithms, employs following machine learning algorithm:

1. naive Bayesian

The ultimate principle of naive Bayesian is: for the data item of a given Awaiting Triage, need to obtain the probability that on the basis that occurs in this data item, other each classification occurs respectively, this probability is referred to as posterior probability usually, which is maximum, just thinks which target classification this pending data item belongs to.

Formula is as follows:

p (C_{k} | x) = \frac{p (C_{k}) p (x | C_{k})}{p (x)}

Formula describes: event C _kprobability be P (C _k), the probability of event x be P (x), event Ck under occurrence condition the probability of event x be P (A|Ck), under event x occurrence condition, the probability of happening of Ck is P (Ck|x)

Programmed logic is as follows: Ck represents classification, P (x) represents data to be sorted, for the classification number determined, P (Ck) is fixing, such as probability is here 0.25 (1/4), and for a Data classification, P (x) also determines, so it is maximum only to need to calculate P (x|Ck), just can show that P (Ck|x) is maximum.P (x|Ck) represents the probability occurring x in Ck class, and this probability obtains in training set, and such as: in training set assorting process, have 100 in Ck, x occupies 10, then probability is 0.1.

2. logistic regression

Logistic regression and numerous regretional analysis and multiple linear regression have some similar parts, and these regression models all belong to (generalizedlinearmodel) of generalized linear model.For in generalized linear model family member, the difference of each regretional analysis is more the difference of dependent variable.Constitutive logic needs following committed step when returning:

1. set up anticipation function, anticipation function refers to that the probability of happening of a certain thing is much.

2. constitutive logic function, logical function refers to Sigmoid function, because anticipation function is the approximation probability function obtained according to original training data, so likely there is the situation being less than 0 in the span of this probability function, therefore the concept of logical function is just introduced, logical function can be mapped to the number of minus infinity to positive infinity between [0,1].

3. the low method declined is used to try to achieve regression parameter, the training stage of logistic regression sorter, according to the logical function form built, we can obtain the likelihood function of this function, simultaneously in the process asking parameter, the method of usual employing is maximum likelihood method, and then utilizes gradient descent method to try to achieve optimum value in parameter.

Programmed logic is as follows: the eigenwert of data set is set to X1, X2, X3 ... corresponding weights are W1, W2, W3 ... if, Z=W1 × X1+W2 × X2+W3 × X3 ... then use sigmoid function that result is mapped to [0,1] on interval, p=sigmoid (z), i.e. 1/ (1+exp (-z)), then use gradient descent method and test data, obtain the maximum likelihood value of each weights.After obtaining each weights, just can obtain the expression formula of this function, just can calculate the possibility of each class, new data are classified.

3. support vector machine

Support vector machine is a kind of learning algorithm of supervision property, has application widely in statistical regression.The lineoid that support vector machine can build one or a lot of superelevation dimension for training data divides high-dimensional the inseparable data of some low dimensions.In text classification, support vector machine is one of best sorting algorithm.

Programmed logic is as follows: the fundamental purpose of Training Support Vector Machines finds out the lineoid equation of segmentation two class, if equation functions is W ^tx+b=0, W and X submeter represents a matrix and vector, and X here represents term vector, introduces relaxation factor and penalty factor, uses method of Lagrange multipliers, obtains optimum classification plane, obtains planar function, just can classify to other vectorial X.

4.K nearest neighbor algorithm

K nearest neighbor algorithm, be in machine learning algorithm in one of very ripe algorithm, K nearest neighbor algorithm is also one of the simplest machine learning algorithm simultaneously.The basic thought of nearest neighbor algorithm is in some given data contents, if the great majority of K the data point kind the most adjacent with other in characteristic vector space of sample data belong to same classification, so just this this classification of sample assignment.

Programmed logic is as follows: in training set, is projected by training vector in N dimension space, and new data vector X, calculates and nearest n the point of X, in putting at this n, if category-A other at most, then to belong to category-A other for this new data.

Step 2.3 uses SENNA degree of deep learning tool bag, carries out an entity respectively extract four class affection data collection.

Be more than ultimate principle of the present invention and main implementation method.The present invention can realize the extraction of content of microblog, learns the degree of depth of large data, improves the analysis precision of emotion, to the identification of microblog emotional entity.Help government, tissue or mechanism carry out the emotion research of popular colony, and in public opinion analysis, social event, event warning aspect has larger effect.

Claims

1. based on an entity recognition method for microblog emotional, it is characterized in that, comprise following step:

Step 1. training stage, choose optimum machine learning algorithm;

Step 1.2 uses network AP I acquisition technique, with four class emotion word for keyword obtains microblog data from microblogging, as training data;

Step 1.6 calculates accuracy rate and the recall rate of each machine learning algorithm 5 cross validations, picks out accuracy rate and the highest machine learning algorithm of recall rate mean value as optimum machine learning classification algorithm;

Step 2. experimental phase, according to the optimum machine learning classification algorithm that step 1 obtains, obtain the emotion entity be identified;

2. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, pre-service in described step 1.3, comprises the phrase that corrects mistakes, the irrelevant phrase of deletion, the phrase that corrects mistakes, the microblogging deleting ambiguity and synonym conversion; The described phrase that corrects mistakes refers to be revised the word of misspelling; Delete irrelevant phrase to refer to delete the word of sentiment analysis without any benefit; The microblogging deleting ambiguity refers to the microblogging but belonging to different emotion classifications at a text; Synonym conversion refers to and another word of the word of equivalent is replaced.

3. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, TF-IDF algorithm is used to extract keyword in described step 1.4, if comprise expression and punctuation mark, then the punctuation mark of conventional emoticon and the expression tone is converted into corresponding word.

4. a kind of entity recognition method based on microblog emotional according to claim 1, is characterized in that, uses word2vec Open-Source Tools to build term vector, carry out vectorization according to vector space model to training dataset in described step 1.4.

5. a kind of entity recognition method based on microblog emotional according to claim 1, is characterized in that, in described step 2.3, uses SENNA degree of deep learning tool bag, carries out an entity respectively extract four class affection data collection.

6. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, in described step 1.5, the machine learning algorithm preset comprises naive Bayesian, logistic regression, support vector machine and k nearest neighbor algorithm 4 kinds of machine learning algorithms.