CN104573046A

CN104573046A - Comment analyzing method and system based on term vector

Info

Publication number: CN104573046A
Application number: CN201510027614.3A
Authority: CN
Inventors: 廖博森
Original assignee: Chengdu Pinguo Technology Co Ltd
Current assignee: Chengdu Pinguo Technology Co Ltd
Priority date: 2015-01-20
Filing date: 2015-01-20
Publication date: 2015-04-29
Anticipated expiration: 2035-01-20
Also published as: CN104573046B

Abstract

The invention discloses a comment analyzing method and system based on a term vector and relates to the technical field of emotion analysis, natural language processing and the like. A machine is utilized to analyze the comment, automatic user comment analysis is made by using the machine, and the working efficiency is improved. The method is characterized in that user comments are collected to form a comment corpus, each comment in the comment corpus is converted into the sentence vectors with identical dimension, a plurality of comment types are set, each comment is labeled with the corresponding type according to the labels which are input manually, a classifier is trained with the sentence vectors as the input and the comment type that each sentence vector corresponds to as the output, a new comment is acquired and converted into the sentence vector, and the sentence vector that the new comment corresponds to is input into the classifier to obtain the comment type of the new comment.

Description

A kind of comment and analysis method and system based on term vector

Technical field

The present invention relates to the technical field such as sentiment analysis, natural language processing.

Background technology

Along with the development of electric business, on network, the comment of user to certain product is more and more.Analyze the comment of user, user can be understood to the view of producing and suggestion, contribute to the perfect of product like this, and the lifting of service quality.But along with the continuous increase of number of users, corresponding comment amount also increases very large, if or rely on manual read's comment, understand consumers' opinions, work efficiency will be reduced greatly, the opinions or suggestions of user to product or service can not be understood in time.

Summary of the invention

For above-mentioned situation, the present invention proposes a kind of method and system using equipment analysis to comment on, do automatic subscriber comment and analysis by machine, work efficiency is provided.

Based on the comment and analysis method of term vector in the present invention, comprising:

Step 1: collect user comment, forms comment corpus;

Step 2: every bar comment of comment corpus is converted into the identical sentence vector of dimension;

Step 3: some comment types are set, the every bar that is labeled as according to artificial input comments on the comment type marked belonging to it;

Step 4: with described sentence vector for input, the comment type that every bar sentence vector is corresponding is export training classifier;

Step 5: obtain a new comment, and be translated into sentence vector;

Step 6: being input to newly commenting on corresponding sentence vector in described sorter, obtaining the comment type of new comment.

Described step 2 comprises further:

Step 21: each comment is divided into some basic participles, comments on dictionary to obtaining after basic participle duplicate removal;

Step 22: each basic participle is converted into a term vector; The term vector dimension that each basic participle is corresponding is identical;

Step 23: superposed by term vector corresponding for the basic participle in every bar comment, obtains the sentence vector of this comment.

Described step 5 comprises further:

Step 51: new comment is divided into some basic participles;

Step 52: the term vector that in finding step 51, each basic participle is corresponding in comment dictionary;

Step 53: superposed by term vector corresponding for each basic participle of new comment, obtains the sentence vector of new comment.

Described step 22 comprises further: using the input of basic participle as neural network model, makes described neural network model unsupervised learning obtain term vector corresponding to this basic participle.

Preferably, described term vector dimension is 200.

Affiliated step 3 comprises further does following process to the comment in each comment type:

Step 31: the key weight calculating the basic participle in comment type in each comment;

Step 32: carry out descending sort according to the basic participle of key weight to all comments in this comment type;

Step 33: the keyword of basic participle as described comment type selecting front n inequality; Described n get be greater than 0 and be less than or equal to 5 natural number.

Present invention also offers a kind of Commentary Systems based on term vector, comprising:

Comment collection module, for collecting user comment, forms comment corpus;

Sample sentence vector conversion module, for being converted into the identical sentence vector of dimension by every bar comment of comment corpus;

Comment type labeling module, for arranging some comment types, the every bar that is labeled as according to artificial input comments on the comment type marked belonging to it;

Sorter training module, for vectorial for input with described sentence, the comment type that every bar sentence vector is corresponding is output training classifier;

Comment sentence vector modular converter, for obtaining a new comment, and is translated into sentence vector;

Sorter, the comment type that the sentence vector calculation corresponding according to new comment is newly commented on.

Described sample sentence vector conversion module comprises further:

Sample word-dividing mode, for each comment in comment corpus is divided into some basic participles, comments on dictionary to obtaining after basic participle duplicate removal;

Sample term vector conversion module, for being converted into a term vector by each basic participle; The term vector dimension that each basic participle is corresponding is identical;

Sample term vector laminating module, for being superposed by term vector corresponding for the basic participle in every bar comment, obtains commenting on the sentence vector of each comment in corpus.

Described comment sentence vector modular converter comprises further:

Comment word-dividing mode, for being divided into some basic participles by new comment;

Comment term vector conversion module, for searching the term vector that in new comment, each basic participle is corresponding in comment dictionary;

Comment term vector laminating module, the term vector corresponding for each the basic participle by new comment superposes, and obtains the sentence vector of new comment.

Described sample term vector conversion module is further used for the input of basic participle as neural network model, makes described neural network model unsupervised learning obtain term vector corresponding to this basic participle.

Preferably, described term vector dimension is 200.

Comment type labeling module comprises further:

Key weight computation module, for calculating the key weight of the basic participle in comment type in each comment;

Order module, for carrying out descending sort according to the basic participle of key weight to all comments in this comment type;

Keyword Selection module, for selecting the basic participle of a front n inequality as the keyword of described comment type; Described n get be greater than 0 and be less than or equal to 5 natural number.

In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:

Present invention achieves the robotization of comment and analysis, robotic, substantially increase work efficiency.

The present invention adopts neural network model to calculate the vector of basic participle, and the term vector represented so accurately can not represent the basic participle of its correspondence, and can also embody the incidence relation between word and word, degree of intelligence is higher.

The present invention adopts the vectorial to sentence of the stacked system of term vector, avoid a vector dimension to increase, because the term vector after training is word has been mapped to a new theme dimensional space in fact, so term vector is carried out superposition well can also represent the mapping situation of sentence at such feature space.Do like this, not only avoid the vector that sentence characteristics represents too sparse, the situation that dimension is too much, well at low dimension space representation sentence characteristics, and can not affect classification performance again.

Embodiment

All features disclosed in this instructions, or the step in disclosed all methods or process, except mutually exclusive feature and/or step, all can combine by any way.

Arbitrary feature disclosed in this instructions, unless specifically stated otherwise, all can be replaced by other equivalences or the alternative features with similar object.That is, unless specifically stated otherwise, each feature is an example in a series of equivalence or similar characteristics.

The present invention's specific embodiment comprises the following steps:

Step 1: arrange user comment, forms comment corpus.The concrete comment statement that web crawlers can be used to collect user from each large webpage forms comment corpus.Web crawlers is a kind of program of automatic acquisition web page contents, is the important component part of search engine.The comment statement collected is more, and the comment corpus that we obtain is more complete.

Step 2: every bar comment of comment corpus is converted into the identical sentence vector of dimension: comprise further and use participle (verb, sentence is carried out segmentation) comment statement is divided into basic participle (noun) by software, after each comment participle in comment corpus, obtain commenting on dictionary by after the whole basic participle deduplication obtained.Each basic participle in comment dictionary is being converted into term vector.

The present embodiment uses degree of depth learning art training term vector model:

In order to the advantage of term vector in outstanding the present invention, first set forth the limitation of traditional word bag model here.

Traditional word bag model is the feature be shown as by each vocabulary in a proper vector.If there is a dictionary, comprise 10 words in dictionary, word wherein needs to represent with 10 dimensional vectors, as " good " in dictionary can word bag model representation: v (' good')=[0,1,0,0,0,0,0,0,0,0], " bad " in dictionary can word bag model representation be v (' bad')=[0,0,1,0,0,0,0,0,0,0] etc.

Adopt this word bag model representation word to there is such limitation, when the word amount in dictionary is very large, such as reach ten million order of magnitude other time, represent with regard to needs ten million dimensional vector, occur dimension disaster, therefore need to do feature selecting or feature extraction.Meanwhile, such expression, is difficult to find the relation between word and word, such as ' fantastic ' and ' good' has similarity, but by word bag model, is difficult to measure the similarity between them.

Us are facilitated to improve the expression of term vector based on above-mentioned two reasons.We used neural network model, using the whole basic participle of comment dictionary as training sample, be input in neural network model, make neural network model unsupervised learning obtain the term vector feature of 200 dimensions.In other embodiments, term vector dimension also can be 50,100,150 etc.

After term vector superposition corresponding for all basic participle in a comment in comment corpus, obtain the sentence vector of this comment.

Suppose a comment statement S, wherein w _irepresent i-th the basic participle of this comment after participle, so have:

S=w ₁, w ₂... w _i... w _n, wherein n represents the word number of sentence.

In the present embodiment, each basic participle w _ibeing expressed as a length is the vector of 200, and:

V _wi={ v ₁, v ₂, v ₃..., v _i... v ₂₀₀, wherein each dimension represents the value of this word in an abstract dimension.

According to the accumulation principle of the present embodiment, the sentence vector of this comment will be expressed as:

V_{S} = \underset{w_{i} &Element; S}{Σ} V_{w_{i}} .

So all comment statements in comment corpus are all expressed as the proper vector of one 200 dimension, avoid " dimension disaster ", also make the pass between word and word tie up in feature and embodied.

The benefit done like this is, the participle no matter commenting on statement has how many, and the dimension of sentence vector is all constant.If adopt traditional mode, replaced by the basic participle in statement with its term vector, if this statement has 10 basic participles, so the sentence vector dimension of this statement will reach 2000, there is the risk of dimension disaster equally.

Because the term vector after training is word has been mapped to a new theme dimensional space in fact, so added up by the term vector of the basic participle in sentence, the mapping situation of sentence at such feature space can be represented well.It is also like this that result proves, not only avoid the vector that sentence characteristics represents too sparse, the situation that dimension is too much, can represent again the feature of sentence well, and not affect classification performance at lower dimensional space.

Step 3: some comment types are set, the every bar that is labeled as according to artificial input comments on the comment type marked belonging to it:

We are manual comments on types to comment statement according to 5, and namely 1 to 5 (1 is non-constant, and 2 is poor, and 3 is general, and 4 is not bad, and 5 is fine) carry out classifying and marking.

Step 4: with described sentence vector for input, the comment type that every bar sentence vector is corresponding is export training classifier:

The present embodiment employs the good GBDT of performance (Gradient Boosting Decision Tree) sorting algorithm, the sentence vector training set of mark carries out unsupervised learning, obtains emotion classifiers.

GBDT is a kind of decision Tree algorithms of iteration, and its training method is based on Boosting simultaneously.Its main thought is, Modling model is the Gradient Descent direction at Modling model loss function before each time.In our training process, we optimize two of GBDT parameters, the depth capacity depth of decision tree number nTree and each decision tree.If we obtain by practical experience analysis 2 times that nTree is set to input feature vector, and depth is within 10, results contrast is good.

Step 5: obtain a new comment, and be translated into sentence vector:

Specifically, utilize participle software that participle is carried out in new comment, obtain basic participle.In comment dictionary, search the term vector that in new comment, basic participle is corresponding, the term vector of each basic participle is carried out superposition and obtains sentence vector.

In order to make comment type have more directive property, we can carry out keyword extraction to each comment type.

Therefore, in another embodiment of the present invention, step 3 comprises further:

Step 33: the keyword of basic participle as described comment type selecting several inequalities front; Choose in the present embodiment time front 5 inequalities basic participle as keyword, as class 1: dodge move back deadlock noise shake blunt.

The TFIDF that the present embodiment adopts, and in conjunction with part of speech, carry out the key weight calculation of basic participle.

That is, the key weight of a word, is made up of two parts, that is:

W_{w_{i, j}} = P_{w_{i, j}} \times T_{w_{i, j}} .

Wherein for TFIDF weight, for part of speech weight, represent i-th basic participle in the comment of jth bar.

The concrete computing method of these two parts are:

T_{w_{i, j}} = \frac{n_{i, j}}{\underset{i}{Σ} n_{i, j}} \times \log (\frac{| D |}{| {k : w_{i, j} &Element; d_{k}} |}) .

Wherein, n _i,jrepresent the number of basic participle i in the comment of jth bar, represent in this comment have how many basic participle, | D| is the quantity commenting on statement in comment corpus, d _krepresent the comment of kth bar, represent comment corpus in have how many comment on statements include with identical basic participle.

a segment factor, different according to the part of speech of basic participle, value is different, when in general we think that part of speech is adjective maximum, be secondly verb, noun, adverbial word, other.Such as, this basic participle is adjective, then be 1; If verb, be 0.8; If noun, be 0.6; If adverbial word, be 0.2, if other parts of speech, be 0.

The part of speech of each basic participle just can be obtained when carrying out participle to statement by participle software equally in the lump.

The present invention is not limited to aforesaid embodiment.The present invention expands to any new feature of disclosing in this manual or any combination newly, and the step of the arbitrary new method disclosed or process or any combination newly.

Claims

1., based on a comment and analysis method for term vector, it is characterized in that, comprising:

Step 1: collect user comment, forms comment corpus;

Step 5: obtain a new comment, and be translated into sentence vector;

2. a kind of comment and analysis method based on term vector according to claim 1, it is characterized in that, described step 2 comprises further:

Step 23: superposed by term vector corresponding for the basic participle in every bar comment, obtains the sentence vector of this comment;

Described step 5 comprises further:

Step 51: new comment is divided into some basic participles;

3. a kind of comment and analysis method based on term vector according to claim 2, it is characterized in that, described step 22 comprises further: using the input of basic participle as neural network model, makes described neural network model unsupervised learning obtain term vector corresponding to this basic participle.

4. a kind of comment and analysis method based on term vector according to Claims 2 or 3, it is characterized in that, described term vector dimension is 200.

5. a kind of comment and analysis method based on term vector according to claim 2, is characterized in that, step 3 comprises further does following process to the comment in each comment type:

6., based on a comment and analysis system for term vector, it is characterized in that, comprising:

Comment collection module, for collecting user comment, forms comment corpus;

7. a kind of comment and analysis system based on term vector according to claim 6, is characterized in that, described sample sentence vector conversion module comprises further:

Sample term vector laminating module, for being superposed by term vector corresponding for the basic participle in every bar comment, obtains commenting on the sentence vector of each comment in corpus;

Described comment sentence vector modular converter comprises further:

8. a kind of comment and analysis system based on term vector according to claim 7, it is characterized in that, described sample term vector conversion module is further used for the input of basic participle as neural network model, makes described neural network model unsupervised learning obtain term vector corresponding to this basic participle.

9. a kind of comment and analysis system based on term vector according to claim 7 or 8, it is characterized in that, described term vector dimension is 200.

10. a kind of comment and analysis system based on term vector according to claim 7, is characterized in that, comment type labeling module comprises further: