CN103246728A

CN103246728A - Emergency detection method based on document lexical feature variations

Info

Publication number: CN103246728A
Application number: CN2013101702967A
Authority: CN
Inventors: 王厚峰; 张龙凯
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2013-05-10
Filing date: 2013-05-10
Publication date: 2013-08-14

Abstract

Disclosed is an emergency detection method based on document lexical feature variations. The emergency detection method based on the document lexical feature variations comprises: utilizing a computer to crawl news articles in a designated time period in current political news reports from news websites; subjecting every file to pre-processing, wherein the pre-processing includes Chinese character segmentation and part-of-speech tagging; keeping content words and filtering the other words; obtaining news files in time periods k years before, which are the same with an target file, and news files days before and r days after to serve as a comparison document set; similarly, subjecting the comparison document set to processing of the Chinese character segmentation and the part-of-speech tagging, and keeping content words; extracting all clue words in a target document set from a database; and clustering a clue word set to form an emergency description. By utilizing the technical scheme of the emergency detection method based on the document lexical feature variations, the event space can be restored to the clue word space, clue word subsets can be output through a method of clustering, and every subset corresponds to a description of an emergency.

Description

A kind of incident detection method that changes based on the document lexical feature

Technical field

The present invention proposes a kind of incident detection method that changes based on lexical feature in the comparable time period collection of document, by analyzing the difference that vocabulary uses in the comparable time period collection of document, infer possible new events, particularly accident.The invention belongs to text mining and information retrieval field.

Background technology

The network information turns to people's pass-along message, expresses viewpoint, the information of obtaining provides means very easily.Network has become the ocean of information.How to take full advantage of network information resource, excavate information of interest, follow the trail of focus incident, become the problem that people pay special attention to.

Accident is that a kind of peacefulness of society of may giving is brought the event of impact, always is subjected to the great attention of government organs and relevant enterprise department.In current society, network becomes the main path of event report and information propagation.In case accident occurs, have a large amount of follow-up reports usually.The evolution of fast detecting accident from network, tracking event is for government decision and the important effect of maintaining social stability.

Because emergentness and the singularity of event, also can there be difference in relevant report aspect word and the language expression.The present invention is exactly at collection of document, detects possible accident by the variation of analyzing the word rule.

Summary of the invention

For convenience of explanation, appoint and fix the row concept:

Lexical word: name here word, verb and adjective.

Clue word: also being differentiating words, is to can be used in the word that detects accident and express event content.The burst clue word can be distinguished common report, particularly Gui Lvxing report.Here, the burst clue word belongs to lexical word.

Destination document collection: collection of document to be excavated.The destination document collection comprises some documents, and each document is corresponding to one piece of online current political news article.

Compare document sets: be used for doing with the destination document collection document sets of contrast, by to recently detecting destination document in the situation of change aspect the word, whether contain accident to judge destination document.The general news article that uses the same time period of several years before the destination document is document sets as a comparison.

Accident: the content of the set co expression of the one group of document that differs greatly with the comparison document sets in the feeling the pulse with the finger-tip mark document sets can be represented by one group of clue word.In news documents set, occur in the A time but do not have event can regard accident as at contemporaneity of some times before.

The purpose of this invention is to provide a kind of simple method, do not having under the situation of manual intervention, detect destination document easily and concentrate contained accident.

Principle of the present invention is: utilize certain measure to calculate the tangible word of difference in destination document collection and the document sets relatively, with them as the burst clue word; Again the clue word set is carried out clustering processing, then cluster result is mapped to event, thereby find the accident of destination document collection.Measure can be selected as required, for example selects the TF-IDF method, also can be the additive method of oneself writing.Here the TF-IDF method of mentioning is the computing method of a kind of classics in the information retrieval, wherein, the frequency (Term Frequency) that TF (t) expression word t occurs in a document, (Document Frequency) appears in DF (t) expression word t in what documents, IDF (t) is called the contrary document frequency (Inverse Document Frequency) of t, can be reciprocal or other calculating Method for Deformation of DF (t).If word t occurs very frequently in certain document, simultaneously, seldom in other document, occur, so, this word is exactly the tangible word of difference, has also measured certain species diversity of place document and other document.To describe the computing method of TF and IDF in detail in the enforcement part of back.

Technical scheme of the present invention is as follows:

A kind of incident detection method (ginseng Fig. 1) that changes based on the document lexical feature is characterized in that, comprises the steps:

Step 1: utilize to climb the current political news report of computing machine from news website (for example Tengxun, Sina) and (for example get the fixed time section, some day) news article, each piece article is expressed as a document, and the whole documents in the time period constitute the destination document set; Each document is carried out pre-service, comprise Chinese word cutting and part-of-speech tagging; Stay lexical word, filter out other word; Each destination document and result thereof are stored in the database of computing machine;

Step 2: obtain identical time period with destination document of front k and preceding r days and back r days news documents document sets as a comparison; The comparison document sets is carried out Chinese equally cut word and part-of-speech tagging processing, keep lexical word, each contrast document and result thereof are stored in the database of computing machine; Here k value and r value can arrange as required;

Step 3: from database, extract all clue words that destination document is concentrated;

Step 4: to the clue word clustering, form accident and describe.

Described incident detection method is characterized in that, described step 3, and following realization:

S31: all the elements word and the word frequency of from database, obtaining destination document collection, comparison document sets;

S32: with the lexical word and the otherness that compares same words in the document sets in certain information criterion calculating destination document set;

S33: arrange according to certain sequence, the part word that comes the front is screened, as the clue word of destination document collection.

Described incident detection method is characterized in that, described step 4, and following realization:

S41: make up the correlation matrix between the clue word;

S42: on the correlation matrix basis that step S41 makes up, the clue word set is carried out clustering processing, obtain several subclass, each subclass is represented a class, corresponding to an event;

S43: with all categories ordering that obtains after the cluster, and then export several classes that come the front, represent several accidents.

Described incident detection method is characterized in that, in the step 1, utilizes web crawlers to get news documents from specifying news website to climb every day.

Described incident detection method is characterized in that, among the step S32, uses the TF-IDF value as the information criterion, also can use the out of Memory criterion method of oneself writing.

Described incident detection method is characterized in that, among the step S33, arranges according to TF-IDF value descending.

Described incident detection method is characterized in that, among the step S41, the degree of correlation computing method between two clue words can be various effective methods, as mutual information or chi-square value etc.; If n clue word arranged, correlation matrix is exactly the matrix of n * n so, with V (n * n) expression; So (i j) is the degree of correlation between clue word i and the clue word j to V.

Described incident detection method is characterized in that, among the step S42, the method for cluster is existing typical algorithm, as hierarchical clustering or based on cluster of figure etc., also can be other clustering algorithm of oneself writing.

Described incident detection method is characterized in that, among the step S43, according to the big or small descending of set of words in the concentrated frequency of destination document, also can be other criterion.

Utilize technical scheme provided by the invention, event space can be reverted to the clue word space, by the clue word subclass that the method for cluster is exported, a subclass correspondence the description of an accident.

Description of drawings

Fig. 1 is the method for the invention schematic flow sheet

Fig. 2 is that example is obtained in accident

Embodiment

The present invention is described further below by example, but it should be noted that, the purpose that provides example is to help further to understand the present invention, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various substitutions and modifications all are possible.Therefore, the present invention should not be limited to the disclosed content of example, and the scope of protection of present invention is as the criterion with the scope that claims define.

Suppose in this example that the destination document collection is the news documents set (as, the current political news of obtaining from the www.qq.com) in May, 2008, relatively document sets be 2000 to the set of the news documents in all Mays in 2007.The accident that needs so to detect is that to occur in May, 2008 be not again each 5 lunar periodicity event simultaneously.Here need to prove especially, when the actual analysis accident, generally with one day news documents collection as the destination document collection, the document of certain hour window before and after relatively document sets can be chosen.For example, if will analyze the accident on May 12nd, 2008, when selecting relatively collection of document, can select from r days (as front 10 days) before May 12 to back r days collection of document.

At first need to obtain the information of word, adopt the word of band part of speech information as the information of word in the document here.Such as, it is " earthquake " that a word is arranged, corresponding part of speech is noun (being expressed as " NN "), uses " earthquake #NN " to represent this specific word so.Only consider the lexical word in the document.

Weigh the standard of otherness and can select existing standard, also can select self-defining standard.Here adopt the TF-IDF value as standard.The main thought of TF-IDF is: if the frequency height that certain word or phrase occur in one piece of article, and in other articles, seldom occur, think that then this word has good class discrimination ability.The present invention will have the word of separating capacity as the burst clue word of destination document.When calculating TF, regard the destination document set as an independent destination document (being 1 aggregation units with 1 day generally).The total degree that might as well establish lexical word appearance wherein is N, and the number of times that lexical word t occurs is n, and then word t in the concentrated frequency of destination document is:

TF (t) = \frac{n}{N}

Suppose that the total number of files that compares in the document sets is M, the document number that word t occurred is m, and then the contrary document frequency of t is:

IDF (t) = \log_{2} \frac{M}{m}

So the computing formula of the TF-IDF value of word t is,

TF-IDF(t)＝TF(t)×IDF(t)

Calculate after the TF-IDF value of each word, according to the descending sort of TF-IDF value, clue word is done in k the word choosing that comes the front.

The clue word set has been arranged, and what next will do is the degree of correlation of calculating between the clue word.Here be example with the mutual information.Mutual information (Mutual Information) is a kind of Useful Information tolerance in the information theory, is used for the correlativity between two event sets of tolerance, and correlativity is more big, and the mutual information value is also more big.Usually with mutual information as feature word and classification ask estimate, if two feature words belong to of a sort words, their mutual information is just greatly.The computing formula of the mutual information of two word w1 and w2 is:

MI (w_{1}, w_{2}) = \log_{2} \frac{p (w_{1}, w_{2})}{P (w_{1}) P (w_{2})}

Wherein, p (w ₁, w ₂) expression w ₁And w ₂Appear at one piece of probability in the document simultaneously, by f (w ₁, w ₂)/T calculates.F (w wherein ₁, w ₂) expression word w ₁And w ₂Appear at jointly in how many table of contents mark documents, T is the total number of documents in the destination document.Similarly, p (w ₁) by f (w ₁)/T calculates, wherein, and f (w ₁) expression w ₁What appear in the table of contents mark documents.Can calculate p (w after the same method ₂).

Next need according to a kind of clustering algorithm, on the basis of degree of correlation value, to clue word set cluster, obtain event sets.Clustering algorithm can be existing clustering algorithm, as hierarchical clustering, spectrogram cluster etc., also can be other clustering algorithm of oneself writing.Here the clustering method of having selected Newman (Newman, 2004) to propose ^[1]

Cluster result is that the clue word set is realized a kind of division, and it is divided into some subclass, and each subclass is represented a class event.Again each subclass is sorted.Multiple sort method is arranged.Here arrange in the sum frequency that destination document concentrate to occur by the clue word that contains in the subclass, get the accident as the time period of the event class that comes the front.

Embodiment 1:

Be the extraction of chronomere's explanation accident below with the sky.

Fig. 2 has listed first three event that the news documents with 11 to 14 May in 2008 extracts as the destination document collection respectively.Wenchuan earthquake takes place on May in 2008 12, the validity of selecting the news before and after it to help to observe institute of the present invention extracting method.Relatively document sets is selected from People's Daily's article of in same month, 1998, and destination document is respectively from the current political news in 11 to 14 May in the www.qq.com.Wherein, in the obtaining of clue word, the differentiation of using the TF-IDF computing method above introduced to calculate lexical word is got big preceding 500 words of the property distinguished as clue word.The degree of correlation between the clue word is calculated by the mutual information method of above introducing.To the clue word cluster time, the clustering method that has adopted Newman to propose.

As can be seen from Figure 2, May 11, three main events corresponded respectively to: the Olympic torch transmission, brothers' aftosa, and the Mother's Day; Main three events on May 12 correspond respectively to: Wenchuan earthquake, the discussion of college entrance examination problem, and the Olympic torch transmission; Three events on May 13 are all relevant with Wenchuan earthquake, correspond respectively to: leadership is to the concern of earthquake, rescue and relief work, and earthquake involves the area; Three events on May 14 still around Wenchuan earthquake, are respectively: contribution donation, rescue and relief work and relief goods and materials.

Here need to prove, the report that the Mother's Day appears in May 11, the Mother's Day in 2008 is May 11 just.According to thought of the present invention, the Mother's Day has periodically, should not be extracted as event.The reason that this situation occurs is that the comparison document sets of use is People's Daily in 1998, and is wherein less with Mother's Day related article.The problem that has occurred about college entrance examination May 12, its reason are that just in time be that Beijing's college entrance examination was made a report on voluntary initial day on May in 2008 12, also is Ministry of Education's start-up time in " college entrance examination online consultation week ", therefore, the report of more college entrance examination topic occurred.

It can also be seen that from Fig. 2 utilize the inventive method to extract accident, the evolution rule with accident matches basically, that is, from beginning to develop into climax, and then fade.About Wenchuan earthquake, May 12 was beginning, had one about earthquake in three events extracting; But very fast development enters climax, and first three event in May 13 and 14 days all is Wenchuan earthquake, represents different subevents respectively.

List of references

[1]Newman?M?E?J.Fast?algorithm?for?detecting?community?structure?in?networks.Physical?Review?E，2004，69(6)：066133

[2] detection method of network focus and public sentiment-200910308542.4

[3] a kind of network public sentiment hotspot prediction and analytical approach-200910214401.6

[4] a kind of classification processing method of internet public feelings information-200810147719.2.

Claims

1. an incident detection method that changes based on the document lexical feature is characterized in that, comprises the steps:

Step 1: utilize the current political news report of computing machine from news website and climb the news article of getting the fixed time section, each piece article is expressed as a document, and the whole documents in the time period constitute the destination document set; Each document is carried out pre-service, comprise Chinese word cutting and part-of-speech tagging; Stay lexical word, filter out other word; Each destination document and result thereof are stored in the database of computing machine;

Step 2: obtain identical time period with destination document of front k and preceding r days and back r days news documents document sets as a comparison; The comparison document sets is carried out Chinese equally cut word and part-of-speech tagging processing, keep lexical word, each contrast document and result thereof are stored in the database of computing machine;

Step 4: to the clue word clustering, form accident and describe.

2. incident detection method as claimed in claim 1 is characterized in that, described step 3, and following realization:

3. incident detection method as claimed in claim 1 is characterized in that, described step 4, and following realization:

S41: make up the correlation matrix between the clue word;

4. incident detection method as claimed in claim 1 is characterized in that, in the step 1, utilizes web crawlers to get news documents from specifying news website to climb every day.

5. incident detection method as claimed in claim 2 is characterized in that, among the step S32, uses the TF-IDF value as the information criterion.

6. incident detection method as claimed in claim 5 is characterized in that, among the step S33, arranges according to TF-IDF value descending.

7. incident detection method as claimed in claim 1 is characterized in that, among the step S41, the degree of correlation computing method between two clue words are mutual information or chi-square value.

8. incident detection method as claimed in claim 1 is characterized in that, among the step S42, the method for cluster is hierarchical clustering or based on the cluster of figure.

9. incident detection method as claimed in claim 1 is characterized in that, among the step S43, concentrates the big or small descending of the frequency that occurs at destination document according to set of words.