CN103246728A - Emergency detection method based on document lexical feature variations - Google Patents

Emergency detection method based on document lexical feature variations Download PDF

Info

Publication number
CN103246728A
CN103246728A CN2013101702967A CN201310170296A CN103246728A CN 103246728 A CN103246728 A CN 103246728A CN 2013101702967 A CN2013101702967 A CN 2013101702967A CN 201310170296 A CN201310170296 A CN 201310170296A CN 103246728 A CN103246728 A CN 103246728A
Authority
CN
China
Prior art keywords
word
document
detection method
news
clue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101702967A
Other languages
Chinese (zh)
Inventor
王厚峰
张龙凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN2013101702967A priority Critical patent/CN103246728A/en
Publication of CN103246728A publication Critical patent/CN103246728A/en
Pending legal-status Critical Current

Links

Images

Abstract

Disclosed is an emergency detection method based on document lexical feature variations. The emergency detection method based on the document lexical feature variations comprises: utilizing a computer to crawl news articles in a designated time period in current political news reports from news websites; subjecting every file to pre-processing, wherein the pre-processing includes Chinese character segmentation and part-of-speech tagging; keeping content words and filtering the other words; obtaining news files in time periods k years before, which are the same with an target file, and news files days before and r days after to serve as a comparison document set; similarly, subjecting the comparison document set to processing of the Chinese character segmentation and the part-of-speech tagging, and keeping content words; extracting all clue words in a target document set from a database; and clustering a clue word set to form an emergency description. By utilizing the technical scheme of the emergency detection method based on the document lexical feature variations, the event space can be restored to the clue word space, clue word subsets can be output through a method of clustering, and every subset corresponds to a description of an emergency.

Description

A kind of incident detection method that changes based on the document lexical feature
Technical field
The present invention proposes a kind of incident detection method that changes based on lexical feature in the comparable time period collection of document, by analyzing the difference that vocabulary uses in the comparable time period collection of document, infer possible new events, particularly accident.The invention belongs to text mining and information retrieval field.
Background technology
The network information turns to people's pass-along message, expresses viewpoint, the information of obtaining provides means very easily.Network has become the ocean of information.How to take full advantage of network information resource, excavate information of interest, follow the trail of focus incident, become the problem that people pay special attention to.
Accident is that a kind of peacefulness of society of may giving is brought the event of impact, always is subjected to the great attention of government organs and relevant enterprise department.In current society, network becomes the main path of event report and information propagation.In case accident occurs, have a large amount of follow-up reports usually.The evolution of fast detecting accident from network, tracking event is for government decision and the important effect of maintaining social stability.
Because emergentness and the singularity of event, also can there be difference in relevant report aspect word and the language expression.The present invention is exactly at collection of document, detects possible accident by the variation of analyzing the word rule.
Summary of the invention
For convenience of explanation, appoint and fix the row concept:
Lexical word: name here word, verb and adjective.
Clue word: also being differentiating words, is to can be used in the word that detects accident and express event content.The burst clue word can be distinguished common report, particularly Gui Lvxing report.Here, the burst clue word belongs to lexical word.
Destination document collection: collection of document to be excavated.The destination document collection comprises some documents, and each document is corresponding to one piece of online current political news article.
Compare document sets: be used for doing with the destination document collection document sets of contrast, by to recently detecting destination document in the situation of change aspect the word, whether contain accident to judge destination document.The general news article that uses the same time period of several years before the destination document is document sets as a comparison.
Accident: the content of the set co expression of the one group of document that differs greatly with the comparison document sets in the feeling the pulse with the finger-tip mark document sets can be represented by one group of clue word.In news documents set, occur in the A time but do not have event can regard accident as at contemporaneity of some times before.
The purpose of this invention is to provide a kind of simple method, do not having under the situation of manual intervention, detect destination document easily and concentrate contained accident.
Principle of the present invention is: utilize certain measure to calculate the tangible word of difference in destination document collection and the document sets relatively, with them as the burst clue word; Again the clue word set is carried out clustering processing, then cluster result is mapped to event, thereby find the accident of destination document collection.Measure can be selected as required, for example selects the TF-IDF method, also can be the additive method of oneself writing.Here the TF-IDF method of mentioning is the computing method of a kind of classics in the information retrieval, wherein, the frequency (Term Frequency) that TF (t) expression word t occurs in a document, (Document Frequency) appears in DF (t) expression word t in what documents, IDF (t) is called the contrary document frequency (Inverse Document Frequency) of t, can be reciprocal or other calculating Method for Deformation of DF (t).If word t occurs very frequently in certain document, simultaneously, seldom in other document, occur, so, this word is exactly the tangible word of difference, has also measured certain species diversity of place document and other document.To describe the computing method of TF and IDF in detail in the enforcement part of back.
Technical scheme of the present invention is as follows:
A kind of incident detection method (ginseng Fig. 1) that changes based on the document lexical feature is characterized in that, comprises the steps:
Step 1: utilize to climb the current political news report of computing machine from news website (for example Tengxun, Sina) and (for example get the fixed time section, some day) news article, each piece article is expressed as a document, and the whole documents in the time period constitute the destination document set; Each document is carried out pre-service, comprise Chinese word cutting and part-of-speech tagging; Stay lexical word, filter out other word; Each destination document and result thereof are stored in the database of computing machine;
Step 2: obtain identical time period with destination document of front k and preceding r days and back r days news documents document sets as a comparison; The comparison document sets is carried out Chinese equally cut word and part-of-speech tagging processing, keep lexical word, each contrast document and result thereof are stored in the database of computing machine; Here k value and r value can arrange as required;
Step 3: from database, extract all clue words that destination document is concentrated;
Step 4: to the clue word clustering, form accident and describe.
Described incident detection method is characterized in that, described step 3, and following realization:
S31: all the elements word and the word frequency of from database, obtaining destination document collection, comparison document sets;
S32: with the lexical word and the otherness that compares same words in the document sets in certain information criterion calculating destination document set;
S33: arrange according to certain sequence, the part word that comes the front is screened, as the clue word of destination document collection.
Described incident detection method is characterized in that, described step 4, and following realization:
S41: make up the correlation matrix between the clue word;
S42: on the correlation matrix basis that step S41 makes up, the clue word set is carried out clustering processing, obtain several subclass, each subclass is represented a class, corresponding to an event;
S43: with all categories ordering that obtains after the cluster, and then export several classes that come the front, represent several accidents.
Described incident detection method is characterized in that, in the step 1, utilizes web crawlers to get news documents from specifying news website to climb every day.
Described incident detection method is characterized in that, among the step S32, uses the TF-IDF value as the information criterion, also can use the out of Memory criterion method of oneself writing.
Described incident detection method is characterized in that, among the step S33, arranges according to TF-IDF value descending.
Described incident detection method is characterized in that, among the step S41, the degree of correlation computing method between two clue words can be various effective methods, as mutual information or chi-square value etc.; If n clue word arranged, correlation matrix is exactly the matrix of n * n so, with V (n * n) expression; So (i j) is the degree of correlation between clue word i and the clue word j to V.
Described incident detection method is characterized in that, among the step S42, the method for cluster is existing typical algorithm, as hierarchical clustering or based on cluster of figure etc., also can be other clustering algorithm of oneself writing.
Described incident detection method is characterized in that, among the step S43, according to the big or small descending of set of words in the concentrated frequency of destination document, also can be other criterion.
Utilize technical scheme provided by the invention, event space can be reverted to the clue word space, by the clue word subclass that the method for cluster is exported, a subclass correspondence the description of an accident.
Description of drawings
Fig. 1 is the method for the invention schematic flow sheet
Fig. 2 is that example is obtained in accident
Embodiment
The present invention is described further below by example, but it should be noted that, the purpose that provides example is to help further to understand the present invention, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various substitutions and modifications all are possible.Therefore, the present invention should not be limited to the disclosed content of example, and the scope of protection of present invention is as the criterion with the scope that claims define.
Suppose in this example that the destination document collection is the news documents set (as, the current political news of obtaining from the www.qq.com) in May, 2008, relatively document sets be 2000 to the set of the news documents in all Mays in 2007.The accident that needs so to detect is that to occur in May, 2008 be not again each 5 lunar periodicity event simultaneously.Here need to prove especially, when the actual analysis accident, generally with one day news documents collection as the destination document collection, the document of certain hour window before and after relatively document sets can be chosen.For example, if will analyze the accident on May 12nd, 2008, when selecting relatively collection of document, can select from r days (as front 10 days) before May 12 to back r days collection of document.
At first need to obtain the information of word, adopt the word of band part of speech information as the information of word in the document here.Such as, it is " earthquake " that a word is arranged, corresponding part of speech is noun (being expressed as " NN "), uses " earthquake #NN " to represent this specific word so.Only consider the lexical word in the document.
Weigh the standard of otherness and can select existing standard, also can select self-defining standard.Here adopt the TF-IDF value as standard.The main thought of TF-IDF is: if the frequency height that certain word or phrase occur in one piece of article, and in other articles, seldom occur, think that then this word has good class discrimination ability.The present invention will have the word of separating capacity as the burst clue word of destination document.When calculating TF, regard the destination document set as an independent destination document (being 1 aggregation units with 1 day generally).The total degree that might as well establish lexical word appearance wherein is N, and the number of times that lexical word t occurs is n, and then word t in the concentrated frequency of destination document is:
TF ( t ) = n N
Suppose that the total number of files that compares in the document sets is M, the document number that word t occurred is m, and then the contrary document frequency of t is:
IDF ( t ) = log 2 M m
So the computing formula of the TF-IDF value of word t is,
TF-IDF(t)=TF(t)×IDF(t)
Calculate after the TF-IDF value of each word, according to the descending sort of TF-IDF value, clue word is done in k the word choosing that comes the front.
The clue word set has been arranged, and what next will do is the degree of correlation of calculating between the clue word.Here be example with the mutual information.Mutual information (Mutual Information) is a kind of Useful Information tolerance in the information theory, is used for the correlativity between two event sets of tolerance, and correlativity is more big, and the mutual information value is also more big.Usually with mutual information as feature word and classification ask estimate, if two feature words belong to of a sort words, their mutual information is just greatly.The computing formula of the mutual information of two word w1 and w2 is:
MI ( w 1 , w 2 ) = log 2 p ( w 1 , w 2 ) P ( w 1 ) P ( w 2 )
Wherein, p (w 1, w 2) expression w 1And w 2Appear at one piece of probability in the document simultaneously, by f (w 1, w 2)/T calculates.F (w wherein 1, w 2) expression word w 1And w 2Appear at jointly in how many table of contents mark documents, T is the total number of documents in the destination document.Similarly, p (w 1) by f (w 1)/T calculates, wherein, and f (w 1) expression w 1What appear in the table of contents mark documents.Can calculate p (w after the same method 2).
Next need according to a kind of clustering algorithm, on the basis of degree of correlation value, to clue word set cluster, obtain event sets.Clustering algorithm can be existing clustering algorithm, as hierarchical clustering, spectrogram cluster etc., also can be other clustering algorithm of oneself writing.Here the clustering method of having selected Newman (Newman, 2004) to propose [1]
Cluster result is that the clue word set is realized a kind of division, and it is divided into some subclass, and each subclass is represented a class event.Again each subclass is sorted.Multiple sort method is arranged.Here arrange in the sum frequency that destination document concentrate to occur by the clue word that contains in the subclass, get the accident as the time period of the event class that comes the front.
Embodiment 1:
Be the extraction of chronomere's explanation accident below with the sky.
Fig. 2 has listed first three event that the news documents with 11 to 14 May in 2008 extracts as the destination document collection respectively.Wenchuan earthquake takes place on May in 2008 12, the validity of selecting the news before and after it to help to observe institute of the present invention extracting method.Relatively document sets is selected from People's Daily's article of in same month, 1998, and destination document is respectively from the current political news in 11 to 14 May in the www.qq.com.Wherein, in the obtaining of clue word, the differentiation of using the TF-IDF computing method above introduced to calculate lexical word is got big preceding 500 words of the property distinguished as clue word.The degree of correlation between the clue word is calculated by the mutual information method of above introducing.To the clue word cluster time, the clustering method that has adopted Newman to propose.
As can be seen from Figure 2, May 11, three main events corresponded respectively to: the Olympic torch transmission, brothers' aftosa, and the Mother's Day; Main three events on May 12 correspond respectively to: Wenchuan earthquake, the discussion of college entrance examination problem, and the Olympic torch transmission; Three events on May 13 are all relevant with Wenchuan earthquake, correspond respectively to: leadership is to the concern of earthquake, rescue and relief work, and earthquake involves the area; Three events on May 14 still around Wenchuan earthquake, are respectively: contribution donation, rescue and relief work and relief goods and materials.
Here need to prove, the report that the Mother's Day appears in May 11, the Mother's Day in 2008 is May 11 just.According to thought of the present invention, the Mother's Day has periodically, should not be extracted as event.The reason that this situation occurs is that the comparison document sets of use is People's Daily in 1998, and is wherein less with Mother's Day related article.The problem that has occurred about college entrance examination May 12, its reason are that just in time be that Beijing's college entrance examination was made a report on voluntary initial day on May in 2008 12, also is Ministry of Education's start-up time in " college entrance examination online consultation week ", therefore, the report of more college entrance examination topic occurred.
It can also be seen that from Fig. 2 utilize the inventive method to extract accident, the evolution rule with accident matches basically, that is, from beginning to develop into climax, and then fade.About Wenchuan earthquake, May 12 was beginning, had one about earthquake in three events extracting; But very fast development enters climax, and first three event in May 13 and 14 days all is Wenchuan earthquake, represents different subevents respectively.
List of references
[1]Newman?M?E?J.Fast?algorithm?for?detecting?community?structure?in?networks.Physical?Review?E,2004,69(6):066133
[2] detection method of network focus and public sentiment-200910308542.4
[3] a kind of network public sentiment hotspot prediction and analytical approach-200910214401.6
[4] a kind of classification processing method of internet public feelings information-200810147719.2.

Claims (9)

1. an incident detection method that changes based on the document lexical feature is characterized in that, comprises the steps:
Step 1: utilize the current political news report of computing machine from news website and climb the news article of getting the fixed time section, each piece article is expressed as a document, and the whole documents in the time period constitute the destination document set; Each document is carried out pre-service, comprise Chinese word cutting and part-of-speech tagging; Stay lexical word, filter out other word; Each destination document and result thereof are stored in the database of computing machine;
Step 2: obtain identical time period with destination document of front k and preceding r days and back r days news documents document sets as a comparison; The comparison document sets is carried out Chinese equally cut word and part-of-speech tagging processing, keep lexical word, each contrast document and result thereof are stored in the database of computing machine;
Step 3: from database, extract all clue words that destination document is concentrated;
Step 4: to the clue word clustering, form accident and describe.
2. incident detection method as claimed in claim 1 is characterized in that, described step 3, and following realization:
S31: all the elements word and the word frequency of from database, obtaining destination document collection, comparison document sets;
S32: with the lexical word and the otherness that compares same words in the document sets in certain information criterion calculating destination document set;
S33: arrange according to certain sequence, the part word that comes the front is screened, as the clue word of destination document collection.
3. incident detection method as claimed in claim 1 is characterized in that, described step 4, and following realization:
S41: make up the correlation matrix between the clue word;
S42: on the correlation matrix basis that step S41 makes up, the clue word set is carried out clustering processing, obtain several subclass, each subclass is represented a class, corresponding to an event;
S43: with all categories ordering that obtains after the cluster, and then export several classes that come the front, represent several accidents.
4. incident detection method as claimed in claim 1 is characterized in that, in the step 1, utilizes web crawlers to get news documents from specifying news website to climb every day.
5. incident detection method as claimed in claim 2 is characterized in that, among the step S32, uses the TF-IDF value as the information criterion.
6. incident detection method as claimed in claim 5 is characterized in that, among the step S33, arranges according to TF-IDF value descending.
7. incident detection method as claimed in claim 1 is characterized in that, among the step S41, the degree of correlation computing method between two clue words are mutual information or chi-square value.
8. incident detection method as claimed in claim 1 is characterized in that, among the step S42, the method for cluster is hierarchical clustering or based on the cluster of figure.
9. incident detection method as claimed in claim 1 is characterized in that, among the step S43, concentrates the big or small descending of the frequency that occurs at destination document according to set of words.
CN2013101702967A 2013-05-10 2013-05-10 Emergency detection method based on document lexical feature variations Pending CN103246728A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101702967A CN103246728A (en) 2013-05-10 2013-05-10 Emergency detection method based on document lexical feature variations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101702967A CN103246728A (en) 2013-05-10 2013-05-10 Emergency detection method based on document lexical feature variations

Publications (1)

Publication Number Publication Date
CN103246728A true CN103246728A (en) 2013-08-14

Family

ID=48926248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101702967A Pending CN103246728A (en) 2013-05-10 2013-05-10 Emergency detection method based on document lexical feature variations

Country Status (1)

Country Link
CN (1) CN103246728A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766163A (en) * 2015-03-23 2015-07-08 重庆晨网网络科技有限公司 News interview management system
CN106202561A (en) * 2016-07-29 2016-12-07 北京联创众升科技有限公司 Digitized contingency management case library construction methods based on the big data of text and device
CN106547875A (en) * 2016-11-02 2017-03-29 哈尔滨工程大学 A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN108959344A (en) * 2018-04-10 2018-12-07 天津大学 One kind being directed to the dynamic analysis method of vocational education
CN109767026A (en) * 2018-11-30 2019-05-17 三峡大学 A kind of news program time delay prediction method and device
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus
CN112732904A (en) * 2020-10-15 2021-04-30 中科曙光南京研究院有限公司 Abnormal emergency detection method and system based on text processing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100125540A1 (en) * 2008-11-14 2010-05-20 Palo Alto Research Center Incorporated System And Method For Providing Robust Topic Identification In Social Indexes
CN102222070A (en) * 2010-04-16 2011-10-19 英业达股份有限公司 Interacted system and method of approximate vocabularies
CN102360378A (en) * 2011-10-10 2012-02-22 南京大学 Outlier detection method for time-series data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100125540A1 (en) * 2008-11-14 2010-05-20 Palo Alto Research Center Incorporated System And Method For Providing Robust Topic Identification In Social Indexes
CN102222070A (en) * 2010-04-16 2011-10-19 英业达股份有限公司 Interacted system and method of approximate vocabularies
CN102360378A (en) * 2011-10-10 2012-02-22 南京大学 Outlier detection method for time-series data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GE XU等: "Using Multiple Resources in Graph-Based Semi-supervised Sentiment Classification", 《WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCES ON》, vol. 3, 7 December 2012 (2012-12-07), pages 132 - 136 *
刘星星: "热点事件发现及事件内容特征自动抽取研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 11, 15 November 2009 (2009-11-15), pages 1 - 37 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766163A (en) * 2015-03-23 2015-07-08 重庆晨网网络科技有限公司 News interview management system
CN106202561A (en) * 2016-07-29 2016-12-07 北京联创众升科技有限公司 Digitized contingency management case library construction methods based on the big data of text and device
CN106202561B (en) * 2016-07-29 2019-10-01 北京联创众升科技有限公司 Digitlization contingency management case base construction method and device based on text big data
CN106547875A (en) * 2016-11-02 2017-03-29 哈尔滨工程大学 A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN106547875B (en) * 2016-11-02 2020-05-15 哈尔滨工程大学 Microblog online emergency detection method based on emotion analysis and label
CN108959344A (en) * 2018-04-10 2018-12-07 天津大学 One kind being directed to the dynamic analysis method of vocational education
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus
CN109767026A (en) * 2018-11-30 2019-05-17 三峡大学 A kind of news program time delay prediction method and device
CN112732904A (en) * 2020-10-15 2021-04-30 中科曙光南京研究院有限公司 Abnormal emergency detection method and system based on text processing

Similar Documents

Publication Publication Date Title
Hamborg et al. Automated identification of media bias in news articles: an interdisciplinary literature review
Hai et al. Identifying features in opinion mining via intrinsic and extrinsic domain relevance
CN104915446A (en) Automatic extracting method and system of event evolving relationship based on news
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN103246728A (en) Emergency detection method based on document lexical feature variations
Alsaedi et al. Arabic event detection in social media
Musaev et al. LITMUS: Landslide detection by integrating multiple sources.
CN104536956A (en) A Microblog platform based event visualization method and system
CN104408093A (en) News event element extracting method and device
EP3014414A2 (en) Real-time and adaptive data mining
Karimi et al. Classifying microblogs for disasters
CN102937960A (en) Device and method for identifying and evaluating emergency hot topic
CN102929873A (en) Method and device for extracting searching value terms based on context search
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
Ikegami et al. Topic and opinion classification based information credibility analysis on twitter
CN105378730A (en) Social media content analysis and output
CN103577404A (en) Microblog-oriented discovery method for new emergencies
Vosoughi et al. A human-machine collaborative system for identifying rumors on twitter
CN104102658A (en) Method and device for mining text contents
CN104915443A (en) Extraction method of Chinese Microblog evaluation object
Fang et al. Witness identification in twitter
CN109857869A (en) A kind of hot topic prediction technique based on Ap increment cluster and network primitive
Dusart et al. Capitalizing on a TREC track to build a tweet summarization dataset
Hienert et al. Automatic Classification and Relationship Extraction for Multi-Lingual and Multi-Granular Events from Wikipedia.
Xu et al. Summarizing complex events: a cross-modal solution of storylines extraction and reconstruction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130814