CN102831248A

CN102831248A - Network hotspot mining method and network hotspot mining device

Info

Publication number: CN102831248A
Application number: CN2012103468279A
Authority: CN
Inventors: 林英杰; 马良; 陈强
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2012-09-18
Filing date: 2012-09-18
Publication date: 2012-12-19
Anticipated expiration: 2032-09-18
Also published as: CN105912670A; CN102831248B

Abstract

The invention discloses a network hotspot mining method and a network hotspot mining device. The device comprises a classification storage module, a filter extraction module, a sequencing combination module and a hotspot counting module, wherein the classification storage module is suitable for collecting network data and classifying and storing the network data in a classification way; the filter extraction module is suitable for filtering the network data of different categories according to a preset filter rule and extracting a key word from the filtered network data of each category; the sequencing combination module is suitable for sequencing key words extracted from the same network data, combining the sequenced key words of the same network data, and acquiring a key word group of each network data of each category; and the hotspot counting module is suitable for counting the occurrence number of the key word group under the subjected category, and respectively acquiring a network hotspot word group under each category to be classified to exhibit. Through the technical scheme, the network hotspot can be more macroscopically mined, so that the mining result can better reflect an objective fact of the internet public opinions and can more specifically reflect a hotspot of one field.

Description

Network focus method for digging and device

Technical field

The present invention relates to field of Internet communication, particularly relate to a kind of network focus method for digging and device.

Background technology

In the prior art; Along with Internet development, user-generated content (User Generated Content abbreviates UGC as) function has been introduced in increasing website; A large amount of netizens pours in and delivers the suggestion of oneself in forum, blog, the microblogging and disclose all kinds of news; There is every day thousands of topic to produce, how from internet mass information, obtains the network focus faster, will dynamically play the directiveness effect understanding social development situation, grasp public opinion from the internet.

At present, the focus method for digging that generally adopts in the prior art is to obtain the text calorific value through the weighted calculation that forwarding amount, click volume, reply volume to the text in the special time period carry out predetermined condition, obtains the hottest text through the calorific value ordering.But; There is following problem in the technical scheme of prior art: 1, owing to only single text self attributes is added up; The much-talked-about topic of obtaining only can reflect the temperature situation of a certain article on the microcosmic, and can't reflect that macroscopic view goes up the temperature situation to a certain netizen's focus; 2, because the sample set of statistics is the full dose data, and do not get down to corresponding statistical study from content of text, the result who therefore produces does not have specific aim, can not divide the focus situation of field reflection to this field; 3, the text that technical scheme of the prior art only can the identical same content of statistical nature, gained result repeatability is big, readable poor.

Summary of the invention

The present invention provides a kind of network focus method for digging and device, and the network focus excavates focus situation and big, the readable poor problem of repeatability that the result is not macroscopical, can not divide the field to reflect this field of being directed against in the prior art to solve.

The present invention provides a kind of network focus method for digging, comprising: the collection network data, network data is classified and classification and storage; Filtering rule according to being provided with in advance filters the network data under of all categories respectively, and from the network data after the filtration down of all categories, extracts centre word respectively; Centre word to from the consolidated network extracting data sorts, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories; The occurrence number of statistics center phrase under affiliated classification obtained the network focus phrase and the displaying of classifying under of all categories respectively.

Alternatively, network data comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.

Alternatively, text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.

Alternatively; Network data classified further comprise with classification and storage: utilize the text automatic classification technology network data to be carried out text classification according to article content; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.

Alternatively, filtering rule further comprise following one of at least: the network data that text header is not met predetermined number of words is deleted; Network data to issuing time is against regulation is deleted; Network data to containing predetermined domain name among the URL is deleted, and wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; The source column is deleted for the network data of predetermined column, and wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; The against regulation network data of originating is deleted, and wherein, the source comprises: forum, blog, perhaps whole model; The network data that the answer number is not inconsistent regulation is deleted; Delete browsing the against regulation network data of number; Network data to the author is against regulation is deleted; And network data disappeared heavily handle.

Alternatively, adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, said method also comprises: according to the prefix dictionary that is provided with in advance text header is carried out prefix and filter.

Alternatively; Adopting participle technique under of all categories, to extract centre word respectively the network data after the filtration further comprises: adopt participle technique respectively the text header after the filtration down of all categories to be carried out participle; Obtain word segmentation result, and with word segmentation result as centre word.

Alternatively, before the centre word from the consolidated network extracting data sorted, method also comprised: according to the dictionary commonly used that is provided with in advance the everyday words in the centre word that extracts is filtered.

Alternatively; Centre word after the ordering of consolidated network data made up further comprise: the centre word according to after the ordering that will belong to same text header makes up; Wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.

Alternatively, the centre word after the ordering of consolidated network data is made up, obtain after the center phrase of each networking data under of all categories, said method also comprises: the rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of center.

Alternatively; The occurrence number of statistics center phrase under affiliated classification; The network focus phrase that obtains respectively under of all categories further comprises: the statistics center phrase is the occurrence number in the different text headers under affiliated classification; Occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.

Alternatively, obtain respectively after the network focus phrase under of all categories, said method also comprises: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.

Alternatively; The displaying of classifying further comprises: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.

The present invention also provides a kind of network focus excavating gear, comprising: the classification and storage module, be suitable for the collection network data, and network data is classified and classification and storage; Filter extraction module, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance; The ordered set compound module is suitable for the centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories; The focus statistics module is suitable for the occurrence number of statistics center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.

Alternatively, network data also comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.

Alternatively; The classification and storage module is further adapted for: utilize the text automatic classification technology according to article content network data to be carried out text classification; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.

Alternatively, filter extraction module and be further adapted for: adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, according to the prefix dictionary that is provided with in advance text header is carried out prefix and filter.

Alternatively, filter extraction module and be further adapted for: adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.

Alternatively, the ordered set compound module is further adapted for: before the centre word from the consolidated network extracting data is sorted, according to the dictionary commonly used that is provided with in advance the everyday words in the centre word that extracts is filtered.

Alternatively; The ordered set compound module is further adapted for: the centre word that will belong to according to after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.

Alternatively; The ordered set compound module is further adapted for: the centre word after the ordering of consolidated network data is made up; Obtain according to the rubbish dictionary that is provided with in advance the rubbish phrase in the phrase of center to be filtered after the center phrase of each networking data under of all categories.

Alternatively; The focus statistics module is further adapted for: the statistics center phrase is the occurrence number in the different text headers under affiliated classification; Occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.

Alternatively, focus statistics module is further adapted for: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.

Alternatively; The focus statistics module is further adapted for: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.

Beneficial effect of the present invention is following:

Realize that through utilizing hot speech to calculate principle focus excavates; And text classification technology combined with the focus digging technology, solved network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation and big, the readable poor problem of repeatability in this field; Can be more the excavation network focus of macroscopic view; The reflection macroscopic view goes up the temperature situation to a certain netizen's focus; Make and excavate the objective fact that the result more can reflect the internet public opinion, the merger identical content article that repeats to occur more easily, and can reflect the focus in a certain field more targetedly.

Above-mentioned explanation only is the general introduction of technical scheme of the present invention; Understand technological means of the present invention in order can more to know; And can implement according to the content of instructions; And for let above and other objects of the present invention, feature and advantage can be more obviously understandable, below special lifts embodiment of the present invention.

Description of drawings

Through reading the hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used to illustrate the purpose of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:

Fig. 1 is the process flow diagram of the network focus method for digging of the embodiment of the invention;

Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention;

Fig. 3 is the detailed process synoptic diagram of the network focus method for digging of the embodiment of the invention;

Fig. 4 is the structural representation of the network focus excavating gear of the embodiment of the invention.

Embodiment

Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Though shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and should do not limited with various forms by the embodiment that sets forth here.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.

For solve network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation in this field and repeatability is big, the problem of readable difference; The invention provides a kind of network focus method for digging and device, the network focus method for digging of the embodiment of the invention and device adopt text automatic classification technology and hot speech computing technique to realize.Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, does not limit the present invention.

According to embodiments of the invention, a kind of network focus method for digging is provided, Fig. 1 is the process flow diagram of the network focus method for digging of the embodiment of the invention, and is as shown in Figure 1, comprises following processing according to the network focus method for digging of the embodiment of the invention:

Step 101, the collection network data are classified and classification and storage to network data;

Wherein, the network data described in the step 101 specifically comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.Wherein, Text attribute specifically comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator (Uniform/Universal Resource Locator abbreviates URL as), text, the source column of text, text, text author, text and text browse number.

In step 101, network data classified specifically to be comprised with classification and storage:

Step 1 utilizes the text automatic classification technology according to article content network data to be carried out text classification, obtains the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Wherein, text automatic classification technology is meant: utilize the principle of machine learning to rely on the model parameter after small sample is learnt that text set (or other entities or object) is carried out automatic key words sorting according to certain taxonomic hierarchies or standard.

Step 2, every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.Wherein, the schedule time can be 1 hour, 6 hours, 1 day, and in embodiments of the present invention, the schedule time can be provided with according to the data characteristics of gathering (for example, renewal speed) flexibly.

Step 102 is filtered the network data under of all categories respectively according to the filtering rule that is provided with in advance, and from the network data after the filtration down of all categories, extracts centre word respectively;

Preferably, Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 2, in embodiments of the present invention, filtering rule specifically comprise following one of at least: the network data that 1, text header is not met predetermined number of words is deleted; 2, the against regulation network data of issuing time is deleted; 3, the network data that contains predetermined domain name among the URL is deleted, wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; 4, the source column is deleted for the network data of predetermined column, wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; 5, the against regulation network data of originating is deleted, wherein, the source comprises: forum, blog, perhaps whole model; 6, delete replying the network data that number is not inconsistent regulation; 7, delete browsing the against regulation network data of number; 8, the against regulation network data of author is deleted; 9, network data is disappeared heavily handle.

Need to prove; 9 rules listing above filtering rule in the embodiment of the invention is not limited to; In embodiments of the present invention; Filtering rule can be provided with as required, and for example, filtering rule is set to: the number of words to article does not have to delete or the like above the network data of predetermined number of words threshold value.

In addition, in step 102, before extracting centre word; In order to extract the centre word that needs better; Can carry out prefix to text header according to the prefix dictionary that is provided with in advance and filter, for example, cat pounced on the unwanted prefixs of this type such as university student base, ends of the earth tittle-tattle and filter.These prefixs are not participated in the extraction of centre word.And, in embodiments of the present invention, can adopt participle technique from the network data after the filtration down of all categories, to extract centre word respectively; Particularly, can adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.Need to prove that above-mentioned participle technique is a centre word extractive technique ripe in the prior art, the embodiment of the invention can also use other technologies to carry out the extraction of centre word.

Step 103 sorts to the centre word from the consolidated network extracting data, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;

Step 103 realizes through hot speech computing technique; Hot speech computing technique is meant: the web page text to real-time collection carries out participle, grouping merger automatically; Calculate high frequency focus keyword, and filter, export real-time internet hot spots vocabulary according to predefined dictionary and preset rules.

In step 103; Before centre word from the consolidated network extracting data sorted; Can filter the everyday words in the centre word that extracts according to the dictionary commonly used that is provided with in advance, vocabulary such as that above-mentioned everyday words is meant is for example original, reprinting, figure group need filter out these vocabulary.

And; In step 103; Carrying out centre word combination is meant: the centre word that will belong to according to after the ordering of same text header makes up; Wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.

After having carried out step 103, in embodiments of the present invention, preferably, can filter the rubbish phrase in the phrase of center according to the rubbish dictionary that is provided with in advance.

Step 104, the occurrence number of statistics center phrase under affiliated classification obtained the network focus phrase and the displaying of classifying under of all categories respectively.

Step 104 specifically comprises following processing: the statistics center phrase is the occurrence number in the different text headers under affiliated classification, and occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtains the network focus phrase under of all categories respectively.Wherein, above-mentioned predefined procedure can be to arrange from more to less by occurrence number.

After the network focus phrase that has obtained under of all categories, can network focus phrase identical under the same classification be merged; Calculate the pairing temperature value of network focus phrase under of all categories; And search for the link of the pairing focus incident of lower network focus phrase of all categories.Think that the user provides hot information more in all directions.

In step 104; The classification displaying is meant: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.

Below in conjunction with accompanying drawing, the technical scheme of the embodiment of the invention is illustrated.

Fig. 3 is the detailed process synoptic diagram of the network focus method for digging of the embodiment of the invention, and is as shown in Figure 3, specifically comprises following processing according to the network focus method for digging of the embodiment of the invention:

Step 301 utilizes self-defined language material to generate disaggregated model through the machine learning module, through disaggregated model the network data that collects is carried out text classification, and tag along sort is together deposited in the engine together with text attribute.

Step 302 is per hour carried out a data acquisition to engine, and with data by in different extend markup languages (Extensible Markup Language the abbreviates XML as) file that stores classifiedly in given server.

Step 303 is pressed following filtering rule filtering data, and filtered data is remained in the database, and wherein, the user can manage filtering rule through data filter regulation management backstage.

Particularly, the filtering rule according to the embodiment of the invention comprises:

1, title filters: the data filter of number of words between 5-30 word of title come in;

2, the temporal filtering of posting is that the model on the same day filters into the time of posting;

3, domain name is filtered: fuzzy matching is taked in (1), can have the model of corresponding domain name or word to filter among the URL with model; Perhaps, filter into by the model of domain name with band auto among the URL of 30 tame current events forums, 20 tame automobile forums and model (2); Perhaps, satisfy all will filtering into of (1), (2) these two kinds of rules.

4, column filters: the URL according to the plate seed filters; Also can the model of certain Chinese character of column title band be filtered into; For example, filter out the model of the band amusement of column title or Eight Diagrams printed words;

5, the domain name blacklist filters: the top result who filters out is carried out deletion action, the model with certain word among certain second level domain or the secondary URL is filtered out; And, be among the result of xinhuanet.com at TLD, be filtering out of 120ask.xinhuanet.com with domain name;

6, column blacklist: the top result who filters out is carried out deletion action, filter out the model with certain word in certain seed or the column name; And, be filtering out of reporting of new person with the column name;

7, source filtering: will meet the data filter that filters the source and come in, wherein, filter the source and be meant: forum, blog still are whole models;

8, replying the number clicks filters: will reply the data filter of number within 0-1000 and come in; The data filter of clicks within 0-10000 come in;

9, disappear heavily and to handle: the URL according to model disappears heavily model of calculations that TLD is identical;

10, filtered fields comprises: title, URL, source forum, come active plate, the time of posting, author, answer number, browse the number etc.

11, filter logic order: above-mentioned the 3rd filtering rule and the 4th filtering rule be " or " relation, between other filtering rules be " with " relation.

Step 304 extracts centre word to all text headers, and a title has a plurality of centre words, through participle technique title is carried out participle, and word segmentation result is the title centre word.Preferably, earlier title is carried out prefix before the participle and filter, these prefixs are not participated in participle, and for example, " cat pounces on the university student base ", " ends of the earth tittle-tattle " wait the prefix of this type.Wherein, the user can manage the backstage through prefix the prefix that needs filter is managed;

Step 305, the focus phrase calculates:

Step 1 is filtered the everyday words in the word segmentation result (for example, vocabulary such as " original ", " reprinting ", " picture group "); Wherein, the user can manage the backstage through everyday words the everyday words that needs filter is managed;

Step 2 is carried out phrase ordering (for example, the centre word of a title extraction is bca, becomes abc after the ordering) with the centre word after filtering;

Step 3; The centre word of each title is made up; The centre word of each title

combination, combinatorial formula:

only keeps the phrase of 2-5 speech;

Below, in conjunction with instance centre word is carried out phrase ordering combination and be illustrated.

Title one is extracted centre word b, a, c out, and ordering back a, b, c form phrase ab, bc, ac, abc

Title two is extracted centre word c, b, d out, and ordering back b, c, d form phrase bc, cd, bd, bcd

Title three is extracted centre word b, c out, forms phrase bc

The phrase seniority among brothers and sisters of these three titles formation is exactly so: bc (3), ab (1), ac (1), cd (1), bd (1), abc (1), bcd (1).

Step 4 is filtered the rubbish phrase, removes the rubbish phrase like inquiry ### prize-winning, ### phone, ### consulting, mobile phone ### prize-winning and so on; Wherein, the user can manage the rubbish phrase that needs filter through rubbish phrase management backstage;

Step 306 forms focus phrase ranking list, add up each focus phrase behind title quantity and by the descending sort of title quantity, the phrase of retain header quantity more than 2, this parameter can adjust according to real data;

In sum; Technical scheme by means of the embodiment of the invention; Realize that through utilizing hot speech to calculate principle focus excavates; And text classification technology combined with the focus digging technology, solved network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation and big, the readable poor problem of repeatability in this field; Can be more the excavation network focus of macroscopic view; The reflection macroscopic view goes up the temperature situation to a certain netizen's focus; Make and excavate the objective fact that the result more can reflect the internet public opinion, the merger identical content article that repeats to occur more easily, and can reflect the focus in a certain field more targetedly.

According to embodiments of the invention; A kind of network focus excavating gear is provided; Fig. 4 is the structural representation of the network focus excavating gear of the embodiment of the invention; As shown in Figure 4, comprise according to the network focus excavating gear of the embodiment of the invention: classification and storage module 40, filter extraction module 42, ordered set compound module 44 and focus statistics module 46, below each module of the embodiment of the invention is carried out detailed explanation.

Classification and storage module 40 is suitable for the collection network data, and network data is classified and classification and storage;

Wherein, above-mentioned network data specifically comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.Wherein, above-mentioned text attribute specifically comprise following one of at least: the answer number of the source column of the URL that text is corresponding, the source forum/blog of text, text, the issuing time of text, text author, text and text browse number.

Classification and storage module 40 specifically is suitable for: 1, utilize the text automatic classification technology according to article content network data to be carried out text classification; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Wherein, text automatic classification technology is meant: utilize the principle of machine learning to rely on the model parameter after small sample is learnt that text set (or other entities or object) is carried out automatic key words sorting according to certain taxonomic hierarchies or standard.2, every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.Wherein, the schedule time can be 1 hour, 6 hours, 1 day, and in embodiments of the present invention, the schedule time can be provided with according to the data characteristics of gathering (for example, renewal speed) flexibly.

Filter extraction module 42, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance;

In embodiments of the present invention, Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 2, filtering rule specifically comprise following one of at least: the network data that 1, text header is not met predetermined number of words is deleted; 2, the against regulation network data of issuing time is deleted; 3, the network data that contains predetermined domain name among the URL is deleted, wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; 4, the source column is deleted for the network data of predetermined column, wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; 5, the against regulation network data of originating is deleted, wherein, the source comprises: forum, blog, perhaps whole model; 6, delete replying the network data that number is not inconsistent regulation; 7, delete browsing the against regulation network data of number; 8, the against regulation network data of author is deleted; 9, network data is disappeared heavily handle.

In addition; Before extracting centre word; In order to extract the centre word that needs better; Filtering extraction module 42 is further adapted for: can carry out prefix to text header according to the prefix dictionary that is provided with in advance and filter, for example, cat pounced on the unwanted prefixs of this type such as university student base, ends of the earth tittle-tattle and filter.These prefixs are not participated in the extraction of centre word.And, in embodiments of the present invention, filter extraction module 42 and can adopt participle technique from the network data after the filtration down of all categories, to extract centre word respectively; Particularly, filter extraction module 42 and can adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.Need to prove that above-mentioned participle technique is a centre word extractive technique ripe in the prior art, the embodiment of the invention can also use other technologies to carry out the extraction of centre word.

Ordered set compound module 44 is suitable for the centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;

Ordered set compound module 44 is above-mentioned processing of realizing through hot speech computing technique; Hot speech computing technique is meant: the web page text to real-time collection carries out participle, grouping merger automatically; Calculate high frequency focus keyword; And filter according to predefined dictionary and preset rules, export real-time internet hot spots vocabulary.

Before centre word from the consolidated network extracting data sorted, ordered set compound module 44 can filter the everyday words in the centre word that extracts according to the dictionary commonly used that is provided with in advance.Vocabulary such as that above-mentioned everyday words is meant is for example original, reprinting, figure group need filter out these vocabulary.

Ordered set compound module 44 carries out centre word combination and is meant: the centre word that ordered set compound module 44 will belong to according to

after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.

Preferably; Centre word after the ordering of consolidated network data is being made up; Obtain after the center phrase of each networking data under of all categories, ordered set compound module 44 is further adapted for: the rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of center.

Focus statistics module 46 is suitable for the occurrence number of statistics center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.

Focus statistics module 46 specifically is suitable for: the statistics center phrase is the occurrence number in the different text headers under affiliated classification, and occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtains the network focus phrase under of all categories respectively.

After the network focus phrase that has obtained under of all categories, focus statistics module 46 is further adapted for: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.

The 46 classification displayings of focus statistics module are meant: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.

Step 301 utilizes self-defined language material to generate disaggregated model through the machine learning module, and classification and storage module 40 is carried out text classification through disaggregated model to the network data that collects, and tag along sort is together deposited in the engine together with text attribute.

Step 302, classification and storage module 40 are per hour carried out a data acquisition to engine, and with data by in different extend markup languages (Extensible Markup Language the abbreviates XML as) file that stores classifiedly in given server.

Step 303 is filtered extraction module 42 by following filtering rule filtering data, and filtered data is remained in the database, and wherein, the user can manage filtering rule through data filter regulation management backstage.

Particularly, Fig. 3 is the preferred synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 3, comprises according to the filtering rule of the embodiment of the invention:

Step 304 is filtered 42 pairs of all text headers of extraction module and is extracted centre word, and a title has a plurality of centre words, through participle technique title is carried out participle, and word segmentation result is the title centre word.Preferably, earlier title is carried out prefix before the participle and filter, these prefixs are not participated in participle, and for example, " cat pounces on the university student base ", " ends of the earth tittle-tattle " wait the prefix of this type.Wherein, the user can manage the backstage through prefix the prefix that needs filter is managed;

Step 305, ordered set compound module 44 are carried out the focus phrase and are calculated:

Step 3; The centre word of each title is made up; The centre word of each title

combination, combinatorial formula:

only keeps the phrase of 2-5 speech;

Title three is extracted centre word b, c out, forms phrase bc

Step 306, focus statistics module 46 form focus phrase ranking lists, add up each focus phrase behind title quantity and by the descending sort of title quantity, the phrase of retain header quantity more than 2, this parameter can adjust according to real data;

Intrinsic not relevant at this algorithm that provides with any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can be used with the teaching that is based on this.According to top description, it is conspicuous constructing the desired structure of this type systematic.In addition, the present invention is not also to any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.

In the instructions that is provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice under the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly; Be to be understood that; In order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description to exemplary embodiment of the present invention, each characteristic of the present invention be grouped together into sometimes single embodiment, figure, or the description to it in.Yet should this disclosed method be construed to the following intention of reflection: promptly the present invention for required protection requires the more characteristic of characteristic clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all characteristics of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate this embodiment thus clearly into, wherein each claim itself is all as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can adaptively change and be arranged on them in one or more equipment different with this embodiment the module in the equipment among the embodiment.Can be the module among the embodiment or unit or the synthetic module of component groups or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such characteristic and/or process or unit at least some are each other repelling, and can adopt any combination to disclosed all characteristics in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Only if clearly statement in addition, disclosed each characteristic can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).

In addition; Those skilled in the art can understand; Although some said embodiment comprise some characteristic rather than further feature included among other embodiment, the combination of features of different embodiment means and is within the scope of the present invention and forms various embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

Each parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of on one or more processor, moving, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use microprocessor or digital signal processor (DSP) to realize in practice according to some or all some or repertoire of parts in the network focus excavating gear of the embodiment of the invention.The present invention can also be embodied as part or all equipment or the device program (for example, computer program and computer program) that is used to carry out described method here.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps on carrier signal, provides, and perhaps provides with any other form.

It should be noted the foregoing description the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment under the situation of the scope that does not break away from accompanying claims.In claim, should any reference symbol between bracket be configured to the restriction to claim.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " or " " before the element does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody through same hardware branch.Any order is not represented in the use of word first, second and C grade.Can be title with these word explanations.

Claims

1. a network focus excavating gear is characterized in that, comprising:

The classification and storage module is suitable for the collection network data, and said network data is classified and classification and storage;

Filter extraction module, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance;

The ordered set compound module is suitable for the said centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;

The focus statistics module is suitable for adding up the occurrence number of said center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.

2. device as claimed in claim 1 is characterized in that, said network data further comprises: text header, with the corresponding article content of said text header and with the corresponding text attribute of said text header.

3. according to claim 1 or claim 2 device; It is characterized in that, said text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.

4. like each described device in the claim 1 to 3, it is characterized in that said classification and storage module is further adapted for:

Utilize the text automatic classification technology said network data to be carried out text classification according to said article content; Obtain the tag along sort corresponding with said network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine;

Every separated schedule time is carried out the primary network data acquisition to said engine, and according to said tag along sort the network data that collects is stored classifiedly in the different XML files of given server.

5. like each described device in the claim 1 to 4, it is characterized in that, said filtering rule further comprise following one of at least:

The network data that text header is not met predetermined number of words is deleted;

Network data to issuing time is against regulation is deleted;

Network data to containing predetermined domain name among the URL is deleted, and wherein, said predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept;

The source column is deleted for the network data of predetermined column, and wherein, said predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept;

The against regulation network data of originating is deleted, and wherein, said source comprises: forum, blog, perhaps whole model;

The network data that the answer number is not inconsistent regulation is deleted;

Delete browsing the against regulation network data of number;

Network data to the author is against regulation is deleted; And

Network data disappeared heavily handle.

6. like each described device in the claim 1 to 5; It is characterized in that; Said filtration extraction module is further adapted for: adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, according to the prefix dictionary that is provided with in advance said text header is carried out prefix and filter.

7. like each described device in the claim 1 to 6; It is characterized in that; Said filtration extraction module is further adapted for: adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with said word segmentation result as said centre word.

8. like each described device in the claim 1 to 7; It is characterized in that; Said ordered set compound module is further adapted for: before the said centre word from the consolidated network extracting data is sorted, according to the dictionary commonly used that is provided with in advance the everyday words in the said centre word that extracts is filtered.

9. like each described device in the claim 1 to 8; It is characterized in that; Said ordered set compound module is further adapted for: the centre word that will belong to according to

10. like each described device in the claim 1 to 8; It is characterized in that; Said ordered set compound module is further adapted for: the centre word after the ordering of consolidated network data is made up; Obtain according to the rubbish dictionary that is provided with in advance the rubbish phrase in the phrase of said center to be filtered after the center phrase of each networking data under of all categories.

11. like each described device in the claim 1 to 10; It is characterized in that; Said focus statistics module is further adapted for: add up said center phrase occurrence number in the different text headers under affiliated classification; Said occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.

12., it is characterized in that said focus statistics module is further adapted for like each described device in the claim 1 to 11: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.

13. like each described device in the claim 1 to 12; It is characterized in that; Said focus statistics module is further adapted for: show hot spot report to the user; Wherein, Said hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, said predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.

14. a network focus method for digging is characterized in that, comprising:

The collection network data are classified and classification and storage to said network data;

Filtering rule according to being provided with in advance filters the network data under of all categories respectively, and from the network data after the filtration down of all categories, extracts centre word respectively;

Said centre word to from the consolidated network extracting data sorts, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;

Add up the occurrence number of said center phrase under affiliated classification, obtain the network focus phrase and the displaying of classifying under of all categories respectively.

15. method as claimed in claim 14 is characterized in that, said network data comprises: text header, with the corresponding article content of said text header and with the corresponding text attribute of said text header.

16. like claim 14 or 15 described methods; It is characterized in that, said text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.

17., it is characterized in that said network data is classified further to be comprised with classification and storage like each described method in the claim 14 to 16:

18. like each described method in the claim 14 to 17, it is characterized in that, said filtering rule further comprise following one of at least:

Network data to issuing time is against regulation is deleted;

Delete browsing the against regulation network data of number;

Network data to the author is against regulation is deleted; And

Network data disappeared heavily handle.

19., it is characterized in that extract before the centre word the said network data after of all categories time is filtered respectively, said method also comprises like each described method in the claim 14 to 18:

According to the prefix dictionary that is provided with in advance said text header being carried out prefix filters.

20. like each described method in the claim 14 to 19, it is characterized in that, from the network data after the filtration down of all categories, extract centre word respectively and further comprise:

Adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with said word segmentation result as said centre word.

21., it is characterized in that before the said centre word from the consolidated network extracting data was sorted, said method also comprised like each described method in the claim 14 to 20:

According to the dictionary commonly used that is provided with in advance the everyday words in the said centre word that extracts is filtered.

22. like each described method in the claim 14 to 21, it is characterized in that, the centre word after the ordering of consolidated network data is made up further comprise:

The centre word that will belong to according to

23. will go each described method in 14 to 22 like right, it is characterized in that, said centre word after the ordering of consolidated network data is made up, obtain after the center phrase of each networking data under of all categories, said method also comprises:

Rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of said center.

24., it is characterized in that like each described method in the claim 14 to 23, add up the occurrence number of said center phrase under affiliated classification, the network focus phrase that obtains respectively under of all categories further comprises:

Add up said center phrase occurrence number in the different text headers under affiliated classification, said occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.

25., it is characterized in that obtain respectively after the network focus phrase under of all categories, said method also comprises like each described method in the claim 14 to 24:

Network focus phrase to identical under the same classification merges;

Calculate the pairing temperature value of network focus phrase under of all categories;

Search for the link of the pairing focus incident of lower network focus phrase of all categories.

26., it is characterized in that the said displaying of classifying further comprises like each described method in the claim 14 to 25:

Show hot spot report to the user; Wherein, Said hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, said predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.