CN102831248A - Network hotspot mining method and network hotspot mining device - Google Patents

Network hotspot mining method and network hotspot mining device Download PDF

Info

Publication number
CN102831248A
CN102831248A CN2012103468279A CN201210346827A CN102831248A CN 102831248 A CN102831248 A CN 102831248A CN 2012103468279 A CN2012103468279 A CN 2012103468279A CN 201210346827 A CN201210346827 A CN 201210346827A CN 102831248 A CN102831248 A CN 102831248A
Authority
CN
China
Prior art keywords
network data
text
network
phrase
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103468279A
Other languages
Chinese (zh)
Other versions
CN102831248B (en
Inventor
林英杰
马良
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610225018.0A priority Critical patent/CN105912670A/en
Priority to CN201210346827.9A priority patent/CN102831248B/en
Publication of CN102831248A publication Critical patent/CN102831248A/en
Application granted granted Critical
Publication of CN102831248B publication Critical patent/CN102831248B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a network hotspot mining method and a network hotspot mining device. The device comprises a classification storage module, a filter extraction module, a sequencing combination module and a hotspot counting module, wherein the classification storage module is suitable for collecting network data and classifying and storing the network data in a classification way; the filter extraction module is suitable for filtering the network data of different categories according to a preset filter rule and extracting a key word from the filtered network data of each category; the sequencing combination module is suitable for sequencing key words extracted from the same network data, combining the sequenced key words of the same network data, and acquiring a key word group of each network data of each category; and the hotspot counting module is suitable for counting the occurrence number of the key word group under the subjected category, and respectively acquiring a network hotspot word group under each category to be classified to exhibit. Through the technical scheme, the network hotspot can be more macroscopically mined, so that the mining result can better reflect an objective fact of the internet public opinions and can more specifically reflect a hotspot of one field.

Description

Network focus method for digging and device
Technical field
The present invention relates to field of Internet communication, particularly relate to a kind of network focus method for digging and device.
Background technology
In the prior art; Along with Internet development, user-generated content (User Generated Content abbreviates UGC as) function has been introduced in increasing website; A large amount of netizens pours in and delivers the suggestion of oneself in forum, blog, the microblogging and disclose all kinds of news; There is every day thousands of topic to produce, how from internet mass information, obtains the network focus faster, will dynamically play the directiveness effect understanding social development situation, grasp public opinion from the internet.
At present, the focus method for digging that generally adopts in the prior art is to obtain the text calorific value through the weighted calculation that forwarding amount, click volume, reply volume to the text in the special time period carry out predetermined condition, obtains the hottest text through the calorific value ordering.But; There is following problem in the technical scheme of prior art: 1, owing to only single text self attributes is added up; The much-talked-about topic of obtaining only can reflect the temperature situation of a certain article on the microcosmic, and can't reflect that macroscopic view goes up the temperature situation to a certain netizen's focus; 2, because the sample set of statistics is the full dose data, and do not get down to corresponding statistical study from content of text, the result who therefore produces does not have specific aim, can not divide the focus situation of field reflection to this field; 3, the text that technical scheme of the prior art only can the identical same content of statistical nature, gained result repeatability is big, readable poor.
Summary of the invention
The present invention provides a kind of network focus method for digging and device, and the network focus excavates focus situation and big, the readable poor problem of repeatability that the result is not macroscopical, can not divide the field to reflect this field of being directed against in the prior art to solve.
The present invention provides a kind of network focus method for digging, comprising: the collection network data, network data is classified and classification and storage; Filtering rule according to being provided with in advance filters the network data under of all categories respectively, and from the network data after the filtration down of all categories, extracts centre word respectively; Centre word to from the consolidated network extracting data sorts, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories; The occurrence number of statistics center phrase under affiliated classification obtained the network focus phrase and the displaying of classifying under of all categories respectively.
Alternatively, network data comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.
Alternatively, text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.
Alternatively; Network data classified further comprise with classification and storage: utilize the text automatic classification technology network data to be carried out text classification according to article content; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.
Alternatively, filtering rule further comprise following one of at least: the network data that text header is not met predetermined number of words is deleted; Network data to issuing time is against regulation is deleted; Network data to containing predetermined domain name among the URL is deleted, and wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; The source column is deleted for the network data of predetermined column, and wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; The against regulation network data of originating is deleted, and wherein, the source comprises: forum, blog, perhaps whole model; The network data that the answer number is not inconsistent regulation is deleted; Delete browsing the against regulation network data of number; Network data to the author is against regulation is deleted; And network data disappeared heavily handle.
Alternatively, adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, said method also comprises: according to the prefix dictionary that is provided with in advance text header is carried out prefix and filter.
Alternatively; Adopting participle technique under of all categories, to extract centre word respectively the network data after the filtration further comprises: adopt participle technique respectively the text header after the filtration down of all categories to be carried out participle; Obtain word segmentation result, and with word segmentation result as centre word.
Alternatively, before the centre word from the consolidated network extracting data sorted, method also comprised: according to the dictionary commonly used that is provided with in advance the everyday words in the centre word that extracts is filtered.
Alternatively; Centre word after the ordering of consolidated network data made up further comprise: the centre word according to after the ordering that will belong to same text header makes up; Wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
Alternatively, the centre word after the ordering of consolidated network data is made up, obtain after the center phrase of each networking data under of all categories, said method also comprises: the rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of center.
Alternatively; The occurrence number of statistics center phrase under affiliated classification; The network focus phrase that obtains respectively under of all categories further comprises: the statistics center phrase is the occurrence number in the different text headers under affiliated classification; Occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.
Alternatively, obtain respectively after the network focus phrase under of all categories, said method also comprises: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.
Alternatively; The displaying of classifying further comprises: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
The present invention also provides a kind of network focus excavating gear, comprising: the classification and storage module, be suitable for the collection network data, and network data is classified and classification and storage; Filter extraction module, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance; The ordered set compound module is suitable for the centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories; The focus statistics module is suitable for the occurrence number of statistics center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.
Alternatively, network data also comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.
Alternatively, text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.
Alternatively; The classification and storage module is further adapted for: utilize the text automatic classification technology according to article content network data to be carried out text classification; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.
Alternatively, filtering rule further comprise following one of at least: the network data that text header is not met predetermined number of words is deleted; Network data to issuing time is against regulation is deleted; Network data to containing predetermined domain name among the URL is deleted, and wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; The source column is deleted for the network data of predetermined column, and wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; The against regulation network data of originating is deleted, and wherein, the source comprises: forum, blog, perhaps whole model; The network data that the answer number is not inconsistent regulation is deleted; Delete browsing the against regulation network data of number; Network data to the author is against regulation is deleted; And network data disappeared heavily handle.
Alternatively, filter extraction module and be further adapted for: adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, according to the prefix dictionary that is provided with in advance text header is carried out prefix and filter.
Alternatively, filter extraction module and be further adapted for: adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.
Alternatively, the ordered set compound module is further adapted for: before the centre word from the consolidated network extracting data is sorted, according to the dictionary commonly used that is provided with in advance the everyday words in the centre word that extracts is filtered.
Alternatively; The ordered set compound module is further adapted for: the centre word that will belong to according to after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
Alternatively; The ordered set compound module is further adapted for: the centre word after the ordering of consolidated network data is made up; Obtain according to the rubbish dictionary that is provided with in advance the rubbish phrase in the phrase of center to be filtered after the center phrase of each networking data under of all categories.
Alternatively; The focus statistics module is further adapted for: the statistics center phrase is the occurrence number in the different text headers under affiliated classification; Occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.
Alternatively, focus statistics module is further adapted for: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.
Alternatively; The focus statistics module is further adapted for: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
Beneficial effect of the present invention is following:
Realize that through utilizing hot speech to calculate principle focus excavates; And text classification technology combined with the focus digging technology, solved network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation and big, the readable poor problem of repeatability in this field; Can be more the excavation network focus of macroscopic view; The reflection macroscopic view goes up the temperature situation to a certain netizen's focus; Make and excavate the objective fact that the result more can reflect the internet public opinion, the merger identical content article that repeats to occur more easily, and can reflect the focus in a certain field more targetedly.
Above-mentioned explanation only is the general introduction of technical scheme of the present invention; Understand technological means of the present invention in order can more to know; And can implement according to the content of instructions; And for let above and other objects of the present invention, feature and advantage can be more obviously understandable, below special lifts embodiment of the present invention.
Description of drawings
Through reading the hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used to illustrate the purpose of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 is the process flow diagram of the network focus method for digging of the embodiment of the invention;
Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention;
Fig. 3 is the detailed process synoptic diagram of the network focus method for digging of the embodiment of the invention;
Fig. 4 is the structural representation of the network focus excavating gear of the embodiment of the invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Though shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and should do not limited with various forms by the embodiment that sets forth here.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
For solve network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation in this field and repeatability is big, the problem of readable difference; The invention provides a kind of network focus method for digging and device, the network focus method for digging of the embodiment of the invention and device adopt text automatic classification technology and hot speech computing technique to realize.Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, does not limit the present invention.
According to embodiments of the invention, a kind of network focus method for digging is provided, Fig. 1 is the process flow diagram of the network focus method for digging of the embodiment of the invention, and is as shown in Figure 1, comprises following processing according to the network focus method for digging of the embodiment of the invention:
Step 101, the collection network data are classified and classification and storage to network data;
Wherein, the network data described in the step 101 specifically comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.Wherein, Text attribute specifically comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator (Uniform/Universal Resource Locator abbreviates URL as), text, the source column of text, text, text author, text and text browse number.
In step 101, network data classified specifically to be comprised with classification and storage:
Step 1 utilizes the text automatic classification technology according to article content network data to be carried out text classification, obtains the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Wherein, text automatic classification technology is meant: utilize the principle of machine learning to rely on the model parameter after small sample is learnt that text set (or other entities or object) is carried out automatic key words sorting according to certain taxonomic hierarchies or standard.
Step 2, every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.Wherein, the schedule time can be 1 hour, 6 hours, 1 day, and in embodiments of the present invention, the schedule time can be provided with according to the data characteristics of gathering (for example, renewal speed) flexibly.
Step 102 is filtered the network data under of all categories respectively according to the filtering rule that is provided with in advance, and from the network data after the filtration down of all categories, extracts centre word respectively;
Preferably, Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 2, in embodiments of the present invention, filtering rule specifically comprise following one of at least: the network data that 1, text header is not met predetermined number of words is deleted; 2, the against regulation network data of issuing time is deleted; 3, the network data that contains predetermined domain name among the URL is deleted, wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; 4, the source column is deleted for the network data of predetermined column, wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; 5, the against regulation network data of originating is deleted, wherein, the source comprises: forum, blog, perhaps whole model; 6, delete replying the network data that number is not inconsistent regulation; 7, delete browsing the against regulation network data of number; 8, the against regulation network data of author is deleted; 9, network data is disappeared heavily handle.
Need to prove; 9 rules listing above filtering rule in the embodiment of the invention is not limited to; In embodiments of the present invention; Filtering rule can be provided with as required, and for example, filtering rule is set to: the number of words to article does not have to delete or the like above the network data of predetermined number of words threshold value.
In addition, in step 102, before extracting centre word; In order to extract the centre word that needs better; Can carry out prefix to text header according to the prefix dictionary that is provided with in advance and filter, for example, cat pounced on the unwanted prefixs of this type such as university student base, ends of the earth tittle-tattle and filter.These prefixs are not participated in the extraction of centre word.And, in embodiments of the present invention, can adopt participle technique from the network data after the filtration down of all categories, to extract centre word respectively; Particularly, can adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.Need to prove that above-mentioned participle technique is a centre word extractive technique ripe in the prior art, the embodiment of the invention can also use other technologies to carry out the extraction of centre word.
Step 103 sorts to the centre word from the consolidated network extracting data, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;
Step 103 realizes through hot speech computing technique; Hot speech computing technique is meant: the web page text to real-time collection carries out participle, grouping merger automatically; Calculate high frequency focus keyword, and filter, export real-time internet hot spots vocabulary according to predefined dictionary and preset rules.
In step 103; Before centre word from the consolidated network extracting data sorted; Can filter the everyday words in the centre word that extracts according to the dictionary commonly used that is provided with in advance, vocabulary such as that above-mentioned everyday words is meant is for example original, reprinting, figure group need filter out these vocabulary.
And; In step 103; Carrying out centre word combination is meant: the centre word that will belong to according to after the ordering of same text header makes up; Wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
After having carried out step 103, in embodiments of the present invention, preferably, can filter the rubbish phrase in the phrase of center according to the rubbish dictionary that is provided with in advance.
Step 104, the occurrence number of statistics center phrase under affiliated classification obtained the network focus phrase and the displaying of classifying under of all categories respectively.
Step 104 specifically comprises following processing: the statistics center phrase is the occurrence number in the different text headers under affiliated classification, and occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtains the network focus phrase under of all categories respectively.Wherein, above-mentioned predefined procedure can be to arrange from more to less by occurrence number.
After the network focus phrase that has obtained under of all categories, can network focus phrase identical under the same classification be merged; Calculate the pairing temperature value of network focus phrase under of all categories; And search for the link of the pairing focus incident of lower network focus phrase of all categories.Think that the user provides hot information more in all directions.
In step 104; The classification displaying is meant: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
Below in conjunction with accompanying drawing, the technical scheme of the embodiment of the invention is illustrated.
Fig. 3 is the detailed process synoptic diagram of the network focus method for digging of the embodiment of the invention, and is as shown in Figure 3, specifically comprises following processing according to the network focus method for digging of the embodiment of the invention:
Step 301 utilizes self-defined language material to generate disaggregated model through the machine learning module, through disaggregated model the network data that collects is carried out text classification, and tag along sort is together deposited in the engine together with text attribute.
Step 302 is per hour carried out a data acquisition to engine, and with data by in different extend markup languages (Extensible Markup Language the abbreviates XML as) file that stores classifiedly in given server.
Step 303 is pressed following filtering rule filtering data, and filtered data is remained in the database, and wherein, the user can manage filtering rule through data filter regulation management backstage.
Particularly, the filtering rule according to the embodiment of the invention comprises:
1, title filters: the data filter of number of words between 5-30 word of title come in;
2, the temporal filtering of posting is that the model on the same day filters into the time of posting;
3, domain name is filtered: fuzzy matching is taked in (1), can have the model of corresponding domain name or word to filter among the URL with model; Perhaps, filter into by the model of domain name with band auto among the URL of 30 tame current events forums, 20 tame automobile forums and model (2); Perhaps, satisfy all will filtering into of (1), (2) these two kinds of rules.
4, column filters: the URL according to the plate seed filters; Also can the model of certain Chinese character of column title band be filtered into; For example, filter out the model of the band amusement of column title or Eight Diagrams printed words;
5, the domain name blacklist filters: the top result who filters out is carried out deletion action, the model with certain word among certain second level domain or the secondary URL is filtered out; And, be among the result of xinhuanet.com at TLD, be filtering out of 120ask.xinhuanet.com with domain name;
6, column blacklist: the top result who filters out is carried out deletion action, filter out the model with certain word in certain seed or the column name; And, be filtering out of reporting of new person with the column name;
7, source filtering: will meet the data filter that filters the source and come in, wherein, filter the source and be meant: forum, blog still are whole models;
8, replying the number clicks filters: will reply the data filter of number within 0-1000 and come in; The data filter of clicks within 0-10000 come in;
9, disappear heavily and to handle: the URL according to model disappears heavily model of calculations that TLD is identical;
10, filtered fields comprises: title, URL, source forum, come active plate, the time of posting, author, answer number, browse the number etc.
11, filter logic order: above-mentioned the 3rd filtering rule and the 4th filtering rule be " or " relation, between other filtering rules be " with " relation.
Step 304 extracts centre word to all text headers, and a title has a plurality of centre words, through participle technique title is carried out participle, and word segmentation result is the title centre word.Preferably, earlier title is carried out prefix before the participle and filter, these prefixs are not participated in participle, and for example, " cat pounces on the university student base ", " ends of the earth tittle-tattle " wait the prefix of this type.Wherein, the user can manage the backstage through prefix the prefix that needs filter is managed;
Step 305, the focus phrase calculates:
Step 1 is filtered the everyday words in the word segmentation result (for example, vocabulary such as " original ", " reprinting ", " picture group "); Wherein, the user can manage the backstage through everyday words the everyday words that needs filter is managed;
Step 2 is carried out phrase ordering (for example, the centre word of a title extraction is bca, becomes abc after the ordering) with the centre word after filtering;
Step 3; The centre word of each title is made up; The centre word of each title
Figure BDA00002153254700111
combination, combinatorial formula:
Figure BDA00002153254700112
only keeps the phrase of 2-5 speech;
Below, in conjunction with instance centre word is carried out phrase ordering combination and be illustrated.
Title one is extracted centre word b, a, c out, and ordering back a, b, c form phrase ab, bc, ac, abc
Title two is extracted centre word c, b, d out, and ordering back b, c, d form phrase bc, cd, bd, bcd
Title three is extracted centre word b, c out, forms phrase bc
The phrase seniority among brothers and sisters of these three titles formation is exactly so: bc (3), ab (1), ac (1), cd (1), bd (1), abc (1), bcd (1).
Step 4 is filtered the rubbish phrase, removes the rubbish phrase like inquiry ### prize-winning, ### phone, ### consulting, mobile phone ### prize-winning and so on; Wherein, the user can manage the rubbish phrase that needs filter through rubbish phrase management backstage;
Step 306 forms focus phrase ranking list, add up each focus phrase behind title quantity and by the descending sort of title quantity, the phrase of retain header quantity more than 2, this parameter can adjust according to real data;
In sum; Technical scheme by means of the embodiment of the invention; Realize that through utilizing hot speech to calculate principle focus excavates; And text classification technology combined with the focus digging technology, solved network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation and big, the readable poor problem of repeatability in this field; Can be more the excavation network focus of macroscopic view; The reflection macroscopic view goes up the temperature situation to a certain netizen's focus; Make and excavate the objective fact that the result more can reflect the internet public opinion, the merger identical content article that repeats to occur more easily, and can reflect the focus in a certain field more targetedly.
According to embodiments of the invention; A kind of network focus excavating gear is provided; Fig. 4 is the structural representation of the network focus excavating gear of the embodiment of the invention; As shown in Figure 4, comprise according to the network focus excavating gear of the embodiment of the invention: classification and storage module 40, filter extraction module 42, ordered set compound module 44 and focus statistics module 46, below each module of the embodiment of the invention is carried out detailed explanation.
Classification and storage module 40 is suitable for the collection network data, and network data is classified and classification and storage;
Wherein, above-mentioned network data specifically comprises: text header, with the corresponding article content of text header and with the corresponding text attribute of text header.Wherein, above-mentioned text attribute specifically comprise following one of at least: the answer number of the source column of the URL that text is corresponding, the source forum/blog of text, text, the issuing time of text, text author, text and text browse number.
Classification and storage module 40 specifically is suitable for: 1, utilize the text automatic classification technology according to article content network data to be carried out text classification; Obtain the tag along sort corresponding with network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine; Wherein, text automatic classification technology is meant: utilize the principle of machine learning to rely on the model parameter after small sample is learnt that text set (or other entities or object) is carried out automatic key words sorting according to certain taxonomic hierarchies or standard.2, every separated schedule time is carried out the primary network data acquisition to engine, and according to tag along sort the network data that collects is stored classifiedly in the different XML files of given server.Wherein, the schedule time can be 1 hour, 6 hours, 1 day, and in embodiments of the present invention, the schedule time can be provided with according to the data characteristics of gathering (for example, renewal speed) flexibly.
Filter extraction module 42, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance;
In embodiments of the present invention, Fig. 2 is the synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 2, filtering rule specifically comprise following one of at least: the network data that 1, text header is not met predetermined number of words is deleted; 2, the against regulation network data of issuing time is deleted; 3, the network data that contains predetermined domain name among the URL is deleted, wherein, predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept; 4, the source column is deleted for the network data of predetermined column, wherein, predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept; 5, the against regulation network data of originating is deleted, wherein, the source comprises: forum, blog, perhaps whole model; 6, delete replying the network data that number is not inconsistent regulation; 7, delete browsing the against regulation network data of number; 8, the against regulation network data of author is deleted; 9, network data is disappeared heavily handle.
Need to prove; 9 rules listing above filtering rule in the embodiment of the invention is not limited to; In embodiments of the present invention; Filtering rule can be provided with as required, and for example, filtering rule is set to: the number of words to article does not have to delete or the like above the network data of predetermined number of words threshold value.
In addition; Before extracting centre word; In order to extract the centre word that needs better; Filtering extraction module 42 is further adapted for: can carry out prefix to text header according to the prefix dictionary that is provided with in advance and filter, for example, cat pounced on the unwanted prefixs of this type such as university student base, ends of the earth tittle-tattle and filter.These prefixs are not participated in the extraction of centre word.And, in embodiments of the present invention, filter extraction module 42 and can adopt participle technique from the network data after the filtration down of all categories, to extract centre word respectively; Particularly, filter extraction module 42 and can adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with word segmentation result as centre word.Need to prove that above-mentioned participle technique is a centre word extractive technique ripe in the prior art, the embodiment of the invention can also use other technologies to carry out the extraction of centre word.
Ordered set compound module 44 is suitable for the centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;
Ordered set compound module 44 is above-mentioned processing of realizing through hot speech computing technique; Hot speech computing technique is meant: the web page text to real-time collection carries out participle, grouping merger automatically; Calculate high frequency focus keyword; And filter according to predefined dictionary and preset rules, export real-time internet hot spots vocabulary.
Before centre word from the consolidated network extracting data sorted, ordered set compound module 44 can filter the everyday words in the centre word that extracts according to the dictionary commonly used that is provided with in advance.Vocabulary such as that above-mentioned everyday words is meant is for example original, reprinting, figure group need filter out these vocabulary.
Ordered set compound module 44 carries out centre word combination and is meant: the centre word that ordered set compound module 44 will belong to according to
Figure BDA00002153254700141
after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
Preferably; Centre word after the ordering of consolidated network data is being made up; Obtain after the center phrase of each networking data under of all categories, ordered set compound module 44 is further adapted for: the rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of center.
Focus statistics module 46 is suitable for the occurrence number of statistics center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.
Focus statistics module 46 specifically is suitable for: the statistics center phrase is the occurrence number in the different text headers under affiliated classification, and occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtains the network focus phrase under of all categories respectively.
After the network focus phrase that has obtained under of all categories, focus statistics module 46 is further adapted for: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.
The 46 classification displayings of focus statistics module are meant: show hot spot report to the user; Wherein, Hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
Below in conjunction with accompanying drawing, the technical scheme of the embodiment of the invention is illustrated.
Fig. 3 is the detailed process synoptic diagram of the network focus method for digging of the embodiment of the invention, and is as shown in Figure 3, specifically comprises following processing according to the network focus method for digging of the embodiment of the invention:
Step 301 utilizes self-defined language material to generate disaggregated model through the machine learning module, and classification and storage module 40 is carried out text classification through disaggregated model to the network data that collects, and tag along sort is together deposited in the engine together with text attribute.
Step 302, classification and storage module 40 are per hour carried out a data acquisition to engine, and with data by in different extend markup languages (Extensible Markup Language the abbreviates XML as) file that stores classifiedly in given server.
Step 303 is filtered extraction module 42 by following filtering rule filtering data, and filtered data is remained in the database, and wherein, the user can manage filtering rule through data filter regulation management backstage.
Particularly, Fig. 3 is the preferred synoptic diagram of the filtering rule of the embodiment of the invention, and is as shown in Figure 3, comprises according to the filtering rule of the embodiment of the invention:
1, title filters: the data filter of number of words between 5-30 word of title come in;
2, the temporal filtering of posting is that the model on the same day filters into the time of posting;
3, domain name is filtered: fuzzy matching is taked in (1), can have the model of corresponding domain name or word to filter among the URL with model; Perhaps, filter into by the model of domain name with band auto among the URL of 30 tame current events forums, 20 tame automobile forums and model (2); Perhaps, satisfy all will filtering into of (1), (2) these two kinds of rules.
4, column filters: the URL according to the plate seed filters; Also can the model of certain Chinese character of column title band be filtered into; For example, filter out the model of the band amusement of column title or Eight Diagrams printed words;
5, the domain name blacklist filters: the top result who filters out is carried out deletion action, the model with certain word among certain second level domain or the secondary URL is filtered out; And, be among the result of xinhuanet.com at TLD, be filtering out of 120ask.xinhuanet.com with domain name;
6, column blacklist: the top result who filters out is carried out deletion action, filter out the model with certain word in certain seed or the column name; And, be filtering out of reporting of new person with the column name;
7, source filtering: will meet the data filter that filters the source and come in, wherein, filter the source and be meant: forum, blog still are whole models;
8, replying the number clicks filters: will reply the data filter of number within 0-1000 and come in; The data filter of clicks within 0-10000 come in;
9, disappear heavily and to handle: the URL according to model disappears heavily model of calculations that TLD is identical;
10, filtered fields comprises: title, URL, source forum, come active plate, the time of posting, author, answer number, browse the number etc.
11, filter logic order: above-mentioned the 3rd filtering rule and the 4th filtering rule be " or " relation, between other filtering rules be " with " relation.
Step 304 is filtered 42 pairs of all text headers of extraction module and is extracted centre word, and a title has a plurality of centre words, through participle technique title is carried out participle, and word segmentation result is the title centre word.Preferably, earlier title is carried out prefix before the participle and filter, these prefixs are not participated in participle, and for example, " cat pounces on the university student base ", " ends of the earth tittle-tattle " wait the prefix of this type.Wherein, the user can manage the backstage through prefix the prefix that needs filter is managed;
Step 305, ordered set compound module 44 are carried out the focus phrase and are calculated:
Step 1 is filtered the everyday words in the word segmentation result (for example, vocabulary such as " original ", " reprinting ", " picture group "); Wherein, the user can manage the backstage through everyday words the everyday words that needs filter is managed;
Step 2 is carried out phrase ordering (for example, the centre word of a title extraction is bca, becomes abc after the ordering) with the centre word after filtering;
Step 3; The centre word of each title is made up; The centre word of each title
Figure BDA00002153254700161
combination, combinatorial formula:
Figure BDA00002153254700162
only keeps the phrase of 2-5 speech;
Below, in conjunction with instance centre word is carried out phrase ordering combination and be illustrated.
Title one is extracted centre word b, a, c out, and ordering back a, b, c form phrase ab, bc, ac, abc
Title two is extracted centre word c, b, d out, and ordering back b, c, d form phrase bc, cd, bd, bcd
Title three is extracted centre word b, c out, forms phrase bc
The phrase seniority among brothers and sisters of these three titles formation is exactly so: bc (3), ab (1), ac (1), cd (1), bd (1), abc (1), bcd (1).
Step 4 is filtered the rubbish phrase, removes the rubbish phrase like inquiry ### prize-winning, ### phone, ### consulting, mobile phone ### prize-winning and so on; Wherein, the user can manage the rubbish phrase that needs filter through rubbish phrase management backstage;
Step 306, focus statistics module 46 form focus phrase ranking lists, add up each focus phrase behind title quantity and by the descending sort of title quantity, the phrase of retain header quantity more than 2, this parameter can adjust according to real data;
In sum; Technical scheme by means of the embodiment of the invention; Realize that through utilizing hot speech to calculate principle focus excavates; And text classification technology combined with the focus digging technology, solved network focus in the prior art excavate the result not macroscopical, can not the reflection of branch field to the focus situation and big, the readable poor problem of repeatability in this field; Can be more the excavation network focus of macroscopic view; The reflection macroscopic view goes up the temperature situation to a certain netizen's focus; Make and excavate the objective fact that the result more can reflect the internet public opinion, the merger identical content article that repeats to occur more easily, and can reflect the focus in a certain field more targetedly.
Intrinsic not relevant at this algorithm that provides with any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can be used with the teaching that is based on this.According to top description, it is conspicuous constructing the desired structure of this type systematic.In addition, the present invention is not also to any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that is provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice under the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly; Be to be understood that; In order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description to exemplary embodiment of the present invention, each characteristic of the present invention be grouped together into sometimes single embodiment, figure, or the description to it in.Yet should this disclosed method be construed to the following intention of reflection: promptly the present invention for required protection requires the more characteristic of characteristic clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all characteristics of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate this embodiment thus clearly into, wherein each claim itself is all as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and be arranged on them in one or more equipment different with this embodiment the module in the equipment among the embodiment.Can be the module among the embodiment or unit or the synthetic module of component groups or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such characteristic and/or process or unit at least some are each other repelling, and can adopt any combination to disclosed all characteristics in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Only if clearly statement in addition, disclosed each characteristic can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).
In addition; Those skilled in the art can understand; Although some said embodiment comprise some characteristic rather than further feature included among other embodiment, the combination of features of different embodiment means and is within the scope of the present invention and forms various embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
Each parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of on one or more processor, moving, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use microprocessor or digital signal processor (DSP) to realize in practice according to some or all some or repertoire of parts in the network focus excavating gear of the embodiment of the invention.The present invention can also be embodied as part or all equipment or the device program (for example, computer program and computer program) that is used to carry out described method here.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps on carrier signal, provides, and perhaps provides with any other form.
It should be noted the foregoing description the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment under the situation of the scope that does not break away from accompanying claims.In claim, should any reference symbol between bracket be configured to the restriction to claim.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " or " " before the element does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody through same hardware branch.Any order is not represented in the use of word first, second and C grade.Can be title with these word explanations.

Claims (26)

1. a network focus excavating gear is characterized in that, comprising:
The classification and storage module is suitable for the collection network data, and said network data is classified and classification and storage;
Filter extraction module, be suitable for respectively the network data under of all categories being filtered, and from the network data after the filtration down of all categories, extract centre word according to the filtering rule that is provided with in advance;
The ordered set compound module is suitable for the said centre word from the consolidated network extracting data is sorted, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;
The focus statistics module is suitable for adding up the occurrence number of said center phrase under affiliated classification, obtains the network focus phrase and the displaying of classifying under of all categories respectively.
2. device as claimed in claim 1 is characterized in that, said network data further comprises: text header, with the corresponding article content of said text header and with the corresponding text attribute of said text header.
3. according to claim 1 or claim 2 device; It is characterized in that, said text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.
4. like each described device in the claim 1 to 3, it is characterized in that said classification and storage module is further adapted for:
Utilize the text automatic classification technology said network data to be carried out text classification according to said article content; Obtain the tag along sort corresponding with said network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine;
Every separated schedule time is carried out the primary network data acquisition to said engine, and according to said tag along sort the network data that collects is stored classifiedly in the different XML files of given server.
5. like each described device in the claim 1 to 4, it is characterized in that, said filtering rule further comprise following one of at least:
The network data that text header is not met predetermined number of words is deleted;
Network data to issuing time is against regulation is deleted;
Network data to containing predetermined domain name among the URL is deleted, and wherein, said predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept;
The source column is deleted for the network data of predetermined column, and wherein, said predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept;
The against regulation network data of originating is deleted, and wherein, said source comprises: forum, blog, perhaps whole model;
The network data that the answer number is not inconsistent regulation is deleted;
Delete browsing the against regulation network data of number;
Network data to the author is against regulation is deleted; And
Network data disappeared heavily handle.
6. like each described device in the claim 1 to 5; It is characterized in that; Said filtration extraction module is further adapted for: adopt participle technique from the network data after the filtration down of all categories, to extract before the centre word respectively, according to the prefix dictionary that is provided with in advance said text header is carried out prefix and filter.
7. like each described device in the claim 1 to 6; It is characterized in that; Said filtration extraction module is further adapted for: adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with said word segmentation result as said centre word.
8. like each described device in the claim 1 to 7; It is characterized in that; Said ordered set compound module is further adapted for: before the said centre word from the consolidated network extracting data is sorted, according to the dictionary commonly used that is provided with in advance the everyday words in the said centre word that extracts is filtered.
9. like each described device in the claim 1 to 8; It is characterized in that; Said ordered set compound module is further adapted for: the centre word that will belong to according to
Figure FDA00002153254600021
after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
10. like each described device in the claim 1 to 8; It is characterized in that; Said ordered set compound module is further adapted for: the centre word after the ordering of consolidated network data is made up; Obtain according to the rubbish dictionary that is provided with in advance the rubbish phrase in the phrase of said center to be filtered after the center phrase of each networking data under of all categories.
11. like each described device in the claim 1 to 10; It is characterized in that; Said focus statistics module is further adapted for: add up said center phrase occurrence number in the different text headers under affiliated classification; Said occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.
12., it is characterized in that said focus statistics module is further adapted for like each described device in the claim 1 to 11: the network focus phrase to identical under the same classification merges; Calculate the pairing temperature value of network focus phrase under of all categories; Search for the link of the pairing focus incident of lower network focus phrase of all categories.
13. like each described device in the claim 1 to 12; It is characterized in that; Said focus statistics module is further adapted for: show hot spot report to the user; Wherein, Said hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, said predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
14. a network focus method for digging is characterized in that, comprising:
The collection network data are classified and classification and storage to said network data;
Filtering rule according to being provided with in advance filters the network data under of all categories respectively, and from the network data after the filtration down of all categories, extracts centre word respectively;
Said centre word to from the consolidated network extracting data sorts, and the centre word after the ordering of consolidated network data is made up, and obtains the center phrase of each networking data under of all categories;
Add up the occurrence number of said center phrase under affiliated classification, obtain the network focus phrase and the displaying of classifying under of all categories respectively.
15. method as claimed in claim 14 is characterized in that, said network data comprises: text header, with the corresponding article content of said text header and with the corresponding text attribute of said text header.
16. like claim 14 or 15 described methods; It is characterized in that, said text attribute further comprise following one of at least: the answer number of the issuing time of the source forum/blog of text corresponding uniform resource locator URL, text, the source column of text, text, text author, text and text browse number.
17., it is characterized in that said network data is classified further to be comprised with classification and storage like each described method in the claim 14 to 16:
Utilize the text automatic classification technology said network data to be carried out text classification according to said article content; Obtain the tag along sort corresponding with said network data, and with the text header of correspondence, corresponding tag along sort and corresponding text property store in engine;
Every separated schedule time is carried out the primary network data acquisition to said engine, and according to said tag along sort the network data that collects is stored classifiedly in the different XML files of given server.
18. like each described method in the claim 14 to 17, it is characterized in that, said filtering rule further comprise following one of at least:
The network data that text header is not met predetermined number of words is deleted;
Network data to issuing time is against regulation is deleted;
Network data to containing predetermined domain name among the URL is deleted, and wherein, said predetermined domain name is the domain name in the domain name blacklist that is provided with in advance; Perhaps, the network data that contains predetermined domain name among the URL is kept;
The source column is deleted for the network data of predetermined column, and wherein, said predetermined column is the column in the column blacklist that is provided with in advance; Perhaps, the network data of source column for predetermined column kept;
The against regulation network data of originating is deleted, and wherein, said source comprises: forum, blog, perhaps whole model;
The network data that the answer number is not inconsistent regulation is deleted;
Delete browsing the against regulation network data of number;
Network data to the author is against regulation is deleted; And
Network data disappeared heavily handle.
19., it is characterized in that extract before the centre word the said network data after of all categories time is filtered respectively, said method also comprises like each described method in the claim 14 to 18:
According to the prefix dictionary that is provided with in advance said text header being carried out prefix filters.
20. like each described method in the claim 14 to 19, it is characterized in that, from the network data after the filtration down of all categories, extract centre word respectively and further comprise:
Adopt participle technique respectively the text header after filtering down of all categories to be carried out participle, obtain word segmentation result, and with said word segmentation result as said centre word.
21., it is characterized in that before the said centre word from the consolidated network extracting data was sorted, said method also comprised like each described method in the claim 14 to 20:
According to the dictionary commonly used that is provided with in advance the everyday words in the said centre word that extracts is filtered.
22. like each described method in the claim 14 to 21, it is characterized in that, the centre word after the ordering of consolidated network data is made up further comprise:
The centre word that will belong to according to
Figure FDA00002153254600051
after the ordering of same text header makes up; Wherein, N is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
23. will go each described method in 14 to 22 like right, it is characterized in that, said centre word after the ordering of consolidated network data is made up, obtain after the center phrase of each networking data under of all categories, said method also comprises:
Rubbish dictionary according to being provided with in advance filters the rubbish phrase in the phrase of said center.
24., it is characterized in that like each described method in the claim 14 to 23, add up the occurrence number of said center phrase under affiliated classification, the network focus phrase that obtains respectively under of all categories further comprises:
Add up said center phrase occurrence number in the different text headers under affiliated classification, said occurrence number is arranged according to predefined procedure greater than the center phrase of predetermined threshold, obtain the network focus phrase under of all categories respectively.
25., it is characterized in that obtain respectively after the network focus phrase under of all categories, said method also comprises like each described method in the claim 14 to 24:
Network focus phrase to identical under the same classification merges;
Calculate the pairing temperature value of network focus phrase under of all categories;
Search for the link of the pairing focus incident of lower network focus phrase of all categories.
26., it is characterized in that the said displaying of classifying further comprises like each described method in the claim 14 to 25:
Show hot spot report to the user; Wherein, Said hot spot report comprises: the network focus phrase in the affiliated classification of network focus phrase, the predetermined amount of time under of all categories, the pairing temperature value of network focus phrase under of all categories and the link of the pairing focus incident of lower network focus phrase of all categories, said predetermined amount of time comprise following one of at least: per hour, every day, weekly and every month.
CN201210346827.9A 2012-09-18 2012-09-18 Network focus method for digging and device Expired - Fee Related CN102831248B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610225018.0A CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation
CN201210346827.9A CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210346827.9A CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201610225018.0A Division CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Publications (2)

Publication Number Publication Date
CN102831248A true CN102831248A (en) 2012-12-19
CN102831248B CN102831248B (en) 2016-05-11

Family

ID=47334383

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201210346827.9A Expired - Fee Related CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device
CN201610225018.0A Pending CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201610225018.0A Pending CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Country Status (1)

Country Link
CN (2) CN102831248B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324718A (en) * 2013-06-25 2013-09-25 百度在线网络技术(北京)有限公司 Topic venation digging method and system based on massive searching logs
CN103544294A (en) * 2013-10-30 2014-01-29 北京京东尚科信息技术有限公司 Keyword popularity automatic control method
CN103580997A (en) * 2013-11-19 2014-02-12 湖南蚁坊软件有限公司 Extraction method and device for hot microblogs in vertical field
CN103761234A (en) * 2013-10-29 2014-04-30 北京奇虎科技有限公司 Method and device for optimizing search ranking of network resource point
CN103902596A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 High-frequency page content clustering method and system
CN104714820A (en) * 2013-12-17 2015-06-17 青岛龙泰天翔通信科技有限公司 Cloud on-line updating method
CN105095175A (en) * 2014-04-18 2015-11-25 北京搜狗科技发展有限公司 Method and device for obtaining truncated web title
CN105095318A (en) * 2014-05-22 2015-11-25 北京启明星辰信息安全技术有限公司 Method and device for realizing hotspot analysis
CN105373551A (en) * 2014-08-25 2016-03-02 阿里巴巴集团控股有限公司 Method for determining sensitive resource processing policy and server
CN105989176A (en) * 2015-03-05 2016-10-05 北大方正集团有限公司 Data processing method and device
CN107133201A (en) * 2017-04-21 2017-09-05 东莞中国科学院云计算产业技术创新与育成中心 The hot information acquisition method and device recognized based on text code
CN107315838A (en) * 2017-07-17 2017-11-03 深圳源广安智能科技有限公司 A kind of efficient network hotspot digging system
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN108712403A (en) * 2018-05-04 2018-10-26 哈尔滨工业大学(威海) The illegal domain name method for digging of similitude is constructed based on domain name
CN108881968A (en) * 2017-05-15 2018-11-23 北京国双科技有限公司 A kind of network video advertisement put-on method and system
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN110765115A (en) * 2019-09-27 2020-02-07 上海麦克风文化传媒有限公司 Method for combining multiple sorting categories
CN110888986A (en) * 2019-12-06 2020-03-17 北京明略软件系统有限公司 Information pushing method and device, electronic equipment and computer readable storage medium
CN110929160A (en) * 2019-12-02 2020-03-27 上海麦克风文化传媒有限公司 Method for optimizing system sequencing result
CN111580921A (en) * 2020-05-15 2020-08-25 北京字节跳动网络技术有限公司 Content creation method and device
CN112380339A (en) * 2020-11-23 2021-02-19 北京达佳互联信息技术有限公司 Hot event mining method and device and server

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182191B (en) * 2016-12-08 2022-01-18 腾讯科技(深圳)有限公司 Hotspot data processing method and device
CN107423444B (en) * 2017-08-10 2020-05-19 世纪龙信息网络有限责任公司 Hot word phrase extraction method and system
CN107967299B (en) * 2017-11-03 2020-05-12 中国农业大学 Agricultural public opinion-oriented automatic hot word extraction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420356A (en) * 2008-05-30 2009-04-29 北京天腾时空信息科技有限公司 Network content classified processing method and apparatus
US20090265315A1 (en) * 2008-04-18 2009-10-22 Yahoo! Inc. System and method for classifying tags of content using a hyperlinked corpus of classified web pages
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788988B (en) * 2009-01-22 2012-06-27 蔡亮华 Information extraction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265315A1 (en) * 2008-04-18 2009-10-22 Yahoo! Inc. System and method for classifying tags of content using a hyperlinked corpus of classified web pages
CN101420356A (en) * 2008-05-30 2009-04-29 北京天腾时空信息科技有限公司 Network content classified processing method and apparatus
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗引: "互联网舆情发现与观点挖掘技术研究", 《电子科技大学硕士学位论文》, 15 April 2011 (2011-04-15) *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902596A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 High-frequency page content clustering method and system
CN103324718A (en) * 2013-06-25 2013-09-25 百度在线网络技术(北京)有限公司 Topic venation digging method and system based on massive searching logs
CN103324718B (en) * 2013-06-25 2016-08-10 百度在线网络技术(北京)有限公司 Method and system based on humongous search Web log mining topic venation
CN103761234A (en) * 2013-10-29 2014-04-30 北京奇虎科技有限公司 Method and device for optimizing search ranking of network resource point
CN103544294A (en) * 2013-10-30 2014-01-29 北京京东尚科信息技术有限公司 Keyword popularity automatic control method
CN103544294B (en) * 2013-10-30 2017-02-01 北京京东尚科信息技术有限公司 Keyword popularity automatic control method
CN103580997A (en) * 2013-11-19 2014-02-12 湖南蚁坊软件有限公司 Extraction method and device for hot microblogs in vertical field
CN103580997B (en) * 2013-11-19 2017-09-29 湖南蚁坊软件有限公司 The extracting method and its device of a kind of popular microblogging in vertical field
CN104714820A (en) * 2013-12-17 2015-06-17 青岛龙泰天翔通信科技有限公司 Cloud on-line updating method
CN105095175A (en) * 2014-04-18 2015-11-25 北京搜狗科技发展有限公司 Method and device for obtaining truncated web title
CN105095175B (en) * 2014-04-18 2019-04-30 北京搜狗科技发展有限公司 Obtain the method and device of truncated web page title
CN105095318A (en) * 2014-05-22 2015-11-25 北京启明星辰信息安全技术有限公司 Method and device for realizing hotspot analysis
CN105095318B (en) * 2014-05-22 2019-02-26 北京启明星辰信息安全技术有限公司 A kind of method and apparatus for realizing analysis of central issue
CN105373551A (en) * 2014-08-25 2016-03-02 阿里巴巴集团控股有限公司 Method for determining sensitive resource processing policy and server
CN105989176A (en) * 2015-03-05 2016-10-05 北大方正集团有限公司 Data processing method and device
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN108108346B (en) * 2016-11-25 2021-12-24 广东亿迅科技有限公司 Method and device for extracting theme characteristic words of document
CN107133201A (en) * 2017-04-21 2017-09-05 东莞中国科学院云计算产业技术创新与育成中心 The hot information acquisition method and device recognized based on text code
CN107133201B (en) * 2017-04-21 2021-03-16 东莞中国科学院云计算产业技术创新与育成中心 Hot spot information acquisition method and device based on text code recognition
CN108881968A (en) * 2017-05-15 2018-11-23 北京国双科技有限公司 A kind of network video advertisement put-on method and system
CN108881968B (en) * 2017-05-15 2020-10-30 北京国双科技有限公司 Network video advertisement putting method and system
CN107315838A (en) * 2017-07-17 2017-11-03 深圳源广安智能科技有限公司 A kind of efficient network hotspot digging system
CN108712403A (en) * 2018-05-04 2018-10-26 哈尔滨工业大学(威海) The illegal domain name method for digging of similitude is constructed based on domain name
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN110765115A (en) * 2019-09-27 2020-02-07 上海麦克风文化传媒有限公司 Method for combining multiple sorting categories
CN110929160A (en) * 2019-12-02 2020-03-27 上海麦克风文化传媒有限公司 Method for optimizing system sequencing result
CN110888986A (en) * 2019-12-06 2020-03-17 北京明略软件系统有限公司 Information pushing method and device, electronic equipment and computer readable storage medium
CN111580921A (en) * 2020-05-15 2020-08-25 北京字节跳动网络技术有限公司 Content creation method and device
CN111580921B (en) * 2020-05-15 2021-10-22 北京字节跳动网络技术有限公司 Content creation method and device
CN112380339A (en) * 2020-11-23 2021-02-19 北京达佳互联信息技术有限公司 Hot event mining method and device and server

Also Published As

Publication number Publication date
CN105912670A (en) 2016-08-31
CN102831248B (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN102831248A (en) Network hotspot mining method and network hotspot mining device
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
CN102945290B (en) Hot microblog topic excavating gear and method
CN102354315B (en) Generation method of site navigation page and device thereof
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN102208992B (en) The malicious information filtering system of Internet and method thereof
CN104281607A (en) Microblog hot topic analyzing method
CN104484431B (en) A kind of multi-source Personalize News webpage recommending method based on domain body
CN103955505A (en) Micro-blog-based real-time event monitoring method and system
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN106327227A (en) Information recommendation system and information recommendation method
WO2014210184A2 (en) Real-time and adaptive data mining
CN104063476A (en) Social network-based content recommending method and system
CN103136358B (en) A kind of method of Automatic Extraction forum data
KR101566616B1 (en) Advertisement decision supporting system using big data-processing and method thereof
CN104063383A (en) Information recommendation method and device
CN105378730A (en) Social media content analysis and output
CN105975537A (en) Sorting method and device of application program
CN106326371A (en) Method and device for pushing services
CN103365904A (en) Advertising information searching method and system
CN103778225A (en) Processing method, identifying device and identifying system of advertisement marketing language information
CN107220745A (en) A kind of recognition methods, system and equipment for being intended to behavioral data
CN103544165A (en) Neologism mining method and system
CN113268649A (en) Thread monitoring method and system based on diversified data fusion
CN106790405A (en) A kind of mobile phone A PP information-pushing methods customized based on user and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160511

Termination date: 20210918

CF01 Termination of patent right due to non-payment of annual fee