CN103885989A - Method and device for estimating new word document frequency - Google Patents

Method and device for estimating new word document frequency Download PDF

Info

Publication number
CN103885989A
CN103885989A CN201210566103.5A CN201210566103A CN103885989A CN 103885989 A CN103885989 A CN 103885989A CN 201210566103 A CN201210566103 A CN 201210566103A CN 103885989 A CN103885989 A CN 103885989A
Authority
CN
China
Prior art keywords
document
sets
frequency
document sets
neologisms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210566103.5A
Other languages
Chinese (zh)
Other versions
CN103885989B (en
Inventor
蔡兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Wuhan Co Ltd
Original Assignee
Tencent Technology Wuhan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Wuhan Co Ltd filed Critical Tencent Technology Wuhan Co Ltd
Priority to CN201210566103.5A priority Critical patent/CN103885989B/en
Publication of CN103885989A publication Critical patent/CN103885989A/en
Application granted granted Critical
Publication of CN103885989B publication Critical patent/CN103885989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method and a device for estimating new word document frequency. The method includes: acquiring a first document set and a second document set, wherein the generation time of the document data contained in the first document set is earlier than that of the document data contained in the second document set; respectively counting the document frequency of each preset common word in the first document set and the second document set; counting the document frequency of each preset new word in the second document set; acquiring the corresponding fitting relations of the preset common words in the first document set and the second document set; acquiring the document frequency of the preset new words in the first document set according to the corresponding fitting relations and the document frequency of the preset new words in the second document set. By the method, new work document frequency counting accuracy is increased, and the defect that traditional methods are large in error during new work document frequency counting is overcome. The method is significant for new words' application to technical fields such as feature selection, keyword extraction, vector space model representation.

Description

Estimate method and the device of neologisms document frequency
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of method and device of estimating neologisms document frequency.
Background technology
Along with the development of Internet technology, neologisms are increasing, and it becomes a more and more general phenomenon of internet arena gradually.Neologisms are again unregistered word, before referring to, never occur, and nearest popular significant word.Neologisms are generally followed focus incident, focus personage and are produced, and often with great quantity of information, are the indispensable characteristic items of the technology such as text classification, keyword abstraction.And document frequency (DF, Document Frequency) is as a kind of measure information factor of classics, be also widely used at these correlative technology fields, such as vector space model, feature selecting, feature weight etc.
Conventionally, document frequency refers to the document number of times that a word occurs in magnanimity collection of document.Traditional document frequency computing method are generally the statistics based on magnanimity collection of document.Its roughly method be first from full dose document random screening go out the document sets of a larger amt (such as 1,000,000), then every piece of document sets is carried out to participle, and add up each word and occur in how many pieces of documents, the document number of times of statistics is just as the document frequency of this word thus.
This method based on magnanimity collection of document statistics is more stable, document frequency for everyday words is more accurate, but because neologisms only appear in the document that few timeliness n is high, traditional this statistical method is larger for the document frequency statistics error of neologisms, generally can be significantly less than its actual value.
Therefore, traditional document frequency computing method based on magnanimity document sets statistics are not too suitable for neologisms, find better neologisms document frequency computing method and seem particularly important.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of method and device of estimating neologisms document frequency, is intended to improve the accuracy rate of neologisms document frequency statistics.
In order to achieve the above object, the present invention proposes a kind of method of estimating neologisms document frequency, comprising:
Obtain the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Add up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Obtain the corresponding matching relation of the document frequency of described default everyday words in described the first document sets and the second document sets;
The document frequency in described the second document sets according to described corresponding matching relation and default neologisms, obtains the document frequency of described default neologisms in described the first document sets.
The present invention also proposes a kind of device of estimating neologisms document frequency, comprising:
Document sets acquisition module, for obtaining the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Statistical module, for adding up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Matching Relation acquisition module, for obtaining the corresponding matching relation of described default everyday words at the document frequency of described the first document sets and the second document sets;
Neologisms document frequency acquisition module, for according to described corresponding matching relation and default neologisms at the document frequency of described the second document sets, obtain the document frequency of described default neologisms in described the first document sets.
A kind of method and device of estimating neologisms document frequency that the present invention proposes, by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, improve thus the accuracy rate of neologisms document frequency statistics, thereby make up the document frequency statistics error larger defect of traditional statistical method for neologisms, and the present invention is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention estimates the method preferred embodiment of neologisms document frequency;
Fig. 2 is the document frequency matched curve schematic diagram that the present invention estimates a kind of example in the method preferred embodiment of neologisms document frequency;
Fig. 3 is the structural representation that the present invention estimates the device preferred embodiment of neologisms document frequency;
Fig. 4 is the structural representation that the present invention estimates matching Relation acquisition module in the device preferred embodiment of neologisms document frequency.
In order to make technical scheme of the present invention clearer, clear, be described in further detail below in conjunction with accompanying drawing.
Embodiment
The solution of the embodiment of the present invention is mainly: by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, to improve the accuracy rate of neologisms document frequency statistics, make up the document frequency statistics error larger defect of traditional statistical method for neologisms.
As shown in Figure 1, preferred embodiment of the present invention proposes a kind of method of estimating neologisms document frequency, comprising:
Step S101, obtains the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Because neologisms often only appear in the page that timeliness n is high, and there is larger error in traditional document frequency computing method based on magnanimity document sets statistics, the present embodiment is introduced new document sets concept, and estimates the document frequency of neologisms in magnanimity document sets based on magnanimity document sets and new document sets.
Particularly, first, determine that magnanimity document sets A(is alleged the first document sets of the present embodiment) and new document sets B(be alleged the second document sets of the present embodiment) two collection of document, wherein:
As preferred version, magnanimity document sets A comprises approximately 1,000,000 pieces of documents, random choose from full dose document altogether; Document in magnanimity document sets A is the data before 2 years substantially.
New document sets B comprises approximately 50,000 pieces of documents altogether, can from each large door website homepage, capture; Document in new document sets B be substantially nearest one month with interior data.
It should be noted that, the generation time of the document data in above-mentioned magnanimity document sets A also can be not limited to before 2 years, such as waiting the year before; The generation time of the document data in above-mentioned new document sets B also can not be defined as in nearest one month, such as can also be in first quarter moon, etc.
Step S102, adds up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Wherein, default everyday words refers to the word of frequent appearance, and the everyday words of definition approximately has 70,000 at present; Default neologisms refer to and appear at the word in the document that timeliness n is high based on Internet technology development, and neologisms are generally followed focus incident, focus personage and produced, and its life period is shorter.
Setting everyday words is w, neologisms are t, determining after two document sets A and B, add up respectively the document frequency of each everyday words w in A and B, be expressed as DF_A_w and DF_B_w, wherein DF_A_w is the true document frequency of everyday words w at magnanimity document sets A, and DF_B_w is used for continuing making comparisons with neologisms in new document sets B.
In addition, also to add up the document frequency DF_B_t of each neologisms t in new document sets B, obtain so that follow-up after the corresponding matching relation of the document frequency of everyday words in magnanimity document sets A and new document sets B, the document frequency DF_B_t according to neologisms t in new document sets B obtains the document frequency DF_A_t of neologisms in magnanimity document sets A.
The document frequency of above-mentioned statistics everyday words w in A and B, and the document frequency of statistics neologisms t in B, can adopt following scheme:
First every piece of document in document sets (A or B) is carried out to participle, then add up each word and occurred in how many pieces of documents, the document number of times that obtains of statistics is as the document frequency of this word thus.
Step S103, obtains the corresponding matching relation of the document frequency of described default everyday words in described the first document sets and the second document sets;
Step S104, the document frequency in described the second document sets according to described corresponding matching relation and default neologisms, obtains the document frequency of described default neologisms in described the first document sets.
In above-mentioned steps 103 and step S104, getting after the document frequency DF_B_w of each everyday words w in the document frequency DF_A_w of magnanimity document sets A and new document sets B, analyzing the document frequency relation of everyday words in magnanimity document sets A and new document sets B.
First, the document frequency by all everyday words in magnanimity document sets A, from little to sorting greatly, obtains collating sequence; Then described collating sequence is carried out segmentation take group as unit; Here take 100 as section gap, 0-100 is one group, and 101-200 is one group, and the rest may be inferred.
Take group as unit, calculate the average DF_B_w of all everyday words in each group afterwards; Then, using the average DF_B_w of each group as horizontal ordinate, draw as ordinate take the ranking value at this group switching centre place, draw and obtain document frequency matched curve.Wherein, the document frequency matched curve that the data based on front 50 groups obtain as shown in Figure 2.
From the scatter diagram shown in Fig. 2, can find out: the document frequency of everyday words in magnanimity document sets A and new document sets B exists the linear matching relation that approaches, between the document frequency of this explanation everyday words in two document sets A and B, have linear relationship.
Consider neologisms finally also can become everyday words and settle out, therefore take neologisms, the document frequency DF_B_t in new document sets B is as horizontal ordinate, and the ordinate value that utilizes the linear fit relation curve shown in Fig. 2 to obtain is the document frequency DF_A_t of neologisms in magnanimity document sets A.
Comparing traditional document frequency computing method is only the large defect of error that the statistics based on magnanimity collection of document is brought, and the present embodiment, by such scheme, has improved the accuracy rate of neologisms document frequency statistics, thereby has made up the defect of traditional statistical method; And the present embodiment is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.
As shown in Figure 3, preferred embodiment of the present invention proposes a kind of device of estimating neologisms document frequency, comprising: document sets acquisition module 201, statistical module 202, matching Relation acquisition module 203 and neologisms document frequency acquisition module 204, wherein:
Document sets acquisition module 201, for obtaining the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Statistical module 202, for adding up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Matching Relation acquisition module 203, for obtaining the corresponding matching relation of described default everyday words at the document frequency of described the first document sets and the second document sets;
Neologisms document frequency acquisition module 204, for according to described corresponding matching relation and default neologisms at the document frequency of described the second document sets, obtain the document frequency of described default neologisms in described the first document sets.
Because neologisms often only appear in the page that timeliness n is high, and there is larger error in traditional document frequency computing method based on magnanimity document sets statistics, the present embodiment is introduced new document sets concept, and estimates the document frequency of neologisms in magnanimity document sets based on magnanimity document sets and new document sets.
Particularly, first, determine that magnanimity document sets A(is alleged the first document sets of the present embodiment) and new document sets B(be alleged the second document sets of the present embodiment) two collection of document, wherein:
As preferred version, magnanimity document sets A comprises approximately 1,000,000 pieces of documents, random choose from full dose document altogether; Document in magnanimity document sets A is the data before 2 years substantially.
New document sets B comprises approximately 50,000 pieces of documents altogether, can from each large door website homepage, capture; Document in new document sets B be substantially nearest one month with interior data.
It should be noted that, the generation time of the document data in above-mentioned magnanimity document sets A also can be not limited to before 2 years, such as waiting the year before; The generation time of the document data in above-mentioned new document sets B also can not be defined as in nearest one month, such as can also be in first quarter moon, etc.
Then, add up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets.
Wherein, default everyday words refers to the word of frequent appearance, and the everyday words of definition approximately has 70,000 at present; Default neologisms refer to and appear at the word in the document that timeliness n is high based on Internet technology development, and neologisms are generally followed focus incident, focus personage and produced, and its life period is shorter.
Setting everyday words is w, neologisms are t, determining after two document sets A and B, add up respectively the document frequency of each everyday words w in A and B, be expressed as DF_A_w and DF_B_w, wherein DF_A_w is the true document frequency of everyday words w at magnanimity document sets A, and DF_B_w is used for continuing making comparisons with neologisms in new document sets B.
In addition, also to add up the document frequency DF_B_t of each neologisms t in new document sets B, obtain so that follow-up after the corresponding matching relation of the document frequency of everyday words in magnanimity document sets A and new document sets B, the document frequency DF_B_t according to neologisms t in new document sets B obtains the document frequency DF_A_t of neologisms in magnanimity document sets A.
The document frequency of above-mentioned statistics everyday words w in A and B, and the document frequency of statistics neologisms t in B, can adopt following scheme:
First every piece of document in document sets (A or B) is carried out to participle, then add up each word and occurred in how many pieces of documents, the document number of times that obtains of statistics is as the document frequency of this word thus.
Getting after the document frequency DF_B_w of each everyday words w in the document frequency DF_A_w of magnanimity document sets A and new document sets B, analyzing the document frequency relation of everyday words in magnanimity document sets A and new document sets B.
First, the document frequency by all everyday words in magnanimity document sets A, from little to sorting greatly, obtains collating sequence; Then described collating sequence is carried out segmentation take group as unit; Here take 100 as section gap, 0-100 is one group, and 101-200 is one group, and the rest may be inferred.
Take group as unit, calculate the average DF_B_w of all everyday words in each group afterwards; Then, using the average DF_B_w of each group as horizontal ordinate, draw as ordinate take the ranking value at this group switching centre place, draw and obtain document frequency matched curve.Wherein, the document frequency matched curve that the data based on front 50 groups obtain as shown in Figure 2.
From the scatter diagram shown in Fig. 2, can find out: the document frequency of everyday words in magnanimity document sets A and new document sets B exists the linear matching relation that approaches, between the document frequency of this explanation everyday words in two document sets A and B, have linear relationship.
Consider neologisms finally also can become everyday words and settle out, therefore take neologisms, the document frequency DF_B_t in new document sets B is as horizontal ordinate, and the ordinate value that utilizes the linear fit relation curve shown in Fig. 2 to obtain is the document frequency DF_A_t of neologisms in magnanimity document sets A.
In specific implementation process, as shown in Figure 4, above-mentioned matching Relation acquisition module 203 can comprise: sequencing unit 2031, segmenting unit 2032, computing unit 2033 and drawing unit 2034, wherein:
Sequencing unit 2031, for all default everyday words are extremely sorted greatly from little at the document frequency of described the first document sets, obtains collating sequence;
Segmenting unit 2032, for carrying out segmentation take group as unit to described collating sequence;
Computing unit 2033, for calculating the average document frequency of all default everyday words of each group in described the second document sets;
Drawing unit 2034,,, draws and obtains document frequency matched curve take the ranking value at this group switching centre place as ordinate as horizontal ordinate for the described average document frequency take each group.
The embodiment of the present invention is estimated method and the device of neologisms document frequency, by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, improve thus the accuracy rate of neologisms document frequency statistics, thereby make up the document frequency statistics error larger defect of traditional statistical method for neologisms, and the present invention is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.
The foregoing is only the preferred embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or flow process conversion that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims (10)

1. a method of estimating neologisms document frequency, is characterized in that, comprising:
Obtain the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Add up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Obtain the corresponding matching relation of the document frequency of described default everyday words in described the first document sets and the second document sets;
The document frequency in described the second document sets according to described corresponding matching relation and default neologisms, obtains the document frequency of described default neologisms in described the first document sets.
2. method according to claim 1, is characterized in that, described in obtain the corresponding matching relation of the document frequency of default everyday words in described the first document sets and the second document sets step comprise:
Document frequency by all default everyday words in described the first document sets, from little to sorting greatly, obtains collating sequence;
Described collating sequence is carried out to segmentation take group as unit;
Calculate the average document frequency of all default everyday words in described the second document sets in each group;
Take the described average document frequency of each group as horizontal ordinate, take the ranking value at this group switching centre place as ordinate, draw and obtain document frequency matched curve.
3. method according to claim 2, is characterized in that, described according to corresponding matching relation and default neologisms the document frequency in described the second document sets, the step of obtaining the document frequency of described default neologisms in described the first document sets comprises:
Document frequency take described default neologisms in described the second document sets, as horizontal ordinate, is searched corresponding ordinate from described document frequency matched curve, is the document frequency of these default neologisms in described the first document sets.
4. according to the method described in claim 1,2 or 3, it is characterized in that, described in obtain the first document sets and the second document sets step comprise:
The magnanimity document of random choose the first predetermined quantity from given full dose document, as described the first document sets; From predetermined portal website's homepage, capture the new document of the second predetermined quantity, as described the second document sets; Described the first predetermined quantity is greater than described the second predetermined quantity.
5. method according to claim 4, is characterized in that, the document data generation time in described the first document sets is at least more than 2 years; Document data generation time in described the second document sets is within January.
6. a device of estimating neologisms document frequency, is characterized in that, comprising:
Document sets acquisition module, for obtaining the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;
Statistical module, for adding up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;
Matching Relation acquisition module, for obtaining the corresponding matching relation of described default everyday words at the document frequency of described the first document sets and the second document sets;
Neologisms document frequency acquisition module, for according to described corresponding matching relation and default neologisms at the document frequency of described the second document sets, obtain the document frequency of described default neologisms in described the first document sets.
7. device according to claim 6, is characterized in that, described matching Relation acquisition module comprises:
Sequencing unit, for all default everyday words are extremely sorted greatly from little at the document frequency of described the first document sets, obtains collating sequence;
Segmenting unit, for carrying out segmentation take group as unit to described collating sequence;
Computing unit, for calculating the average document frequency of all default everyday words of each group in described the second document sets;
Drawing unit,,, draws and obtains document frequency matched curve take the ranking value at this group switching centre place as ordinate as horizontal ordinate for the described average document frequency take each group.
8. device according to claim 7, it is characterized in that, described neologisms document frequency acquisition module also for take described default neologisms at the document frequency of described the second document sets as horizontal ordinate, from described document frequency matched curve, search corresponding ordinate, be the document frequency of these default neologisms in described the first document sets.
9. according to the device described in claim 6,7 or 8, it is characterized in that, described document sets acquisition module is also for the magnanimity document of full dose document random choose the first predetermined quantity from given, as described the first document sets; From predetermined portal website's homepage, capture the new document of the second predetermined quantity, as described the second document sets; Described the first predetermined quantity is greater than described the second predetermined quantity.
10. device according to claim 9, is characterized in that, the document data generation time in described the first document sets is at least more than 2 years; Document data generation time in described the second document sets is within January.
CN201210566103.5A 2012-12-24 2012-12-24 Estimate the method and device of neologisms document frequency Active CN103885989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210566103.5A CN103885989B (en) 2012-12-24 2012-12-24 Estimate the method and device of neologisms document frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210566103.5A CN103885989B (en) 2012-12-24 2012-12-24 Estimate the method and device of neologisms document frequency

Publications (2)

Publication Number Publication Date
CN103885989A true CN103885989A (en) 2014-06-25
CN103885989B CN103885989B (en) 2017-12-01

Family

ID=50954884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210566103.5A Active CN103885989B (en) 2012-12-24 2012-12-24 Estimate the method and device of neologisms document frequency

Country Status (1)

Country Link
CN (1) CN103885989B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241611A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 A kind of keyword extracting method and extraction equipment
CN112883186A (en) * 2019-11-29 2021-06-01 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map
CN112883186B (en) * 2019-11-29 2024-04-12 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6549897B1 (en) * 1998-10-09 2003-04-15 Microsoft Corporation Method and system for calculating phrase-document importance
WO2007005742A2 (en) * 2005-07-01 2007-01-11 Ebrary, Inc. Method and apparatus for document clustering and document sketching
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6549897B1 (en) * 1998-10-09 2003-04-15 Microsoft Corporation Method and system for calculating phrase-document importance
WO2007005742A2 (en) * 2005-07-01 2007-01-11 Ebrary, Inc. Method and apparatus for document clustering and document sketching
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘志平等: "最小二乘原理及其matlab实现", 《开发应用》 *
曹素青等: "一个中文文本自动分类数学模型", 《情报学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241611A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 A kind of keyword extracting method and extraction equipment
CN108241611B (en) * 2016-12-26 2021-08-17 北京国双科技有限公司 Keyword extraction method and extraction equipment
CN112883186A (en) * 2019-11-29 2021-06-01 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map
CN112883186B (en) * 2019-11-29 2024-04-12 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map

Also Published As

Publication number Publication date
CN103885989B (en) 2017-12-01

Similar Documents

Publication Publication Date Title
CN104866572B (en) A kind of network short text clustering method
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN104077407B (en) A kind of intelligent data search system and method
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN102929928A (en) Multidimensional-similarity-based personalized news recommendation method
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN103578008A (en) Method and device for recommending garment products
CN105893380B (en) A kind of text classification feature selection approach of improvement
WO2011057497A1 (en) Method and device for mining and evaluating vocabulary quality
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN106296368A (en) A kind of vehicle commending system and method
CN101393555A (en) Rubbish blog detecting method
CN105630768A (en) Cascaded conditional random field-based product name recognition method and device
CN105023178B (en) A kind of electronic commerce recommending method based on ontology
CN102880647A (en) Method and device for acquiring another name of organization
CN101751425A (en) Method for acquiring document set abstracts and device
CN103559303A (en) Evaluation and selection method for data mining algorithm
CN103838754A (en) Information searching device and method
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN102081598A (en) Method for detecting duplicated texts
CN105740480A (en) Air ticket recommending method and system
CN104123318A (en) Method and system for displaying interest points in map
CN102364467A (en) Network search method and system
CN103106234A (en) Searching method and device of webpage content

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant