CN103885989A

CN103885989A - Method and device for estimating new word document frequency

Info

Publication number: CN103885989A
Application number: CN201210566103.5A
Authority: CN
Inventors: 蔡兵
Original assignee: Tencent Technology Wuhan Co Ltd
Current assignee: Tencent Technology Wuhan Co Ltd
Priority date: 2012-12-24
Filing date: 2012-12-24
Publication date: 2014-06-25
Anticipated expiration: 2032-12-24
Also published as: CN103885989B

Abstract

The invention discloses a method and a device for estimating new word document frequency. The method includes: acquiring a first document set and a second document set, wherein the generation time of the document data contained in the first document set is earlier than that of the document data contained in the second document set; respectively counting the document frequency of each preset common word in the first document set and the second document set; counting the document frequency of each preset new word in the second document set; acquiring the corresponding fitting relations of the preset common words in the first document set and the second document set; acquiring the document frequency of the preset new words in the first document set according to the corresponding fitting relations and the document frequency of the preset new words in the second document set. By the method, new work document frequency counting accuracy is increased, and the defect that traditional methods are large in error during new work document frequency counting is overcome. The method is significant for new words' application to technical fields such as feature selection, keyword extraction, vector space model representation.

Description

Estimate method and the device of neologisms document frequency

Technical field

The present invention relates to Internet technical field, relate in particular to a kind of method and device of estimating neologisms document frequency.

Background technology

Along with the development of Internet technology, neologisms are increasing, and it becomes a more and more general phenomenon of internet arena gradually.Neologisms are again unregistered word, before referring to, never occur, and nearest popular significant word.Neologisms are generally followed focus incident, focus personage and are produced, and often with great quantity of information, are the indispensable characteristic items of the technology such as text classification, keyword abstraction.And document frequency (DF, Document Frequency) is as a kind of measure information factor of classics, be also widely used at these correlative technology fields, such as vector space model, feature selecting, feature weight etc.

Conventionally, document frequency refers to the document number of times that a word occurs in magnanimity collection of document.Traditional document frequency computing method are generally the statistics based on magnanimity collection of document.Its roughly method be first from full dose document random screening go out the document sets of a larger amt (such as 1,000,000), then every piece of document sets is carried out to participle, and add up each word and occur in how many pieces of documents, the document number of times of statistics is just as the document frequency of this word thus.

This method based on magnanimity collection of document statistics is more stable, document frequency for everyday words is more accurate, but because neologisms only appear in the document that few timeliness n is high, traditional this statistical method is larger for the document frequency statistics error of neologisms, generally can be significantly less than its actual value.

Therefore, traditional document frequency computing method based on magnanimity document sets statistics are not too suitable for neologisms, find better neologisms document frequency computing method and seem particularly important.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of method and device of estimating neologisms document frequency, is intended to improve the accuracy rate of neologisms document frequency statistics.

In order to achieve the above object, the present invention proposes a kind of method of estimating neologisms document frequency, comprising:

Obtain the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;

Add up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;

Obtain the corresponding matching relation of the document frequency of described default everyday words in described the first document sets and the second document sets;

The document frequency in described the second document sets according to described corresponding matching relation and default neologisms, obtains the document frequency of described default neologisms in described the first document sets.

The present invention also proposes a kind of device of estimating neologisms document frequency, comprising:

Document sets acquisition module, for obtaining the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;

Statistical module, for adding up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;

Matching Relation acquisition module, for obtaining the corresponding matching relation of described default everyday words at the document frequency of described the first document sets and the second document sets;

Neologisms document frequency acquisition module, for according to described corresponding matching relation and default neologisms at the document frequency of described the second document sets, obtain the document frequency of described default neologisms in described the first document sets.

A kind of method and device of estimating neologisms document frequency that the present invention proposes, by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, improve thus the accuracy rate of neologisms document frequency statistics, thereby make up the document frequency statistics error larger defect of traditional statistical method for neologisms, and the present invention is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that the present invention estimates the method preferred embodiment of neologisms document frequency;

Fig. 2 is the document frequency matched curve schematic diagram that the present invention estimates a kind of example in the method preferred embodiment of neologisms document frequency;

Fig. 3 is the structural representation that the present invention estimates the device preferred embodiment of neologisms document frequency;

Fig. 4 is the structural representation that the present invention estimates matching Relation acquisition module in the device preferred embodiment of neologisms document frequency.

In order to make technical scheme of the present invention clearer, clear, be described in further detail below in conjunction with accompanying drawing.

Embodiment

The solution of the embodiment of the present invention is mainly: by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, to improve the accuracy rate of neologisms document frequency statistics, make up the document frequency statistics error larger defect of traditional statistical method for neologisms.

As shown in Figure 1, preferred embodiment of the present invention proposes a kind of method of estimating neologisms document frequency, comprising:

Step S101, obtains the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;

Because neologisms often only appear in the page that timeliness n is high, and there is larger error in traditional document frequency computing method based on magnanimity document sets statistics, the present embodiment is introduced new document sets concept, and estimates the document frequency of neologisms in magnanimity document sets based on magnanimity document sets and new document sets.

Particularly, first, determine that magnanimity document sets A(is alleged the first document sets of the present embodiment) and new document sets B(be alleged the second document sets of the present embodiment) two collection of document, wherein:

As preferred version, magnanimity document sets A comprises approximately 1,000,000 pieces of documents, random choose from full dose document altogether; Document in magnanimity document sets A is the data before 2 years substantially.

New document sets B comprises approximately 50,000 pieces of documents altogether, can from each large door website homepage, capture; Document in new document sets B be substantially nearest one month with interior data.

It should be noted that, the generation time of the document data in above-mentioned magnanimity document sets A also can be not limited to before 2 years, such as waiting the year before; The generation time of the document data in above-mentioned new document sets B also can not be defined as in nearest one month, such as can also be in first quarter moon, etc.

Step S102, adds up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;

Wherein, default everyday words refers to the word of frequent appearance, and the everyday words of definition approximately has 70,000 at present; Default neologisms refer to and appear at the word in the document that timeliness n is high based on Internet technology development, and neologisms are generally followed focus incident, focus personage and produced, and its life period is shorter.

Setting everyday words is w, neologisms are t, determining after two document sets A and B, add up respectively the document frequency of each everyday words w in A and B, be expressed as DF_A_w and DF_B_w, wherein DF_A_w is the true document frequency of everyday words w at magnanimity document sets A, and DF_B_w is used for continuing making comparisons with neologisms in new document sets B.

In addition, also to add up the document frequency DF_B_t of each neologisms t in new document sets B, obtain so that follow-up after the corresponding matching relation of the document frequency of everyday words in magnanimity document sets A and new document sets B, the document frequency DF_B_t according to neologisms t in new document sets B obtains the document frequency DF_A_t of neologisms in magnanimity document sets A.

The document frequency of above-mentioned statistics everyday words w in A and B, and the document frequency of statistics neologisms t in B, can adopt following scheme:

First every piece of document in document sets (A or B) is carried out to participle, then add up each word and occurred in how many pieces of documents, the document number of times that obtains of statistics is as the document frequency of this word thus.

Step S103, obtains the corresponding matching relation of the document frequency of described default everyday words in described the first document sets and the second document sets;

Step S104, the document frequency in described the second document sets according to described corresponding matching relation and default neologisms, obtains the document frequency of described default neologisms in described the first document sets.

In above-mentioned steps 103 and step S104, getting after the document frequency DF_B_w of each everyday words w in the document frequency DF_A_w of magnanimity document sets A and new document sets B, analyzing the document frequency relation of everyday words in magnanimity document sets A and new document sets B.

First, the document frequency by all everyday words in magnanimity document sets A, from little to sorting greatly, obtains collating sequence; Then described collating sequence is carried out segmentation take group as unit; Here take 100 as section gap, 0-100 is one group, and 101-200 is one group, and the rest may be inferred.

Take group as unit, calculate the average DF_B_w of all everyday words in each group afterwards; Then, using the average DF_B_w of each group as horizontal ordinate, draw as ordinate take the ranking value at this group switching centre place, draw and obtain document frequency matched curve.Wherein, the document frequency matched curve that the data based on front 50 groups obtain as shown in Figure 2.

From the scatter diagram shown in Fig. 2, can find out: the document frequency of everyday words in magnanimity document sets A and new document sets B exists the linear matching relation that approaches, between the document frequency of this explanation everyday words in two document sets A and B, have linear relationship.

Consider neologisms finally also can become everyday words and settle out, therefore take neologisms, the document frequency DF_B_t in new document sets B is as horizontal ordinate, and the ordinate value that utilizes the linear fit relation curve shown in Fig. 2 to obtain is the document frequency DF_A_t of neologisms in magnanimity document sets A.

Comparing traditional document frequency computing method is only the large defect of error that the statistics based on magnanimity collection of document is brought, and the present embodiment, by such scheme, has improved the accuracy rate of neologisms document frequency statistics, thereby has made up the defect of traditional statistical method; And the present embodiment is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.

As shown in Figure 3, preferred embodiment of the present invention proposes a kind of device of estimating neologisms document frequency, comprising: document sets acquisition module 201, statistical module 202, matching Relation acquisition module 203 and neologisms document frequency acquisition module 204, wherein:

Document sets acquisition module 201, for obtaining the first document sets and the second document sets; The document data generation time that described the first document sets comprises is early than described the second document sets;

Statistical module 202, for adding up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets;

Matching Relation acquisition module 203, for obtaining the corresponding matching relation of described default everyday words at the document frequency of described the first document sets and the second document sets;

Neologisms document frequency acquisition module 204, for according to described corresponding matching relation and default neologisms at the document frequency of described the second document sets, obtain the document frequency of described default neologisms in described the first document sets.

Then, add up respectively the document frequency of each default everyday words in described the first document sets and the second document sets; Add up the document frequency of each default neologisms in described the second document sets.

Getting after the document frequency DF_B_w of each everyday words w in the document frequency DF_A_w of magnanimity document sets A and new document sets B, analyzing the document frequency relation of everyday words in magnanimity document sets A and new document sets B.

In specific implementation process, as shown in Figure 4, above-mentioned matching Relation acquisition module 203 can comprise: sequencing unit 2031, segmenting unit 2032, computing unit 2033 and drawing unit 2034, wherein:

Sequencing unit 2031, for all default everyday words are extremely sorted greatly from little at the document frequency of described the first document sets, obtains collating sequence;

Segmenting unit 2032, for carrying out segmentation take group as unit to described collating sequence;

Computing unit 2033, for calculating the average document frequency of all default everyday words of each group in described the second document sets;

Drawing unit 2034,,, draws and obtains document frequency matched curve take the ranking value at this group switching centre place as ordinate as horizontal ordinate for the described average document frequency take each group.

The embodiment of the present invention is estimated method and the device of neologisms document frequency, by determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and add up the document frequency of everyday words in magnanimity document sets and new document sets, find again the relation between these two document frequencies, finally utilize the document frequency of neologisms in new document sets to estimate its document frequency in magnanimity document sets, improve thus the accuracy rate of neologisms document frequency statistics, thereby make up the document frequency statistics error larger defect of traditional statistical method for neologisms, and the present invention is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model represent for neologisms.

The foregoing is only the preferred embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or flow process conversion that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims

1. a method of estimating neologisms document frequency, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, described in obtain the corresponding matching relation of the document frequency of default everyday words in described the first document sets and the second document sets step comprise:

Document frequency by all default everyday words in described the first document sets, from little to sorting greatly, obtains collating sequence;

Described collating sequence is carried out to segmentation take group as unit;

Calculate the average document frequency of all default everyday words in described the second document sets in each group;

Take the described average document frequency of each group as horizontal ordinate, take the ranking value at this group switching centre place as ordinate, draw and obtain document frequency matched curve.

3. method according to claim 2, is characterized in that, described according to corresponding matching relation and default neologisms the document frequency in described the second document sets, the step of obtaining the document frequency of described default neologisms in described the first document sets comprises:

Document frequency take described default neologisms in described the second document sets, as horizontal ordinate, is searched corresponding ordinate from described document frequency matched curve, is the document frequency of these default neologisms in described the first document sets.

4. according to the method described in claim 1,2 or 3, it is characterized in that, described in obtain the first document sets and the second document sets step comprise:

The magnanimity document of random choose the first predetermined quantity from given full dose document, as described the first document sets; From predetermined portal website's homepage, capture the new document of the second predetermined quantity, as described the second document sets; Described the first predetermined quantity is greater than described the second predetermined quantity.

5. method according to claim 4, is characterized in that, the document data generation time in described the first document sets is at least more than 2 years; Document data generation time in described the second document sets is within January.

6. a device of estimating neologisms document frequency, is characterized in that, comprising:

7. device according to claim 6, is characterized in that, described matching Relation acquisition module comprises:

Sequencing unit, for all default everyday words are extremely sorted greatly from little at the document frequency of described the first document sets, obtains collating sequence;

Segmenting unit, for carrying out segmentation take group as unit to described collating sequence;

Computing unit, for calculating the average document frequency of all default everyday words of each group in described the second document sets;

Drawing unit,,, draws and obtains document frequency matched curve take the ranking value at this group switching centre place as ordinate as horizontal ordinate for the described average document frequency take each group.

8. device according to claim 7, it is characterized in that, described neologisms document frequency acquisition module also for take described default neologisms at the document frequency of described the second document sets as horizontal ordinate, from described document frequency matched curve, search corresponding ordinate, be the document frequency of these default neologisms in described the first document sets.

9. according to the device described in claim 6,7 or 8, it is characterized in that, described document sets acquisition module is also for the magnanimity document of full dose document random choose the first predetermined quantity from given, as described the first document sets; From predetermined portal website's homepage, capture the new document of the second predetermined quantity, as described the second document sets; Described the first predetermined quantity is greater than described the second predetermined quantity.

10. device according to claim 9, is characterized in that, the document data generation time in described the first document sets is at least more than 2 years; Document data generation time in described the second document sets is within January.