CN103744959B - Webpage class feature vector extracting method based on ant colony algorithm - Google Patents
Webpage class feature vector extracting method based on ant colony algorithm Download PDFInfo
- Publication number
- CN103744959B CN103744959B CN201410004815.7A CN201410004815A CN103744959B CN 103744959 B CN103744959 B CN 103744959B CN 201410004815 A CN201410004815 A CN 201410004815A CN 103744959 B CN103744959 B CN 103744959B
- Authority
- CN
- China
- Prior art keywords
- word
- ant
- coco
- max
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
Abstract
The invention relates to a method for extracting feature words by improved ant colony algorithm. The method comprises the following steps: when in pretreatment, storing all information into a hash table, wherein coco_prepare is used for storing the information of each article, consisting of article id, words and appearance number of each word, and readhdfs_prepare is used for storing the statistical information of a training set of each class, consisting of word frequency of each word, file number and appearance number of taxon; setting the parameter of the ant colony algorithm, including ant number M, iteration number N, ant steps namely feature word number K, initialization path information prime matrix adMatrixs, local update decay rate p1, total update decay rate p2 and pheromone amount m released by ant. According to the method for extracting feature words by improved ant colony algorithm, the ant colony algorithm is firstly brought in to solve the problem of extracting accurate feature vectors for classes in the case of lacking accurate sample sets.
Description
Technical field
The present invention relates to text mining, it is applied to Web page classifying.
Background technology
Web text mining is exactly a kind of Method and kit for extracting useful information from the web page of magnanimity, wherein,
Web page classifying is its main one side.It is known that train the premise of the grader of oneself to be to want in machine learning field
Have can such sample set of accurate representation, for doing training set and test set.The approach that sample set obtains mainly has three:
(1) utilize existing disclosed corpus;(2) manually corresponding sample set is collected according to class name;(3) utilize web crawlers.Root
According to actual demand, class name is to be defined according to different demands, and method (1) corpus is insufficient for the needs of reality, method less
(2) take time and effort, so training set can be obtained using method (3) in reality.But internet information complex redundancy, how from obtaining
Take is not to extract accurate information in very accurate sample set as the characteristic value of class and to obtain the weight of each Feature Words
Value, be unlikely to excessively complexity makes learning algorithm cannot process again.This is the problem that we mainly solve.
Ant group algorithm, is a kind of probability type algorithm for finding optimal path in figure, is a kind of simulated evolutionary algorithm.
Ant colony optimization algorithm has been applied to many combinatorial optimization problems, in a practical situation, due to candidate feature concentrate Feature Words very
Many, and need be best able to represent a class Feature Words number less and also accurately, so, this problem is exactly that ant colony is calculated
The soluble problem of method.
Content of the invention
The present invention, in the case of not having normative text training set, obtains sample set it is clear that so by web crawlers
Sample set comprise the noise informations such as a lot of advertisements, the presence of these information can not using common text vector extraction algorithm
Obtain the characteristic vector of accurate class.In order to solve this problem, we determine ant using the weighted value of four metrics
Select the possibility of this word, be respectively the co-occurrence rate of word frequency, document frequency, Feature Words and class name and Feature Words and Feature Words it
Between four aspects of co-occurrence rate.The spy that ant group algorithm extracts accurate class in the range of acceptable complexity is designed with this
Levy vector.The characteristic vector building as the pith of grader, as Web page classifying.
The technical scheme that the present invention is given is:
A kind of extract Feature Words using improved ant group algorithm it is characterised in that detailed process:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the letter of every article
Breath, the number of times of the id including article and each word and its appearance;Readhdfs_prepare accesses the system of the training set of each class
Meter information, including the word frequency of each word, number of files, and the number of times of class name co-occurrence.The parameter of setting ant group algorithm: ant number m;
Iterations n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating declines
Rate of deceleration p1 and the overall situation update rate of decay p2;Ant release pheromone amount m.
The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with
Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 ~ 30 of taking-up value maximum as ant first
The random scope of paths.
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, at next in this table
The all words in this table are excluded in the Candidate Set candidate of step.During selecting, calculate according to following steps and be chosen
Possibility:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_ in all words
Max and minimum word frequency tf_min, are normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum word frequency in all words
Df_max and minimum word frequency df_min, are normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum in all words
Word frequency co_max and minimum word frequency co_min, be normalized: co=(co-co_min)/(co_max-co_min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare,
And calculate maximum c oco_max and minimum of a value coco_min of co-occurrence in all of word, and be normalized:
coco=(coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility
Property p=tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after all candidate word calculate and complete, selects that maximum word of p value this time selected as ant
Word, and record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant
Except this word, continue to walk next step.
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula
For: r=(1-p1) * r+p1*m;
5th step, choose from m paths optimum paths and record optimal path fitness value be fit_
Best carries out overall situation renewal, updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_
best);
6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeats the first to the 5th step,
Complete local feature value to extract.If equal to n, then the word in the optimal path of last iteration be actual desired can
Represent such Feature Words.
The innovative point of technical solution of the present invention and its advantage:
1st, it is firstly introduced ant group algorithm to solve to extract accurate characteristic vector for class in the case of not having accurate sample set
Problem.When hands-on grader, would not suffer from and there is no very accurate training set or take a substantial amount of time
With manpower go to artificially collect sample set or be selected at classification accuracy on give way.
2nd, when considering to select Feature Words, propose the co-occurrence rate of Feature Words and class name and Feature Words and Feature Words first
Co-occurrence rate alternatively word as the basis for estimation of Feature Words.
3rd, limit the scope that ant selects for the first time, using tf*idf value as considering the representative degree to class for the word, will
The random scope of the first step of ant is limited in the larger scope of tf*idf value so that accuracy rate is higher.
Brief description
The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings:
Fig. 1 is the structure chart of class.
Fig. 2 is the Feature Words extraction process of class.
Fig. 3 is that improved ant group algorithm extracts Feature Words.
Specific embodiment
The present invention is built upon on the basis of traditional search engine, according to manual sort's catalogue of dmoz, extracts
Class, then crawls front 200 results of full-text search engine Search Results using web crawlers according to class name, excludes webpage mark
Web page text is extracted as sample set after the noise informations such as label, advertisement.Then using segmenter, participle is carried out to training set, go
Except stop words and low-frequency word, count word frequency, the total number of documents of this word appearance, this word and the class name co-occurrence of every each word of article
Number of times, total article number.Finally extract Feature Words obtain its weighted value using improved ant group algorithm, reach acquisition class and
The purpose of its Feature Words.The concrete structure of class is as shown in Figure 1.So build grader.With grader, to the information crawling
Carry out taxonomic revision, then index building net is carried out according to Web-indexing composer to the webpage of point good class, conveniently for user's clothes
Then the result of structure is stored in database by the recommendation of business flow process.
The present invention is by using being capable of Combinatorial Optimization based on the extraction that improved ant group algorithm realizes category feature word
Ant group algorithm for going for accurate grader but but there is no accurate sample set on the premise of, from the feature of higher-dimension
Extract the Feature Words that can represent certain kinds in word candidate collection, lay the first stone for obtaining an accurate grader.
The extraction of category feature word of the present invention to implement step as shown in Figure 2:
1) using class name as the keyword of existing search engine, front 200 letters of Search Results are crawled using web crawlers
Breath and web page contents.
2) exclude web page tag, the noise information such as advertisement extracts Web page text as such sample set.
3) using segmenter, training set is carried out with participle, remove stop words and low-frequency word.
4) word frequency of every each word of article of statistics, the total number of documents that this word occurs, the number of times of this word and class name co-occurrence, always
Article number.
5) improved ant group algorithm is utilized to extract Feature Words.
Specifically, described extract Feature Words using improved ant group algorithm, its detailed process is as shown in Figure 3:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the letter of every article
Breath, the number of times of the id including article and each word and its appearance;Readhdfs_prepare accesses the system of the training set of each class
Meter information, including the word frequency of each word, number of files, and the number of times of class name co-occurrence.The parameter of setting ant group algorithm: ant number m;
Iterations n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating declines
Rate of deceleration p1 and the overall situation update rate of decay p2;Ant release pheromone amount m.
The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with
Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 of taking-up value maximum as first road of ant
The random scope in footpath.
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, at next in this table
The all words in this table are excluded in the Candidate Set candidate of step.During selecting, calculate according to following steps and be chosen
Possibility:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_ in all words
Max and minimum word frequency tf_min, are normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum word frequency in all words
Df_max and minimum word frequency df_min, are normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum in all words
Word frequency co_max and minimum word frequency co_min, be normalized: co=(co-co_min)/(co_max-co_min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare,
And calculate maximum c oco_max and minimum of a value coco_min of co-occurrence in all of word, and be normalized:
coco=(coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility
Property p=tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after all candidate word calculate and complete, selects that maximum word of p value this time selected as ant
Word, and record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant
Except this word, continue to walk next step.
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula
For: r=(1-p1) * r+p1*m;
5th step, choose from m paths optimum paths and record optimal path fitness value be fit_
Best carries out overall situation renewal, updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_
best);
6th step, continues next iteration, repeats the first to the 5th step, completes characteristics extraction, last iteration
Word in optimal path is the actual desired Feature Words that can represent such.
Claims (1)
1. a kind of web page class characteristic vector pickup method based on ant group algorithm is it is characterised in that detailed process:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the information of every article, bag
Include the id of article and the number of times of each word and its appearance;Readhdfs_prepare accesses the statistics letter of the training set of each class
Breath, including the word frequency of each word, number of files, and the number of times of class name co-occurrence;The parameter of setting ant group algorithm: ant number m;Iteration
Frequency n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating decay speed
Rate p1 and the overall situation update rate of decay p2;Ant release pheromone amount m;
The first step, m ant is put on first word at random, because candidate collection is big, the random model of artificial restriction ant
Enclose, specific practice is: calculate the tf*idf value of each Feature Words, first 20~30 of taking-up value maximum as first road of ant
The random scope in footpath;
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, in next step in this table
The all words in this table are excluded in Candidate Set candidate;Select during, according to following steps calculate selected can
Energy property:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_max in all words
With minimum word frequency tf_min, it is normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum number of files df_ in all words
Max and minimum number of files df_min, is normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum being total in all words
Now count co_max and minimum co-occurrence number co_min, be normalized: co=(co-co_min)/(co_max-co_
min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, and count
Calculate maximum c oco_max and minimum of a value coco_min drawing co-occurrence in all of word, and be normalized: coco=
(coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility p=
tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after the calculating of all candidate word completes, selects that maximum word of p value as this time selected word of ant,
And record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant
This word, continues to walk next step;
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula is: r
=(1-p1) * r+p1*m;
5th step, chooses paths of optimum from m paths and records the fitness value of optimal path and enter for fit_best
The row overall situation updates, and updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_best);
6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeating the first to the 5th step, completing
Local feature value is extracted;If equal to n, then the word in the optimal path of last iteration is actual desired can represent
Such Feature Words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410004815.7A CN103744959B (en) | 2014-01-06 | 2014-01-06 | Webpage class feature vector extracting method based on ant colony algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410004815.7A CN103744959B (en) | 2014-01-06 | 2014-01-06 | Webpage class feature vector extracting method based on ant colony algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103744959A CN103744959A (en) | 2014-04-23 |
CN103744959B true CN103744959B (en) | 2017-01-25 |
Family
ID=50501977
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410004815.7A Active CN103744959B (en) | 2014-01-06 | 2014-01-06 | Webpage class feature vector extracting method based on ant colony algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103744959B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640553A (en) * | 1995-09-15 | 1997-06-17 | Infonautics Corporation | Relevance normalization for documents retrieved from an information retrieval system in response to a query |
CN102222098A (en) * | 2011-06-20 | 2011-10-19 | 北京邮电大学 | Method and system for pre-fetching webpage |
CN102254004A (en) * | 2011-07-14 | 2011-11-23 | 北京邮电大学 | Method and system for modeling Web in weblog excavation |
-
2014
- 2014-01-06 CN CN201410004815.7A patent/CN103744959B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640553A (en) * | 1995-09-15 | 1997-06-17 | Infonautics Corporation | Relevance normalization for documents retrieved from an information retrieval system in response to a query |
CN102222098A (en) * | 2011-06-20 | 2011-10-19 | 北京邮电大学 | Method and system for pre-fetching webpage |
CN102254004A (en) * | 2011-07-14 | 2011-11-23 | 北京邮电大学 | Method and system for modeling Web in weblog excavation |
Also Published As
Publication number | Publication date |
---|---|
CN103744959A (en) | 2014-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777274B (en) | A kind of Chinese tour field knowledge mapping construction method and system | |
CN104992184B (en) | A kind of multiclass image classification method based on semi-supervised extreme learning machine | |
CN104182517B (en) | The method and device of data processing | |
CN105760439B (en) | A kind of personage's cooccurrence relation map construction method based on specific behavior co-occurrence network | |
CN103886054B (en) | Personalization recommendation system and method of network teaching resources | |
CN103823845B (en) | Method for automatically annotating remote sensing images on basis of deep learning | |
CN106372060B (en) | Search for the mask method and device of text | |
CN110245981A (en) | A kind of crowd's kind identification method based on mobile phone signaling data | |
CN104008165B (en) | Club detecting method based on network topology and node attribute | |
Coomes et al. | Moving on from Metabolic Scaling Theory: hierarchical models of tree growth and asymmetric competition for light | |
CN107818105A (en) | The recommendation method and server of application program | |
CN107239892A (en) | Region talent's equilibrium of supply and demand quantitative analysis method based on big data | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN103778262B (en) | Information retrieval method and device based on thesaurus | |
CN106650273A (en) | Behavior prediction method and device | |
CN106682696A (en) | Multi-example detection network based on refining of online example classifier and training method thereof | |
CN109740541A (en) | A kind of pedestrian weight identifying system and method | |
CN107203872A (en) | Region demand for talent based on big data quantifies analysis method | |
CN107437038A (en) | A kind of detection method and device of webpage tamper | |
CN108038205A (en) | For the viewpoint analysis prototype system of Chinese microblogging | |
CN106649849A (en) | Text information base building method and device and searching method, device and system | |
CN110334578A (en) | Image level marks the Weakly supervised method for automatically extracting high score remote sensing image building | |
CN105931271B (en) | A kind of action trail recognition methods of the people based on variation BP-HMM | |
CN108804651A (en) | A kind of Social behaviors detection method based on reinforcing Bayes's classification | |
CN105654144A (en) | Social network body constructing method based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |