CN103744959B

CN103744959B - Webpage class feature vector extracting method based on ant colony algorithm

Info

Publication number: CN103744959B
Application number: CN201410004815.7A
Authority: CN
Inventors: 蒋昌俊; 陈闳中; 闫春钢; 丁志军; 王鹏伟; 孙海春; 邓晓栋; 刘俊俊
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2014-01-06
Filing date: 2014-01-06
Publication date: 2017-01-25
Anticipated expiration: 2034-01-06
Also published as: CN103744959A

Abstract

The invention relates to a method for extracting feature words by improved ant colony algorithm. The method comprises the following steps: when in pretreatment, storing all information into a hash table, wherein coco_prepare is used for storing the information of each article, consisting of article id, words and appearance number of each word, and readhdfs_prepare is used for storing the statistical information of a training set of each class, consisting of word frequency of each word, file number and appearance number of taxon; setting the parameter of the ant colony algorithm, including ant number M, iteration number N, ant steps namely feature word number K, initialization path information prime matrix adMatrixs, local update decay rate p1, total update decay rate p2 and pheromone amount m released by ant. According to the method for extracting feature words by improved ant colony algorithm, the ant colony algorithm is firstly brought in to solve the problem of extracting accurate feature vectors for classes in the case of lacking accurate sample sets.

Description

A kind of web page class characteristic vector pickup method based on ant group algorithm

Technical field

The present invention relates to text mining, it is applied to Web page classifying.

Background technology

Web text mining is exactly a kind of Method and kit for extracting useful information from the web page of magnanimity, wherein, Web page classifying is its main one side.It is known that train the premise of the grader of oneself to be to want in machine learning field Have can such sample set of accurate representation, for doing training set and test set.The approach that sample set obtains mainly has three: (1) utilize existing disclosed corpus；(2) manually corresponding sample set is collected according to class name；(3) utilize web crawlers.Root According to actual demand, class name is to be defined according to different demands, and method (1) corpus is insufficient for the needs of reality, method less (2) take time and effort, so training set can be obtained using method (3) in reality.But internet information complex redundancy, how from obtaining Take is not to extract accurate information in very accurate sample set as the characteristic value of class and to obtain the weight of each Feature Words Value, be unlikely to excessively complexity makes learning algorithm cannot process again.This is the problem that we mainly solve.

Ant group algorithm, is a kind of probability type algorithm for finding optimal path in figure, is a kind of simulated evolutionary algorithm. Ant colony optimization algorithm has been applied to many combinatorial optimization problems, in a practical situation, due to candidate feature concentrate Feature Words very Many, and need be best able to represent a class Feature Words number less and also accurately, so, this problem is exactly that ant colony is calculated The soluble problem of method.

Content of the invention

The present invention, in the case of not having normative text training set, obtains sample set it is clear that so by web crawlers Sample set comprise the noise informations such as a lot of advertisements, the presence of these information can not using common text vector extraction algorithm Obtain the characteristic vector of accurate class.In order to solve this problem, we determine ant using the weighted value of four metrics Select the possibility of this word, be respectively the co-occurrence rate of word frequency, document frequency, Feature Words and class name and Feature Words and Feature Words it Between four aspects of co-occurrence rate.The spy that ant group algorithm extracts accurate class in the range of acceptable complexity is designed with this Levy vector.The characteristic vector building as the pith of grader, as Web page classifying.

The technical scheme that the present invention is given is:

A kind of extract Feature Words using improved ant group algorithm it is characterised in that detailed process:

In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the letter of every article Breath, the number of times of the id including article and each word and its appearance；Readhdfs_prepare accesses the system of the training set of each class Meter information, including the word frequency of each word, number of files, and the number of times of class name co-occurrence.The parameter of setting ant group algorithm: ant number m； Iterations n；The step number that ant walks is Feature Words number k；Initialization path Pheromone Matrix admatrixs；Local updating declines Rate of deceleration p1 and the overall situation update rate of decay p2；Ant release pheromone amount m.

The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 ~ 30 of taking-up value maximum as ant first The random scope of paths.

Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, at next in this table The all words in this table are excluded in the Candidate Set candidate of step.During selecting, calculate according to following steps and be chosen Possibility:

(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_ in all words Max and minimum word frequency tf_min, are normalized: tf=(tf-tf_min)/(tf_max-tf_min)；

(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum word frequency in all words Df_max and minimum word frequency df_min, are normalized: df=(df-df_min)/(df_max-df_min)；

(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum in all words Word frequency co_max and minimum word frequency co_min, be normalized: co=(co-co_min)/(co_max-co_min);

(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, And calculate maximum c oco_max and minimum of a value coco_min of co-occurrence in all of word, and be normalized: coco=(coco-coco_min)/(coco_max-coco_min);

(5) obtain pheromones value r in respective path；

(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility Property p=tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;

3rd step, after all candidate word calculate and complete, selects that maximum word of p value this time selected as ant Word, and record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant Except this word, continue to walk next step.

4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula For: r=(1-p1) * r+p1*m;

5th step, choose from m paths optimum paths and record optimal path fitness value be fit_ Best carries out overall situation renewal, updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_ best);

6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeats the first to the 5th step, Complete local feature value to extract.If equal to n, then the word in the optimal path of last iteration be actual desired can Represent such Feature Words.

The innovative point of technical solution of the present invention and its advantage:

1st, it is firstly introduced ant group algorithm to solve to extract accurate characteristic vector for class in the case of not having accurate sample set Problem.When hands-on grader, would not suffer from and there is no very accurate training set or take a substantial amount of time With manpower go to artificially collect sample set or be selected at classification accuracy on give way.

2nd, when considering to select Feature Words, propose the co-occurrence rate of Feature Words and class name and Feature Words and Feature Words first Co-occurrence rate alternatively word as the basis for estimation of Feature Words.

3rd, limit the scope that ant selects for the first time, using tf*idf value as considering the representative degree to class for the word, will The random scope of the first step of ant is limited in the larger scope of tf*idf value so that accuracy rate is higher.

Brief description

The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings:

Fig. 1 is the structure chart of class.

Fig. 2 is the Feature Words extraction process of class.

Fig. 3 is that improved ant group algorithm extracts Feature Words.

Specific embodiment

The present invention is built upon on the basis of traditional search engine, according to manual sort's catalogue of dmoz, extracts Class, then crawls front 200 results of full-text search engine Search Results using web crawlers according to class name, excludes webpage mark Web page text is extracted as sample set after the noise informations such as label, advertisement.Then using segmenter, participle is carried out to training set, go Except stop words and low-frequency word, count word frequency, the total number of documents of this word appearance, this word and the class name co-occurrence of every each word of article Number of times, total article number.Finally extract Feature Words obtain its weighted value using improved ant group algorithm, reach acquisition class and The purpose of its Feature Words.The concrete structure of class is as shown in Figure 1.So build grader.With grader, to the information crawling Carry out taxonomic revision, then index building net is carried out according to Web-indexing composer to the webpage of point good class, conveniently for user's clothes Then the result of structure is stored in database by the recommendation of business flow process.

The present invention is by using being capable of Combinatorial Optimization based on the extraction that improved ant group algorithm realizes category feature word Ant group algorithm for going for accurate grader but but there is no accurate sample set on the premise of, from the feature of higher-dimension Extract the Feature Words that can represent certain kinds in word candidate collection, lay the first stone for obtaining an accurate grader.

The extraction of category feature word of the present invention to implement step as shown in Figure 2:

1) using class name as the keyword of existing search engine, front 200 letters of Search Results are crawled using web crawlers Breath and web page contents.

2) exclude web page tag, the noise information such as advertisement extracts Web page text as such sample set.

3) using segmenter, training set is carried out with participle, remove stop words and low-frequency word.

4) word frequency of every each word of article of statistics, the total number of documents that this word occurs, the number of times of this word and class name co-occurrence, always Article number.

5) improved ant group algorithm is utilized to extract Feature Words.

Specifically, described extract Feature Words using improved ant group algorithm, its detailed process is as shown in Figure 3:

The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 of taking-up value maximum as first road of ant The random scope in footpath.

(5) obtain pheromones value r in respective path；

6th step, continues next iteration, repeats the first to the 5th step, completes characteristics extraction, last iteration Word in optimal path is the actual desired Feature Words that can represent such.

Claims

1. a kind of web page class characteristic vector pickup method based on ant group algorithm is it is characterised in that detailed process:

In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the information of every article, bag Include the id of article and the number of times of each word and its appearance；Readhdfs_prepare accesses the statistics letter of the training set of each class Breath, including the word frequency of each word, number of files, and the number of times of class name co-occurrence；The parameter of setting ant group algorithm: ant number m；Iteration Frequency n；The step number that ant walks is Feature Words number k；Initialization path Pheromone Matrix admatrixs；Local updating decay speed Rate p1 and the overall situation update rate of decay p2；Ant release pheromone amount m；

The first step, m ant is put on first word at random, because candidate collection is big, the random model of artificial restriction ant Enclose, specific practice is: calculate the tf*idf value of each Feature Words, first 20～30 of taking-up value maximum as first road of ant The random scope in footpath；

Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, in next step in this table The all words in this table are excluded in Candidate Set candidate；Select during, according to following steps calculate selected can Energy property:

(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_max in all words With minimum word frequency tf_min, it is normalized: tf=(tf-tf_min)/(tf_max-tf_min)；

(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum number of files df_ in all words Max and minimum number of files df_min, is normalized: df=(df-df_min)/(df_max-df_min)；

(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum being total in all words Now count co_max and minimum co-occurrence number co_min, be normalized: co=(co-co_min)/(co_max-co_ min)；

(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, and count Calculate maximum c oco_max and minimum of a value coco_min drawing co-occurrence in all of word, and be normalized: coco= (coco-coco_min)/(coco_max-coco_min)；

(5) obtain pheromones value r in respective path；

(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility p= tf_per*tf+df_per*df+co_per*co+coco_per*coco+r；

3rd step, after the calculating of all candidate word completes, selects that maximum word of p value as this time selected word of ant, And record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant This word, continues to walk next step；

4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula is: r =(1-p1) * r+p1*m；

5th step, chooses paths of optimum from m paths and records the fitness value of optimal path and enter for fit_best The row overall situation updates, and updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_best)；

6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeating the first to the 5th step, completing Local feature value is extracted；If equal to n, then the word in the optimal path of last iteration is actual desired can represent Such Feature Words.