CN103744959B - Webpage class feature vector extracting method based on ant colony algorithm - Google Patents

Webpage class feature vector extracting method based on ant colony algorithm Download PDF

Info

Publication number
CN103744959B
CN103744959B CN201410004815.7A CN201410004815A CN103744959B CN 103744959 B CN103744959 B CN 103744959B CN 201410004815 A CN201410004815 A CN 201410004815A CN 103744959 B CN103744959 B CN 103744959B
Authority
CN
China
Prior art keywords
word
ant
coco
max
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410004815.7A
Other languages
Chinese (zh)
Other versions
CN103744959A (en
Inventor
蒋昌俊
陈闳中
闫春钢
丁志军
王鹏伟
孙海春
邓晓栋
刘俊俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201410004815.7A priority Critical patent/CN103744959B/en
Publication of CN103744959A publication Critical patent/CN103744959A/en
Application granted granted Critical
Publication of CN103744959B publication Critical patent/CN103744959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Abstract

The invention relates to a method for extracting feature words by improved ant colony algorithm. The method comprises the following steps: when in pretreatment, storing all information into a hash table, wherein coco_prepare is used for storing the information of each article, consisting of article id, words and appearance number of each word, and readhdfs_prepare is used for storing the statistical information of a training set of each class, consisting of word frequency of each word, file number and appearance number of taxon; setting the parameter of the ant colony algorithm, including ant number M, iteration number N, ant steps namely feature word number K, initialization path information prime matrix adMatrixs, local update decay rate p1, total update decay rate p2 and pheromone amount m released by ant. According to the method for extracting feature words by improved ant colony algorithm, the ant colony algorithm is firstly brought in to solve the problem of extracting accurate feature vectors for classes in the case of lacking accurate sample sets.

Description

A kind of web page class characteristic vector pickup method based on ant group algorithm
Technical field
The present invention relates to text mining, it is applied to Web page classifying.
Background technology
Web text mining is exactly a kind of Method and kit for extracting useful information from the web page of magnanimity, wherein, Web page classifying is its main one side.It is known that train the premise of the grader of oneself to be to want in machine learning field Have can such sample set of accurate representation, for doing training set and test set.The approach that sample set obtains mainly has three: (1) utilize existing disclosed corpus;(2) manually corresponding sample set is collected according to class name;(3) utilize web crawlers.Root According to actual demand, class name is to be defined according to different demands, and method (1) corpus is insufficient for the needs of reality, method less (2) take time and effort, so training set can be obtained using method (3) in reality.But internet information complex redundancy, how from obtaining Take is not to extract accurate information in very accurate sample set as the characteristic value of class and to obtain the weight of each Feature Words Value, be unlikely to excessively complexity makes learning algorithm cannot process again.This is the problem that we mainly solve.
Ant group algorithm, is a kind of probability type algorithm for finding optimal path in figure, is a kind of simulated evolutionary algorithm. Ant colony optimization algorithm has been applied to many combinatorial optimization problems, in a practical situation, due to candidate feature concentrate Feature Words very Many, and need be best able to represent a class Feature Words number less and also accurately, so, this problem is exactly that ant colony is calculated The soluble problem of method.
Content of the invention
The present invention, in the case of not having normative text training set, obtains sample set it is clear that so by web crawlers Sample set comprise the noise informations such as a lot of advertisements, the presence of these information can not using common text vector extraction algorithm Obtain the characteristic vector of accurate class.In order to solve this problem, we determine ant using the weighted value of four metrics Select the possibility of this word, be respectively the co-occurrence rate of word frequency, document frequency, Feature Words and class name and Feature Words and Feature Words it Between four aspects of co-occurrence rate.The spy that ant group algorithm extracts accurate class in the range of acceptable complexity is designed with this Levy vector.The characteristic vector building as the pith of grader, as Web page classifying.
The technical scheme that the present invention is given is:
A kind of extract Feature Words using improved ant group algorithm it is characterised in that detailed process:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the letter of every article Breath, the number of times of the id including article and each word and its appearance;Readhdfs_prepare accesses the system of the training set of each class Meter information, including the word frequency of each word, number of files, and the number of times of class name co-occurrence.The parameter of setting ant group algorithm: ant number m; Iterations n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating declines Rate of deceleration p1 and the overall situation update rate of decay p2;Ant release pheromone amount m.
The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 ~ 30 of taking-up value maximum as ant first The random scope of paths.
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, at next in this table The all words in this table are excluded in the Candidate Set candidate of step.During selecting, calculate according to following steps and be chosen Possibility:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_ in all words Max and minimum word frequency tf_min, are normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum word frequency in all words Df_max and minimum word frequency df_min, are normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum in all words Word frequency co_max and minimum word frequency co_min, be normalized: co=(co-co_min)/(co_max-co_min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, And calculate maximum c oco_max and minimum of a value coco_min of co-occurrence in all of word, and be normalized: coco=(coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility Property p=tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after all candidate word calculate and complete, selects that maximum word of p value this time selected as ant Word, and record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant Except this word, continue to walk next step.
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula For: r=(1-p1) * r+p1*m;
5th step, choose from m paths optimum paths and record optimal path fitness value be fit_ Best carries out overall situation renewal, updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_ best);
6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeats the first to the 5th step, Complete local feature value to extract.If equal to n, then the word in the optimal path of last iteration be actual desired can Represent such Feature Words.
The innovative point of technical solution of the present invention and its advantage:
1st, it is firstly introduced ant group algorithm to solve to extract accurate characteristic vector for class in the case of not having accurate sample set Problem.When hands-on grader, would not suffer from and there is no very accurate training set or take a substantial amount of time With manpower go to artificially collect sample set or be selected at classification accuracy on give way.
2nd, when considering to select Feature Words, propose the co-occurrence rate of Feature Words and class name and Feature Words and Feature Words first Co-occurrence rate alternatively word as the basis for estimation of Feature Words.
3rd, limit the scope that ant selects for the first time, using tf*idf value as considering the representative degree to class for the word, will The random scope of the first step of ant is limited in the larger scope of tf*idf value so that accuracy rate is higher.
Brief description
The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings:
Fig. 1 is the structure chart of class.
Fig. 2 is the Feature Words extraction process of class.
Fig. 3 is that improved ant group algorithm extracts Feature Words.
Specific embodiment
The present invention is built upon on the basis of traditional search engine, according to manual sort's catalogue of dmoz, extracts Class, then crawls front 200 results of full-text search engine Search Results using web crawlers according to class name, excludes webpage mark Web page text is extracted as sample set after the noise informations such as label, advertisement.Then using segmenter, participle is carried out to training set, go Except stop words and low-frequency word, count word frequency, the total number of documents of this word appearance, this word and the class name co-occurrence of every each word of article Number of times, total article number.Finally extract Feature Words obtain its weighted value using improved ant group algorithm, reach acquisition class and The purpose of its Feature Words.The concrete structure of class is as shown in Figure 1.So build grader.With grader, to the information crawling Carry out taxonomic revision, then index building net is carried out according to Web-indexing composer to the webpage of point good class, conveniently for user's clothes Then the result of structure is stored in database by the recommendation of business flow process.
The present invention is by using being capable of Combinatorial Optimization based on the extraction that improved ant group algorithm realizes category feature word Ant group algorithm for going for accurate grader but but there is no accurate sample set on the premise of, from the feature of higher-dimension Extract the Feature Words that can represent certain kinds in word candidate collection, lay the first stone for obtaining an accurate grader.
The extraction of category feature word of the present invention to implement step as shown in Figure 2:
1) using class name as the keyword of existing search engine, front 200 letters of Search Results are crawled using web crawlers Breath and web page contents.
2) exclude web page tag, the noise information such as advertisement extracts Web page text as such sample set.
3) using segmenter, training set is carried out with participle, remove stop words and low-frequency word.
4) word frequency of every each word of article of statistics, the total number of documents that this word occurs, the number of times of this word and class name co-occurrence, always Article number.
5) improved ant group algorithm is utilized to extract Feature Words.
Specifically, described extract Feature Words using improved ant group algorithm, its detailed process is as shown in Figure 3:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the letter of every article Breath, the number of times of the id including article and each word and its appearance;Readhdfs_prepare accesses the system of the training set of each class Meter information, including the word frequency of each word, number of files, and the number of times of class name co-occurrence.The parameter of setting ant group algorithm: ant number m; Iterations n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating declines Rate of deceleration p1 and the overall situation update rate of decay p2;Ant release pheromone amount m.
The first step, m ant is put on first word at random, because candidate collection is big, artificial restriction ant with Machine scope, specific practice is: calculates the tf*idf value of each Feature Words, first 20 of taking-up value maximum as first road of ant The random scope in footpath.
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, at next in this table The all words in this table are excluded in the Candidate Set candidate of step.During selecting, calculate according to following steps and be chosen Possibility:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_ in all words Max and minimum word frequency tf_min, are normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum word frequency in all words Df_max and minimum word frequency df_min, are normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum in all words Word frequency co_max and minimum word frequency co_min, be normalized: co=(co-co_min)/(co_max-co_min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, And calculate maximum c oco_max and minimum of a value coco_min of co-occurrence in all of word, and be normalized: coco=(coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility Property p=tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after all candidate word calculate and complete, selects that maximum word of p value this time selected as ant Word, and record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant Except this word, continue to walk next step.
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula For: r=(1-p1) * r+p1*m;
5th step, choose from m paths optimum paths and record optimal path fitness value be fit_ Best carries out overall situation renewal, updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_ best);
6th step, continues next iteration, repeats the first to the 5th step, completes characteristics extraction, last iteration Word in optimal path is the actual desired Feature Words that can represent such.

Claims (1)

1. a kind of web page class characteristic vector pickup method based on ant group algorithm is it is characterised in that detailed process:
In pretreatment, by all Information Access in hash table, wherein coco_prepare accesses the information of every article, bag Include the id of article and the number of times of each word and its appearance;Readhdfs_prepare accesses the statistics letter of the training set of each class Breath, including the word frequency of each word, number of files, and the number of times of class name co-occurrence;The parameter of setting ant group algorithm: ant number m;Iteration Frequency n;The step number that ant walks is Feature Words number k;Initialization path Pheromone Matrix admatrixs;Local updating decay speed Rate p1 and the overall situation update rate of decay p2;Ant release pheromone amount m;
The first step, m ant is put on first word at random, because candidate collection is big, the random model of artificial restriction ant Enclose, specific practice is: calculate the tf*idf value of each Feature Words, first 20~30 of taking-up value maximum as first road of ant The random scope in footpath;
Second step, to each ant, safeguards a taboo list result, has the word that ant is passed by, in next step in this table The all words in this table are excluded in Candidate Set candidate;Select during, according to following steps calculate selected can Energy property:
(1) take out word frequency tf of this word from table readhdfs_prepare, and obtain maximum word frequency tf_max in all words With minimum word frequency tf_min, it is normalized: tf=(tf-tf_min)/(tf_max-tf_min);
(2) take out the number of files df of this word from table readhdfs_prepare, and obtain maximum number of files df_ in all words Max and minimum number of files df_min, is normalized: df=(df-df_min)/(df_max-df_min);
(3) take out the co-occurrence number co with class name for this word from table readhdfs_prepare, and obtain maximum being total in all words Now count co_max and minimum co-occurrence number co_min, be normalized: co=(co-co_min)/(co_max-co_ min);
(4) obtain co-occurrence number of times sum coco of all of word and this word in result table from table coco_prepare, and count Calculate maximum c oco_max and minimum of a value coco_min drawing co-occurrence in all of word, and be normalized: coco= (coco-coco_min)/(coco_max-coco_min);
(5) obtain pheromones value r in respective path;
(6) according to four partly shared ratios tf_per, df_per, co_per, coco_per calculate selected possibility p= tf_per*tf+df_per*df+co_per*co+coco_per*coco+r;
3rd step, after the calculating of all candidate word completes, selects that maximum word of p value as this time selected word of ant, And record the weighted value of this word, this word is added in the taboo list result of this ant, delete from the Candidate Set of this ant This word, continues to walk next step;
4th step, after all ants complete k step, obtains the information cellulose content of m paths more new route, more new formula is: r =(1-p1) * r+p1*m;
5th step, chooses paths of optimum from m paths and records the fitness value of optimal path and enter for fit_best The row overall situation updates, and updates and does not have the information cellulose content of individual Feature Words to be on this path: r=(1-p2) * r+p2* (1/fit_best);
6th step, judges whether iterations is less than n, if it is, continuing next iteration, repeating the first to the 5th step, completing Local feature value is extracted;If equal to n, then the word in the optimal path of last iteration is actual desired can represent Such Feature Words.
CN201410004815.7A 2014-01-06 2014-01-06 Webpage class feature vector extracting method based on ant colony algorithm Active CN103744959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410004815.7A CN103744959B (en) 2014-01-06 2014-01-06 Webpage class feature vector extracting method based on ant colony algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410004815.7A CN103744959B (en) 2014-01-06 2014-01-06 Webpage class feature vector extracting method based on ant colony algorithm

Publications (2)

Publication Number Publication Date
CN103744959A CN103744959A (en) 2014-04-23
CN103744959B true CN103744959B (en) 2017-01-25

Family

ID=50501977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410004815.7A Active CN103744959B (en) 2014-01-06 2014-01-06 Webpage class feature vector extracting method based on ant colony algorithm

Country Status (1)

Country Link
CN (1) CN103744959B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640553A (en) * 1995-09-15 1997-06-17 Infonautics Corporation Relevance normalization for documents retrieved from an information retrieval system in response to a query
CN102222098A (en) * 2011-06-20 2011-10-19 北京邮电大学 Method and system for pre-fetching webpage
CN102254004A (en) * 2011-07-14 2011-11-23 北京邮电大学 Method and system for modeling Web in weblog excavation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640553A (en) * 1995-09-15 1997-06-17 Infonautics Corporation Relevance normalization for documents retrieved from an information retrieval system in response to a query
CN102222098A (en) * 2011-06-20 2011-10-19 北京邮电大学 Method and system for pre-fetching webpage
CN102254004A (en) * 2011-07-14 2011-11-23 北京邮电大学 Method and system for modeling Web in weblog excavation

Also Published As

Publication number Publication date
CN103744959A (en) 2014-04-23

Similar Documents

Publication Publication Date Title
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN104992184B (en) A kind of multiclass image classification method based on semi-supervised extreme learning machine
CN104182517B (en) The method and device of data processing
CN105760439B (en) A kind of personage's cooccurrence relation map construction method based on specific behavior co-occurrence network
CN103886054B (en) Personalization recommendation system and method of network teaching resources
CN103823845B (en) Method for automatically annotating remote sensing images on basis of deep learning
CN106372060B (en) Search for the mask method and device of text
CN110245981A (en) A kind of crowd's kind identification method based on mobile phone signaling data
CN104008165B (en) Club detecting method based on network topology and node attribute
Coomes et al. Moving on from Metabolic Scaling Theory: hierarchical models of tree growth and asymmetric competition for light
CN107818105A (en) The recommendation method and server of application program
CN107239892A (en) Region talent's equilibrium of supply and demand quantitative analysis method based on big data
CN106156372B (en) A kind of classification method and device of internet site
CN103778262B (en) Information retrieval method and device based on thesaurus
CN106650273A (en) Behavior prediction method and device
CN106682696A (en) Multi-example detection network based on refining of online example classifier and training method thereof
CN109740541A (en) A kind of pedestrian weight identifying system and method
CN107203872A (en) Region demand for talent based on big data quantifies analysis method
CN107437038A (en) A kind of detection method and device of webpage tamper
CN108038205A (en) For the viewpoint analysis prototype system of Chinese microblogging
CN106649849A (en) Text information base building method and device and searching method, device and system
CN110334578A (en) Image level marks the Weakly supervised method for automatically extracting high score remote sensing image building
CN105931271B (en) A kind of action trail recognition methods of the people based on variation BP-HMM
CN108804651A (en) A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN105654144A (en) Social network body constructing method based on machine learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant