CN104731909A

CN104731909A - Commodity information extraction method based on HERITRIX and HTMLPARSER

Info

Publication number: CN104731909A
Application number: CN201510129487.8A
Authority: CN
Inventors: 焦毓葳; 徐宏伟; 崔乐乐
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2015-06-24

Abstract

The invention discloses a commodity information extraction method based on the HERITRIX and HTMLPARSER. The method comprises the steps of analyzing a Web page with the HTMLPARSER and extracting a hyperlink in the webpage to acquire useful information; extending the crawling logic with the HERITRIX, correcting module HERITRIX, and capturing commodity webpage information accurately. Compared with the prior art, the method has the advantages that the Webpage is analyzed with the HTMLPARSER, the hyperlink in the webpage can be extracted so as to acquire useful information, and extraction speed is high; the crawling logic is extended with the HERITRIX, commodity webpage information can be captured accurately, and crawling efficiency is improved.

Description

A kind of based on HERITRIX and HTMLPARSER merchandise news extracting method

Technical field

The present invention relates to computer data and excavate processing technology field, specifically a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.

Background technology

Web analysis, i.e. program automatic analysis web page contents, obtaining information, thus further process information.Web analysis realizes web crawlers to obtain a ring indispensable and very important in data, a lot of hyperlink is there is in each webpage, the information of a lot of webpage all exists in these hyperlink, how effectively to obtain the important step that these hyperlink become Web excavation.

Along with the development of information diversification, universal search engine towards all users can not meet more deep, professional, the detailed query demand of specific user, vertical search engine arises at the historic moment, it is containing much information of relative universal search engine, inquire about inaccurate, the new search engine service pattern that the degree of depth is inadequate etc. puts forward, web crawlers serves very important effect in a search engine, by the cooperation by htmlParse and Heritrix, be the tool carrying out Network Information Gathering, effectively can extract the key message of commodity webpage.

Summary of the invention

Technical assignment of the present invention is to provide a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.

Technical assignment of the present invention realizes in the following manner, and this merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.

Described is as follows by the step of the hyperlink in HtmlParser extraction webpage:

Step one: import HtmlParser bag;

Step 2: import page info into htmlparser and page coded format is set;

Step 3: the nodelist obtaining parser;

Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.

Described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.

One of the present invention based on HERITRIX and HTMLPARSER merchandise news extracting method compared to the prior art, HtmlParser is utilized to resolve Web page, the hyperlink in webpage can be extracted, thus obtain useful information, there is the advantage that extraction rate is fast; Utilize heritrix to expand and crawl logic, can accurately capture commodity info web, improve and crawl efficiency.

Accompanying drawing explanation

Accompanying drawing 1 crawls information schematic diagram for utilizing HtmlParser.

Embodiment

Embodiment 1:

This merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, realize capturing same host multithreading, accurately capture commodity info web.

Step one: import HtmlParser bag;

Step 2: import page info into htmlparser and page coded format is set;

Step 3: the nodelist obtaining parser;

Embodiment 2:

This merchandise news extracting method is: application Heritrix expansion crawls logic, and modified module Heritrix can accurately capture commodity info web; In bag org.archive.crawler.extrator, add class ExtractorForPcOnline analyzing web page content and select candidate site;

Realize capturing same host multithreading by expansion queue-assignment-policy;

Extract the hyperlink in webpage with HtmlParser, extract merchandise news:

Step one: import HtmlParser and wrap import org.htmlParser.util.*;

Step 2: import page info into htmlparser and page coded format is set;

Parser parser = new Parser(url);

Parser.setEncoding(pageEncoding);

Step 3: the nodelist obtaining parser;

Nodelist=parser.parser(NewNodeListFilter(LinkTag.class));

Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news;

LinkName in cyclic process=(LinkTag) nodelist.elementAt (i) .getLinkText () is exactly the name obtaining page connection.

Explanation of nouns:

Heritrix is a web crawlers developed by java, increase income, and user can use it to capture the resource wanted from network.Its outstanding part is the extensibility that it is good, facilitates user to realize the crawl logic of oneself.

Htmlparser is an application under the html(standard generalized markup language write of a pure java) storehouse of resolving, it does not rely on other java library file, is mainly used in transformation or extracts html.It hypervelocity can resolve html, and can not make mistakes.

Url: URL(uniform resource locator) is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it comprises points out how the position of file and browser should process it.

By embodiment above, described those skilled in the art can be easy to realize the present invention.But should be appreciated that the present invention is not limited to above-mentioned several embodiments.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizes different technical schemes.

Claims

1. based on a HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.

2. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described is as follows by the step of the hyperlink in HtmlParser extraction webpage:

Step one: import HtmlParser bag;

Step 2: import page info into htmlparser and page coded format is set;

Step 3: the nodelist obtaining parser;

3. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.