CN104731909A - Commodity information extraction method based on HERITRIX and HTMLPARSER - Google Patents
Commodity information extraction method based on HERITRIX and HTMLPARSER Download PDFInfo
- Publication number
- CN104731909A CN104731909A CN201510129487.8A CN201510129487A CN104731909A CN 104731909 A CN104731909 A CN 104731909A CN 201510129487 A CN201510129487 A CN 201510129487A CN 104731909 A CN104731909 A CN 104731909A
- Authority
- CN
- China
- Prior art keywords
- heritrix
- htmlparser
- webpage
- merchandise news
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a commodity information extraction method based on the HERITRIX and HTMLPARSER. The method comprises the steps of analyzing a Web page with the HTMLPARSER and extracting a hyperlink in the webpage to acquire useful information; extending the crawling logic with the HERITRIX, correcting module HERITRIX, and capturing commodity webpage information accurately. Compared with the prior art, the method has the advantages that the Webpage is analyzed with the HTMLPARSER, the hyperlink in the webpage can be extracted so as to acquire useful information, and extraction speed is high; the crawling logic is extended with the HERITRIX, commodity webpage information can be captured accurately, and crawling efficiency is improved.
Description
Technical field
The present invention relates to computer data and excavate processing technology field, specifically a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.
Background technology
Web analysis, i.e. program automatic analysis web page contents, obtaining information, thus further process information.Web analysis realizes web crawlers to obtain a ring indispensable and very important in data, a lot of hyperlink is there is in each webpage, the information of a lot of webpage all exists in these hyperlink, how effectively to obtain the important step that these hyperlink become Web excavation.
Along with the development of information diversification, universal search engine towards all users can not meet more deep, professional, the detailed query demand of specific user, vertical search engine arises at the historic moment, it is containing much information of relative universal search engine, inquire about inaccurate, the new search engine service pattern that the degree of depth is inadequate etc. puts forward, web crawlers serves very important effect in a search engine, by the cooperation by htmlParse and Heritrix, be the tool carrying out Network Information Gathering, effectively can extract the key message of commodity webpage.
Summary of the invention
Technical assignment of the present invention is to provide a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.
Technical assignment of the present invention realizes in the following manner, and this merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.
Described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
Described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.
One of the present invention based on HERITRIX and HTMLPARSER merchandise news extracting method compared to the prior art, HtmlParser is utilized to resolve Web page, the hyperlink in webpage can be extracted, thus obtain useful information, there is the advantage that extraction rate is fast; Utilize heritrix to expand and crawl logic, can accurately capture commodity info web, improve and crawl efficiency.
Accompanying drawing explanation
Accompanying drawing 1 crawls information schematic diagram for utilizing HtmlParser.
Embodiment
Embodiment 1:
This merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, realize capturing same host multithreading, accurately capture commodity info web.
Described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
Embodiment 2:
This merchandise news extracting method is: application Heritrix expansion crawls logic, and modified module Heritrix can accurately capture commodity info web; In bag org.archive.crawler.extrator, add class ExtractorForPcOnline analyzing web page content and select candidate site;
Realize capturing same host multithreading by expansion queue-assignment-policy;
Extract the hyperlink in webpage with HtmlParser, extract merchandise news:
Step one: import HtmlParser and wrap import org.htmlParser.util.*;
Step 2: import page info into htmlparser and page coded format is set;
Parser parser = new Parser(url);
Parser.setEncoding(pageEncoding);
Step 3: the nodelist obtaining parser;
Nodelist=parser.parser(NewNodeListFilter(LinkTag.class));
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news;
LinkName in cyclic process=(LinkTag) nodelist.elementAt (i) .getLinkText () is exactly the name obtaining page connection.
Explanation of nouns:
Heritrix is a web crawlers developed by java, increase income, and user can use it to capture the resource wanted from network.Its outstanding part is the extensibility that it is good, facilitates user to realize the crawl logic of oneself.
Htmlparser is an application under the html(standard generalized markup language write of a pure java) storehouse of resolving, it does not rely on other java library file, is mainly used in transformation or extracts html.It hypervelocity can resolve html, and can not make mistakes.
Url: URL(uniform resource locator) is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it comprises points out how the position of file and browser should process it.
By embodiment above, described those skilled in the art can be easy to realize the present invention.But should be appreciated that the present invention is not limited to above-mentioned several embodiments.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizes different technical schemes.
Claims (3)
1. based on a HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.
2. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
3. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510129487.8A CN104731909A (en) | 2015-03-24 | 2015-03-24 | Commodity information extraction method based on HERITRIX and HTMLPARSER |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510129487.8A CN104731909A (en) | 2015-03-24 | 2015-03-24 | Commodity information extraction method based on HERITRIX and HTMLPARSER |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104731909A true CN104731909A (en) | 2015-06-24 |
Family
ID=53455796
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510129487.8A Pending CN104731909A (en) | 2015-03-24 | 2015-03-24 | Commodity information extraction method based on HERITRIX and HTMLPARSER |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104731909A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
CN101470752A (en) * | 2007-12-29 | 2009-07-01 | 指点通(北京)科技有限公司 | Search engine method based on keyword resolution scheduling |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN102968495A (en) * | 2012-11-29 | 2013-03-13 | 河海大学 | Vertical search engine and method for searching contrast association shopping information |
-
2015
- 2015-03-24 CN CN201510129487.8A patent/CN104731909A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
CN101470752A (en) * | 2007-12-29 | 2009-07-01 | 指点通(北京)科技有限公司 | Search engine method based on keyword resolution scheduling |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN102968495A (en) * | 2012-11-29 | 2013-03-13 | 河海大学 | Vertical search engine and method for searching contrast association shopping information |
Non-Patent Citations (1)
Title |
---|
刘文浩 等: "《基于Heritrix 和HTMLParser 的网页商品信息提取的研究》", 《计算机光盘软件与应用》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106095979B (en) | URL merging processing method and device | |
TW201250492A (en) | Method and system of extracting web page information | |
CN103559235A (en) | Online social network malicious webpage detection and identification method | |
WO2011017929A1 (en) | Method and apparatus for positioning effective information quickly by mobile phone browser | |
CN104462547A (en) | Configurable webpage data acquisition method and system | |
CN103678511A (en) | Method and device for extracting webpage content according to visualized template | |
CN101571860A (en) | Method and device for generating dynamic website as well as method and device for extracting structural data | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN103778238A (en) | Method for automatically building classification tree from semi-structured data of Wikipedia | |
CN103970898A (en) | Method and device for extracting information based on multistage rule base | |
CN104142985A (en) | Semi-automatic vertical crawler generation tool and method | |
CN104899219A (en) | Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system | |
CN102760150A (en) | Webpage extraction method based on attribute reproduction and labeled path | |
CN104391706A (en) | Reverse engineering based model base structuring method | |
CN103678509A (en) | Method and device for generating webpage template | |
CN104991904A (en) | Page data acquisition method of dynamic webpage | |
CN102664925A (en) | Method and apparatus for displaying searching result | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
JP4231298B2 (en) | Information extraction rule creation system, information extraction rule creation program, information extraction system, and information extraction program | |
CN103778156A (en) | Method and device for searching for data and server for data search | |
CN104317845A (en) | Method and system for automatic extraction of deep web data | |
CN103678510A (en) | Method and device for providing visualized label for webpage | |
CN104008213A (en) | Method and device for finding and counting webpage information updating | |
CN101807187A (en) | Browsing information-based instant search method | |
CN108363711B (en) | Method and device for detecting dark chain in webpage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150624 |