CN104731909A - Commodity information extraction method based on HERITRIX and HTMLPARSER - Google Patents

Commodity information extraction method based on HERITRIX and HTMLPARSER Download PDF

Info

Publication number
CN104731909A
CN104731909A CN201510129487.8A CN201510129487A CN104731909A CN 104731909 A CN104731909 A CN 104731909A CN 201510129487 A CN201510129487 A CN 201510129487A CN 104731909 A CN104731909 A CN 104731909A
Authority
CN
China
Prior art keywords
heritrix
htmlparser
webpage
merchandise news
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510129487.8A
Other languages
Chinese (zh)
Inventor
焦毓葳
徐宏伟
崔乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201510129487.8A priority Critical patent/CN104731909A/en
Publication of CN104731909A publication Critical patent/CN104731909A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a commodity information extraction method based on the HERITRIX and HTMLPARSER. The method comprises the steps of analyzing a Web page with the HTMLPARSER and extracting a hyperlink in the webpage to acquire useful information; extending the crawling logic with the HERITRIX, correcting module HERITRIX, and capturing commodity webpage information accurately. Compared with the prior art, the method has the advantages that the Webpage is analyzed with the HTMLPARSER, the hyperlink in the webpage can be extracted so as to acquire useful information, and extraction speed is high; the crawling logic is extended with the HERITRIX, commodity webpage information can be captured accurately, and crawling efficiency is improved.

Description

A kind of based on HERITRIX and HTMLPARSER merchandise news extracting method
Technical field
The present invention relates to computer data and excavate processing technology field, specifically a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.
Background technology
Web analysis, i.e. program automatic analysis web page contents, obtaining information, thus further process information.Web analysis realizes web crawlers to obtain a ring indispensable and very important in data, a lot of hyperlink is there is in each webpage, the information of a lot of webpage all exists in these hyperlink, how effectively to obtain the important step that these hyperlink become Web excavation.
Along with the development of information diversification, universal search engine towards all users can not meet more deep, professional, the detailed query demand of specific user, vertical search engine arises at the historic moment, it is containing much information of relative universal search engine, inquire about inaccurate, the new search engine service pattern that the degree of depth is inadequate etc. puts forward, web crawlers serves very important effect in a search engine, by the cooperation by htmlParse and Heritrix, be the tool carrying out Network Information Gathering, effectively can extract the key message of commodity webpage.
Summary of the invention
Technical assignment of the present invention is to provide a kind of based on HERITRIX and HTMLPARSER merchandise news extracting method.
Technical assignment of the present invention realizes in the following manner, and this merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.
Described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
Described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.
One of the present invention based on HERITRIX and HTMLPARSER merchandise news extracting method compared to the prior art, HtmlParser is utilized to resolve Web page, the hyperlink in webpage can be extracted, thus obtain useful information, there is the advantage that extraction rate is fast; Utilize heritrix to expand and crawl logic, can accurately capture commodity info web, improve and crawl efficiency.
Accompanying drawing explanation
Accompanying drawing 1 crawls information schematic diagram for utilizing HtmlParser.
Embodiment
Embodiment 1:
This merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, realize capturing same host multithreading, accurately capture commodity info web.
Described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
Embodiment 2:
This merchandise news extracting method is: application Heritrix expansion crawls logic, and modified module Heritrix can accurately capture commodity info web; In bag org.archive.crawler.extrator, add class ExtractorForPcOnline analyzing web page content and select candidate site;
Realize capturing same host multithreading by expansion queue-assignment-policy;
Extract the hyperlink in webpage with HtmlParser, extract merchandise news:
Step one: import HtmlParser and wrap import org.htmlParser.util.*;
Step 2: import page info into htmlparser and page coded format is set;
Parser parser = new Parser(url);
Parser.setEncoding(pageEncoding);
Step 3: the nodelist obtaining parser;
Nodelist=parser.parser(NewNodeListFilter(LinkTag.class));
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news;
LinkName in cyclic process=(LinkTag) nodelist.elementAt (i) .getLinkText () is exactly the name obtaining page connection.
Explanation of nouns:
Heritrix is a web crawlers developed by java, increase income, and user can use it to capture the resource wanted from network.Its outstanding part is the extensibility that it is good, facilitates user to realize the crawl logic of oneself.
Htmlparser is an application under the html(standard generalized markup language write of a pure java) storehouse of resolving, it does not rely on other java library file, is mainly used in transformation or extracts html.It hypervelocity can resolve html, and can not make mistakes.
Url: URL(uniform resource locator) is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and the information that it comprises points out how the position of file and browser should process it.
By embodiment above, described those skilled in the art can be easy to realize the present invention.But should be appreciated that the present invention is not limited to above-mentioned several embodiments.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizes different technical schemes.

Claims (3)

1. based on a HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described merchandise news extracting method is: resolve Web page with HtmlParser, extracts the hyperlink in webpage, thus obtains useful information; Crawl logic, modified module Heritrix with heritrix expansion, accurately capture commodity info web.
2. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described is as follows by the step of the hyperlink in HtmlParser extraction webpage:
Step one: import HtmlParser bag;
Step 2: import page info into htmlparser and page coded format is set;
Step 3: the nodelist obtaining parser;
Step 4: circulation obtains nodelist, and the information of depositing inside nodelist is exactly the url of page merchandise news.
3. one according to claim 1 is based on HERITRIX and HTMLPARSER merchandise news extracting method, it is characterized in that, described crawls logic, modified module Heritrix with heritrix expansion, realizes capturing same host multithreading.
CN201510129487.8A 2015-03-24 2015-03-24 Commodity information extraction method based on HERITRIX and HTMLPARSER Pending CN104731909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510129487.8A CN104731909A (en) 2015-03-24 2015-03-24 Commodity information extraction method based on HERITRIX and HTMLPARSER

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510129487.8A CN104731909A (en) 2015-03-24 2015-03-24 Commodity information extraction method based on HERITRIX and HTMLPARSER

Publications (1)

Publication Number Publication Date
CN104731909A true CN104731909A (en) 2015-06-24

Family

ID=53455796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510129487.8A Pending CN104731909A (en) 2015-03-24 2015-03-24 Commodity information extraction method based on HERITRIX and HTMLPARSER

Country Status (1)

Country Link
CN (1) CN104731909A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN101470752A (en) * 2007-12-29 2009-07-01 指点通(北京)科技有限公司 Search engine method based on keyword resolution scheduling
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN101470752A (en) * 2007-12-29 2009-07-01 指点通(北京)科技有限公司 Search engine method based on keyword resolution scheduling
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘文浩 等: "《基于Heritrix 和HTMLParser 的网页商品信息提取的研究》", 《计算机光盘软件与应用》 *

Similar Documents

Publication Publication Date Title
CN106095979B (en) URL merging processing method and device
TW201250492A (en) Method and system of extracting web page information
CN103559235A (en) Online social network malicious webpage detection and identification method
WO2011017929A1 (en) Method and apparatus for positioning effective information quickly by mobile phone browser
CN104462547A (en) Configurable webpage data acquisition method and system
CN103678511A (en) Method and device for extracting webpage content according to visualized template
CN101571860A (en) Method and device for generating dynamic website as well as method and device for extracting structural data
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN103778238A (en) Method for automatically building classification tree from semi-structured data of Wikipedia
CN103970898A (en) Method and device for extracting information based on multistage rule base
CN104142985A (en) Semi-automatic vertical crawler generation tool and method
CN104899219A (en) Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN104391706A (en) Reverse engineering based model base structuring method
CN103678509A (en) Method and device for generating webpage template
CN104991904A (en) Page data acquisition method of dynamic webpage
CN102664925A (en) Method and apparatus for displaying searching result
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
JP4231298B2 (en) Information extraction rule creation system, information extraction rule creation program, information extraction system, and information extraction program
CN103778156A (en) Method and device for searching for data and server for data search
CN104317845A (en) Method and system for automatic extraction of deep web data
CN103678510A (en) Method and device for providing visualized label for webpage
CN104008213A (en) Method and device for finding and counting webpage information updating
CN101807187A (en) Browsing information-based instant search method
CN108363711B (en) Method and device for detecting dark chain in webpage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150624