CN102426600A - Intranet information acquisition method based on meta-search - Google Patents

Intranet information acquisition method based on meta-search Download PDF

Info

Publication number
CN102426600A
CN102426600A CN2011103508110A CN201110350811A CN102426600A CN 102426600 A CN102426600 A CN 102426600A CN 2011103508110 A CN2011103508110 A CN 2011103508110A CN 201110350811 A CN201110350811 A CN 201110350811A CN 102426600 A CN102426600 A CN 102426600A
Authority
CN
China
Prior art keywords
search
information
focus
thread
search engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103508110A
Other languages
Chinese (zh)
Inventor
杨更
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Original Assignee
JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd filed Critical JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority to CN2011103508110A priority Critical patent/CN102426600A/en
Publication of CN102426600A publication Critical patent/CN102426600A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to an intranet information acquisition method based on meta-search, which targets at intranet information systems, summarizes and acquires sensitive information through the built-in search engines of all information systems, and well ensures the independency of all the information systems, and simultaneously an acquisition system can be very conveniently inlaid into different complicated intranet information system environments. The intranet information acquisition method based on meta-search has the advantages that the expansion of the intranet information systems are dealt with, only one search configuration is added to change the systems, the configuration for monitoring and filing sensitive words is simple, and no special acquisition template needs to be customized.

Description

A kind of Intranet information collecting method based on unit's search
Technical field
The present invention relates to a kind of Intranet information collecting method based on unit's search.
Background technology
In order magnanimity information in the Intranet effectively to be monitored and to be filed, an effective acquisition system is a condition precedent.Existing acquisition system adopts the mode of directly website being gathered mostly, and at first, inefficiency is very high to the acquisition system load request like this.Once had research to adopt the distributed capture system to improve collecting efficiency, but this has proposed higher hardware requirement to acquisition system, secondly; Demand side is to various website situation, and to the format analysis more complicated of acquisition source, acquisition system is difficult to the frequent variations of reply URL in time; At last; Traditional acquisition system is fundamental purpose mostly with the file, lacks gathering the analysis and the reorganization of content, before the content side of vastness, is difficult in time find focus and trend.
Summary of the invention
The purpose of this invention is to provide a kind of Intranet information collecting method based on unit's search simple in structure.
Intranet information collecting method based on unit's search of the present invention, target is the website/infosystem of Intranet issue, may further comprise the steps: timesharing starts capture program; Search condition according to the built-in search engine of the internal web information system of sensitive word makes up; Automatic acquisition and search result.
Wherein crucial collecting flowchart is following:
1, timesharing starts collecting thread
For n focus; With existing x the search engine of setting, maximum n*22 visits to search engine can take place after acquisition tasks started, be resolved to n*x*100 bar up-to-date information at most; Go weight, analysis of central issue, statistical information to upgrade; The target pages main body such as obtains at operation can cause the frequent access to network and database, if time point starts the collecting thread of each search engine simultaneously, can cause excessive pressure to server hardware and network environment; The search engine frequent visit also is put into the abnormal access blacklist easily, so the pattern that adopts timesharing to start collecting thread is carried out acquisition tasks.
Before collection, calculate time interval m second (m=50*60/n) that two focus threads start according to current collection period (such as 1 hour) and focus quantity n earlier; Host process is slept m second in the next focal line of structure Cheng Qian; The having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously.In the end thread starts back m second, and is every at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues execution 20 seconds, if Thread Count less than 1, capture program withdraws from.
Such mechanism has at first guaranteed the utilization of program to the resource average effective, and the hardware of avoiding too frequent operation to cause does not to a great extent have response, to program run stability certain guarantee is provided also.
2, structure search engine link
The focus combination of system configuration is divided into two field store; Comprise speech (separating with the space between a plurality of speech) and do not comprise speech (separating with the space between a plurality of speech); Need that focusing combination splits, code conversion and reconfiguring, construct the search engine link then.
At first from focus combination, speech extracted and change into the URL coding; Add search engine specific with non-array mode; Constitute reconfiguring of focus speech; As being " %E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88%B F%E8%B4%B7 " after " loans and deposits+interest rate-housing loan " conversion, with search engine link, the collection page number, the merging of information such as coded format obtains URL again; Like http://www.google.com.hk/search q=%E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88 %BF%E8%B4%B7&um=1&ie=UTF-8&tbs=nws:1&source=og&sa=N&tab=wn&hl=zh-CN&num=100; Expression is searched for " deposit and loan interest rate " with Google's information, and does not contain " housing loan ", once reads 100 information.
3, simulation browser access webpage
Use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Circulate maximum 5 times (not limitting number of times mode access portion website can cause endless loop) add up and obtain cookie, normal until the HTTP connection status, this moment, the cookie that adds up that gets access to write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page.
GZIP data compression transmission technology is used by the website of a lot of big flows; What traditional data stream receive mode received is mess code; Therefore before reading document flow, to obtain its transformat earlier; Can confirm that whether the page is through the GZIP Compression and Transmission through using connection.getHeaderField (" Content-Encoding ") method; If the transmission of GZIP form will use the GZIPInputStream mode to receive document flow, also be mess code entirely otherwise the document flow that receives comprises English character.
The general page can be attempted through the getHeaderField (" Content-Type ") that uses connection from the http head, obtaining character set, uses intercepting character string " charset=" to obtain coded format.If it's not true, read preceding 1024 characters, the header of inspection html also obtains coded format through multinomial regular expression.After getting access to coded format, just can directly read document flow and obtain page source code.
4, the collection result content of pages is resolved
Page to search engine searches arrives through the parsing to the webpage source code, can obtain information in the webpage, like title, link, issuing time, informative abstract etc.
The parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.The regular expression that for example in Google's result of page searching source code, extracts title be "<a href=(.*)</a></h3>", to the character string that matches remove "<.*>" mark just can obtain title.
Also to obtain the dependency number of news for the news category page; After Baidu's heat is heard and is collected; Heat is heard title use Baidu's headline search and Google's information title search; The related news that the result who obtains hears as this heat (it is related to hear docid by master's heat), Google's heat news is obtained according to relevant more multichain and is got the related news that the secondary page is made heat news.
The present invention is owing to take above technical scheme, and it has the following advantages:
1, the expansion of web information system in the reply, a search configuration has just been added in change for system;
2, for the monitoring and the file of sensitive word, configuration is simple, need not to customize special collection masterplate.
Description of drawings
Fig. 1 is the collecting flowchart figure of the inventive method.
Embodiment
Carry out detailed description below in conjunction with accompanying drawing and embodiment to of the present invention.
As shown in Figure 1, be the acquisition module flow process that an Intranet sensitive information based on unit's search is found system:
At first, according to starting the acquisition time program 1.; Program reads the focus (sensitive word strategy) of configuration and constructs the search engine link and obtain Search Results 2.; Do heavily and handle, and obtain the webpage main information 3.; Obtain the result and divide two parts storages, goal-focus etc. from segment information with the stored in form of data form 4.; After the webpage main body is extracted, 5. with the format of txt; 8. body matter deposits in the local file system, handles and sets up index 6. for concordance program; 7. list then deposits local data base in; 9. database and file system provide the full-text search service for page program jointly.

Claims (1)

1. Intranet information collecting method based on unit search, it is characterized in that: it comprises following steps: timesharing starts capture program, makes up according to the search condition of the built-in search engine of the internal web information system of sensitive word, automatically acquisition and search result;
Wherein timesharing starts collecting thread: before collection, earlier according to current collection period and focus quantity n, calculate the time interval m second that two focus threads start, m=50*60/n; Host process sleep m second in structure next focal line Cheng Qian, the having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously; In the end a thread starts back m second; Whenever at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues to carry out 20 seconds; If Thread Count is less than 1, capture program withdraws from;
The link of structure search engine: the focus combination of system configuration is divided into two field store; Comprise speech and do not comprise speech; At first from focus combination, speech extracted and change into the URL coding; Add search engine and non-array mode, constitute reconfiguring of focus speech, construct the search engine link then;
Simulation browser access webpage: use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Maximum add up for 5 times of circulating obtained cookie, and be normal until the HTTP connection status, and the cookie that adds up that get access to this moment has write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page;
The collection result content of pages is resolved: the parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.
CN2011103508110A 2011-11-08 2011-11-08 Intranet information acquisition method based on meta-search Pending CN102426600A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103508110A CN102426600A (en) 2011-11-08 2011-11-08 Intranet information acquisition method based on meta-search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103508110A CN102426600A (en) 2011-11-08 2011-11-08 Intranet information acquisition method based on meta-search

Publications (1)

Publication Number Publication Date
CN102426600A true CN102426600A (en) 2012-04-25

Family

ID=45960580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103508110A Pending CN102426600A (en) 2011-11-08 2011-11-08 Intranet information acquisition method based on meta-search

Country Status (1)

Country Link
CN (1) CN102426600A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search
CN112528205A (en) * 2020-12-22 2021-03-19 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN101727485A (en) * 2009-12-10 2010-06-09 湖南科技大学 WSDL collection method based on focused search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
CN101727485A (en) * 2009-12-10 2010-06-09 湖南科技大学 WSDL collection method based on focused search

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search
CN112528205A (en) * 2020-12-22 2021-03-19 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium
CN112528205B (en) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium

Similar Documents

Publication Publication Date Title
CN101329687B (en) Method for positioning news web page
CN100405371C (en) Method and system for abstracting new word
US9165085B2 (en) System and method for publishing aggregated content on mobile devices
CN102930059B (en) Method for designing focused crawler
AU2009276354B2 (en) Providing posts to discussion threads in response to a search query
CN103365924A (en) Method, device and terminal for searching information
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
JP5084858B2 (en) Summary creation device, summary creation method and program
KR102222287B1 (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
WO2013146736A1 (en) Synonym relation determination device, synonym relation determination method, and program thereof
JP2006309515A (en) Information delivery method and information delivery server
US20110219017A1 (en) System and methods for citation database construction and for allowing quick understanding of scientific papers
CN102375813A (en) Duplicate detection system and method for search engines
JP4875911B2 (en) Content identification method and apparatus
CN105302876A (en) Regular expression based URL filtering method
CN102117275B (en) Method and device for collecting webpage data of direction site based on internet
CN102426600A (en) Intranet information acquisition method based on meta-search
JP5466133B2 (en) Document search apparatus with image and document search program with image
JP2010026724A (en) Web page providing apparatus, method for interlocking web page with ranking and program thereof
CN101344892A (en) Information processing apparatus, information processing method and computer readable information recording medium
CN105207852A (en) Method for directionally acquiring network data based on distributed mode
CN104063506A (en) Method and device for identifying repeated web pages
KR101600616B1 (en) Method for analyzing service of heterogeneous contents
KR101362090B1 (en) Method for providing retrieval service using integrated data base and server thereof
CN103544294B (en) Keyword popularity automatic control method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120425