CN102426600A

CN102426600A - Intranet information acquisition method based on meta-search

Info

Publication number: CN102426600A
Application number: CN2011103508110A
Authority: CN
Inventors: 杨更
Original assignee: JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Current assignee: JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority date: 2011-11-08
Filing date: 2011-11-08
Publication date: 2012-04-25

Abstract

The invention relates to an intranet information acquisition method based on meta-search, which targets at intranet information systems, summarizes and acquires sensitive information through the built-in search engines of all information systems, and well ensures the independency of all the information systems, and simultaneously an acquisition system can be very conveniently inlaid into different complicated intranet information system environments. The intranet information acquisition method based on meta-search has the advantages that the expansion of the intranet information systems are dealt with, only one search configuration is added to change the systems, the configuration for monitoring and filing sensitive words is simple, and no special acquisition template needs to be customized.

Description

A kind of Intranet information collecting method based on unit's search

Technical field

The present invention relates to a kind of Intranet information collecting method based on unit's search.

Background technology

In order magnanimity information in the Intranet effectively to be monitored and to be filed, an effective acquisition system is a condition precedent.Existing acquisition system adopts the mode of directly website being gathered mostly, and at first, inefficiency is very high to the acquisition system load request like this.Once had research to adopt the distributed capture system to improve collecting efficiency, but this has proposed higher hardware requirement to acquisition system, secondly; Demand side is to various website situation, and to the format analysis more complicated of acquisition source, acquisition system is difficult to the frequent variations of reply URL in time; At last; Traditional acquisition system is fundamental purpose mostly with the file, lacks gathering the analysis and the reorganization of content, before the content side of vastness, is difficult in time find focus and trend.

Summary of the invention

The purpose of this invention is to provide a kind of Intranet information collecting method based on unit's search simple in structure.

Intranet information collecting method based on unit's search of the present invention, target is the website/infosystem of Intranet issue, may further comprise the steps: timesharing starts capture program; Search condition according to the built-in search engine of the internal web information system of sensitive word makes up; Automatic acquisition and search result.

Wherein crucial collecting flowchart is following:

1, timesharing starts collecting thread

For n focus; With existing x the search engine of setting, maximum n*22 visits to search engine can take place after acquisition tasks started, be resolved to n*x*100 bar up-to-date information at most; Go weight, analysis of central issue, statistical information to upgrade; The target pages main body such as obtains at operation can cause the frequent access to network and database, if time point starts the collecting thread of each search engine simultaneously, can cause excessive pressure to server hardware and network environment; The search engine frequent visit also is put into the abnormal access blacklist easily, so the pattern that adopts timesharing to start collecting thread is carried out acquisition tasks.

Before collection, calculate time interval m second (m=50*60/n) that two focus threads start according to current collection period (such as 1 hour) and focus quantity n earlier; Host process is slept m second in the next focal line of structure Cheng Qian; The having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously.In the end thread starts back m second, and is every at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues execution 20 seconds, if Thread Count less than 1, capture program withdraws from.

Such mechanism has at first guaranteed the utilization of program to the resource average effective, and the hardware of avoiding too frequent operation to cause does not to a great extent have response, to program run stability certain guarantee is provided also.

2, structure search engine link

The focus combination of system configuration is divided into two field store; Comprise speech (separating with the space between a plurality of speech) and do not comprise speech (separating with the space between a plurality of speech); Need that focusing combination splits, code conversion and reconfiguring, construct the search engine link then.

At first from focus combination, speech extracted and change into the URL coding; Add search engine specific with non-array mode; Constitute reconfiguring of focus speech; As being " %E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88%B F%E8%B4%B7 " after " loans and deposits+interest rate-housing loan " conversion, with search engine link, the collection page number, the merging of information such as coded format obtains URL again; Like http://www.google.com.hk/search q=%E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88 %BF%E8%B4%B7&um=1&ie=UTF-8&tbs=nws:1&source=og&sa=N&tab=wn&hl=zh-CN&num=100; Expression is searched for " deposit and loan interest rate " with Google's information, and does not contain " housing loan ", once reads 100 information.

3, simulation browser access webpage

Use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Circulate maximum 5 times (not limitting number of times mode access portion website can cause endless loop) add up and obtain cookie, normal until the HTTP connection status, this moment, the cookie that adds up that gets access to write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page.

GZIP data compression transmission technology is used by the website of a lot of big flows; What traditional data stream receive mode received is mess code; Therefore before reading document flow, to obtain its transformat earlier; Can confirm that whether the page is through the GZIP Compression and Transmission through using connection.getHeaderField (" Content-Encoding ") method; If the transmission of GZIP form will use the GZIPInputStream mode to receive document flow, also be mess code entirely otherwise the document flow that receives comprises English character.

The general page can be attempted through the getHeaderField (" Content-Type ") that uses connection from the http head, obtaining character set, uses intercepting character string " charset=" to obtain coded format.If it's not true, read preceding 1024 characters, the header of inspection html also obtains coded format through multinomial regular expression.After getting access to coded format, just can directly read document flow and obtain page source code.

4, the collection result content of pages is resolved

Page to search engine searches arrives through the parsing to the webpage source code, can obtain information in the webpage, like title, link, issuing time, informative abstract etc.

The parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.The regular expression that for example in Google's result of page searching source code, extracts title be "<a href=(.*)</a></h3>", to the character string that matches remove "<.*>" mark just can obtain title.

Also to obtain the dependency number of news for the news category page; After Baidu's heat is heard and is collected; Heat is heard title use Baidu's headline search and Google's information title search; The related news that the result who obtains hears as this heat (it is related to hear docid by master's heat), Google's heat news is obtained according to relevant more multichain and is got the related news that the secondary page is made heat news.

The present invention is owing to take above technical scheme, and it has the following advantages:

1, the expansion of web information system in the reply, a search configuration has just been added in change for system;

2, for the monitoring and the file of sensitive word, configuration is simple, need not to customize special collection masterplate.

Description of drawings

Fig. 1 is the collecting flowchart figure of the inventive method.

Embodiment

Carry out detailed description below in conjunction with accompanying drawing and embodiment to of the present invention.

As shown in Figure 1, be the acquisition module flow process that an Intranet sensitive information based on unit's search is found system:

At first, according to starting the acquisition time program 1.; Program reads the focus (sensitive word strategy) of configuration and constructs the search engine link and obtain Search Results 2.; Do heavily and handle, and obtain the webpage main information 3.; Obtain the result and divide two parts storages, goal-focus etc. from segment information with the stored in form of data form 4.; After the webpage main body is extracted, 5. with the format of txt; 8. body matter deposits in the local file system, handles and sets up index 6. for concordance program; 7. list then deposits local data base in; 9. database and file system provide the full-text search service for page program jointly.

Claims

1. Intranet information collecting method based on unit search, it is characterized in that: it comprises following steps: timesharing starts capture program, makes up according to the search condition of the built-in search engine of the internal web information system of sensitive word, automatically acquisition and search result;

Wherein timesharing starts collecting thread: before collection, earlier according to current collection period and focus quantity n, calculate the time interval m second that two focus threads start, m=50*60/n; Host process sleep m second in structure next focal line Cheng Qian, the having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously; In the end a thread starts back m second; Whenever at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues to carry out 20 seconds; If Thread Count is less than 1, capture program withdraws from;

The link of structure search engine: the focus combination of system configuration is divided into two field store; Comprise speech and do not comprise speech; At first from focus combination, speech extracted and change into the URL coding; Add search engine and non-array mode, constitute reconfiguring of focus speech, construct the search engine link then;

Simulation browser access webpage: use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Maximum add up for 5 times of circulating obtained cookie, and be normal until the HTTP connection status, and the cookie that adds up that get access to this moment has write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page;

The collection result content of pages is resolved: the parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.