CN102426600A - Intranet information acquisition method based on meta-search - Google Patents
Intranet information acquisition method based on meta-search Download PDFInfo
- Publication number
- CN102426600A CN102426600A CN2011103508110A CN201110350811A CN102426600A CN 102426600 A CN102426600 A CN 102426600A CN 2011103508110 A CN2011103508110 A CN 2011103508110A CN 201110350811 A CN201110350811 A CN 201110350811A CN 102426600 A CN102426600 A CN 102426600A
- Authority
- CN
- China
- Prior art keywords
- search
- information
- focus
- thread
- search engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention relates to an intranet information acquisition method based on meta-search, which targets at intranet information systems, summarizes and acquires sensitive information through the built-in search engines of all information systems, and well ensures the independency of all the information systems, and simultaneously an acquisition system can be very conveniently inlaid into different complicated intranet information system environments. The intranet information acquisition method based on meta-search has the advantages that the expansion of the intranet information systems are dealt with, only one search configuration is added to change the systems, the configuration for monitoring and filing sensitive words is simple, and no special acquisition template needs to be customized.
Description
Technical field
The present invention relates to a kind of Intranet information collecting method based on unit's search.
Background technology
In order magnanimity information in the Intranet effectively to be monitored and to be filed, an effective acquisition system is a condition precedent.Existing acquisition system adopts the mode of directly website being gathered mostly, and at first, inefficiency is very high to the acquisition system load request like this.Once had research to adopt the distributed capture system to improve collecting efficiency, but this has proposed higher hardware requirement to acquisition system, secondly; Demand side is to various website situation, and to the format analysis more complicated of acquisition source, acquisition system is difficult to the frequent variations of reply URL in time; At last; Traditional acquisition system is fundamental purpose mostly with the file, lacks gathering the analysis and the reorganization of content, before the content side of vastness, is difficult in time find focus and trend.
Summary of the invention
The purpose of this invention is to provide a kind of Intranet information collecting method based on unit's search simple in structure.
Intranet information collecting method based on unit's search of the present invention, target is the website/infosystem of Intranet issue, may further comprise the steps: timesharing starts capture program; Search condition according to the built-in search engine of the internal web information system of sensitive word makes up; Automatic acquisition and search result.
Wherein crucial collecting flowchart is following:
1, timesharing starts collecting thread
For n focus; With existing x the search engine of setting, maximum n*22 visits to search engine can take place after acquisition tasks started, be resolved to n*x*100 bar up-to-date information at most; Go weight, analysis of central issue, statistical information to upgrade; The target pages main body such as obtains at operation can cause the frequent access to network and database, if time point starts the collecting thread of each search engine simultaneously, can cause excessive pressure to server hardware and network environment; The search engine frequent visit also is put into the abnormal access blacklist easily, so the pattern that adopts timesharing to start collecting thread is carried out acquisition tasks.
Before collection, calculate time interval m second (m=50*60/n) that two focus threads start according to current collection period (such as 1 hour) and focus quantity n earlier; Host process is slept m second in the next focal line of structure Cheng Qian; The having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously.In the end thread starts back m second, and is every at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues execution 20 seconds, if Thread Count less than 1, capture program withdraws from.
Such mechanism has at first guaranteed the utilization of program to the resource average effective, and the hardware of avoiding too frequent operation to cause does not to a great extent have response, to program run stability certain guarantee is provided also.
2, structure search engine link
The focus combination of system configuration is divided into two field store; Comprise speech (separating with the space between a plurality of speech) and do not comprise speech (separating with the space between a plurality of speech); Need that focusing combination splits, code conversion and reconfiguring, construct the search engine link then.
At first from focus combination, speech extracted and change into the URL coding; Add search engine specific with non-array mode; Constitute reconfiguring of focus speech; As being " %E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88%B F%E8%B4%B7 " after " loans and deposits+interest rate-housing loan " conversion, with search engine link, the collection page number, the merging of information such as coded format obtains URL again; Like http://www.google.com.hk/search q=%E5%AD%98%E8%B4%B7%E6%AC%BE+%E5%88%A9%E7%8E%87+-%E6%88 %BF%E8%B4%B7&um=1&ie=UTF-8&tbs=nws:1&source=og&sa=N&tab=wn&hl=zh-CN&num=100; Expression is searched for " deposit and loan interest rate " with Google's information, and does not contain " housing loan ", once reads 100 information.
3, simulation browser access webpage
Use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Circulate maximum 5 times (not limitting number of times mode access portion website can cause endless loop) add up and obtain cookie, normal until the HTTP connection status, this moment, the cookie that adds up that gets access to write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page.
GZIP data compression transmission technology is used by the website of a lot of big flows; What traditional data stream receive mode received is mess code; Therefore before reading document flow, to obtain its transformat earlier; Can confirm that whether the page is through the GZIP Compression and Transmission through using connection.getHeaderField (" Content-Encoding ") method; If the transmission of GZIP form will use the GZIPInputStream mode to receive document flow, also be mess code entirely otherwise the document flow that receives comprises English character.
The general page can be attempted through the getHeaderField (" Content-Type ") that uses connection from the http head, obtaining character set, uses intercepting character string " charset=" to obtain coded format.If it's not true, read preceding 1024 characters, the header of inspection html also obtains coded format through multinomial regular expression.After getting access to coded format, just can directly read document flow and obtain page source code.
4, the collection result content of pages is resolved
Page to search engine searches arrives through the parsing to the webpage source code, can obtain information in the webpage, like title, link, issuing time, informative abstract etc.
The parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.The regular expression that for example in Google's result of page searching source code, extracts title be "<a href=(.*)</a></h3>", to the character string that matches remove "<.*>" mark just can obtain title.
Also to obtain the dependency number of news for the news category page; After Baidu's heat is heard and is collected; Heat is heard title use Baidu's headline search and Google's information title search; The related news that the result who obtains hears as this heat (it is related to hear docid by master's heat), Google's heat news is obtained according to relevant more multichain and is got the related news that the secondary page is made heat news.
The present invention is owing to take above technical scheme, and it has the following advantages:
1, the expansion of web information system in the reply, a search configuration has just been added in change for system;
2, for the monitoring and the file of sensitive word, configuration is simple, need not to customize special collection masterplate.
Description of drawings
Fig. 1 is the collecting flowchart figure of the inventive method.
Embodiment
Carry out detailed description below in conjunction with accompanying drawing and embodiment to of the present invention.
As shown in Figure 1, be the acquisition module flow process that an Intranet sensitive information based on unit's search is found system:
At first, according to starting the acquisition time program 1.; Program reads the focus (sensitive word strategy) of configuration and constructs the search engine link and obtain Search Results 2.; Do heavily and handle, and obtain the webpage main information 3.; Obtain the result and divide two parts storages, goal-focus etc. from segment information with the stored in form of data form 4.; After the webpage main body is extracted, 5. with the format of txt; 8. body matter deposits in the local file system, handles and sets up index 6. for concordance program; 7. list then deposits local data base in; 9. database and file system provide the full-text search service for page program jointly.
Claims (1)
1. Intranet information collecting method based on unit search, it is characterized in that: it comprises following steps: timesharing starts capture program, makes up according to the search condition of the built-in search engine of the internal web information system of sensitive word, automatically acquisition and search result;
Wherein timesharing starts collecting thread: before collection, earlier according to current collection period and focus quantity n, calculate the time interval m second that two focus threads start, m=50*60/n; Host process sleep m second in structure next focal line Cheng Qian, the having thread to start and withdraw from of property at interval in this collection period guaranteed to put at one time and do not had multithreading and carry out simultaneously; In the end a thread starts back m second; Whenever at a distance from current active Thread Count of inspection in 20 seconds, if Thread Count greater than 1, continues to carry out 20 seconds; If Thread Count is less than 1, capture program withdraws from;
The link of structure search engine: the focus combination of system configuration is divided into two field store; Comprise speech and do not comprise speech; At first from focus combination, speech extracted and change into the URL coding; Add search engine and non-array mode, constitute reconfiguring of focus speech, construct the search engine link then;
Simulation browser access webpage: use the pattern of the simulation browser access page, simulation browser agent User-Agent is Mozilla/4.0 (compatible; MSIE 8.0); Be set to not automatically perform HTTP redirection; Maximum add up for 5 times of circulating obtained cookie, and be normal until the HTTP connection status, and the cookie that adds up that get access to this moment has write down the redirect operation of simulation browser; Use this cookie to simulate browser once more and open link, just can obtain correct page;
The collection result content of pages is resolved: the parsing of page source code mainly is fixed against the regular expression combination of configuration in early stage; The html label that initial sum ending place of information field in the page source code is had mark property is made regular expression, obtains the character string that comprises information needed through regular expression.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011103508110A CN102426600A (en) | 2011-11-08 | 2011-11-08 | Intranet information acquisition method based on meta-search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011103508110A CN102426600A (en) | 2011-11-08 | 2011-11-08 | Intranet information acquisition method based on meta-search |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102426600A true CN102426600A (en) | 2012-04-25 |
Family
ID=45960580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011103508110A Pending CN102426600A (en) | 2011-11-08 | 2011-11-08 | Intranet information acquisition method based on meta-search |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102426600A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902667A (en) * | 2014-03-14 | 2014-07-02 | 浪潮电子信息产业股份有限公司 | Simple network information collector achieving method based on meta-search |
CN112528205A (en) * | 2020-12-22 | 2021-03-19 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
CN101727485A (en) * | 2009-12-10 | 2010-06-09 | 湖南科技大学 | WSDL collection method based on focused search |
-
2011
- 2011-11-08 CN CN2011103508110A patent/CN102426600A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
CN101727485A (en) * | 2009-12-10 | 2010-06-09 | 湖南科技大学 | WSDL collection method based on focused search |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902667A (en) * | 2014-03-14 | 2014-07-02 | 浪潮电子信息产业股份有限公司 | Simple network information collector achieving method based on meta-search |
CN112528205A (en) * | 2020-12-22 | 2021-03-19 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
CN112528205B (en) * | 2020-12-22 | 2021-10-29 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101329687B (en) | Method for positioning news web page | |
CN100405371C (en) | Method and system for abstracting new word | |
US9165085B2 (en) | System and method for publishing aggregated content on mobile devices | |
CN102930059B (en) | Method for designing focused crawler | |
AU2009276354B2 (en) | Providing posts to discussion threads in response to a search query | |
CN103365924A (en) | Method, device and terminal for searching information | |
WO2008098502A1 (en) | Method and device for creating index as well as method and system for retrieving | |
JP5084858B2 (en) | Summary creation device, summary creation method and program | |
KR102222287B1 (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
WO2013146736A1 (en) | Synonym relation determination device, synonym relation determination method, and program thereof | |
JP2006309515A (en) | Information delivery method and information delivery server | |
US20110219017A1 (en) | System and methods for citation database construction and for allowing quick understanding of scientific papers | |
CN102375813A (en) | Duplicate detection system and method for search engines | |
JP4875911B2 (en) | Content identification method and apparatus | |
CN105302876A (en) | Regular expression based URL filtering method | |
CN102117275B (en) | Method and device for collecting webpage data of direction site based on internet | |
CN102426600A (en) | Intranet information acquisition method based on meta-search | |
JP5466133B2 (en) | Document search apparatus with image and document search program with image | |
JP2010026724A (en) | Web page providing apparatus, method for interlocking web page with ranking and program thereof | |
CN101344892A (en) | Information processing apparatus, information processing method and computer readable information recording medium | |
CN105207852A (en) | Method for directionally acquiring network data based on distributed mode | |
CN104063506A (en) | Method and device for identifying repeated web pages | |
KR101600616B1 (en) | Method for analyzing service of heterogeneous contents | |
KR101362090B1 (en) | Method for providing retrieval service using integrated data base and server thereof | |
CN103544294B (en) | Keyword popularity automatic control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120425 |