CN101261635A - Passive type network information automatic highly effective collection system and method - Google Patents

Passive type network information automatic highly effective collection system and method Download PDF

Info

Publication number
CN101261635A
CN101261635A CNA200810066892XA CN200810066892A CN101261635A CN 101261635 A CN101261635 A CN 101261635A CN A200810066892X A CNA200810066892X A CN A200810066892XA CN 200810066892 A CN200810066892 A CN 200810066892A CN 101261635 A CN101261635 A CN 101261635A
Authority
CN
China
Prior art keywords
operate portions
information
described information
gathering
information acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200810066892XA
Other languages
Chinese (zh)
Other versions
CN101261635B (en
Inventor
陈清财
王晓龙
郭鸿志
马天明
翁家才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN200810066892XA priority Critical patent/CN101261635B/en
Publication of CN101261635A publication Critical patent/CN101261635A/en
Application granted granted Critical
Publication of CN101261635B publication Critical patent/CN101261635B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a passive web-information automatic high-effective collecting system and a method. The web-information automatic high-effective collecting system includes: an information collection work part which works at an information requiring terminal and an information collecting-sending work part which works at an information providing terminal. The information collection work part is communicated with the information collecting-sending work part. The web-information automatic high-effective collecting method includes the following steps: effective communication is established between the information collection work part and the information collecting-sending work part; the information collection work part provides terminal information according to notices of the information collecting-sending work part or the periodically obtained information which is stored in the information collecting-sending work part. After the technique of the invention is applied to a searching engine, relevant contents of websites such as the text information, audio and video characteristic information, etc. can be obtained from the internet duly, quickly and effectively, and both the system cost and internet bandwidth occupation amount can be dramatically reduced.

Description

A kind of passive type network information automatic highly effective collection system and method
Technical field
The present invention relates to network information automatic acquisition technology, be specifically related to a kind of passive type network information automatic highly effective collection system and method, the technology of the present invention is mainly used in searching engine field.
Background technology
It is to utilize computing machine to carry out the prerequisite of information subsequent treatment (for example information retrieval, search engine etc.) that information is obtained.Present stage web information getting method is mainly realized by web crawlers.Modern society's informationization is more and more faster, and computer utility also is not only that dependence programming in logic realizes, and more needs a large amount of information to handle, sum up and conclude, and therefrom excavates Useful Information.
Present main means-web crawlers of obtaining of web information-manifest gradually weakness of some incompatibility new demands, the ultimate principle of web crawlers is an Initial page lists of links to be traveled through visit obtain content, and link new in the webpage that gets access to is joined the recurrence traversal that circulates in the web link list.This will cause a large amount of repeated accesses, upgrade untimely, transmission content redundancy, the network bandwidth takies greatly, a lot of problems of the big grade of server access pressure.Real-time to information is used in new computer utility particularly web, and is comprehensive and reduce aspect such as system overhead new requirement is all arranged, and it is unable to do what one wishes that the legacy network reptile has seemed.For example, Blog Search need be to Blog content renewal at the right time, and requirement is delivered within the new article several hrs in blog can provide Search Results; Quantity of information in the multimedia web sites such as audio frequency and video is huge, the classic method transmission need take a large amount of network bandwidths, thereby make the website can't bear the heavy load, and owing to the reasons such as restriction of copyright, these websites often do not allow search engine to download original audio-video frequency content, thereby have limited the development of content-based audio video searching service.
At above-mentioned technical matters, many improvement thought and methods to the legacy network reptile have been proposed in the recent period, these methods have plenty of to climb by improvement and get strategy and improve access efficiency or accelerate renewal frequency, for example different web sites is adopted different intercycles, or web crawlers is confined to some specific area climbs the breath etc. of winning the confidence; Have plenty of and assist reptile to climb by the webmaster to get, for example when there is bigger renewal the website, submit map of website to by the network manager, reptile according to map of website arrangement climb the time of getting and climb and get.Though these methods can be improved some performances of reptile and strengthen its function at certain specific area, but they remain based on traditional reptile framework, except the content update efficient that can partly improve the search engine reptile, therefore do not break through traditional web crawlers structure, have no idea thoroughly to solve that content upgrades in time, web site contents repeats transmission and large-scale datas such as audio and video characteristic such as are gathered at problem yet.
Summary of the invention
A large amount of repeated accesses can appear causing in the web information access process that exists in the prior art in order to solve, upgrade untimely, transmission content redundancy, the network bandwidth takies greatly, technical matterss such as server access pressure is big, though and solution can be improved some performances of reptile and strengthen its function at some specific area in the prior art, but do not break through traditional web crawlers structure, also having no idea thoroughly to solve content upgrades in time, web site contents repeats transmission, and to sound, characteristic informations such as video carry out technical matterss such as large-scale data collection, the invention provides a kind of passive type network information automatic highly effective collection system.
A large amount of repeated accesses can appear causing in the web information access process that exists in the prior art in order to solve, upgrade untimely, transmission content redundancy, the network bandwidth takies greatly, technical matterss such as server access pressure is big, though and solution can be improved some performances of reptile and strengthen its function at some specific area in the prior art, but do not break through traditional web crawlers structure, also having no idea thoroughly to solve content upgrades in time, web site contents repeats transmission, and to sound, characteristic informations such as video carry out technical matterss such as large-scale data collection, the invention provides and have gone back a kind of passive type network information automatic highly effective acquisition method.
The present invention solves technical scheme that the prior art problem adopted for a kind of passive type network information automatic highly effective collection system is provided, and described network information automatic highly effective collection system comprises: the information acquisition operate portions that runs on the information requirement end; The information of running on provides the information gathering of end to send operate portions; Described information acquisition operate portions and described information gathering send operate portions for communicating to connect relation.
According to a preferred embodiment of the invention: described information requirement end is the search engine server end; Described information provides end to be web site server end; Described information acquisition operate portions is the server component that is arranged on described information requirement end; Described information gathering sends operate portions and provides the client component of end for being arranged on described information.
According to a preferred embodiment of the invention: described network information automatic highly effective acquisition method comprises step: the first step, described information acquisition operate portions and described information gathering are sent operate portions set up and effectively get in touch; Second step, described information acquisition operate portions send the notice of operate portions according to described information gathering, obtain the described information that described information gathering sends the operate portions stored client information is provided.
According to a preferred embodiment of the invention: the described first step comprises substep: one, by the inquiry of described information acquisition operate portions new, operation has described information gathering to send the website of operate portions, and obtain described information gathering and send operate portions website relevant information; Two, described information acquisition operate portions sends operate portions information according to the described information gathering that gets access to and sends the corresponding information that operate portions sends register requirement and described information acquisition operate portions is provided to described information gathering transmission operate portions to described information gathering.
According to a preferred embodiment of the invention: described two steps are specially: described information acquisition operate portions sends operate portions information according to the described information gathering that gets access to and sends the corresponding information that operate portions sends register requirement and described information acquisition operate portions is provided to described information gathering, described information gathering sends operate portions according to the described information acquisition operate portions information of receiving, judge whether accepting this register requirement by artificial or automated manner, with preservations of tabulating of the described information acquisition operate portions relevant information of agreeing registration, and the notice that succeeds in registration to described information acquisition operate portions transmission.
According to a preferred embodiment of the invention: described second step comprises substep: one, send operate portions by described information gathering and detect the more new situation of related content under its website, place, and carry out the storage of update content; Two, described information gathering sends operate portions and sends the download content notice to all described information acquisition operate portions of successfully registering; Three, between described information acquisition operate portions and described information gathering transmission operate portions, set up point-to-point (P2P) transmission network, send the seed file that operate portions provides download, download for described information acquisition operate portions by described collection.
According to a preferred embodiment of the invention: a described step is specially: send the more new situation that operate portions detects related content under its website, place by described information gathering, and, respectively updated information or the pairing correlated characteristic information of this information are carried out information gathering, collection apparatus, packing data and be stored in the specific file according to the type of update content.
According to a preferred embodiment of the invention: described two steps are specially: when newly-added information is accumulated to a certain amount of or be accumulated to certain time length update time, described information gathering sends operate portions and sends the download content notice to all described information acquisition operate portions of successfully registering, and in notice, arrange a download content time period, receive the described information acquisition operate portions of this notice determines whether remove download message in the designated time according to s own situation; Perhaps regularly initiatively send operate portions and carry out the download of relevant information to described information gathering by described information acquisition operate portions.
According to a preferred embodiment of the invention: described three steps are specially: after the designated time section arrives, notified and need to determine the down loading updating content, serving the described information acquisition operate portions and the described information gathering transmission operate portions of different search engines links up, and by the initiation of described information gathering transmission operate portions, participate in setting up point-to-point (P2P) transmission network by described information acquisition operate portions, sending operate portions by described information gathering provides the seed file of download and according to the quantity of the described information acquisition operate portions that participates in downloading and the factors such as size of seed file seed file is divided into several portions, each described information acquisition operate portions is responsible for downloading a part or a plurality of part wherein, and this information acquisition operate portions needs the information acquisition operate portions of this part content to carry out information sharing with other again after the download.
According to a preferred embodiment of the invention: the described information that information acquisition operate portions described in described second step is obtained described information gathering transmission operate portions stored provides the client information obtain manner for by each described information acquisition operate portions and described information gathering transmission operate portions are set up point-to-point (P2P) network, provides the seed file of download for its download for described information acquisition operate portions by described information gathering transmission operate portions.
Beneficial effect of the present invention is: after being applied to this technology in the search engine, text message and website related contents such as sound, video features information on the internet can be obtained in time, fast and efficiently, and system overhead and network bandwidth occupancy can be significantly reduced.
Description of drawings
Fig. 1. network information automatic highly effective collection system structural drawing in a kind of passive type network information automatic highly effective collection system of the present invention and the method;
Fig. 2. new site is found schematic diagram;
Fig. 3. information acquisition operate portions server registration schematic diagram;
Fig. 4. information gathering sends operate portions client component lastest imformation notification principles figure;
Fig. 5. download schematic diagram based on the P2P protocol data;
Fig. 6. network information automatic highly effective acquisition method process flow diagram in a kind of passive type network information automatic highly effective collection system of the present invention and the method.
Embodiment
Below in conjunction with the drawings and specific embodiments a kind of passive type network information automatic highly effective collection system of the present invention and method are elaborated:
See also network information automatic highly effective collection system structural drawing in a kind of passive type network information automatic highly effective collection system of Fig. 1 the present invention and the method, as shown in Figure 1, described network information automatic highly effective collection system comprises: the information acquisition operate portions that runs on the information requirement end; The information of running on provides the information gathering of end to send operate portions; Described information acquisition operate portions and described information gathering send operate portions for communicating to connect relation.
Described in an embodiment of the present invention information requirement end is a Client Search Engine, comprises second search engine and the 3rd search engine in the drawings, but is not limited to have only this two search engines in the concrete practice; Described information provides end to be web site server end; Described information acquisition operate portions is the server component that is arranged on the described information requirement end search engine; Described information gathering sends operate portions and provides the client component of holding on the Website server for being arranged on described information.
At information gathering operate portions---server component described in the system of the present invention, this assembly operating is in information requirement end (as the search engine server end), and major responsibility includes but not limited to: (one), the new website of searching; (2), register on the client in the website; (3), wait for the download content notice that client is sent; (4), need the server component of same content and respective client assembly to set up the information that interim point-to-point transmission network is downloaded to be needed with other.
Send operate portions---client component in information gathering described in the system of the present invention, this assembly operating provides end (as the web site server end that provides text or audio/video information to visit) in information, and its major responsibility includes but not limited to: the register requirement of (), reception different server assembly; (2), safeguard the successfully server component information table of registration; (3), in time monitor and collect the content of text of packing network upgrade, extract and the pack image of website, the various characteristics of audio or video content; (4), the information requirement according to each registered server component sends the content update notice; (5), the assist server assembly sets up the temporary content transmission network in the designated time, and the web site contents file of packing offered this transmission network as seed file carry out download content.
According to the above-mentioned responsibility and the division of labor, the invention provides a kind of network information automatic highly effective acquisition method, when solving the technical matters that exists in the prior art, need each webpage all to set up HTTP to connect with following specific implementation method, content upgrades in time, content is obtained and large-scale data download problem such as audio and video characteristic information from website repeated downloads, audio and video characteristic.Its concrete execution in step reaches accordingly, and the method for dealing with problems can specifically describe as follows:
Information requirement end described in below specifying is a Client Search Engine; Described information provides end to be web site server end; Described information acquisition operate portions is the server component that is arranged on described information requirement end; Described information gathering sends operate portions and provides the client component of end for being arranged on described information.
(1) website is found: at first search out website new, that operation has the information client assembly by server component, and download the website client client information table that leaves under the assigned catalogue of website from this website, determine the information such as connectivity port of client component according to this information table.The discovery of new site has two kinds of implementation methods, and a kind of method is that server end is according to this access websites successively of tabulating by the Website page content analysis of the having obtained site list that makes new advances.Another kind method can find effectively and fast new site by a third-party list of websites service.Concrete grammar finds that as Fig. 2 new site shown in the schematic diagram, each website can be registered in oneself on third party's site list server after client is installed, and server end just can obtain list of websites easily by inquiring about this third-party server.
(2) server registration: can consult Figure of description Fig. 3 information acquisition operate portions server registration schematic diagram, as shown in FIG., server component sends register requirement and corresponding server component information is provided to client component according to the client component information that obtains, client component is according to the server component information of receiving, judge whether to accept this register requirement by artificial or automatic mode, if accepted, then the information with this server component deposits in the server component tabulation of oneself, and send to succeed in registration and be notified to server component, otherwise directly transmission does not allow registration notification to server component.
(3) content update notice: can consult Figure of description Fig. 4 information gathering and send operate portions client component lastest imformation notification principles figure, as shown in FIG., after succeeding in registration, server component is waited for the content update notice of client component; Client component is monitored content update situations all under its website, place, and according to the type of update content, respectively updated information or the pairing various feature packings of this information are deposited in the specific file, after newly-added information is accumulated to a certain amount of or is accumulated to certain time length update time, send the download content notice to all server components of successfully registering, and in notice, arrange a download content time period; The server component of receiving this notice determines whether remove download message in the designated time according to s own situation; Because the content of upgrading is responsible for by client component; and the content analysis and the monitoring of being responsible for disk-based web site of each client component; therefore the data that have a copyright for audio frequency and video etc. are carried out feature extraction and just can be carried out in content providers, can fine realization copyright protection.
(4) download content: can consult Figure of description Fig. 5 and download schematic diagram based on the P2P protocol data, as shown in FIG. after the designated time section arrives, notified and need to determine the down loading updating content, serving the server component of different search engines links up with client component in succession, and initiate by client component, server component participates in setting up an interim point-to-point transmission network, client component provides the seed file of download and according to the quantity of the server component that participates in downloading and the factors such as size of seed file seed file is divided into several portions, each server component is responsible for downloading a part or a plurality of part wherein, but load pressure for the website that alleviates the client component place, every part can only or be downloaded from the website by maximum N server components of webmaster's appointment simultaneously by one, and this server component needs the server component of this part content to share with other again after the download.In this way, solved in the problem of obtaining that does not increase large-scale data such as audio or video characteristic information etc. under the pressure condition of website, simultaneously in theory, each server component only need be paid the extra data upload amount that is no more than its data total amount size that obtains, under this and the traditional reptile framework its consume new Webpage searching and web page contents more new situation judge, and for obtaining the consumption rates such as extra HTTP request that each independent webpage is set up, obviously be acceptable, more crucial also is, this transmission method can finish traditional reptile the download problem of insurmountable audio and video characteristic file.
The groundwork step that above-mentioned four steps are native systems.In addition, in some cases, such as for a new reptile, it sometimes also needs the website that historical data except the information of recent renewal is provided, if this history data file is not very big, finish download as new data more in (four) of job step step fully in the above, if but historical data is huger, then as an optional step, providing for (five) step below carries out the download of historical data.
(5) historical data is downloaded: the historical data of a website has several characteristics, the one, because information is more outmoded, therefore importance is slightly lower than up-to-date data generally speaking for search engine, the 2nd, because time integral is long, so in general therefore data volume needs the strict number of times of downloading of controlling than more new data is much bigger.These two characteristics have determined client component when providing historical data to download, the frequency that the open historical data of mainly need controlling well is downloaded, and method, actual download process and method and (three), (four) that its notice is downloaded they are identical in going on foot.In order to determine suitable download frequency, need make an estimation to the time interval that provides historical data to download, provide a possible method of estimation here, even download time interval T w=min{N c/ N fλ, T cβ }, λ wherein, β ∝ L/L cBe two coefficients by current historical data size L influence, N c, T cBe according to given historical data size L by the webmaster cGiven new registration server end sum N, the reference value of maximum wait time T, N fIt is the frequency that current new server component is registered on client component.Same server component is usually in the face of a large amount of website client assemblies, and therefore providing the long-time downloads historical data of trying one's best that server end is downloaded to optimize has very big benefit, also passes judgment on client state to server end simultaneously foundation is provided.
Said method may be summarized to be shown in network information automatic highly effective acquisition method process flow diagram in a kind of passive type network information automatic highly effective collection system of Figure of description 6 the present invention and the method.
Cooperate successful basis as server component and client component, between server component and the client component and server component communication each other must finish with the communication protocol of unanimity.
Any agreement implementation of concrete employing can't influence function of the present invention and main efficient, but, determine that a kind of extendible standard agreement based on XML is useful for the efficient that guarantees whole internet information is shared and obtained optimum to greatest extent.Though be not necessary, following key data structure is the important content that guarantees that client component and server component intercom mutually, therefore need provide concrete definition in every kind of agreement, the exemplary definition method that is a kind of based on XML given here:
(1) site information description document
For the site information description document, a kind of embodiment is the method that adopts similar traditional reptile, places an XML file that is similar to " robot.xml " under the root directory of each website.Provided a kind of site information description document definition template below based on XML Schema:
<xsd:element name=" client "〉// information of the resident client in definition website
<xsd:complexType>
<xsd:all>
<xsd:element name=" port " type=" xsd:unsignedShort "/〉 // the client listening port
<xsd:element name=" domain " type=" xsd:anyURI "/〉 // domain name of website
<xsd:element ref=" subject "/〉 // the related theme (optional) of web site contents
<xsd:element ref=" changefreq " minOccurs=" 0 "/〉 // general update frequency (optional)
<xsd:element ref=" timezone "〉// time zone of client component institute resident service device
</xsd:all>
</xsd:complexType>
</xsd:element>
<xsd:element?name=″subject″>
<xsd:simpleType>
<xsd:restriction?base=″xsd:string″>
<xsd:enumeration value=" finance "/〉
<xsd:enumeration value=" education "/〉
<xsd:enumeration value=" infotech "/〉
……
// top just the example of some possibility classifications can expand as required
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element?name=″changefreq″>
<xsd:simpleType>
<xsd:restriction?base=″xsd:string″>
<xsd:enumeration value=" renewal continuously "/〉
<xsd:enumeration value=" per hour "/〉
<xsd:enumeration value=" every day "/〉
<xsd:enumeration value=" weekly "/〉
<xsd:enumeration value=" every month "/〉
<xsd:enumeration value=" every year "/〉
<xsd:enumeration value=" never renewal "/〉
// can expand as required
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element?name=″timezone″>
<xsd:simpleType>
<xsd:restriction?base=″xsd:unsignedByte″>
<xsd:minLength?value=″0″/>
<xsd:maxLength?value=″23″/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
Be the example that a site information is described below:
<?xml?version=″1.0″encoding=″UTF-8″?>
<client?xmlns=″www.hitsz.edu.cn″>
<port>8088</port>
<domain>www.hitsz.edu.cn</domain>
<subjects〉education</subjects 〉
<changefreq〉every day</changefreq 〉
</client>
In above-mentioned template, most important parts is<client〉<port element, this element tell server component and if the client set up communicate the listening port that must know.Though not necessarily, provide<subject〉unit tell that usually the server component theme that this website relates generally to is helpful for the search engine at server place, especially those vertical search engines of being absorbed in specific area information.Other optional information comprises renewal frequency, and website domain name etc. also better provide information retrieval service to have very great help to server component and corresponding search engine.Another element<timezone〉be to consider that server component may come from different areas with client component, in order to guarantee temporal consistance, inform affiliated time zone so need to determine each other difference.
For a bigger website, often may need a plurality of client components and provide service respectively, at this time also can in the site information table, provide a plurality of<client at the content of different piece〉unit usually is illustrated respectively.
(2) server component log-on message
When registering, server component need carry out the mutual of necessity with client component on client component, mainly comprise the basic server component and the identity identification information of corresponding search engine, server component listening port, information requirement type etc. to client component are provided.The template of the main log-on message that has sent when as a reference, having provided a server component registration below:
<xsd:element?name=″serverRegister″>
<xsd:complexType>
<xsd:all>
<xsd:element?name=″serverIP″type=″xsd:string″/>
<xsd:element?name=″serverName″type=″xsd:strimg″/>
<xsd:element?ref=″subject″/>
<xsd:element?ref=″contentType″minOccurs=″0″/>
<xsd:element?ref=″timezone″minOccurs=″0″/>
</xsd:all>
</xsd:complexType>
</xsd:element>
<xsd:element?name=″contentType″>
<xsd:simpleType>
<xsd:restriction?base=″xsd:string″>
<xsd:enumeration?value=″audio″/>
<xsd:enumeration?value=″video″/>
<xsd:enumeration?value=″image″/>
<xsd:enumeration?value=″text″/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
Except the element that in data structure (), has defined, here element<the type that increases newly〉<type〉show the type of the needed information of server component, comprise audio frequency " audio ", video " video ", image " image " and text " text " these several main types, can certainly further expand these types.<type〉element also can be used as the optional element of site information file, is used to describe the website that particular type of information only is provided, but for the website that mixed type information can be provided, can be with a plurality of<type element describes respectively, also can ignore this.
For above-mentioned log-on message is responded, client component need determine to succeed in registration or refuse to beam back a response message after this server component registration usually, this response message can be general simple response, also can be the XML Message-text of a more complicated.
(3) web site contents update notification
This message is the content update and download notice that is sent to all server components that success is registered on it by a client component, this notice is told the updating period of server component update content, the size of the update content that webpage that comprises or audio and video characteristic number of files, type and needs are downloaded, if possible, also can indicate the related theme of these contents.Except the information relevant with update content, this notification message also should comprise open update content of what period to relevant informations such as server component download, open download port, the protocol type of opening, file in download tabulations.Following XML Schema template has provided the template of a update notification message.
<xsd:element?name=″update″>
<xsd:complexType>
<xsd:all>
<xsd:element?name=″cliendID″type=″xsd:ID″/>
<xsd:element?name=”downloadPort”type=”xsd:unsignedInt”>
<xsd:element name=”updatedFile”type=”updatedFileType”
minOccurs=”1”/>
</xsd:all>
</xsd:complexType>
</xsd:element>
<xsd:complexType?name=″durationType″>
<xsd:all>
<xsd:element?name=″startTime″type=″xsd:dateTime″>
<xsd:elememy?name=″dueTime″type=″xsd:dateTime″>
</xsd:all>
</xsd:complexType>
<xsd:complexType?name=”updatedFileType”>
<xsd:all>
<xsd:element name=”fileFullPath”type=”xsd:string”
minOccurs=”1”/>
<xsd:element?name=″dataSize″type=″xsd:unsignedLong″/>
<xsd:element?ref=”subject”minOccurs=”0”/>
<xsd:element?ref=”contentType”minOccurs=”0”/>
<xsd:element name=″isHistory″ type=″xsd:boolean″
minOccurs=″1″/>
<xsd:element name=″updateDuration″ type=“durationType”
minOccurs=″1″/><xsd:element?name=″downloadDuration″
type=″durationType″minOccurs=“1”/>
</xsd:all>
</xsd:complexType>
In above-mentioned data structure, comprise one or more<updatedFile in each content update message〉unit, a updating file by the client packing is described in each unit, usually comprise one or more type network information of upgrading in a period of time in this updating file, as text webpage, audio frequency characteristics file or video features file etc.Server component can have more<updatedFile〉information that provides in the unit judges whether to need to download corresponding lastest imformation.
Beneficial effect of the present invention is: after being applied to this technology in the search engine, text message and website related contents such as sound, video features information on the internet can be obtained in time, fast and efficiently, and system overhead and network bandwidth occupancy can be significantly reduced.
Above content be in conjunction with concrete preferred implementation to further describing that the present invention did, can not assert that concrete enforcement of the present invention is confined to these explanations.For the general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims (10)

1. passive type network information automatic highly effective collection system, it is characterized in that: described network information automatic highly effective collection system comprises:
Run on the information acquisition operate portions of information requirement end;
The information of running on provides the information gathering of end to send operate portions;
Described information acquisition operate portions and described information gathering send operate portions for communicating to connect relation.
2. according to the described passive type network information automatic highly effective collection system of claim 1, it is characterized in that:
Described information requirement end is the search engine server end;
Described information provides end to be web site server end;
Described information acquisition operate portions is the server component that is arranged on described information requirement end;
Described information gathering sends operate portions and provides the client component of end for being arranged on described information.
3. passive type network information automatic highly effective acquisition method, it is characterized in that: described network information automatic highly effective acquisition method comprises step:
A: described information acquisition operate portions and the foundation of described information gathering transmission operate portions are effectively got in touch;
B: described information acquisition operate portions is according to the notice of described information gathering transmission operate portions, and the described information of obtaining described information gathering transmission operate portions stored provides client information.
4. according to the described passive type network information automatic highly effective acquisition method of claim 3, it is characterized in that: described steps A comprises substep:
A1: by the inquiry of described information acquisition operate portions new, operation has described information gathering to send the website of operate portions, and obtain described information gathering and send operate portions website relevant information;
A2: described information acquisition operate portions sends operate portions information according to the described information gathering that gets access to and sends the corresponding information that operate portions sends register requirement and described information acquisition operate portions is provided to described information gathering transmission operate portions to described information gathering.
5. according to the described passive type network information automatic highly effective acquisition method of claim 4, it is characterized in that: described steps A 2 is specially: described information acquisition operate portions sends operate portions information according to the described information gathering that gets access to and sends the corresponding information that operate portions sends register requirement and described information acquisition operate portions is provided to described information gathering, described information gathering sends operate portions according to the described information acquisition operate portions information of receiving, judge whether accepting this register requirement by artificial or automated manner, with preservations of tabulating of the described information acquisition operate portions relevant information of agreeing registration, and the notice that succeeds in registration to described information acquisition operate portions transmission.
6. according to the described passive type network information automatic highly effective acquisition method of claim 3, it is characterized in that: described step B comprises substep:
B1: send operate portions by described information gathering and detect the more new situation of related content under its website, place, and carry out the storage of update content;
B2: described information gathering sends operate portions and sends the download content notice to all described information acquisition operate portions of successfully registering;
B3: between described information acquisition operate portions and described information gathering transmission operate portions, set up point-to-point (P2P) transmission network, send the seed file that operate portions provides download, download for described information acquisition operate portions by described collection.
7. according to the described passive type network information automatic highly effective acquisition method of claim 6, it is characterized in that: described step B1 is specially: send the more new situation that operate portions detects related content under its website, place by described information gathering, and, respectively updated information or the pairing correlated characteristic information of this information are carried out information gathering, collection apparatus, packing data and be stored in the specific file according to the type of update content.
8. according to the described passive type network information automatic highly effective acquisition method of claim 6, it is characterized in that: described step B2 is specially: when newly-added information is accumulated to a certain amount of or be accumulated to certain time length update time, described information gathering sends operate portions and sends the download content notice to all described information acquisition operate portions of successfully registering, and in notice, arrange a download content time period, receive the described information acquisition operate portions of this notice determines whether remove download message in the designated time according to s own situation; Perhaps regularly initiatively send operate portions and carry out the download of relevant information to described information gathering by described information acquisition operate portions.
9. according to the described passive type network information automatic highly effective acquisition method of claim 6, it is characterized in that: described step B3 is specially: after the designated time section arrives, notified and need to determine the down loading updating content, serving the described information acquisition operate portions and the described information gathering transmission operate portions of different search engines links up, and by the initiation of described information gathering transmission operate portions, participate in setting up point-to-point (P2P) transmission network by described information acquisition operate portions, sending operate portions by described information gathering provides the seed file of download and according to the quantity of the described information acquisition operate portions that participates in downloading and the factors such as size of seed file seed file is divided into several portions, each described information acquisition operate portions is responsible for downloading a part or a plurality of part wherein, and this information acquisition operate portions needs the information acquisition operate portions of this part content to carry out information sharing with other again after the download.
10. according to the described passive type network information automatic highly effective acquisition method of claim 3, it is characterized in that: the described information that information acquisition operate portions described in the described step B is obtained described information gathering transmission operate portions stored provides the client information obtain manner for by each described information acquisition operate portions and described information gathering transmission operate portions are set up the P2P network, sends operate portions by described information gathering and supplies its download for described information acquisition operate portions provides the seed file of download.
CN200810066892XA 2008-04-29 2008-04-29 Passive type network information automatic highly effective collection system and method Expired - Fee Related CN101261635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810066892XA CN101261635B (en) 2008-04-29 2008-04-29 Passive type network information automatic highly effective collection system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810066892XA CN101261635B (en) 2008-04-29 2008-04-29 Passive type network information automatic highly effective collection system and method

Publications (2)

Publication Number Publication Date
CN101261635A true CN101261635A (en) 2008-09-10
CN101261635B CN101261635B (en) 2010-09-01

Family

ID=39962095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810066892XA Expired - Fee Related CN101261635B (en) 2008-04-29 2008-04-29 Passive type network information automatic highly effective collection system and method

Country Status (1)

Country Link
CN (1) CN101261635B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN102334325A (en) * 2009-02-26 2012-01-25 高通股份有限公司 Methods and apparatus for enhanced overlay state maintenance
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN105138279A (en) * 2015-08-03 2015-12-09 深圳市美贝壳科技有限公司 Intelligent management method of household equipment data
CN107193828A (en) * 2016-03-14 2017-09-22 百度在线网络技术(北京)有限公司 Novel webpage capture method and apparatus

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102334325A (en) * 2009-02-26 2012-01-25 高通股份有限公司 Methods and apparatus for enhanced overlay state maintenance
CN102334325B (en) * 2009-02-26 2015-04-15 高通股份有限公司 Methods and apparatus for enhanced overlay state maintenance
US9240927B2 (en) 2009-02-26 2016-01-19 Qualcomm Incorporated Methods and apparatus for enhanced overlay state maintenance
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN105138279A (en) * 2015-08-03 2015-12-09 深圳市美贝壳科技有限公司 Intelligent management method of household equipment data
CN107193828A (en) * 2016-03-14 2017-09-22 百度在线网络技术(北京)有限公司 Novel webpage capture method and apparatus

Also Published As

Publication number Publication date
CN101261635B (en) 2010-09-01

Similar Documents

Publication Publication Date Title
TWI253262B (en) Remote dynamic configuration of a web server to facilitate capacity on demand
US9367296B2 (en) Method and system for synchronizing application programs across devices
CN101136938B (en) Centralized management method and platform system for mobile internet application
EP2021937B1 (en) Techniques to perform gradual upgrades
US20020156863A1 (en) Apparatus and methods for managing caches on a gateway
CN103678319B (en) Resource file update method, device and system and server
CN102164186B (en) Method and system for realizing cloud search service
CN101261635B (en) Passive type network information automatic highly effective collection system and method
CN107404480B (en) A kind of transmission method of stream medium data, storage medium and streaming media server
KR101111155B1 (en) A communication network system and communication network service processing method
CN101039309B (en) Link sharing service apparatus and communication method thereof
US20080201457A1 (en) MSI enhancement to update RDP files
CN102571849B (en) Cloud computing system and method
JP2009512934A (en) RSS-based content extraction server, content extraction method, server management device, and standby screen providing system for mobile communication terminal device using the same
CN101420325A (en) Automatic deployment method, apparatus and system for software package
US20140244609A1 (en) System, Method and Device for Internet Search Based on Peer-to-Peer Network
CN103297275A (en) Dynamic discovery and loading system and method for Web network management client module
CN108829792A (en) Distributed darknet excavating resource system and method based on scrapy
CN103701929A (en) Method and device for realizing business data caching
CN106452839A (en) Message report method and device
CN110324423A (en) A kind of service registration discovery method, system, equipment and medium
US20080304411A1 (en) Bandwidth control system and method capable of reducing traffic congestion on content servers
CN100411367C (en) Method and apparatus for implementing simultaneous processing of multiple service logic on server
CN101170540A (en) A XML document management method, client and server
CN105959363A (en) Big data cluster deployment method capable of adapting to hardware configuration

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100901

Termination date: 20180429

CF01 Termination of patent right due to non-payment of annual fee