CN103440139A

CN103440139A - Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites

Info

Publication number: CN103440139A
Application number: CN2013104123487A
Authority: CN
Inventors: 闫丹凤; 杨翔; 张丽莹; 蓝田; 黄俊霖; 唐皓瑾; 邹文涛; 徐佳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2013-09-11
Filing date: 2013-09-11
Publication date: 2013-12-11

Abstract

The invention discloses an acquisition method and an acquisition tool facing microblog IDs (identitiesy) of mainstream microblog websites. A The system architecture is divided into two layers, namely an acquisition layer and a storage layer; interfaces between the two layers and a system are clear; each layer internally consists of a plurality of modules which are in loose coupling, so that the function expansion of each layer is facilitated; the acquisition layer is used for realizing crawling of authorized user microblog IDs and long-term acquisition of authorized user fan IDs; the storage layer is used for storing the microblog IDs into a local database and externally supplying an open type microblog ID indexing retrieval function. A u Users of the system can be any developer for third-party application based on microblog data, and who can develop the related third-party application for performing further grabbing, analysis and the like on microblog contents through the microblog IDs supplied by the system, or a user can be a manager for the microblog websites, and who can execute related statistics according to the microblog IDs supplied by the system so as to analyze microblog related indexes such as the microblog activity level and the microblog influence.

Description

A kind of acquisition method and instrument towards main flow microblogging website microblogging ID

Technical field

The present invention relates to that the ID acquisition technique of social network sites is particularly a kind of take main flow microblogging website opening API as basic microblogging ID acquisition method and instrument.

Background technology

Microblogging has been realized Information Sharing based on customer relationship, has been propagated and obtained platform as one of social networks application technology, has carried a large amount of information based on social networks.The user, by WEB, WAP and various client component individual community, with the word lastest imformation of 140 word left and right, and realizes immediately sharing." the 31st the China Internet network state of development statistical report " of CNNIC (CNNIC) issue shows, by by the end of December, 2012, China's microblog users scale is 3.09 hundred million, than increasing by 5,873 10000 the end of the year 2011.The stage that microblogging is expanded rapidly finishes, but year amplification still can reach 23.5%.Huge customer volume has brought huge quantity of information to microblogging, and the application that micro-blog information is recycled also occurs thereupon.Sina and Tengxun etc. have all opened main flow microblogging website the API for the microblogging operation, for supporting and encourageing the development of the third party's application based on the microblogging data message.Third party's application and development pattern based on API is using micro-blog information as basic data message, classification for microblogging, analysis, retrieval etc. upgrade application further is provided, having promoted the value of microblogging data, is to promote the model that social network information recycles effectively.Have now and occurred that certain applications have realized the recycling to micro-blog information to a certain extent, but it is so a difficult task that the collecting work of information is appointed, by in by the end of April, 2013, registered user's scale of Sina's microblogging is 5.03 hundred million, day any active ues 4,620 ten thousand, user id is int64(8Bytes), total data volume is about 4G; The registered user of Tengxun's microblogging is 5.4 hundred million, day any active ues 100,000,000, and the character string that user id is 32 char (32Bytes), total data volume is about 16G; The registered user of Sohu's microblogging is about 100,000,000; Take Sina's microblogging as example, suppose that average each user issues 50 microbloggings, every microblogging on average contains 20 Chinese characters, and every microblogging on average comprises 4 comments, and the conceptual data amount of micro-blog information is 4G*50*20*2*(4+1)=400T.Micro-blog information in the face of magnanimity, can only gather the microblogging data for the URL link of known microblogging, this mode just gathers micro-blog information among a small circle, so be badly in need of a kind of metadata acquisition tool that can be more complete, comprehensive and supports the big data quantity storage for gathering microblogging unique identifier-microblogging ID, and then gather concrete micro-blog information by microblogging ID, thereby realize the recycling of micro-blog information widely, for upper layer application provides better Data support.

Summary of the invention

In view of this, the purpose of this invention is to provide a kind of microblogging ID acquisition method and instrument towards main flow microblogging website, method of the present invention and instrument reptile Network Based and microblogging opening API, in conjunction with existing index and unstructured data library storage method, a set of acquisition method and instrument towards main flow microblogging website microblogging ID proposed, the already present microblogging account of energy automation collection of the present invention ID, facilitate the developer to be recycled, for upper layer application provides better Data support.Distinctive distributed data storage method allows the present invention possess better extensibility.

In order to achieve the above object, the purpose of this invention is to provide a kind of acquisition method and instrument towards main flow microblogging website microblogging ID, its framework is divided into two levels, is respectively acquisition layer and accumulation layer.Acquisition layer is realized the collection of microblogging ID, and accumulation layer realizes local storage, and open search function is provided; Wherein:

Acquisition layer, realize that authenticated ID crawls the collecting work with authenticated bean vermicelli ID, by network, crawls module and microblogging API module forms.

Accumulation layer, be responsible for to the microblogging ID obtained in acquisition layer is carried out the duplicate removal operation and microblogging ID is stored, and microblogging ID query interface is provided, and reserved field, for subsequent operation provides certain scalability.

The function of each module of described acquisition layer is respectively:

Network crawls module, is responsible for the crawl work for microblogging ID in Sina, Tengxun's microblogging authenticated webpage, mainly comprises that webpage crawls and resolves and local operation of storing.Wherein webpage crawls and parse operation completes by the mode of writing browser plug-in, and local storage operation completes by the mode of writing the WebServer code.Comprise request microblogging authenticated homepage, one-level, secondary and the reclassify page, resolve specific name and corresponding URL in the page, ask and resolve the microblogging ID in the classification pages at different levels, and microblogging ID is stored in the authenticated ID file in local authenticated ID collection catalogue.Wherein authenticated ID catalogue is local directory, according to authenticated classification grade and the title automatic classification of microblogging website, sets up; Authenticated ID file is separated each ID with newline, and names with the form of " the specific name .txt of subordinate ".This module comprises that microblogging ID crawls submodule and sub module stored.

Microblogging API module, the microblogging API that uses Sina, Tengxun's microblogging open platform to provide, obtain the bean vermicelli ID of microblogging authenticated.Its operating process comprises the authorization token that at first obtains two large microblogging open platforms, next is according to the api interface of different microbloggings, construct different parameters to corresponding api interface, obtain the microblogging ID data of JSON form, and the microblogging ID parsed is deposited in the bean vermicelli ID file of local bean vermicelli ID collection catalogue.Wherein, bean vermicelli ID gathers the catalogue that catalogue is a storage bean vermicelli ID file, and bean vermicelli ID file is the text of depositing the ID that this module obtains, with newline, separates each ID, each file is deposited the ID of some, and file is with the form name of " current time stamp .txt ".

The function of each module of described accumulation layer is respectively:

Duplicate removal and index module, described duplicate removal and index module, for the microblogging ID gathered carries out the duplicate removal operation and sets up index, wherein Sina, index name that Tengxun's microblogging is corresponding are respectively index_A and index_B, regularly microblogging ID is exported in the file of assigned catalogue and process for data memory module, tag field is set in index and prevents from repeating to derive ID.Index comprises microblogging ID and mark two row, and storage mode is all " not participle ".Be marked with three kinds of values: " not deriving ", " deriving " and " having imported Hbase ".Wherein " do not derive " and mean that microblogging ID does not derive from Lucene; " derive " and mean that microblogging ID derives from Lucene, but do not import in Hbase; " imported Hbase " and meaned that microblogging ID imports in Hbase.This module comprises that submodule set up in index, ID derives submodule and flag update submodule.Index is set up module and is realized the ID duplicate removal will gathered, and the ID after duplicate removal is incorporated to index, the tag field of corresponding ID is set to " not deriving ".ID derives submodule and is responsible for the ID do not derived in the periodic retrieval index, (microblogging ID Export directoey need to establishment before the system operation to deposit microblogging ID export in local microblogging ID Export directoey in, microblogging ID export automatically creates when each the derivation), the file designation form is " year-moon-_ time-minute .txt ", and offer the data memory module processing, the mark of corresponding ID is set to " deriving ".The flag update submodule is responsible for checking the whether ID of processed derivation of data memory module, if finish dealing with, the mark of corresponding ID is set to " having imported Hbase ".

Data memory module, be responsible for reading microblogging ID export from local microblogging ID Export directoey, and use the storage tool Hbase in the distributed system Hadoop increased income that the microblogging ID in file is carried out to distributed storage, microblogging ID query interface is provided simultaneously.This module is used the Crontab order of Linux regularly to carry out.During each the execution, read the microblogging ID export in local microblogging ID Export directoey, the microblogging ID of Sina and the microblogging ID of Tengxun are stored in respectively in the different table of Hbase, the major key of table is designed to the form of " DDDDD microblogging ID ", wherein DDDDD means five tens digits, this number prefix is to be the reserved label space of the data analysis in later stage, can be for state or the attribute of this microblogging of mark ID.This module regularly retrieves the microblogging ID that mark value equals " not deriving " from Lucene, call the API that Hbase provides, these microbloggings ID is deposited in the corresponding table of Hbase, and change mark value corresponding to these microbloggings ID into " deriving " in Lucene, to repeat derivation after preventing.

Described network crawl module each submodule function respectively:

Microblogging ID crawls submodule, and the mode request by writing the Chrome browser plug-in is also resolved the microblogging authenticated classification page, and the ID parsed is sent to the sub module stored of this module.Write different browser plug-ins and crawled for Sina, Tengxun's microblogging.Be that two plug-in units are set up respectively the plug-in unit root directory, create the manifest.json configuration file in the plug-in unit root directory, specify in Sina's microblogging authenticated homepage and load jquery.js and two Javascript files of content_script.js after having loaded, wherein jquery.js is the jQuery function library, crawl the logic of authenticated ID in content_script.js for this submodule, by the permissions field in configuration file, specify this plug-in unit can the request authentication User Page and any resource of local WebServer.The relevant configuration that Tengxun's microblogging crawls plug-in unit is basic identical except Sina's microblogging authenticated homepage being changed Tengxun's microblogging authenticated homepage into to its remainder.The authenticated page structure of Sina, Tengxun's microblogging is different, and the concrete process that crawls is also different.For Sina's microblogging authenticated page, at first use jQuery to obtain all first order li elements in the ul element in the div element of all class=nav_barMain, obtain again first a element from each li element, its href attribute is the link of one-level classification, the title that the word in a element is the one-level classification.Next obtains the ul element in first order li element, and word and href attribute in a element in the li element in it are secondary classification title and link.Then for each secondary classification, the $ .post function provided by jQuery to its link sends request, and in the responseText returned at the post function, uses regular expression to find out reclassify title and link.For all reclassify links, after link, add parameter l etter=a, letter=b ..., letter=z forms 26 new links, send respectively the post request to these links, in the page returned, use regular expression to find out out all microblogging ID.Finally, microblogging ID and corresponding classification are configured to respectively to the Javascript array, the $ .post function that calling jQuery provides sends to this array the corresponding processing page of sub module stored of this module.For Tengxun's microblogging authenticated page, at first use jQuery to obtain all first order li elements in the ul element in the div element of class=peopleNav, in each li element, the text of first a element and href attribute are one-level specific name and link; Next uses jQuery to trigger the mouseover event of each li, makes the page automatically send the secondary classification under each the one-level classification of AJAX acquisition request and write current page.Then obtain the div element of class=navLayer in first order li element, wherein the text in the strong element in the dt element is the secondary classification title, and in the dd element after the dt element, the text of all a elements and href attribute are reclassify title and link.For each reclassify link, add sort=char&amp after link; Char=a, sort=char& Char=b ..., sort=char& Char=z forms 26 new links, to these links, sends respectively the Post request, in the page returned, uses canonical coupling expression formula to find out all microblogging ID.Subsequent step is similar to the Sina microblogging, and except the processing page of the corresponding microblogging ID of Tengxun in sub module stored sends ID and corresponding specific name, other steps are basic identical.

Sub module stored, use the PHP language to write respectively Sina, the saveId.php file of Tengxun, receive Sina, the microblogging ID of Tengxun crawls the Get/Post request of submodule, obtain specific name array and microblogging ID array, setting up respectively authenticated ID according to the specific name except the afterbody classification in array gathers catalogue and names (no longer setting up if catalogue has existed) with specific name, form with " afterbody specific name .txt " creates authenticated ID file, and the ID in the ID array is separated and deposits authenticated ID file (if authenticated ID file has existed, only in file, adding ID) in newline.Wherein setting up catalogue and creating before file needs not allow to appear at character "/" in directory name or filename and " " replace with ", ".As the specific name array is [" amusement ", " entertainment industry ", " planning/publicity "], the catalogue of setting up should be " amusement/entertainment industry/", and creates file " planning, publicity .txt " in catalogue, and ID is deposited in this document.

Microblogging API module, robotization is obtained authorization token to support constantly to obtain by the mode of API chronically the bean vermicelli ID of authenticated.Wherein the authorization token of Sina's microblogging need to obtain by the mode robotization of simulation HTTPS request, and the token that the authorization token of Tengxun's microblogging only need to regularly call Tengxun's microblogging to be provided refreshes API can realize that robotization is obtained in 3 months.When the bean vermicelli ID quantity of obtaining by the mode of API reaches certain threshold value, bean vermicelli ID is deposited in the bean vermicelli ID file of local bean vermicelli ID collection catalogue, bean vermicelli ID file is separated each ID with newline, with the form name of " current time stamp .txt ".Comprise that Sina's authorization token obtains submodule, Tengxun's authorization token obtains submodule and bean vermicelli gathers submodule.

The function of each submodule of described microblogging API module is respectively:

Sina's authorization token obtains submodule, and what the authorization page of Sina's microblogging open platform adopted is the HTTPS agreement, and link is through encipherment protection, if adopt conventional method, calling java.net.URLconnection foundation communication will be failed.For this reason, need to adopt additive method: through carefully investigation, obtain the user and login the parameter that Sina's microblogging authorization page needs.The relevant interface that the HttpClient Open-Source Tools of employing Apache company provides, create a PostMethod object, by logging in the required parameter of Sina's microblogging, joins in this object.Then create a message header chained list, each element in chained list is initialized as a Header object in HttpClient, and the message header of structure request message, insert relevant information, comprises Referer, Host, User_Agent.Create the Protocol object, thereby insert the link that correlation parameter is opened to service end, specific as follows: protocol name is made as " https ", and port is that 443, ProtocolSocketFactory parameter obtains by MySSLSocketFactory object (being provided by the microblogging platform) is provided.Create the HttpClient object, by calling relevant interface, submit the message header chained list to.Thereby obtain the return message of service end, resolve afterwards return messages, the cutting character string, just obtained AccessToken.

Tengxun's authorization token obtains submodule, refreshes API by the access token that regularly calls Tengxun's microblogging and provide and obtains new token, not expired in 3 months to guarantee token.This module needs the user to carry out manual authorisation one time before the system operation, in after this 3 months, only need regularly refresh authorization token, and without manual authorisation.

Bean vermicelli gathers submodule, gather catalogue and read authenticated ID file from authenticated ID, obtain and travel through authenticated ID, the use authority token calls Sina, the open API of Tengxun's microblogging, obtain at random the random ID of 5000 beans vermicelli at the most of authenticated, and bean vermicelli ID is buffered in a cache file, when bean vermicelli ID quantity reaches certain threshold value, ID in cache file is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue, bean vermicelli ID file is with the form name of " current time stamp .txt ".Although travel through the ID of 5000 beans vermicelli at the most that authenticated ID can only obtain every authenticated, by traversal repeatedly, make the ID obtained be tending towards complete.Repeatedly traversal need to be called the API of more number of times, but Sina, Tengxun's microblogging limit calling frequency, has therefore affected the efficiency gathered.By use a plurality of microblogging application testing user polls, many modes that machine captures simultaneously, under the prerequisite of the original intention without prejudice to Sina's microblogging, Tengxun's microblogging restriction API Calls number of times, improve the efficiency gathered.Wherein microblogging application testing user (hereinafter to be referred as the test subscriber) is the least unit that can obtain authorization token.Microblogging is divided into 3 ranks to the restriction of API Calls number of times: test subscriber's rank (frequency that per hour an interior test subscriber calls API can not surpass this upper limit), microblogging application level (the number of times summation that per hour, in same microblogging application, a plurality of test subscribers call API can not surpass this upper limit) and IP rank (per hour the number of times summation of a plurality of microblogging application call API on interior same machine can not surpass this upper limit).At first the collection of many test subscribers mode poll refers to uses a plurality of test subscribers to obtain respectively authorization token in same program, secondly poll ground is used these authorization token to call API, with the call number of API in per hour increasing, this mode can make the API Calls frequency reach the IP level and else call upper frequency limit.In Sina, Tengxun's microblogging, other API Calls upper frequency limit of microblogging application level and IP level equates, does not therefore need research to apply by a plurality of microbloggings the mode simultaneously gathered on a machine and improves acquisition rate.The mode that many machines capture simultaneously refers to many test subscribers mode poll capture program is deployed to operation simultaneously on many machines, the API Calls frequency of every machine can reach other API Calls upper frequency limit of IP level, so this mode can make the API Calls frequency reach number of machines * IP rank API Calls upper frequency limit.When the acquisition mode that uses many machines simultaneously to capture, every machine is buffered in the microblogging ID gathered in the cache file of oneself respectively, when the ID of buffer memory quantity reaches certain threshold value, the bean vermicelli ID that uses the Ftp interface that the file of buffer memory is uploaded to index and duplicate removal module place machine gathers in catalogue.Suppose to gather the bean vermicelli ID of 55000 authenticated of Sina's microblogging, need 5 authenticated ID of traversal, if do not use a plurality of test subscriber's polls, many modes that machine gathers simultaneously, due to test subscriber's rank API Calls frequency limitation (150 times/hour), need the collection of (55000*5)/150=1833.33 hour=76.4 days ability to complete.The microblogging website is not formed to potential risks if need to gather quickly microblogging ID, can adopt many test subscribers' mode, due to IP rank API Calls frequency limitation (1000 times/hour), can use at most 7 test subscribers, the demand only needs to complete in (55000*5)/1000=275 hour=11.5 days.If also need further to improve the speed gathered, can adopt the mode of multimachine device, now the API Calls frequency can reach 1000* number of machines/per hour, if use 5 machines to be gathered, the demand only needs (55000*5)/(1000*5)=55 hour ≈ can complete in 2.3 days.The user can take suitable acquisition scheme according to real needs and condition.

The function of each submodule of the described duplicate removal of stating and index module is respectively:

Module set up in index, uses the Crontab instrument of Linux regularly to carry out.Module is used Java to write, gather catalogue and read bean vermicelli ID file from microblogging bean vermicelli ID during each the execution, deposit the ID in file in the HashSet container and realize that (the HashSet container is based on the container that hash algorithm is realized to automatic duplicate removal, there is aggregating characteristic, can guarantee that element does not wherein repeat), ID in the Lucene index in inquiry HashSet container, if ID has been present in index, ID is deleted from the HashSet container, then deposit the ID in the HashSet container in the Lucene index, the ID that newly is incorporated to index is labeled as to " not deriving ", processed bean vermicelli ID file is deleted.

ID derives submodule, uses the Crontab instrument of Linux regularly to carry out.Call the API that Lucene provides during each the execution, retrieve the microblogging ID that is labeled as " not deriving " of some from Lucene, create microblogging ID export in the microblogging ID Export directoey established in this locality, the microblogging ID of derivation is separated and deposits in file with newline, file is according to the form name of current system time with " year-moon-_ time-minute .txt ", and the mark of corresponding ID is set to " deriving ".

The flag update submodule, used the Crontab instrument of Linux regularly to carry out.Check the microblogging ID export in local microblogging ID Export directoey during each the execution, if filename is stored the form of resume module RNTO " Hbase--moon-_ time-minute .txt ", in Lucene, by the ID in this document, corresponding mark is set to " having imported Hbase ", and this document is deleted.

Described method comprises following operation steps:

(1) instrument is collected the microblogging ID from the microblogging website by acquisition layer;

(2) instrument carries out duplicate removal and distributed storage operation by accumulation layer to microblogging ID.

Described step (1) further comprises following operation:

(1.1) network in acquisition layer crawls module and crawls the ID of all authenticated, and store in local file from Sina, Tengxun two large microblogging websites by the mode of writing browser plug-in;

(1.2) the API Calls authorization token of automatic acquisition Sina, Tengxun's microblogging at first of the microblogging API module in acquisition layer, secondly obtain the bean vermicelli ID of authenticated by the mode of calling API, finally bean vermicelli ID is stored in to local bean vermicelli ID and gathers in the bean vermicelli ID file of catalogue.

In described step (1.1), the operation that crawls and store that network crawls module further comprises following content:

(1.1.1) the mode request by writing the Chrome browser plug-in resolve the microblogging authenticated classification page, and the ID parsed is sent to the sub module stored of this module.Write different browser plug-ins and crawled for Sina, Tengxun's microblogging.Be that two plug-in units are set up respectively the plug-in unit root directory, create the manifest.json configuration file in the plug-in unit root directory, specify in Sina's microblogging authenticated homepage and load jquery.js and two Javascript files of content_script.js after having loaded, wherein jquery.js is the jQuery function library, crawl the logic of authenticated ID in content_script.js for this submodule, by the permissions field in configuration file, specify this plug-in unit can the request authentication User Page and any resource of local WebServer.The relevant configuration that Tengxun's microblogging crawls plug-in unit is basic identical except Sina's microblogging authenticated homepage being changed Tengxun's microblogging authenticated homepage into to its remainder.The authenticated page structure of Sina, Tengxun's microblogging is different, and the concrete process that crawls is also different.

(1.1.2) for Sina's microblogging authenticated page, the process of crawling is:

At first use jQuery to obtain all first order li elements in the ul element in the div element of all class=nav_barMain.

Obtain first a element from each li element, its href attribute is the link of one-level classification, the title that the word in a element is the one-level classification.

Obtain the ul element in first order li element, word and href attribute in a element in the li element in it are secondary classification title and link.

Then for each secondary classification, the $ .post function provided by jQuery to its link sends request, and in the responseText returned at the post function, uses regular expression to find out reclassify title and link.

For all reclassify links, after link, add parameter l etter=a, letter=b ..., letter=z forms 26 new links, send respectively the post request to these links, in the page returned, use regular expression to find out all microblogging ID.

Finally, microblogging ID and corresponding classification are configured to respectively to the Javascript array, the $ .post function that calling jQuery provides sends to this array the corresponding processing page of sub module stored of this module.

(1.1.3) the microblogging ID of Tengxun that network crawls in module crawls submodule, for Tengxun's microblogging authenticated page, crawls process as follows:

At first use jQuery to obtain all first order li elements in the ul element in the div element of class=peopleNav, in each li element, the text of first a element and href attribute are one-level specific name and link.

Use jQuery to trigger the mouseover event of each li, make the page automatically send the secondary classification under each the one-level classification of AJAX acquisition request and write current page.

Then obtain the div element of class=navLayer in first order li element, wherein the text in the strong element in the dt element is the secondary classification title, and in the dd element after the dt element, the text of all a elements and href attribute are reclassify title and link.

For each reclassify link, add sort=char&amp after link; Char=a, sort=char& Char=b ..., sort=char& Char=z forms 26 new links, to these links, sends respectively the post request, in the page returned, makes regular expression find out all microblogging ID.

Subsequent step is similar to the Sina microblogging, and except corresponding processing page in sub module stored sends ID and corresponding specific name, other steps are basic identical.

(1.1.4) network crawls the sub module stored in module, use PHP language compilation Sina, the saveId.php file of Tengxun, receive Sina, the microblogging ID of Tengxun crawls submodule at (1.1.2) and the Get/Post sent (1.1.3) request, obtain specific name array and microblogging ID array, setting up respectively authenticated ID according to the specific name except the afterbody classification in array gathers catalogue and names (no longer setting up if catalogue has existed) with specific name, form with " afterbody specific name .txt " creates authenticated ID file, and the ID in the ID array is separated and deposits authenticated ID file (if authenticated ID file has existed, only in file, adding ID) in newline.Wherein setting up catalogue and creating before file needs not allow to appear at character "/" in directory name or filename and " " replace with ", ".As the specific name array is [" amusement ", " entertainment industry ", " planning/publicity "], the catalogue of setting up should be " amusement/entertainment industry/", and creates file " planning, publicity .txt " in catalogue.

In described step (1.2), the collection of microblogging API module and the operation of storage further comprise following content:

(1.2.1) the Sina's authorization token in microblogging API module obtains submodule, by the simulation HTTPS request microblogging SDK of automatic acquisition Sina access token, guarantee that token is not expired, other modules only need be utilized the token obtained, follow the OAuth2.0 agreement by authentication, the API that can provide SDK is called.Concrete mode is:

Through carefully investigation, obtain the user and login the parameter that Sina's microblogging authorization page needs.

The relevant interface that the HttpClient Open-Source Tools of employing Apache company provides, create a PostMethod object, by logging in the required parameter of Sina's microblogging, joins in this object.

Create a message header chained list, each element in chained list is initialized as a Header object in HttpClient, and the message header of structure request message, insert relevant information, comprises Referer, Host, User_Agent.

Create the Protocol object, thereby insert the link that correlation parameter is opened to service end, specific as follows: protocol name is made as " https ", and port is that 443, ProtocolSocketFactory parameter obtains by MySSLSocketFactory object (being provided by the microblogging platform) is provided.

Create the HttpClient object, by calling relevant interface, submit the message header chained list to.

Complete the return message that above step can obtain service end afterwards, resolve afterwards return messages, the cutting character string, just obtained AccessToken.

(1.2.2) the Tengxun's authorization token in microblogging API module obtains submodule, before the program operation, the user manually moves the access token obtained after manual authorisation of use, refresh API by the access token that regularly calls Tengxun's microblogging and provide, obtain new token, not expired in 3 months to guarantee token.Wherein the process of manual authorisation is, the operation authoring program, program will eject browser, show Tengxun's microblogging authorization page, input the automatic redirect of the page after the user name password, by the partial replication of # back in the URL of the page after redirect, paste in program and input carriage return, can complete a test subscriber's mandate.

(1.2.3) bean vermicelli in microblogging API module gathers submodule, gather catalogue and read authenticated ID file from authenticated ID, wherein authenticated ID collection catalogue and authenticated ID file are produced by (1.1.4), and all ID in file are stored in an array;

(1.2.4) for each the authenticated ID in the array in (1.2.3), the use authority token calls Sina, the open API of Tengxun's microblogging, obtain at random the random ID of 5000 beans vermicelli at the most of this authenticated, and bean vermicelli ID is kept in a cache file;

(1.2.5) when bean vermicelli ID quantity reaches certain threshold value, the ID in cache file is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue, bean vermicelli ID file is with the form name of " current time stamp .txt ";

(1.2.6) through above step traversal authenticated ID, can only obtain every 5000 bean vermicelli ID at the most that authenticated is random, therefore need (1.2.4) (1.2.5) step of repetition (1.2.3), repeatedly travel through authenticated ID and capture its bean vermicelli ID, make the ID of system acquisition be tending towards gradually complete;

(1.2.7) need more API Calls number of times in the traversal that repeatedly repeats described in (1.2.6), but, because there is restriction in the API Calls number of times, repeatedly traversal needs the long period.By use a plurality of microblogging application testing user polls, many modes that machine captures simultaneously, under the prerequisite of the original intention without prejudice to Sina's microblogging, Tengxun's microblogging restriction API Calls number of times, improve the efficiency gathered.Wherein microblogging application testing user (hereinafter to be referred as the test subscriber) is the least unit that can obtain authorization token.Microblogging is divided into 3 ranks to the restriction of API Calls number of times:

Table 1API call number restriction explanation

At first the collection of many test subscribers mode poll refers to uses a plurality of test subscribers to obtain respectively authorization token in same program, secondly poll ground is used these authorization token to call API, with the call number of API in per hour increasing, this mode can make the API Calls frequency reach the IP level and else call upper frequency limit.

In Sina, Tengxun's microblogging, other API Calls upper frequency limit of microblogging application level and IP level equates, does not therefore need research to apply by a plurality of microbloggings the mode simultaneously gathered on a machine and improves acquisition rate.

The mode that many machines capture simultaneously refers to many test subscribers mode poll capture program is deployed to operation simultaneously on many machines, the API Calls frequency of every machine can reach other API Calls upper frequency limit of IP level, so this mode can make the API Calls frequency reach number of machines * IP rank API Calls upper frequency limit.

When the acquisition mode that uses many machines simultaneously to capture, every machine is buffered in the microblogging ID gathered in the cache file of oneself respectively, when the ID of buffer memory quantity reaches certain threshold value, the bean vermicelli ID that uses the ftp interface that the file of buffer memory is uploaded to index and duplicate removal module place machine gathers in catalogue.

Through adding many test subscribers poll, many mechanism such as machine deployment, the performance of system has had obvious improvement.Suppose to gather the bean vermicelli ID of 55000 authenticated of Sina's microblogging, need 5 authenticated ID of traversal, if do not use a plurality of test subscriber's polls, many modes that machine gathers simultaneously, due to test subscriber's rank API Calls frequency limitation (150 times/hour), need the collection of (55000*5)/150=1833.33 hour=76.4 days ability to complete.The microblogging website is not formed to potential risks if need to gather quickly microblogging ID, can adopt many test subscribers' mode, due to IP rank API Calls frequency limitation (1000 times/hour), can use at most 7 test subscribers, the demand only needs to complete in (55000*5)/1000=275 hour=11.5 days.If also need further to improve the speed gathered, can adopt the mode of multimachine device, now the API Calls frequency can reach 1000* number of machines/per hour, if use 5 machines to be gathered, the demand only needs (55000*5)/(1000*5)=55 hour ≈ can complete in 2.3 days.The user can take suitable acquisition scheme according to real needs and condition.

Described step (2) further comprises following operation:

(2.1) duplicate removal in accumulation layer and index module, duplicate removal and index module, the API that calling Lucene provides sets up index for the microblogging ID gathered, wherein Sina, index name that Tengxun's microblogging is corresponding are respectively index_A and index_B, regularly microblogging ID are exported in the file of assigned catalogue and process for data memory module.Index comprises microblogging ID and mark two row.Microblogging ID and mark all can be retrieved, and storage mode is all " not participle ".Be marked with three kinds of values: " not deriving ", " deriving " and " having imported Hbase ".Wherein " do not derive " and mean that microblogging ID does not derive from Lucene; " derive " and mean that microblogging ID derives from Lucene, but do not import in Hbase; " imported Hbase " and meaned that microblogging ID imports in Hbase.

(2.2) data memory module, be responsible for deriving from duplicate removal and index module the microblogging ID gathered, and utilizes memory module, and microblogging id is deposited in Hbase microblogging ID query interface is provided simultaneously.This module is used the Crontab order of Linux regularly to carry out.During each the execution, the API that calling lucene provides retrieves respectively data memory module from index_A and two index of index_B, the microblogging ID of Sina and the microblogging ID of Tengxun are stored in respectively in the different table of Hbase, the form that the major key of table is " DDDDD microblogging ID ", wherein DDDDD means five tens digits, this number prefix is to be the reserved label space of the data analysis in later stage, can be for state or the attribute of this microblogging of mark ID.This module regularly retrieves the microblogging ID that mark value equals " not deriving " from Lucene, call the API that Hbase provides, these microbloggings ID is deposited in the corresponding table of Hbase, and change mark value corresponding to these microbloggings ID into " deriving " in Lucene, to repeat derivation after preventing.

In described step (2.1), the operation of duplicate removal and index module further comprises following content:

(2.1.1) index in the index module API index that submodule calls Lucene and provide is provided and is set up module, uses the Crontab instrument of Linux regularly to carry out.Gather catalogue and read bean vermicelli ID file from microblogging bean vermicelli ID during each the execution, deposit the ID in file in the HashSet container and realize that (the HashSet container is based on the container that hash algorithm is realized to automatic duplicate removal, there is aggregating characteristic, can guarantee that element does not wherein repeat), the retrieval API that calling Lucene provides inquires about the ID in the HashSet container in the Lucene index, if ID has been present in index, ID is deleted from the HashSet container, then the index API that calling Lucene provides is incorporated to the Lucene index by the ID in the HashSet container, the ID that newly is incorporated to index is labeled as to " not deriving ", processed bean vermicelli ID file is deleted.

(2.1.2) ID in the index module derives submodule, uses the Crontab instrument of Linux regularly to carry out.Call the API that Lucene provides during each the execution, retrieve the microblogging ID that is labeled as " not deriving " of some from Lucene, create microblogging ID export in the microblogging ID Export directoey established in this locality, the microblogging ID of derivation is separated and deposits in file with newline, file is according to the form name of current system time with " year-moon-_ time-minute .txt ", and the mark of corresponding ID is set to " deriving ".

(2.1.3) the flag update submodule in the index module, used the Crontab instrument of Linux regularly to carry out.Check the file of the form that whether has file " Hbase--moon-_ time-divide .txt " by name in local microblogging ID Export directoey during each the execution, if exist, mean that the ID in this document is disposed by data memory module, in Lucene, by the ID in these files, corresponding mark is set to " having imported Hbase ", and this document is deleted.

In described step (2.2), the storage operation of data memory module further comprises following content:

(2.2.1) the Contab instrument that utilizes Linux to provide from the microblogging ID Export directoey of Sina, Tengxun's microblogging, read termly by (2.1.2), produced with " year-moon-_ time-the divide .txt " file that form is named, obtain the microblogging ID in file;

(2.2.2) call the API that Hbase provides, Sina, the microblogging ID of Tengxun obtained in (2.2.1) deposited in respectively in the corresponding microblogging ID table of Hbase, wherein Sina's microblogging ID deposits in table author_A, and the microblogging ID of Tengxun deposits in table author_B.The major key of table is designed to the form of " DDDDD microblogging ID ", and wherein DDDDD means five tens digits, and this number prefix is to be the reserved label space of the data analysis in later stage, can be for state or the attribute of this microblogging of mark ID;

(2.2.3) call the field renewal API that Lucene provides, to in the microblogging ID Export directoey of Sina, Tengxun's microblogging, by (2.2.1) and the file (2.2.2) processed, be added prefix " Hbase-", i.e. the form of RNTO " Hbase--moon-_ time-minute .txt ".

The accompanying drawing explanation

Fig. 1 is that the framework of a kind of acquisition method towards main flow microblogging website microblogging ID of the present invention and instrument forms schematic diagram.

Fig. 2 is towards the acquisition method of main flow microblogging website microblogging ID and the operating process schematic diagram of instrument.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the present invention is described in further detail.

Referring to Fig. 1, framework of the present invention is divided into two levels, is respectively acquisition layer and accumulation layer, and the interface between level and system is clear, and every layer of inside all is comprised of some modules, and the loose coupling between module is conducive to the expansion of every layer function.Acquisition layer is realized the collection of microblogging ID, and accumulation layer realizes local distributed storage, and search function is provided; Wherein:

Accumulation layer, be responsible for to the microblogging ID obtained in acquisition layer is carried out the duplicate removal operation and microblogging ID is carried out to distributed storage, and microblogging ID query interface be provided, and duplicate removal module and data memory module, consists of.

The function of each module of described acquisition layer is respectively:

Microblogging API module, the microblogging API that uses Sina, Tengxun's microblogging open platform to provide, obtain the bean vermicelli ID of microblogging authenticated.Its operating process comprises the authorization token that at first obtains two large microblogging open platforms, secondly according to business demand, construct different parameters to corresponding api interface, and transmission Get/Post request, obtain the microblogging ID data of JSON form, and the microblogging ID parsed is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue.Wherein, bean vermicelli ID gathers the catalogue that catalogue is a storage bean vermicelli ID file, and bean vermicelli ID file is the text of depositing the ID that this module obtains, with newline, separates each ID, each file is deposited the ID of some, and file is with the form name of " current time stamp .txt ".

The function of each module of described accumulation layer is respectively:

Described network crawl module each submodule function respectively:

The microblogging ID of Sina crawls submodule, resolve the authenticated homepage of Sina's microblogging by the mode of writing browser plug-in, obtain one-level tabulation and URL thereof, and the URL in the one-level tabulation is sent to the corresponding page of AJAX acquisition request, if there is no the next stage classification, the authenticated page with each beginning of letter in classifying to this sends the AJAX request, parse authenticated ID from the page obtained, and the microblogging ID that uses the Get/Post request to parse and corresponding classification thereof the sub module stored that sends to this module.Otherwise, parse secondary classification list and URL thereof from the one-level classification page, carry out the same operation.

The microblogging ID of Tengxun crawls submodule, resolve the authenticated homepage of Tengxun's microblogging by the mode of writing browser plug-in, obtain firsts and seconds tabulation and URL thereof, and the authenticated page with each beginning of letter in secondary classification is sent to the AJAX request, parse the ID of authenticated from the page obtained, and the microblogging ID that uses the Get/Post request to parse and corresponding classification thereof the sub module stored that sends to this module.

Sub module stored, use WebServer reception Sina, the microblogging ID of Tengxun to crawl the Get/Post request of submodule, set up respectively one-level, second-level directory and name with specific name according to one-level, secondary classification, wherein set up before catalogue and need not allow to appear at character "/" in directory name and " " replace with ", ", and microblogging ID classification is deposited in corresponding file.

Microblogging API module, robotization is obtained authorization token to support constantly to obtain by the mode of API chronically the bean vermicelli ID of authenticated.Wherein the authorization token of Sina's microblogging need to obtain by the mode robotization of simulation HTTPS request, and the token that the authorization token of Tengxun's microblogging only need to regularly call Tengxun's microblogging to be provided refreshes API can realize that robotization obtains.When the bean vermicelli ID quantity of obtaining by the mode of API reaches certain threshold value, bean vermicelli ID is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue, bean vermicelli ID file is with the form name of " current time stamp .txt ".Comprise that Sina's authorization token obtains submodule, Tengxun's authorization token obtains submodule and bean vermicelli gathers submodule.

Sina's authorization token obtains submodule, by the simulation HTTPS request microblogging SDK of automatic acquisition Sina access token, guarantee that token is not expired, other modules only need be utilized the token obtained, follow the OAuth2.0 agreement by authentication, the API that can provide SDK is called.Concrete mode is: by the needed parameter of structure Sina microblogging authorization page login action, characteristics according to the HTTPS agreement, the HttpClient Open-Source Tools that utilizes Apache company to provide, the required message header of structure request, arrange PostMethod, adds that the general Socket interface that open platform provides is communicated with the communication of client to server, send the Post request, thereby successfully obtain the response of server end, and resolution response message body, authentication token obtained.

Tengxun's authorization token obtains submodule, uses the access token obtained after manual authorisation, by the access token that regularly calls Tengxun's microblogging and provide, refreshes API, obtains new token, with guarantee token at 3 months with interior not expired.Wherein the process of manual authorisation is, the operation authoring program, program will eject browser, show Tengxun's microblogging authorization page, input the automatic redirect of the page after the user name password, by the partial replication of # back in the URL of the page after redirect, paste in program and input carriage return, can complete a test subscriber's mandate.

Bean vermicelli gathers submodule, gather catalogue and read authenticated ID file from authenticated ID, obtain and travel through authenticated ID, the use authority token calls Sina, the open API of Tengxun's microblogging, obtain at random the random ID of 5000 beans vermicelli at the most of authenticated, and bean vermicelli ID is buffered in a cache file, when bean vermicelli ID quantity reaches certain threshold value, ID in cache file is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue, bean vermicelli ID file is with the form name of " current time stamp .txt ".This module is by use a plurality of microblogging application testing users, many modes that machine captures simultaneously, breaks through the restriction for the API Calls number of times of Sina's microblogging, Tengxun's microblogging.Wherein microblogging application testing user (hereinafter to be referred as the test subscriber) is the least unit that can obtain authorization token.In an application due to a test subscriber on an IP, API Calls number of times hourly can not be over 150 times, and in one or more application of a plurality of test subscribers on an IP per hour the summation of API Calls number of times can not surpass 1000 times, so this module adopts the mode of many test subscribers, many machines to increase the call number of API.

Referring to Fig. 2, the concrete operations flow process of the described microblogging ID acquisition method towards main flow microblogging website and instrument is: crawl Sina, the microblogging authenticated microblogging ID of Tengxun, and be stored in local file; Obtain Sina, Tengxun's microblogging authorization token, the use authority token calls relevant API, obtains the bean vermicelli ID of authenticated, and is stored in local file, is subject to the restriction of Sina, Tengxun's microblogging API Calls number of times, this action need long period; Regularly the bean vermicelli ID duplicate removal of previous step collection is also set up to index for it; Regularly from index, retrieve ID, it is deposited in Hbase.

Described method comprises following operation steps:

(2) system is carried out duplicate removal and distributed storage operation by accumulation layer to microblogging ID.

Described step (1) further comprises following operation:

(1.2) the API Calls authorization token of automatic acquisition Sina, Tengxun's microblogging at first of the microblogging API module in acquisition layer, secondly obtain the bean vermicelli ID of authenticated by the mode of calling API, and bean vermicelli ID is stored in the bean vermicelli ID file of local bean vermicelli ID collection catalogue.

(1.1.1) the mode request by writing the Chrome browser plug-in resolve the microblogging authenticated classification page, and the ID parsed is sent to the sub module stored of this module.Write different browser plug-ins and crawled for Sina, Tengxun's microblogging.Be that two plug-in units are set up respectively the plug-in unit root directory, create the manifest.json configuration file in the plug-in unit root directory, wherein Sina's microblogging crawls the relevant configuration code in the manifest.json file of plug-in unit and is:

Wherein the content_scripts field specifies in the http://verified.weibo.com/fame/ page and loads jquery.js and two Javascript files of content_script.js after having loaded, wherein jquery.js is the jQuery function library, crawls the logic of authenticated ID in content_script.js for this submodule.The permissions field specifies this plug-in unit can the request authentication User Page and any resource of local webserver.

Relevant configuration and Sina's microblogging that Tengxun's microblogging crawls plug-in unit crawl the basic identical of plug-in unit.Except " http://verified.weibo.com/fame/ " in matches and permissions field changed into to " http://zhaoren.t.qq.com/people.php ".Specific code is:

The authenticated page structure of Sina, Tengxun's microblogging is different, and the process that crawls realized in content_script.js is also different.

(1.1.2) the microblogging ID of Sina that network crawls in module crawls submodule, and for Sina's microblogging authenticated page, the code in its content_script.js file is carried out flow process and is:

At first use jQuery to obtain all first order li elements in the ul element in the div element of all class=nav_barMain, obtain again first a element from each li element, its href attribute is the link of one-level classification, the title that the word in a element is the one-level classification.Next obtains the ul element in first order li element, and word and href attribute in a element in the li element in it are secondary classification title and link.

The title that obtains secondary classification be connected after; to each secondary classification; the .post function provided by jQuery to its link sends request, use canonical match pattern in the responseText returned at the post function/<spanclass=" cat_B " ><aclass=" W_linkc " href=" ([A-Za-z0-9 ./=%_～; #:; +-]++ " > ((( u[0-9a-f] { 4}| { /) | w))+)) /a > /span/match all character strings that contain reclassify title and link (match pattern two ends/mean beginning and the end of canonical formula), and use canonical match pattern/href=" ([A-Za-z0-9 ./=%_～; #:; +-]+) " > ((( u[0-9a-f] 4}) | (/) | w)+)</a/from wherein extracting link and title; by title u replace with %u and use the unescape function of javascript to do and reverse the justice operation, take escape is meaned to the Chinese character that the Chinese character code conversion of mode can be identified as browser as utf-8.

For all reclassify links, after link, add parameter l etter=a, letter=b ..., letter=z can form 26 new links, send respectively the Post request to these links, in the page returned, use afterwards canonical match pattern/action-data=" uid=[0-9]+ "/match all character strings that contains authenticated microblogging ID, use canonical match pattern/[0-9]+/ extract microblogging ID wherein.Finally, microblogging ID and corresponding classification are configured to respectively to the Javascript array, the $ .post function that calling jQuery provides sends to this array the processing page http://localhost/sina/saveId.php of the sub module stored of this module.

(1.1.3) the microblogging ID of Tengxun that network crawls in module crawls submodule, and for Tengxun's microblogging authenticated page, the code in its content_script.js file is carried out flow process and is:

At first use jQuery to obtain all first order li elements in the ul element in the div element of class=peopleNav, in each li element, the text of first a element and href attribute are one-level specific name and link; Next uses jQuery to trigger the mouseover event of each li, makes the page automatically send the secondary classification under each the one-level classification of AJAX acquisition request and write current page.

Then obtain the div element of class=navLayer in first order li element, wherein the text in the strong element in the dt element is the secondary classification title, and in the dd element after the dt element, the text of all a elements and href attribute are reclassify title and link.For each reclassify link, add sort=char&amp after link, char=a, sort=char& char=b ..., sort=char& char=z forms 26 new links, send respectively the Post request to these links, in the page returned, use the canonical match pattern/<atarget=" _ blank " href=" http://t.qq.com/[a-zA-Z0-9-_] { 6, 20} " title=" [u0000-uffffa-zA-Z0-9-_] { 1, 12} (@[and u0000-uffffa-zA-Z0-9-_] { 1, 20}) " > [u0000-u9999a-zA-Z0-9-_] { 1, 12}</a >/match all character strings that comprise authenticated microblogging ID, use the canonical match pattern/" http://t.qq.com/([a-zA-Z0-9-_] { 6, 20}) "/extract microblogging ID wherein.

Finally, microblogging ID and corresponding classification are configured to respectively to the javascript array, the $ .post function that calling jQuery provides sends to this array the processing page http://localhost/qq/saveId.php of the sub module stored of this module.

(1.1.4) network crawls the sub module stored in module, use PHP language compilation Sina, the saveId.php file of Tengxun, receive Sina, the microblogging ID of Tengxun crawls the Get/Post request of submodule, obtain specific name array and microblogging ID array, character string in array being used to the iconv function of PHP is that the GBK coding is (because sub module stored is moved on Windows by the UTF-8 code conversion, the title of file and file is all the GBK coding), setting up respectively authenticated ID according to the specific name except the afterbody classification in array gathers catalogue and names (no longer setting up if catalogue has existed) with specific name, form with " afterbody specific name .txt " creates authenticated ID file, and the ID in the ID array is separated and deposits authenticated ID file (if authenticated ID file has existed, only in file, adding ID) in newline.Wherein setting up catalogue and creating before file needs not allow to appear at character "/" in directory name or filename and " " replace with ", ".As the specific name array is [" amusement ", " entertainment industry ", " planning/publicity "], the catalogue of setting up should be " amusement/entertainment industry/", and creates file " planning, publicity .txt " in catalogue.

In described step (1.2), the operation that crawls and store of microblogging API module further comprises following content:

Through carefully investigation, find that the user logins Sina's microblogging authorization page and needs following parameter:

Table 2 Sina microblogging authorization page login desired parameters

Complete the return message that above-mentioned steps obtains service end afterwards, resolve afterwards return messages, the cutting character string, just obtained AccessToken.

(1.2.2) the Tengxun's authorization token in microblogging API module obtains submodule, before the program operation, the user manually moves the access token obtained after manual authorisation of use, refresh API by the access token that regularly calls Tengxun's microblogging and provide, obtain new token, not expired in 3 months to guarantee token; Refresh API by the access token that regularly calls Tengxun's microblogging and provide and obtain new token, not expired in 3 months to guarantee token.This module needs the user to carry out manual authorisation one time before the system operation, in after this 3 months, only need regularly refresh authorization token, and without manual authorisation.Wherein the process of manual authorisation is, the operation authoring program, program will eject browser, show Tengxun's microblogging authorization page, input the automatic redirect of the page after the user name password, by the partial replication of # back in the URL of the page after redirect, paste in program and input carriage return, can complete a test subscriber's mandate.

(1.2.7) need more API Calls number of times in the repeatedly repetition traversal described in (1.2.6), because there is restriction in the API Calls number of times, the traversal that repeatedly repeats described in (1.2.6) needs the long period to realize.By use a plurality of microblogging application testing user polls, many modes that machine captures simultaneously, under the prerequisite of the original intention without prejudice to Sina's microblogging, Tengxun's microblogging restriction API Calls number of times, improve the efficiency gathered.Wherein microblogging application testing user (hereinafter to be referred as the test subscriber) is the least unit that can obtain authorization token.

Microblogging is divided into 3 ranks to the restriction of API Calls number of times:

Table 1API call number restriction explanation

The mode that many machines capture simultaneously refers to many test subscribers mode poll capture program is deployed to operation simultaneously on many machines, the API Calls frequency of every machine can reach other API Calls upper frequency limit of IP level, so this mode can make the API Calls frequency reach number of machines * IP rank API Calls upper frequency limit.When the acquisition mode that uses many machines simultaneously to capture, every machine is buffered in the microblogging ID gathered in the cache file of oneself respectively, when the ID of buffer memory quantity reaches certain threshold value, the bean vermicelli ID that uses the Ftp interface that the file of buffer memory is uploaded to index and duplicate removal module place machine gathers in catalogue.

Described step (2) further comprises following operation:

(2.1) duplicate removal in accumulation layer and index module, duplicate removal and index module, for the microblogging ID gathered sets up index, regularly export to microblogging ID in the file of assigned catalogue and process for data memory module.Index comprises microblogging ID and mark two row.Microblogging ID and mark all can be retrieved, and storage mode is all " not participle ".Be marked with three kinds of values: " not deriving ", " deriving " and " having imported Hbase ".Wherein " do not derive " and mean that microblogging ID does not derive from Lucene; " derive " and mean that microblogging ID derives from Lucene, but do not import in Hbase; " imported Hbase " and meaned that microblogging ID imports in Hbase.

(2.2) data memory module, be responsible for deriving from duplicate removal and index module the microblogging ID gathered, and use the storage tool Hbase in the distributed system Hadoop increased income to carry out distributed storage, and microblogging ID query interface is provided simultaneously.This module is used the Crontab order of Linux regularly to carry out.During each the execution, the API that calling Lucene provides retrieves respectively data memory module from index_A and two index of index_B, the microblogging ID of Sina and the microblogging ID of Tengxun are stored in respectively in the different table of Hbase, the form that the major key of table is " DDDDD microblogging ID ", wherein DDDDD means five tens digits, and this numeral is to reserve for following analysis.This module regularly retrieves the microblogging ID that mark value equals " not deriving " from Lucene, call the API that Hbase provides, these microbloggings ID is deposited in the corresponding table of Hbase, and change mark value corresponding to these microbloggings ID into " deriving " in Lucene, to repeat derivation after preventing.

Claims

1. acquisition method and instrument towards a main flow microblogging website microblogging ID, it is characterized in that: described system architecture is divided into two levels, be respectively acquisition layer and accumulation layer, interface between level and system is clear, every layer of inside all is comprised of some modules, loose coupling between module, be conducive to the expansion of every layer function.Acquisition layer is realized the collection of microblogging ID, and accumulation layer realizes the local data library storage, and open search function is provided.Wherein:

Accumulation layer, be responsible for to the microblogging ID obtained in acquisition layer is carried out the duplicate removal operation and microblogging ID is carried out to distributed storage, and microblogging ID query interface be provided, and duplicate removal and index module and data memory module, consists of.

2. acquisition method and instrument towards main flow microblogging website microblogging ID according to claim 1 is characterized in that: the function of each module of described acquisition layer respectively:

Network crawls module, is responsible for the crawl work for microblogging ID in Sina, Tengxun's microblogging authenticated webpage, mainly comprises that webpage crawls, ID resolves and the operation of the local storage of ID.Wherein webpage crawls and the ID parse operation completes by the mode of writing browser plug-in, and local storage operation completes by the mode of writing the WebServer code.Flow process comprises request microblogging authenticated homepage, one-level, secondary and the reclassify page, resolve specific name and corresponding URL in the page, ask and resolve the microblogging ID in the classification pages at different levels, and microblogging ID is stored in the authenticated ID file in local authenticated ID collection catalogue.

Microblogging API module, the microblogging API that uses Sina, Tengxun's microblogging open platform to provide, obtain the bean vermicelli ID of microblogging authenticated.Its operating process comprises the authorization token that at first obtains two large microblogging open platforms, next is according to the api interface of different microbloggings, construct different parameters to corresponding api interface, obtain the microblogging ID data of JSON form, and the microblogging ID parsed is deposited in the bean vermicelli ID file of local bean vermicelli ID collection catalogue.

3. acquisition method and instrument towards main flow microblogging website microblogging ID according to claim 1 is characterized in that: the function of each module of described accumulation layer respectively:

Duplicate removal and index module, the index instrument Lucene that responsible utilization is increased income carries out the duplicate removal operation and sets up index the microblogging ID of microblogging API module collection, regularly from index, derives microblogging ID and processes for data memory module.The microblogging ID derived is stored in the microblogging ID export of local microblogging ID Export directoey.

Data memory module, be responsible for reading microblogging ID export from local microblogging ID Export directoey, and use the storage tool Hbase in the distributed system Hadoop increased income that the microblogging ID in file is carried out to distributed storage, microblogging ID query interface is provided simultaneously.

4. acquisition layer according to claim 2 is characterized in that: described network crawl module each submodule function respectively:

Microblogging ID crawls submodule, and the mode request by writing browser plug-in is also resolved the microblogging authenticated classification page, and the ID parsed is sent to the sub module stored of this module.Because browser plug-in is used the Javascript language compilation, adopt the Javascript storehouse simplification requests for page and the operation of resolving html document such as jQuery; Because the browser plug-in code can automatically perform after browser has loaded the page of appointment, by being set, code carries out the exploitation of avoiding the microblogging landfall process after microblogging logs in; Utilize Javascript to initiate the characteristic of Get/Post request to any resource in same domain name, request is to the page of all microblogging authenticated; Utilize browser plug-in to support the characteristics of cross-domain XMLHttpRequest request (XMLHttpRequest is the basis that Javascript sends the Get/Post request), the ID that crawls ask to send on the WebServer oneself write by Get/Post and carry out this locality and store.

Sub module stored, use WebServer reception Sina, the microblogging ID of Tengxun to crawl the Get/Post request of submodule, set up respectively firsts and seconds authenticated ID in this locality according to the firsts and seconds classification and gather catalogue and name with specific name, and microblogging ID is deposited in the authenticated ID file of corresponding authenticated ID collection catalogue according to affiliated classification.

5. acquisition layer according to claim 2 is characterized in that: described microblogging API module, robotization is obtained authorization token to support constantly to obtain by the mode of API chronically the bean vermicelli ID of authenticated.Wherein the authorization token of Sina's microblogging need to obtain by the mode robotization of simulation HTTPS request, and the token that the authorization token of Tengxun's microblogging only need to regularly call Tengxun's microblogging to be provided refreshes API can realize in 3 months that robotization obtains.When the bean vermicelli ID quantity of obtaining by the mode of API reaches certain threshold value, bean vermicelli ID is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue, bean vermicelli ID file is separated each ID with newline, with the form name of " current time stamp .txt ".Comprise that Sina's authorization token obtains submodule, Tengxun's authorization token obtains submodule and bean vermicelli gathers submodule.

6. accumulation layer according to claim 3, it is characterized in that: described duplicate removal and index module, for the microblogging ID gathered carries out the duplicate removal operation and sets up index, regularly microblogging ID is exported in the file of assigned catalogue and process for data memory module, tag field is set in index and prevents from repeating to derive ID.Index comprises microblogging ID and mark two row, and storage mode is all " not participle ".Be marked with three kinds of values: " not deriving ", " deriving " and " having imported Hbase ".Wherein " do not derive " and mean that microblogging ID does not derive from Lucene; " derive " and mean that microblogging ID derives from Lucene, but do not import in Hbase; " imported Hbase " and meaned that microblogging ID imports in Hbase.This module comprises that submodule set up in index, ID derives submodule and flag update submodule.Index is set up module and is realized the ID duplicate removal will gathered, and the ID after duplicate removal is incorporated to index, the tag field of corresponding ID is set to " not deriving ".ID derives submodule and is responsible for the ID do not derived in the periodic retrieval index, (microblogging ID Export directoey need to establishment before the system operation to deposit microblogging ID export in local microblogging ID Export directoey in, microblogging ID export automatically creates when each the derivation), the file designation form is " year-moon-_ time-minute .txt ", and offer the data memory module processing, the mark of corresponding ID is set to " deriving ".The flag update submodule is responsible for checking the whether ID of processed derivation of data memory module, if finish dealing with, the mark of corresponding ID is set to " having imported Hbase ".

The microblogging ID of Sina and the microblogging ID of Tengxun are stored in respectively in the different table of Hbase, the major key of table is designed to the form of " DDDDD microblogging ID ", wherein DDDDD means five tens digits, this number prefix is to be the reserved label space of the data analysis in later stage, can be for state or the attribute of this microblogging of mark ID.This module regularly reads the file with " year-moon-_ time-minute .txt " form name from microblogging ID Export directoey, call the API that Hbase provides the microblogging ID in file is deposited in the corresponding table of Hbase, by this document RNTO " Hbase--moon-_ time-divide .txt ".

7. microblogging API module according to claim 5 is characterized in that: the function of described each submodule respectively:

Sina's authorization token obtains submodule, by the simulation HTTPS request microblogging SDK of automatic acquisition Sina access token, guarantee that token is not expired, other modules only need be utilized the token obtained, follow the OAuth2.0 agreement by authentication, the API that can provide SDK is called.

Tengxun's authorization token obtains submodule, refreshes API by the access token that regularly calls Tengxun's microblogging and provide and obtains new token, not expired in 3 months to guarantee token.This module needs the user to carry out a manual authorisation (being authorized by the mode of Tengxun's microblogging development platform regulation) before the system operation, in after this 3 months, only need regularly refresh authorization token, and without manual authorisation.

Bean vermicelli gathers submodule, gather catalogue and read authenticated ID file from authenticated ID, obtain and travel through authenticated ID, the use authority token calls Sina, the open API of Tengxun's microblogging, obtain at random the random ID of 5000 beans vermicelli at the most of authenticated, and bean vermicelli ID is buffered in a cache file, when bean vermicelli ID quantity reaches certain threshold value, the ID in cache file is deposited in the bean vermicelli ID file of bean vermicelli ID collection catalogue.

8. duplicate removal according to claim 6 and index module is characterized in that: the function of described each submodule respectively:

Module set up in index, regularly from microblogging bean vermicelli ID, gather catalogue and read bean vermicelli ID file, deposit the ID in file in the Set container and realize automatic duplicate removal, microblogging ID in the Lucene index in inquiry Set container, if ID has been present in index, it is deleted from the Set container, then deposit remaining ID in the Set container in the Lucene index, the ID that newly is incorporated to index is labeled as to " not deriving ", processed bean vermicelli ID file is deleted.

ID derives submodule, regularly from Lucene, retrieve the microblogging ID that is labeled as " not deriving " of some, create microblogging ID export in the microblogging ID Export directoey established in this locality, the microblogging ID of derivation is separated and deposits in file with newline, file is according to the form name of current system time with " year-moon-_ time-minute .txt ", and the mark of corresponding ID is set to " deriving ".

The flag update submodule, make regular check on the microblogging ID export in local microblogging ID Export directoey, if filename is stored resume module and by the form of RNTO " Hbase--moon-_ time-minute .txt ", in Lucene, by the ID in this document, corresponding mark is set to " having imported Hbase ", and this document is deleted.

9. microblogging API module according to claim 7, it is characterized in that: described Sina authorization token obtains submodule, Sina's microblogging open platform is 24 hours to mandate effective time of test subscriber, the user that test is authorized just needed the login authorization page again to apply for new AccessToken every 24 hours, this has brought very large inconvenience to third party's application and development team.Sina's authorization token obtains submodule can be by this manual processes robotization, thereby makes third party's development teams of low rights can ignore this restriction, realizes the automation collection data, for microblog users provides better service.

10. microblogging API module according to claim 7, it is characterized in that: described bean vermicelli gathers submodule, by use a plurality of microblogging application testing user polls, many modes that machine captures simultaneously, under the prerequisite of the original intention without prejudice to Sina's microblogging, Tengxun's microblogging restriction API Calls number of times, improve the efficiency gathered.