Embodiment
Although with reference to containing preferred embodiment of the present invention
accompanying drawingabundant description the present invention, but should be appreciated that before this describes, those of ordinary skill in the art can revise invention described herein, obtains technique effect of the present invention simultaneously.Therefore, Yan Weiyi discloses widely to those of ordinary skill in the art must to understand following description, and its content does not lie in restriction exemplary embodiment described in the invention.
Reference
fig. 1shown in, video website similar users searching method of the present invention comprises:
Step 1, carries out statistical study to user's view content, and in statistics a period of time, the user video of (such as a week) watches record, obtains the viewing number of times of each user on each video content and frequency in conjunction with video content description word.Wherein, video content description word describes mainly through video tab, keyword and video title participle, video tab, keyword and video title participle have all carried out brief and abstract description to video content, more efficiently can portray the content information of video, different videos containing similar content, may show that they may have identical label or keyword.Utilize the viewing record of user, in conjunction with video content description word, the viewing frequency of counting user on different content, can reflect the interest preference of user effectively.
Wherein, step 1 comprises further:
Step 1.1, utilizes the viewing record of video user, the video-see number of times in counting user a period of time, obtains the video-see list of " user ID---video labeling---viewing number of times ";
Step 1.2, for video information, extracts video information list " video labeling---label 1, label 2 ..., label i ", in conjunction with video-see list generating content viewing list " user ID---label i---watches number of times ";
Step 1.3, merges the content viewing record with same subscriber mark, utilizes the content viewing number of times of label i to calculate the viewing frequency of label i, the viewing frequency namely on inherent label i of each user's a period of time, and computing method are:
Wherein, tf
ifor the frequency of label i, C
i, C
jfor user watches the number of times of label i, label j, T is the set of all labels of this user viewing.
By step 1, the video content of viewing and the viewing frequency of often kind of video content in each user nearest a period of time can be obtained.
Step 2, set up the inverted index of user, according to the viewing record that statistical study in step 1 obtains, the inverted index of user is set up based on video content description word, this index form using the descriptor of video content as index key, using watch this descriptor all user ID and viewing frequency as index value.
Wherein, step 2 comprises further:
Step 2.1 take label as index key, adds up all users of this label viewed and the viewing frequency of each user, calculates the total number of users of this label viewed;
Step 2.2, utilizes hash method, carries out Hash calculation to label, carries out piecemeal to index file;
Step 2.3, is stored to the piecemeal place corresponding to cryptographic hash by the viewing information of label.
Step 3, carries out similar users search and calculates similarity, utilizes the video-see record of seed user, with video content description word for search key, indexed file carries out the search of similar users, calculate the similarity of relative users simultaneously, obtain preliminary Search Results.
Wherein, step 3 comprises further:
Step 3.1, analyzes the viewing record of seed user, searches for each label of seed user, obtains all total numbers of users of this label viewed, user ID and watches frequency accordingly;
Step 3.2, search for each the user returned and calculate similarity, its computing method are as follows:
Wherein, S
uirepresent the similarity of user u on label i, tf
uirepresent that user u watches the frequency of label i, D represents all total numbers of users, P
irepresent the total number of users of viewing label i;
Step 3.4, searches for the viewing label of all seed user the result returned and comprehensively analyzes, and calculate each comprehensive similarity returning user, computing method are:
Wherein, Score
urepresent the comprehensive similarity of user u, S
uirepresent the similarity of user u on label i.
Step 4, carries out search results ranking, utilizes similarity to carry out descending sequence to initial search result, obtains final similar crowd's Search Results through filtration treatment.For the comprehensive similarity of searching for customer group and each user returned, carry out descending sequence, suitable similarity threshold can be adopted to carry out result filtration according to similarity, the result after sequence being filtered exports.
Reference
fig. 2, the present invention also provides a kind of video website similar users search system, comprising:
Statistical study device, statistical study is carried out to user's view content, user video viewing record in statistics a period of time, each user is obtained to the viewing number of times of each video content and frequency in conjunction with video content description word, wherein, above-mentioned video content description word is described by video tab, keyword and video title participle.
Wherein, statistical study device utilizes the viewing record of video user, the video-see number of times in counting user a period of time, obtains " user ID---video labeling---viewing number of times " video-see list; For video information, extract video information list " video labeling---label 1, label 2 ..., label i ", in conjunction with video-see list generating content viewing list " user ID---label i---watches number of times "; Merge the content viewing record with same subscriber mark, utilize the viewing number of times of label i to calculate the viewing frequency of label i, the viewing frequency namely on inherent label i of each user's a period of time, computing method are:
Wherein, tf
ifor the frequency of label i, C
i, C
jfor user watches the number of times of label i, label j, T is the set of all labels of this user viewing.
Indexing unit, set up the inverted index of user, according to the viewing record that statistical study in the first step obtains, the inverted index of user is set up based on video content description word, this index form using video content description word as index key, using watch this descriptor all user ID and viewing frequency as index value.
Wherein, indexing unit is index key with label, adds up all users of this label viewed and the viewing frequency of each user, calculates the total number of users of this label viewed; Utilize hash method, Hash calculation is carried out to label, piecemeal is carried out to index file; The viewing information of label is stored to the piecemeal place corresponding to cryptographic hash.
Calculation element, carries out similar users search and calculates similarity, utilizes the video-see record of seed user, with video content description word for search key, indexed file carries out the search of similar users, calculate the similarity of relative users simultaneously, obtain preliminary Search Results.
Wherein, the viewing record of calculation element to seed user is analyzed, and searches for each label of seed user, obtains all total numbers of users of this label viewed, user ID and watches frequency accordingly;
Search for each the user returned and calculate similarity, its computing method are as follows:
Wherein, S
uirepresent the similarity of user u on label i, tf
uirepresent that user u watches the frequency of label i, D represents all total numbers of users, P
irepresent the total number of users of viewing label i;
Search for the viewing label of all seed user the result returned comprehensively to analyze, calculate each comprehensive similarity returning user, computing method are:
Wherein, Score
urepresent the comprehensive similarity of user u, S
uirepresent the similarity of user u on label i.
Collator, carries out search results ranking, utilizes similarity to carry out descending sequence to initial search result, obtains final similar crowd's Search Results through filtration treatment.
Wherein, collator, for the comprehensive similarity of searching for customer group and each user returned, carries out descending sequence according to similarity, suitable similarity threshold can be adopted to carry out result filtration, and the result after sequence being filtered exports.
Below, system and method for the present invention is further described by two examples.
Example one: the similar people's group hunting of certain video website.
There is video set S={V certain website
1..., V
n, each video packets, containing one group of content descriptor (i.e. label), may also have identical descriptor between different video.Watching record of user R={U simultaneously in this website records nearest a week
1---V
x---C
1x..., U
n---V
y---C
ny.
Step 1, using label as the description of video content, according to the label information of each video, adds up viewing number of times on each tab in each user one week, obtains the viewing record of shape as " user ID---label---viewing number of times "; Viewing record for same subscriber mark carries out joint account, and obtain all labels of each user viewing, and calculate the viewing frequency of each label, computing method are:
Wherein, tf
ifor the frequency of label i, C
i, C
jfor user watches the number of times of label i, label j, T is the set of all labels of this user viewing.Like this, just obtain each user viewing frequency on each tab, part viewing record example is as follows:
table 1user's view content record example
Step 2, using label as index key, sets up inverted index to viewing information.Hash is carried out to label, obtains cryptographic hash; Suitable piecemeal is carried out to inverted index file, cryptographic hash and file block is set up and maps; Viewing information (comprising: the total number of users of this label viewing, watch all user ID of this label and the viewing frequency of each user) corresponding to each label is stored to file block place corresponding to this label cryptographic hash.
Step 3, for given seed user viewing record, utilizes viewing label information at the enterprising line search of inverted index file.For each label of seed user viewing, identical hash function is adopted to calculate cryptographic hash, thus find corresponding inverted index blocks of files, read viewing information wherein, obtain the total number of users of this label viewed, all user ID and viewing frequency, calculate the similarity of each user of this label viewed, method is as follows:
Wherein, S
uirepresent the similarity of user u on label i, tf
uirepresent that user u watches the frequency of label i, D represents all total numbers of users, P
irepresent the total number of users of viewing label i.
The result returned for all seed user viewing tag search is comprehensively analyzed, and calculate each comprehensive similarity returning user, its computing method are:
Wherein, Score
irepresent the comprehensive similarity of user, S
uirepresent the similarity of user u on label i.
Step 4, according to the sequence that comprehensive similarity is carried out from big to small, through certain filtering screening, exports the result after sequence.
The seed file finally obtained and search result examples as follows:
table 2seed user
User ID |
Comprehensive similarity |
1414805406362bou |
7.457061423316192 |
1411422657876HQS |
7.457061423316192 |
1414897033491tst |
6.188232499062661 |
1414225525441rHY |
5.067268407706754 |
1376735750584cE7 |
4.97137438163828 |
1413197549819YYw |
4.97137438163828 |
1414929307620uum |
4.125488415218207 |
1401986230544u2n |
4.125488415218207 |
1396228567787C4I |
4.125488415218207 |
1413550544110F75 |
4.125488415218207 |
1414835997319Vst |
4.125488415218207 |
14148266200180f2 |
4.125488415218207 |
1413333051347w4D |
4.125488415218207 |
1403043694606LSF |
4.125488415218207 |
table 3part searches returns results
table 4the content viewing record of part similar users
Example two: certain product summary crowd expands
A certain product has locked a small amount of target group U={U1 ..., Um}, is desirably in certain video website and carries out product promotion, require to promote audient be with the target group U locked in there is the customer group of similar interests.Watching record of user R={U simultaneously in this website records nearest a week
1---V
x---C
1x..., U
n---V
y---C
ny.
Step 1, utilizes the viewing record of website, searches the video-see record of user in target group U, in conjunction with video information, obtains the viewing record of target group based on video tab.In conjunction with the information of this product, screen the viewing label of target group, filtering has nothing to do label.Using the seed of the viewing record after filtration as search.Afterwards, for all viewing records in nearest a week, using label as the description of video content, according to the label information of each video, add up viewing number of times on each tab in each user one week, obtain the viewing record of shape as " user ID---label---viewing number of times ".
Viewing record for same subscriber mark carries out joint account, obtains all labels of each user viewing, and calculates the viewing frequency of each label.Computing method are:
Wherein, tf
ifor the frequency of label i, C
i, C
jfor user watches the number of times of label i, label j, T is the set of all labels of this user viewing.We just obtain each user viewing frequency on each tab like this.
Step 2, using label as index key, sets up inverted index to viewing information.Hash is carried out to label, obtains cryptographic hash; Suitable piecemeal is carried out to inverted index file, cryptographic hash and file block is set up and maps; Viewing information (comprising: the total number of users of this label viewing, watch all user ID of this label and the viewing frequency of each user) corresponding to each label is stored to file block place corresponding to this label cryptographic hash.
Step 3, for given seed user viewing record, utilizes viewing label information at the enterprising line search of inverted index file.For each label of seed user viewing, identical hash function is adopted to calculate cryptographic hash, thus find corresponding inverted index blocks of files, read viewing information wherein, obtain the total number of users of this label viewed, all user ID and viewing frequency, calculate the similarity of each user of this label viewed, computing method are as follows:
Wherein, S
uirepresent the similarity of user u on label i, tf
uirepresent that user u watches the frequency of label i, D represents all total numbers of users, P
irepresent the total number of users of viewing label i.
The result that all seed user viewing tag search return comprehensively is analyzed, calculates each comprehensive similarity returning user.Its computing method are:
Wherein, Score
urepresent the comprehensive similarity of user u, S
uirepresent the similarity of user u on label i.
Step 4, according to the sequence that comprehensive similarity is carried out from big to small, through certain filtering screening, exports the result after sequence.
The view content example of this product summary user is as follows:
Targeted customer |
The video tab of target customer's viewing |
The product information that target customer pays close attention to |
Client one |
Huawei honor 3c |
Honor 3c UNICOM version |
Client two |
I Phone 5s |
Iphone5s hand shell |
Client three |
Meizu .mx3 |
mx3 |
Client four |
Xplay news conference |
Bubukao xplay3 |
table 5client's view content example
The partial results example searched is as follows:
User ID |
Comprehensive similarity |
1403339995050JHU |
6.344749147553775 |
1414920358781u4V |
6.344749147553775 |
14046215115455ID |
6.344749147553775 |
1414887141725RG9 |
6.344749147553775 |
1403888781082S88 |
6.344749147553775 |
1408775203633njo |
6.344749147553775 |
1400511822703RkC |
6.344749147553775 |
1414852321322EFa |
6.007725367960162 |
1414934708013Eut |
6.007725367960162 |
141126880285943L |
6.007725367960162 |
1414923557154foW |
6.007725367960162 |
1414856887921mCx |
6.007725367960162 |
table 6part searches returns results
table 7the viewing record of part similar users
After detailed description preferred embodiment of the present invention; those of ordinary skill in the art can clearly understand; various change and change can be carried out under the protection domain not departing from claim of enclosing and spirit, and the present invention is not also limited to the embodiment of examples cited embodiment in instructions.