CN103593469B - A kind of association keyword calculation method and device adopting complementary information - Google Patents

A kind of association keyword calculation method and device adopting complementary information Download PDF

Info

Publication number
CN103593469B
CN103593469B CN201310620943.XA CN201310620943A CN103593469B CN 103593469 B CN103593469 B CN 103593469B CN 201310620943 A CN201310620943 A CN 201310620943A CN 103593469 B CN103593469 B CN 103593469B
Authority
CN
China
Prior art keywords
word
item
user
event
correlation rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310620943.XA
Other languages
Chinese (zh)
Other versions
CN103593469A (en
Inventor
刘伟
姚键
潘柏宇
卢述奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Youku Network Technology Beijing Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201310620943.XA priority Critical patent/CN103593469B/en
Publication of CN103593469A publication Critical patent/CN103593469A/en
Application granted granted Critical
Publication of CN103593469B publication Critical patent/CN103593469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Abstract

Adopt association keyword calculation method and the device of complementary information, described method is based on user's data query, and after inquiring about, played data and user's uploaded videos data filling are in data query, obtain unified event sets.Use the correlation rule that association rule algorithm finds out personage from event sets, event is correlated with.Finally from correlation rule, parse association keyword.Integrated complementary user of the present invention inquiry and user inquire about the advantage of rear broadcasting video and user's uploading data, and the tendentious personage that has avoiding that use data mapping obtains is correlated with keyword results.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data can avoid occurring that user does not know to search for which crucial word problem, by heightening the threshold value of event and correlation rule, higher accuracy rate can be obtained.

Description

A kind of association keyword calculation method and device adopting complementary information
Technical field
The application relates to a kind of keyword calculation method of searching for and device, and especially, the application relates to the association keyword calculation method and device that adopt complementary information.
Background technology
Video service provides website to play, and media find, the role of broadcasting media.People usually wonder the content information relevant with certain personage, so browse or search for the content of searching be concerned about personage and being correlated with in video website.Current video website is by editor's programming content plate and hotly search list and supply user to browse the relevant content of personage, the content allowing user can browse personage more by providing video search to be correlated with.But human-edited's speed is comparatively slow, be subject to edit file source, the restriction of work hours simultaneously, exist content not extensively, real-time not problem; Heat searches list can only cover tens the highest personages of volumes of searches, can not meet the extensive concern face of user; User's its unconcerned content a lot of may be presented to by the search of personage's keyword.Meanwhile, desirably search for certain mark thing, it is desirable to obtain the keyword relevant to this event.Therefore, how can by the certain personage of search or event, obtaining the keyword relevant with this personage or event becomes the technical matters needing solution badly.
Summary of the invention
The object of the invention is to propose a kind of automatic discovery personage of high-accuracy or the method and apparatus of event associative key, can efficiently find the keyword relevant to personage or event and a large amount of personage can be covered, solving and cover range and the efficient problem upgraded.
For reaching this object, the present invention by the following technical solutions:
Adopt an association keyword calculation method for complementary information, comprise the steps:
Construct unified event sets step S110: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, thus every bar record is converted to the sequence of several numerals, and the word id preserving each word and its correspondence is to lexicon file;
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
Build one-level item set step S130: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: for the item set of the previous step just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to sorting from small to large, first primitive term is identical with the front n-1 item of the second primitive term, and n-th of the first primitive term word id is less than n-th word id of the second primitive term
To described two primitive terms carry out " and " computing, the high one-level item containing n+1 item obtained, traversal event sets, the event number of statistics containing all word id in described high one-level item, if described event number exceedes described average occurrence number, then retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items retained, form the set of high one-level item;
Continue to build set determining step S150, according to the method for described structure high one-level item set step, can judgement build the set of higher one-level item, if can, then return described structure high one-level item set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First defining threshold value TH, for screening correlation rule, for each final multinomial D in the final multinomial set obtained, obtaining correlation rule according to following way screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form multiple proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set;
Text reconstitution steps S170: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
Preferably, in the event sets step that described structure is unified, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query, each time playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described structure high one-level item set step, when building secondary item set, to each word id in the set of first order item, combination of two obtains multiple secondary item, described secondary item contains two elements, and to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retain, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set.
Preferably, the described threshold value TH of artificial setting, the correlation rule that described threshold value TH is filtered out reflects the correlativity of other word of user's data query and playing video data or user's uploaded videos data substantially.
Preferably, in text reconstitution steps, choose the proper subclass only containing name, obtain the keyword that personage is correlated with.
The invention also discloses a kind of association keyword calculation element adopting complementary information, comprise as lower unit:
Unified event sets tectonic element: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, thus every bar record is converted to the sequence of several numerals, and the word id preserving each word and its correspondence is to lexicon file;
The average occurrence number statistic unit of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit: for the item set of the upper unit just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to sorting from small to large, first primitive term is identical with the front n-1 item of the second primitive term, and n-th of the first primitive term word id is less than n-th word id of the second primitive term
To described two primitive terms carry out " and " computing, the high one-level item containing n+1 item obtained, traversal event sets, the event number of statistics containing all word id in described high one-level item, if described event number exceedes described average occurrence number, then retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items retained, form the set of high one-level item;
Continue to build set judging unit, according to described high one-level item set construction unit, can judgement build the set of higher one-level item, if can, then return described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit; First defining threshold value TH, for screening correlation rule, for each final multinomial D in the final multinomial set obtained, obtaining correlation rule according to following screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form multiple proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set;
Text restoration unit: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query, each time playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building secondary item set, to each word id in the set of first order item, combination of two obtains multiple secondary item, described secondary item contains two elements, and to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retain, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set.
Preferably, the described threshold value TH of artificial setting, the correlation rule that described threshold value TH is filtered out reflects the correlativity of other word of user's data query and playing video data or user's uploaded videos data substantially.
Preferably, in text restoration unit, choose the proper subclass only containing name, obtain the keyword that personage is correlated with.
Therefore, integrated complementary user of the present invention inquiry and user inquire about the advantage of rear broadcasting video and user's uploading data, and the tendentious personage that has avoiding that use data mapping obtains is correlated with keyword results.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data can avoid occurring that user does not know to search for which crucial word problem, by heightening the threshold value of event and correlation rule, higher accuracy rate can be obtained.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the crucial keyword calculation method according to the embodiment of the present invention;
Fig. 2 is the module frame chart of the crucial keyword calculation element according to the embodiment of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.
The present invention is based on user's data query, and after inquiring about, played data and user's uploaded videos data filling are in data query, obtain unified event sets.Use the correlation rule that association rule algorithm finds out personage from event sets, event is correlated with.Finally from correlation rule, parse association keyword.
Embodiment 1:
See Fig. 1, disclose the process flow diagram of the association keyword calculation method according to employing complementary information of the present invention.Described association keyword calculation method comprises the steps:
Construct unified event sets step S110: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, such as, can digital value sequentially from 1, thus every bar text entry is just converted to the sequence of several numerals, and preserve the word id of each word and its correspondence, such as by the word id of each word and its correspondence to lexicon file.
Preferably, the described record relevant with search or video comprises user's data query, and user inquires about rear playing video data, user's uploaded videos data, user's data query, each time playing video data each time, user's uploaded videos data are respectively as a record each time.
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id.
Build one-level item set step S130: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: for the item set of the previous step just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing, be namely equivalent to the inclusive-OR operation of logical operation
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, such as the first primitive term A and the second primitive term B, in each word id according to sorting from small to large, first primitive term A is identical with the front n-1 item of the second primitive term B, and n-th word id of the first primitive term A is less than n-th word id of the second primitive term B
To described two primitive terms carry out " and " computing, obtain containing the high one-level item C of n+1 item, traversal event sets, the event number of statistics containing all word id in described high one-level item C, if described event number exceedes described average occurrence number, then high one-level item C retains, otherwise abandons;
Add all high one-level items retained, i.e. the item of n+1 level, forms the set of high one-level item, i.e. (n+1)th grade of item set;
Especially, when building secondary item set, to each word id in the set of first order item, combination of two obtains some secondary items, and described secondary item contains two elements, to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retains, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set;
Special instruction, in the present invention, in order to coordinate the special circumstances building the set of secondary item, at the high one-level item set step S140 of structure, as n=1, n-1=0, can think that the 0th word of each primitive term is all identical, all be considered as satisfying condition.Therefore, when the set of structure secondary item, each one-level item directly carries out " and " computing, and when building other high one-level item set, only have front n-1 item word identical, just can carry out " and " computing.
Continue to build set determining step S150, according to the method for described structure high one-level item set step, can judgement build the set of higher one-level item, namely have and do not have two primitive terms can construct higher one-level item, and the event number that described higher one-level item occurs in described event sets exceeds described average occurrence number, if can, then return described structure high one-level item set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, each finally multinomial in the final multinomial set obtained, obtains correlation rule according to following way screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form several proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set.
Preferably, the described threshold value TH of artificial setting, the correlation rule that described threshold value TH is filtered out reflects user's data query substantially, i.e. query string, with the correlativity of other word of playing video data or user's uploaded videos data.
Text reconstitution steps S170: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
Preferably, choose the proper subclass only containing name, then correctly identified the keyword that personage is correlated with.
Certainly, also can choose the proper subclass comprising event, obtain the keyword relevant to event.
Normally, when text restores, symbol text should be removed.
Therefore, the present invention combines complementary user inquiry and user inquires about the advantage of rear broadcasting and user's uploading data, and the tendentious personage that has avoiding that use data mapping obtains is correlated with keyword results.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data can avoid occurring that user does not know to search for which crucial word problem, by heightening the threshold value of event and correlation rule, higher accuracy rate can be obtained.
Embodiment 2:
In the present embodiment, the concrete example of the association keyword calculation method according to embodiment 1 is disclosed:
According to user's inquiry, user inquires about rear broadcasting, user uploads three class data, totally five events, comprehensive and obtain after cutting word:
Event 1: Guo Degang (1)
Event 2: Guo Degang (1) up-to-date (2) cross-talk (3)
Event 3: Guo Degang (1) is in modest (4) 2013 (5) up-to-date (2) cross-talk (3) " (6) you be MashiMaro (7) " (8)
Event 4: Guo Degang (1) is in modest (4) 2013 (5) up-to-date (2) cross-talk (3) " (6) sparking chicken (9) " (8)
Event 5: Guo Degang (1) is in modest (4) cross-talk (3) " (6) timid wealth and rank (10) " (8)
Digitized representation after each word in parenthesis distributes to the id of this word, one has 10 word id here, and what all words occurred adds up to 26, and average each word occurs 2.6 times, and definition below threshold value used is 2.6.
The event sets obtained is expressed as:
{{1},{1,2,3},{1,2,3,4,5,6,7,8},{1,2,3,4,5,6,8,9},{1,3,4,6,8,10}}
First find out one-level frequency item collection, add up the number of times that each word occurs:
Word ID The frequency
1 5
2 3 5 -->
3 4
4 3
5 2
6 3
7 1
8 8
9 1
10 1
Visible, word id1,2,3,4,6, the number of times of 8 is greater than threshold value 2.6, as one-level item collection, be expressed as 1}, 2}, 3}, 4}, 6}, 8}},
Then start to construct the set of secondary item, the secondary item that may form comprises:
{
{1,2},{1,3},{1,4},{1,6},{1,8},
{2,3},{2,4},{2,6},{2,8},
{3,4},{3,6},{3,7},
{4,6},{4,8},
{6,8}
}
In this secondary item set, each binomial is integrated into the frequency occurred in event and is:
Secondary item (two word ID represent) The frequency
{1,2} 3
{1,3} 4
{1,4} 3
{1,6} 3
{1,8} 3
{2,3} 3
{2,4} 2
{2,6} 2
{2,8} 2
{3,4} 3
{3,6} 3
{3,7} 1
{4,6} 3
{4,8} 3
{6,8} 3
Because secondary item 2,4}, and 2,6}, 2,8}, lower than threshold value 2.6, need to delete, obtain the set of secondary item is the number of times of 3,7}:
{
{1,2},{1,3},{1,4},{1,6},{1,8},
{2,3},
{3,4},{3,6},
{4,6},{4,8},
{6,8}
}
Then construct three grades of item set, can comprise by getable three grades of item collection:
{
{1,2,3},{1,2,4},{1,2,6},{1,2,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Here the method obtaining three grades of items from secondary item is, to every two secondary items, such as { 3, 4} and { 3, 6}, id value is adopted to sort to the element in each secondary item, (being sequence here), obtain { 3, 4} and { 3, 6}, because front n-1 (2-1=1) item of these two secondary items is identical, and the former second value 4 is less than second value 6 of the latter, therefore three grades of items { 3 can be obtained, 4, 6}, and for two secondary items { 3, 4} and { 4, 8} (both inside are according to id sequence), because { 3, 4} and { 4, the front n-1 item (2-1=1) of 8} is different, { 3 are obtained so can not merge, 4, 8}.
The frequency of adding up three grades of items appearance obtains:
Three grades of items The frequency
{1,2,3} 3
{1,2,4} 2
{1,2,6} 2
{1,2,8} 2
{1,3,4} 3
{1,3,6} 3
{1,3,8} 3
{1,4,6} 3
{1,4,8} 3
{1,6,8} 3
{3,4,6} 3
But because three grades of items 1,2,4}, and 1,2,6}, the number of times of 1,2,8} is less than threshold value 2.6, therefore obtains three grades of item set to be:
{
{1,2,3},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Then construct the set of level Four item, can the set of getable level Four item comprise:
{
{1,3,4,6},{1,3,4,8},{1,3,6,8},
{1,4,6,8},
}
And the event times that these level Four items occur all is greater than threshold value 2.6.
Continue the set of structure Pyatyi item, can comprise by getable Pyatyi item:
{
{1,3,4,6,8},
}
The event times that this unique Pyatyi item occurs is 3 be greater than threshold value 2.6, can not continue the item set obtaining higher level simultaneously.So this Pyatyi item is final item.
Construct the proper subclass of this final item, its set is:
{
{1},{3},{4},{6},{8},
{1,3},{1,4},{1,6},{1,8},
{3,4},{3,6},{3,8},
{4,6},{4,8},
{6,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},{3,4,8},
{4,6,8},
{1,3,4,6},{1,3,4,8},{1,4,6,8},{3,4,6,8},
}
The frequency that these proper subclass occur in event sets is respectively:
Subset The frequency
{1} 5
{3} 4
{4} 3
{6} 3
{8} 3 8 -->
{1,3} 4
{1,4} 3
{1,6} 3
{1,8} 3
{3,4} 3
{3,6} 3
{3,8} 3
{4,6} 3
{4,8} 3
{6,8} 3
{1,3,4} 3
{1,3,6} 3
{1,3,8} 3
{1,4,6} 3
{1,4,8} 3
{1,6,8} 3
{3,4,6} 3
{3,4,8} 3
{4,6,8} 3
{1,3,4,6} 3
{1,3,4,8} 3
{1,4,6,8} 3
{3,4,6,8} 3
{ 1} has occurred 5 times to its proper subclass, then
P({1,3,4,6,8}|{1})=3/5=0.6。
Setting threshold value TH is 0.55, then this correlation rule is effective.Translate into word according to dictionary to obtain
{ Guo Degang }->{ Guo De guiding principle, cross-talk, Yu Qian, ", " }
Namely the keyword associated with Guo Degang is " cross-talk ", " in modest ".Punctuation marks used to enclose the title " " are filtered out by symbolic rule.
If setting threshold value TH is 0.7, according to the frequency of each subset, can sees and only have { 1}->{1,3,4,6,8} does not reach threshold value, and namely effective correlation rule is removing { 1}->{1, strictly all rules outside 3,4,6,8}.
Text recovery is carried out to subset, selects the subset (namely containing word 1-Guo De guiding principle, word 4-in modest) wherein only containing name, obtain:
{4}->{1,3,4,6,8}
{1,4}->{1,3,4,6,8}
Such two rules, text representation is:
1, { in modest }->{ Guo De guiding principle, cross-talk, Yu Qian, ", " }
2, { Guo Degang, in modest }->{ Guo De guiding principle, cross-talk, Yu Qian, ", " }
Can be understood as,
The keyword that personage " in modest " associates is " Guo Degang ", " cross-talk ".
The keyword that personage's combination { " Guo Degang "+" in modest " } associates is " cross-talk ".
Can see through correlation rule calculating, comparatively can identify the keyword that personage is correlated with.
Below only example, only containing the subset of name, also can choose the subset comprising event, its effect, it will be appreciated by those skilled in the art that the keyword that also correctly can identify that event is relevant.
Embodiment 3:
The invention also discloses a kind of association keyword calculation element adopting complementary information, comprise as lower unit:
Unified event sets tectonic element 210: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, thus every bar record is converted to the sequence of several numerals, and the word id preserving each word and its correspondence is to lexicon file;
The average occurrence number statistic unit 220 of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit 230: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit 240: for the item set of the upper unit just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to sorting from small to large, first primitive term is identical with the front n-1 item of the second primitive term, and n-th of the first primitive term word id is less than n-th word id of the second primitive term
To described two primitive terms carry out " and " computing, the high one-level item containing n+1 item obtained, traversal event sets, the event number of statistics containing all word id in described high one-level item, if described event number exceedes described average occurrence number, then retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items retained, form the set of high one-level item;
Continue to build set judging unit 250, according to described high one-level item set construction unit, can judgement continue to build the set of higher one-level item, if can, then return described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit 260; First defining threshold value TH, for screening correlation rule, for each final multinomial D in the final multinomial set obtained, obtaining correlation rule according to following screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form multiple proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set;
Text restoration unit 270: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query, each time playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building secondary item set, to each word id in the set of first order item, combination of two obtains multiple secondary item, described secondary item contains two elements, and to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retain, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set.
Wherein, the described threshold value TH of artificial setting, the correlation rule that described threshold value TH is filtered out reflects the correlativity of other word of user's data query and playing video data or user's uploaded videos data substantially.
Preferably, in text restoration unit, choose the proper subclass only containing name, obtain the keyword that personage is correlated with.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each unit or each step can realize with general calculation element, they can concentrate on single calculation element, alternatively, they can realize with the executable program code of computer installation, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.
Above content is in conjunction with concrete preferred implementation further description made for the present invention; can not assert that the specific embodiment of the present invention is only limitted to this; for general technical staff of the technical field of the invention; without departing from the inventive concept of the premise; some simple deduction or replace can also be made, all should be considered as belonging to the present invention by submitted to claims determination protection domain.

Claims (10)

1. adopt an association keyword calculation method for complementary information, comprise the steps:
Construct unified event sets step S110: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, thus every bar record is converted to the sequence of several numerals, and the word id preserving each word and its correspondence is to lexicon file;
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
Build one-level item set step S130: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: for the item set of the previous step just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to sorting from small to large, first primitive term is identical with the front n-1 item of the second primitive term, and n-th of the first primitive term word id is less than n-th word id of the second primitive term
To described two primitive terms carry out " and " computing, the high one-level item containing n+1 item obtained, traversal event sets, the event number of statistics containing all word id in described high one-level item, if described event number exceedes described average occurrence number, then retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items retained, form the set of high one-level item;
Continue to build set determining step S150, according to the method for described structure high one-level item set step, can judgement build the set of higher one-level item, if can, then return described structure high one-level item set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First defining threshold value TH, for screening correlation rule, for each final multinomial D in the final multinomial set obtained, obtaining correlation rule according to following way screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form multiple proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set;
Text reconstitution steps S170: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
2. the association keyword calculation method of employing complementary information according to claim 1, is characterized in that:
In the event sets step that described structure is unified, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
3. the association keyword calculation method of employing complementary information according to claim 1, is characterized in that:
For described structure high one-level item set step, when building secondary item set, to each word id in the set of first order item, combination of two obtains multiple secondary item, and described secondary item contains two elements, to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retains, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set.
4., according to the association keyword calculation method of the employing complementary information in claim 1-3 described in any one, it is characterized in that:
The described threshold value TH of artificial setting, the correlativity of other word of the correlation rule that described threshold value TH is filtered out reflection user's data query and playing video data or user's uploaded videos data.
5. the association keyword calculation method of employing complementary information according to claim 4, is characterized in that:
In text reconstitution steps, choose the proper subclass only containing name, obtain the keyword that personage is correlated with.
6. adopt an association keyword calculation element for complementary information, comprise as lower unit:
Unified event sets tectonic element: add the record relevant with search or video, add up all records and obtain event sets, word process is cut to each record in described event sets, the text entry of word has been cut in order scanning, and give order-assigned digital value increased progressively that each word occurs the earliest according to it, as the word id of this word, thus every bar record is converted to the sequence of several numerals, and the word id preserving each word and its correspondence is to lexicon file;
The average occurrence number statistic unit of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event is only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit: travel through all word id, and find out, occurrence number exceedes the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit: for the item set of the upper unit just formed, be called primitive term set, each primitive term contains n word id, n >=1, two primitive terms finding out satisfied condition below carry out " and " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to sorting from small to large, first primitive term is identical with the front n-1 item of the second primitive term, and n-th of the first primitive term word id is less than n-th word id of the second primitive term
To described two primitive terms carry out " and " computing, the high one-level item containing n+1 item obtained, traversal event sets, the event number of statistics containing all word id in described high one-level item, if described event number exceedes described average occurrence number, then retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items retained, form the set of high one-level item;
Continue to build set judging unit, according to described high one-level item set construction unit, can judgement build the set of higher one-level item, if can, then return described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit; First defining threshold value TH, for screening correlation rule, for each final multinomial D in the final multinomial set obtained, obtaining correlation rule according to following screening:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form multiple proper subclass E, for each proper subclass E, in described event sets, statistics contains the event number of final multinomial D and described proper subclass E respectively, be designated as Cnt (D) and Cnt (E) respectively, calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, then think that described proper subclass can be derived finally multinomial, then form a correlation rule, and record preservation and obtain correlation rule set;
Text restoration unit: utilize described lexicon file, travel through the described correlation rule set obtained, text recovery is carried out to every bar correlation rule, each word id in described proper subclass E and final multinomial D is obtained original text according to lexicon file inquiry, and the word thinking in proper subclass can obtain finally multinomial in remaining word except proper subclass.
7. the association keyword calculation element of employing complementary information according to claim 6, is characterized in that:
In described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
8. the association keyword calculation element of employing complementary information according to claim 6, is characterized in that:
For described high one-level item set construction unit, when building secondary item set, to each word id in the set of first order item, combination of two obtains multiple secondary item, and described secondary item contains two elements, to each secondary item obtained, in described event sets, statistics comprises the event number of described secondary item, if described event number exceeds described average occurrence number, then retains, otherwise then abandon, the secondary item of reservation is carried out set and form second level item set.
9., according to the association keyword calculation element of the employing complementary information in claim 6-8 described in any one, it is characterized in that:
The described threshold value TH of artificial setting, the correlativity of other word of the correlation rule that described threshold value TH is filtered out reflection user's data query and playing video data or user's uploaded videos data.
10. the association keyword calculation element of employing complementary information according to claim 9, is characterized in that:
In text restoration unit, choose the proper subclass only containing name, obtain the keyword that personage is correlated with.
CN201310620943.XA 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information Active CN103593469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310620943.XA CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310620943.XA CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Publications (2)

Publication Number Publication Date
CN103593469A CN103593469A (en) 2014-02-19
CN103593469B true CN103593469B (en) 2016-04-20

Family

ID=50083610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310620943.XA Active CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Country Status (1)

Country Link
CN (1) CN103593469B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402920B (en) * 2016-05-18 2020-02-07 北京京东尚科信息技术有限公司 Method and device for determining correlation complexity of relational database table
CN109344402B (en) * 2018-09-20 2023-08-04 中国科学技术信息研究所 New term automatic discovery and identification method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
CN102012900A (en) * 2009-09-04 2011-04-13 阿里巴巴集团控股有限公司 An information retrieval method and system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
CN102012900A (en) * 2009-09-04 2011-04-13 阿里巴巴集团控股有限公司 An information retrieval method and system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time

Also Published As

Publication number Publication date
CN103593469A (en) 2014-02-19

Similar Documents

Publication Publication Date Title
CA2777506C (en) System and method for grouping multiple streams of data
CN100372372C (en) Free text and attribute search of electronic program guide data
CN104219575B (en) Method and system for recommending related videos
CN101719167B (en) Interactive movie searching method
CN101821734B (en) Detection and classification of matches between time-based media
US8332775B2 (en) Adaptive user feedback window
US20230394048A1 (en) Methods, systems, and media for providing a media search engine
US20140129559A1 (en) Timeline-Based Data Visualization of Social Media Topic
CN104794228B (en) A kind of search result provides method and device
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
KR101252670B1 (en) Apparatus, method and computer readable recording medium for providing related contents
CN109947791B (en) Database statement optimization method, device, equipment and storage medium
KR101386832B1 (en) System and method for television search assistant
CN108874812B (en) Data processing method, server and computer storage medium
CN104331446A (en) Memory map-based mass data preprocessing method
CN110633330A (en) Event discovery method, device, equipment and storage medium
CN103294671A (en) Document detection method and system
CN105378730A (en) Social media content analysis and output
CN104035956A (en) Time-series data storage method based on distributive column storage
CN1577600A (en) Network system, server, data recording and playing device, method for the same, and program
CN104063384A (en) Data retrieval method and device
CN103390045A (en) Time sequence storage method and time sequence storage device for monitoring system
CN102955812B (en) A kind of method of index building storehouse, device and querying method and device
CN103593469B (en) A kind of association keyword calculation method and device adopting complementary information
CN104965903A (en) Resource recommendation method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee after: Youku network technology (Beijing) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: 1VERGE INTERNET TECHNOLOGY (BEIJING) Co.,Ltd.

CP01 Change in the name or title of a patent holder
TR01 Transfer of patent right

Effective date of registration: 20200323

Address after: 310018 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co.,Ltd.

TR01 Transfer of patent right