Summary of the invention
The object of the invention is to propose a kind of automatic discovery personage of high-accuracy or the method and apparatus of event associative key, the keyword that efficiently discovery is relevant to personage or event and can cover a large amount of personages, has solved range and the efficient problem of upgrading of covering.
For reaching this object, the present invention by the following technical solutions:
Associated keyword computing method for complementary information, comprise the steps:
Construct unified event sets step S110: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
Build one-level item set step S130: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: the item set for the previous step of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set determining step S150, according to the method for the high one-level item of described structure set step, can judgement build the set of higher one-level item, if can, return to the high one-level item of described structure set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, for each the final multinomial D in the final multinomial set obtaining, according to following way screening, obtain correlation rule:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text reconstitution steps S170: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in the unified event sets step of described structure, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for the high one-level item of described structure set step, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Preferably, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text reconstitution steps, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.The invention also discloses a kind of associated keyword calculation element that adopts complementary information, comprise as lower unit:
Unified event sets tectonic element: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number statistic unit of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit: the item set for the upper unit of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set judging unit, according to described high one-level item set construction unit, can judgement build the set of higher one-level item, if can, return to described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit; First define threshold value TH, for screening correlation rule, each the final multinomial D in the final multinomial set obtaining, obtains correlation rule according to following screening:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text restoration unit: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Preferably, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text restoration unit, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
Therefore, the advantage that integrated complementary user of the present invention inquiry and user inquire about rear displaying video and user's uploading data, that has avoided that use data mapping obtains has a tendentious personage keyword results of being correlated with.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data and can avoid occurring which crucial word problem user does not know to search for, by heightening the threshold value of event and correlation rule, can obtain higher accuracy rate.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, in accompanying drawing, only show part related to the present invention but not entire infrastructure.
The present invention be take user's data query as main body, and after inquiring about, played data and user's uploaded videos data filling, in data query, obtain unified event sets.Use association rule algorithm from event sets, to find out personage, correlation rule that event is relevant.Finally from correlation rule, parse associated keyword.
Embodiment 1:
Referring to Fig. 1, disclose according to the process flow diagram of the associated keyword computing method of employing complementary information of the present invention.Described associated keyword computing method comprise the steps:
Construct unified event sets step S110: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, word id as this word, for example, the digital value that can increase progressively since 1 order, thereby every text entry is just converted to the sequence of several numerals, and preserve each word and its corresponding word id, for example by each word and its corresponding word id to lexicon file.
Preferably, the described record relevant with search or video comprises user's data query, and user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id.
Build one-level item set step S130: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: the item set for the previous step of firm formation, be called primitive term set, each primitive term contains n word id, n31, find out satisfied two primitive terms of condition below and carry out " also " computing, be equivalent to the inclusive-OR operation of logical operation
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, for example the first primitive term A and the second primitive term B, in each word id according to from small to large sequence, the front n-1 item of the first primitive term A and the second primitive term B is identical, and the n item word id of the first primitive term A is less than the n item word id of the second primitive term B
Described two primitive terms are carried out to " also " computing, the high one-level item C that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item C, if described event number surpasses described average occurrence number, high one-level item C retains, otherwise abandons;
Add all high one-level items that retain, the item of n+1 level, forms the set of high one-level item, i.e. the set of n+1 level item;
Especially, when building the set of secondary item, word id to each in the set of first order item, combination of two obtains some secondary items, and described secondary item contains two elements, to each the secondary item obtaining, the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retains, otherwise abandon, the secondary item of reservation is gathered and formed second level item set;
Special instruction, in the present invention, in order to coordinate the special circumstances that build the set of secondary item, is building high one-level item set step S140, and when n=1, n-1=0, can think that the 0th word of each primitive term is all identical, is all considered as satisfying condition.Therefore, when building the set of secondary item, each one-level item has directly carried out " also " computing, and when building other high one-level item set, only has front n-1 item word identical, just can carry out " also " computing.
Continue to build set determining step S150; according to the method for the high one-level item of described structure set step; can judgement build the set of higher one-level item; have and do not have two primitive terms can construct higher one-level item; and the event number that described higher one-level item occurs in described event sets exceeds described average occurrence number; if can, return to the high one-level item of described structure set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, finally multinomial for each in the final multinomial set obtaining, according to following way screening, obtain correlation rule:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form several proper subclass E, for each proper subclass E, the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets, be designated as respectively Cnt (D) and Cnt (E), calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, think that described proper subclass can derive finally multinomial, form a correlation rule, and record preservation obtains correlation rule set.
Preferably, manually set described threshold value TH, make the correlation rule that described threshold value TH filters out substantially reflect user's data query, i.e. query string, with the correlativity of other word of playing video data or user's uploaded videos data.
Text reconstitution steps S170: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, choose the proper subclass that only contains name, obtain correctly identifying the keyword that personage is relevant.
Certainly, also can choose the proper subclass that comprises event, obtain the keyword relevant to event.
Normally, when text restores, should remove symbol text.
Therefore, the advantage that the present invention combines complementary user inquiry and user inquires about rear broadcasting and user's uploading data, that has avoided that use data mapping obtains has a tendentious personage keyword results of being correlated with.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data and can avoid occurring which crucial word problem user does not know to search for, by heightening the threshold value of event and correlation rule, can obtain higher accuracy rate.
Embodiment 2:
In the present embodiment, disclose according to the concrete example of the associated keyword computing method of embodiment 1:
According to user inquiry, user inquire about rear broadcasting, user uploads three class data, totally five events, comprehensive and cut word after obtain:
Event 1: Guo Degang (1)
Event 2: Guo Degang (1) up-to-date (2) cross-talks (3)
In modest (4) 2013 (5) up-to-date (2) cross-talks (3) < < (6), you are MashiMaro (7) > > (8) to event 3: Guo Degang (1)
Event 4: Guo Degang (1) is in modest (4) 2013 (5) up-to-date (2) cross-talks (3) < < (6) sparking chicken (9) > > (8)
Event 5: Guo Degang (1) is in the timid wealths and ranks of modest (4) cross-talks (3) < < (6) (10) > > (8)
Digitized representation after each word in parenthesis is distributed to the id of this word, one has 10 word id here, and what all words occurred adds up to 26, and on average each word occurs 2.6 times, and definition below threshold value used is 2.6.
The event sets obtaining is expressed as:
{{1},{1,2,3},{1,2,3,4,5,6,7,8},{1,2,3,4,5,6,8,9},{1,3,4,6,8,10}}
First find out one-level frequency item collection, add up the number of times that each word occurs:
Visible, the number of times of word id 1,2,3,4,6,8 is greater than threshold value 2.6,, as one-level item collection, be expressed as 1}, and 2}, 3}, 4}, 6}, 8}},
Then start to construct the set of secondary item, the secondary item that may form comprises:
{
{1,2},?{1,3},?{1,4},?{1,6},?{1,8},
{2,3},{2,4},{2,6},{2,8},
{3,4},{3,6},{3,7},
{4,6},{4,8},
{6,8}
}
In this secondary item set, each binomial is integrated into the frequency occurring in event and is:
Because secondary item 2,4}, and 2,6}, 2,8}, the number of times of 3,7}, lower than threshold value 2.6, needs to delete, and obtains the set of secondary item and is:
{
{1,2},?{1,3},?{1,4},?{1,6},?{1,8},
{2,3},
{3,4},{3,6},
{4,6},{4,8},
{6,8}
}
Then construct three grades of item set, can comprise by getable three grades of item collection:
{
{1,2,3},{1,2,4},{1,2,6},{1,2,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Here the method that obtains three grades of items from secondary item is, to every two secondary items, for example { 3, 4} and { 3, 6}, to the element in each secondary item, adopt id value to sort, (sorting here), obtain { 3, 4} and { 3, 6}, because the front n-1 (2-1=1) of these two secondary items identical, and the former second value 4 is less than second value 6 of the latter, therefore can obtain three grades of items { 3, 4, 6}, and for two secondary items { 3, 4} and { 4, both inside of 8}(are according to id sequence), because { 3, 4} and { 4, front n-1 item (2-1=1) difference of 8}, so can not merge and obtain { 3, 4, 8}.
The frequency of adding up three grades of items appearance obtains:
But because three grades of items 1,2,4}, and 1,2,6}, the number of times of 1,2,8} is less than threshold value 2.6, therefore obtains three grades of item set to be:
{
{1,2,3},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Then construct the set of level Four item, can the set of getable level Four item comprise:
{
{1,3,4,6},{1,3,4,8},{1,3,6,8},
{1,4,6,8},
}
And the event times that these level Four items occur is all greater than threshold value 2.6.
Continue the set of structure Pyatyi item, can comprise by getable Pyatyi item:
{
{1,3,4,6,8},
}
The event times that this unique Pyatyi item occurs is 3 to be greater than threshold value 2.6, can not continue to obtain the item set of higher level simultaneously.So this Pyatyi item is final.
Construct this proper subclass of final, its set is:
{
{1},{3},{4},{6},{8},
{1,3},{1,4},{1,6},{1,8},
{3,4},{3,6},{3,8},
{4,6},{4,8},
{6,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},{3,4,8},
{4,6,8},
{1,3,4,6},{1,3,4,8},{1,4,6,8},{3,4,6,8},
}
The frequency that these proper subclass occur in event sets is respectively:
{ 1} occurred 5 times its proper subclass,
P({1,3,4,6,8}|{1})?=?3/5?=?0.6。
Setting threshold TH is 0.55, and this correlation rule is effective.According to dictionary, translating into word obtains
{ Guo Degang }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
The keyword associated with Guo De guiding principle is " cross-talk ", " in modest ".Punctuation marks used to enclose the title < < > > is filtered out by symbolic rule.
If setting threshold TH is 0.7, according to the frequency of each subset, can sees and only have { 1}->{1,3,4,6,8} does not reach threshold value, and effective correlation rule is for removing { 1}->{1, strictly all rules outside 3,4,6,8}.
Subset is carried out to text recovery, selects the subset (containing word 1-Guo De guiding principle, word 4-in modest) that wherein only contains name, obtain:
{4}->{1,3,4,6,8}
{1,4}->{1,3,4,6,8}
Such two rules, text representation is:
1, { in modest }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
2, { Guo Degang, in modest }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
Can be understood as,
The associated keyword of personage " in modest " is " Guo Degang ", " cross-talk ".
The associated keyword of personage's combination { " Guo Degang "+" in modest " } is " cross-talk ".
Can see through correlation rule and calculating, can more correctly identify the keyword that personage is relevant.
Below only example, only containing the subset of name, also can be chosen the subset that comprises event, and its effect, it will be appreciated by those skilled in the art that the also relevant keyword of correctly identification event.
Embodiment 3:
The invention also discloses a kind of associated keyword calculation element that adopts complementary information, comprise as lower unit:
Unified event sets tectonic element 210: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number statistic unit 220 of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit 230: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit 240: the item set for the upper unit of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set judging unit 250, according to described high one-level item set construction unit, can judgement continue to build the set of higher one-level item, if can, return to described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit 260; First define threshold value TH, for screening correlation rule, each the final multinomial D in the final multinomial set obtaining, obtains correlation rule according to following screening:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text restoration unit 270: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Wherein, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text restoration unit, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
Obviously, those skilled in the art should be understood that, above-mentioned each unit of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, alternatively, they can realize with the executable program code of computer installation, thereby they can be stored in memory storage and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit module realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.
Above content is in conjunction with concrete preferred implementation further description made for the present invention; can not assert that the specific embodiment of the present invention only limits to this; for general technical staff of the technical field of the invention; without departing from the inventive concept of the premise; can also make some simple deduction or replace, all should be considered as belonging to the present invention and determine protection domain by submitted to claims.