CN103593469A - Method and device for calculating associated keywords through complementary information - Google Patents

Method and device for calculating associated keywords through complementary information Download PDF

Info

Publication number
CN103593469A
CN103593469A CN201310620943.XA CN201310620943A CN103593469A CN 103593469 A CN103593469 A CN 103593469A CN 201310620943 A CN201310620943 A CN 201310620943A CN 103593469 A CN103593469 A CN 103593469A
Authority
CN
China
Prior art keywords
word
item
user
level
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310620943.XA
Other languages
Chinese (zh)
Other versions
CN103593469B (en
Inventor
刘伟
姚键
潘柏宇
卢述奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Youku Network Technology Beijing Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201310620943.XA priority Critical patent/CN103593469B/en
Publication of CN103593469A publication Critical patent/CN103593469A/en
Application granted granted Critical
Publication of CN103593469B publication Critical patent/CN103593469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Abstract

The invention provides a method and device for calculating associated keywords through complementary information. The method comprises the steps that query data of a user are regarded as the main body, and playing data after query and video data uploaded by the user are supplemented to the query data to obtain a uniform event set; an associated rule algorithm is used for searching out associated rules related to characters and events in the event set; finally the associated keywords are resolved from the associated rules. The method integrates the advantages of user query data, playing video data after user query and user uploaded data in a complementation mode, the problem that a singular data source is used for obtaining a result of tendentious keywords related to characters is avoided, after the playing video data after user query are added, keywords which the user is really interested in can be obtained, after the user uploaded data are added, the problem that the user does not know which keywords to search can be avoided, and a higher accuracy rate can be obtained by raising the threshold value of the events and the associated rules.

Description

A kind of associated keyword computing method and device that adopts complementary information
Technical field
The application relates to a kind of keyword computing method and device of searching for use, and special, the application relates to associated keyword computing method and the device that adopts complementary information.
Background technology
Video service provides website playing the part of the role of media discovery, broadcasting media.People usually wonder the content information closing with someone's phase, so browse or search for to search the content that be concerned about personage is relevant in video website.Current video website is searched list by editor's programming content plate with heat and is supplied user to browse the content that personage is relevant, by providing video search to allow user can browse more the content that personage is relevant.But human-edited's speed is slower, be subject to the restriction of edit file source, work hours simultaneously, exist content not extensively, real-time not problem; Heat is searched list can only cover tens personages that volumes of searches is the highest, can not meet user's extensive concern face; Search by personage's keyword may represent a lot of its unconcerned contents to user.Meanwhile, people also wish to search for certain sign thing, and hope can access the keyword relevant to this event.Therefore, how can be by search certain personage or event, obtaining the keyword relevant with this personage or event becomes the technical matters of needing solution badly.
Summary of the invention
The object of the invention is to propose a kind of automatic discovery personage of high-accuracy or the method and apparatus of event associative key, the keyword that efficiently discovery is relevant to personage or event and can cover a large amount of personages, has solved range and the efficient problem of upgrading of covering.
For reaching this object, the present invention by the following technical solutions:
Associated keyword computing method for complementary information, comprise the steps:
Construct unified event sets step S110: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
Build one-level item set step S130: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: the item set for the previous step of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set determining step S150, according to the method for the high one-level item of described structure set step, can judgement build the set of higher one-level item, if can, return to the high one-level item of described structure set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, for each the final multinomial D in the final multinomial set obtaining, according to following way screening, obtain correlation rule:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text reconstitution steps S170: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in the unified event sets step of described structure, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for the high one-level item of described structure set step, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Preferably, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text reconstitution steps, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.The invention also discloses a kind of associated keyword calculation element that adopts complementary information, comprise as lower unit:
Unified event sets tectonic element: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number statistic unit of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit: the item set for the upper unit of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set judging unit, according to described high one-level item set construction unit, can judgement build the set of higher one-level item, if can, return to described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit; First define threshold value TH, for screening correlation rule, each the final multinomial D in the final multinomial set obtaining, obtains correlation rule according to following screening:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text restoration unit: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Preferably, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text restoration unit, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
Therefore, the advantage that integrated complementary user of the present invention inquiry and user inquire about rear displaying video and user's uploading data, that has avoided that use data mapping obtains has a tendentious personage keyword results of being correlated with.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data and can avoid occurring which crucial word problem user does not know to search for, by heightening the threshold value of event and correlation rule, can obtain higher accuracy rate.
Accompanying drawing explanation
Fig. 1 is according to the process flow diagram of the crucial keyword computing method of the embodiment of the present invention;
Fig. 2 is according to the module frame chart of the crucial keyword calculation element of the embodiment of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, in accompanying drawing, only show part related to the present invention but not entire infrastructure.
The present invention be take user's data query as main body, and after inquiring about, played data and user's uploaded videos data filling, in data query, obtain unified event sets.Use association rule algorithm from event sets, to find out personage, correlation rule that event is relevant.Finally from correlation rule, parse associated keyword.
Embodiment 1:
Referring to Fig. 1, disclose according to the process flow diagram of the associated keyword computing method of employing complementary information of the present invention.Described associated keyword computing method comprise the steps:
Construct unified event sets step S110: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, word id as this word, for example, the digital value that can increase progressively since 1 order, thereby every text entry is just converted to the sequence of several numerals, and preserve each word and its corresponding word id, for example by each word and its corresponding word id to lexicon file.
Preferably, the described record relevant with search or video comprises user's data query, and user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id.
Build one-level item set step S130: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: the item set for the previous step of firm formation, be called primitive term set, each primitive term contains n word id, n31, find out satisfied two primitive terms of condition below and carry out " also " computing, be equivalent to the inclusive-OR operation of logical operation
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, for example the first primitive term A and the second primitive term B, in each word id according to from small to large sequence, the front n-1 item of the first primitive term A and the second primitive term B is identical, and the n item word id of the first primitive term A is less than the n item word id of the second primitive term B
Described two primitive terms are carried out to " also " computing, the high one-level item C that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item C, if described event number surpasses described average occurrence number, high one-level item C retains, otherwise abandons;
Add all high one-level items that retain, the item of n+1 level, forms the set of high one-level item, i.e. the set of n+1 level item;
Especially, when building the set of secondary item, word id to each in the set of first order item, combination of two obtains some secondary items, and described secondary item contains two elements, to each the secondary item obtaining, the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retains, otherwise abandon, the secondary item of reservation is gathered and formed second level item set;
Special instruction, in the present invention, in order to coordinate the special circumstances that build the set of secondary item, is building high one-level item set step S140, and when n=1, n-1=0, can think that the 0th word of each primitive term is all identical, is all considered as satisfying condition.Therefore, when building the set of secondary item, each one-level item has directly carried out " also " computing, and when building other high one-level item set, only has front n-1 item word identical, just can carry out " also " computing.
Continue to build set determining step S150; according to the method for the high one-level item of described structure set step; can judgement build the set of higher one-level item; have and do not have two primitive terms can construct higher one-level item; and the event number that described higher one-level item occurs in described event sets exceeds described average occurrence number; if can, return to the high one-level item of described structure set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, finally multinomial for each in the final multinomial set obtaining, according to following way screening, obtain correlation rule:
Described final multinomial D contains m word id, therefrom take out 1 to m-1 word id and form several proper subclass E, for each proper subclass E, the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets, be designated as respectively Cnt (D) and Cnt (E), calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E), if P (D|E) is greater than TH, think that described proper subclass can derive finally multinomial, form a correlation rule, and record preservation obtains correlation rule set.
Preferably, manually set described threshold value TH, make the correlation rule that described threshold value TH filters out substantially reflect user's data query, i.e. query string, with the correlativity of other word of playing video data or user's uploaded videos data.
Text reconstitution steps S170: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, choose the proper subclass that only contains name, obtain correctly identifying the keyword that personage is relevant.
Certainly, also can choose the proper subclass that comprises event, obtain the keyword relevant to event.
Normally, when text restores, should remove symbol text.
Therefore, the advantage that the present invention combines complementary user inquiry and user inquires about rear broadcasting and user's uploading data, that has avoided that use data mapping obtains has a tendentious personage keyword results of being correlated with.After adding access customer inquiry, played data can obtain the true interested keyword of user, add access customer uploading data and can avoid occurring which crucial word problem user does not know to search for, by heightening the threshold value of event and correlation rule, can obtain higher accuracy rate.
Embodiment 2:
In the present embodiment, disclose according to the concrete example of the associated keyword computing method of embodiment 1:
According to user inquiry, user inquire about rear broadcasting, user uploads three class data, totally five events, comprehensive and cut word after obtain:
Event 1: Guo Degang (1)
Event 2: Guo Degang (1) up-to-date (2) cross-talks (3)
In modest (4) 2013 (5) up-to-date (2) cross-talks (3) < < (6), you are MashiMaro (7) > > (8) to event 3: Guo Degang (1)
Event 4: Guo Degang (1) is in modest (4) 2013 (5) up-to-date (2) cross-talks (3) < < (6) sparking chicken (9) > > (8)
Event 5: Guo Degang (1) is in the timid wealths and ranks of modest (4) cross-talks (3) < < (6) (10) > > (8)
Digitized representation after each word in parenthesis is distributed to the id of this word, one has 10 word id here, and what all words occurred adds up to 26, and on average each word occurs 2.6 times, and definition below threshold value used is 2.6.
The event sets obtaining is expressed as:
{{1},{1,2,3},{1,2,3,4,5,6,7,8},{1,2,3,4,5,6,8,9},{1,3,4,6,8,10}}
First find out one-level frequency item collection, add up the number of times that each word occurs:
Figure 201310620943X100002DEST_PATH_IMAGE001
Visible, the number of times of word id 1,2,3,4,6,8 is greater than threshold value 2.6,, as one-level item collection, be expressed as 1}, and 2}, 3}, 4}, 6}, 8}},
Then start to construct the set of secondary item, the secondary item that may form comprises:
{
{1,2},?{1,3},?{1,4},?{1,6},?{1,8},
{2,3},{2,4},{2,6},{2,8},
{3,4},{3,6},{3,7},
{4,6},{4,8},
{6,8}
}
In this secondary item set, each binomial is integrated into the frequency occurring in event and is:
Because secondary item 2,4}, and 2,6}, 2,8}, the number of times of 3,7}, lower than threshold value 2.6, needs to delete, and obtains the set of secondary item and is:
{
{1,2},?{1,3},?{1,4},?{1,6},?{1,8},
{2,3},
{3,4},{3,6},
{4,6},{4,8},
{6,8}
}
Then construct three grades of item set, can comprise by getable three grades of item collection:
{
{1,2,3},{1,2,4},{1,2,6},{1,2,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Here the method that obtains three grades of items from secondary item is, to every two secondary items, for example { 3, 4} and { 3, 6}, to the element in each secondary item, adopt id value to sort, (sorting here), obtain { 3, 4} and { 3, 6}, because the front n-1 (2-1=1) of these two secondary items identical, and the former second value 4 is less than second value 6 of the latter, therefore can obtain three grades of items { 3, 4, 6}, and for two secondary items { 3, 4} and { 4, both inside of 8}(are according to id sequence), because { 3, 4} and { 4, front n-1 item (2-1=1) difference of 8}, so can not merge and obtain { 3, 4, 8}.
The frequency of adding up three grades of items appearance obtains:
Figure DEST_PATH_IMAGE003
But because three grades of items 1,2,4}, and 1,2,6}, the number of times of 1,2,8} is less than threshold value 2.6, therefore obtains three grades of item set to be:
{
{1,2,3},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},
{4,6,8},
}
Then construct the set of level Four item, can the set of getable level Four item comprise:
{
{1,3,4,6},{1,3,4,8},{1,3,6,8},
{1,4,6,8},
}
And the event times that these level Four items occur is all greater than threshold value 2.6.
Continue the set of structure Pyatyi item, can comprise by getable Pyatyi item:
{
{1,3,4,6,8},
}
The event times that this unique Pyatyi item occurs is 3 to be greater than threshold value 2.6, can not continue to obtain the item set of higher level simultaneously.So this Pyatyi item is final.
Construct this proper subclass of final, its set is:
{
{1},{3},{4},{6},{8},
{1,3},{1,4},{1,6},{1,8},
{3,4},{3,6},{3,8},
{4,6},{4,8},
{6,8},
{1,3,4},{1,3,6},{1,3,8},
{1,4,6},{1,4,8},
{1,6,8},
{3,4,6},{3,4,8},
{4,6,8},
{1,3,4,6},{1,3,4,8},{1,4,6,8},{3,4,6,8},
}
The frequency that these proper subclass occur in event sets is respectively:
Figure 201310620943X100002DEST_PATH_IMAGE004
{ 1} occurred 5 times its proper subclass,
P({1,3,4,6,8}|{1})?=?3/5?=?0.6。
Setting threshold TH is 0.55, and this correlation rule is effective.According to dictionary, translating into word obtains
{ Guo Degang }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
The keyword associated with Guo De guiding principle is " cross-talk ", " in modest ".Punctuation marks used to enclose the title < < > > is filtered out by symbolic rule.
If setting threshold TH is 0.7, according to the frequency of each subset, can sees and only have { 1}->{1,3,4,6,8} does not reach threshold value, and effective correlation rule is for removing { 1}->{1, strictly all rules outside 3,4,6,8}.
Subset is carried out to text recovery, selects the subset (containing word 1-Guo De guiding principle, word 4-in modest) that wherein only contains name, obtain:
{4}->{1,3,4,6,8}
{1,4}->{1,3,4,6,8}
Such two rules, text representation is:
1, { in modest }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
2, { Guo Degang, in modest }->{ Guo De guiding principle, cross-talk, in modest, < <, > > }
Can be understood as,
The associated keyword of personage " in modest " is " Guo Degang ", " cross-talk ".
The associated keyword of personage's combination { " Guo Degang "+" in modest " } is " cross-talk ".
Can see through correlation rule and calculating, can more correctly identify the keyword that personage is relevant.
Below only example, only containing the subset of name, also can be chosen the subset that comprises event, and its effect, it will be appreciated by those skilled in the art that the also relevant keyword of correctly identification event.
Embodiment 3:
The invention also discloses a kind of associated keyword calculation element that adopts complementary information, comprise as lower unit:
Unified event sets tectonic element 210: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number statistic unit 220 of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit 230: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit 240: the item set for the upper unit of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set judging unit 250, according to described high one-level item set construction unit, can judgement continue to build the set of higher one-level item, if can, return to described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit 260; First define threshold value TH, for screening correlation rule, each the final multinomial D in the final multinomial set obtaining, obtains correlation rule according to following screening:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text restoration unit 270: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
Preferably, in described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
Preferably, for described high one-level item set construction unit, when building the set of secondary item, the word id to each in the set of first order item, combination of two obtains a plurality of secondary items, described secondary item contains two elements, to each the secondary item obtaining, and the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retain, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
Wherein, manually set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
Preferably, in text restoration unit, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
Obviously, those skilled in the art should be understood that, above-mentioned each unit of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, alternatively, they can realize with the executable program code of computer installation, thereby they can be stored in memory storage and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit module realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.
Above content is in conjunction with concrete preferred implementation further description made for the present invention; can not assert that the specific embodiment of the present invention only limits to this; for general technical staff of the technical field of the invention; without departing from the inventive concept of the premise; can also make some simple deduction or replace, all should be considered as belonging to the present invention and determine protection domain by submitted to claims.

Claims (10)

1. associated keyword computing method that adopt complementary information, comprise the steps:
Construct unified event sets step S110: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number S120 of statistics word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
Build one-level item set step S130: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, add all one-level items and form the set of one-level item;
Build high one-level item set step S140: the item set for the previous step of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set determining step S150, according to the method for the high one-level item of described structure set step, can judgement build the set of higher one-level item, if can, return to the high one-level item of described structure set step S140, otherwise enter screening correlation rule step S160;
Screening correlation rule step S160; First define threshold value TH, for screening correlation rule, for each the final multinomial D in the final multinomial set obtaining, according to following way screening, obtain correlation rule:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text reconstitution steps S170: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
2. the associated keyword computing method of employing complementary information according to claim 1, is characterized in that:
In the unified event sets step of described structure, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
3. the associated keyword computing method of employing complementary information according to claim 1, is characterized in that:
For the high one-level item of described structure set step, when building the set of secondary item, word id to each in the set of first order item, combination of two obtains a plurality of secondary items, and described secondary item contains two elements, to each the secondary item obtaining, the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retains, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
4. according to the associated keyword computing method of the employing complementary information described in any one in claim 1-3, it is characterized in that:
Artificial set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
5. the associated keyword computing method of employing complementary information according to claim 4, is characterized in that:
In text reconstitution steps, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
6. an associated keyword calculation element that adopts complementary information, comprises as lower unit:
Unified event sets tectonic element: the record that interpolation is relevant with search or video, add up all records and obtain event sets, each record in described event sets is cut to word to be processed, the text entry of word has been cut in sequential scanning, and the digital value increasing progressively to the order that each word occurs the earliest according to it, as the word id of this word, thereby every record is converted to the sequence of several numerals, and preserves each word and its corresponding word id to lexicon file;
The average occurrence number statistic unit of word id: travel through described event sets, add up the number of times that each word id occurs, the repeatedly appearance of same word id in an event only calculated once, adds up total number of times of all word id appearance and the quantity of word id, obtains the average occurrence number of word id;
One-level item set construction unit: travel through all word id, and find out, occurrence number surpasses the word id of average occurrence number, and each word id becomes an one-level item, adds all one-level items and forms the set of one-level item;
High one-level item set construction unit: the item set for the upper unit of firm formation, be called primitive term set, each primitive term contains n word id, and n31 finds out satisfied two primitive terms of condition below and carries out " also " computing,
Described condition is: described two primitive terms comprise the first primitive term and the second primitive term, by each word id in described two primitive terms according to from small to large sequence, the front n-1 item of the first primitive term and the second primitive term is identical, and the n item word id of the first primitive term is less than the n item word id of the second primitive term
Described two primitive terms are carried out to " also " computing, the high one-level item that contains n+1 item obtaining, traversal event sets, the event number that statistics contains all word id in described high one-level item, if described event number surpasses described average occurrence number, retain described high one-level item, otherwise abandon described high one-level item, add all high one-level items that retain, form the set of high one-level item;
Continue to build set judging unit, according to described high one-level item set construction unit, can judgement build the set of higher one-level item, if can, return to described high one-level item set construction unit, otherwise enter correlation rule screening unit;
Correlation rule screening unit; First define threshold value TH, for screening correlation rule, each the final multinomial D in the final multinomial set obtaining, obtains correlation rule according to following screening:
Described final multinomial D contains m word id; therefrom take out 1 to m-1 word id and form a plurality of proper subclass E; for each proper subclass E; the event number that statistics contains final multinomial D and described proper subclass E respectively in described event sets; be designated as respectively Cnt (D) and Cnt (E); calculate Cnt (D)/Cnt (E) and obtain probable value P (D|E); if P (D|E) is greater than TH; think that described proper subclass can derive finally multinomial; form a correlation rule, and record preservation obtains correlation rule set;
Text restoration unit: utilize described lexicon file, the described correlation rule set that traversal has obtained, every correlation rule is carried out to text recovery, each word id in described proper subclass E and final multinomial D is obtained to original text according to lexicon file inquiry, and think word in proper subclass can obtain final multinomial in remaining word except proper subclass.
7. the associated keyword calculation element of employing complementary information according to claim 6, is characterized in that:
In described unified event sets tectonic element, the described record relevant with search or video comprises user's data query, user inquires about rear playing video data, user's uploaded videos data, user's data query each time, playing video data each time, user's uploaded videos data are respectively as a record each time.
8. the associated keyword calculation element of employing complementary information according to claim 6, is characterized in that:
For described high one-level item set construction unit, when building the set of secondary item, word id to each in the set of first order item, combination of two obtains a plurality of secondary items, and described secondary item contains two elements, to each the secondary item obtaining, the event number that statistics comprises described secondary item in described event sets, if described event number exceeds described average occurrence number, retains, otherwise abandon, the secondary item of reservation is gathered and formed second level item set.
9. according to the associated keyword calculation element of the employing complementary information described in any one in claim 6-8, it is characterized in that:
Artificial set described threshold value TH, make correlation rule that described threshold value TH filters out substantially reflect the correlativity of other word of user's data query and playing video data or user's uploaded videos data.
10. the associated keyword calculation element of employing complementary information according to claim 9, is characterized in that:
In text restoration unit, choose the proper subclass that only contains name, obtain the keyword that personage is relevant.
CN201310620943.XA 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information Active CN103593469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310620943.XA CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310620943.XA CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Publications (2)

Publication Number Publication Date
CN103593469A true CN103593469A (en) 2014-02-19
CN103593469B CN103593469B (en) 2016-04-20

Family

ID=50083610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310620943.XA Active CN103593469B (en) 2013-11-30 2013-11-30 A kind of association keyword calculation method and device adopting complementary information

Country Status (1)

Country Link
CN (1) CN103593469B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402920A (en) * 2016-05-18 2017-11-28 北京京东尚科信息技术有限公司 The method and apparatus for determining relation database table connection complexity factor
CN109344402A (en) * 2018-09-20 2019-02-15 中国科学技术信息研究所 A kind of new terminology finds recognition methods automatically

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
CN102012900A (en) * 2009-09-04 2011-04-13 阿里巴巴集团控股有限公司 An information retrieval method and system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6289302B1 (en) * 1998-10-26 2001-09-11 Matsushita Electric Industrial Co., Ltd. Chinese generation apparatus for machine translation to convert a dependency structure of a Chinese sentence into a Chinese sentence
CN102012900A (en) * 2009-09-04 2011-04-13 阿里巴巴集团控股有限公司 An information retrieval method and system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402920A (en) * 2016-05-18 2017-11-28 北京京东尚科信息技术有限公司 The method and apparatus for determining relation database table connection complexity factor
CN107402920B (en) * 2016-05-18 2020-02-07 北京京东尚科信息技术有限公司 Method and device for determining correlation complexity of relational database table
CN109344402A (en) * 2018-09-20 2019-02-15 中国科学技术信息研究所 A kind of new terminology finds recognition methods automatically

Also Published As

Publication number Publication date
CN103593469B (en) 2016-04-20

Similar Documents

Publication Publication Date Title
US11269476B2 (en) Concurrent display of search results from differing time-based search queries executed across event data
CN100372372C (en) Free text and attribute search of electronic program guide data
CN102368262B (en) Method and equipment for providing searching suggestions corresponding to query sequence
US10002189B2 (en) Method and apparatus for searching using an active ontology
US7917840B2 (en) Dynamic aggregation and display of contextually relevant content
CN110633330B (en) Event discovery method, device, equipment and storage medium
US9355111B2 (en) Hierarchical index based compression
CN101719167B (en) Interactive movie searching method
US20080282186A1 (en) Keyword generation system and method for online activity
US20090240674A1 (en) Search Engine Optimization
US20090077065A1 (en) Method and system for information searching based on user interest awareness
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
WO2011044662A1 (en) System and method for grouping multiple streams of data
CN108874812B (en) Data processing method, server and computer storage medium
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
US20150066478A1 (en) Synonym relation determination device, synonym relation determination method, and program thereof
CN105378730A (en) Social media content analysis and output
US8788477B1 (en) Identifying addresses and titles of authoritative web pages by analyzing search queries in query logs
EP3019988A2 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
CN103069825A (en) System and method for television search assistant
US20090089266A1 (en) Method of finding candidate sub-queries from longer queries
CN103593469A (en) Method and device for calculating associated keywords through complementary information
CN112035534A (en) Real-time big data processing method and device and electronic equipment
WO2015094311A1 (en) Quote and media search method and apparatus
CN116595043A (en) Big data retrieval method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee after: Youku network technology (Beijing) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: 1VERGE INTERNET TECHNOLOGY (BEIJING) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200323

Address after: 310018 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co.,Ltd.