WO2015043389A1 - Participle information push method and device based on video search - Google Patents

Participle information push method and device based on video search Download PDF

Info

Publication number
WO2015043389A1
WO2015043389A1 PCT/CN2014/086519 CN2014086519W WO2015043389A1 WO 2015043389 A1 WO2015043389 A1 WO 2015043389A1 CN 2014086519 W CN2014086519 W CN 2014086519W WO 2015043389 A1 WO2015043389 A1 WO 2015043389A1
Authority
WO
WIPO (PCT)
Prior art keywords
participle
video
resource data
word
video resource
Prior art date
Application number
PCT/CN2014/086519
Other languages
French (fr)
Chinese (zh)
Inventor
崔代超
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201310462461.6A external-priority patent/CN103491205B/en
Priority claimed from CN201310462214.6A external-priority patent/CN103500214B/en
Priority claimed from CN201310462768.6A external-priority patent/CN103488787B/en
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015043389A1 publication Critical patent/WO2015043389A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • H04L65/612Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio for unicast
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Definitions

  • the present invention relates to the technical field of the Internet, and in particular, to a word segmentation information push method based on video search and a word segmentation information push device based on video search.
  • Video search engine is a vertical search technology that is different from comprehensive search.
  • the video search engine crawls the results of the video class in the Internet and builds an index. Since it can provide pure video results to the searcher, it can greatly save the time for netizens to find the video.
  • the existing video search engines have insufficient shortcomings in related recommendations: some video search engines do not have relevant recommendations, and the related recommended video search engines are simply based on the user's search history data and manually collated to obtain an association system.
  • Implement recommendations This recommendation system is based on the user's existing search habits, and the recall rate is low.
  • the user's search range is generally much smaller than the existing Internet resources, and the high-quality video in the Internet cannot be fully exploited.
  • Another method of search recommendation is to manually sort out a resource association system or obtain such a system from other knowledge systems and apply it to the recommendation system. For example, when a search engine searches for "square dance”, it will get the recommended words of "social dance”, “belly dance”, “aerobics”, etc. When searching for "dota”, you will get “crossing the fire line”, “World of Warcraft”, etc. The recommended word, but this system has a low recall rate and generally cannot be recommended in long tail searches.
  • the present invention has been made in order to provide a video search-based word segmentation information push method and a corresponding video search-based word segmentation information push device that overcome the above problems or at least partially solve the above problems.
  • a method for word segmentation information push based on video search including:
  • the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
  • a push method for an online play portal object based on video search including:
  • the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
  • a method for pushing an associated resource address based on a video search including:
  • the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
  • a video search-based word segmentation information pushing apparatus including:
  • a video search string receiving module adapted to receive a video search string
  • a first word segmentation mapping module configured to map the video search string into one or more first word segments
  • a second participle finding module configured to find an associated second participle with a co-occurrence rate of the one or more first participles being higher than a preset threshold; the co-occurrence rate is one or more current participles and a second participle The probability of co-occurrence in the same video resource data;
  • a combined push module adapted to push a combination of the one or more first word segments and the one or more associated second word segments.
  • a computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform according to claims 1-7 Any of the video search-based word segmentation information push methods described.
  • a computer readable medium storing the computer program according to claim 17 is provided.
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • the user can directly perform more levels of searching based on the combination, so that the user can obtain more results by simply searching, and does not need to submit the search multiple times, thereby reducing the access server.
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • FIG. 1 is a flow chart showing the steps of an embodiment of a method for segmenting word information based on video search according to an embodiment of the present invention
  • FIG. 2 is a flow chart showing the steps of an embodiment of a method for pushing an online play portal object based on video search according to an embodiment of the present invention
  • FIG. 3 is a flow chart showing the steps of a push embodiment of an associated resource address based on video search, in accordance with one embodiment of the present invention
  • FIG. 4 is a block diagram showing an embodiment of a video search-based word segmentation information pushing apparatus according to an embodiment of the present invention
  • Figure 5 shows schematically a block diagram of a computing device for performing the method according to the invention
  • Fig. 6 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1 a flow chart of a step of a method for a word search information based on a video search according to an embodiment of the present invention is shown.
  • Step 101 Receive a video search string.
  • the video search string may be video search information input by the user, and may be used to request to search for video data resources related thereto.
  • the video search string can be a word, that is, including a semantically independent word, such as Mid-Autumn Festival, Dragon Boat Festival, National Day, etc.; the video search string can also be a compound word, that is, including two or more semantically independent Words, such as Mid-Autumn moon cakes, Dragon Boat Festival, National Dayzhou Tourism, etc.
  • Step 102 Map the video search string into one or more first word segments
  • the mapped word segmentation may be preset and may be used to calculate the co-occurrence rate between different word segments.
  • the mapped rule may also be one or more presets, and may include words that remove the dirty words, modifiers, modal particles, broad words, and the like of the video search characters; may include setting stop words, that is, some common ones. Words, which are criteria for stopping a phrase, such as me, you, etc.; can also include the correspondence of associations, and correspond to multiple expressions of the same thing as an expression, for example, August 15th, Mid-Autumn Festival The moon cake section and the like are associated with the Mid-Autumn Festival; other mapping rules may also be included, which are not limited by the embodiment of the present invention.
  • English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student.” The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning” and "sheng" are combined to represent a word.
  • the Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: me, yes, one, student.
  • Word segmentation based on string matching refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.
  • the word segmentation method based on feature scanning or mark segmentation refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.
  • the word segmentation method based on understanding refers to the effect of identifying words by letting the computer simulate the understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
  • Statistical-based word segmentation method It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.
  • the step 102 may specifically include the following sub-steps:
  • Sub-step S11 extracting a participle mapped by the video search string
  • the corresponding word segmentation can be directly extracted according to a preset mapping rule.
  • the video search string is "Mid-Autumn Festival", “My Mid-Autumn Festival” or "Mid-Autumn Festival", etc.
  • the first participle of the map Can be "Mid-Autumn Festival”.
  • the video search string can also be the same word as the first participle of the mapping.
  • the video search string is “Mid-Autumn Festival”, and the first participle of the map can also be “Mid-Autumn Festival”.
  • Sub-step S12 when the received video search string is a compound word, splitting the video search string into a plurality of search sub-words;
  • Sub-step S13 extracting a plurality of word segments mapped by the plurality of search sub-words.
  • the word segmentation may be performed according to a preset mapping rule to obtain a search subword, and then the word segment corresponding to the search subword is separately extracted.
  • the received video search string is “Mid-Autumn Festival Mooncake”, which can be split into two search sub-words of “Mid-Autumn Festival” and “Moon Cake”, and then “Mid-Autumn Festival” is mapped to “Mid-Autumn Festival”, “ The moon cake is mapped to "moon cake”, and the first participles of "Mid-Autumn Festival” and "moon cake” are obtained.
  • Step 103 Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold;
  • the co-occurrence rate is a probability that one or more first participles and a second participle coexist in the same video resource data
  • the second participle may be a participle other than the first participle among all the preset participles.
  • the associated second participle may be a second participle with the first participle having a co-occurrence rate higher than a preset threshold.
  • the video resource data may include feature text information, which may be used to record related information of the video resource data, and may also be used to extract word segmentation.
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • the characteristic text information can be as follows:
  • Video Keywords YY Reporter Life Information Dongguan Flooding
  • the co-occurrence rate may be a probability that the current one or more participles and the second participle co-occur in the feature text information of the same video resource data, and specifically may include a co-occurrence rate of the first participle and the second participle, The co-occurrence rate of multiple participles and second participles.
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S21 when the video search string is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
  • the search engine may be used to crawl the video resource data on each website platform by using a crawler, and then the index database is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established.
  • the index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • index table for "Mid-Autumn Festival” can be as follows:
  • the first participle is “Mid-Autumn Festival”, and the information of the video resource data includes a video identifier.
  • the information of the video resource data may not include the video identifier, but only the records formed by the first participle and the second participle (ie, the second participle of each line as one record).
  • index table is only an example.
  • other index tables may be set according to actual conditions, which is not limited by the embodiment of the present invention.
  • other index tables may be used by those skilled in the art according to actual needs, and the embodiment of the present invention does not limit this.
  • each platform can be captured periodically or irregularly, and then the index database is updated, that is, each index table is updated.
  • Sub-step S22 calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • the co-occurrence rate may also be expressed as the number of video material data to which each second participle in the index table belongs and in the index table. The ratio of the total number of information for video resource data.
  • the index table of the word segmentation "square dance” has a total of 100 pieces of video resource data information
  • the index table "Bing brother” has a total of 200 pieces of video resource data information
  • "square dance” and "bing brother” simultaneously There are 10 pieces of information on the video resource data appearing in the two index tables.
  • Sub-step S23 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention.
  • the associated second participle extracted in the embodiment of the present invention may be empty or one or more.
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S31 when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment
  • the information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
  • the search engine may be used in advance to crawl the video resource data on each platform through the crawler, and then the index is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established.
  • the index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S32 extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
  • first word segments that is, there are multiple numbers corresponding to the index table, and the candidate word segments need to appear in each index table, that is, the candidate word segments are respectively present in the same index table together with the current first word segments.
  • Sub-step S33 calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is a number of occurrences of the candidate participle in the index table and a video resource in the index table The ratio of the total number of pieces of information;
  • Sub-step S34 respectively, a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
  • the weight may be determined according to the ratio of the total number of information of the video resource data in the index table between the first participles, wherein The greater the total amount of information of the video resource data in the index table, the greater the weight. For example, in the index table of "Mid-Autumn Festival", the total amount of information of video resource data is 900, and in the index table of "moon cake", the total amount of information of video resource data is 100, and the co-occurrence rate of "Mid-Autumn Festival" and "Moon” The weight can be 0.9, and the weight of the "moon cake” and "moon” co-occurrence rate can be 0.1.
  • weights are only examples.
  • other weights may be set according to actual conditions, for example, according to current social hotspots (news ranking, microblog ranking, etc.), corresponding weights are set according to the user's local and/or
  • the online operation behavior sets the corresponding weights and the like, which are not limited by the embodiment of the present invention.
  • other weights may be used by those skilled in the art in addition to the above-mentioned weights, and the embodiments of the present invention do not limit this.
  • Sub-step S35 respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
  • the weighted average of multiple co-occurrence rates may be used as the final co-occurrence rate.
  • Sub-step S36 extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention.
  • the associated second participle extracted in the embodiment of the present invention may be empty or one or more.
  • the step 103 may specifically include the following sub-steps:
  • Sub-step S41 when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
  • the search engine may be used in advance to crawl the video resource data on each platform through the crawler, and then the index is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established.
  • the index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S42 determining the main participle by using the plurality of index tables, where the main participle is the first participle corresponding to the index table with the largest total number of pieces of information of the video resource data;
  • the first participle of the video resource data with a small amount of information may be ignored.
  • the total amount of information of the video resource data in the index table of "Mid-Autumn Festival” is 900, and the index of "moon cake” If the total number of pieces of video resource data in the table is 100, you can set "Mid-Autumn Festival” as the main participle.
  • Sub-step S43 calculating a co-occurrence rate of each second participle in the index table and the corresponding index part, the co-occurrence rate is the number of occurrences of each second participle in the index table and the video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • the co-occurrence rate of the main participle can be used as the final co-occurrence rate.
  • Sub-step S44 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention.
  • the associated second participle extracted in the embodiment of the present invention may be empty or one or more.
  • Step 104 Push a combination of the one or more first word segments and the one or more associated second word segments.
  • a combination of the current first participle and one or more participles may be pushed at a position such as a pull-down menu of the input box of the webpage.
  • the video search string is “dota”
  • the words with the same co-occurrence rate are: “funny”, “egg hurt”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence
  • the rates are 40%, 35%, 30%, 25%, 20% and 10% respectively, which will push the combination "dota funny", “dota egg pain", “dota2009”, “dota sea”, “dota” A perspective” and "dota classic”.
  • a combination of the current plurality of first word segments and one or more word segments may be pushed at a drop-down menu or the like of the input box of the web page.
  • the video search string is "square dance soldier brother", which is mapped to the first participle “square dance” and “bing brother”, and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching”, which can be used as the second participle of the association, will eventually push the combination "tea dance of the square dance soldiers".
  • step 104 may specifically include the following sub-steps:
  • Sub-step S51 pushing the main participle and the associated second participle.
  • the combination of the current main participle and one or more participles can be pushed at a drop-down menu or the like of the input box of the web page.
  • a drop-down menu or the like of the input box of the web page For example, for the first participle "Mid-Autumn Festival” and “moon cake” mapped by the video search string "Mid-Autumn Festival Mooncake”, you can set “Mid-Autumn Festival” as the main participle, and get the second participle "Moon", you can push the combination "Mid-Autumn Festival” moon”.
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • the user can directly perform more levels of searching based on the combination, so that the user can obtain more results by simply searching, and does not need to submit the search multiple times, thereby reducing the access server.
  • FIG. 2 is a flow chart showing the steps of an embodiment of a method for pushing an online play portal object based on a video search according to an embodiment of the present invention, which may specifically include the following steps:
  • Step 201 Receive a video search string.
  • Step 202 Map the video search string into one or more first word segments
  • the step 202 may specifically include the following sub-steps:
  • Sub-step S61 extracting a participle mapped by the video search string
  • Sub-step S62 when the received video search string is a compound word, splitting the video search string into a plurality of search sub-words;
  • Sub-step S63 extracting a plurality of word segments mapped by the plurality of search sub-words.
  • Step 203 Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold.
  • the step 203 may specifically include the following sub-steps:
  • Sub-step S71 when the video search string is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
  • Sub-step S72 calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • Sub-step S73 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the step 203 may specifically include the following sub-steps:
  • Sub-step S81 when the video search string is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment
  • the information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S82 extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
  • Sub-step S83 calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is the number of occurrences of the candidate participle in the index table and the video resource in the index table The ratio of the total number of pieces of information;
  • Sub-step S84 which are respectively a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
  • Sub-step S85 respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
  • Sub-step S86 extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the step 203 may specifically include the following sub-steps:
  • Sub-step S91 when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S92 determining the main participle by using the plurality of index tables, where the main participle is the first participle corresponding to the index table with the largest total number of pieces of information of the video resource data;
  • Sub-step S93 calculating a co-occurrence rate of each second word segment in the index table and the corresponding index table, the co-occurrence rate is a number of occurrences of each second word segment in the index table and a video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • Sub-step S94 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • Step 204 Obtain a network address of one or more video data resources that match the one or more first word segments and the associated second word segment;
  • a combination of the current first participle and one or more participles can be obtained.
  • the video search string is “dota”, and the words with the same co-occurrence rate are: “funny”, “egg hurt”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence
  • the rates are 40%, 35%, 30%, 25%, 20%, and 10%, respectively, and the combinations obtained are “dota funny”, “dota egg pain”, “dota2009”, “dota sea”, “dota” The first perspective" and the "dota classic”.
  • a combination of the current plurality of first participles and one or more participles can be obtained.
  • the video search string is "square dance soldier brother”, which is mapped to the first participle “square dance” and "bing brother”, and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching”, which can be used as the second participle of the association, then the final combination "tea dance of the square dancers".
  • step 204 may specifically include the following sub-steps:
  • Sub-step S101 Obtain a network address of one or more video data resources that match the primary participle and the associated second participle.
  • a combination of the current main participle and one or more participles can be obtained. For example, for the first participle "Mid-Autumn Festival” and “moon cake” mapped by the video search string "Mid-Autumn Festival Mooncake”, you can set “Mid-Autumn Festival” as the main participle and get the second participle "moon”, and finally get the combination "Mid-Autumn Festival” moon”.
  • the search of the matched video data resources may be performed based on the combination of the first word segmentation and the second segment word segment.
  • the network address may be recorded, which may be an intranet address, or may be External network address.
  • Step 205 Construct an ingress object for playing the video data resource online according to the one or more video data resource network addresses.
  • the entry object can be an icon or button in the web page that links to the online play URL.
  • an icon or a button may be configured in the current page, and is associated with the video data resource network address in the extended window.
  • the database may be accessed from the database. The corresponding video data resource is loaded under the URL.
  • Step 206 Push the one or more ingress objects of the online play video data resource.
  • the entry object can be placed at any position of the current page, and the user can trigger the entry object to trigger the network address of the video data resource corresponding to the entry object, thereby loading the video data resource.
  • the user inputs a video search string "steel" in the search box, which itself can be used as the first participle to obtain a video data resource matched by the combination of the first participle and the associated second participle "Iron Man 3", the video resource
  • the entry object is an icon that says “Read Now” to prompt the user. When the user clicks on the icon, he can go to the play page of "Iron Man 3".
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • the user can directly obtain more video search results based on the entry object, so that the user can obtain more results by simply searching, and the search does not need to be submitted multiple times, thereby reducing the The burden of accessing the server reduces the occupation of network resources and improves the user experience.
  • FIG. 3 a flow chart of steps of a push-based embodiment of a video search-based associated resource address according to an embodiment of the present invention is shown.
  • Step 301 when receiving a loading or playing request of the first video resource data, acquiring feature text information of the first video resource data;
  • the first video resource data may be located on the terminal device or may be located on the network, and the feature text information may be information carried by the video resource data.
  • the step 301 may specifically include the following sub-steps:
  • Sub-step S111 when receiving the play request of the first video data, receiving the feature text information of the first video resource data sent by the current terminal;
  • the feature text information of the first video resource data may be extracted by the terminal device, and then uploaded to the corresponding server side.
  • Sub-step S112 when receiving the first video data loading request, extracting the feature text information of the video resource data preset locally.
  • the feature text information of the first video resource data may be extracted by the server side.
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Step 302 Map the feature text information into one or more first word segments
  • the step 302 may specifically include the following sub-steps:
  • Sub-step S121 extracting a participle mapped by the feature text information
  • Sub-step S122 when the received feature text information is a compound word, splitting the feature text information into a plurality of search sub-words;
  • Sub-step S123 extracting a plurality of word segments mapped by the plurality of search sub-words.
  • Step 303 Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold.
  • the co-occurrence rate is a probability that one or more first participles and a second participle coexist in the same video resource data
  • the co-occurrence rate may be a probability that the current one or more participles and the second participle co-occur in the feature text information of the same video resource data, and specifically may include a co-occurrence rate of the first participle and the second participle, The co-occurrence rate of multiple participles and second participles.
  • the second participle may be a participle other than the first participle among all the preset participles.
  • the associated second participle may be a second participle with the first participle having a co-occurrence rate higher than a preset threshold.
  • the video resource data may include feature text information, which may be used to record related information of the video resource data, and may also be used to extract word segmentation.
  • the step 303 may specifically include the following sub-steps:
  • Sub-step S131 when the feature text information is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S132 calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • Sub-step S133 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the step 303 may specifically include the following sub-steps:
  • Sub-step S141 when the feature text information is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment
  • the information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S142 extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
  • Sub-step S143 calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is a number of occurrences of the candidate participle in the index table and a video resource in the index table The ratio of the total number of pieces of information;
  • Sub-step S144 respectively, a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
  • Sub-step S145 respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
  • Sub-step S146 extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • the step 303 may specifically include the following sub-steps:
  • Sub-step S151 when the feature text information is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
  • the feature text information may include a video title, a video keyword, and/or a video description.
  • Sub-step S152 determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of the video resource data;
  • Sub-step S153 calculating a co-occurrence rate of each second word segment in the index table and the corresponding index table, the co-occurrence rate is the number of occurrences of each second word segment in the index table and the video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • the co-occurrence rate of the main participle can be used as the final co-occurrence rate.
  • Sub-step S154 extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  • Step 304 Obtain a network link address of the second video resource data that matches the one or more first word segments and the associated second word segment.
  • a combination of the current first participle and one or more associated second participles can be obtained.
  • the character text information is “dota”, and the words with the same co-occurrence rate are: “funny”, “egg pain”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence
  • the rates are 40%, 35%, 30%, 25%, 20%, and 10%, respectively, and the combinations obtained are “dota funny”, “dota egg pain”, “dota2009”, “dota sea”, “dota” The first perspective" and the "dota classic”.
  • a combination of the current plurality of first participles and one or more associated second participles can be obtained.
  • the character text information is "square dance soldier brother”, which is mapped to the first participle “square dance” and “bing brother”, and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching”, which can be used as the second participle of the association, and finally obtain the combination "teaching of the square dance soldiers".
  • step 304 may specifically include the following sub-steps:
  • Sub-step S161 acquiring a network link address of the second video resource data of the main participle and the associated second participle.
  • a combination of the current main participle and one or more associated second participles can be obtained.
  • the first participle "Mid-Autumn Festival” and “moon cake” mapped by the character text "Mid-Autumn Festival Mooncake” you can set "Mid-Autumn Festival” as the main participle and get the second participle "moon”, and finally get the combination "Mid-Autumn Festival” moon”.
  • the search of the matched video data resource may be performed based on the combination of the first word segment and the second word segmentation.
  • the network connection address may be recorded, which may be an intranet address, or Is the external network address.
  • Step 305 Push a network link address of the second video resource data.
  • the network link address of the second video resource data may be placed at any position on the current page, or may be pushed by embedding an icon or a button, and the user may load by triggering the network link address of the second video resource data.
  • the video data resource may be placed at any position on the current page, or may be pushed by embedding an icon or a button, and the user may load by triggering the network link address of the second video resource data.
  • the invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out.
  • the invention obtains the network connection address of the matched second video resource data of the first word segment and the second word segment, and the user can directly obtain the video data resource based on the address, so that the user can obtain more results by simply searching, without Submitting the search multiple times, reducing the burden of accessing the server, reducing the occupation of network resources and improving the user experience.
  • FIG. 4 a block diagram of an embodiment of a video search-based word segmentation information pushing apparatus according to an embodiment of the present invention is shown. Specifically, the following modules may be included:
  • the video search string receiving module 401 is adapted to receive a video search string
  • the first part-of-word mapping module 402 is adapted to map the video search string into one or more first word segments
  • a second participle finding module 403 configured to search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is one or more current participles and a second participle The probability that a participle will co-occur in the same video resource data;
  • the pushing module 404 is adapted to push a combination of the one or more first word segments and the one or more associated second word segments.
  • the first word segmentation mapping module 402 may further be adapted to:
  • the video search string is split into a plurality of search subwords; and a plurality of word segments mapped by the plurality of search subwords are extracted.
  • the second word segmentation module 403 is further adapted to:
  • the co-occurrence rate is the number of occurrences of each second participle in the index table and information of video resource data in the index table a ratio of the total number; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
  • the second word segmentation module 403 is further adapted to:
  • each index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
  • the candidate participle with the co-occurrence rate higher than the preset threshold is extracted as the associated second participle.
  • the second word segmentation module 403 is further adapted to:
  • each index table includes a video to which the first word segment belongs
  • all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, and the feature text is extracted Information is generated by word segmentation;
  • main participle Determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of video resource data;
  • the feature text information includes a video title, a video keyword, and/or a video description.
  • the combined push module 404 can also be adapted to:
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • Those skilled in the art will appreciate that some or all of some or all of the components of the video search based word segmentation information push device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP).
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 5 illustrates a computing device, such as a user terminal device or an application server, that can implement video search based word segmentation information push in accordance with the present invention.
  • the computing device conventionally includes a processor 510 and a computer program product or computer readable medium in the form of a memory 520.
  • the memory 520 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 520 has a memory space 530 for program code 531 for performing any of the method steps described above.
  • storage space 530 for program code may include various program code 531 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such computer program products are typically portable or fixed storage units as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 520 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 531 ', ie, code readable by a processor, such as 510, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Abstract

Disclosed are a participle information push method and device based on a video search, a push method for an on-line playing entry object based on a video search and a push method for an associated resource address based on a video search. The method comprises: mapping feature text information about a received video search character string or acquired video resource data into one or more first participles; and searching for associated second participles of which the co-occurrence rate together with that of the one or more first participles is higher than a preset threshold value, the co-occurrence rate being the probability that the current one or more first participles and the second participles appear in the same video resource data together, thereby pushing a combination of the one or more first participles and the one or more associated second participles or an entry object of a video resource data related thereto or a network connection address associated with the video resource data. In the present application, data of the video resources which are seldom searched by users but have multiple relative resources in a video database are pushed, good-quality resources in the video database are deeply dug, resource digging efficiency is improved. An index table will be continuously expanded along with the constant accumulation of Internet video contents, and the number and scope of contents produced by various video stations will be far beyond the number of words already searched by users, thereby being beneficial to expand the recall rate.

Description

一种基于视频搜索的分词信息推送方法和装置Word search information push method and device based on video search 技术领域Technical field
本发明涉及互联网的技术领域,尤其涉及一种基于视频搜索的分词信息推送方法和一种基于视频搜索的分词信息推送装置。The present invention relates to the technical field of the Internet, and in particular, to a word segmentation information push method based on video search and a word segmentation information push device based on video search.
背景技术Background technique
视频搜索引擎是有别于综合搜索的一种垂直搜索技术。视频搜索引擎抓取互联网中的视频类的结果并建立索引,由于它可以向搜索者提供纯粹的视频类结果,从而可以大大节省网民寻找视频的时间。Video search engine is a vertical search technology that is different from comprehensive search. The video search engine crawls the results of the video class in the Internet and builds an index. Since it can provide pure video results to the searcher, it can greatly save the time for netizens to find the video.
根据视频搜索的相关统计数据显示,娱乐、游戏、影视、新闻、动漫等类型的视频是用户的主要搜索对象。这表明用户对于视频搜索本身具有泛需求的性质。用户往往不带有很强的目的性,搜索结果并非“非彼不可”,而是带有一定扩展性,只要目标在用户所喜欢的范畴内即可。因此,往往会在搜索结果之外对用户进行相关推荐是。According to the relevant statistics of video search, entertainment, games, movies, news, animation and other types of video are the main search objects of users. This indicates the user's general need for video search itself. Users often do not have a strong purpose, the search results are not "not incompetent", but with a certain degree of scalability, as long as the target is within the scope of the user's favorite. Therefore, it is often the case that relevant recommendations are made to users outside of the search results.
但是,现有的视频搜索引擎在相关推荐方面做得还有不足:部分视频搜索引擎没有相关推荐,有相关推荐的视频搜索引擎只是根据用户的搜索历史数据、通过人工整理得到关联体系等简单方式实现推荐。这种推荐系统基于用户已有的搜索习惯,召回率较低,另外由于用户的搜索范围一般会比现有互联网中的资源范围要小很多,不能充分挖掘互联网中的优质视频。However, the existing video search engines have insufficient shortcomings in related recommendations: some video search engines do not have relevant recommendations, and the related recommended video search engines are simply based on the user's search history data and manually collated to obtain an association system. Implement recommendations. This recommendation system is based on the user's existing search habits, and the recall rate is low. In addition, the user's search range is generally much smaller than the existing Internet resources, and the high-quality video in the Internet cannot be fully exploited.
另一种搜索推荐方法是依靠人工整理出一个资源关联体系或从其他知识体系中得到这样的体系,应用到推荐系统中。例如在某搜索引擎搜索”广场舞”时,会得到“交谊舞”、“肚皮舞”、“健身操”等的推荐词,搜索“dota”时会得到“穿越火线”、“魔兽世界”等的推荐词,但是这种体系召回率较低,在长尾的搜索中一般不能给出推荐。Another method of search recommendation is to manually sort out a resource association system or obtain such a system from other knowledge systems and apply it to the recommendation system. For example, when a search engine searches for "square dance", it will get the recommended words of "social dance", "belly dance", "aerobics", etc. When searching for "dota", you will get "crossing the fire line", "World of Warcraft", etc. The recommended word, but this system has a low recall rate and generally cannot be recommended in long tail searches.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种基于视频搜索的分词信息推送方法和相应的一种基于视频搜索的分词信息推送装置。In view of the above problems, the present invention has been made in order to provide a video search-based word segmentation information push method and a corresponding video search-based word segmentation information push device that overcome the above problems or at least partially solve the above problems.
根据本发明的一个方面,提供了一种基于视频搜索的分词信息推送方法,包括:According to an aspect of the present invention, a method for word segmentation information push based on video search is provided, including:
接收视频搜索字符串;Receiving a video search string;
将所述视频搜索字符串映射为一个或多个第一分词;Mapping the video search string to one or more first word segments;
查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。Pushing a combination of the one or more first word segments and the one or more associated second word segments.
根据本发明的另一方面,提供了一种基于视频搜索的在线播放入口对象的推送方法,包括:According to another aspect of the present invention, a push method for an online play portal object based on video search is provided, including:
接收视频搜索字符串;Receiving a video search string;
将所述视频搜索字符串映射为一个或多个第一分词;Mapping the video search string to one or more first word segments;
查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
获取与所述一个或多个第一分词和所述关联第二分词匹配的一个或多个视频数据资源的网络地址;Obtaining a network address of one or more video data resources that match the one or more first word segments and the associated second word segment;
根据所述一个或多个视频数据资源网络地址构造在线播放所述视频数据资源的入口对象;Constructing an entry object for playing the video data resource online according to the one or more video data resource network addresses;
推送所述一个或多个在线播放视频数据资源的入口对象。Pushing the one or more ingress objects of the online play video data resource.
根据本发明的另一方面,提供了一种基于视频搜索的关联资源地址的推送方法,包括:According to another aspect of the present invention, a method for pushing an associated resource address based on a video search is provided, including:
当接收到第一视频资源数据的加载或播放请求时,获取所述第一视频资源数据的特征本文本信息;Obtaining feature text information of the first video resource data when receiving a loading or playing request of the first video resource data;
将所述特征本文本信息映射为一个或多个第一分词;Mapping the feature text information into one or more first word segments;
查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率; Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
获取与所述一个或多个第一分词和所述关联第二分词匹配的第二视频资源数据的网络链接地址;Obtaining a network link address of the second video resource data that matches the one or more first word segments and the associated second word segment;
推送所述第二视频资源数据的网络链接地址。Pushing the network link address of the second video resource data.
根据本发明的另一方面,提供了一种基于视频搜索的分词信息推送装置,包括:According to another aspect of the present invention, a video search-based word segmentation information pushing apparatus is provided, including:
视频搜索字符串接收模块,适于接收视频搜索字符串;a video search string receiving module adapted to receive a video search string;
第一分词映射模块,适于将所述视频搜索字符串映射为一个或多个第一分词;a first word segmentation mapping module, configured to map the video search string into one or more first word segments;
第二分词查找模块,适于查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个分词与第二分词在同一视频资源数据中共同出现的概率;a second participle finding module, configured to find an associated second participle with a co-occurrence rate of the one or more first participles being higher than a preset threshold; the co-occurrence rate is one or more current participles and a second participle The probability of co-occurrence in the same video resource data;
组合推送模块,适于推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。A combined push module adapted to push a combination of the one or more first word segments and the one or more associated second word segments.
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-7中的任一个所述的基于视频搜索的分词信息推送方法。According to still another aspect of the present invention, there is provided a computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform according to claims 1-7 Any of the video search-based word segmentation information push methods described.
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了如权利要求17所述的计算机程序。According to still another aspect of the present invention, a computer readable medium storing the computer program according to claim 17 is provided.
本发明的有益效果为:The beneficial effects of the invention are:
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
本发明通过推送第一分词和第二分词的组合,用户可以基于此组合直接进行更多层次的搜索,使用户简单搜索即可获得更多的结果,无需多次提交搜索,从而减轻了访问服务器的负担,减少了网络资源的占用,并提升了用户体验。By pushing the combination of the first participle and the second participle, the user can directly perform more levels of searching based on the combination, so that the user can obtain more results by simply searching, and does not need to submit the search multiple times, thereby reducing the access server. The burden of reducing network resources and improving the user experience.
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性示出了根据本发明一个实施例的一种基于视频搜索的分词信息推送方法实施例的步骤流程图;FIG. 1 is a flow chart showing the steps of an embodiment of a method for segmenting word information based on video search according to an embodiment of the present invention; FIG.
图2示意性示出了根据本发明一个实施例的一种基于视频搜索的在线播放入口对象的推送方法实施例的步骤流程图;2 is a flow chart showing the steps of an embodiment of a method for pushing an online play portal object based on video search according to an embodiment of the present invention;
图3示意性示出了根据本发明一个实施例的一种基于视频搜索的关联资源地址的推送实施例的步骤流程图; 3 is a flow chart showing the steps of a push embodiment of an associated resource address based on video search, in accordance with one embodiment of the present invention;
图4示意性示出了根据本发明一个实施例的一种基于视频搜索的分词信息推送装置实施例的结构框图;4 is a block diagram showing an embodiment of a video search-based word segmentation information pushing apparatus according to an embodiment of the present invention;
图5示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 5 shows schematically a block diagram of a computing device for performing the method according to the invention;
图6示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Fig. 6 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
参照图1,示出了根据本发明一个实施例的一种基于视频搜索的分词信息推送方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 1 , a flow chart of a step of a method for a word search information based on a video search according to an embodiment of the present invention is shown.
步骤101,接收视频搜索字符串;Step 101: Receive a video search string.
需要说明的是,视频搜索字符串可以是用户输入的视频搜索信息,可以用于请求搜索与之相关的视频数据资源。It should be noted that the video search string may be video search information input by the user, and may be used to request to search for video data resources related thereto.
在实际应用中,视频搜索字符串可以是单词,即包括一个语义独立的词,例如中秋、端午、国庆等等;视频搜索字符串也可以是复合词,即包括两个或两个以上语义独立的词,例如中秋月饼、端午粽子、国庆西藏旅游等等。In practical applications, the video search string can be a word, that is, including a semantically independent word, such as Mid-Autumn Festival, Dragon Boat Festival, National Day, etc.; the video search string can also be a compound word, that is, including two or more semantically independent Words, such as Mid-Autumn moon cakes, Dragon Boat Festival, National Day Tibet Tourism, etc.
步骤102,将所述视频搜索字符串映射为一个或多个第一分词;Step 102: Map the video search string into one or more first word segments;
需要说明的是,被映射的分词可以是预先设置的,可以用于计算不同分词之间的同现率。It should be noted that the mapped word segmentation may be preset and may be used to calculate the co-occurrence rate between different word segments.
映射的规则也可以是预先设置的一个或多个,可以包括去除视频搜索字符的脏词、修饰词、语气助词、宽泛词等无实际意义的词语;可以包括设定停止词,即一些常见的词,为拆分词组时停止的标准,例如的、我、你等等;还可以包括关联关系的对应,将同一事物的多种表达对应为一种表达,例如将八月十五、中秋节、月饼节等关联为中秋;还可以包括其他映射规则,本发明实施例对此不加以限制。The mapped rule may also be one or more presets, and may include words that remove the dirty words, modifiers, modal particles, broad words, and the like of the video search characters; may include setting stop words, that is, some common ones. Words, which are criteria for stopping a phrase, such as me, you, etc.; can also include the correspondence of associations, and correspond to multiple expressions of the same thing as an expression, for example, August 15th, Mid-Autumn Festival The moon cake section and the like are associated with the Mid-Autumn Festival; other mapping rules may also be included, which are not limited by the embodiment of the present invention.
英文是以词为单位的,词和词之间是靠空格隔开,而中文是以字为单位,句子中所有的字连起来才能描述一个意思。例如,英文句子I am a student,用中文则为:“我是一个学生”。计算机可以很简单通过空格知道student是一个单词,但是不能很容易明白“学”、“生”两个字合起来才表示一个词。把中文的汉字序列切分成有意义的词,就是中文分词。例如,我是一个学生,分词的结果是:我、是、一个、学生。English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning" and "sheng" are combined to represent a word. The Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: me, yes, one, student.
下面介绍一些常用的分词方法:Here are some common word segmentation methods:
1、基于字符串匹配的分词方法:是指按照一定的策略将待分析的汉字串与一个预置的机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。实际使用的分词系统,都是把机械分词作为一种初分手段,还需通过利用各种其它的语言信息来进一步提高切分的准确率。1. Word segmentation based on string matching: refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word). The actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.
2、基于特征扫描或标志切分的分词方法:是指优先在待分析字符串中识别和切分出一些带有明显特征的词,以这些词作为断点,可将原字符串分为较小的串再来进机械分词,从而减少匹配的错误率;或者将分词和词类标注结合起来,利用丰富的词类信息对分词决策提供帮助,并且在标注过程中又反过来对分词结果进行检验、调整,从而提高切分的准确率。2. The word segmentation method based on feature scanning or mark segmentation: refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.
3、基于理解的分词方法:是指通过让计算机模拟人对句子的理解,达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。它通常包括三个部分:分词子系统、句法语义子系统、总控部分。在总控部分的协调下,分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断,即它模拟了人对句子的理解过程。这种分词方法需要使用大量的语言知识和信息。3. The word segmentation method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
4、基于统计的分词方法:是指,中文信息中由于字与字相邻共现的频率或概率能够较好的反映成词的可信度,所以可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息,以及计算两个汉字X、Y的相邻共现概率。互现信息可以体现汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。这种方法只需对语料中的字组频度进行统计,不需要切分词典。4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.
在本发明的一种优选实施例中,所述步骤102具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 102 may specifically include the following sub-steps:
子步骤S11,提取所述视频搜索字符串所映射的一个分词;Sub-step S11, extracting a participle mapped by the video search string;
对于视频搜索字符串为单词的情形,可以按照预设的映射规则直接提取其对应的分词。例如,视频搜索字符串为“中秋节”、“我的中秋节”或者“中秋节了”等,映射的第一分词都 可以为“中秋”。当然,视频搜索字符串也可以与其映射的第一分词是同一个词,例如视频搜索字符串为“中秋”,映射的第一分词也可以“中秋”。For the case where the video search string is a word, the corresponding word segmentation can be directly extracted according to a preset mapping rule. For example, the video search string is "Mid-Autumn Festival", "My Mid-Autumn Festival" or "Mid-Autumn Festival", etc., the first participle of the map Can be "Mid-Autumn Festival". Of course, the video search string can also be the same word as the first participle of the mapping. For example, the video search string is “Mid-Autumn Festival”, and the first participle of the map can also be “Mid-Autumn Festival”.
或者,or,
子步骤S12,当接收到的视频搜索字符串为复合词时,将所述视频搜索字符串拆分为多个搜索子词;Sub-step S12, when the received video search string is a compound word, splitting the video search string into a plurality of search sub-words;
子步骤S13,提取所述多个搜索子词所映射的多个分词。Sub-step S13, extracting a plurality of word segments mapped by the plurality of search sub-words.
对于视频搜索字符串为复合词的情形,可以按照预设的映射规则进行分词,得到搜索子词,然后分别提取搜索子词对应的分词。例如,接收到的视频搜索字符串为“中秋节月饼”,可以将其拆分为“中秋节”和“月饼”两个搜索子词,然后将“中秋节”映射为“中秋”,将“月饼”映射为“月饼”,得到“中秋”和“月饼”两个第一分词。For the case where the video search string is a compound word, the word segmentation may be performed according to a preset mapping rule to obtain a search subword, and then the word segment corresponding to the search subword is separately extracted. For example, the received video search string is “Mid-Autumn Festival Mooncake”, which can be split into two search sub-words of “Mid-Autumn Festival” and “Moon Cake”, and then “Mid-Autumn Festival” is mapped to “Mid-Autumn Festival”, “ The moon cake is mapped to "moon cake", and the first participles of "Mid-Autumn Festival" and "moon cake" are obtained.
步骤103,查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;Step 103: Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold;
所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;The co-occurrence rate is a probability that one or more first participles and a second participle coexist in the same video resource data;
需要说明的是,第二分词可以是在全部预设的分词中,除第一分词以外的分词。关联第二分词可以是与第一分词的同现率高于预设阈值的第二分词。It should be noted that the second participle may be a participle other than the first participle among all the preset participles. The associated second participle may be a second participle with the first participle having a co-occurrence rate higher than a preset threshold.
在实际应用中,视频资源数据可以包括特征文本信息,该特征文本信息可以用于记载该视频资源数据的相关信息,也可以用于提取分词。In practical applications, the video resource data may include feature text information, which may be used to record related information of the video resource data, and may also be used to extract word segmentation.
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
例如,在一段名为《【拍客】东莞暴雨后变威尼斯,千余辆车水浸抛锚-在线播放-XX网,视频高清在线观看》的视频资源数据中,其特征文本信息可以如下:For example, in a video resource data named "[Photographer] After the rainstorm in Dongguan, changed to Venice, more than a thousand cars flooding anchors - online play - XX network, video HD online viewing", the characteristic text information can be as follows:
视频标题(Title):【拍客】东莞暴雨后变威尼斯,千余辆车水浸抛锚-在线播放-XX网,视频高清在线观看;Title: [Title] After the rainstorm in Dongguan, it became Venice. More than a thousand cars were flooded and anchored - online play - XX network, video HD online viewing;
视频关键词(Keywords):YY记者 生活资讯 东莞 水浸;Video Keywords: YY Reporter Life Information Dongguan Flooding;
视频描述(Description):昨天上午的一场暴雨,让东莞部分地区的街坊瞬间感到好像来到了威尼斯。行驶中的小车在暴雨中遭到水浸抛锚,有的街坊家中也是一片汪洋。Description: A heavy rain yesterday morning made the neighborhoods in some parts of Dongguan feel like they came to Venice. The driving car was hit by flooding during the heavy rain, and some of the neighborhoods were also a sea.
具体而言,同现率可以为当前一个或多个分词与第二分词在同一视频资源数据的特征文本信息中共同出现的概率,具体可以包括一个第一分词和第二分词的同现率,多个分词和第二分词的同现率。Specifically, the co-occurrence rate may be a probability that the current one or more participles and the second participle co-occur in the feature text information of the same video resource data, and specifically may include a co-occurrence rate of the first participle and the second participle, The co-occurrence rate of multiple participles and second participles.
在本发明的一种优选实施例中,所述步骤103具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 103 may specifically include the following sub-steps:
子步骤S21,当所述视频搜索字符串被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S21, when the video search string is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
在具体实现中,可以预先采用搜索引擎通过爬虫抓取各个网站平台上的视频资源数据,然后建立索引库:提取视频资源数据的特征文本信息进行分词处理,并建立每个分词对应的索引表,该索引表中可以存储视频资源数据的信息(可以是ID、内网地址、外网地址等等视频标识,也可以是一条由当前分词和其他分词所组成的记录)、视频资源数据中的所有分词(包括第一分词和除第一分词外的第二分词)。In a specific implementation, the search engine may be used to crawl the video resource data on each website platform by using a crawler, and then the index database is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established. The index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
例如,“中秋”的索引表可以如下:For example, the index table for "Mid-Autumn Festival" can be as follows:
Figure PCTCN2014086519-appb-000001
Figure PCTCN2014086519-appb-000001
Figure PCTCN2014086519-appb-000002
Figure PCTCN2014086519-appb-000002
其中,第一分词为“中秋”,视频资源数据的信息包括视频标识。当然,视频资源数据的信息也可以不包括视频标识,而只有第一分词和第二分词所成的记录(即每一行的第二分词作为一条记录)。The first participle is “Mid-Autumn Festival”, and the information of the video resource data includes a video identifier. Of course, the information of the video resource data may not include the video identifier, but only the records formed by the first participle and the second participle (ie, the second participle of each line as one record).
当然,上述索引表只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他索引表,本发明实施例对此不加以限制。另外,除了上述索引表外,本领域技术人员还可以根据实际需要采用其他索引表,本发明实施例对此也不加以限制。Of course, the foregoing index table is only an example. When the embodiment of the present invention is implemented, other index tables may be set according to actual conditions, which is not limited by the embodiment of the present invention. In addition, in addition to the foregoing index table, other index tables may be used by those skilled in the art according to actual needs, and the embodiment of the present invention does not limit this.
需要说明的是,可以周期或者不定时抓取各个平台上的视频资源数据,然后更新索引建库,即更新各索引表。It should be noted that the video resource data on each platform can be captured periodically or irregularly, and then the index database is updated, that is, each index table is updated.
子步骤S22,计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S22, calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
由于索引表中各个第二分词出现的次数与其所属的视频资料数据的数量一样,同现率也可以表示为所述索引表中各个第二分词所属的视频资料数据的数量与所述索引表中视频资源数据的信息总数的比值。Since the number of occurrences of each second participle in the index table is the same as the number of video material data to which it belongs, the co-occurrence rate may also be expressed as the number of video material data to which each second participle in the index table belongs and in the index table. The ratio of the total number of information for video resource data.
例如,分词“广场舞”的索引表中总共有100条视频资源数据的信息,分词“兵哥哥”的索引表中总共有200条视频资源数据的信息,“广场舞”和“兵哥哥”同时出现在这两个索引表中的视频资源数据的信息共10条,则对于“广场舞”而言,“广场舞”与“兵哥哥”的同现率为10/100=10%,而对于“兵哥哥”而言,“兵哥哥”与“广场舞”的同现率为10/200=5%。For example, the index table of the word segmentation "square dance" has a total of 100 pieces of video resource data information, and the index table "Bing brother" has a total of 200 pieces of video resource data information, and "square dance" and "bing brother" simultaneously There are 10 pieces of information on the video resource data appearing in the two index tables. For the "square dance", the co-occurrence rate of "square dance" and "Bing brother" is 10/100=10%. For "Bing Brother", the co-occurrence rate of "Bing Brother" and "Plaza Dance" is 10/200=5%.
子步骤S23,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S23, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
具体实现中,预设阈值可以由本领域技术人员根据实际情况而设定的,本发明实施例对此不加以限制。本发明实施例中所提取的关联第二分词可以为空,也可以为一个或多个。In a specific implementation, the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention. The associated second participle extracted in the embodiment of the present invention may be empty or one or more.
在本发明的一种优选实施例中,所述步骤103具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 103 may specifically include the following sub-steps:
子步骤S31,当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S31, when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment The information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
在具体实现中,可以预先采用搜索引擎通过爬虫抓取各个平台上的视频资源数据,然后建立索引建库:提取视频资源数据的特征文本信息进行分词处理,并建立每个分词对应的索引表,该索引表中可以存储视频资源数据的信息(可以是ID、内网地址、外网地址等等视频标识,也可以是一条由当前分词和其他分词所组成的记录)、视频资源数据中的所有分词(包括第一分词和除第一分词外的第二分词)。In a specific implementation, the search engine may be used in advance to crawl the video resource data on each platform through the crawler, and then the index is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established. The index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S32,提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S32, extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
具体而言,当前有多个第一分词,即有多个数量对应的索引表,候选分词需要在各个索引表中出现,即候选分词分别与当前各第一分词都共同在同一索引表中出现。Specifically, there are currently a plurality of first word segments, that is, there are multiple numbers corresponding to the index table, and the candidate word segments need to appear in each index table, that is, the candidate word segments are respectively present in the same index table together with the current first word segments. .
子步骤S33,分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Sub-step S33, calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is a number of occurrences of the candidate participle in the index table and a video resource in the index table The ratio of the total number of pieces of information;
例如,可以将视频搜索字符串“中秋节月饼”映射为第一分词“中秋”和“月饼”,提取了其中一个候选分词为“月亮”,则可以分别计算“中秋”和“月亮”的同现率(假设为70%)、“月饼”和“月亮”同现率(假设为60%)。For example, you can map the video search string “Mid-Autumn Festival Mooncake” to the first participle “Mid-Autumn Festival” and “Moon Cake”, and extract one of the candidate participles as “Moon”, then you can calculate the “Mid-Autumn Festival” and “Moon” respectively. The current rate (assumed to be 70%), the “moon cake” and the “moon” co-occurrence rate (assumed to be 60%).
子步骤S34,分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重;Sub-step S34, respectively, a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
权重可以根据各第一分词间的索引表中视频资源数据的信息总数比例的进行确定,其中, 索引表中视频资源数据的信息总数越多其权重越大。例如,在“中秋”的索引表中视频资源数据的信息总数为900,而在“月饼”的索引表中视频资源数据的信息总数为100,则“中秋”和“月亮”的同现率的权重可以为0.9,“月饼”和“月亮”同现率的权重可以为0.1。The weight may be determined according to the ratio of the total number of information of the video resource data in the index table between the first participles, wherein The greater the total amount of information of the video resource data in the index table, the greater the weight. For example, in the index table of "Mid-Autumn Festival", the total amount of information of video resource data is 900, and in the index table of "moon cake", the total amount of information of video resource data is 100, and the co-occurrence rate of "Mid-Autumn Festival" and "Moon" The weight can be 0.9, and the weight of the "moon cake" and "moon" co-occurrence rate can be 0.1.
当然,上述权重只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他权重,例如按照当前社会热点(新闻排名、微博排名等)设置对应的权重、按照用户的本地和/或网上操作行为(视频播放、新闻阅读等)设置对应的权重等等,本发明实施例对此不加以限制。另外,除了上述权重外,本领域技术人员还可以根据实际需要采用其他权重,本发明实施例对此也不加以限制。Of course, the foregoing weights are only examples. When implementing the embodiments of the present invention, other weights may be set according to actual conditions, for example, according to current social hotspots (news ranking, microblog ranking, etc.), corresponding weights are set according to the user's local and/or The online operation behavior (video playback, news reading, etc.) sets the corresponding weights and the like, which are not limited by the embodiment of the present invention. In addition, other weights may be used by those skilled in the art in addition to the above-mentioned weights, and the embodiments of the present invention do not limit this.
子步骤S35,分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率;Sub-step S35, respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
本发明实施例中,可以以多个同现率的加权平均值作为最终的同现率。In the embodiment of the present invention, the weighted average of multiple co-occurrence rates may be used as the final co-occurrence rate.
例如,中秋”、“月饼”和“月亮”的同现率可以为(70%*0.9+60%*0.1)/2=34.5%。For example, the co-occurrence rate of Mid-Autumn Festival, Mooncake, and Moon can be (70%*0.9+60%*0.1)/2=34.5%.
子步骤S36,提取所述同现率高于预设阈值的候选分词作为关联第二分词。Sub-step S36, extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
具体实现中,预设阈值可以由本领域技术人员根据实际情况而设定的,本发明实施例对此不加以限制。本发明实施例中所提取的关联第二分词可以为空,也可以为一个或多个。In a specific implementation, the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention. The associated second participle extracted in the embodiment of the present invention may be empty or one or more.
在本发明的一种优选实施例中,所述步骤103具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 103 may specifically include the following sub-steps:
子步骤S41,当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S41, when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
在具体实现中,可以预先采用搜索引擎通过爬虫抓取各个平台上的视频资源数据,然后建立索引建库:提取视频资源数据的特征文本信息进行分词处理,并建立每个分词对应的索引表,该索引表中可以存储视频资源数据的信息(可以是ID、内网地址、外网地址等等视频标识,也可以是一条由当前分词和其他分词所组成的记录)、视频资源数据中的所有分词(包括第一分词和除第一分词外的第二分词)。In a specific implementation, the search engine may be used in advance to crawl the video resource data on each platform through the crawler, and then the index is built: the feature text information of the video resource data is extracted for word segmentation, and an index table corresponding to each word segment is established. The index table may store information of video resource data (which may be a video identifier such as an ID, an intranet address, an external network address, or the like, or a record consisting of a current participle and other participles), and all of the video resource data. Participle (including the first participle and the second participle except the first participle).
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S42,采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Sub-step S42, determining the main participle by using the plurality of index tables, where the main participle is the first participle corresponding to the index table with the largest total number of pieces of information of the video resource data;
为了提高用户体验,对于视频资源数据相差比较悬殊的多个第一分词,可以忽略视频资源数据的信息总量少的第一分词。例如,对于视频搜索字符串“中秋节月饼”所映射的第一分词“中秋”和“月饼”,在“中秋”的索引表中视频资源数据的信息总数为900,而在“月饼”的索引表中视频资源数据的信息总数为100,则可以设置“中秋”作为主分词。In order to improve the user experience, for a plurality of first participles in which the video resource data differs greatly, the first participle of the video resource data with a small amount of information may be ignored. For example, for the first participle "Mid-Autumn Festival" and "moon cake" mapped by the video search string "Mid-Autumn Festival Mooncake", the total amount of information of the video resource data in the index table of "Mid-Autumn Festival" is 900, and the index of "moon cake" If the total number of pieces of video resource data in the table is 100, you can set "Mid-Autumn Festival" as the main participle.
子步骤S43,计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S43, calculating a co-occurrence rate of each second participle in the index table and the corresponding index part, the co-occurrence rate is the number of occurrences of each second participle in the index table and the video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
本发明实施例中,可以以主分词的同现率的作为最终的同现率。In the embodiment of the present invention, the co-occurrence rate of the main participle can be used as the final co-occurrence rate.
子步骤S44,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S44, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
具体实现中,预设阈值可以由本领域技术人员根据实际情况而设定的,本发明实施例对此不加以限制。本发明实施例中所提取的关联第二分词可以为空,也可以为一个或多个。In a specific implementation, the preset threshold may be set by a person skilled in the art according to actual conditions, which is not limited by the embodiment of the present invention. The associated second participle extracted in the embodiment of the present invention may be empty or one or more.
步骤104,推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。Step 104: Push a combination of the one or more first word segments and the one or more associated second word segments.
具体而言,在子步骤S23之后,可以在网页的输入框的下拉菜单等位置推送当前一个第一分词与一个或多个分词的组合。例如视频搜索字符串为“dota”,与其同现率较高的词为:“搞笑”、“蛋疼”、“2009”、“海涛”、“第一视角”和“经典”,同现率分别为40%、35%、30%、25%、20%和10%,则将依次推送组合“dota搞笑”、“dota蛋疼”、“dota2009”、“dota海涛”、“dota第一视角”和“dota经典”。Specifically, after the sub-step S23, a combination of the current first participle and one or more participles may be pushed at a position such as a pull-down menu of the input box of the webpage. For example, the video search string is “dota”, and the words with the same co-occurrence rate are: “funny”, “egg hurt”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence The rates are 40%, 35%, 30%, 25%, 20% and 10% respectively, which will push the combination "dota funny", "dota egg pain", "dota2009", "dota sea", "dota" A perspective" and "dota classic".
在子步骤S36之后,可以在网页的输入框的下拉菜单等位置推送当前多个第一分词与一个或多个分词的组合。例如视频搜索字符串为“广场舞兵哥哥”,将其映射为第一分词“广场舞”和“兵哥哥”,提取与这两个第一分词同时出现的第二分词,例如第二分词“教学”,其可以作为关联第二分词,则最终推送组合“广场舞兵哥哥教学”。 After sub-step S36, a combination of the current plurality of first word segments and one or more word segments may be pushed at a drop-down menu or the like of the input box of the web page. For example, the video search string is "square dance soldier brother", which is mapped to the first participle "square dance" and "bing brother", and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching", which can be used as the second participle of the association, will eventually push the combination "tea dance of the square dance soldiers".
在本发明的一种优选实施例中,步骤104具体可以包括如下子步骤:In a preferred embodiment of the present invention, step 104 may specifically include the following sub-steps:
子步骤S51,推送所述主分词和所述关联第二分词。Sub-step S51, pushing the main participle and the associated second participle.
在子步骤S44之后,可以在网页的输入框的下拉菜单等位置推送当前主分词与一个或多个分词的组合。例如,对于视频搜索字符串“中秋节月饼”所映射的第一分词“中秋”和“月饼”,可以设置“中秋”作为主分词,得到关联第二分词“月亮”,则可以推送组合“中秋月亮”。After sub-step S44, the combination of the current main participle and one or more participles can be pushed at a drop-down menu or the like of the input box of the web page. For example, for the first participle "Mid-Autumn Festival" and "moon cake" mapped by the video search string "Mid-Autumn Festival Mooncake", you can set "Mid-Autumn Festival" as the main participle, and get the second participle "Moon", you can push the combination "Mid-Autumn Festival" moon".
用户可以通过点击下拉菜单中的推送组合,搜索新的视频资源数据。Users can search for new video asset data by clicking on the push combination in the drop-down menu.
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
本发明通过推送第一分词和第二分词的组合,用户可以基于此组合直接进行更多层次的搜索,使用户简单搜索即可获得更多的结果,无需多次提交搜索,从而减轻了访问服务器的负担,减少了网络资源的占用,并提升了用户体验。By pushing the combination of the first participle and the second participle, the user can directly perform more levels of searching based on the combination, so that the user can obtain more results by simply searching, and does not need to submit the search multiple times, thereby reducing the access server. The burden of reducing network resources and improving the user experience.
参照图2,示出了根据本发明一个实施例的一种基于视频搜索的在线播放入口对象的推送方法实施例的步骤流程图,具体可以包括如下步骤:2 is a flow chart showing the steps of an embodiment of a method for pushing an online play portal object based on a video search according to an embodiment of the present invention, which may specifically include the following steps:
步骤201,接收视频搜索字符串;Step 201: Receive a video search string.
步骤202,将所述视频搜索字符串映射为一个或多个第一分词;Step 202: Map the video search string into one or more first word segments;
在本发明的一种优选实施例中,所述步骤202具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 202 may specifically include the following sub-steps:
子步骤S61,提取所述视频搜索字符串所映射的一个分词;Sub-step S61, extracting a participle mapped by the video search string;
或者,or,
子步骤S62,当接收到的视频搜索字符串为复合词时,将所述视频搜索字符串拆分为多个搜索子词;Sub-step S62, when the received video search string is a compound word, splitting the video search string into a plurality of search sub-words;
子步骤S63,提取所述多个搜索子词所映射的多个分词。Sub-step S63, extracting a plurality of word segments mapped by the plurality of search sub-words.
步骤203,查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;Step 203: Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold.
在本发明的一种优选实施例中,所述步骤203具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 203 may specifically include the following sub-steps:
子步骤S71,当所述视频搜索字符串被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S71, when the video search string is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
子步骤S72,计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S72, calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
子步骤S73,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S73, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述步骤203具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 203 may specifically include the following sub-steps:
子步骤S81,当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S81, when the video search string is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment The information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S82,提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S82, extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
子步骤S83,分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Sub-step S83, calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is the number of occurrences of the candidate participle in the index table and the video resource in the index table The ratio of the total number of pieces of information;
子步骤S84,分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重;Sub-step S84, which are respectively a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
子步骤S85,分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率; Sub-step S85, respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
子步骤S86,提取所述同现率高于预设阈值的候选分词作为关联第二分词。Sub-step S86, extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述步骤203具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 203 may specifically include the following sub-steps:
子步骤S91,当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S91, when the video search string is mapped to a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S92,采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Sub-step S92, determining the main participle by using the plurality of index tables, where the main participle is the first participle corresponding to the index table with the largest total number of pieces of information of the video resource data;
子步骤S93,计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S93, calculating a co-occurrence rate of each second word segment in the index table and the corresponding index table, the co-occurrence rate is a number of occurrences of each second word segment in the index table and a video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
子步骤S94,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S94, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
步骤204,获取与所述一个或多个第一分词和所述关联第二分词匹配的一个或多个视频数据资源的网络地址;Step 204: Obtain a network address of one or more video data resources that match the one or more first word segments and the associated second word segment;
在子步骤S73之后,可以获得当前一个第一分词与一个或多个分词的组合。例如视频搜索字符串为“dota”,与其同现率较高的词为:“搞笑”、“蛋疼”、“2009”、“海涛”、“第一视角”和“经典”,同现率分别为40%、35%、30%、25%、20%和10%,则获得的组合依次为“dota搞笑”、“dota蛋疼”、“dota2009”、“dota海涛”、“dota第一视角”和“dota经典”。After sub-step S73, a combination of the current first participle and one or more participles can be obtained. For example, the video search string is “dota”, and the words with the same co-occurrence rate are: “funny”, “egg hurt”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence The rates are 40%, 35%, 30%, 25%, 20%, and 10%, respectively, and the combinations obtained are “dota funny”, “dota egg pain”, “dota2009”, “dota sea”, “dota” The first perspective" and the "dota classic".
在子步骤S86之后,可以获得当前多个第一分词与一个或多个分词的组合。例如视频搜索字符串为“广场舞兵哥哥”,将其映射为第一分词“广场舞”和“兵哥哥”,提取与这两个第一分词同时出现的第二分词,例如第二分词“教学”,其可以作为关联第二分词,则得到最终的组合“广场舞兵哥哥教学”。After sub-step S86, a combination of the current plurality of first participles and one or more participles can be obtained. For example, the video search string is "square dance soldier brother", which is mapped to the first participle "square dance" and "bing brother", and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching", which can be used as the second participle of the association, then the final combination "tea dance of the square dancers".
在本发明的一种优选实施例中,步骤204具体可以包括如下子步骤:In a preferred embodiment of the present invention, step 204 may specifically include the following sub-steps:
子步骤S101,获取与所述主分词和所述关联第二分词匹配的一个或多个视频数据资源的网络地址。Sub-step S101: Obtain a network address of one or more video data resources that match the primary participle and the associated second participle.
在子步骤S94之后,可以获得当前主分词与一个或多个分词的组合。例如,对于视频搜索字符串“中秋节月饼”所映射的第一分词“中秋”和“月饼”,可以设置“中秋”作为主分词,得到关联第二分词“月亮”,则最终获得组合“中秋月亮”。After sub-step S94, a combination of the current main participle and one or more participles can be obtained. For example, for the first participle "Mid-Autumn Festival" and "moon cake" mapped by the video search string "Mid-Autumn Festival Mooncake", you can set "Mid-Autumn Festival" as the main participle and get the second participle "moon", and finally get the combination "Mid-Autumn Festival" moon".
本发明实施例中,可以基于一个或多个第一分词和第二分词的组合进行匹配的视频数据资源的搜索,当搜索到时,记录其网络地址,具体可以是内网地址,也可以是外网地址。In the embodiment of the present invention, the search of the matched video data resources may be performed based on the combination of the first word segmentation and the second segment word segment. When searching, the network address may be recorded, which may be an intranet address, or may be External network address.
步骤205,根据所述一个或多个视频数据资源网络地址构造在线播放所述视频数据资源的入口对象;Step 205: Construct an ingress object for playing the video data resource online according to the one or more video data resource network addresses.
入口对象可以为网页中链接到在线播放URL的图标或按钮。具体实现中,可以在当前页面中配置一个图标或按钮,在扩展窗口中与该视频数据资源网络地址相关联,当用户点击该图标或按钮,该视频数据资源网络地址被触发时,可以从数据库的URL下加载对应的视频数据资源。The entry object can be an icon or button in the web page that links to the online play URL. In a specific implementation, an icon or a button may be configured in the current page, and is associated with the video data resource network address in the extended window. When the user clicks the icon or button, and the video data resource network address is triggered, the database may be accessed from the database. The corresponding video data resource is loaded under the URL.
步骤206,推送所述一个或多个在线播放视频数据资源的入口对象。Step 206: Push the one or more ingress objects of the online play video data resource.
实际应用中,入口对象可以放置在当前页面的任一位置,用户可以通过触发入口对象而触发该入口对象对应的视频数据资源的网络地址,进而加载所述视频数据资源。In an actual application, the entry object can be placed at any position of the current page, and the user can trigger the entry object to trigger the network address of the video data resource corresponding to the entry object, thereby loading the video data resource.
例如,用户在搜索框中输入视频搜索字符串“钢铁”,其本身可以作为第一分词,获取第一分词和关联第二分词的组合“钢铁侠3”匹配的视频数据资源,该视频资源的入口对象为一图标,该图标上写有“立即观看”提示用户,当用户点击该图标时,可以转到“钢铁侠3”的播放页面。For example, the user inputs a video search string "steel" in the search box, which itself can be used as the first participle to obtain a video data resource matched by the combination of the first participle and the associated second participle "Iron Man 3", the video resource The entry object is an icon that says "Read Now" to prompt the user. When the user clicks on the icon, he can go to the play page of "Iron Man 3".
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。 The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
本发明通过推送在线播放视频数据资源的入口对象,用户可以基于此入口对象直接进行获取更多的视频搜索结果,使用户简单搜索即可获得更多的结果,无需多次提交搜索,从而减轻了访问服务器的负担,减少了网络资源的占用,并提升了用户体验。By pushing the entry object of the online video data resource, the user can directly obtain more video search results based on the entry object, so that the user can obtain more results by simply searching, and the search does not need to be submitted multiple times, thereby reducing the The burden of accessing the server reduces the occupation of network resources and improves the user experience.
参照图3,示出了根据本发明一个实施例的一种基于视频搜索的关联资源地址的推送实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 3, a flow chart of steps of a push-based embodiment of a video search-based associated resource address according to an embodiment of the present invention is shown.
步骤301,当接收到第一视频资源数据的加载或播放请求时,获取所述第一视频资源数据的特征本文本信息; Step 301, when receiving a loading or playing request of the first video resource data, acquiring feature text information of the first video resource data;
需要说明的是,第一视频资源数据可以位于终端设备上,也可以位于网络上,特征本文本信息可以是视频资源数据所携带的信息。It should be noted that the first video resource data may be located on the terminal device or may be located on the network, and the feature text information may be information carried by the video resource data.
在本发明的一种优选实施例中,所述步骤301具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 301 may specifically include the following sub-steps:
子步骤S111,当接收到第一视频数据的播放请求时,接收当前终端发送的所述第一视频资源数据的特征本文本信息;Sub-step S111, when receiving the play request of the first video data, receiving the feature text information of the first video resource data sent by the current terminal;
当第一视频资源数据位于终端设备上时,可以由终端设备提取第一视频资源数据的特征文本信息,然后上传到对应的服务器侧。When the first video resource data is located on the terminal device, the feature text information of the first video resource data may be extracted by the terminal device, and then uploaded to the corresponding server side.
或者,or,
子步骤S112,当接收到第一视频数据加载请求时,提取本地预置的所述视频资源数据的特征本文本信息。Sub-step S112, when receiving the first video data loading request, extracting the feature text information of the video resource data preset locally.
当第一视频资源数据位于网络上时,可以由服务器侧提取第一视频资源数据的特征文本信息。When the first video resource data is located on the network, the feature text information of the first video resource data may be extracted by the server side.
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
步骤302,将所述特征本文本信息映射为一个或多个第一分词;Step 302: Map the feature text information into one or more first word segments;
在本发明的一种优选实施例中,所述步骤302具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 302 may specifically include the following sub-steps:
子步骤S121,提取所述特征本文本信息所映射的一个分词;Sub-step S121, extracting a participle mapped by the feature text information;
或者,or,
子步骤S122,当接收到的特征本文本信息为复合词时,将所述特征本文本信息拆分为多个搜索子词;Sub-step S122, when the received feature text information is a compound word, splitting the feature text information into a plurality of search sub-words;
子步骤S123,提取所述多个搜索子词所映射的多个分词。Sub-step S123, extracting a plurality of word segments mapped by the plurality of search sub-words.
步骤303,查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;Step 303: Search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold.
所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;The co-occurrence rate is a probability that one or more first participles and a second participle coexist in the same video resource data;
具体而言,同现率可以为当前一个或多个分词与第二分词在同一视频资源数据的特征文本信息中共同出现的概率,具体可以包括一个第一分词和第二分词的同现率,多个分词和第二分词的同现率。Specifically, the co-occurrence rate may be a probability that the current one or more participles and the second participle co-occur in the feature text information of the same video resource data, and specifically may include a co-occurrence rate of the first participle and the second participle, The co-occurrence rate of multiple participles and second participles.
需要说明的是,第二分词可以是在全部预设的分词中,除第一分词以外的分词。关联第二分词可以是与第一分词的同现率高于预设阈值的第二分词。It should be noted that the second participle may be a participle other than the first participle among all the preset participles. The associated second participle may be a second participle with the first participle having a co-occurrence rate higher than a preset threshold.
在实际应用中,视频资源数据可以包括特征文本信息,该特征文本信息可以用于记载该视频资源数据的相关信息,也可以用于提取分词。In practical applications, the video resource data may include feature text information, which may be used to record related information of the video resource data, and may also be used to extract word segmentation.
在本发明的一种优选实施例中,所述步骤303具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 303 may specifically include the following sub-steps:
子步骤S131,当所述特征本文本信息被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S131, when the feature text information is mapped to a first word segment, extracting a preset index table corresponding to the first word segment; wherein the index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S132,计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S132, calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and the video in the index table a ratio of the total number of pieces of information of the resource data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
子步骤S133,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S133, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述步骤303具体可以包括如下子步骤: In a preferred embodiment of the present invention, the step 303 may specifically include the following sub-steps:
子步骤S141,当所述特征本文本信息被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S141, when the feature text information is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes the first word segment The information of the video resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, Feature text information for word segmentation;
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S142,提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S142, extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle ;
子步骤S143,分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Sub-step S143, calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is a number of occurrences of the candidate participle in the index table and a video resource in the index table The ratio of the total number of pieces of information;
子步骤S144,分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重;Sub-step S144, respectively, a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
子步骤S145,分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率;Sub-step S145, respectively calculating an average value of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate participles;
子步骤S146,提取所述同现率高于预设阈值的候选分词作为关联第二分词。Sub-step S146, extracting the candidate participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述步骤303具体可以包括如下子步骤:In a preferred embodiment of the present invention, the step 303 may specifically include the following sub-steps:
子步骤S151,当所述特征本文本信息被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Sub-step S151, when the feature text information is mapped into a plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; wherein each index table includes the first The information of the video resource data to which the word segment belongs, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, The feature text information is generated by word segmentation;
在本发明的一种优选实施例中,所述特征文本信息可以包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information may include a video title, a video keyword, and/or a video description.
子步骤S152,采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Sub-step S152, determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of the video resource data;
子步骤S153,计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Sub-step S153, calculating a co-occurrence rate of each second word segment in the index table and the corresponding index table, the co-occurrence rate is the number of occurrences of each second word segment in the index table and the video resource in the index table a ratio of the total number of pieces of information of the data; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
本发明实施例中,可以以主分词的同现率的作为最终的同现率。In the embodiment of the present invention, the co-occurrence rate of the main participle can be used as the final co-occurrence rate.
子步骤S154,提取所述同现率高于预设阈值的第二分词作为关联第二分词。Sub-step S154, extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
步骤304,获取与所述一个或多个第一分词和所述关联第二分词匹配的第二视频资源数据的网络链接地址;Step 304: Obtain a network link address of the second video resource data that matches the one or more first word segments and the associated second word segment.
具体而言,在子步骤S133之后,可以获得当前一个第一分词与一个或多个关联第二分词的组合。例如特征本文本信息为“dota”,与其同现率较高的词为:“搞笑”、“蛋疼”、“2009”、“海涛”、“第一视角”和“经典”,同现率分别为40%、35%、30%、25%、20%和10%,则获得的组合依次为“dota搞笑”、“dota蛋疼”、“dota2009”、“dota海涛”、“dota第一视角”和“dota经典”。Specifically, after sub-step S133, a combination of the current first participle and one or more associated second participles can be obtained. For example, the character text information is “dota”, and the words with the same co-occurrence rate are: “funny”, “egg pain”, “2009”, “sea Tao”, “first perspective” and “classic”, co-occurrence The rates are 40%, 35%, 30%, 25%, 20%, and 10%, respectively, and the combinations obtained are “dota funny”, “dota egg pain”, “dota2009”, “dota sea”, “dota” The first perspective" and the "dota classic".
在子步骤S146之后,可以获得当前多个第一分词与一个或多个关联第二分词的组合。例如特征本文本信息为“广场舞兵哥哥”,将其映射为第一分词“广场舞”和“兵哥哥”,提取与这两个第一分词同时出现的第二分词,例如第二分词“教学”,其可以作为关联第二分词,则最终获得组合“广场舞兵哥哥教学”。After sub-step S146, a combination of the current plurality of first participles and one or more associated second participles can be obtained. For example, the character text information is "square dance soldier brother", which is mapped to the first participle "square dance" and "bing brother", and extracts the second participle that appears at the same time as the two first participles, for example, the second participle " Teaching", which can be used as the second participle of the association, and finally obtain the combination "teaching of the square dance soldiers".
在本发明的一种优选实施例中,步骤304具体可以包括如下子步骤:In a preferred embodiment of the present invention, step 304 may specifically include the following sub-steps:
子步骤S161,获取所述主分词和所述关联第二分词的的第二视频资源数据的网络链接地址。Sub-step S161, acquiring a network link address of the second video resource data of the main participle and the associated second participle.
在子步骤S154之后,可以获得当前主分词与一个或多个关联第二分词的组合。例如,对于特征本文本信息“中秋节月饼”所映射的第一分词“中秋”和“月饼”,可以设置“中秋”作为主分词,得到关联第二分词“月亮”,则最终获得组合“中秋月亮”。After sub-step S154, a combination of the current main participle and one or more associated second participles can be obtained. For example, for the first participle "Mid-Autumn Festival" and "moon cake" mapped by the character text "Mid-Autumn Festival Mooncake", you can set "Mid-Autumn Festival" as the main participle and get the second participle "moon", and finally get the combination "Mid-Autumn Festival" moon".
本发明实施例中,可以基于一个或多个第一分词和第二分词的组合进行匹配的视频数据资源的搜索,当搜索到时,记录其网络连接地址,具体可以是内网地址,也可以是外网地址。In the embodiment of the present invention, the search of the matched video data resource may be performed based on the combination of the first word segment and the second word segmentation. When searching, the network connection address may be recorded, which may be an intranet address, or Is the external network address.
步骤305,推送所述第二视频资源数据的网络链接地址。Step 305: Push a network link address of the second video resource data.
实际应用中,第二视频资源数据的网络链接地址可以放置在当前页面的任一位置,也可以通过嵌入图标或按钮等方式进行推送,用户可以通过触发第二视频资源数据的网络链接地址进而加载所述视频数据资源。 In an actual application, the network link address of the second video resource data may be placed at any position on the current page, or may be pushed by embedding an icon or a button, and the user may load by triggering the network link address of the second video resource data. The video data resource.
本发明可以根据现有已发布内容进行推送,使搜索引擎摆脱对用户搜索习惯的依赖,将虽然比较少有用户搜索的但视频库汇总已有较多相关资源的视频资源数据推送出来,从而实现深度挖掘视频库中的优质资源,提高了资源挖掘的效率;此外,索引表会随着互联网视频内容的不断积累而不断扩大,各大视频站生产出来的内容数量和广度会远远超过用户已经搜索过的词数,有利于扩大召回率。The invention can be pushed according to the existing published content, so that the search engine can get rid of the dependence on the user's search habit, and the video resource data of the video library that has more relevant resources is pushed out, although the user searches for less, so that the video resource data is pushed out. Deeply mining high-quality resources in the video library and improving the efficiency of resource mining; in addition, the index table will continue to expand with the accumulation of Internet video content, and the amount and breadth of content produced by major video stations will far exceed the user's already The number of words searched is conducive to expanding the recall rate.
本发明通过获取第一分词和第二分词的匹配的第二视频资源数据的网络连接地址,用户可以基于此地址直接进行视频数据资源的获取,使用户简单搜索即可获得更多的结果,无需多次提交搜索,从而减轻了访问服务器的负担,减少了网络资源的占用,并提升了用户体验。The invention obtains the network connection address of the matched second video resource data of the first word segment and the second word segment, and the user can directly obtain the video data resource based on the address, so that the user can obtain more results by simply searching, without Submitting the search multiple times, reducing the burden of accessing the server, reducing the occupation of network resources and improving the user experience.
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。For the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present invention are not limited by the described action sequence, because the embodiment according to the present invention Some steps can be performed in other orders or at the same time. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.
参照图4,示出了根据本发明一个实施例的一种基于视频搜索的分词信息推送装置实施例的结构框图,具体可以包括如下模块:Referring to FIG. 4, a block diagram of an embodiment of a video search-based word segmentation information pushing apparatus according to an embodiment of the present invention is shown. Specifically, the following modules may be included:
视频搜索字符串接收模块401,适于接收视频搜索字符串;The video search string receiving module 401 is adapted to receive a video search string;
第一分词映射模块402,适于将所述视频搜索字符串映射为一个或多个第一分词;The first part-of-word mapping module 402 is adapted to map the video search string into one or more first word segments;
第二分词查找模块403,适于查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个分词与第二分词在同一视频资源数据中共同出现的概率;a second participle finding module 403, configured to search for an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is one or more current participles and a second participle The probability that a participle will co-occur in the same video resource data;
推送模块404,适于推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。The pushing module 404 is adapted to push a combination of the one or more first word segments and the one or more associated second word segments.
在本发明的一种优选实施例中,所述第一分词映射模块402还可以适于:In a preferred embodiment of the present invention, the first word segmentation mapping module 402 may further be adapted to:
提取所述视频搜索字符串所映射的一个分词;Extracting a participle mapped by the video search string;
或者,or,
当接收到的视频搜索字符串为复合词时,将所述视频搜索字符串拆分为多个搜索子词;提取所述多个搜索子词所映射的多个分词。When the received video search string is a compound word, the video search string is split into a plurality of search subwords; and a plurality of word segments mapped by the plurality of search subwords are extracted.
在本发明的一种优选实施例中,所述第二分词查找模块403还可以适于:In a preferred embodiment of the present invention, the second word segmentation module 403 is further adapted to:
当所述视频搜索字符串被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Extracting a preset index table corresponding to the first word segment when the video search string is mapped to a first word segment; wherein the index table includes information of video resource data to which the first word segment belongs, and All the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing word segmentation on the feature text information;
计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and information of video resource data in the index table a ratio of the total number; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述第二分词查找模块403还可以适于:In a preferred embodiment of the present invention, the second word segmentation module 403 is further adapted to:
当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to the plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is the number of occurrences of the candidate participle in the index table and the total information of the video resource data in the index table Ratio
分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重; a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率;Calculating, respectively, an average of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate segmentation words;
提取所述同现率高于预设阈值的候选分词作为关联第二分词。The candidate participle with the co-occurrence rate higher than the preset threshold is extracted as the associated second participle.
在本发明的一种优选实施例中,所述第二分词查找模块403还可以适于:In a preferred embodiment of the present invention, the second word segmentation module 403 is further adapted to:
当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to a plurality of first word segments, a plurality of preset index tables corresponding to the plurality of first word segments are respectively extracted; wherein each index table includes a video to which the first word segment belongs The information of the resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, and the feature text is extracted Information is generated by word segmentation;
采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of video resource data;
计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of each of the second participles in the index table and the corresponding second participle in the index table, the co-occurrence rate being the number of occurrences of each second participle in the index table and the total information of the video resource data in the index table a ratio of the second participle being a part of all the participles in the video resource data except the first participle;
提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
在本发明的一种优选实施例中,所述特征文本信息包括视频标题、视频关键词和/或视频描述。In a preferred embodiment of the invention, the feature text information includes a video title, a video keyword, and/or a video description.
在本发明的一种优选实施例中,所述组合推送模块404还可以适于:In a preferred embodiment of the present invention, the combined push module 404 can also be adapted to:
推送所述主分词和所述关联第二分词的组合。Pushing a combination of the main participle and the associated second participle.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的基于视频搜索的分词信息推送设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of some or all of the components of the video search based word segmentation information push device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP). Features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图5示出了可以实现根据本发明的基于视频搜索的分词信息推送的计算设备,例如用户终端设备或应用服务器。该计算设备传统上包括处理器510和以存储器520形式的计算机程序产品或者计算机可读介质。存储器520可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器520具有用于执行上述方法中的任何方法步骤的程序代码531的存储空间530。例如,用于程序代码的存储空间530可以包括分别用于实现上面的方法中的各种步骤的各个程序代码531。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图6所述的便携式或者固定存储单元。该存储单元可以具有与图5的计算设备中的存储器520类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码531’,即可以由例如诸如510之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, FIG. 5 illustrates a computing device, such as a user terminal device or an application server, that can implement video search based word segmentation information push in accordance with the present invention. The computing device conventionally includes a processor 510 and a computer program product or computer readable medium in the form of a memory 520. The memory 520 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 520 has a memory space 530 for program code 531 for performing any of the method steps described above. For example, storage space 530 for program code may include various program code 531 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 520 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 531 ', ie, code readable by a processor, such as 510, that when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。 It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (18)

  1. 一种基于视频搜索的分词信息推送方法,包括:A word search information pushing method based on video search, comprising:
    接收视频搜索字符串;Receiving a video search string;
    将所述视频搜索字符串映射为一个或多个第一分词;Mapping the video search string to one or more first word segments;
    查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
    推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。Pushing a combination of the one or more first word segments and the one or more associated second word segments.
  2. 如权利要求1所述的方法,其特征在于,所述将所述视频搜索字符串映射为一个或多个第一分词的步骤包括:The method of claim 1 wherein said step of mapping said video search string to one or more first word segments comprises:
    提取所述视频搜索字符串所映射的一个分词;Extracting a participle mapped by the video search string;
    或者,or,
    当接收到的视频搜索字符串为复合词时,将所述视频搜索字符串拆分为多个搜索子词;提取所述多个搜索子词所映射的多个分词。When the received video search string is a compound word, the video search string is split into a plurality of search subwords; and a plurality of word segments mapped by the plurality of search subwords are extracted.
  3. 如权利要求1所述的方法,其特征在于,所述查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词的步骤包括:The method of claim 1 wherein said step of finding an associated second word segment having a co-occurrence rate with said one or more first word segments that is above a predetermined threshold comprises:
    当所述视频搜索字符串被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Extracting a preset index table corresponding to the first word segment when the video search string is mapped to a first word segment; wherein the index table includes information of video resource data to which the first word segment belongs, and All the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing word segmentation on the feature text information;
    计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and information of video resource data in the index table a ratio of the total number; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
    提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  4. 如权利要求1所述的方法,其特征在于,所述查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词的步骤包括:The method of claim 1 wherein said step of finding an associated second word segment having a co-occurrence rate with said one or more first word segments that is above a predetermined threshold comprises:
    当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to the plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes video resource data to which the first word segment belongs Information, and all the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing the feature text information Word segmentation;
    提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
    分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is the number of occurrences of the candidate participle in the index table and the total information of the video resource data in the index table Ratio
    分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重;a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
    分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率;Calculating, respectively, an average of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate segmentation words;
    提取所述同现率高于预设阈值的候选分词作为关联第二分词。The candidate participle with the co-occurrence rate higher than the preset threshold is extracted as the associated second participle.
  5. 如权利要求1所述的方法,其特征在于,所述查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词的步骤包括:The method of claim 1 wherein said step of finding an associated second word segment having a co-occurrence rate with said one or more first word segments that is above a predetermined threshold comprises:
    当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to a plurality of first word segments, a plurality of preset index tables corresponding to the plurality of first word segments are respectively extracted; wherein each index table includes a video to which the first word segment belongs The information of the resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, and the feature text is extracted Information is generated by word segmentation;
    采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of video resource data;
    计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of each of the second participles in the index table and the corresponding second participle in the index table, the co-occurrence rate being the number of occurrences of each second participle in the index table and the total information of the video resource data in the index table a ratio of the second participle being a part of all the participles in the video resource data except the first participle;
    提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  6. 如权利要求3或4或5所述的方法,其特征在于,所述特征文本信息包括视频标题、 视频关键词和/或视频描述。The method of claim 3 or 4 or 5, wherein the feature text information comprises a video title, Video keywords and/or video descriptions.
  7. 如权利要求5所述的方法,其特征在于,所述推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合的步骤包括:The method of claim 5 wherein the step of pushing the combination of the one or more first word segments and the one or more associated second word segments comprises:
    推送所述主分词和所述关联第二分词的组合。Pushing a combination of the main participle and the associated second participle.
  8. 一种基于视频搜索的在线播放入口对象的推送方法,包括:A method for pushing an online play portal object based on video search, comprising:
    接收视频搜索字符串;Receiving a video search string;
    将所述视频搜索字符串映射为一个或多个第一分词;Mapping the video search string to one or more first word segments;
    查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
    获取与所述一个或多个第一分词和所述关联第二分词匹配的一个或多个视频数据资源的网络地址;Obtaining a network address of one or more video data resources that match the one or more first word segments and the associated second word segment;
    根据所述一个或多个视频数据资源网络地址构造在线播放所述视频数据资源的入口对象;Constructing an entry object for playing the video data resource online according to the one or more video data resource network addresses;
    推送所述一个或多个在线播放视频数据资源的入口对象。Pushing the one or more ingress objects of the online play video data resource.
  9. 一种基于视频搜索的关联资源地址的推送方法,包括:A method for pushing an associated resource address based on video search, comprising:
    当接收到第一视频资源数据的加载或播放请求时,获取所述第一视频资源数据的特征本文本信息;Obtaining feature text information of the first video resource data when receiving a loading or playing request of the first video resource data;
    将所述特征本文本信息映射为一个或多个第一分词;Mapping the feature text information into one or more first word segments;
    查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个第一分词与第二分词在同一视频资源数据中共同出现的概率;Finding an associated second participle with a co-occurrence rate of the one or more first participles that is higher than a preset threshold; the co-occurrence rate is that the first one or more first participles and the second participle are in the same video resource data The probability of co-occurrence;
    获取与所述一个或多个第一分词和所述关联第二分词匹配的第二视频资源数据的网络链接地址;Obtaining a network link address of the second video resource data that matches the one or more first word segments and the associated second word segment;
    推送所述第二视频资源数据的网络链接地址。Pushing the network link address of the second video resource data.
  10. 一种基于视频搜索的分词信息推送装置,包括:A word segmentation information pushing device based on video search, comprising:
    视频搜索字符串接收模块,适于接收视频搜索字符串;a video search string receiving module adapted to receive a video search string;
    第一分词映射模块,适于将所述视频搜索字符串映射为一个或多个第一分词;a first word segmentation mapping module, configured to map the video search string into one or more first word segments;
    第二分词查找模块,适于查找与所述一个或多个第一分词的同现率高于预设阈值的关联第二分词;所述同现率为当前一个或多个分词与第二分词在同一视频资源数据中共同出现的概率;a second participle finding module, configured to find an associated second participle with a co-occurrence rate of the one or more first participles being higher than a preset threshold; the co-occurrence rate is one or more current participles and a second participle The probability of co-occurrence in the same video resource data;
    组合推送模块,适于推送所述一个或多个第一分词与所述一个或多个关联第二分词的组合。A combined push module adapted to push a combination of the one or more first word segments and the one or more associated second word segments.
  11. 如权利要求10所述的装置,其特征在于,所述第一分词映射模块还适于:The apparatus according to claim 10, wherein the first word segmentation mapping module is further adapted to:
    提取所述视频搜索字符串所映射的一个分词;Extracting a participle mapped by the video search string;
    或者,or,
    当接收到的视频搜索字符串为复合词时,将所述视频搜索字符串拆分为多个搜索子词;提取所述多个搜索子词所映射的多个分词。When the received video search string is a compound word, the video search string is split into a plurality of search subwords; and a plurality of word segments mapped by the plurality of search subwords are extracted.
  12. 如权利要求10所述的装置,其特征在于,所述第二分词查找模块还适于:The device of claim 10, wherein the second word segmentation module is further adapted to:
    当所述视频搜索字符串被映射为一个第一分词时,提取所述第一分词对应的预置索引表;其中,所述索引表包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;Extracting a preset index table corresponding to the first word segment when the video search string is mapped to a first word segment; wherein the index table includes information of video resource data to which the first word segment belongs, and All the word segments in the video resource data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing word segmentation on the feature text information;
    计算所述第一分词与所述索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of the first participle and each second participle in the index table, where the co-occurrence rate is the number of occurrences of each second participle in the index table and information of video resource data in the index table a ratio of the total number; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
    提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  13. 如权利要求10所述的装置,其特征在于,所述第二分词查找模块还适于:The device of claim 10, wherein the second word segmentation module is further adapted to:
    当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资 源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to the plurality of first word segments, respectively extracting a plurality of preset index tables corresponding to the plurality of first word segments; each index table includes video resource data to which the first word segment belongs Information, as well as the video assets All the word segments in the source data; all the word segments in the video resource data are obtained by capturing video resource data, extracting feature text information of the video resource data, and performing word segmentation on the feature text information;
    提取与所述多个第一分词共同出现的第二分词作为候选分词;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Extracting a second participle that appears together with the plurality of first participles as a candidate participle; wherein the second participle is a participle of all the participles in the video resource data except the first participle;
    分别在各个索引表中计算所述第一分词与所述候选分词的同现率,所述同现率为所述索引表中候选分词出现的次数与所述索引表中视频资源数据的信息总数的比值;Calculating a co-occurrence rate of the first participle and the candidate participle in each index table, where the co-occurrence rate is the number of occurrences of the candidate participle in the index table and the total information of the video resource data in the index table Ratio
    分别为所述多个第一分词与所述候选分词的同现率配置对应的多个权重;a plurality of weights corresponding to the co-occurrence rate configuration of the plurality of first word segments and the candidate word segment;
    分别计算多个配置了权重的同现率的平均值,作为所述多个第一分词与所述候选分词的同现率;Calculating, respectively, an average of a plurality of co-occurrence rates configured with weights as a co-occurrence rate of the plurality of first word segments and the candidate segmentation words;
    提取所述同现率高于预设阈值的候选分词作为关联第二分词。The candidate participle with the co-occurrence rate higher than the preset threshold is extracted as the associated second participle.
  14. 如权利要求10所述的装置,其特征在于,所述第二分词查找模块还适于:The device of claim 10, wherein the second word segmentation module is further adapted to:
    当所述视频搜索字符串被映射为多个第一分词时,分别提取所述多个第一分词对应的多个预置索引表;其中,各个索引表中包括所述第一分词所属的视频资源数据的信息,以及,所述视频资源数据中的所有分词;所述视频资源数据中的所有分词为通过抓取视频资源数据,提取所述视频资源数据的特征文本信息,对所述特征文本信息进行分词生成;When the video search string is mapped to a plurality of first word segments, a plurality of preset index tables corresponding to the plurality of first word segments are respectively extracted; wherein each index table includes a video to which the first word segment belongs The information of the resource data, and all the word segments in the video resource data; all the word segments in the video resource data are the feature text information of the video resource data by extracting the video resource data, and the feature text is extracted Information is generated by word segmentation;
    采用所述多个索引表确定主分词,所述主分词为视频资源数据的信息总数最多的索引表对应的第一分词;Determining a main participle by using the plurality of index tables, where the main participle is a first participle corresponding to an index table with the largest total number of pieces of information of video resource data;
    计算所述主分词与其对应的索引表中各个第二分词的同现率,所述同现率为所述索引表中各个第二分词出现的次数与所述索引表中视频资源数据的信息总数的比值;其中,所述第二分词为所述视频资源数据中的所有分词中除所述第一分词以外的分词;Calculating a co-occurrence rate of each of the second participles in the index table and the corresponding second participle in the index table, the co-occurrence rate being the number of occurrences of each second participle in the index table and the total information of the video resource data in the index table a ratio of the second participle being a part of all the participles in the video resource data except the first participle;
    提取所述同现率高于预设阈值的第二分词作为关联第二分词。Extracting the second participle with the co-occurrence rate higher than the preset threshold as the associated second participle.
  15. 如权利要求12或13或14所述的装置,所述特征文本信息包括视频标题、视频关键词和/或视频描述。The apparatus of claim 12 or 13 or 14, the feature text information comprising a video title, a video keyword, and/or a video description.
  16. 如权利要求14所述的装置,其特征在于,所述组合推送模块还适于:The device according to claim 14, wherein the combined push module is further adapted to:
    推送所述主分词和所述关联第二分词的组合。Pushing a combination of the main participle and the associated second participle.
  17. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-7中的任一个所述的基于视频搜索的分词信息推送方法。A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform video search based word segmentation information according to any of claims 1-7 Push method.
  18. 一种计算机可读介质,其中存储了如权利要求17所述的计算机程序。 A computer readable medium storing the computer program of claim 17.
PCT/CN2014/086519 2013-09-30 2014-09-15 Participle information push method and device based on video search WO2015043389A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201310462461.6A CN103491205B (en) 2013-09-30 2013-09-30 The method for pushing of a kind of correlated resources address based on video search and device
CN201310462214.6 2013-09-30
CN201310462768.6 2013-09-30
CN201310462461.6 2013-09-30
CN201310462214.6A CN103500214B (en) 2013-09-30 2013-09-30 Word segmentation information pushing method and device based on video searching
CN201310462768.6A CN103488787B (en) 2013-09-30 2013-09-30 A kind of method for pushing and device of the online broadcasting entrance object based on video search

Publications (1)

Publication Number Publication Date
WO2015043389A1 true WO2015043389A1 (en) 2015-04-02

Family

ID=52742025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/086519 WO2015043389A1 (en) 2013-09-30 2014-09-15 Participle information push method and device based on video search

Country Status (1)

Country Link
WO (1) WO2015043389A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310018A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Determining method of timeliness search vocabulary and search engine
CN113496411A (en) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 Page pushing method, device and system, storage medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064447A1 (en) * 2002-09-27 2004-04-01 Simske Steven J. System and method for management of synonymic searching
CN101236567A (en) * 2008-02-04 2008-08-06 上海升岳电子科技有限公司 Method and terminal apparatus for accomplishing on-line network multimedia application
CN101957828A (en) * 2009-07-20 2011-01-26 阿里巴巴集团控股有限公司 Method and device for sequencing search results
CN102326144A (en) * 2008-12-12 2012-01-18 阿迪吉欧有限责任公司 The information that the usability interest worlds are confirmed is offered suggestions
CN103164405A (en) * 2011-12-08 2013-06-19 盛乐信息技术(上海)有限公司 Generation method for relevant video data bank, recommendation method and recommendation system for relevant videos
CN103488787A (en) * 2013-09-30 2014-01-01 北京奇虎科技有限公司 Method and device for pushing online playing entry objects based on video retrieval
CN103491205A (en) * 2013-09-30 2014-01-01 北京奇虎科技有限公司 Related resource address push method and device based on video retrieval
CN103500214A (en) * 2013-09-30 2014-01-08 北京奇虎科技有限公司 Word segmentation information pushing method and device based on video searching

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064447A1 (en) * 2002-09-27 2004-04-01 Simske Steven J. System and method for management of synonymic searching
CN101236567A (en) * 2008-02-04 2008-08-06 上海升岳电子科技有限公司 Method and terminal apparatus for accomplishing on-line network multimedia application
CN102326144A (en) * 2008-12-12 2012-01-18 阿迪吉欧有限责任公司 The information that the usability interest worlds are confirmed is offered suggestions
CN101957828A (en) * 2009-07-20 2011-01-26 阿里巴巴集团控股有限公司 Method and device for sequencing search results
CN103164405A (en) * 2011-12-08 2013-06-19 盛乐信息技术(上海)有限公司 Generation method for relevant video data bank, recommendation method and recommendation system for relevant videos
CN103488787A (en) * 2013-09-30 2014-01-01 北京奇虎科技有限公司 Method and device for pushing online playing entry objects based on video retrieval
CN103491205A (en) * 2013-09-30 2014-01-01 北京奇虎科技有限公司 Related resource address push method and device based on video retrieval
CN103500214A (en) * 2013-09-30 2014-01-08 北京奇虎科技有限公司 Word segmentation information pushing method and device based on video searching

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310018A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Determining method of timeliness search vocabulary and search engine
CN111310018B (en) * 2018-12-11 2024-03-01 阿里巴巴集团控股有限公司 Method for determining timeliness search vocabulary and search engine
CN113496411A (en) * 2020-03-18 2021-10-12 北京沃东天骏信息技术有限公司 Page pushing method, device and system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US11182445B2 (en) Method, apparatus, server, and storage medium for recalling for search
US11216504B2 (en) Document recommendation method and device based on semantic tag
US10599643B2 (en) Template-driven structured query generation
JP7028858B2 (en) Systems and methods for contextual search of electronic records
CN109690529B (en) Compiling documents into a timeline by event
US8073877B2 (en) Scalable semi-structured named entity detection
WO2017024884A1 (en) Search intention identification method and device
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
US20130060769A1 (en) System and method for identifying social media interactions
US20090287676A1 (en) Search results with word or phrase index
US11263277B1 (en) Modifying computerized searches through the generation and use of semantic graph data models
US20170351709A1 (en) Method and system for dynamically rankings images to be matched with content in response to a search query
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
JP7451747B2 (en) Methods, devices, equipment and computer readable storage media for searching content
US20180060359A1 (en) Method and system to randomize image matching to find best images to be matched with content items
US20120158716A1 (en) Image object retrieval based on aggregation of visual annotations
WO2021002998A1 (en) Extracting key phrase candidates from documents and producing topical authority ranking
CN110147494A (en) Information search method, device, storage medium and electronic equipment
US20130218861A1 (en) Related Entities
CN107861948B (en) Label extraction method, device, equipment and medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
JP2008268985A (en) Method for attaching tag
CN113407775B (en) Video searching method and device and electronic equipment
WO2015043389A1 (en) Participle information push method and device based on video search
Liu et al. Cross domain search by exploiting wikipedia

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14847257

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14847257

Country of ref document: EP

Kind code of ref document: A1