CN102402593A

CN102402593A - Multi-modal approach to search query input

Info

Publication number: CN102402593A
Application number: CN201110345050XA
Authority: CN
Inventors: 刘激杨; 孙剑; 沈向洋; 杨晓松; 郭昱廷; 张磊; 李鹢; 柯启发; 刘策
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2010-11-05
Filing date: 2011-11-04
Publication date: 2012-04-04
Also published as: JP2013541793A; AU2011323602A1; MX2013005056A; US20120117051A1; WO2012061275A1; EP2635984A1; IL225831A0; RU2013119973A; TW201220099A; EP2635984A4; KR20130142121A; IN2013CN03029A

Abstract

Search queries containing multiple modes of query input are used to identify responsive results. The search queries can be composed of combinations of keyword or text input, image input, video input, audio input, or other modes of input. The multiple modes of query input can be present in an initial search request, or an initial request containing a single type of query input can be supplemented with a second type of input. In addition to providing responsive results, in some embodiments additional query refinements or suggestions can be made based on the content of the query or the initially responsive results.

Description

Multi-modal mode for the search inquiry input

Background technology

The known various methods that are used to search for retrieving information in this area are such as on Wide Area Network, carrying out through search engine.Such method is typically utilized the text based search.The search inquiry that comprises one or more text elements such as speech or phrase based on the search utilization of this paper.Text element is comprised content of text coupling or semantically similar, metadata, filename or document other text representation, such as webpage with index or other data structure comparison with identification.

The known method of text based search is worked for the text based document relatively preferably, yet they are difficult to be applied to image file and data.In order to inquire about the searching image file via text based, image file must associate (associate) with one or more text elements such as title, filename or other metadata or label.The search engine and the algorithm that are used for text based search can not come the searching image file based on the content of image, and only are restricted to thus based on the data that associate with image and discern the Search Results image.

Developed the method that is used for content-based picture search, the content of its analysis image is with image similar on the recognition visible sensation.Yet for the identification text based document related with the input of picture search, such method maybe be restricted.

Summary of the invention

In various embodiments, be provided for using the method for the input of various modes as the part of search inquiry.Said method allows the search inquiry by keyword or text are imported, image is imported, video is imported, audio frequency is imported or the combination of the input of other pattern is formed.Then, can carry out search based on the characteristic of extracting from the inquiry input of various patterns for response document.The inquiry input that can in initial search request, present a plurality of patterns perhaps can utilize the input of second type to replenish the initial request of the inquiry that comprises single type.Except response results is provided, in certain embodiments, can make extra inquiry improvement (refinement) or suggestion based on the content or the initial response results of inquiry.

Provide this summary of the invention partly to come the selection with the form introduction design of simplifying, said design further describes in the embodiment part below.This summary of the invention part also is not intended to the key feature or the essential feature of the theme that sign asks for protection, and it also is not intended to the scope that is used for helping confirming the theme of being asked for protection isolatedly.

Description of drawings

Describe the present invention with reference to the accompanying drawings in detail, in the accompanying drawing:

Fig. 1 is the block diagram that is applicable to the example calculations environment of realizing embodiments of the invention.

Fig. 2 schematically shows and is applicable to the network environment of carrying out embodiments of the invention.

Fig. 3 schematically shows the example of the assembly of user interface according to an embodiment of the invention.

Fig. 4 shows various assemblies related when carrying out embodiments of the invention and the relation between the process.

Fig. 5-9 shows the example of extracting characteristics of image according to embodiments of the invention, from image.

Figure 10-12 shows the example according to the method for each embodiment of the present invention.

Embodiment

In each embodiment, provide to be used for keyword or text based search input are imported integrated system and method with the search of other pattern.The example of the search input of other patterns can comprise image input, video input and audio frequency input.More generally, said system and method can allow the performance based on the search of the input of various modes in the inquiry.The embodiment of resulting multi-modal search system and method can provide greater flexibility for the user when input is provided to search engine.In addition, when the user utilizes one type input (such as the image input) initiation search, can use the input (the perhaps input of multiple other types) of second type to improve Search Results that (refine) perhaps revises response in other cases then.For example, the user can import one or more keyword so that associate with the image input.In many cases, extra keyword and image input related can provide the indication of importing clearer user view than independent image input or keyword.

In certain embodiments, through use comprise with more than the index of the relevant item of one type data (such as comprise the text based keyword, based on " keyword " of image, based on " keyword " of video and based on the index of " keyword " of audio frequency) carry out based on the search of multi-modal search input for response results.Being used for being used for the selection that " keyword " of the input pattern except the text based search merge can be to make multi-modal characteristic and artificial keyword interrelated (correlate).These artificial keywords can be called as the descriptor keyword.For example, being used for can be interrelated with the descriptor keyword based on the characteristics of image of the search of image, so that appear in the index (inverted index) of the row of falling identical with traditional text based keyword based on the search characteristics of image.For example, the image at " Space Needle (space pin tower) " mansion of Seattle can comprise a plurality of characteristics of image.These characteristics of image can extract from image, and interrelated with descriptor " keyword " then, to be used for merging to down with other text based key word item row's index.

Except merging to the descriptor keyword in the text based keyword index, also can associate with traditional key word item from the descriptor keyword of image (the perhaps non-text of another kind of type input).In the above example, " space needle " can be interrelated with the one or more descriptor keywords from the image of Space Needle.This can allow to comprise inquiry suggestion or that revise of descriptor keyword, and therefore is suitable for carrying out the search based on image for other image that is similar to Space Needle image better.The inquiry of such suggestion can be provided for the user with allow for the search of the improvement of image-related other image of Space Needle, perhaps can automatically use the inquiry of suggestion to discern so relevant image.

In the discussion below, the each side of carrying out multi-modal search is described in the definition below using.Characteristic refers to can be as in response to search inquiry document being selected and/or the information of any kind of the part of grade (ranking).Characteristic from the text based inquiry typically comprises keyword.Can comprise from characteristic being identified as the distinctive part of tool in the image, such as the part that is used for face recognition in part that has the strong brightness of contrast (contrasting intensity) in the image or the image corresponding to people's face based on the inquiry of image.Variation or other detectable audio mode that can comprise the level of sound volume of audio frequency from characteristic based on the inquiry of audio frequency.Keyword refers to traditional text based search terms.Keyword can refer to be used as the one or more speech that are used in response to inquiring about the single item of discerning document.The descriptor keyword refers to the keyword that gets up with non-text based feature association.Thus, the descriptor keyword can be used to discern characteristic based on image, based on the characteristic of video, based on characteristic or other non-text feature of audio frequency.Response results refers to be identified as any document relevant with search inquiry based on selection and/or the grading carried out by search engine.When showing response results, can show response results through display document self, identifier that perhaps can display document.For example, search identifier that traditional hyperlink that plain engine returns (also being known as " blue link ") expression is used for other document by text based or to the link of other document.Should link through clicking, can visit represented document.The identifier that is used for document can provide or can not provide the other information of the document of relevant correspondence.

Receive multi-modal search inquiry

Can from inquiry, extract and be used to discern result from the characteristic of multiple search pattern in response to inquiry.The inquiry input of various modes can be provided through any method easily in an embodiment.For example, be used to receive the user interface of inquiring about input and can comprise the dialog box that is used to receive the keyword query input.User interface also can comprise the position that is used to receive the image of being selected by the user, such as allowing the user that the input picture of expectation " is lost " the image querying frame in the user interface.Alternately, the image querying frame can receive document location or the network address source as the image input.Similar frame or position can be provided, to be used to discerning audio file, video file, the perhaps non-text input of another kind of type, to import as inquiry.

The inquiry input of various modes does not need to be received simultaneously.Alternatively, one type inquiry input can be provided at first, the input that second kind of pattern can be provided then is to improve inquiry.For example, the image that can submit the film star to is as the inquiry input.This will return a series of matching results that possibly comprise this image.Then, can word " performer " be input in the search query box as keyword, so that improve Search Results based on the expectation that is used for for the name of knowing the film star.

After receiving multi-modal search information, multi-modal information can be used as search inquiry with the identification response results.Response results can be any kind confirmed as relevant document by search engine, no matter the input pattern of search inquiry is how.Thus, image term can be identified as the response document for text based inquiry, and perhaps the text based item can be for the response document based on the inquiry of audio frequency.In addition, comprise that the inquiry more than a kind of input of pattern also can be used to discern the response results of any available types.The response results that is displayed to the user can be the form of document they self, or is used for the form of the identifier of response document.

Can use one or more index to be convenient to discern response results.In one embodiment, can use the single index such as the index of the row of falling to store keyword and descriptor keyword based on all types of rustling sound patterns.Alternately, single rating system can use a plurality of index to come Storage Item or characteristic.No matter how are the quantity of index or form, can be with one or more index with acting on identification in response to the integrated selection of the document of inquiry and/or the part of ranking method.Said system of selection and/or ranking method can merge characteristic based on the inquiry input of any enabled mode.

Also can extract text based keyword that the input with other type associates for use.The text message that a kind of selection that is used to merge the information of various modes can be to use the inquiry input with another kind of pattern to associate.Image, video or audio file have the metadata of getting up with file association through regular meeting.This can comprise theme or other and the text that file association gets up of title, the file of file.Said other text can comprise text or other text of describing media file of the part of the document (such as webpage) that occurs as link as media file wherein.The metadata that associates with image, video or audio file can be used in every way the inquiry input replenished.Text meta-data can be used to form the extra query suggestion that is provided for the user.Said text also can be used for replenishing existing search inquiry automatically, so that revise the grading of response results.

Except using the metadata that associates with input inquiry, can also use the metadata that associates with response results to revise search inquiry.The known image that for example, can cause Eiffel Tower based on the search inquiry of image is as response results.Can indicate Eiffel Tower from the metadata of response results is the theme of the image result of response.This metadata can be used to advise extra inquiry to the user, perhaps automatically replenishes search inquiry.

The mode that has multiple extraction metadata.The meta-data extraction technology can be predetermined, and perhaps it can be come dynamically to select by people or automatic process.The meta-data extraction technology can including, but not limited to: (1) is resolved (parse) for the metadata that embeds to filename; (2) from approximate repeat number object word, extract metadata; (3) in webpage, extract the text on every side that wherein comprises (host) approximate repeat number object word; (4) from wherein extracting note and the comment related the website approximate repetition digital media object of storage, that support note and comment with approximate repetition; And the searching keyword related with approximate repetition extracted in (5) when the user has selected approximate repetition after text query after.In other embodiments, the meta-data extraction technology can relate to other operation.

Some meta-data extraction technology begin with the theme of text and sift out the most concise and to the point metadata.Therefore, can utilize such as to the parsing of grammer and other based on the technology the analysis of token.For example, the text around the image can comprise title (caption) or very long paragraph.At least in the latter case, said very long paragraph can be resolved to extract interested.Through another example, note and comment data are notorious because of comprising text condensation (for example, IMHO representative " according to my my humble opinion ") and emotion function word (the for example exclamation mark of smiling face's symbol and repetition) aspect.Although IMHO it look like emphasically in note and comment, possibly be will be by the candidate of filtering when search metadata.

If selected a plurality of meta-data extraction technology, then reconcile (reconciliation) method a kind of mode of reconciling candidate's metadata result of potential conflict can be provided.Can for example use statistical study and machine learning, perhaps alternately carry out conciliation via regulation engine.

Fig. 3 provides and has been suitable for receiving multi-modal search input and the example that shows the user interface of response results according to an embodiment of the invention.In Fig. 3, user interface is provided for the input position of three types inquiry input.Input frame 311 can receive the keyword input, such as the text based input of typically being used by traditional search engine.Input frame 313 can receive image and/or video file as input.Be stuck or " lost " image or video file in the input frame 313 in other cases and can be used the image analysis technology analysis and can be extracted the characteristic that is used to search for identification.Similarly, input frame 315 can receive audio file as input.

Zone 320 comprises the tabulation of response results.In the embodiment shown in Fig. 3, current response results 332 and 342 of illustrating.Response results 332 is identifiers of the image document that is used for being identified in response to search, such as thumbnail (thumbnail).Except image result 332, also provide link or icon 334 allow to merge the search of image result 332 (the descriptor keyword that perhaps associates with image result 332) as the modification of the part of the inquiry of modification.Response results 342 is corresponding to the identifier of text based document.

Zone 340 comprises the tabulation based on the inquiry 347 of the suggestion of initial query.Can use traditional query suggestion algorithm to generate the inquiry 347 of suggestion.The metadata that the inquiry 347 of suggestion also can associate based on the input with submission in image/video input 313 or audio frequency input 315.The inquiry 347 of another suggestion can be based on the metadata that associates with response results such as response results 332.

Fig. 4 schematically shows the mutual of the various systems that are used to carry out multi-modal search according to an embodiment of the invention and/or process.In the embodiment shown in Fig. 4, multi-modal search is corresponding to importing based on keyword query and both search of image querying input.In Fig. 4, begin search based on receiving inquiry.Inquiry comprises searching keyword 405 and query image 407.In order to handle query image 407, can use image understanding assembly 412 to come the characteristic in the recognition image.Characteristic by image understanding assembly 412 extracts from query image 407 can be distributed the descriptor keyword by image text characteristic and image visual features component 422.The example of the method that can be used by image understanding assembly 412 combines Fig. 5-9 to describe below.Image understanding assembly 412 also can comprise the image understanding method of other type, such as face recognition method, or be used for the method for the color similarity of analysis image.Metadata analysis assembly 414 can be discerned the metadata that associates with query image 407.This can comprise through operating system and being embedded in and/or with this document canned data at image file, such as title or the note at the image of document memory storage.This also can comprise the text that other is related with image, such as the text in the URL path that is transfused to the image that is used for searching for identification, be used for or be positioned at or the image of embedded web page or other text based document be positioned near the text the image.Image text characteristic and Image Visual Feature assembly 422 can be based on discerning keyword feature from the output of metadata analysis 414.

After any extra characteristic in having discerned query term 405 and image text characteristic and Image Visual Feature assembly 422, resulting inquiry can change or expansion in assembly 432 alternatively.Inquiry change or expansion can be based on the characteristics of the derivation of the metadata from metadata analysis assembly 414 and image text characteristic/Image Visual Feature assembly 422.Another source of inquiry change or expansion can be the feedback from UI interactive component 462.This can comprise customer-furnished extra Query Information and based on the query suggestion 442 from the response results of current or before inquiry.Then, expansion or change alternatively inquiry can be used to generate response results 452.In Fig. 4, the result generates 452 and relates to and use inquiry to come the response document in the identification database 475, and it comprises the text feature and the characteristics of image of document in the database.Index or any other that database 475 can be represented down row easily type be used for storage format based on inquiry identification response results.

According to this embodiment, the result generate 452 one or more types can be provided the result.In some cases, the identification of most probable coupling possibly expected, such as the response results of one or several high ratings.This can be used as replys 444 and provides.Alternately, response results possibly expected according to the tabulation of grading order.This can be used as combination grading result 446 and provide.Except the result who replys or grade, can also one or more query suggestion 442 be offered the user.Comprise the result show and inquire about receive with can the handling by UI interactive component 462 alternately of user.

Based on multimedia searching method

Fig. 5-9 schematically shows the processing according to embodiments of the invention example image 500.In Fig. 5, use operator (operator) algorithm to handle image 500 so that discern a plurality of points of interest 502.The operator algorithm comprises any available algorithm that can be used for the point of interest 502 of recognition image 500.In an embodiment, the operator algorithm can be the poor of Gauss algorithm as known in the art and Laplce's algorithm.In an embodiment, the operator algorithm is configured to analysis image 500 on two dimension.Alternatively, when image 500 is coloured image, can convert image 500 into gray level.

Point of interest 502 can comprise like any point in the image 500 depicted in figure 5 and like zone 602, zone, pixel groups or characteristic in the image 500 depicted in figure 6.For clear and succinct, point of interest 502 is known as point of interest 502 hereinafter with zone 602, yet, be intended to comprise point of interest 502 and zone 602 for quoting of point of interest 502.In one embodiment, point of interest 502 is arranged in the stable zone of image 500, and comprises visibly different or discernible characteristic in the image 500.For example, point of interest 502 has the zone of the characteristic of the distinctness that has high-contrast between such as the characteristic of describing at 502a and 602a place in image.On the contrary, point of interest is not positioned at the zone with visibly different characteristic or contrast, such as constant color or the zone of gray scale by 504 indications.

Any amount of point of interest 502 in the operator algorithm identified image 500 is such as for example thousands of points of interest.Point of interest 502 can be the combination in point 502 and zone 602 in the image 500, and its quantity can be based on the size of image 500.Image processing modules 412 is measured for each point of interest 502 calculates, and measures according to this point of interest 502 is graded.This is measured and can comprise that image 500 is in the signal intensity at point of interest 502 places or the tolerance of signal to noise ratio (S/N ratio).The subclass that image processing modules 412 is selected point of interest 502 based on this grading is to be used for further processing.In one embodiment, select to have 100 points of interest 502 the most significant of highest signal to noise ratio, yet also can select the point of interest 502 of any desired quantity.In another embodiment, do not select subclass, and in further handling, comprise all points of interest.

As depicted in figure 7, can discern set corresponding to the fritter (patch) 700 of selected point of interest 502.Each fritter 702 is corresponding to single selected point of interest 502.Fritter 702 comprises the zone that comprises corresponding point of interest 502 in the image 500.According to the size of confirming to take from each fritter 702 of image 500 for each selected point of interest 502 from the output of operator algorithm.Each fritter 702 can have different sizes, and the zone that will be included in the image 500 in the fritter 702 can be overlapping.In addition, the shape of fritter 702 is any desired shapes, comprises square, rectangle, triangle, circle, ellipse or the like.In illustrated embodiment, fritter 702 is being square in shape.

Fritter 702 can be by standardization as describing among Fig. 7.In one embodiment, fritter 702 so that each fritter 702 compound identical size, is taken advantage of the square tiles of X pixel by standardization such as the X pixel.Except that other operation, make fritter 702 be standardized as size and/or resolution that identical size can comprise to be increased or reduce fritter 702.Also can be via especially contrast strengthens, goes spot (despeckling), sharpening and one or more other the operation of using the gray scale makes fritter 702 standardization such as using.

Also can confirm descriptor for each standardized fritter.Descriptor can be to can be used as characteristic and merge the description with the fritter that is used for picture search.Can confirm descriptor through the statistical figure that calculate pixel in the fritter 702.In one embodiment, confirm descriptor based on the statistical figure of the shade of gray of pixel in the fritter 702.This descriptor can visually be expressed as the histogram of each fritter, such as the descriptor of describing among Fig. 8 802 (wherein the fritter 702 of Fig. 7 is corresponding to the descriptor 802 of similar position among Fig. 8).Descriptor also can be described to multi-C vector, such as the multi-C vector such as but not limited to the pixel grey scale statistical figure of representing the pixel in the fritter.T2S2 36 dimensional vectors are examples of the vector of remarked pixel gray-scale statistical numeral.

As describing among Fig. 9, quantization table (quantization table) 900 can be used to make descriptor keyword 902 and each descriptor 802 interrelated.Quantization table 900 can comprise any table, index, chart or other data structure that can be used for descriptor 802 is mapped to descriptor keyword 902.Various forms of quantization tables 900 are known in the art, and can be used in the various embodiments of the present invention.In one embodiment,, come to be each image recognition descriptor 802 through at first handling a large amount of image (for example image 500), 1,000,000 images for example, thus generating quantification table 900.Therefrom the descriptor 802 of identification then by statistical study with identification have similar, or statistics go up the string (cluster) or the group of the descriptor 802 of similar value.For example, the value of variable is similar in the T2S2 vector.The representative descriptor 904 of each string is selected and distributes position and the corresponding descriptor keyword 902 in the quantization table 900.Descriptor keyword 902 can comprise any desired designator of the representative descriptor 904 that identification is corresponding.For example, descriptor keyword 902 can comprise round values or alphanumeric values, digital value, symbol, text or their combination as describing among Fig. 9.In certain embodiments, descriptor keyword 902 can comprise the sequence of the character of the descriptor keyword that identification is related with non-text based search pattern.For example, all descriptor keywords can comprise a series of three integers, follow character by underscore, as preceding four characters in the keyword.Then, initial sequence can be used to discern the descriptor keyword related with image.

For each descriptor 802, can in quantization table 900, discern the most representative descriptor 904 of coupling.For example, the descriptor 802a that describes among Fig. 8 is the most closely corresponding to the representative descriptor 904a of quantization table among Fig. 9 900.Thus, the descriptor keyword 902 of each descriptor 802 associates (for example, descriptor 802a is corresponding to descriptor identifier 902 " 1 ") with image 500.Each possibly differ from one another the descriptor keyword that associates with image 500 902, perhaps one or more descriptor keywords 902 can with image more than 500 time related (for example image 500 can have " 1,2; 3,4 " perhaps descriptor keyword 902 of " 1,2; 2,3 ").In one embodiment; In order to consider the characteristic such as image change, can through identification more than a representative descriptor 904 that the most closely matees descriptor 802 with and corresponding descriptor keyword 902 descriptor 802 is mapped to more than a descriptor identifier 902.Based on more than, the content of image 500 of set with point of interest 502 of identification can be represented by the set of descriptor keyword 902.

In another embodiment, can the search based on image of other type be integrated in the search plan.For example, face recognition method can provide the picture search of another kind of type.Except such identification descriptor keyword as stated, and/or replace such as stated identification descriptor keyword, can use face recognition method to confirm people's in the image identity.People's identity can be used to search inquiry is replenished in the image.Another selection can be to have the storehouse that is used for the people of facial recognition techniques coupling.Can in this storehouse, comprise the metadata that is used for various people, and the metadata of this storage can be used to search inquiry is replenished.

A kind of description that adapts to the text based search plan based on the search plan of image that is used to make is provided above.Can be for making similarly adaptive such as the search of other patterns based on the search plan of audio frequency.In one embodiment, can use any search that makes things convenient for type based on audio frequency.Be used for to have the characteristic of one or more types that are used to discern audio file with similar characteristic based on the method for the search of audio frequency.As stated, audio frequency characteristics can be interrelated with the descriptor keyword.The descriptor keyword can have the indication keyword form relevant with audio search, such as back four characters that make keyword corresponding to the hyphen of followed with four numerals.

Example based on the search of multi-modal inquiry

Search example 1-add image information to the text based inquiry.A difficulty of traditional searching method is the result of identification for common query term expectation.One type of search that can relate to common query term is the search for the people with common name (such as " Steve Smith ").If keyword query " steve smith " is submitted to search engine, then a large amount of results possibly be identified as response, and these results maybe be corresponding to a large amount of different people who shares identical or similar name.

In one embodiment, can improve search as the part of search inquiry through the picture of submitting entity to for the entity of appointment.For example, except input " steve smith " in the key words text frame, can interested specific Mr.'s Smith image or video be dropped into and be used to receive position based on the Query Information of image.Then, can use facial recognition software to make correctly " Steve Smith " and search inquiry coupling.In addition, if image or video comprise other people, then can distribute lower grading, because keyword query has been indicated interested people for result based on extra people.As a result, can use the combination of keyword and image or video to come to discern effectively result corresponding to the people with common name (or other entity).

As top modification, consider that wherein the user has people's image or video, but do not know the situation of Taoist's name.Said people can be statesman, actor or actress, sports circles famous person, perhaps can pass through any other people or other entity that face recognition or image matching technology are discerned.Under this situation, the image or the video that comprise this entity can be submitted to as multi-modal search inquiry with one or more keyword.Under this situation, the information of relevant this entity (such as " statesman " or " actress ") that the user has can be represented in said one or more keywords.The assistant images search in every way of this extra keyword.A benefit that has image or video and have a keyword is that user's interested result can be given higher grading.Submit to indication to know the wish of people's name in the image with image keyword " actress ", and the result that will cause actress's name to become than the film of in cast, listing actress have the more result of high ratings.In addition, for the image analysis technology of face recognition or other accurate coupling that wherein is unrealized, keyword can help the Search Results of potential response is graded.If face recognition method all is identified as potential coupling with state assemblyman and author, then keyword " statesman " can be used to provide the information of state assemblyman as the result of high ratings of closing.

The inquiry of the multi-modal inquiry of search example 2-be used for improves.The information of the commodity (music CD or film DVD) that find in the more relevant shops of user expectation acquisition in this example.As the predecessor of search procedure, be used for to take the picture of the front cover of interested music CD.Then, can this picture be submitted to as search inquiry.Use image recognition and/or coupling, can be with the institute image stored coupling of this CD front cover with the CD front cover that comprises extra metadata.This metadata can comprise title, or data of any other relevant this CD of title, last each the first song of CD of artistical name, CD alternatively.

Can the image of the CD front cover of being stored be returned as response results, and possibly return as the highest rating result.According to this embodiment, can be that the user proposes the potential query modification about initial page, perhaps the user can clickthrough so that visit potential query modification.This query modification can comprise the suggestion based on metadata (such as the title of one of last popular song of title, the CD of artistical name, CD).Can these query modifications be offered the user as link.Alternately, can the selection of some or all query metadata being added to the keyword search frame be provided for the user.The user also can utilize extra search terms to come the modification of being advised is replenished.For example, the user can select artistical name, and adds word " concert " to query frame then.Can make this extra word " concert " and image associate a part as search inquiry.This can for example produce the indication artist response results on following concert date.Can comprise the lyrics of the song that pricing information, news, the CD relevant with the artist are last or the suggestion of other type for query suggestion or other selection of revising.Alternatively, can automatically submit some query modification to,, and need not other action from the user with the response results of generation for the inquiry of revising for search.For example, adding keyword " price " to based on the CD front cover inquiry can be query modification automatically, so that each online retailer's price is returned with initial result of page searching.

Notice: in the above example, at first submit query image to, then keyword and inquiry are associated, with as improving.Can carry out similar improvement through with text key word search beginning, then based on image, video or audio file improvement.

The mobile search of search example 3-improvement.In this example, the user possibly know substantially requirement what, but maybe the uncertain search inquiry of how expressing.Such mobile rustling sound can be used to the search of position, people, object or other entity about any kind.Add one or more keywords and allow users to receive response results based on user intention, but not based on the response results of the images match of the best.Can for example before submitting image to, keyword be added in the search text box as search inquiry.Keyword can replenish any keyword that can derive from the metadata that associates with image, video or audio file alternatively.For example, the user can take the picture in restaurant and this picture is submitted to keyword " menu " as search inquiry.This relates to raising result's the grading of the menu in this restaurant.Alternately, the user can take the video of one type cat, and search inquiry is submitted to word " kind ".Form contrast with image or the results for video of other animal that returns the similar action of execution, this will improve the result's of the type of discerning cat correlativity (relevance).Another selection is that the image of the placard of film is submitted to keyword " film sound tracks ", so that be identified in the song of playing in the film.

As another example, the user who in the city, advances possibly go for the information of the scheduling of relevant local public transportation system.Unfortunately, the user does not know the title of this system.The user begins through input < city title>and " public transport " in keyword query.This returns a large amount of results, the user for which result with the most helpful self-distrust.Near the sign of this traffic system of bus stop place then, the user notices.The user takes the picture of this sign, and uses this sign to improve search as the part of inquiry.Then, the result that the bus system relevant with this sign is used as high ratings returns, thereby for the user confidence of having discerned correct traffic scheduling is provided.

Search example 4-relates to the multi-modal search of audio file.Except video or image, can use the input pattern of other type to be used for search.Audio file is represented another example of suitable inquiry input.As above regard to image or video is described, can combine keyword together submit to as search inquiry audio file.Alternately, can be before the inquiry input of submitting another kind of type to or after submit audio file to, with as inquiring about an improved part.Notice: in certain embodiments, multi-modal search inquiry can comprise polytype inquiry input, and need not the user any keyword input is provided.Thus, the user can provide image and video or video and audio file.Another selection can be to comprise a plurality of images, video and/or audio file, imports as inquiry together with keyword.

Briefly described the general introduction of each embodiment of the present invention, description now is suitable for carrying out exemplary operations environment of the present invention.Always with reference to each accompanying drawing, and particularly at first with reference to Fig. 1, be used to realize that the exemplary operations environment of embodiments of the invention is illustrated, and always be appointed as computing equipment 100.Computing equipment 100 only is an example of suitable computing environment, and is not intended to any restriction of suggestion for the scope of use of the present invention or function.Computing equipment 100 should not be interpreted as any perhaps dependence or the requirement of its combination that has in the relevant illustrated assembly.

Can under total background of computer code or machine available commands, describe embodiments of the invention, computer code or machine available commands comprise by computing machine or the computer executable instructions that carry out, such as program module of other machine such as personal digital assistant or other handheld device.Usually, the program module that comprises routine, program, object, assembly, data structure or the like is meant the code of carrying out specific task or realizing specific abstract data type.Can be in various system configuration embodiment of the present invention, comprise handheld device, consumer-elcetronics devices, multi-purpose computer, more special-purpose computing equipment or the like.Task is by embodiment of the present invention in the DCE of carrying out through the teleprocessing equipment of linked therein.

Continuation is with reference to Fig. 1; Computing equipment 100 comprises bus 110, its directly or indirectly with following devices, coupled: storer 112, one or more processor 114, one or more power supply 122 that presents assembly 116, I/O (I/O) port one 18, I/O assembly 120 and example.Bus 110 expressions can be the buses (such as address bus, data bus or their combination) of one or more bus.Although each frame among Fig. 1 illustrates with line for asking clear, it is so unclear in fact to describe each assembly, draws an analogy, and said line will be ash and fuzzy more accurately.For example, maybe the assembly that appears such as display device be regarded as the I/O assembly.In addition, many processors have storer.The inventor here recognizes: this is the character of this area, and the figure that reaffirms Fig. 1 only is the illustration of the example calculations equipment that can use with one or more embodiment of the present invention.Between the classification such as " workstation ", " server ", " laptop computer ", " handheld device " etc., do not distinguish,, and be called as " computing equipment " because they all are expected in the scope of Fig. 1.

Computing equipment 100 typically comprises various computer-readable mediums.Computer-readable medium can be can be by any available medium of computing equipment 100 visit, and comprises volatibility and non-volatile media, removable and removable medium not.Unrestricted through example, computer-readable medium can comprise computer-readable storage medium and communication media.Computer-readable storage medium comprises the volatibility that realizes with any method of the information of storage such as computer-readable instruction, data structure, program module or other data or technology and non-volatile, removable and removable medium not.Computer-readable storage medium is including, but not limited to random-access memory (ram), ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), flash memory or other memory technology, CD-ROM, digital multi-purpose disk (DVD) or other holographic memory, tape cassete, tape, disk storage device or other magnetic storage apparatus, carrier wave, or the information of the expectation that can be used to encode and can be by any other medium of computing equipment 100 visits.In one embodiment, computer-readable storage medium can be selected from tangible computer-readable storage medium.In another embodiment, computer-readable storage medium can be selected from non-instantaneous computer-readable storage medium.

Storer 112 comprises the computer-readable storage medium of volatibility and/or nonvolatile memory form.Storer can be movably, immovable or their combination.Exemplary hardware equipment comprises solid-state memory, hard disk drive, CD drive or the like.Computing equipment 100 comprises from one or more processors of the various entity reading of data such as storer 112 or I/O assembly 120.Present assembly 116 and present the data indication to user or miscellaneous equipment.Example presents assembly and comprises display device, loudspeaker, print components, vibration component or the like.

I/O port one 18 allows computing equipment 100 logically to be couple to miscellaneous equipment, comprises the I/O assembly 120 that some of them can be built-in.Exemplary components comprises microphone, operating rod, cribbage-board, dish-shaped satellite-signal receiver, scanner, printer, wireless device or the like.

In addition with reference to Fig. 2, the block diagram of the example network environment 200 of describing to be applicable to embodiments of the invention has been described.Environment 200 only is an example that can be used for the environment of embodiments of the invention, and can comprise the assembly of any amount under a variety of configurations.The description of the environment 200 that provides here is used for the illustration purpose, and is not intended to the configuration that restriction can realize the environment of embodiments of the invention.

Environment 200 comprises network 202, inquiry input equipment 204 and search engine server 206.Network 202 comprises any computer network, such as such as but not limited to the Internet, Intranet, special use and public LAN and wireless data or telephone network.Inquiry input equipment 204 is any computing equipments, such as can be from the computing equipment 100 that search inquiry wherein is provided.For example, except that miscellaneous equipment, inquiry input equipment 204 can be personal computer, laptop computer, server computer, wireless telephone or equipment, PDA(Personal Digital Assistant) or digital camera.In one embodiment, a plurality of inquiry input equipments 204 such as thousands of or millions of inquiry input equipments 204, are connected to network 202.

Search engine server 206 comprises any computing equipment such as computing equipment 100, and at least a portion of the function of the search engine that is provided for providing content-based.In one embodiment, last set engine server 206 is shared or distributed provides the search engine operation required function to user crowd.

Image processing server 208 also is provided in environment 200.Image processing server 208 comprises any computing equipment such as computing equipment 100, and the content of the image that is configured to analyze, describes more fully below the expression, index.Image processing server 208 comprises quantization table 210, and it is stored in the storer of image processing server 208, perhaps can remotely be visited by image processing server 208.The mapping that quantization table 210 is used with the notice picture material by image processing server 208, thus allow characteristics of image is searched for and index.

Search engine server 206 can be couple to image memory device 212 and index 214 communicatedly with image processing server 208.Image memory device 212 comprises any available computer memory device or a plurality of available computer memory devices with index 214, such as hard disk drive, flash memory, optical memory devices or the like.Image memory device 212 provides the data storage of the image file that can provide in response to the content-based search of embodiments of the invention.Index 214 provide via network 202 can with the search index that is used for content-based document searching, comprise the image that is stored in the image memory device 212.Index 214 can utilize any index data structure or form, and preferably utilizes the indexed format of the row of falling.Attention: in certain embodiments, image memory device 212 can be optional.

The index of the row of falling provides the mapping of describing the position of content in data structure.For example; When for particular keywords (comprising the keyword descriptor) searching documents; In the sign document, find keyword in the index of the row of falling of the existence of characteristic in the position of this speech and/or the image document, rather than searching documents is to find the position of this speech or characteristic.

In one embodiment; With one or more being integrated in the single computing equipment in search engine server 206, image processing server 208, image memory device 212 and the index 214; Perhaps they directly can couple so that allow direct communication between these equipment communicatedly, and need not to communicate across network 202.

Figure 10 has described method according to an embodiment of the invention, perhaps alternately is the executable instruction of the method that is used on computer-readable storage medium, comprising according to an embodiment of the invention.In Figure 10, obtain 1010 comprise a plurality of correlative characters that can be extracted image, video or audio file.This image, video or audio file and at least one keyword associate 1020.Submit to 1030 to give search engine as inquiry this image, video or audio file and related keyword.Receive 1040 at least one response results in response to a plurality of correlative characters and related keyword.Then, show 1050 these at least one response results.

Figure 11 has described another method according to an embodiment of the invention, perhaps alternately is the executable instruction of the method that is used on computer-readable storage medium, comprising according to an embodiment of the invention.In Figure 11, reception 1110 comprises the inquiry of at least two query patterns.From inquiry, extract 1120 correlative characters corresponding at least two query patterns.Select more than 1130 response results based on the correlative character that is extracted.Also these a plurality of response results are graded 1140 based on the correlative character that is extracted.Then, show one or more in the response results of 1150 gradings.

Figure 12 has described another method according to an embodiment of the invention, perhaps alternately is the executable instruction of the method that is used on computer-readable storage medium, comprising according to an embodiment of the invention.In Figure 12, reception 1210 comprises the inquiry of at least one keyword.Based on more than 1220 response results of the query display that is received.Receive 1230 at least one the supplemental queries inputs that comprise in image, video or the audio file.Revise the grading of more than 1240 response results based on supplemental queries input.Show one or more in 1250 response results based on the grading of revising.

Additional embodiment

The embodiment of first expection comprises a kind of method that is used to carry out multi-modal search.Said method comprises that reception (1110) comprises the inquiry of at least two kinds of query patterns; Extract (1120) correlative character from said inquiry corresponding to said at least two kinds of query patterns; Select (1130) a plurality of response results based on the correlative character that is extracted; Based on the correlative character that is extracted to said a plurality of response results grade (1140); And show one or more in the response results of (1150) being graded.

Second embodiment comprises the method for first embodiment, and the query pattern in the inquiry that is wherein received comprises that two in keyword, image, video or the audio file are perhaps more a plurality of.

The 3rd embodiment comprises in the foregoing description any, wherein uses merging to select a plurality of response document from the index of the row of falling of the correlative character of at least two kinds of query patterns.

The 4th embodiment comprises the 3rd embodiment, and the correlative character that wherein from image, video or audio file, extracts is used as the descriptor keyword and merges to down in row's the index.

In the 5th embodiment, a kind of method that is used to carry out multi-modal search is provided.Said method comprises image, video or the audio file that obtains (1010) and comprise a plurality of correlative characters that can be extracted; Said image, video or audio file and at least one keyword are associated (1020); Submit to (1030) to give search engine as inquiry image, video or audio file and related keyword; Receive (1040) at least one response results in response to the keyword of said a plurality of correlative characters and association; And demonstration (1050) said at least one response results.

The 6th embodiment comprises any in the foregoing description, and the correlative character that is wherein extracted is corresponding to keyword and image.

The 7th embodiment comprises any in the foregoing description, also comprises: from image, video or audio file, extract metadata; Discern one or more keywords according to the metadata of being extracted; And forming second inquiry, this second inquiry comprises from the correlative character of the inquiry extraction that is received and the keyword of discerning from the metadata of being extracted at least.

The 8th embodiment comprises the 7th embodiment, wherein based on the correlative character that is extracted to a plurality of response document grade comprise based on second the inquiry a plurality of response document are graded.

The 9th embodiment comprises the 7th or the 8th embodiment, and wherein second inquiry shows with the response results that is shown explicitly.

The tenth embodiment comprises among the 7th to the 9th embodiment any, also comprises: automatically select second group of a plurality of response document based on second inquiry; Based on second inquiry second group of a plurality of response document graded; And demonstration is from least one document of the document of second group of a plurality of response.

The 11 embodiment comprises any in the foregoing description, wherein obtain as from the image that obtains the video camera that device association gets up or the image or the video of video.

The 12 embodiment comprises any in the foregoing description, wherein through obtaining image, video or audio file via access to netwoks institute image stored, video or audio file.

The 13 embodiment comprises any in the foregoing description, and wherein said at least one response results comprises the sign (identity) of text document, image, video, audio file, text document, the sign of image, the sign of video, the sign of audio file or their combination.

The 14 embodiment comprises any in the foregoing description, and wherein said method also comprises based on the inquiry of being submitted to corresponding to the metadata of at least one response results and shows one or more query suggestion.

In the 15 embodiment, a kind of method that is used to carry out multi-modal search is provided, comprise that reception (1210) comprises the inquiry of at least one keyword; Based on a plurality of response results of the query display that is received (1220); Receive (1230) and comprise at least one the supplemental queries input in image, video or the audio file; Revise the grading of (1240) a plurality of response results based on said supplemental queries input; And show one or more in (1250) response results based on the grading revised.

Described embodiments of the invention about specific embodiment, it is intended to all is exemplary and nonrestrictive in all fields.Under the situation that does not depart from scope of the present invention, will become obvious the those of ordinary skill in the field of alternate embodiment under the present invention.

According to preceding text, will see: the present invention be suitable for well obtaining all purposes of proposing at preceding text and target and other tangible and for this structure intrinsic advantage.

To understand: combination is practical to specific characteristic with son, and can not be utilized under the situation with reference to further feature and son combination.This is that the scope of claim is desired, and within the scope of the claims.

Claims

1. method that is used to carry out multi-modal search comprises:

Receive the inquiry that (1110) comprise at least two kinds of query patterns;

Extract (1120) correlative character from said inquiry corresponding to said at least two kinds of query patterns;

Select (1130) a plurality of response results based on the correlative character that is extracted;

Based on the correlative character that is extracted to said a plurality of response results grade (1140); And

Show one or more in the response results of (1150) being graded.

2. the method for claim 1, the query pattern in the inquiry that is wherein received comprise in keyword, image, video or the audio file two or more a plurality of.

3. like each the described method in the above-mentioned claim, wherein use merging to select a plurality of response document from the index of the row of falling of the correlative character of said at least two kinds of query patterns.

4. method as claimed in claim 3, the correlative character that wherein from image, video or audio file, extracts are used as the descriptor keyword and merge to down in row's the index.

5. method that is used to carry out multi-modal search comprises:

Obtain image, video or audio file that (1010) comprise a plurality of correlative characters that can be extracted;

Said image, video or audio file and at least one keyword are associated (1020);

Submit to (1030) to give search engine as inquiry image, video or audio file and related keyword;

Receive (1040) at least one response results in response to the keyword of said a plurality of correlative characters and association; And

Show (1050) said at least one response results.

6. like each the described method in the above-mentioned claim, the correlative character that is wherein extracted is corresponding to keyword and image.

7. like each the described method in the above-mentioned claim, also comprise:

From image, video or audio file, extract metadata;

The one or more keywords of identification from the metadata of being extracted; And

Form second inquiry, this second inquiry comprises from the correlative character of the inquiry extraction that is received and the keyword of discerning from the metadata of being extracted at least.

8. method as claimed in claim 7, wherein based on the correlative character that is extracted to a plurality of response document grade comprise based on second the inquiry a plurality of response document are graded.

9. like claim 7 or 8 described methods, wherein second inquiry shows with the response results that is shown relatedly.

10. like each the described method among the claim 7-9, also comprise:

Automatically select second group of a plurality of response document based on second inquiry;

Based on second inquiry second group of a plurality of response document graded; And

Demonstration is from least one document of second group of a plurality of response document.

11. as each the described method in the above-mentioned claim, wherein obtain as from the image that obtains the video camera that device association gets up or the image or the video of video.

12. like each the described method in the above-mentioned claim, wherein through obtaining image, video or audio file via access to netwoks institute image stored, video or audio file.

13. like each the described method in the above-mentioned claim, wherein said at least one response results comprises the sign of text document, image, video, audio file, text document, the sign of image, the sign of video, the sign of audio file or their combination.

14. as each the described method in the above-mentioned claim, wherein said method also comprises based on the inquiry of being submitted to corresponding to the metadata of at least one response results and shows one or more query suggestion.

15. a method that is used to carry out multi-modal search comprises:

Receive the inquiry that (1210) comprise at least one keyword;

Based on a plurality of response results of the query display that is received (1220);

Receive (1230) and comprise at least one the supplemental queries input in image, video or the audio file;

Revise the grading of (1240) a plurality of response results based on said supplemental queries input; And

Show one or more in (1250) response results based on the grading revised.

16. a computer-readable medium comprises executable instruction, said executable instruction is used for carrying out like any one described method of claim 1-15 when carrying out on computers.