WO2008069791A1

WO2008069791A1 - Method and apparatus for improving image retrieval and search using latent semantic indexing

Info

Publication number: WO2008069791A1
Application number: PCT/US2006/046394
Authority: WO
Inventors: Jonathon S. Hare; Paul H. Lewis
Original assignee: General Instrument Corporation
Priority date: 2006-12-04
Filing date: 2006-12-04
Publication date: 2008-06-12

Abstract

A method of creating an image database and searching the database is disclosed. A term-document matrix that includes at least two different domain features is created. Latent semantic indexing is applied to the term-document matrix to decompose the term-document matrix. Then a plurality of new vectors are added to a decomposed document matrix of the term-document matrix using a fold-in technique to complete the creation of the searchable image database. Consequently, images are retrieved from the image database by providing a query vector. The query vector is compared against each one of a plurality of document vectors of the decomposed document matrix. Finally, a plurality of images that are similar to the query vector are returned.

Description

METHOD AND APPARATUS FOR IMPROVING IMAGE RETRIEVAL AND SEARCH USING LATENT SEMANTIC INDEXING

BACKGROUND OF THE INVENTION

1. Field of the Invention

[0001] The present invention relates generally to computer-based information retrieval and, in particular, to image retrieval stored in a computer database.

2. Description of the Background Art

[0002] Research into content based image retrieval has been on going for many years, with many algorithms having been developed for searching for similar images based on a query image. However, these algorithms have not been widely deployed due to the fact that searching for an image or images with an example query image is not a natural thing to do, and also requires being able to find a suitable query image. [0003] Image retrieval using descriptors based on the pixel content of salient regions has been shown to outperform existing methods for retrieval based on global descriptors and avoid the segmentation problems found with region-based indexing and retrieval, while being robust to various image transforms. However, a common problem with current retrieval algorithms based on salient regions is the computational complexity due to the high dimensionality of the problem. With a salient region based approach, the cost of comparison rises with the number of regions, as each region may have to be compared to every other region. This cost can be massive with the number of regions per image feasibly reaching into the 1000s.

[0004] Therefore, a need exists for a method of creating an image database and image retrieval that reduces the computational complexity of image retrieval with the robustness to utilize searches based on human language, visual language, or both.

SUMMARY OF THE INVENTION

[0005] An aspect of the invention relates to creating an image database. First a term-document matrix comprising at least two different domain features is created. Next, latent semantic indexing is applied to the term-document matrix. Finally, a plurality of new vectors are added to a decomposed document matrix of the term- document matrix using a fold-in technique.

[0006] Another aspect of the invention relates to image retrieval comprising creating a term-document matrix comprising at least two different domain features, applying latent semantic indexing to the term-document matrix, adding a plurality of new vectors to a decomposed document matrix of the term-document matrix using a fold-in technique, providing a query vector, comparing the query vector against each one of a plurality of document vectors of the decomposed document matrix and returning a plurality of images that are similar to the query vector.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.

[0008] FIG. -1 illustrates a flowchart of a method for creating an image database and image retrieval from the image database;

[0009] FIG. 2 is a block diagram depicting an exemplary annotated image in accordance with the invention;

[0010] FIG. 3 illustrates an exemplary embodiment of a term-document matrix in accordance with the invention;

[0011] FIG. 4 illustrates an exemplary embodiment of the term-document matrix after latent semantic indexing is applied;

[0012] FIG. 5 illustrates an exemplary embodiment of the fold-in technique; and

[0013] FIG. 6 illustrates a high level block diagram of a general purpose computer suitable for use in performing the functions described herein. [0014] To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.

DETAILED DESCRIPTION OF THE INVENTION

[0015] FIG. 1 is a flow diagram depicting an exemplary embodiment of a method 100 for image retrieval from an image database in accordance with one or more aspects of the invention. FIG. 2 depicts an exemplary annotated image 200-1 of a small set of annotated images 200 in accordance with one or more aspects of the invention. The method 100 begins at step 110, where a small set of annotated images 200 are collected or generated from an entire image collection. Annotated image 200-1 may be described by at least two different domain features such as, for example, human language and visual language domains. Semantic annotations 210 are used to generate a human language vector 214. Human language vector 214 is a representation of the word occurrences in the semantic annotations 210 compared to the human language vocabulary 212.

[0016] For example, the first item of semantic annotation 210 represents the word 'sky' and 'sky' appears in the semantic annotations 210 only once. The closest term to 'sky' is located in human language vocabulary 212. The matching term found in human language vocabulary 212 is then assigned a value of '1' in the human language vector 214 for the first item. Any words that are in the semantic annotations 210, but are not found in the human language vocabulary 212 are represented as zeros in the human language vector 214. Although, in the exemplary embodiment the size of the human language vector 214 is shown with only five items, one skilled in the art will recognize that the size of human language vector 214 may be any size suitable to capture all the words found in the semantic annotations 210.

[0017] Annotated image 200-1 will also have visual language annotations 220. Visual language annotations 220 represent quantized local descriptors of salient regions of annotated image 200-1. Visual language annotations 220 may be represented by any descriptor capable of describing the image content as a set of discrete terms, such as for example, RGB histograms, rg-chromaticity histograms or any other color based histogram. The salient regions of annotated image 200-1 may be selected by any method known to those skilled in the art of image retrieval. The visual language annotations 220 are used to generate a visual language vector 224. Visual language vector 224, similar to human language vector 214, is a representation of the word occurrences in the visual language annotations 220 compared to the visual language vocabulary 222. Visual language vocabulary 222 may be represented, for example, by a matrix of vectors. Thus, visual language vector 224 is assigned by finding the closest match of each term of visual language annotations 220 with visual language vocabulary 222 in terms of a calculated distance, for example calculating a Euclidean distance. [0018] The method for generating both the human language vocabulary 212 and visual language vocabulary 222 may be any suitable method known in the art of image retrieval. In an exemplary embodiment, the human language vocabulary 212 and visual language vocabulary 222 were generated using a k-means clustering algorithm applied to a sample of local descriptors picked from a set of training images. Furthermore, in one embodiment, a human language vocabulary 212 and visual language vocabulary 222 are provided that include a multiplicity of terms (e.g., at least a few thousand terms). Finally, human language vectors 214 and visual language vectors 224 are generated for the remaining annotated images 200-2 through 200-n of the small set of annotated images 200.

[0019] Referring back to FIG. 1 , in step 120, cross-language vectors are created by appending the human language vectors and visual language vectors. For example, as illustrated in FIG. 2, the human language vector 214 and visual language vector 224 of annotated image 200-1 are appended to generate a cross-language vector 230. Notably, cross-language vector 230 contains both human language and visual language elements. Similarly, cross-language vectors 230 are generated for the remaining annotated images 200-2 through 200-n of the small set of annotated images 200. [0020] Once all of the cross-language vectors 230 are generated for each of the images 200-1 through 200-n within the small set of annotated images 200, the cross- language vectors 230 are combined into a term-document matrix. FIG. 3 depicts an exemplary embodiment of a term-document matrix 300 in accordance with one or more aspects of the invention. Each column 320-1 to 320-j represents each of the annotated images 200-1 through 200-n, also referred to as documents, of the small set of annotated images 200. Each row 310-1 to 310-i represents a term found in all of the documents. Within term-document matrix 300, are elements ay which represent the frequency of term i in document j. Each element ay is weighted to each element in term- document matrix 300 due to the fact that term-document matrix 300 is usually very sparse because every word does not normally occur in each document. In one embodiment, the weighting is calculated such that: βy = L(i, j) x G(i) (1 ) where L(i,j) represents the local weighting for term i in document j and G(i) is the global weighting for term i. In an exemplary embodiment, log-entropy weighting is used. Log- entropy weighting is defined as:

G(O = 1- ∑ [(W) log (W)] / fog Λ/J (3) where tf_\] is the frequency of term i in document j, gf\ is the total number of times term i occurs in the entire collection, and N is the total number of documents in the collection. It should be noted that although a particular weighting process is described above, the present invention is not so limited. Namely, any weighting process or no weighting process can be used in accordance with the requirements of a particular implementation.

[0021] Once the term-document matrix 300 is generated, Latent Semantic Indexing (LSI) is applied in step 130 as depicted in FIG. 1. LSI, as known in the art of image retrieval, is a technique for information retrieval that is related to the vector-space model of information retrieval. In the vector-space model, documents are represented in a multidimensional space, such as for example, term-document matrix 300. LSI takes the vector-space model one stage further by applying linear algebra to attempt to factor out noise and deal with issues of polysemy (words with multiple meaning) and synonymy (different words with the same meaning). LSI works by constructing the term-document matrix 300 and factoring term-document matrix 300 using Singular Value Decomposition (SVD). From the factored data, a rank-/c estimate of the original term- document matrix 300 can be reconstructed that removes much of the noise and reduces the dimensionality, thereby, reducing the computational complexity of performing image search and retrieval.

[0022] FIG. 4 depicts an exemplary embodiment of a result of applying LSI to term- document matrix 300 using SVD in accordance with one or more aspects of the invention. Term-document matrix 300 is decomposed into a product of three separate matrices of vectors. Matrix U 402 represents an i x m matrix of term vectors, where i represents the terms of term-document matrix 300 and m≤ min(i.j). Matrix ∑ 404 represents a m x m diagonal matrix of singular values. Matrix V 406 represents a m x j matrix of document vectors, where j represents the documents of term-document matrix 300.

[0023] The dimensionality of the term-document matrix 300 can be further reduced by using rank-/c approximation of term-document matrix 300 by selecting the /c-largest singular values 412 within matrix ∑ 404. The remaining values within matrix £ 404 are set to zeros. The rows and columns within matrix U 402 and matrix V 406 having these zeros are deleted, thereby, creating an i x k matrix U 402 represented by shaded region 410 and an k x j matrix V 406 represented by shaded region 414. Consequently, the dimensions of term-document matrix 300 are reduced as represented by reduced term- document matrix 400.

[0024] Referring back to FIG. 1 , once LSI is applied and a reduced term-document matrix 400 is generated, remaining images from the entire collection of images may be added to the small set of annotated images 200 collected in step 110 by using a "fold-in" technique in step 140. The remaining images need only to be annotated by visual language vectors. Consequently, the remaining images may be added without the need to semantically annotate each and every remaining image within the entire image collection. Notably, only a small set of the entire collection of images needs to be semantically annotated. [0025] FIG. 5 depicts an exemplary illustration of the "fold-in" technique of step 140 in accordance with one or more aspects of the invention. The reduced decomposed matrices from step 130 are represented as matrix IΛ 502, matrix ∑_k 504 and matrix V/ 506. The additional visual language vectors are generated similarly to visual language vector 224, as discussed above. The new vectors are projected into the reduced k- space as: d = d ^Υυ_k∑_kr^ι (4)

wherein d^τ is the vector of visual language terms, padded by zeros in place of the unknown human language terms, and d is the projected version of the d^τ vector. If weighting is used as discussed above, then the same weighting must first be applied to d^τ before projection. The new projected vector is then appended as a new column to the matrix V*^τ 506 and represented by shaded region 510. Thus, a complete term- document matrix 500 is created with all the images added via the "fold-in" technique. The additional added images are represented by shaded region 512. [0026] Referring back to FIG. 1 , the image database is now ready to be queried. A query is submitted in step 150. The query may include visual language, human language or a combination of the two. Notably, even though the remaining images added via the "fold-in" technique did not have semantic annotations that contain human language, these images may still be queried using human language, visual language or both. After the query is submitted a query vector is created from the visual language terms, human language terms or both. The query vector is created similar to the way human language vector 214 and visual language vector 224 were created, as discussed above. Furthermore, the query vector is also weighted in the same way as each term a,-_s was when creating term-document matrix 300, as discussed above. Finally the query vector is then reduced to k dimensions and represented as follows:

q = 9^TU*∑*-' (5)

wherein q^τ is the query vector, and q is the projected version of the q^τ vector. [0027] Referring back to FIG. 1 , the next step 160 is to compare the query vector of equation (5) against each document represented by the columns of matrix V/ 506, including all the images added via the "fold-in" technique represented by shaded region 510. The comparison is based on the calculated distance, for example Euclidean distance, between the query vector of equation (5) and each column of matrix V*^τ 506, including all the images added via the "fold-in" technique represented by shaded region 510.

[0028] Finally, the method ends at step 170 by returning matching results. In an exemplary embodiment, the results may be ranked in order of their distance, with the closest being the most similar.

[0029] FIG. 6 is a block diagram depicting an exemplary embodiment of a computer 600 suitable for implementing the processes and methods described above in accordance with one or more aspects of the invention. The computer 600 includes a processor 601 , a memory 603, various support circuits 604, and an I/O interface 602. The processor 601 may include one or more of any type of microprocessor known in the art. The support circuits 604 for the processor 601 include conventional cache, power supplies, clock circuits, data registers, I/O interfaces, and the like. The I/O interface 602 may be directly coupled to the memory 603 or coupled through the processor 601. The I/O interface 602. The memory 603 may include one or more of the following random access memory, read only memory, magneto-resistive read/write memory, optical read/write memory, cache memory, magnetic read/write memory, and the like, as well as signal-bearing media as described below.

[0030] The memory 603 stores processor-executable instructions and/or data that may be executed by and/or used by the processor 601 as described further below. These processor-executable instructions may comprise hardware, firmware, software, and the like, or some combination thereof. Notably, the processor-executable instructions may be configured to cause the processor to perform the method 100 of FIG. 1. Although one or more aspects of the invention are disclosed as being implemented as processor(s) executing a software program, those skilled in the art will appreciate that the invention may be implemented in hardware, software, or a combination of hardware and software. Such implementations may include a number of processors independently executing various programs and dedicated hardware, such as ASICs. The computer 600 may be programmed with an operating system, which may be OS/2, Java Virtual Machine, Linux, Solaris, Unix, Windows, Windows95, Windows98, Windows NT, and Windows2000, WindowsME, and WindowsXP, among other known platforms. At least a portion of an operating system may be disposed in the memory 603.

[0031] An aspect of the invention is implemented as a program product for execution by a processor. Program(s) of the program product defines functions of embodiments and can be contained on a variety of signal-bearing media (computer readable media), which include, but are not limited to: (i) information permanently stored on non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM or DVD-ROM disks readable by a CD-ROM drive or a DVD drive); (ii) alterable information stored on writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or read/writable CD or read/writable DVD); or (iii) information conveyed to a computer by a communications medium, such as through a computer or telephone network, including wireless communications. The latter embodiment specifically includes information downloaded from the Internet and other networks. Such signal- bearing media, when carrying computer-readable instructions that direct functions of the invention, represent embodiments of the invention.

[0032] While various embodiments have been described above, it should be understood that they are presented by way of example only, and not limiting. For example, although the invention disclosed herein was discussed in connection with an image with human language and visual language annotations in the exemplary embodiments, one skilled in the art would recognize that the method and system disclosed herein can also be used in connection with other documents containing multiple mixed domain features. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

What is claimed is:

1. A method of creating an image database, comprising: creating a term-document matrix comprising at least two different domain features; applying latent semantic indexing to the term-document matrix; and adding a plurality of new vectors to a decomposed document matrix of the term- document matrix using a fold-in technique.

2. The method of claim 1 , wherein the term-document matrix comprises a plurality cross-language vectors of the at least two different domain features.

3. The method of claim 2, wherein each of the plurality cross-language vector comprises a combination of the at least two different domain features, wherein at least one of the at least two different domain features is a human language vector and another one of the at least two different domain features is a visual language vector.

4. The method of claim 3, wherein the human language vector represents annotations associated with a small set of annotated images that are generated from an entire image collection.

5. The method of claim 3, wherein the visual language vectors represents a descriptor that describes an image's content as a set of discrete terms.

6. The method of claim 1 , wherein the term-document matrix comprises a plurality of individual elements ay, wherein ay represents the frequency of a term i in a document j.

7. The method of claim 6, wherein each element a-ή is weighted such that: a_q = L(ij) x G(i), wherein L(ij) represents the local weighting for term i in document j and G(i) is the global weighting for term i.

8. The method of claim 1 , wherein latent semantic indexing further comprises applying Singular Value Decomposition (SVD) to the term-document matrix.

9. The method of claim 1 , wherein the fold-in technique comprises adding the plurality of new vectors only based on a visual language vector associated with each one of the plurality of new vectors.

10. The method of claim 9, wherein the visual language vector is created based on values of a vocabulary vector that are closest to a plurality of visual language annotations of an image.

11. The method of claim 1 , wherein the fold-in technique further comprises adding the plurality of new vectors by appending a new column for each one of the plurality of new vectors to the decomposed document matrix.

12. The method of claim 1 , wherein the plurality of new vectors represents remaining un-annotated images in an entire image collection.

13. A method of image retrieval, comprising: creating a term-document matrix comprising at least two different domain features; applying latent semantic indexing to the term-document matrix; adding a plurality of new vectors to a decomposed document matrix of the term- document matrix using a fold-in technique; providing a query vector; comparing the query vector against each one of a plurality of document vectors of the decomposed document matrix; and returning a plurality of images that are similar to the query vector.

14. The method of claim 13, wherein the fold-in technique comprises adding the plurality of new vectors only based on a visual language vector associated with each one of the plurality of new vectors.

15. The method of claim 14, wherein the visual language vector is created based on values of a vocabulary vector that are closest to a plurality of visual language annotations of an image.

16. The method of claim 13, wherein the fold-in technique further comprises adding the plurality of new vectors by appending a new column for each one of the plurality of new vectors to the decomposed document matrix.

17. The method of claim 13, wherein the query vector comprises human language vectors, visual language vectors or both human and visual language vectors.

18. The method of claim 13, wherein the comparing step comprises calculating the Euclidean distance between the query vector and each one of the plurality of document vectors of the decomposed document matrix.

19. The method of claim 13, wherein the returning step ranks the plurality of images in order of a calculated distance between each one of the plurality of images and the query vector.

20. A computer-readable medium having stored thereon a plurality of instructions, the plurality of instructions including instructions which, when executed by a processor, cause the processor to perform the steps of a method of image retrieval, comprising: creating a term-document matrix comprising human language vectors and visual language vectors; applying latent semantic indexing to the term-document matrix; adding a plurality of new vectors to a decomposed document matrix of the term- document matrix using a fold-in technique; providing a query vector; comparing the query vector against each one of a plurality of document vectors of the decomposed document matrix; and returning a plurality of images that are similar to the query vector.

21. Apparatus for image retrieval, comprising: means for creating a term-document matrix comprising at least two different domain features; means for applying latent semantic indexing to the term-document matrix; means for adding a plurality of new vectors to a decomposed document matrix of the term-document matrix using a fold-in technique; means for providing a query vector; means for comparing the query vector against each one of a plurality of document vectors of the decomposed document matrix; and means for returning a plurality of images that are similar to the query vector.