US20120250984A1

US20120250984A1 - Image segmentation for distributed target tracking and scene analysis

Info

Publication number: US20120250984A1
Application number: US13/309,551
Authority: US
Inventors: Camillo Jose Taylor
Original assignee: University of Pennsylvania Penn
Current assignee: University of Pennsylvania Penn
Priority date: 2010-12-01
Filing date: 2011-12-01
Publication date: 2012-10-04
Also published as: US9449233B2; US8867793B2; US20120249802A1; US20120250978A1

Abstract

Pixels in a feature space can be divided using a hashing method. Random planes are selected within a feature space and a pixel's relationship to each plane determines a bit of a hash code. Clusters of pixels can be identified by local maxima in the hash cells in the feature space. Nearby pixels in the feature space can be further assigned to these local maxima based on hamming distance. An image can be segmented by observing adjacent pixels sharing a common hash code.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

The present application claims priority to provisional patent applications 61/418,789, 61/418,805, and 61/418,799 which are incorporated by reference in their entirety.
The present application relates to co-pending patent applications entitled “Scene Analysis Using Image and Range Data” and “Distributed Target Tracking Using Self Localizing Smart Camera Networks Technology Field” both of which are incorporated by reference in their entirety and filed on the same day as the present application entitled “Image Segmentation for Distributed Target Tracking and Scene Analysis.”

TECHNOLOGY FIELD

The present invention relates generally to machine vision systems and methods and specifically to image segmentation to determine salient features in an image. The present invention relates generally to machine vision systems and methods and specifically to object tracking and scene analysis for distributed or mobile applications.

BACKGROUND

Segmentation is a method of breaking an image into coherent regions and is a common problem in Computer Vision. Many known methods of segmentation are computationally intensive, making them unsuitable for fast, low power, or low cost applications, such as for use with distributed or mobile devices. However, there is a need for an algorithm that is more amenable to real-time implementation.
To date, most of the approaches that have been developed to tackle the segmentation problem can be broadly divided into two groups. The first group consists of algorithms that view the image as a graph and use various metrics to measure the difference in appearance between neighboring pixels or regions. Once the problem has been formulated in this way, the algorithms center on the problem of dividing this graph into pieces so as to maximize coherence. The Normalized Cut algorithm developed by Shi and Malik proceeds by recasting the graph segmentation problem in terms of a spectral analysis. (See Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:888-905, 1997.) This approach involves computing the distance between the pixels in the image and then solving a series of large but sparse eigenvector problems.
Felzenszswalb and Huttenlocher proposed an efficient approach to grouping pixels in an image by making use of a spanning tree and showed that locally greedy grouping decisions can yield plausible results. (Pedro F. Felzenszwalb and Daniel P. Huttenlocher, “Efficient graph-based image segmentation,” Int. J. Comput. Vision, 59(2): 167-181, 2004. ISSN 0920-5691.) This approach also revolves around the computation of multiple pairwise distance values. There remains a need for avoiding the computational costs associated with distance computation.
Another broad class of segmentation schemes are termed feature based methods because these proceed by associating a feature vector with each pixel in the image. (See Dorin Comaniciu, Peter Meer, and Senior Member, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603-619, 2002; W Y Ma and B. S. Manjunath, “Texture features and learning similarity,” Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:425, 1996, ISSN 1063-6919. doi: http://doi.ieeecomputersociety.org/10.1109/CVPR.1996.517107; and Eduard Vazquez, Joost Weijer and Ramon Baldrich, “Image segmentation in the presence of shadows and highlights,” In ECCV '08: “Proceedings of the 10th European Conference on Computer Vision,” pages 1-14, Berlin, Heidelberg, 2008, Springer-Verlag. ISBN 978-3-540-88692-1. doi: http://dx.doi.org/10.1007/978-3-540-88693-8_—1.)
Another clustering method is the k-means algorithm, which seeks to divide the population into k-clusters using an Expectation Maximization approach. (See Morten Rufus Blas, Motilal Agrawal, Aravind Sundaresan, and Kurt Konolige, “Fast color/texture segmentation for outdoor robots.” In IROS, pages 4078-1085, 2008. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. One issue that one needs to be addressed in applying this algorithm to segmentation problems is the question of choosing an appropriate value for k, which is typically not known beforehand. A second issue is the fact that the k-means scheme involves repeated rounds of distance computations. This means that the computational complexity grows with the number of pixels, the dimension of the feature space and the number of clusters. K-means clustering is typically considered an NP-hard problem. While some approaches have been proposed to mitigate this problem including the method developed by Elkan, which seeks to accelerate the process by invoking the triangle inequality, and Locality Sensitive Hashing schemes that search for near neighbors in the feature space, these have been unable to fully mitigate the distance computations required. (See Charles Elkan, “Using the triangle inequality to accelerate k-means,” In International Conference on Machine Learning, 2003; Piotr Indyk and Rajeev Motwani. “Approximate nearest neighbors: towards removing the curse of dimensionality,” In STOC '98: Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, New York, N.Y., USA, 1998. ACM. ISBN 0-89791-962-9. doi: http://doi.acm.org/10.1145/276698.276876.)
There have also been attempts to apply Mean Shift segmentation algorithm to subdivide color images into regions. (Dorin Comaniciu, Peter Meer and Senior Member, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603-619, 2002.) This feature based approach proceeds by searching for modes of the distribution in the feature space using a Parzen Window based approach (sometimes referred to as the Parzen-Rosenblatt window, which provides a well known non-parametric way of estimating the probability density function of a random variable). The method involves tracing the paths of various feature vectors as they evolve under the mean shift rule. This non-parameteric estimation scheme can be very time consuming which makes it less useful in situations where real time response is desired. The Parzen Window density estimation scheme employed in this approach also limits the dimension of the feature spaces to which it can be applied effectively. In contrast the method proposed in this application can be applied to arbitrary feature spaces and has been implemented in real time on modest hardware.
When labeled image data is available, algorithms that learn how to classify pixels and segment images have also been proposed. (See J. Shotton, M. Johnson, and R. Cipolla, “Semantic text on forests for image categorization and segmentation,” In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, 2008; Michael Maire, Pablo Arbelaez, Charles Fowlkes, and Jitendra Malik, “Using contours to detect and localize junctions in natural images,” In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, June 2008.) These approaches rely on leveraging the training data to associate semantic labels with pixels and segments.
With the emergence of low-cost, ubiquitous cameras, such as cameras, there exists a need for developing image segmentation methods that can be reliably used with low-cost computation, which can handle large volumes of image data. Similarly, it is desirable for a segmentation method to be operated without the need for structured training data, which may not be readily available for use with low-cost or large volume cameras and processors.
When labeled image data is available, algorithms that learn how to classify pixels and segment images have also been proposed. (See J. Shotton, M. Johnson, and R. Cipolla, “Semantic text on forests for image categorization and segmentation,” In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, 2008; Michael Maire, Pablo Arbelaez, Charles Fowlkes, and Jitendra Malik, “Using contours to detect and localize junctions in natural images,” In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, June 2008.) These approaches rely on leveraging the training data to associate semantic labels with pixels and segments.

SUMMARY OF THE INVENTION

Embodiments of the present invention address and overcome one or more of the above shortcomings and drawbacks, by providing devices, systems, and methods for segmenting images. This technology is particularly well-suited for, but by no means limited to, real-time segmentation of images for identifying salient features of an image.
Embodiments of the present invention are directed to a method for identifying salient segments in an image comprising the steps of choosing a plurality of planes that divide a feature space into a plurality of cells, generating a first set of hash codes for a first set of pixels in an image based on a location in the feature space of a feature vector associated with each pixel in the first set whereby the location of each feature vector relative to each of the plurality of planes contributes a binary value to each hash code, determining a second set of hash codes selected from the first set of hash codes, wherein the second set of hash codes indicates local maxima of clusters of pixels, assigning each of the first set of pixels to one of the second set of hash codes based on the hamming distance between a first hash code from the first set of hash codes assigned to each pixel and each of the second set of hash codes, and identifying segments in an image based on groups of adjacent pixels sharing common hash codes. The step of choosing a plurality of planes may further comprise selecting a predetermined number of random planes in the feature space. The feature space may be a three-dimensional color space. The step of choosing a plurality of planes may further comprise training the plurality of planes based on human interpretations of images.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:

FIG. 1A is a two dimensional view of a simplified 2D version of the randomized hashing scheme in a feature space fractured into regions by a set of randomly chosen splitting planes Each region is associated with a hash code indicating where it falls with respect to the splitting planes;

FIG. 1B is a four-dimensional view of the relationship between hash codes and associated with the vertices of a hypercube in an exemplary embodiment;

FIG. 2 is a flow chart showing operation of an exemplary algorithm for use with segmenting images in accordance with some embodiments of the random hashing scheme;

FIG. 3 shows two histograms of the distributions of the GCE and Rand Index metrics over a sample data set as a result of the application of certain embodiments of the random hash scheme.

FIG. 3 includes sample images along with the human segmentation and the machine segmentation.

FIG. 4 shows a plurality sample images, along with the human segmentation and the machine segmentation, including both prior art and an embodiment of the random hashing scheme.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

The segmentation scheme described in this application employs a feature based approach. Each pixel in the image is described by a feature vector which encodes a set of properties used to describe that pixel. Embodiments of the present invention can employ a simple color descriptor vector, which is an example of a feature vector, but some embodiments also use more sophisticated feature vectors such as a histogram of color values or a vector of texture coefficients. Some embodiments can employ an approach to segmenting natural images which leverages the idea of randomized hashing. The procedure aims to replace the problem of finding clusters in the feature space with the problem of finding local maxima in a graph whose topology approximates the geometry of the underlying feature space. In so doing the method can bypass the computational effort associated with computing distances between feature vectors which can comprise a significant fraction of the effort in other techniques such as k-means clustering and mean shift segmentation.
The method can be controlled by a few parameters namely, the number of random splitting planes, n, The Hamming Distance threshold, k, and the window size that is used to average the color vectors, w. By adjusting these parameters the algorithm can be made to produce over segmentations or under segmentations of the input imagery. Importantly the number of segments that are produced is implicitly controlled by these parameters rather than explicitly provided as an input to the algorithm.
These techniques can be employed in a computing environment available to a person of ordinary skill in the art, including performing the prescribed calculations on a PC, embedded processor, mobile device, cloud computing environment, client-server environment, DSP, or dedicated hardware circuit capable of performing the methods disclosed herein.
Given a set of feature vectors (e.g. pixels having predetermined properties, where each selected property is a dimension in a feature space), the goal of the segmentation procedure is to divide them into a set of clusters which capture the most salient groupings in the distribution. An example of a feature space for use with some embodiments includes a 3D space with orthogonal axes for each color: red, green, and blue (RGB). In some embodiments, the feature space uses axes for Y, Pb, and Pr. In some embodiments, more than three dimensions can be used. For example, other dimensions can include motion vectors, depth, or other information that can be obtained related to each pixel, so that each pixel (or group of pixels) can be mapped to the feature space.
Entries in this feature vector characterize salient properties of the region surrounding that pixel such as color (e.g. RGB, YPbPr, or multispectral properties, such as IR, UV, or information from a FLIR camera), texture, frequency content, and/or motion properties. Once all of the pixels have been mapped to the feature space, the segmentation process is treated as a clustering problem where the goal can be to identify salient clusters in the population of feature vectors.
FIG. 1A depicts a simplified 2D version of the randomized hashing scheme in a feature space fractured into regions by a set of randomly chosen splitting planes Each region is associated with a hash code indicating where it falls with respect to the splitting planes. The set of all hash codes can be associated with the vertices of a hypercube as shown in Figure (b) here the shading of the nodes indicates how many feature vectors are hashed to that code. The segmentation scheme proceeds by identifying local maxima in this hash code space.
This scheme employs a series of randomly chosen splitting planes. FIGS. 1A and 1B show a simplified view of this procedure in two dimensions. Here the random splitting planes 0, 1, 2, and 3 are used to hash the feature vectors into a set of disjoint cells based on their location.
On can hash a set of vectors into a set of discrete bins in order to accelerate the search for nearest neighbors. One can further leverage the fact that this randomized hashing procedure tends to preserves locality so points that are near to each other in the feature space are hashed to the same bin with high probability. The proposed segmentation scheme leverages these phenomena to cluster the feature vectors into groups. In FIG. 1A, the n=4 spitting planes fracture the feature space into a set of 2ⁿdisjoint convex cells each of which corresponds to an n-bit hash code. It should be understood that each splitting plane can also be considered a normal vector and a decision point (such as a mean) in the feature space. By projecting each pixel on the that normal vector, and determining its position relative to the decision point, a bit associated with each normal vector can be assigned.
More specifically, each sample vector (e.g. pixel or object) in the feature space v_jis assigned an n-bit hash code where the ith bit in the code, b_ijis derived from the ith splitting plane as follows b_ij=(v_j·u_i)>S_iwhere u_idenotes the normal associated with the ith splitting plane and si denotes the corresponding splitting value. Neighboring cells in the feature space differ by a single bit so the Hamming distance between the codes provides some indication of the distance between vectors in the feature space. More generally we can construct a correspondence between the set of all possible hash codes and the vertices of an n-dimensional hypercube. The topology of the hypercube 100 reflects the structure of the feature space since neighboring cells in feature space will correspond to neighboring vertices in the hypercube. In this example, the shaded nodes, 1001, 0000, and 0111 are those bins that have local maxima clusters in FIG. 1A.
For each of the hash codes the clustering procedure can record how many feature vectors are mapped to that code. In some embodiments, clusters in feature space will induce population maxima in the code space. That is, if we consider the hypercube as a graph we would expect to observe that some of the hash codes have a greater population than their neighbors. This allows us to replace the original problem of clustering vectors in the feature space in favor of the simpler problem of looking for population maxima in the code space graph.
For every populated code in the hypercube the algorithm interrogates all of the codes that differ from the current code by k bits or fewer. This parameter, k, is referred to as the Hamming Distance Threshold. If the code under consideration has a population greater than all of its neighbors it is declared a local maxima and a cluster center. In this way the number of clusters recovered by the procedure is determined automatically based on the data as opposed to being imposed a priori as in k-means. Note that this scheme can be used to distinguish up to 2^(n-k)local maxima.
The normals associated with the splitting planes, u_i, can be chosen randomly or based on a priori knowledge. The splitting values, s_i, can be chosen by considering the distribution of the projected values, (v_i·u_i). In some embodiments, the mean of the distribution, which corresponds to casting all of the splitting planes through the centroid of the distribution, is used. In some embodiments, the median value, and the value midway between the maximum and minimum projected values, is used as the splitting value (i.e. the decision point for assigning bit values to pixels projected on the normal vector. These schemes tend to produce similar results in practice.
After the local maxima have been identified, each of the feature vectors is labeled with the hash code of the closest local maxima based on the Hamming Distance. In the case where a feature vector is equidistant from two or more local maxima based on Hamming Distance the Euclidean distance between the feature vector and the mean cluster vector is used to break the tie and decide the label. Once each of the pixels has been labeled with the index of its local maxima, a connected components procedure is run to divide the image into coherent connected regions.
The entire scheme is outlined below in pseudo-code. This algorithm is elaborated in FIG. 2.
Algorithm 1 Segmentation via Randomized Hashing
1: Hash each feature vector to an n-bit code using the n randomly chosen splitting planes
2: Maintain a count of the number of feature vectors mapped to each hash code
3: Identify local maxima in the code space—these are the cluster centers
4: Assign each feature vector to the closest local maxima
5: Run connected components on the labeled pixels to identify coherent connected components.
FIG. 2 shows method 300 for segmenting an image. At step 302 a feature space is divided by a predetermined number, n, of splitting planes. This step can include randomly assigning and feature planes and using a calibration image or a first image to the splitting plane for a given orientation. For example, the position of each plane can be chosen such that feature vectors (e.g. those feature vectors associated with each pixel in a test image) within the feature space are evenly divided on either side of the splitting plane. As discussed above, each splitting plane can be created by choosing a random normal vector and assigning a decision point along that vector such that the decision point is at the mean or median of the distribution of feature vectors (e.g. pixels projected on that normal vector). This can be done using a calibration image, the image under test, or the previous image under test. In some embodiments, step 302 takes into account predetermined splitting planes that have been created via prior training or have been manually assigned. At step 302, the splitting planes can include a combination of random splitting planes and preassigned splitting planes.
At step 304, each feature vector image (or a subset of the feature vectors in image) is hashed using the splitting planes. This process can be computationally simple as each bit in the hash simply determines which side of the splitting plane the feature vector resides. Because only a bit is used for this hash, additional computational overhead needed for deriving the Euclidean distance from the splitting claim is not necessary. Step 304 can be an iterative loop whereby each feature vector is taken and compared in succession to each of the n splitting planes. It will be appreciated that massive parallelism and may be possible using the right processor, such as a DSP or graphics processor, to perform this step.
Once each feature vector is hashed into the cells created by the splitting planes, the algorithm proceeds to step 306. At step 306, the number of feature vectors resident in each cell in the feature space is counted. At step 308, the population counts for each cell in the feature space are compared to choose a number of local maxima. In some embodiments the number of local maxima, M, is predetermined prior to image processing. In other embodiments the number of local maxima, M, is derived dynamically from the image based on the results of the residency counts of each cell in the feature space. As discussed, the maximum number of local maxima, M, can be determined based on the number of splitting planes used and the Hamming distance requirements for clusters used to image segmentation. Once the local maxima are identified, these can be used as the center of each cluster used for image segmentation.
At step 310 each feature vector under test can be assigned to each of the cluster centers determined in step 308. In some embodiments, this assignment has low computational is and where. In overhead because the hash of each feature vector is compared to each of the cluster centers and the cluster having the nearest Hamming distance to the hash of the vector is selected as the cluster to which the feature vector will be assigned. In some embodiments, ties between competing clusters (i.e. clusters that are the same Hamming distance away from the hashed feature vector) can be resolved by estimating the Euclidean distance between the center of each cluster and the current feature vector. It will be appreciated that other techniques for resolving conflicts between equidistant clusters can be used, including assigning each feature vector to the equidistant cluster that has the least number of feature vectors currently assigned, or the most number.
Once each of the feature vectors is assigned to a cluster, image segments can be identified within the image at step 312. For example, adjacent (i.e. connected) pixels in the image plane that have had their feature vectors assigned to the same cluster can be considered part of the same image segment. Any known technique can be used, such as minimum threshold sizes for segments of adjacent pixels. In this way, image segments can be rapidly deduced from the image by scanning pixels in the image plane and determining whether they share the same cluster in the feature space.
In some embodiments, algorithm 300 repeats after step 312, with the next captured image. In some embodiments, step 314 may optionally adjust the splitting planes based on the results in step 312. For example, in a training scheme the results of step 312 can be compared via a consistency score with a desired segmentation result. Based on this result, the splitting planes can be adjusted to improve this consistency score. In some embodiments, incremental improvement of splitting planes is done in real time based on other criteria, such as reducing the number of identified segments in step 312 or increasing the number of identified segments in step 312 to reach a desired range of segments. Once the splitting planes are adjusted the algorithm can return to step 302.
The proposed scheme can be similar in spirit to the Mean Shift segmentation algorithm which also seeks to identify modes in the distribution of feature vectors. Where the mean shift scheme uses a Parzen Window based scheme to estimate density in feature space, the proposed scheme uses hashing, which can be randomized or tailored to apriori information, to identify salient groupings of feature vectors.
Like Locality Sensitive Hashing, the segmentation scheme can make implicit use of the Johnsson-Lindenstrauss theorem which justifies the use of random projection by bounding the distortion of the relative distances between the feature vectors induced by the projection process. Additionally and alternatively, the hashing scheme may use planes that are predetermined, selected from a bounded group of planes, estimated or calculated based on some selected criteria. For example, if a particular color is known to be important, the hashing scheme can take this into account, by using at least some predetermined feature planes to create the hash or introduce a bias into the selection process, which might otherwise be random. The hashing scheme may also include an adaptive selection algorithm to learn how to select more appropriate feature planes in future video frames. This learning process can be by any scheme known to a person of ordinary skill in the art, including using human feedback to help train parameters, genetic algorithms, decision trees, pruning, beam searching, or the like.
From a computational perspective the principal effort revolves around computing the hash codes which involves O(nmN) operations where n denotes the number of projection directions, m denotes the dimension of the feature space and N denotes the total number of pixels or feature vectors. Note that the scheme can avoid the explicit distance computations between the feature vectors that one uses in most agglomerative and k-means segmentation schemes in favor of randomized hashing.
In searching for the local maxima in the code space one can simply store the hash code populations in an array with 2n entries. For each populated hash code the procedure involves interrogating on the order of (n/k) neighboring codes. For example to run the local maxima detection algorithm on n=12 dimensions with a Hamming distance threshold, k=2, one can construct a table with 212=4096 entries and each hash code would have (12/1)+(12/2)=12+66=78 neighbors. Typically many of the hash bins are empty which further simplifies processing. For larger value of n one could employ a binary tree data structure to store and query the contents of the hash table efficiently.
In some embodiments, the proposed segmentation scheme can be carried out using training. For example, the Berkeley Segmentation Database which contains 1633 manual segmentations of 300 color images that can be used for this purpose. By comparing the results of random hash segmentation one or more of the images to one or more of the manual segmentation results, the selected splitting planes can be improved. For example, training can be used to assist in selecting the appropriate number of splitting planes or in selecting specific splitting planes to include. The manual segmentations provided by the users can be compared with the segmentations produced by the algorithm using two different measures, the Global Consistency Error and the Rand Index. The Global Consistency Error (GCE) developed by Martin, Fowlkes, Tal and Malik can capture the difference between two segmentations in a single number between 0 and 1 where lower numbers indicate lower error. (See David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” In In Proc. 8th Int'l Conf. Computer Vision, pages 416-423, 2001.) The measure can be designed such that if one segmentation is a refinement of the other the score will be zero. This is a useful feature since it accounts for the fact that human subjects often choose to segment scenes to various levels, however, it also implies that machine segmentations that are strongly over or under segmented can also yield very low GCE scores which can be misleading.
To provide a different but related perspective on the algorithm one can record and report the Rand Index for each segmentation. This measure is commonly used in statistics to measure the quality of clustering algorithms. (See Ranjith Unnikrishnan, Caroline Pantofaru and Martial Hebert, “Toward objective evaluation of image segmentation algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):929-944, 2007. ISSN 0162-8828. doi: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.1046.) In order to compute the Rand Index one can consider every pair of pixels in the image and determine whether they are labeled consistently in the human and machine segmentations. That is, if the two pixels have the same label in the human segmentation they should have the same label in the machine segmentation and vice versa. The Rand Index represents the fraction of the pixel pairs that are labeled consistently in the two segmentations, values that are closer to 1 indicate better segmentations. Unlike the GCE the Rand Index will suffer if the machine segmentation is over or under segmented with respect to the human segmentation.
A series of segmentation experiments can be carried out using feature spaces based on color information. In some embodiments, the color values can be averaged over a square window of width w centered around each pixel. Effectively, this averaging is a preprocessing step that can aid in smoothing noise out of the image. Increasing the size of this window can increase the level of smoothing and leads to a coarser segmentation. It will be appreciated that other filtering schemes can be used to smooth or preprocess the image before performing image segmentation.
A first set of experiments that was carried out was designed to determine how the performance of the segmentation scheme varied as we varied the color space. Experiments were carried out using the standard RGB values, the HSV color space, the LAB color space and a color vector that concatenated the RGV and HSV values into a six dimensional color vector. These experiments were carried out using a randomly chosen subset of 150 of the segmentations in the database. The average GCE and Rand index values are reported. In all of these experiments the value of n was fixed at 12 the value of k was fixed at 1 and the value of w was fixed at 3. Table 1 summarizes the results of these experiments and indicates that the RGBHSV color space offers the best performance with respect to the two metrics.

TABLE 1

Results of running the segmentation
procedure using various color spaces

Color	GCE	Rand

RGB	0.2805	0.7327
HSV	0.2421	0.7527
LAB	0.2578	0.7351
RGBHSV	0.2014	0.7614

A second set of experiments explored how the performance of the scheme varied as we varied the number of splitting planes, n, and the Hamming Distance Threshold used to find local minima, k. The experiments were carried out using the HSV color space on the same subset of 150 segmentations from the database. The mean GCE and Rand Index values were recorded for every combination of parameters and the results are summarized in Table 2. In practice increasing values of the n parameter provide more ways to distinguish between feature vectors can lead to over segmentation while increasing the k parameter decreases the number of local maxima detected in the code space and leads to under segmentation.

TABLE 2

Results of running the segmentation procedure
using various values for the n and k parameters

	n	k	GCE	Rand Index

8	1	0.1952	0.7632
8	2	0.2511	0.7443
8	3	0.2450	0.7104
12	1	0.1652	0.7535
12	2	0.2250	0.7640
12	3	0.2482	0.7417
16	1	0.1005	0.7438
16	2	0.1670	0.7583
16	3	0.2236	0.7527

A third set of experiments was carried out to investigate how the performance of the scheme varied as the window size parameter, w, was varied. The experiments were carried out using the HSV color space with the n and k parameters fixed at 12 and 1 respectively. Table 3 contains the results of these trials. Increasing the value of w leads to increases the level of smoothing which typically leads to under segmentation.

TABLE 3

Results of running the segmentation procedure using various
values for the size of the smoothing window, w in pixels

w	GCE	Rand Index

3	0.1219	0.7479
5	0.1350	0.7494
7	0.1477	0.7513
11	0.1660	0.7520
21	0.1992	0.7537

A fourth experiment was run to compare the results of the automated segmentation procedure to each of the 1633 human segmentations in the database. For this experiment the HSV color space was employed, the number of splitting planes, n was 12, the Hamming Distance threshold, k was 2 and the window size, w was 3. These parameter values were chosen to produce a visually pleasing over segmentation of the images rather than to optimize the GCE or Rand Index values. Over the entire database the mean GCE value was 0.2235 and the median GCE value was 0.2157 the mean Rand Index value was 0.7370 and the median Rand Index was 0.7833.
FIG. 3 shows histograms of the distributions of the GCE and Rand Index metrics over the entire data set. The first graph on the left of FIG. 2 indicates the distribution of the GCE values over all of the segmentations in the database while the graph on the right denotes the distribution of the Rand Index values.
FIG. 4 shows a few of the images from the data set along with the human segmentation and the machine segmentation. This figure compares the output of the automated segmentation procedure to human labeled segmentations. The first and fourth rows contain the input imagery, the second and fifth rows contain human segmentations while the third and sixth rows contain machine segmentations.
FIG. 5 provides a direct comparison of segmentations produced by the methods of the present invention with those produced by the Mean Shift procedure for a few randomly chosen images in the data set. This figure compares the output of the proposed segmentation scheme with the results obtained using the Edison segmentation tool. The first row corresponds to the input image, the second to a human segmentation, the third to the mean shift result and the fourth to the randomized hash result. The parameters used for the Edison tool were (h_s,h_r,M)=(7,6.5,20) and the parameters used for the randomized method were (n,k,w)=(12,2,3).
One advantage of the proposed segmentation scheme is that the computational effort required scales linearly in the number of pixels and the operations required are simple and regular. In order to demonstrate this a real time version of the scheme was implemented on a Macbook Pro laptop computer. This implementation was used to segment 640 by 480 video frames at a rate of 10 frames per second using a single core of an Intel Core 2 Duo processor running at 2.33 GHz. This rate includes the time taken for all phases of the algorithm, image acquisition, randomized hashing, local maxima detection and connected components processing. Since almost all of the steps in the procedure are embarrassingly parallel, the algorithm is a well suited to implementation on modern multi-core processors and GPUs and should be amenable to further acceleration.

Applications

The proposed algorithm can be highly parallelizable and can be implemented in real time on modest hardware. This is an advantage since it means that the method could be used as a cheap preprocessing step in a variety of image interpretation applications much as edge detection is used today. In some embodiments, the segmentation algorithm is used instead of edge detection in image processing in a real time environment. This call allow objects to be identified when coupled with pattern matching or relative motion to a background.
The method can be used on a mobile robot to produce a fast, rough segmentation of the scene into sky ground, road and tree regions. This mobile robot can use image segmentation as described herein as well as a ranging mechanism to model its surrounding environment as described in concurrently filed application titled “Scene Analysis Using Image And Range Data,” by C. J. Taylor, which is incorporated herein by reference.
Similarly, the segmentation scheme could be used as part of the loop in real time tracking applications where it would allow the system to automatically delineate targets. The lower processing requirements of hashing could make object detection faster and cheaper for real-time tracking. For example, if used in a distributed camera network, the segmentation scheme could be performed by CPUs on the cameras to detect an object and track it. A suitable camera network with tracking ability is described in concurrently filed application titled “Distributed Target Tracking Using Self Localizing Smart Camera Networks,” by C. J. Taylor, which is incorporated herein by reference.
In any of these applications, the real time segmentation scheme being used as a pre-processing step which would suggest possible groupings in the image to higher level interpretation algorithms. In this way a person of ordinary skill in the art could use this segmentation scheme as a preprocessing stage to any preferred image processing technique suitable for the application. The system could, for instance, focus its attention on regions based on their, size, shape, texture or position in the image.
While the segmentation algorithm has been discussed and implemented in the context of color descriptors, it could equally easily be applied to feature spaces with higher dimension and can involve any combination of features, including both color and texture values and/or motion vectors.
It should be readily apparent that the image processing techniques taught herein are suitable for execution in a computing environment that includes at least one processor. Suitable processing environments can include Intel, PowerPC, ARM, or other CPU-based systems having memory and a processor, but can also include any suitable embedded systems, DSP, GPU, APU, or other multi-core processing environment including related hardware and memory. Similarly, the algorithms taught herein can be implemented by dedicated logic. Similarly, execution of these algorithm and techniques is not limited to a single processor environment, and can, in some contemplated embodiments, be performed in a client server environment, a cloud computing environment, multicore environment, multithreaded environment, mobile device or devices, etc.
Although the invention has been described with reference to exemplary embodiments, it is not limited thereto. Those skilled in the art will appreciate that numerous changes and modifications may be made to the preferred embodiments of the invention and that such changes and modifications may be made without departing from the true spirit of the invention. It is therefore intended that the appended claims be construed to cover all such equivalent variations as fall within the true spirit and scope of the invention.

Claims

1. A method for identifying salient segments in an image comprising the steps of:

choosing a plurality of planes that divide a feature space into a plurality of cells;

generating a first set of hash codes for a first set of pixels in an image based on a location in the feature space of a feature vector associated with each pixel in the first set, whereby the location of each feature vector relative to each of the plurality of planes contributes a binary value to each hash code;

determining a second set of hash codes selected from the first set of hash codes, wherein the second set of hash codes indicates local maxima of clusters of pixels;

assigning each of the first set of pixels to one of the second set of hash codes based on the hamming distance between a first hash code from the first set of hash codes assigned to each pixel and each of the second set of hash codes; and

identifying segments in an image based on groups of adjacent pixels sharing common hash codes.

2. The method of claim 1, wherein the step of choosing a plurality of planes further comprises selecting a predetermined number of random planes in the feature space.

3. The method of claim 1, wherein the feature space is a three-dimensional color space.

4. The method of claim 1, wherein the step of choosing a plurality of planes further comprises training the plurality of planes based on human interpretations of images.