US20090175544A1 - Finding structures in multi-dimensional spaces using image-guided clustering - Google Patents

Finding structures in multi-dimensional spaces using image-guided clustering Download PDF

Info

Publication number
US20090175544A1
US20090175544A1 US12/143,131 US14313108A US2009175544A1 US 20090175544 A1 US20090175544 A1 US 20090175544A1 US 14313108 A US14313108 A US 14313108A US 2009175544 A1 US2009175544 A1 US 2009175544A1
Authority
US
United States
Prior art keywords
multidimensional
image
pyramid
clusters
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/143,131
Other versions
US7558425B1 (en
Inventor
Tanveer Syeda-Mahmood
Peter J. Haas
John M. Lake
Guy Lohman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/143,131 priority Critical patent/US7558425B1/en
Application granted granted Critical
Publication of US7558425B1 publication Critical patent/US7558425B1/en
Publication of US20090175544A1 publication Critical patent/US20090175544A1/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing

Definitions

  • This invention relates to data clustering, and more particularly, to the clustering of multidimensional data to determine high-level structures.
  • Data clustering is the categorization of objects into different groups, or more precisely, the organizing of a collection of data into clusters, or subsets, based on quantitative information provided by one or more traits or characteristics shared by the data in each cluster.
  • a cluster is a collection of objects which are “similar” between them and “dissimilar” to the objects belonging to other clusters.
  • the goal of clustering is to determine an intrinsic grouping, or structure, in a set of unlabeled data.
  • the functional dependency between two or more time series can lie along a curve.
  • FIG. 1 shows a graph of a functional dependency between a pair of time series that maps to a perceptible curve having a rotated U-like structure.
  • Clustering can be used to perform statistical data analysis in many fields, including machine learning, data mining, pattern recognition, medical imaging and other image analysis, and bioinformatics.
  • clustering For applications dealing with sets of high-dimensional data such as multimedia processing applications (for example, content-based image and video retrieval, multimedia browsing, and multimedia transmission over networks), the finding of underlying high-level structures by clustering and categorization is a fundamental analysis operation.
  • a good clustering scheme should, for example, help to provide an efficient organization of content, as well as provide for better retrieval based upon semantic qualities.
  • video retrieval because of the larger number of additional features resulting from motion in time, efficient organization is particularly important.
  • image-based retrieval semantic quality retrieval is particularly important because clustering provides a means for grouping images into classes that share some common semantics.
  • multidimensional datasets presents a number of peculiarities that can lead to misleading or insufficient results using distance-based clustering, particularly for cases of grouping high-dimensional objects into high-level structures.
  • the number of feature dimensions in multidimensional datasets tends to be large in comparison to the number of data samples.
  • a single four second action video assuming a pair of features per frame (for instance, for representing the motion of the object centroid) can have at least 240 feature dimensions.
  • image clustering while color, texture, and shape features can encompass hundreds of features, the number of samples available for training could be comparably small. This can result in a data space that is high-dimensional but sparse. The sparseness of the data points can make it difficult to identify the clusters because observation at multiple scales may be needed to spot the patterns.
  • a second issue that may arise is that the number of clusters for a multidimensional dataset is often unknown and more than one set of clusters may be possible. Different relative scalings can lead to groupings with different structures, even with measurements being taken in the same physical units. To make an informed decision as to relative scaling using existing clustering methods, either the number of clusters needs be known a priori or a hierarchical clustering must be performed that yields several possible clusters without a specific recommendation on one.
  • a hierarchical clustering the process builds (agglomerative), or breaks up (divisive), a hierarchy of clusters.
  • the traditional representation of such a hierarchy of clusters is a tree structure called a dendrogram, which depicts the mergers or divisions which have been made at successive levels in the clustering process.
  • a bottom row of leaf nodes represent data and the set of remaining nodes represent the clusters to which the data belong at each successive stage of analysis.
  • the leaf nodes are spaced evenly along the horizontal axis, and the vertical axis gives the distance (or dissimilarity measure) at which any two clusters are joined.
  • Divisive methods begin at the top of the tree, while agglomerative methods begin at the bottom, and cutting the tree at a given height will give a clustering at a selected precision.
  • the bottom level of the hierarchy includes all data points as one cluster.
  • FIG. 2 illustrates a graph of functional dependencies between a pair of time series in which the noticeable structures are that of three separate lines radiating from common points. While different structures from within this graph may be obtained using hierarchical clustering methods, ideally, it would be desirable to have the result of clustering the dataset indicate the lower level structures (such as the individual splotches in FIG. 2 ) as well as the higher-level structures formed (such as the lines perceived in FIG. 2 ) without necessarily leading to a single cluster at the top level, unless that is in fact matching how the data collection should be perceived.
  • the lower level structures such as the individual splotches in FIG. 2
  • the higher-level structures formed such as the lines perceived in FIG. 2
  • a data processing system that comprises a processor, a random access memory for storing data and programs for execution by the processor, and computer readable instructions stored in the random access memory for execution by the processor to perform a method for clustering data points in a multidimensional dataset in a multidimensional image space.
  • the method comprises generating a multidimensional image from the multidimensional dataset; generating a pyramid of multidimensional images having varying resolution levels by successively performing a pyramidal sub-sampling of the multidimensional image; identifying data clusters at each resolution level of the pyramid by applying a set of perceptual grouping constraints; and determining levels of a clustering hierarchy by identifying each salient bend in a variation curve of a magnitude of identified data clusters as a function of pyramid resolution level.
  • FIG. 1 is a graph illustrating an exemplary functional dependency between a pair of time series.
  • FIG. 2 is a graph illustrating exemplary functional dependencies between a pair of time series.
  • FIGS. 3 a and 3 b are illustrations of exemplary arrangements of points to which perceptual grouping cues can be applied.
  • FIG. 4 is a flow diagram illustrating an exemplary embodiment of a clustering process in accordance with the present invention.
  • FIGS. 5 a - 5 f illustrate graphs of different levels of an exemplary pyramidal grouping generated while performing the exemplary clustering process illustrated in FIG. 4 .
  • FIG. 6 illustrates a graph of an exemplary variation curve of a magnitude of clusters identified from a dataset as a function of image resolution level.
  • FIG. 7 is a block diagram illustrating an exemplary hardware configuration or a computer system within which exemplary embodiments of the present invention can be implemented.
  • Perceptual grouping refers to the human visual ability to extract significant image relations from lower-level primitive image features without any knowledge of the image content and thereby group them to obtain meaningful higher-level structure.
  • perceptual grouping concepts as implemented in exemplary embodiments described herein implicitly use some aspects of processing that can be directly related to the pattern recognition processes of the human visual system.
  • the human visual system can detect many classes of patterns and statistically significant arrangements of image elements.
  • Perceptual grouping aims at reducing ambiguity in image data or in initial segmentation and thus at increasing the robustness and efficiency of subsequent processing steps.
  • FIGS. 3 a and 3 b show the close applicability of perceptual grouping to clustering.
  • FIG. 3 a illustrates an exemplary arrangement of points where proximity as well as continuity of orientation can help isolate the three lines (that is, the C and F splits and the main line A-H in the middle).
  • FIG. 3 b illustrates an exemplary case in which the density difference is the primary perceptual grouping cue used to separate the two objects (that is, the inner and outer discs).
  • Exemplary embodiments described herein embody methods for extracting perceptually salient clusters/groups from multidimensional datasets using perceptual grouping as a way of clustering.
  • the multidimensional feature space is modeled as a multidimensional image
  • clustering is posed as a problem of object extraction from sparse and noisy data
  • perceptual grouping constraints are utilized to successively group sample points into dense clusters in multidimensional spaces.
  • the grouping is carried over progressively sub-sampled images using a pyramid scheme to identify clustering levels.
  • Exemplary embodiments as described herein are applicable to suitable multidimensional datasets in any arbitrary application domain or field.
  • FIG. 4 illustrates a flow diagram of a process for performing clustering in accordance with an exemplary embodiment of the present invention.
  • Exemplary clustering process 100 consists of three main steps: At step 110 , a pyramidal image sampling is performed to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the dataset. At step 120 , a perceptual grouping method is perform at each image size to assemble the clusters from the samples by extracting features from the subspaces formed in the previous image in the image pyramid. That is, perceptual grouping constraints are used to assemble the clusters at successive pyramidal levels. Finally, at step 130 , the cluster curve is then obtained and bends in the curve are identified as curvature change points.
  • each of the X i (f 1i , f 2i , . . .f Mi ) is an M-dimensional feature vector in which the features are normalized so that 0 ⁇ f ji ⁇ 1.0
  • each of the X i is a point in an R M space, which can be regarded as an M-dimensional image I k of size L k ⁇ L k .
  • Each image bin at level k is an M-dimensional unit of size
  • a pyramidal image sampling scheme is performed to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the dataset.
  • Any appropriate technique for pyramidal image sampling of the multidimensional dataset can be used.
  • image sampling for example, in cross-section, radial section, or spiral (Archimedean and logarithmic) form—to obtain vector data from image greyscale values that have shown to be successful in obtaining representative samples.
  • a logarithmic sampling scheme is used in the present exemplary embodiment. Because all the feature dimensions are normalized to be in the range [0,1], a square grid can be used. The scheme starts with an image of size L 0 ⁇ L 0 , and each successive image is of size L i ⁇ L i where
  • the pyramid sampling scheme is sufficient to bring out all groups that obey the selected grouping constraints.
  • the pyramid image sampling scheme can be used as a systemic way to explore such multi-level grouping by applying multi-resolution grouping constraints to achieve a meaningful semantic representation. That is, the multidimensional space is modeled as a multidimensional image with pyramidal sub-sampling for representing the features at lower resolution such that higher level structures are extracted from the dataset.
  • This sampling scheme exploits the characteristic of images that indicate that neighboring pixels are highly correlated.
  • An example of a pyramid is the Laplacian pyramid, which is obtained by convolving the image with a Gaussian kernel. The Laplacian is then computed as the difference between the original image and the low pass filtered image to create a sequence of band-pass filtered representations. This process is continued to obtain a set of band-pass filtered images, each being the difference between two levels of the Gaussian pyramid.
  • the Laplacian pyramid is a set of band pass filters at successively lower resolutions or image sizes.
  • perceptual grouping constraints are used at step 120 to cluster the multidimensional dataset into smaller numbers of spatial and computational complexity tractable clusters at successive pyramidal levels by extracting features from the subspaces formed in the previous image in the pyramid.
  • the cluster labels of the previous pyramid level serve as the intensity values for the current pyramid level and remain in the same cluster at the next pyramid level. This property not only ensures nested clusters across the cluster hierarchy, but also makes it possible to generate the groupings using a logarithmic scale of multi-resolution.
  • the image size is successively sub-sampled or equivalently reduced and those perceptual groups that persist the longest as the image shrinks are selected.
  • the perceptual grouping constraints of proximity, density, orientation similarity, and region contiguity are utilized to successively group sample points into dense clusters in multidimensional spaces. Due to sparse and irregular nature of point distribution in clustering, the emergence of higher-level structures is rarely apparent through a single pass of the image. Initial grouping may yield some structures, and these structures may be further combined to yield another level of higher-level structures. This process can be repeated until a meaningful semantic representation is achieved.
  • the grouping constraints that are utilized determine whether pixels can be grouped.
  • a pair of points (X i , X j ) are considered proximal at pyramidal level k if
  • stands for the absolute value. For the 2-dimensional case, it corresponds to the use of a 3 ⁇ 3 neighborhood around a pixel.
  • the grouping will consider the clusters from the previous level as the grouping elements.
  • the proximity constraints state that they can be merged provided at least a pair of their respective image bins is adjacent.
  • the proximity constraint to group two clusters c i k-1 ,c j k-1 from level k-1 into one at level k can be given as ⁇ l,m
  • the average density of a cluster obtained at pyramid level k is obtained by noting the average number of samples with the given cluster label in an image bin.
  • N c k be the number of image bins for cluster c at pyramid level k
  • n cl k be the number of sample points belonging to cluster c in bin l at pyramid level k
  • the grouping constraint of density attempts to group clusters that have a small difference in density. That is, given two clusters from pyramid level k-1, c i k-1 , c j k-1 , the density constraint is
  • the orientation of the subspace can be characterized by the eigenvectors of the covariance matrix of the subspace.
  • the covariance matrix of the cluster c is given by X c X c T , where X c are the set of sample points that fall into the cluster c.
  • the orientation constraint of grouping operates to group those clusters from a previous pyramid level at the next level when there is a small difference in their orientation.
  • v i , v j be the eigenvectors corresponding to the largest eigenvalues for two clusters c i k-1 ,c j k-1 respectively
  • the dot product represents the cosine of the angle between the above unit eigenvectors.
  • region contiguity should be implemented as a three-way constraint to determine if two of the regions belonging to different groups being merged potentially intersect with a third region belonging to a group that is already assembled, which would result in physically implausible clusters.
  • the potential for having clusters consisting of intersecting regions belonging to different clusters is a particular concern at higher levels of the image pyramid.
  • the contiguity of two potential groups c i k-1 and c j k-1 can be detected if the potential minimum spanning tree (MST) formed from their merger V ij k does not have an edge intersecting with the MST V l k of a group already formed at this level c l k or with V m k-1 for the region c m k-1 at previous scale.
  • MST potential minimum spanning tree
  • V ij k V i k-1 ⁇ V j k-1 ⁇ E min ⁇
  • E min min ⁇ E uv ,u ⁇ c i k-1 ,v ⁇ c j k-1 ⁇
  • E uv is the distance between the M-dimensional points u and v belonging to groups c i k-1 and c j k-1 respectively.
  • process 100 returns the number of clusters as the connected components grouped.
  • the number of distinct groups can be expected to decrease with coarseness of sub-sampling until the minimum is reached.
  • the cluster curve is obtained and salient bends in the curve are identified as curvature change points to determine the clustering hierarchy.
  • the number of clusters as a function of image sample size can be represented as a monotonically decreasing curve.
  • FIG. 6 illustrates a graph of an exemplary variation of clusters for a sample set as a function of image resolution.
  • the salient bends W, X, Y, and Z indicated by the arrows on the curve are used as levels of hierarchical clustering.
  • the pyramidal representation gives the multi-resolution decomposition for purposes of grouping, not all levels may produce distinct changes in the grouping. Rather, distinct groups may emerge at certain levels of resolution and such emergence is usually marked with a distinct change in the number of clusters.
  • FIGS. 5 a - 5 f are graphs illustrating the different levels of an exemplary pyramidal grouping performed in accordance with the present exemplary process.
  • multi-resolution decomposition is used to guide the perceptual grouping of data points using factors of proximity, orientation, and density.
  • the number of clusters decreases as a function of image size.
  • FIG. 5 a shows the original data.
  • the clustering can be produced at each level in the hierarchy by using the bends on this curve as an indication of level are shown in FIG.
  • the exemplary clustering process illustrated in FIG. 4 can be used to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the data samples.
  • the exemplary clustering process is based upon perceptual grouping concepts that are applicable to high-dimensional spaces.
  • the approach identifies bends in cluster curves that it then uses to determine the cluster hierarchy.
  • the process can identify clusters which would be obvious using a visual metaphor but would otherwise be unrecognized using only proximity based measures.
  • the details of the exemplary clustering process illustrated in FIG. 4 can be summarized as follows:
  • n k be the number of clusters at pyramid level k.
  • the hierarchical levels for clustering are given by ⁇ L l 1 , . . . L l S ⁇ .
  • the corresponding clusters at each cluster hierarchy level are identified as ⁇ c 1 l 1 , . . . c n l1 l 1 ⁇ . . . ⁇ c 1 l S , . . . c n lS l S ⁇ .
  • the exemplary process has three parameters, namely, L 0 , ⁇ , ⁇ , that can be chosen as follows.
  • the starting image size for the pyramidal sampling can be taken as the minimum non-zero distance between pairs of sample points in the data set. Nevertheless, a starting sampling size based on the distance that is at 1/100 th percentile in the sorted list of distances between pairs of points can usually be sufficient. In this case, supposing the pair-wise Euclidean distance at 1/100 th percentile is d,
  • the process does not require a priori information regarding to the number of clusters or starting points for the computation.
  • the clustering output can be generated in the form of a tabular list of point coordinates along the corresponding cluster labels.
  • the output can also be visualized using any suitable technique for visualizing clustering results.
  • exemplary embodiments of present invention can be implemented in software, firmware, hardware, or some combination thereof, and may be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods and/or functions described herein—is suitable.
  • a typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
  • Exemplary embodiments of the present invention can also be embedded in a computer program product, which comprises features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
  • Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
  • one or more aspects of exemplary embodiments of the present invention can be included in an article of manufacture (for example, one or more computer program products) having, for instance, computer usable media.
  • the media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention.
  • the article of manufacture can be included as a part of a computer system or sold separately.
  • at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the exemplary embodiments of the present invention described above can be provided.
  • FIG. 7 illustrates an exemplary computer system 10 upon which exemplary embodiments of the present invention can be implemented.
  • a processor or CPU 12 receives data and instructions for operating upon from on-board cache memory or further cache memory 18 , possibly through the mediation of a cache controller 20 , which can in turn receives such data from system read/write memory (“RAM”) 22 through a RAM controller 24 , or from various peripheral devices through a system bus 26 .
  • RAM system read/write memory
  • RAM controller 24 or from various peripheral devices through a system bus 26 .
  • the data and instruction contents of RAM 22 will ordinarily have been loaded from peripheral devices such as a system disk 27 .
  • Alternative sources include communications interface 28 , which can receive instructions and data from other computer systems.
  • the above-described program or modules implementing exemplary embodiments of the present invention can work on processor 12 and the like to perform clustering.
  • the program or modules implementing exemplary embodiments may be stored in an external storage medium.
  • an optical recording medium such as a DVD and a PD
  • a magneto-optical recording medium such as a MD
  • a tape medium such as an IC card
  • a semiconductor memory such as an IC card, and the like
  • the program may be provided to computer system 10 through the network by using, as the recording medium, a storage device such as a hard disk or a RAM, which is provided in a server system connected to a dedicated communication network or the Internet.

Abstract

A data processing system is provided that comprises a processor, a random access memory for storing data and programs for execution by the processor, and computer readable instructions stored in the random access memory for execution by the processor to perform a method for clustering data points in a multidimensional dataset in a multidimensional image space. The method comprises generating a multidimensional image from the multidimensional dataset; generating a pyramid of multidimensional images having varying resolution levels by successively performing a pyramidal sub-sampling of the multidimensional image; identifying data clusters at each resolution level of the pyramid by applying a set of perceptual grouping constraints; and determining levels of a clustering hierarchy by identifying each salient bend in a variation curve of a magnitude of identified data clusters as a function of pyramid resolution level.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation of U.S. patent application Ser. No. 11/970,946, filed Jan. 8, 2008, the disclosure of which is incorporated by reference herein in its entirety.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This invention relates to data clustering, and more particularly, to the clustering of multidimensional data to determine high-level structures.
  • 2. Description of Background
  • Data clustering (or just clustering) is the categorization of objects into different groups, or more precisely, the organizing of a collection of data into clusters, or subsets, based on quantitative information provided by one or more traits or characteristics shared by the data in each cluster. A cluster is a collection of objects which are “similar” between them and “dissimilar” to the objects belonging to other clusters. The goal of clustering is to determine an intrinsic grouping, or structure, in a set of unlabeled data. For example, the functional dependency between two or more time series can lie along a curve. As an example, FIG. 1 shows a graph of a functional dependency between a pair of time series that maps to a perceptible curve having a rotated U-like structure. Clustering can be used to perform statistical data analysis in many fields, including machine learning, data mining, pattern recognition, medical imaging and other image analysis, and bioinformatics.
  • For applications dealing with sets of high-dimensional data such as multimedia processing applications (for example, content-based image and video retrieval, multimedia browsing, and multimedia transmission over networks), the finding of underlying high-level structures by clustering and categorization is a fundamental analysis operation. A good clustering scheme should, for example, help to provide an efficient organization of content, as well as provide for better retrieval based upon semantic qualities. In video retrieval, because of the larger number of additional features resulting from motion in time, efficient organization is particularly important. In image-based retrieval, semantic quality retrieval is particularly important because clustering provides a means for grouping images into classes that share some common semantics.
  • Even though clustering of multidimensional datasets is important to determining high-level structures, much of the focus in multidimensional data analysis has been on feature extraction and representation, and existing methods available from data mining and machine learning have been relied on for the clustering task. These methods are primarily based upon the similarity criterion of distance or proximity in which two or more objects belong to the same cluster if they are “close” according to a given distance function that defines a distance between elements of a set (for example, the simple Euclidean distance metric).
  • The nature of multidimensional datasets, however, presents a number of peculiarities that can lead to misleading or insufficient results using distance-based clustering, particularly for cases of grouping high-dimensional objects into high-level structures. First, the number of feature dimensions in multidimensional datasets tends to be large in comparison to the number of data samples. As an example, a single four second action video assuming a pair of features per frame (for instance, for representing the motion of the object centroid) can have at least 240 feature dimensions. Similarly, in image clustering, while color, texture, and shape features can encompass hundreds of features, the number of samples available for training could be comparably small. This can result in a data space that is high-dimensional but sparse. The sparseness of the data points can make it difficult to identify the clusters because observation at multiple scales may be needed to spot the patterns.
  • A second issue that may arise is that the number of clusters for a multidimensional dataset is often unknown and more than one set of clusters may be possible. Different relative scalings can lead to groupings with different structures, even with measurements being taken in the same physical units. To make an informed decision as to relative scaling using existing clustering methods, either the number of clusters needs be known a priori or a hierarchical clustering must be performed that yields several possible clusters without a specific recommendation on one. In a hierarchical clustering, the process builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of such a hierarchy of clusters is a tree structure called a dendrogram, which depicts the mergers or divisions which have been made at successive levels in the clustering process. A bottom row of leaf nodes represent data and the set of remaining nodes represent the clusters to which the data belong at each successive stage of analysis. The leaf nodes are spaced evenly along the horizontal axis, and the vertical axis gives the distance (or dissimilarity measure) at which any two clusters are joined. Divisive methods begin at the top of the tree, while agglomerative methods begin at the bottom, and cutting the tree at a given height will give a clustering at a selected precision. The bottom level of the hierarchy includes all data points as one cluster.
  • As an example of the scaling issue, a clustering scenario is provided that involves a type of dataset for which the structure of the functional dependency between two or more time series can take a variety of forms. As an example, FIG. 2 illustrates a graph of functional dependencies between a pair of time series in which the noticeable structures are that of three separate lines radiating from common points. While different structures from within this graph may be obtained using hierarchical clustering methods, ideally, it would be desirable to have the result of clustering the dataset indicate the lower level structures (such as the individual splotches in FIG. 2) as well as the higher-level structures formed (such as the lines perceived in FIG. 2) without necessarily leading to a single cluster at the top level, unless that is in fact matching how the data collection should be perceived.
  • SUMMARY OF THE INVENTION
  • The shortcomings of the prior art can be overcome and additional advantages can be provided through exemplary embodiments of the present invention that are related to a data processing system that comprises a processor, a random access memory for storing data and programs for execution by the processor, and computer readable instructions stored in the random access memory for execution by the processor to perform a method for clustering data points in a multidimensional dataset in a multidimensional image space. The method comprises generating a multidimensional image from the multidimensional dataset; generating a pyramid of multidimensional images having varying resolution levels by successively performing a pyramidal sub-sampling of the multidimensional image; identifying data clusters at each resolution level of the pyramid by applying a set of perceptual grouping constraints; and determining levels of a clustering hierarchy by identifying each salient bend in a variation curve of a magnitude of identified data clusters as a function of pyramid resolution level.
  • The shortcomings of the prior art can also be overcome and additional advantages can also be provided through exemplary embodiments of the present invention that are related to computer program products and data processing systems corresponding to the above-summarized method are also described and claimed herein.
  • Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
  • TECHNICAL EFFECTS
  • As a result of the summarized invention, technically we have achieved a solution that can be implemented to cluster data points in a multidimensional dataset in a multidimensional image space in a manner that performs pyramidal clustering by applying perceptual grouping constraints to identify multi-level structures in the dataset, can automatically determine the number of perceivable clusters at each level of an image pyramid, and can thereby determine a hierarchical clustering for the dataset.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The subject matter that is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description of exemplary embodiments of the present invention taken in conjunction with the accompanying drawings in which:
  • FIG. 1 is a graph illustrating an exemplary functional dependency between a pair of time series.
  • FIG. 2 is a graph illustrating exemplary functional dependencies between a pair of time series.
  • FIGS. 3 a and 3 b are illustrations of exemplary arrangements of points to which perceptual grouping cues can be applied.
  • FIG. 4 is a flow diagram illustrating an exemplary embodiment of a clustering process in accordance with the present invention.
  • FIGS. 5 a-5 f illustrate graphs of different levels of an exemplary pyramidal grouping generated while performing the exemplary clustering process illustrated in FIG. 4.
  • FIG. 6 illustrates a graph of an exemplary variation curve of a magnitude of clusters identified from a dataset as a function of image resolution level.
  • FIG. 7 is a block diagram illustrating an exemplary hardware configuration or a computer system within which exemplary embodiments of the present invention can be implemented.
  • The detailed description explains exemplary embodiments of the present invention, together with advantages and features, by way of example with reference to the drawings. The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified. All of these variations are considered a part of the claimed invention.
  • DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
  • While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the description of exemplary embodiments in conjunction with the drawings. It is of course to be understood that the embodiments described herein are merely exemplary of the invention, which can be embodied in various forms. Therefore, specific structural and functional details disclosed in relation to the exemplary embodiments described herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any appropriate form. Further, the terms and phrases used herein are not intended to be limiting but rather to provide an understandable description of the invention.
  • In exemplary embodiments of the present invention, concepts of perceptual grouping are exploited for clustering multidimensional datasets to determine the hierarchical structures. Perceptual grouping refers to the human visual ability to extract significant image relations from lower-level primitive image features without any knowledge of the image content and thereby group them to obtain meaningful higher-level structure. Thus, perceptual grouping concepts as implemented in exemplary embodiments described herein implicitly use some aspects of processing that can be directly related to the pattern recognition processes of the human visual system. The human visual system can detect many classes of patterns and statistically significant arrangements of image elements. Perceptual grouping aims at reducing ambiguity in image data or in initial segmentation and thus at increasing the robustness and efficiency of subsequent processing steps.
  • A number of factors have been known to influence the parts of an image that are combined to form an object. The use of perceptual grouping in pattern recognition involves clustering data points based upon factors embodying some or all of the relevant human perceivable attributes of parallelism, co-linearity, proximity, similarity, good continuation, orientation, connectivity, density, symmetry, etc. These factors are based on the Gestalt principles of psychology. As an example, FIGS. 3 a and 3 b show the close applicability of perceptual grouping to clustering. FIG. 3 a illustrates an exemplary arrangement of points where proximity as well as continuity of orientation can help isolate the three lines (that is, the C and F splits and the main line A-H in the middle). FIG. 3 b illustrates an exemplary case in which the density difference is the primary perceptual grouping cue used to separate the two objects (that is, the inner and outer discs).
  • Exemplary embodiments described herein embody methods for extracting perceptually salient clusters/groups from multidimensional datasets using perceptual grouping as a way of clustering. The multidimensional feature space is modeled as a multidimensional image, clustering is posed as a problem of object extraction from sparse and noisy data, and perceptual grouping constraints are utilized to successively group sample points into dense clusters in multidimensional spaces. To accommodate perceptual groups that may occur at different scales, the grouping is carried over progressively sub-sampled images using a pyramid scheme to identify clustering levels. Exemplary embodiments as described herein are applicable to suitable multidimensional datasets in any arbitrary application domain or field.
  • FIG. 4 illustrates a flow diagram of a process for performing clustering in accordance with an exemplary embodiment of the present invention. Exemplary clustering process 100 consists of three main steps: At step 110, a pyramidal image sampling is performed to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the dataset. At step 120, a perceptual grouping method is perform at each image size to assemble the clusters from the samples by extracting features from the subspaces formed in the previous image in the image pyramid. That is, perceptual grouping constraints are used to assemble the clusters at successive pyramidal levels. Finally, at step 130, the cluster curve is then obtained and bends in the curve are identified as curvature change points. By noting the image size at each bend, the corresponding subspaces are retained as levels of the clustering hierarchy. Throughout this process, while the sampling of the neighborhood is discrete, the clustering at each level is based upon the actual and not quantized location of each data sample. Each step of the present exemplary embodiment is described in greater detail below. First, however, some terminology for the model used will be outlined.
  • In an M-dimensional data set of N samples X=(X1, X2, . . . XN) where each of the Xi=(f1i, f2i, . . .fMi) is an M-dimensional feature vector in which the features are normalized so that 0≦fji≦1.0, each of the Xi is a point in an RM space, which can be regarded as an M-dimensional image Ik of size Lk×Lk. Each sample then has an image bin coordinate at level k that is an M-tuple Di k=[q1i k . . . qMi k]T where 0≦qji k≦Lk−1 are the bin coordinates representing the pixel in the image. Each image bin at level k is an M-dimensional unit of size
  • ( 1 L k × 1 L k × × 1 L k ) .
  • In this model, clusters become multidimensional regions or subspaces with image intensity formed from the cluster number, so that at each image size Lk, the sample Xi belonging to cluster cl k is represented by an intensity cl k at the pixel Di k=[q1i k . . . qMi k]T. The set of clusters at each image size Lk is denoted by Ck={c1 k, c2 k, . . . cn k k} where nk are the distinct number of subspaces at image size Lk.
  • The variation in the number of clusters as a function of image size is given by a 1D-cluster curve z={(nk,Lk)|k=0, . . . T}. The bends in the curve, that is, points at which there is a significant change of curvature zp=(x(p), y(p)) are noted to form the clustering levels Cp of the clustering hierarchy. It should be noted that while pyramidal level and hierarchical clustering levels are distinguished, for some data distributions, they may coincide.
  • Referring again to the exemplary embodiment shown in FIG. 4, at step 110, a pyramidal image sampling scheme is performed to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the dataset. Any appropriate technique for pyramidal image sampling of the multidimensional dataset can be used. Several known methods have been implemented for image sampling—for example, in cross-section, radial section, or spiral (Archimedean and logarithmic) form—to obtain vector data from image greyscale values that have shown to be successful in obtaining representative samples. Following the convention in pyramid image representations, a logarithmic sampling scheme is used in the present exemplary embodiment. Because all the feature dimensions are normalized to be in the range [0,1], a square grid can be used. The scheme starts with an image of size L0×L0, and each successive image is of size Li×Li where
  • L i = L i - 1 2
  • until an image size of 1×1 is reached. The pyramid sampling scheme is sufficient to bring out all groups that obey the selected grouping constraints.
  • The pyramid image sampling scheme can be used as a systemic way to explore such multi-level grouping by applying multi-resolution grouping constraints to achieve a meaningful semantic representation. That is, the multidimensional space is modeled as a multidimensional image with pyramidal sub-sampling for representing the features at lower resolution such that higher level structures are extracted from the dataset. This sampling scheme exploits the characteristic of images that indicate that neighboring pixels are highly correlated. An example of a pyramid is the Laplacian pyramid, which is obtained by convolving the image with a Gaussian kernel. The Laplacian is then computed as the difference between the original image and the low pass filtered image to create a sequence of band-pass filtered representations. This process is continued to obtain a set of band-pass filtered images, each being the difference between two levels of the Gaussian pyramid. Thus, the Laplacian pyramid is a set of band pass filters at successively lower resolutions or image sizes.
  • In exemplary process 100, after pyramidal sampling is performed, perceptual grouping constraints are used at step 120 to cluster the multidimensional dataset into smaller numbers of spatial and computational complexity tractable clusters at successive pyramidal levels by extracting features from the subspaces formed in the previous image in the pyramid. The cluster labels of the previous pyramid level serve as the intensity values for the current pyramid level and remain in the same cluster at the next pyramid level. This property not only ensures nested clusters across the cluster hierarchy, but also makes it possible to generate the groupings using a logarithmic scale of multi-resolution. Thus, the image size is successively sub-sampled or equivalently reduced and those perceptual groups that persist the longest as the image shrinks are selected.
  • In the present exemplary embodiment, the perceptual grouping constraints of proximity, density, orientation similarity, and region contiguity are utilized to successively group sample points into dense clusters in multidimensional spaces. Due to sparse and irregular nature of point distribution in clustering, the emergence of higher-level structures is rarely apparent through a single pass of the image. Initial grouping may yield some structures, and these structures may be further combined to yield another level of higher-level structures. This process can be repeated until a meaningful semantic representation is achieved. The grouping constraints that are utilized, which will now be described in greater detail, determine whether pixels can be grouped.
  • In the present exemplary embodiment, using the multidimensional image model outlined above, a pair of points (Xi, Xj) are considered proximal at pyramidal level k if |Di k−Dj k|≦1 or |q1i k−qij k|≦1, ∀1≦l≦M. The operation ∥ stands for the absolute value. For the 2-dimensional case, it corresponds to the use of a 3×3 neighborhood around a pixel.
  • At each pyramid level, the grouping will consider the clusters from the previous level as the grouping elements. Thus, the proximity constraints state that they can be merged provided at least a pair of their respective image bins is adjacent. By letting Dk(ci)={D1i k, . . . D1i k} be the set of image bins at level k occupied by the cluster ci k-1 (that is, at least one of the sample points of the cluster belongs to one of the image bins), the proximity constraint to group two clusters ci k-1,cj k-1 from level k-1 into one at level k can be given as ∃l,m|D1i k−Dmj k|≦1.
  • Using the multidimensional image model outlined above, the average density of a cluster obtained at pyramid level k is obtained by noting the average number of samples with the given cluster label in an image bin. By letting Nc k be the number of image bins for cluster c at pyramid level k and ncl k be the number of sample points belonging to cluster c in bin l at pyramid level k, the average density of the cluster c is can be given by
  • Density ( c k ) = l = 1 N c k n cl k N c k .
  • The grouping constraint of density attempts to group clusters that have a small difference in density. That is, given two clusters from pyramid level k-1, ci k-1, cj k-1, the density constraint is |Density(ci k-1)−Density(cj k-1)|≦τ.
  • Using the multidimensional image model outlined above, the orientation of the subspace can be characterized by the eigenvectors of the covariance matrix of the subspace. The covariance matrix of the cluster c is given by XcXc T, where Xc are the set of sample points that fall into the cluster c. The eigenvectors are obtained from the characteristic equation ((XcXc T−λI)v=0, where λ are the eigenvalues obtained by solving the characteristic polynomial det(cXc T−λI)=0. Because the covariance matrix is symmetric, the eigenvalues are real and the eigenvectors are orthogonal. In practice, the orientation of the cluster can be determined by the eigenvectors corresponding to the largest eigenvalue.
  • The orientation constraint of grouping operates to group those clusters from a previous pyramid level at the next level when there is a small difference in their orientation. By letting vi, vj be the eigenvectors corresponding to the largest eigenvalues for two clusters ci k-1,cj k-1 respectively, the two clusters are merged at the next pyramid level k if Θ(ci k-1,cj k-1)=|vi·vj|≧ρ, where 0≦ρ≦1.0. Here, the dot product represents the cosine of the angle between the above unit eigenvectors.
  • Using the multidimensional image model outlined above, region contiguity should be implemented as a three-way constraint to determine if two of the regions belonging to different groups being merged potentially intersect with a third region belonging to a group that is already assembled, which would result in physically implausible clusters. The potential for having clusters consisting of intersecting regions belonging to different clusters is a particular concern at higher levels of the image pyramid. For purposes of perceptual grouping, the contiguity of two potential groups ci k-1 and cj k-1 can be detected if the potential minimum spanning tree (MST) formed from their merger Vij k does not have an edge intersecting with the MST Vl k of a group already formed at this level cl k or with Vm k-1 for the region cm k-1 at previous scale. The merger Vij k=Vi k-1∪Vj k-1∪{Emin} where Emin=min{Euv,u∈ci k-1,v∈cj k-1} and Euv is the distance between the M-dimensional points u and v belonging to groups ci k-1 and cj k-1 respectively. The two groups ci k-1 and cj k-1 meet the contiguity constraint if Emin
    Figure US20090175544A1-20090709-P00001
    Vl k′, where k′=k or k-1 (as the case may be) and
    Figure US20090175544A1-20090709-P00001
    denotes no proper line segment intersection.
  • At each step of the pyramid, process 100 returns the number of clusters as the connected components grouped. Generally, the number of distinct groups can be expected to decrease with coarseness of sub-sampling until the minimum is reached. In the present exemplary embodiment, at step 130, the cluster curve is obtained and salient bends in the curve are identified as curvature change points to determine the clustering hierarchy. In practice, the number of clusters as a function of image sample size can be represented as a monotonically decreasing curve. FIG. 6 illustrates a graph of an exemplary variation of clusters for a sample set as a function of image resolution. The salient bends W, X, Y, and Z indicated by the arrows on the curve are used as levels of hierarchical clustering. Although the pyramidal representation gives the multi-resolution decomposition for purposes of grouping, not all levels may produce distinct changes in the grouping. Rather, distinct groups may emerge at certain levels of resolution and such emergence is usually marked with a distinct change in the number of clusters.
  • By plotting the variation in the number of clusters as a function of the image resolution, a pronounced bend can be observed at distinct points when salient groups emerge. These sharp changes in the cluster curve at points of steep slope or slope change can be used to signal the various levels in a hierarchical clustering. In particular, the last salient bend before the image shrinks to zero can be taken as the residual clusters at the top level. In exemplary embodiments, these change points can be detected using an algorithm for scale-space salient change detection in which the bends are identified as curvature change points, that is, where there is a zero crossing in the second derivative of the signal. By noting the image size at each bend, the corresponding subspaces are retained as levels of the clustering hierarchy. It should be noted that with the pyramidal grouping, the highest level need not be limited to a single cluster as in conventional agglomerative clustering because region grouping is dictated not only by proximity but also by orientation and change in density.
  • FIGS. 5 a-5 f are graphs illustrating the different levels of an exemplary pyramidal grouping performed in accordance with the present exemplary process. In this exemplary embodiment, multi-resolution decomposition is used to guide the perceptual grouping of data points using factors of proximity, orientation, and density. The number of clusters decreases as a function of image size. FIG. 5 a shows the original data. The clustering can be produced at each level in the hierarchy by using the bends on this curve as an indication of level are shown in FIG. 5 b-5 f As can be seen, there are only a few levels of clustering, with the top level of the hierarchy representing the top-level structures perceived in the dataset without grouping all the data points into a single cluster, which is remarkably different from conventional hierarchical clustering schemes.
  • As described above, the exemplary clustering process illustrated in FIG. 4 can be used to create a multidimensional image pyramid by successively sub-sampling the original multidimensional image formed from the data samples. The exemplary clustering process is based upon perceptual grouping concepts that are applicable to high-dimensional spaces. The approach identifies bends in cluster curves that it then uses to determine the cluster hierarchy. The process can identify clusters which would be obvious using a visual metaphor but would otherwise be unrecognized using only proximity based measures. The details of the exemplary clustering process illustrated in FIG. 4 can be summarized as follows:
  • 1. At level 0, ci 0=Xi, Density(ci)=1.0, and Θ(ci 0, cj 0)=1.0, and D0(ci)={Di 0}.
  • 2. Given the subspaces (clusters from pyramid level k-1), the clusters at level k are assembled. Let nk be the number of clusters at pyramid level k.
  • 3. For k=1 to log L0 do
  • For i=1 to nk−1 do
      Make-set(i);
    For i=1 to nk−1 do
      For j=i+1 to nk−1 do
        If (∃l,m|Dli k−Dmj k|≦1) &&
          (|Density(ci k−1)−Density(cj k−1)|≦τ) &&
          (Θ(ci k−1,cj k−1) ≧ ρ) &&
          (Emin
    Figure US20090175544A1-20090709-P00002
    Vl k′)
            {
             if (find-set(i) ≠ find-set(j))
              union-find(i,j)
            }
  • 4. Find salient bends in the curve plotting the nk vs. Lk. Let the bends be at positions {(Ll 1 , nl 1 ), . . . (Ll S ,nl S )}.
  • 5. The hierarchical levels for clustering are given by {Ll 1 , . . . Ll S }. The corresponding clusters at each cluster hierarchy level are identified as {{c1 l 1 , . . . cn l1 l 1 } . . . {c1 l S , . . . cn lS l S }}.
  • The exemplary process has three parameters, namely, L0, τ, ρ, that can be chosen as follows. The starting image size for the pyramidal sampling can be taken as the minimum non-zero distance between pairs of sample points in the data set. Nevertheless, a starting sampling size based on the distance that is at 1/100th percentile in the sorted list of distances between pairs of points can usually be sufficient. In this case, supposing the pair-wise Euclidean distance at 1/100th percentile is d,
  • L 0 = 1 d .
  • The process does not require a priori information regarding to the number of clusters or starting points for the computation.
  • In exemplary embodiments, the clustering output can be generated in the form of a tabular list of point coordinates along the corresponding cluster labels. The output can also be visualized using any suitable technique for visualizing clustering results.
  • The capabilities of exemplary embodiments of present invention described above can be implemented in software, firmware, hardware, or some combination thereof, and may be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods and/or functions described herein—is suitable. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. Exemplary embodiments of the present invention can also be embedded in a computer program product, which comprises features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
  • Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
  • Therefore, one or more aspects of exemplary embodiments of the present invention can be included in an article of manufacture (for example, one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately. Furthermore, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the exemplary embodiments of the present invention described above can be provided.
  • For instance, exemplary embodiments of the present invention can be implemented within the exemplary embodiment of a hardware configuration provided for a computer system in FIG. 7. FIG. 7 illustrates an exemplary computer system 10 upon which exemplary embodiments of the present invention can be implemented. A processor or CPU 12 receives data and instructions for operating upon from on-board cache memory or further cache memory 18, possibly through the mediation of a cache controller 20, which can in turn receives such data from system read/write memory (“RAM”) 22 through a RAM controller 24, or from various peripheral devices through a system bus 26. The data and instruction contents of RAM 22 will ordinarily have been loaded from peripheral devices such as a system disk 27. Alternative sources include communications interface 28, which can receive instructions and data from other computer systems.
  • The above-described program or modules implementing exemplary embodiments of the present invention can work on processor 12 and the like to perform clustering. The program or modules implementing exemplary embodiments may be stored in an external storage medium. In addition to system disk 27, an optical recording medium such as a DVD and a PD, a magneto-optical recording medium such as a MD, a tape medium, a semiconductor memory such as an IC card, and the like may be used as the storage medium. Moreover, the program may be provided to computer system 10 through the network by using, as the recording medium, a storage device such as a hard disk or a RAM, which is provided in a server system connected to a dedicated communication network or the Internet.
  • Although exemplary embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and alternations can be made therein without departing from spirit and scope of the inventions as defined by the appended claims. Variations described for exemplary embodiments of the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application, need not be used for all applications. Also, not all limitations need be implemented in methods, systems, and/or apparatuses including one or more concepts described with relation to exemplary embodiments of the present invention.
  • While exemplary embodiments of the present invention have been described, it will be understood that those skilled in the art, both now and in the future, may make various modifications without departing from the spirit and the scope of the present invention as set forth in the following claims. These following claims should be construed to maintain the proper protection for the present invention.

Claims (5)

1. A data processing system comprising:
a processor;
a random access memory for storing data and programs for execution by the processor; and
computer readable instructions stored in the random access memory for execution by the processor to perform a method for clustering data points in a multidimensional dataset in a multidimensional image space, the method comprising:
generating a multidimensional image from the multidimensional dataset;
generating a pyramid of multidimensional images having varying resolution levels by successively performing a pyramidal sub-sampling of the multidimensional image;
identifying data clusters at each resolution level of the pyramid by applying a set of perceptual grouping constraints; and
determining levels of a clustering hierarchy by identifying each salient bend in a variation curve of a magnitude of identified data clusters as a function of pyramid resolution level.
2. The data processing system of claim 1, wherein the pyramidal sub-sampling performed comprises a logarithmic sampling strategy.
3. The data processing system of claim 1, wherein identifying data clusters at each resolution level comprises identifying data clusters at a top resolution level of the pyramid by applying the set of perceptual grouping constraints, and identifying data clusters at each successively lower resolution level of the pyramid by extracting clusters identified at the previous resolution level and applying the set of perceptual grouping constraints.
4. The data processing system of claim 1, wherein the set of perceptual grouping constraints comprises proximity, density, orientation similarity, and region contiguity.
5. The data processing system of claim 1, wherein determining levels of the clustering hierarchy comprises identifying salient bends at points at which there is a zero crossing in a second derivative of the variation curve.
US12/143,131 2008-01-08 2008-06-20 Finding structures in multi-dimensional spaces using image-guided clustering Expired - Fee Related US7558425B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/143,131 US7558425B1 (en) 2008-01-08 2008-06-20 Finding structures in multi-dimensional spaces using image-guided clustering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/970,946 US7406200B1 (en) 2008-01-08 2008-01-08 Method and system for finding structures in multi-dimensional spaces using image-guided clustering
US12/143,131 US7558425B1 (en) 2008-01-08 2008-06-20 Finding structures in multi-dimensional spaces using image-guided clustering

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/970,946 Continuation US7406200B1 (en) 2008-01-08 2008-01-08 Method and system for finding structures in multi-dimensional spaces using image-guided clustering

Publications (2)

Publication Number Publication Date
US7558425B1 US7558425B1 (en) 2009-07-07
US20090175544A1 true US20090175544A1 (en) 2009-07-09

Family

ID=39643332

Family Applications (3)

Application Number Title Priority Date Filing Date
US11/970,946 Expired - Fee Related US7406200B1 (en) 2008-01-08 2008-01-08 Method and system for finding structures in multi-dimensional spaces using image-guided clustering
US12/143,131 Expired - Fee Related US7558425B1 (en) 2008-01-08 2008-06-20 Finding structures in multi-dimensional spaces using image-guided clustering
US12/168,547 Expired - Fee Related US7519227B1 (en) 2008-01-08 2008-07-07 Finding structures in multi-dimensional spaces using image-guided clustering

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/970,946 Expired - Fee Related US7406200B1 (en) 2008-01-08 2008-01-08 Method and system for finding structures in multi-dimensional spaces using image-guided clustering

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/168,547 Expired - Fee Related US7519227B1 (en) 2008-01-08 2008-07-07 Finding structures in multi-dimensional spaces using image-guided clustering

Country Status (1)

Country Link
US (3) US7406200B1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120011186A1 (en) * 2010-07-08 2012-01-12 National Cheng Kung University Method for quantifying and analyzing intrinsic parallelism of an algorithm
US8340415B2 (en) 2010-04-05 2012-12-25 Microsoft Corporation Generation of multi-resolution image pyramids
JP2013530477A (en) * 2010-07-06 2013-07-25 ナショナル・チェン・クン・ユニヴァーシティ Method for quantifying and analyzing parallel processing of algorithms
US8547389B2 (en) 2010-04-05 2013-10-01 Microsoft Corporation Capturing image structure detail from a first image and color from a second image
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312552A1 (en) * 2007-06-18 2008-12-18 Qienyuan Zhou Method to detect change in tissue measurements
US9363143B2 (en) 2008-03-27 2016-06-07 International Business Machines Corporation Selective computation using analytic functions
US20090248722A1 (en) * 2008-03-27 2009-10-01 International Business Machines Corporation Clustering analytic functions
US8145672B2 (en) * 2008-10-10 2012-03-27 Exxonmobil Research And Engineering Company Method for clustering of large high-dimensional datasets
US20110190657A1 (en) * 2009-08-10 2011-08-04 Carl Zeiss Meditec, Inc. Glaucoma combinatorial analysis
US8990207B2 (en) * 2009-12-04 2015-03-24 University Of South Carolina Optimization and visual controls for regionalization
US8560544B2 (en) 2010-09-15 2013-10-15 International Business Machines Corporation Clustering of analytic functions
US9519705B2 (en) * 2011-01-25 2016-12-13 President And Fellows Of Harvard College Method and apparatus for selecting clusterings to classify a data set
US9357911B2 (en) 2011-05-09 2016-06-07 Carl Zeiss Meditec, Inc. Integration and fusion of data from diagnostic measurements for glaucoma detection and progression analysis
US8971665B2 (en) 2012-07-31 2015-03-03 Hewlett-Packard Development Company, L.P. Hierarchical cluster determination based on subgraph density
US9224071B2 (en) * 2012-11-19 2015-12-29 Microsoft Technology Licensing, Llc Unsupervised object class discovery via bottom up multiple class learning
US9183261B2 (en) 2012-12-28 2015-11-10 Shutterstock, Inc. Lexicon based systems and methods for intelligent media search
US9183215B2 (en) 2012-12-29 2015-11-10 Shutterstock, Inc. Mosaic display systems and methods for intelligent media search
EP3465470A1 (en) * 2016-06-07 2019-04-10 Open As App GmbH Method and device for generating an electronic document specification based on a n-dimensional data source
US9968251B2 (en) 2016-09-30 2018-05-15 Carl Zeiss Meditec, Inc. Combined structure-function guided progression analysis
US11726753B2 (en) 2016-12-03 2023-08-15 Thomas STACHURA Spreadsheet-based software application development
US10216494B2 (en) 2016-12-03 2019-02-26 Thomas STACHURA Spreadsheet-based software application development
CN110033051B (en) * 2019-04-18 2021-08-20 杭州电子科技大学 Fishing trawler behavior discrimination method based on multi-step clustering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040013305A1 (en) * 2001-11-14 2004-01-22 Achi Brandt Method and apparatus for data clustering including segmentation and boundary detection
US20040202370A1 (en) * 2003-04-08 2004-10-14 International Business Machines Corporation Method, system and program product for representing a perceptual organization of an image
US20070185946A1 (en) * 2004-02-17 2007-08-09 Ronen Basri Method and apparatus for matching portions of input images

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040013305A1 (en) * 2001-11-14 2004-01-22 Achi Brandt Method and apparatus for data clustering including segmentation and boundary detection
US20040202370A1 (en) * 2003-04-08 2004-10-14 International Business Machines Corporation Method, system and program product for representing a perceptual organization of an image
US20070185946A1 (en) * 2004-02-17 2007-08-09 Ronen Basri Method and apparatus for matching portions of input images

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8340415B2 (en) 2010-04-05 2012-12-25 Microsoft Corporation Generation of multi-resolution image pyramids
US8547389B2 (en) 2010-04-05 2013-10-01 Microsoft Corporation Capturing image structure detail from a first image and color from a second image
JP2013530477A (en) * 2010-07-06 2013-07-25 ナショナル・チェン・クン・ユニヴァーシティ Method for quantifying and analyzing parallel processing of algorithms
US20120011186A1 (en) * 2010-07-08 2012-01-12 National Cheng Kung University Method for quantifying and analyzing intrinsic parallelism of an algorithm
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US9607023B1 (en) 2012-07-20 2017-03-28 Ool Llc Insight and algorithmic clustering for automated synthesis
US10318503B1 (en) 2012-07-20 2019-06-11 Ool Llc Insight and algorithmic clustering for automated synthesis
US11216428B1 (en) 2012-07-20 2022-01-04 Ool Llc Insight and algorithmic clustering for automated synthesis
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis

Also Published As

Publication number Publication date
US7558425B1 (en) 2009-07-07
US7519227B1 (en) 2009-04-14
US7406200B1 (en) 2008-07-29

Similar Documents

Publication Publication Date Title
US7558425B1 (en) Finding structures in multi-dimensional spaces using image-guided clustering
Fischer et al. Bagging for path-based clustering
Lampert et al. Efficient subwindow search: A branch and bound framework for object localization
US11294624B2 (en) System and method for clustering data
US8644622B2 (en) Compact signature for unordered vector sets with application to image retrieval
US8548256B2 (en) Method for fast scene matching
US8200010B1 (en) Image segmentation by clustering web images
Tasse et al. Cluster-based point set saliency
US20120301014A1 (en) Learning to rank local interest points
US8971614B2 (en) Extracting object edges from images
Shirazi et al. Content-based image retrieval using texture color shape and region
CN111400528B (en) Image compression method, device, server and storage medium
Qi et al. Diagram structure recognition by bayesian conditional random fields
Zhang et al. Improved adaptive image retrieval with the use of shadowed sets
Liu et al. Human action recognition based on 3D SIFT and LDA model
Bazan et al. Quantitative analysis of similarity measures of distributions
Lee et al. Model-based detection, segmentation, and classification for image analysis using on-line shape learning
Takahashi et al. Applying manifold learning to plotting approximate contour trees
Castellano et al. Shape annotation by semi-supervised fuzzy clustering
Bouteldja et al. A comparative analysis of SVM, K-NN, and decision trees for high resolution satellite image scene classification
Khoder et al. Multicriteria classification method for dimensionality reduction adapted to hyperspectral images
Chamalis et al. Region merging for image segmentation based on unimodality tests
Chen et al. Robust semi-supervised manifold learning algorithm for classification
Al-Azzawy Eigenface and SIFT for gender classification
Nayef et al. Efficient symbol retrieval by building a symbol index from a collection of line drawings

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REMI Maintenance fee reminder mailed
FPAY Fee payment

Year of fee payment: 4

SULP Surcharge for late payment
REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20170707