US20080021896A1 - Automatic query clustering - Google Patents

Automatic query clustering Download PDF

Info

Publication number
US20080021896A1
US20080021896A1 US11/838,126 US83812607A US2008021896A1 US 20080021896 A1 US20080021896 A1 US 20080021896A1 US 83812607 A US83812607 A US 83812607A US 2008021896 A1 US2008021896 A1 US 2008021896A1
Authority
US
United States
Prior art keywords
items
properties
property
clusterization
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/838,126
Inventor
Andrzej Turski
Richard Rashid
Matthew MacLaurin
Lili Cheng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US11/838,126 priority Critical patent/US20080021896A1/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHENG, LILI, MACLAURIN, MATTHEW, RASHID, RICHARD F., TURSKI, ANDRZEJ
Publication of US20080021896A1 publication Critical patent/US20080021896A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B62LAND VEHICLES FOR TRAVELLING OTHERWISE THAN ON RAILS
    • B62BHAND-PROPELLED VEHICLES, e.g. HAND CARTS OR PERAMBULATORS; SLEDGES
    • B62B5/00Accessories or details specially adapted for hand carts
    • B62B5/04Braking mechanisms; Locking devices against movement
    • B62B5/0485Braking mechanisms; Locking devices against movement by braking on the running surface, e.g. the tyre
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60BVEHICLE WHEELS; CASTORS; AXLES FOR WHEELS OR CASTORS; INCREASING WHEEL ADHESION
    • B60B33/00Castors in general; Anti-clogging castors
    • B60B33/04Castors in general; Anti-clogging castors adjustable, e.g. in height; linearly shifting castors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation

Definitions

  • the present invention relates generally to computer systems, and more particularly to a system and methods that automatically organize information items into a smaller subset of items by analyzing an item distribution associated with various property clusters.
  • a database-based operating system One key aspect of a database-based operating system is an ability to find desired items quickly by executing a query that may involve a number of item properties. This should be compared to previous systems, which required the knowledge of a file location within a folder hierarchy to retrieve desired information, for example. While the query approach is very powerful, the success of newer systems generally depends on an ability to create a user interface (UI) that allows queries to be simple and intuitive for average users. In its native form, database queries (e.g., expressed in T-SQL language) are difficult to handle for professional programmers and typically inappropriate for end users.
  • UI user interface
  • a predefined query could be provided to find all the picture files on a disk (Picture Library), or all unread email.
  • a system may suggest grouping results in a certain manner, e.g., the pictures may be automatically put into groups according to the date taken.
  • Such patterns of predefined queries are useful for many common scenarios, but they are not general enough to unlock the full power of the database.
  • it may happen that all the pictures were taken the same day (or, maybe, the camera clock wasn't set,) in which case grouping by date is useless.
  • the situation is even worse when dealing with 3rd party properties (application-defined, administrator-defined, or user-defined). Since these properties are not known to the creators of the operating system, designing predefined queries for the properties may be almost impossible.
  • Another approach is to provide users an ability to query databases with textual queries that appear like a natural language. Such queries can be general enough from the database point of view and easy to understand for the users. However, if natural language queries that can take a totally free form are allowed, it is difficult to create a parser that will correctly understand the user's intention in each case. If some grammar restrictions are imposed, it becomes more difficult for users to form a syntactically correct query that can sometimes be worked around with chunk expressions. In either case, the very idea that the query text needs to be typed in may not be appealing to many users. Small kids, non-English users, and users of keyboard-less devices (e.g., Tablet PC) may all have problems with textual typing. Thus, there is a need for a query interface that has point-and-click simplicity for finding and retrieving information.
  • keyboard-less devices e.g., Tablet PC
  • the present invention relates to automatic retrieval and display of desired information into a subset of easily manageable information clusters.
  • navigating a large set of items such as by displaying the items as lists becomes problematic when attempting to find and retrieve desired information from such lists.
  • the present invention provides an improved point-and-click interface that facilitates navigating a large set of items classified by associated properties of the items. Items clustered by these properties can be presented in a folder-like manner (or other display type), whereby automatic clusterization may be performed by a different or subsequent property to split or organize query results into an easily manageable subset of clusters. These subsets may then be selected to retrieve desired information or to perform other clustering procedures (e.g., nested clusterization). The best property to cluster on can be determined by analyzing an item distribution in various property clusters.
  • One aspect of the present invention provides an automatic selection of a clusterization property.
  • a problem can be stated as follows: Given a starting set of items, and a set of item properties that can be used for grouping, which property associated with the set of items offers the best-clustered results?
  • the present invention attempts to provide a uniform grouping of results into a moderate number of clusters.
  • cases when there are just a few clusters with a large number of items, or a large number of clusters with just a few items in each cluster are typically not desired in order to efficiently find and retrieve desired information.
  • the above problem can be solved by assigning a clusterization score to each item property and selecting the property with the highest score.
  • the clusterization score can be calculated by multiplying together the number of items in each cluster.
  • a function to calculate the clusterization score as a product of cluster sizes has its maximum when the items are split into ⁇ N clusters, respective clusters having ⁇ N items.
  • the score is utilized to measure and compare how far the distribution is from an ideal distribution.
  • An example of an alternative score function can be based on a binomial distribution, for example. For these type distributions, the score value has a statistical interpretation that provides a number of ways N_total items can be split into clusters of given size.
  • clusterization that has most value for the user is the one that mitigates the largest number of alternative distributions.
  • clusterization scores can be calculated for all of the properties, wherein such calculations can be easily performed with a single pass through of a list of items.
  • FIG. 1 is a schematic block diagram of a clustering system in accordance with an aspect of the present invention.
  • FIG. 2 is a flow diagram illustrating an automatic query clustering process in accordance with an aspect of the present invention.
  • FIGS. 3-10 illustrate example user interfaces for automatic query clustering in accordance with an aspect of the present invention.
  • FIG. 11 is a schematic block diagram illustrating a suitable operating environment in accordance with an aspect of the present invention.
  • FIG. 12 is a schematic block diagram of a sample-computing environment with which the present invention can interact.
  • the present invention relates to a system and methodology for automatic clusterization and display of data items in a local or remote database system.
  • clusterization can be based on properties associated with the data items such as a type, location, people, date, time, user-defined, and so forth, wherein an initial property may be employed to form a first level of clusterization and a subsequent property may be automatically determined to form an optimized clusterization from which to find and retrieve desired information.
  • a computerized interface for organizing and retrieving data is provided.
  • the interface includes a property analyzer to determine an item distribution for at least two cluster properties and an organizer that forms new clusters based in part on the item distribution.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a server and the server can be a component.
  • One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon.
  • the components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal).
  • a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal).
  • the system 100 includes a data storage 110 that stores a plurality of data items 120 to be displayed at a user interface (not shown).
  • Such items 120 can include documents, files, folders, images, audio files, source code and so forth that can appear in various viewable states at the user interface which is described in more detail below.
  • the items 120 are also associated with various properties (e.g., metadata) describing such aspects as an item's type (e.g., image, document, spreadsheet, binary, and so forth), date created, people associated with the item, location, category, user-defined property, and so forth.
  • An aggregator 130 collects the items 120 and associated properties and presents the items to a property analyzer 140 that performs an analysis of respective items and properties. For example, such analysis can include automatically determining a score for various possible clustering scenarios or potential groupings for items.
  • a cluster organizer 150 Based upon the analysis by the analyzer 140 , a cluster organizer 150 presents an optimized grouping of new clusters 160 to a user.
  • the optimized grouping of clusters 160 facilitates finding and retrieving desired information from the data storage 110 which can include local storage mediums, remote storage mediums, or a combination of local and remote storage.
  • a default top-level clusterization can group items by item type.
  • a first level grouping by item type is useful and well understood by the users.
  • a second level clusterization by another property is not obvious and difficult to discover.
  • one aspect of the present invention is an automatic selection of a clusterization property. The problem can be stated as follows: Given as starting set of items, and a set of item properties that can be used for grouping, which property offers the best clustered results? By best or optimized clustered results, it is a goal to provide a uniform grouping of items into a moderate number of clusters.
  • the above goal can be achieved by assigning a clusterization score to each item property and selecting the property with the highest score.
  • a function to calculate the clusterization score as a product of cluster sizes has its maximum when the items are split into ⁇ N clusters, each cluster with ⁇ N items.
  • the score is used to measure and compare how far it is from the ideal or optimized distribution. It was found that the above score function produced reasonable results in test cases.
  • the score function utilized is an example. For example, other functions may be employed that provide different relative weights to the distributions off the ideal distribution.
  • score ( N _total)!/(( n _items cluster1 )!*( n _items cluster2 )!* . . . )
  • the score value has a statistical interpretation that it provides a number of ways N_total items can be split into clusters of given size.
  • the clusterization that has most value for the user is the one that mitigates the largest number of alternative distributions.
  • the clusterization scores for all of the properties are calculated. This can be easily achieved with a single pass through the list of all items as is described in more detail in the process outlined in FIG. 2 .
  • FIG. 2 is a flow diagram illustrating an automatic clusterization process 200 in accordance with an aspect of the present invention. While, for purposes of simplicity of explanation, the methodology is shown and described as a series of acts, it is to be understood and appreciated that the present invention is not limited by the order of acts, as some acts may, in accordance with the present invention, occur in different orders and/or concurrently with other acts from that shown and described herein. For example, those skilled in the art will understand and appreciate that a methodology could alternatively be represented as a series of interrelated states or events, such as in a state diagram. Moreover, not all illustrated acts may be required to implement a methodology in accordance with the present invention.
  • the process 200 may be employed as follows: At 210 , initialize M hash tables. At 220 , iterate through N items. At 230 , for each item, iterate through M properties. At 240 , for each item property, calculate a hash value. A hash function is selected in such a way, that two property values going into the same cluster return the same hash value. For example, when clustering date/time property the hash function may be based on the date part only, ignoring the time part. At 250 , the hash tables are employed to track the number of clusters and number of items in each cluster. At 260 , a clusterization score is calculated for each property using the data from its associated hash table.
  • the properties on the list are ordered by the quality of clusters they may produce. If the number of items exceeds some threshold (e.g., more then 10 items,)
  • the results may be automatically clustered using the property on the top of the list at 280 .
  • other clusters may be suggested that are next in order as alternatives.
  • the above process automatically clusters the results by the message sender in test cases for email messages.
  • selecting the items of type Word Document for example, clusters were created based on the last modification date, whereas items of type C# source files were grouped by their folder containment (which corresponded to a grouping by programming project).
  • the general nature of the above approach allows determining a grouping algorithm which is most appropriate for a given set of items, including in the evaluation of custom and 3rd party properties as well.
  • FIGS. 3-10 illustrate various example user interfaces that illustrate one or more of the automatic clustering systems and processes previously described. It is noted that these interfaces may include a display having one or more display objects including such aspects as configurable icons, buttons, sliders, input boxes, selection options, menus, tabs and so forth having multiple configurable dimensions, shapes, colors, text, data and sounds to facilitate operations with the system 100 .
  • the interface can also include a plurality of other inputs or controls for adjusting and configuring one or more aspects of the present invention and as will be described in more detail below. This can include receiving user commands from a mouse, keyboard, speech input, web site, remote web service and/or other device such as a camera or video input to affect or modify operations of the interface or other aspects of the system 100 .
  • Property up/downgrading can be considered at different levels. At the application level, application designer can indicate which properties are the primary properties to be exposed in the user interface, and which are secondary or auxiliary ones. This is typically to be defined separately for each item type. Automatic query clustering described in the previous section generally considers the primary properties. Moreover, each item type should define property mapping for the properties that are common to all items. For example, a common Date property can be mapped to Date Taken for pictures, but mapped as Last Modification Date for documents. Similarly, a People field can be Author for documents but Sender for email, and so forth.
  • each property may have its weight which gets increased when the user switches from a different property clusterization to another one, and decreased when they switch out.
  • a final rating of each property is a product of the property weight and the clusterization score (calculated according to the formulas described above).
  • item type is defined by the file extension.
  • User-friendly names for the file types can be used as defined by current viewer programs. Different file extensions that resulted in the same friendly name were generally already grouped together (e.g., both .h and .hxx are called C/C++ Header Files).
  • C/C++ Header Files Different file extensions that resulted in the same friendly name were generally already grouped together (e.g., both .h and .hxx are called C/C++ Header Files).
  • one more level of hierarchy can be introduced by grouping all the files of similar type.
  • metagroups of Document Files, Picture Files, Music and Video Files, Programming Files, and Other Files were considered and processed. Also, people metagroups may be processed as class objects.
  • email addresses can be split into internal (derived from the corporate address book) and external (usually from the user's personal contact list.)
  • Some people may have multiple methods of communication, in which case they may end up in multiple clusters.
  • Property clusters unlike traditional folders, have no restriction that the item is in one place only.
  • Folders represent a user-created grouping of items. While it is expected that over time that property-based clustering of items will diminish the need for and significance of folders, folders can still be supported. Folders are generally organized hierarchically and folder clusters should resemble this hierarchy.
  • One disadvantage of the folder hierarchy is that it includes a number of directories of little interest to the user, like Program Files or the Windows directory. When using existing folders to organize items into clusters, an obvious improvement is to display only a part of the folder hierarchy that does contain some of the items in the view.
  • FIG. 3 is a sample interface 300 that contains programming files on (Volume C:).
  • the view includes the full folder structure.
  • clustering files by “category” includes only folders related to the actually selected set of items (subset of the full folder tree.)
  • FIG. 4 is an interface 400 that demonstrates clustering by folders.
  • Another aspect of a folder hierarchy is that it joins the concept of physical location (this or that disk, or an external share) with a logical one (placement in a folder hierarchy.) Since logical groups may be created that may span several physical locations, the physical location may be separated from the folder property and thus present folders having the same name together, regardless of their physical location. As can be appreciated, grouping by location is also provided.
  • FIG. 5 is an example interface 500 of a folder (VSS) that exists on two drives (Volume C: and Volume D:)
  • VSS folder
  • Volume C: and Volume D: drives
  • the interface 500 combines the folder content from the physical locations at 510 .
  • This functionality is based on the assumption that if two or more folders have the same name it happens on purpose. If this is not the case, the files can be easily separated by a location property 610 in a user interface 600 depicted in FIG. 6 .
  • FIG. 7 is an interface 700 illustrating clustering by date properties.
  • Clustering by date and time has a natural year/month/day/hour/minute hierarchy. However, there is also a concept of relative time—relative to now. It is believed both concepts are important. Date clusters include a number of predefined queries (dynamic groups) that include items from today, yesterday, and so forth.
  • One interesting item classifications is by associating items with people.
  • item properties that may be used to create such associations, for example, sender or recipient for email messages or attachments, author for documents, person pictured for photographs, and so forth.
  • Clustering items by people may pose a special challenge because of social connotations carried by any presentation of people hierarchy. For example, people may be grouped by some formal attributes, like Internal or External Contacts, but some of these groups may still be too large to be handled efficiently. For example, the list of internal contacts referenced by a sample email message has about 5,000 names.
  • the list can be ordered alphabetically or grouped by first letters (dictionary-like,) but any list that long is generally difficult to comprehend.
  • One problem is that names of the people significant to a user are obscured by names of little known people that are there by accident. It can be assumed that most significant contacts are those that were emailed by the user most often and most recently, or who were the authors or co-authors of documents on the user's disk, and so forth. Using some weighted analysis, a list of all people ordered by their relative significance to the user can be constructed.
  • FIG. 8 is an example interface 800 that illustrates semi-collapsed lists for viewing associated people. This can include a hierarchical expansion of a people list which nevertheless is presented to the user as a single flat alphabetically ordered list. When the list is first shown, it only contains the top few (10-20) most significant names in an alphabetical order. This allows for a simple one-click access to the information about the most relevant people. At the same time, the top names act as dictionary bookmarks—each one can be expanded to show the names of the second level or other tertiary levels.
  • indices are put at the beginning and end of every page to indicate the content of that page. This can be described as a “constant space” between consecutive indices.
  • the words chosen for indices are not particular in any way, they just happen to be at the beginning or end of the page.
  • the names chosen for indices are those that are on the top of an “importance” list. Using the dictionary analogy, these would be the words that are most frequently looked up. Moreover, these names are entries by themselves—clicking on the name selects it. This provides a one-click access to the most common entries, rather then scrolling to the page that contains the entry. On the other hand, there may be a variable number of second-order entries between indices. When the number of second-order entries is large enough, a third order index can be created, and so on.
  • FIG. 9 illustrates semi-collapsed groups 900 , whereas a group 1000 is shown in an expanded state in FIG. 10 when selected from the groups 900 .
  • FIG. 10 also depicts the group 1000 in a semi collapsed state at 1010 .
  • Typical ways to visualize groups is to show some representation of the group as a whole (collapsed view), or the collection of all the items in the group (expanded view.)
  • a standard Windows representation with the folder list on the left and the item list on the right, can be thought of as an expanded view for the currently visible folder and a collapsed view for all other folders.
  • Subfolders of the current folder are typically shown in a collapsed view, even if the thumbnail of the subfolder may contain a collage of a few items inside it. Sometimes more then one expanded group may be visible concurrently or when the items are shown grouped into stacks.
  • a third state is introduced which shows the first few items of the group—this is called the “squeezed” or “semi-collapsed” state of the group at 900 .
  • a single button is clicked repeatedly to cycle between expanded state at 1000 , squeezed at 1010 , and collapsed states at 900 .
  • the interface 900 is a File Viewer showing two semi-collapsed groups and the third small enough to be shown fully open at 910 .
  • squeezed state is that the group takes up less room on the screen than an open state, but gives the user more information about the group than the closed state. This allows more groups to be visible while still providing detailed information about the contents of the group. The user can more quickly evaluate the groups in a large set of items, which in turn provides more efficient evaluation and manipulation of large groups of items.
  • a second advantage is that the collapsed state still provides direct one-click access to the few visible items. Assuming the visible items were selected by their “importance” to the user (e.g., most recent, or most often accessed in the past,) the visible items are those that the user is most likely looking for. For example, to print a picture recently sent to somebody, the user can scroll to Pictures group and the file should be right on the top of the list (as the one of the most recently accessed.) This may be compared to the current viewers—if the picture thumbnail is shown in the folder icon, the user would still need to open the folder to access the file. At the end, the squeezed view is about half way between the collapsed and expanded ones: it tries to balance viewing and manipulating groups as whole with an access to individual items.
  • the semi-collapsed view provides a convenient way to access selected items from the group (without having to deal with all the items in it) users can be given control over which items show up in the semi-collapsed view and how many of the items appear.
  • the items may be sorted by a predetermined criteria and those items shown are from the top of the sorted list.
  • the user may change the criteria to sort on and the number of elements shown.
  • a convenient and useful way to sort documents is by the last modification date.
  • the semi-collapsed view may show the top n most recent documents from the list by default, and may have a button to show the next n.
  • An alternative is to have a button to show the remaining documents from today, yesterday, last week, last month, etc.
  • the order of the items shown is the same one used to limit the visible items.
  • another approach is to order the items in a way most convenient to the user and not necessarily the same one as the criteria to select the items. For example, people are generally best sorted alphabetically, even if the selection order is by “importance.”
  • the items in the squeezed group can be displayed as a semi-collapsed list.
  • the semi-collapsed list can be selectively expanded to show more items. (Alternatively, the whole group can be expanded to show all items.)
  • the semi-collapsed list view can be used for any type of items and when the sort order is different from the selection order. (can also use the semi-expanded list view if the sort and selection orders are the same).
  • An example is a list of favorite songs sorted alphabetically. The user can expand parts of the list to show less popular songs, but the next songs coming into view will be selected by their popularity.
  • clusters When creating a property hierarchy, higher level clusters typically include the content of all nested ones.
  • the Documents cluster includes all of Word Documents, Excel Worksheets, and so forth.
  • items from year 2003 include the items from individual months, which in turn include the individual days.
  • Any container (cluster or folder) may be considered a standalone item, to be manipulated as a single entity, or merely a group of items, used to organize the view.
  • the primary function of the item browser is to enable easily finding of the requested item(s).
  • traversing down property clusters is just one of the ways.
  • the browsing functionality can be greatly enhanced by allowing for horizontal searches, which go to some related items rather then drilling down the property hierarchy.
  • the browser should allow for organizing items in any user-defined manner.
  • searching for items users often work by associating the items together. For example, the exact date when the document was last edited may not be known, but the user may remember that it was just before an important meeting. The meeting itself may be easy to find, at which point the most relevant query is to “show all documents from the same date.”
  • an exemplary environment 1110 for implementing various aspects of the invention includes a computer 1112 .
  • the computer 1112 includes a processing unit 1114 , a system memory 1116 , and a system bus 1118 .
  • the system bus 1118 couples system components including, but not limited to, the system memory 1116 to the processing unit 1114 .
  • the processing unit 1114 can be any of various available processors. Dual microprocessors and other multiprocessor architectures also can be employed as the processing unit 1114 .
  • the system bus 1118 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 16-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MSA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
  • ISA Industrial Standard Architecture
  • MSA Micro-Channel Architecture
  • EISA Extended ISA
  • IDE Intelligent Drive Electronics
  • VLB VESA Local Bus
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • AGP Advanced Graphics Port
  • PCMCIA Personal Computer Memory Card International Association bus
  • SCSI Small Computer Systems Interface
  • the system memory 1116 includes volatile memory 1120 and nonvolatile memory 1122 .
  • the basic input/output system (BIOS) containing the basic routines to transfer information between elements within the computer 1112 , such as during start-up, is stored in nonvolatile memory 1122 .
  • nonvolatile memory 1122 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory.
  • Volatile memory 1120 includes random access memory (RAM), which acts as external cache memory.
  • RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
  • SRAM synchronous RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM Synchlink DRAM
  • DRRAM direct Rambus RAM
  • Computer 1112 also includes removable/non-removable, volatile/non-volatile computer storage media.
  • FIG. 11 illustrates, for example a disk storage 1124 .
  • Disk storage 1124 includes, but is not limited to, devices like a magnetic disk drive, floppy disk drive, tape drive, Jaz drive, Zip drive, LS-100 drive, flash memory card, or memory stick.
  • disk storage 1124 can include storage media separately or in combination with other storage media including, but not limited to, an optical disk drive such as a compact disk ROM device (CD-ROM), CD recordable drive (CD-R Drive), CD rewritable drive (CD-RW Drive) or a digital versatile disk ROM drive (DVD-ROM).
  • CD-ROM compact disk ROM device
  • CD-R Drive CD recordable drive
  • CD-RW Drive CD rewritable drive
  • DVD-ROM digital versatile disk ROM drive
  • a removable or non-removable interface is typically used such as interface 1126 .
  • FIG. 11 describes software that acts as an intermediary between users and the basic computer resources described in suitable operating environment 1110 .
  • Such software includes an operating system 1128 .
  • Operating system 1128 which can be stored on disk storage 1124 , acts to control and allocate resources of the computer system 1112 .
  • System applications 1130 take advantage of the management of resources by operating system 1128 through program modules 1132 and program data 1134 stored either in system memory 1116 or on disk storage 1124 . It is to be appreciated that the present invention can be implemented with various operating systems or combinations of operating systems.
  • Input devices 1136 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 1114 through the system bus 1118 via interface port(s) 1138 .
  • Interface port(s) 1138 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB).
  • Output device(s) 1140 use some of the same type of ports as input device(s) 1136 .
  • a USB port may be used to provide input to computer 1112 , and to output information from computer 1112 to an output device 1140 .
  • Output adapter 1142 is provided to illustrate that there are some output devices 1140 like monitors, speakers, and printers, among other output devices 1140 , that require special adapters.
  • the output adapters 1142 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 1140 and the system bus 1118 . It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 1144 .
  • Computer 1112 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 1144 .
  • the remote computer(s) 1144 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 1112 .
  • only a memory storage device 1146 is illustrated with remote computer(s) 1144 .
  • Remote computer(s) 1144 is logically connected to computer 1112 through a network interface 1148 and then physically connected via communication connection 1150 .
  • Network interface 1148 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN).
  • LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 1102.3, Token Ring/IEEE 1102.5 and the like.
  • WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
  • ISDN Integrated Services Digital Networks
  • DSL Digital Subscriber Lines
  • Communication connection(s) 1150 refers to the hardware/software employed to connect the network interface 1148 to the bus 1118 . While communication connection 1150 is shown for illustrative clarity inside computer 1112 , it can also be external to computer 1112 .
  • the hardware/software necessary for connection to the network interface 1148 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
  • FIG. 12 is a schematic block diagram of a sample-computing environment 1200 with which the present invention can interact.
  • the system 1200 includes one or more client(s) 1210 .
  • the client(s) 1210 can be hardware and/or software (e.g., threads, processes, computing devices).
  • the system 1200 also includes one or more server(s) 1230 .
  • the server(s) 1230 can also be hardware and/or software (e.g., threads, processes, computing devices).
  • the servers 1230 can house threads to perform transformations by employing the present invention, for example.
  • One possible communication between a client 1210 and a server 1230 may be in the form of a data packet adapted to be transmitted between two or more computer processes.
  • the system 1200 includes a communication framework 1250 that can be employed to facilitate communications between the client(s) 1210 and the server(s) 1230 .
  • the client(s) 1210 are operably connected to one or more client data store(s) 1260 that can be employed to store information local to the client(s) 1210 .
  • the server(s) 1230 are operably connected to one or more server data store(s) 1240 that can be employed to store information local to the servers 1230 .

Abstract

The present invention relates to a system and methodology for automatic clusterization and display of data items in a local or remote database system. Such clusterization can be based on properties associated with the data items such as a type, location, people, date, time, user-defined, and so forth, wherein an initial property may be employed to form a first level of clusterization and a subsequent property may be automatically determined to form an optimized clusterization from which to find and retrieve desired information. A computerized interface for organizing and retrieving data is provided. The interface includes a property analyzer to determine an item distribution for at least two cluster properties and an organizer that forms new clusters based in part on the item distribution.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a divisional of U.S. patent application Ser. No. 10/764,738, filed Jan. 26, 2004, entitled “AUTOMATIC QUERY CLUSTERING”. The entirety of the aforementioned application is incorporated herein by reference.
  • TECHNICAL FIELD
  • The present invention relates generally to computer systems, and more particularly to a system and methods that automatically organize information items into a smaller subset of items by analyzing an item distribution associated with various property clusters.
  • BACKGROUND OF THE INVENTION
  • One key aspect of a database-based operating system is an ability to find desired items quickly by executing a query that may involve a number of item properties. This should be compared to previous systems, which required the knowledge of a file location within a folder hierarchy to retrieve desired information, for example. While the query approach is very powerful, the success of newer systems generally depends on an ability to create a user interface (UI) that allows queries to be simple and intuitive for average users. In its native form, database queries (e.g., expressed in T-SQL language) are difficult to handle for professional programmers and typically inappropriate for end users.
  • One approach to the query problem is to expose user interface commands that provide direct access to some number of predefined queries. For example, a predefined query could be provided to find all the picture files on a disk (Picture Library), or all unread email. Furthermore, a system may suggest grouping results in a certain manner, e.g., the pictures may be automatically put into groups according to the date taken. Such patterns of predefined queries are useful for many common scenarios, but they are not general enough to unlock the full power of the database. Using the picture example, it may happen that all the pictures were taken the same day (or, maybe, the camera clock wasn't set,) in which case grouping by date is useless. The situation is even worse when dealing with 3rd party properties (application-defined, administrator-defined, or user-defined). Since these properties are not known to the creators of the operating system, designing predefined queries for the properties may be almost impossible.
  • Another approach is to provide users an ability to query databases with textual queries that appear like a natural language. Such queries can be general enough from the database point of view and easy to understand for the users. However, if natural language queries that can take a totally free form are allowed, it is difficult to create a parser that will correctly understand the user's intention in each case. If some grammar restrictions are imposed, it becomes more difficult for users to form a syntactically correct query that can sometimes be worked around with chunk expressions. In either case, the very idea that the query text needs to be typed in may not be appealing to many users. Small kids, non-English users, and users of keyboard-less devices (e.g., Tablet PC) may all have problems with textual typing. Thus, there is a need for a query interface that has point-and-click simplicity for finding and retrieving information.
  • SUMMARY OF THE INVENTION
  • The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
  • The present invention relates to automatic retrieval and display of desired information into a subset of easily manageable information clusters. In a file system user interface, navigating a large set of items such as by displaying the items as lists becomes problematic when attempting to find and retrieve desired information from such lists. The present invention provides an improved point-and-click interface that facilitates navigating a large set of items classified by associated properties of the items. Items clustered by these properties can be presented in a folder-like manner (or other display type), whereby automatic clusterization may be performed by a different or subsequent property to split or organize query results into an easily manageable subset of clusters. These subsets may then be selected to retrieve desired information or to perform other clustering procedures (e.g., nested clusterization). The best property to cluster on can be determined by analyzing an item distribution in various property clusters.
  • One aspect of the present invention provides an automatic selection of a clusterization property. In order to determine such properties, a problem can be stated as follows: Given a starting set of items, and a set of item properties that can be used for grouping, which property associated with the set of items offers the best-clustered results? By best-clustered results, the present invention attempts to provide a uniform grouping of results into a moderate number of clusters. Thus, cases when there are just a few clusters with a large number of items, or a large number of clusters with just a few items in each cluster are typically not desired in order to efficiently find and retrieve desired information.
  • The above problem can be solved by assigning a clusterization score to each item property and selecting the property with the highest score. The clusterization score can be calculated by multiplying together the number of items in each cluster. For N items, a function to calculate the clusterization score as a product of cluster sizes has its maximum when the items are split into √N clusters, respective clusters having √N items. For other distributions, the score is utilized to measure and compare how far the distribution is from an ideal distribution. An example of an alternative score function can be based on a binomial distribution, for example. For these type distributions, the score value has a statistical interpretation that provides a number of ways N_total items can be split into clusters of given size. The clusterization that has most value for the user is the one that mitigates the largest number of alternative distributions. To compare different properties that can be used for subsequent clusterization, clusterization scores can be calculated for all of the properties, wherein such calculations can be easily performed with a single pass through of a list of items.
  • To the accomplishment of the foregoing and related ends, certain illustrative aspects of the invention are described herein in connection with the following description and the annexed drawings. These aspects are indicative of various ways in which the invention may be practiced, all of which are intended to be covered by the present invention. Other advantages and novel features of the invention may become apparent from the following detailed description of the invention when considered in conjunction with the drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic block diagram of a clustering system in accordance with an aspect of the present invention.
  • FIG. 2 is a flow diagram illustrating an automatic query clustering process in accordance with an aspect of the present invention.
  • FIGS. 3-10 illustrate example user interfaces for automatic query clustering in accordance with an aspect of the present invention.
  • FIG. 11 is a schematic block diagram illustrating a suitable operating environment in accordance with an aspect of the present invention.
  • FIG. 12 is a schematic block diagram of a sample-computing environment with which the present invention can interact.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The present invention relates to a system and methodology for automatic clusterization and display of data items in a local or remote database system. Such clusterization can be based on properties associated with the data items such as a type, location, people, date, time, user-defined, and so forth, wherein an initial property may be employed to form a first level of clusterization and a subsequent property may be automatically determined to form an optimized clusterization from which to find and retrieve desired information. In one aspect, a computerized interface for organizing and retrieving data is provided. The interface includes a property analyzer to determine an item distribution for at least two cluster properties and an organizer that forms new clusters based in part on the item distribution.
  • As used in this application, the terms “component,” “analyzer,” “cluster,” “system,” and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal).
  • Referring initially to FIG. 1, a query clustering system 100 is illustrated in accordance with an aspect of the present invention. The system 100 includes a data storage 110 that stores a plurality of data items 120 to be displayed at a user interface (not shown). Such items 120 can include documents, files, folders, images, audio files, source code and so forth that can appear in various viewable states at the user interface which is described in more detail below. The items 120 are also associated with various properties (e.g., metadata) describing such aspects as an item's type (e.g., image, document, spreadsheet, binary, and so forth), date created, people associated with the item, location, category, user-defined property, and so forth. An aggregator 130 collects the items 120 and associated properties and presents the items to a property analyzer 140 that performs an analysis of respective items and properties. For example, such analysis can include automatically determining a score for various possible clustering scenarios or potential groupings for items.
  • Based upon the analysis by the analyzer 140, a cluster organizer 150 presents an optimized grouping of new clusters 160 to a user. The optimized grouping of clusters 160 facilitates finding and retrieving desired information from the data storage 110 which can include local storage mediums, remote storage mediums, or a combination of local and remote storage.
  • In one example of automatic clustering, a default top-level clusterization can group items by item type. In a user study, it was found that a first level grouping by item type is useful and well understood by the users. However, it was also found that a second level clusterization by another property is not obvious and difficult to discover. Thus, one aspect of the present invention is an automatic selection of a clusterization property. The problem can be stated as follows: Given as starting set of items, and a set of item properties that can be used for grouping, which property offers the best clustered results? By best or optimized clustered results, it is a goal to provide a uniform grouping of items into a moderate number of clusters.
  • The above goal can be achieved by assigning a clusterization score to each item property and selecting the property with the highest score. The clusterization score can be calculated by multiplying together the number of items in each cluster such as in the following equation:
    score=n_itemscluster1 * n_itemscluster2* . . .
  • For N items, a function to calculate the clusterization score as a product of cluster sizes has its maximum when the items are split into √N clusters, each cluster with √N items. For other distributions, the score is used to measure and compare how far it is from the ideal or optimized distribution. It was found that the above score function produced reasonable results in test cases. However, it is noted that the score function utilized is an example. For example, other functions may be employed that provide different relative weights to the distributions off the ideal distribution.
  • An example of an alternative score function is based on binomial distribution as follows:
    score=(N_total)!/((n_itemscluster1)!*(n_itemscluster2)!* . . . )
  • In this example, the score value has a statistical interpretation that it provides a number of ways N_total items can be split into clusters of given size. The clusterization that has most value for the user is the one that mitigates the largest number of alternative distributions. To compare different properties that can be used for subsequent clusterization, the clusterization scores for all of the properties are calculated. This can be easily achieved with a single pass through the list of all items as is described in more detail in the process outlined in FIG. 2.
  • FIG. 2 is a flow diagram illustrating an automatic clusterization process 200 in accordance with an aspect of the present invention. While, for purposes of simplicity of explanation, the methodology is shown and described as a series of acts, it is to be understood and appreciated that the present invention is not limited by the order of acts, as some acts may, in accordance with the present invention, occur in different orders and/or concurrently with other acts from that shown and described herein. For example, those skilled in the art will understand and appreciate that a methodology could alternatively be represented as a series of interrelated states or events, such as in a state diagram. Moreover, not all illustrated acts may be required to implement a methodology in accordance with the present invention.
  • Assuming there are N items and M properties to compare, the process 200 may be employed as follows: At 210, initialize M hash tables. At 220, iterate through N items. At 230, for each item, iterate through M properties. At 240, for each item property, calculate a hash value. A hash function is selected in such a way, that two property values going into the same cluster return the same hash value. For example, when clustering date/time property the hash function may be based on the date part only, ignoring the time part. At 250, the hash tables are employed to track the number of clusters and number of items in each cluster. At 260, a clusterization score is calculated for each property using the data from its associated hash table.
  • At 270, the properties on the list are ordered by the quality of clusters they may produce. If the number of items exceeds some threshold (e.g., more then 10 items,)
  • the results may be automatically clustered using the property on the top of the list at 280. Also, other clusters may be suggested that are next in order as alternatives. For example, when selecting all the items of type email message, the above process automatically clusters the results by the message sender in test cases for email messages. However, selecting the items of type Word Document, for example, clusters were created based on the last modification date, whereas items of type C# source files were grouped by their folder containment (which corresponded to a grouping by programming project). The general nature of the above approach allows determining a grouping algorithm which is most appropriate for a given set of items, including in the evaluation of custom and 3rd party properties as well.
  • FIGS. 3-10 illustrate various example user interfaces that illustrate one or more of the automatic clustering systems and processes previously described. It is noted that these interfaces may include a display having one or more display objects including such aspects as configurable icons, buttons, sliders, input boxes, selection options, menus, tabs and so forth having multiple configurable dimensions, shapes, colors, text, data and sounds to facilitate operations with the system 100. In addition, the interface can also include a plurality of other inputs or controls for adjusting and configuring one or more aspects of the present invention and as will be described in more detail below. This can include receiving user commands from a mouse, keyboard, speech input, web site, remote web service and/or other device such as a camera or video input to affect or modify operations of the interface or other aspects of the system 100.
  • The following discussion describes various aspects of the present invention and is related to the example interfaces depicted in FIGS. 3-10. When designing a folder or other type structure, designers (whether it's an application programmer or an end user) have high degree of freedom, which allows hiding unimportant or rarely used items from the top level view by putting the items into a hidden folder. Similarly, when creating a property based browser, various mechanisms can be provided to hide properties that are meaningless or otherwise not very useful—even if the clustering algorithm determines a high score.
  • Property up/downgrading can be considered at different levels. At the application level, application designer can indicate which properties are the primary properties to be exposed in the user interface, and which are secondary or auxiliary ones. This is typically to be defined separately for each item type. Automatic query clustering described in the previous section generally considers the primary properties. Moreover, each item type should define property mapping for the properties that are common to all items. For example, a common Date property can be mapped to Date Taken for pictures, but mapped as Last Modification Date for documents. Similarly, a People field can be Author for documents but Sender for email, and so forth.
  • Generally, it is the user who should be able to decide which properties are best to view their respective data. There may be an explicit UI to promote or demote any particular property, but the present invention can also learn implicitly from user actions (e.g., via learning algorithms). Each property may have its weight which gets increased when the user switches from a different property clusterization to another one, and decreased when they switch out. A final rating of each property (used to decide which property to cluster on) is a product of the property weight and the clusterization score (calculated according to the formulas described above).
  • As discussed above, users generally prefer a hierarchical organization of item type clusters over a flat list. The hierarchy introduced some type of order and made it easier to find the requested item type value. The same should be true for any property that has more then a few different property values. The following describes specific example techniques for organizing property values into a hierarchical view.
  • In the case of regular files, item type is defined by the file extension. User-friendly names for the file types can be used as defined by current viewer programs. Different file extensions that resulted in the same friendly name were generally already grouped together (e.g., both .h and .hxx are called C/C++ Header Files). In addition, one more level of hierarchy can be introduced by grouping all the files of similar type. In a prototype, metagroups of Document Files, Picture Files, Music and Video Files, Programming Files, and Other Files were considered and processed. Also, people metagroups may be processed as class objects.
  • For example, a list of item type=people can be split into smaller sections by the type of communication channel that can be used to reach a given person. This includes groups of people that can be reached by post mail, by phone, by instant messaging, or by email, for example. Each of these groups can be divided further, if desired. For example, in a corporate environment, email addresses can be split into internal (derived from the corporate address book) and external (usually from the user's personal contact list.) Some people may have multiple methods of communication, in which case they may end up in multiple clusters. Property clusters, unlike traditional folders, have no restriction that the item is in one place only.
  • Folders represent a user-created grouping of items. While it is expected that over time that property-based clustering of items will diminish the need for and significance of folders, folders can still be supported. Folders are generally organized hierarchically and folder clusters should resemble this hierarchy. One disadvantage of the folder hierarchy is that it includes a number of directories of little interest to the user, like Program Files or the Windows directory. When using existing folders to organize items into clusters, an obvious improvement is to display only a part of the folder hierarchy that does contain some of the items in the view.
  • FIG. 3 is a sample interface 300 that contains programming files on (Volume C:). In Windows Explorer, for example, the view includes the full folder structure. In a prototype, clustering files by “category” includes only folders related to the actually selected set of items (subset of the full folder tree.)
  • FIG. 4 is an interface 400 that demonstrates clustering by folders. Another aspect of a folder hierarchy is that it joins the concept of physical location (this or that disk, or an external share) with a logical one (placement in a folder hierarchy.) Since logical groups may be created that may span several physical locations, the physical location may be separated from the folder property and thus present folders having the same name together, regardless of their physical location. As can be appreciated, grouping by location is also provided.
  • FIG. 5 is an example interface 500 of a folder (VSS) that exists on two drives (Volume C: and Volume D:) When looking at the “category” VSS, the interface 500 combines the folder content from the physical locations at 510. This functionality is based on the assumption that if two or more folders have the same name it happens on purpose. If this is not the case, the files can be easily separated by a location property 610 in a user interface 600 depicted in FIG. 6.
  • FIG. 7 is an interface 700 illustrating clustering by date properties. Clustering by date and time has a natural year/month/day/hour/minute hierarchy. However, there is also a concept of relative time—relative to now. It is believed both concepts are important. Date clusters include a number of predefined queries (dynamic groups) that include items from today, yesterday, and so forth.
  • One interesting item classifications is by associating items with people. There are many item properties that may be used to create such associations, for example, sender or recipient for email messages or attachments, author for documents, person pictured for photographs, and so forth. Clustering items by people may pose a special challenge because of social connotations carried by any presentation of people hierarchy. For example, people may be grouped by some formal attributes, like Internal or External Contacts, but some of these groups may still be too large to be handled efficiently. For example, the list of internal contacts referenced by a sample email message has about 5,000 names.
  • The list can be ordered alphabetically or grouped by first letters (dictionary-like,) but any list that long is generally difficult to comprehend. One problem is that names of the people significant to a user are obscured by names of little known people that are there by accident. It can be assumed that most significant contacts are those that were emailed by the user most often and most recently, or who were the authors or co-authors of documents on the user's disk, and so forth. Using some weighted analysis, a list of all people ordered by their relative significance to the user can be constructed.
  • However, presenting a long list of people names ordered by their calculated significance may not be an acceptable solution. The calculated order may be accidental and not correctly reflecting one's feeling of importance while finding names near the middle or bottom of the list may still be very difficult. Significance information should be used to select which names are shown first or on the top level, but order names alphabetically to make searching for a specific name easier and mitigating possible suggestions about people's relative importance.
  • FIG. 8 is an example interface 800 that illustrates semi-collapsed lists for viewing associated people. This can include a hierarchical expansion of a people list which nevertheless is presented to the user as a single flat alphabetically ordered list. When the list is first shown, it only contains the top few (10-20) most significant names in an alphabetical order. This allows for a simple one-click access to the information about the most relevant people. At the same time, the top names act as dictionary bookmarks—each one can be expanded to show the names of the second level or other tertiary levels.
  • This is somewhat similar to the hierarchy expansion, except that all the expanded names are shown on the top level as peers to the first level names. The latter is provided to mitigate connotations that one person is above the other one, which may be perceived negatively if it does not follow the organization hierarchy, for example. List expansion can be continued until the names from the bottom of the significance list come into view. However, since the expansion can be performed on the selected areas of the list, the total number of the visible names can be limited, typically just in tens. At any given time, the visible names are sorted alphabetically and presented as a single list. This makes it easy to find a requested name. It is noted, that a semi-collapsed list can be applied to many different classifications, not just people. A few obvious ones include a list of keywords (categories), and a list of dictionary (encyclopedia) entries.
  • The idea to use existing entries as catalog indices is common. In fact, this is the standard way to organize printed dictionaries. However, in the standard dictionary approach, indices are put at the beginning and end of every page to indicate the content of that page. This can be described as a “constant space” between consecutive indices. The words chosen for indices are not particular in any way, they just happen to be at the beginning or end of the page.
  • In the present invention, the names chosen for indices are those that are on the top of an “importance” list. Using the dictionary analogy, these would be the words that are most frequently looked up. Moreover, these names are entries by themselves—clicking on the name selects it. This provides a one-click access to the most common entries, rather then scrolling to the page that contains the entry. On the other hand, there may be a variable number of second-order entries between indices. When the number of second-order entries is large enough, a third order index can be created, and so on.
  • FIG. 9 illustrates semi-collapsed groups 900, whereas a group 1000 is shown in an expanded state in FIG. 10 when selected from the groups 900. FIG. 10 also depicts the group 1000 in a semi collapsed state at 1010. When presenting clusters (or other ways to group the items together,) another question is how the clusters are visualized on the screen. Typical ways to visualize groups is to show some representation of the group as a whole (collapsed view), or the collection of all the items in the group (expanded view.) In a standard Windows representation, with the folder list on the left and the item list on the right, can be thought of as an expanded view for the currently visible folder and a collapsed view for all other folders. Subfolders of the current folder are typically shown in a collapsed view, even if the thumbnail of the subfolder may contain a collage of a few items inside it. Sometimes more then one expanded group may be visible concurrently or when the items are shown grouped into stacks.
  • In file viewers which allow grouping and can display multiple groups concurrently, it is typical for the groups to be “collapsible”—the contents of a group may be individually shown or hidden. Nevertheless, the group can still exist in two states, and the expanded state allows interacting with the individual items in the group. In case of large groups, expanding one group obscures visibility of all others, which makes the multi-group view not as useful.
  • In the present invention, a third state is introduced which shows the first few items of the group—this is called the “squeezed” or “semi-collapsed” state of the group at 900. A single button is clicked repeatedly to cycle between expanded state at 1000, squeezed at 1010, and collapsed states at 900. The interface 900 is a File Viewer showing two semi-collapsed groups and the third small enough to be shown fully open at 910.
  • One advantage of the squeezed state is that the group takes up less room on the screen than an open state, but gives the user more information about the group than the closed state. This allows more groups to be visible while still providing detailed information about the contents of the group. The user can more quickly evaluate the groups in a large set of items, which in turn provides more efficient evaluation and manipulation of large groups of items.
  • A second advantage is that the collapsed state still provides direct one-click access to the few visible items. Assuming the visible items were selected by their “importance” to the user (e.g., most recent, or most often accessed in the past,) the visible items are those that the user is most likely looking for. For example, to print a picture recently sent to somebody, the user can scroll to Pictures group and the file should be right on the top of the list (as the one of the most recently accessed.) This may be compared to the current viewers—if the picture thumbnail is shown in the folder icon, the user would still need to open the folder to access the file. At the end, the squeezed view is about half way between the collapsed and expanded ones: it tries to balance viewing and manipulating groups as whole with an access to individual items.
  • Since the semi-collapsed view provides a convenient way to access selected items from the group (without having to deal with all the items in it) users can be given control over which items show up in the semi-collapsed view and how many of the items appear.
  • In one approach, the items may be sorted by a predetermined criteria and those items shown are from the top of the sorted list. The user may change the criteria to sort on and the number of elements shown. For example, a convenient and useful way to sort documents is by the last modification date. The semi-collapsed view may show the top n most recent documents from the list by default, and may have a button to show the next n. An alternative is to have a button to show the remaining documents from today, yesterday, last week, last month, etc. Typically, in all these cases, the order of the items shown is the same one used to limit the visible items. However, another approach is to order the items in a way most convenient to the user and not necessarily the same one as the criteria to select the items. For example, people are generally best sorted alphabetically, even if the selection order is by “importance.”
  • The items in the squeezed group can be displayed as a semi-collapsed list. The semi-collapsed list can be selectively expanded to show more items. (Alternatively, the whole group can be expanded to show all items.) The semi-collapsed list view can be used for any type of items and when the sort order is different from the selection order. (can also use the semi-expanded list view if the sort and selection orders are the same). An example is a list of favorite songs sorted alphabetically. The user can expand parts of the list to show less popular songs, but the next songs coming into view will be selected by their popularity.
  • When creating a property hierarchy, higher level clusters typically include the content of all nested ones. For example, the Documents cluster includes all of Word Documents, Excel Worksheets, and so forth. Similarly, items from year 2003 include the items from individual months, which in turn include the individual days. Any container (cluster or folder) may be considered a standalone item, to be manipulated as a single entity, or merely a group of items, used to organize the view.
  • The primary function of the item browser is to enable easily finding of the requested item(s). However, traversing down property clusters is just one of the ways. The browsing functionality can be greatly enhanced by allowing for horizontal searches, which go to some related items rather then drilling down the property hierarchy. Moreover, the browser should allow for organizing items in any user-defined manner. When searching for items, users often work by associating the items together. For example, the exact date when the document was last edited may not be known, but the user may remember that it was just before an important meeting. The meeting itself may be easy to find, at which point the most relevant query is to “show all documents from the same date.”
  • With reference to FIG. 11, an exemplary environment 1110 for implementing various aspects of the invention includes a computer 1112. The computer 1112 includes a processing unit 1114, a system memory 1116, and a system bus 1118. The system bus 1118 couples system components including, but not limited to, the system memory 1116 to the processing unit 1114. The processing unit 1114 can be any of various available processors. Dual microprocessors and other multiprocessor architectures also can be employed as the processing unit 1114.
  • The system bus 1118 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 16-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MSA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
  • The system memory 1116 includes volatile memory 1120 and nonvolatile memory 1122. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 1112, such as during start-up, is stored in nonvolatile memory 1122. By way of illustration, and not limitation, nonvolatile memory 1122 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory 1120 includes random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
  • Computer 1112 also includes removable/non-removable, volatile/non-volatile computer storage media. FIG. 11 illustrates, for example a disk storage 1124. Disk storage 1124 includes, but is not limited to, devices like a magnetic disk drive, floppy disk drive, tape drive, Jaz drive, Zip drive, LS-100 drive, flash memory card, or memory stick. In addition, disk storage 1124 can include storage media separately or in combination with other storage media including, but not limited to, an optical disk drive such as a compact disk ROM device (CD-ROM), CD recordable drive (CD-R Drive), CD rewritable drive (CD-RW Drive) or a digital versatile disk ROM drive (DVD-ROM). To facilitate connection of the disk storage devices 1124 to the system bus 1118, a removable or non-removable interface is typically used such as interface 1126.
  • It is to be appreciated that FIG. 11 describes software that acts as an intermediary between users and the basic computer resources described in suitable operating environment 1110. Such software includes an operating system 1128. Operating system 1128, which can be stored on disk storage 1124, acts to control and allocate resources of the computer system 1112. System applications 1130 take advantage of the management of resources by operating system 1128 through program modules 1132 and program data 1134 stored either in system memory 1116 or on disk storage 1124. It is to be appreciated that the present invention can be implemented with various operating systems or combinations of operating systems.
  • A user enters commands or information into the computer 1112 through input device(s) 1136. Input devices 1136 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 1114 through the system bus 1118 via interface port(s) 1138. Interface port(s) 1138 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB). Output device(s) 1140 use some of the same type of ports as input device(s) 1136. Thus, for example, a USB port may be used to provide input to computer 1112, and to output information from computer 1112 to an output device 1140. Output adapter 1142 is provided to illustrate that there are some output devices 1140 like monitors, speakers, and printers, among other output devices 1140, that require special adapters. The output adapters 1142 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 1140 and the system bus 1118. It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 1144.
  • Computer 1112 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 1144. The remote computer(s) 1144 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 1112. For purposes of brevity, only a memory storage device 1146 is illustrated with remote computer(s) 1144. Remote computer(s) 1144 is logically connected to computer 1112 through a network interface 1148 and then physically connected via communication connection 1150. Network interface 1148 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN). LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 1102.3, Token Ring/IEEE 1102.5 and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
  • Communication connection(s) 1150 refers to the hardware/software employed to connect the network interface 1148 to the bus 1118. While communication connection 1150 is shown for illustrative clarity inside computer 1112, it can also be external to computer 1112. The hardware/software necessary for connection to the network interface 1148 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
  • FIG. 12 is a schematic block diagram of a sample-computing environment 1200 with which the present invention can interact. The system 1200 includes one or more client(s) 1210. The client(s) 1210 can be hardware and/or software (e.g., threads, processes, computing devices). The system 1200 also includes one or more server(s) 1230. The server(s) 1230 can also be hardware and/or software (e.g., threads, processes, computing devices). The servers 1230 can house threads to perform transformations by employing the present invention, for example. One possible communication between a client 1210 and a server 1230 may be in the form of a data packet adapted to be transmitted between two or more computer processes. The system 1200 includes a communication framework 1250 that can be employed to facilitate communications between the client(s) 1210 and the server(s) 1230. The client(s) 1210 are operably connected to one or more client data store(s) 1260 that can be employed to store information local to the client(s) 1210. Similarly, the server(s) 1230 are operably connected to one or more server data store(s) 1240 that can be employed to store information local to the servers 1230.
  • What has been described above includes examples of the present invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the present invention, but one of ordinary skill in the art may recognize that many further combinations and permutations of the present invention are possible. Accordingly, the present invention is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.

Claims (10)

1. A system for automatically clustering query results, comprising:
means for retrieving properties of a plurality of items;
means for determining a score for the plurality of items based upon the properties; and
means for automatically clustering data associated with the items based upon the determined score.
2. A method for automatic query clustering, comprising:
associating one or more properties with a plurality of data items;
determining a distribution for the data items based upon the properties; and
automatically clustering the data items based upon the determined distribution.
3. The method of claim 2, the distribution is determined from at least one of the following equations:

score=n_itemscluster1 *n_itemscluster2* . . .
score=(N_total)!/((n_itemscluster1)!*(n_itemscluster2)!* . . . )
4. The method of claim 2, further comprising processing N items and M properties.
5. The method of claim 4, further comprising at least one of initializing M hash tables, iterating through N items and, for each item, iterating through M properties.
6. The method of claim 5, further comprising calculating a hash value for each property.
7. The method of claim 6, further comprising calculating a clusterization score for each property using data from an associated hash table.
8. The method of claim 2, further comprising automatically organizing clusters based upon a predetermined threshold.
9. The method of claim 8, further comprising suggesting alternative cluster grouping.
10. The method of claim 8, further comprising organizing clusters based upon user-defined properties.
US11/838,126 2004-01-26 2007-08-13 Automatic query clustering Abandoned US20080021896A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/838,126 US20080021896A1 (en) 2004-01-26 2007-08-13 Automatic query clustering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/764,738 US7257571B2 (en) 2004-01-26 2004-01-26 Automatic query clustering
US11/838,126 US20080021896A1 (en) 2004-01-26 2007-08-13 Automatic query clustering

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/764,738 Division US7257571B2 (en) 2004-01-26 2004-01-26 Automatic query clustering

Publications (1)

Publication Number Publication Date
US20080021896A1 true US20080021896A1 (en) 2008-01-24

Family

ID=34634626

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/764,738 Expired - Fee Related US7257571B2 (en) 2004-01-26 2004-01-26 Automatic query clustering
US11/838,126 Abandoned US20080021896A1 (en) 2004-01-26 2007-08-13 Automatic query clustering

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/764,738 Expired - Fee Related US7257571B2 (en) 2004-01-26 2004-01-26 Automatic query clustering

Country Status (14)

Country Link
US (2) US7257571B2 (en)
EP (1) EP1557774A3 (en)
JP (1) JP4101239B2 (en)
KR (1) KR101029403B1 (en)
CN (1) CN1648903A (en)
AU (1) AU2005200286A1 (en)
BR (1) BRPI0500784A (en)
CA (1) CA2494410A1 (en)
HK (1) HK1080164A1 (en)
MX (1) MXPA05001072A (en)
MY (1) MY145961A (en)
RU (1) RU2368948C2 (en)
TW (1) TW200529063A (en)
ZA (1) ZA200500736B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9047368B1 (en) * 2013-02-19 2015-06-02 Symantec Corporation Self-organizing user-centric document vault
US20160132521A1 (en) * 2014-11-07 2016-05-12 Semandex Networks Inc. Systems and methods for file clustering, multi-drive forensic analysis and data protection
CN108460049A (en) * 2017-02-21 2018-08-28 阿里巴巴集团控股有限公司 A kind of method and system of determining information category
US20180285458A1 (en) * 2017-04-04 2018-10-04 Yandex Europe Ag Method and system for generating an offline search engine result page

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8448069B2 (en) * 2004-04-23 2013-05-21 International Business Machines Corporation Object set property viewer
US8090698B2 (en) 2004-05-07 2012-01-03 Ebay Inc. Method and system to facilitate a search of an information resource
JP4533007B2 (en) * 2004-06-09 2010-08-25 キヤノン株式会社 Image management apparatus, image management method, and program
US9552141B2 (en) 2004-06-21 2017-01-24 Apple Inc. Methods and apparatuses for operating a data processing system
US7693860B2 (en) * 2005-03-18 2010-04-06 Microsoft Corporation Method and system to associate cell and item metadata
WO2006137223A1 (en) * 2005-06-22 2006-12-28 Nec Corporation Debug system, debug method, and program
US7739314B2 (en) * 2005-08-15 2010-06-15 Google Inc. Scalable user clustering based on set similarity
KR100713534B1 (en) * 2005-09-08 2007-04-30 삼성전자주식회사 Method for searching a user data in mobile communication terminal
US8234293B2 (en) * 2005-09-08 2012-07-31 Microsoft Corporation Autocompleting with queries to a database
US7792847B2 (en) * 2005-09-09 2010-09-07 Microsoft Corporation Converting structured reports to formulas
US8843481B1 (en) * 2005-09-30 2014-09-23 Yongyong Xu System and method of forming action based virtual communities and related search mechanisms
US7805433B2 (en) * 2005-10-14 2010-09-28 Microsoft Corporation Multidimensional cube functions
US20070168323A1 (en) * 2006-01-03 2007-07-19 Microsoft Corporation Query aggregation
WO2007088893A1 (en) * 2006-02-01 2007-08-09 Matsushita Electric Industrial Co., Ltd. Information sorting device and information retrieval device
KR100725411B1 (en) * 2006-02-06 2007-06-07 삼성전자주식회사 User interface for content browsing, method for the providing the user interface, and content browsing apparatus
US8214360B2 (en) * 2006-04-06 2012-07-03 International Business Machines Corporation Browser context based search disambiguation using existing category taxonomy
US20070239682A1 (en) * 2006-04-06 2007-10-11 Arellanes Paul T System and method for browser context based search disambiguation using a viewed content history
KR100806115B1 (en) 2006-05-02 2008-02-21 인하대학교 산학협력단 Design method of query classification component in multi-level dbms
US20070260600A1 (en) * 2006-05-08 2007-11-08 Mita Group Information discovery and group association
CN101122910A (en) * 2006-08-07 2008-02-13 鸿富锦精密工业(深圳)有限公司 Memory device and method for opening the stored file
US8266147B2 (en) * 2006-09-18 2012-09-11 Infobright, Inc. Methods and systems for database organization
WO2008034213A1 (en) * 2006-09-18 2008-03-27 Infobright Inc. A method and system for data compression in a relational database
US9817436B2 (en) 2007-01-07 2017-11-14 Apple Inc. Portable multifunction device, method, and graphical user interface for displaying user interface objects adaptively
JP4412342B2 (en) * 2007-03-30 2010-02-10 ソニー株式会社 CONTENT MANAGEMENT DEVICE, IMAGE DISPLAY DEVICE, IMAGING DEVICE, PROCESSING METHOD IN THEM, AND PROGRAM FOR CAUSING COMPUTER TO EXECUTE THE METHOD
US20090228826A1 (en) * 2008-03-04 2009-09-10 Microsoft Corporation Group filtering of items in a view
US7860866B2 (en) * 2008-03-26 2010-12-28 Microsoft Corporation Heuristic event clustering of media using metadata
WO2010060179A1 (en) * 2008-11-28 2010-06-03 Infobright Inc. Methods for organizing a relational database by using clustering operations
US20100299603A1 (en) * 2009-05-22 2010-11-25 Bernard Farkas User-Customized Subject-Categorized Website Entertainment Database
US20110116416A1 (en) * 2009-11-18 2011-05-19 Dobson Eric L System and method for geographically optimized wireless mesh networks
US9152318B2 (en) * 2009-11-25 2015-10-06 Yahoo! Inc. Gallery application for content viewing
WO2011112221A1 (en) 2010-03-12 2011-09-15 Exxonmobil Upstream Research Company Dynamic grouping of domain objects via smart groups
US8521748B2 (en) 2010-06-14 2013-08-27 Infobright Inc. System and method for managing metadata in a relational database
US8417727B2 (en) 2010-06-14 2013-04-09 Infobright Inc. System and method for storing data in a relational database
US9608941B2 (en) * 2011-02-07 2017-03-28 Google Inc. Methods and systems for clustering emails using email addresses
CN102253967B (en) * 2011-06-03 2013-03-27 用友软件股份有限公司 Data searching system and data searching method
US10078819B2 (en) * 2011-06-21 2018-09-18 Oath Inc. Presenting favorite contacts information to a user of a computing device
CN103049444B (en) * 2011-10-12 2016-09-28 阿里巴巴集团控股有限公司 A kind of storage method and system of data information classification structure
US9558278B2 (en) 2012-09-11 2017-01-31 Apple Inc. Integrated content recommendation
US9218118B2 (en) 2012-09-11 2015-12-22 Apple Inc. Media player playlist management
US10140372B2 (en) * 2012-09-12 2018-11-27 Gracenote, Inc. User profile based on clustering tiered descriptors
RU2611960C2 (en) * 2015-03-31 2017-03-01 Общество С Ограниченной Ответственностью "Яндекс" Method of clusterization of points of interest and server for its implementation
US20190228378A1 (en) * 2018-01-24 2019-07-25 Microsoft Technology Licensing, Llc Intelligent targeting of files needing attention
US11074408B2 (en) 2019-06-01 2021-07-27 Apple Inc. Mail application features

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5801699A (en) * 1996-01-26 1998-09-01 International Business Machines Corporation Icon aggregation on a graphical user interface
US6415282B1 (en) * 1998-04-22 2002-07-02 Nec Usa, Inc. Method and apparatus for query refinement
US20020169764A1 (en) * 2001-05-09 2002-11-14 Robert Kincaid Domain specific knowledge-based metasearch system and methods of using
US20030018652A1 (en) * 2001-04-30 2003-01-23 Microsoft Corporation Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
US6539374B2 (en) * 1999-06-03 2003-03-25 Microsoft Corporation Methods, apparatus and data structures for providing a uniform representation of various types of information
US6578032B1 (en) * 2000-06-28 2003-06-10 Microsoft Corporation Method and system for performing phrase/word clustering and cluster merging
US6606613B1 (en) * 1999-06-03 2003-08-12 Microsoft Corporation Methods and apparatus for using task models to help computer users complete tasks
US6732149B1 (en) * 1999-04-09 2004-05-04 International Business Machines Corporation System and method for hindering undesired transmission or receipt of electronic messages
US20040093321A1 (en) * 2002-11-13 2004-05-13 Xerox Corporation Search engine with structured contextual clustering
US20050278321A1 (en) * 2001-05-09 2005-12-15 Aditya Vailaya Systems, methods and computer readable media for performing a domain-specific metasearch, and visualizing search results therefrom

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122405A (en) * 1993-08-27 2000-09-19 Martin Marietta Corporation Adaptive filter selection for optimal feature extraction
US5899992A (en) * 1997-02-14 1999-05-04 International Business Machines Corporation Scalable set oriented classifier
US6564197B2 (en) * 1999-05-03 2003-05-13 E.Piphany, Inc. Method and apparatus for scalable probabilistic clustering using decision trees
RU2177174C1 (en) * 2000-05-12 2001-12-20 Серебренников Олег Александрович Method for retrieval in formatted-data bases
US6910026B2 (en) * 2001-08-27 2005-06-21 Verity, Inc. Method and apparatus for determining classifier features with minimal supervision

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5801699A (en) * 1996-01-26 1998-09-01 International Business Machines Corporation Icon aggregation on a graphical user interface
US6415282B1 (en) * 1998-04-22 2002-07-02 Nec Usa, Inc. Method and apparatus for query refinement
US6732149B1 (en) * 1999-04-09 2004-05-04 International Business Machines Corporation System and method for hindering undesired transmission or receipt of electronic messages
US6539374B2 (en) * 1999-06-03 2003-03-25 Microsoft Corporation Methods, apparatus and data structures for providing a uniform representation of various types of information
US6606613B1 (en) * 1999-06-03 2003-08-12 Microsoft Corporation Methods and apparatus for using task models to help computer users complete tasks
US6578032B1 (en) * 2000-06-28 2003-06-10 Microsoft Corporation Method and system for performing phrase/word clustering and cluster merging
US20030200198A1 (en) * 2000-06-28 2003-10-23 Raman Chandrasekar Method and system for performing phrase/word clustering and cluster merging
US20030018652A1 (en) * 2001-04-30 2003-01-23 Microsoft Corporation Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
US20020169764A1 (en) * 2001-05-09 2002-11-14 Robert Kincaid Domain specific knowledge-based metasearch system and methods of using
US20050278321A1 (en) * 2001-05-09 2005-12-15 Aditya Vailaya Systems, methods and computer readable media for performing a domain-specific metasearch, and visualizing search results therefrom
US20040093321A1 (en) * 2002-11-13 2004-05-13 Xerox Corporation Search engine with structured contextual clustering

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9047368B1 (en) * 2013-02-19 2015-06-02 Symantec Corporation Self-organizing user-centric document vault
US20160132521A1 (en) * 2014-11-07 2016-05-12 Semandex Networks Inc. Systems and methods for file clustering, multi-drive forensic analysis and data protection
US9792289B2 (en) * 2014-11-07 2017-10-17 Semandex Networks Inc. Systems and methods for file clustering, multi-drive forensic analysis and data protection
CN108460049A (en) * 2017-02-21 2018-08-28 阿里巴巴集团控股有限公司 A kind of method and system of determining information category
CN108460049B (en) * 2017-02-21 2021-10-19 阿里巴巴集团控股有限公司 Method and system for determining information category
US20180285458A1 (en) * 2017-04-04 2018-10-04 Yandex Europe Ag Method and system for generating an offline search engine result page
US10713308B2 (en) * 2017-04-04 2020-07-14 Yandex Europe Ag Method and system for generating an offline search engine result page
US11347815B2 (en) * 2017-04-04 2022-05-31 Yandex Europe Ag Method and system for generating an offline search engine result page

Also Published As

Publication number Publication date
AU2005200286A1 (en) 2005-08-11
RU2368948C2 (en) 2009-09-27
ZA200500736B (en) 2006-10-25
TW200529063A (en) 2005-09-01
JP4101239B2 (en) 2008-06-18
KR101029403B1 (en) 2011-04-14
RU2005101735A (en) 2006-07-10
CA2494410A1 (en) 2005-07-26
EP1557774A3 (en) 2007-07-04
US20050165825A1 (en) 2005-07-28
BRPI0500784A (en) 2005-08-30
KR20050077036A (en) 2005-07-29
MXPA05001072A (en) 2005-10-05
JP2005235196A (en) 2005-09-02
US7257571B2 (en) 2007-08-14
HK1080164A1 (en) 2006-04-21
MY145961A (en) 2012-05-31
CN1648903A (en) 2005-08-03
EP1557774A2 (en) 2005-07-27

Similar Documents

Publication Publication Date Title
US7257571B2 (en) Automatic query clustering
US11409777B2 (en) Entity-centric knowledge discovery
Dinneen et al. The ubiquitous digital file: A review of file management research
US7409644B2 (en) File system shell
RU2367013C2 (en) System and method for dynamic generation of extension which allows for search selection
AU2003233558B2 (en) File system shell
US10387469B1 (en) System and methods for discovering, presenting, and accessing information in a collection of text contents
US20060190817A1 (en) Filtering a collection of items
KR20050075718A (en) Systems and methods for controlling a visible results set
US20080313158A1 (en) Database file management system, integration module and browsing interface of database file management system, database file management method
US8612882B1 (en) Method and apparatus for creating collections using automatic suggestions
Nahotko Knowledge Organization Affordances in a Faceted Online Public Access Catalog (OPAC)
Al Nasar et al. Personal information management systems and interfaces: An overview
Li Liew et al. Derivation of interaction environment and information object properties for enhanced integrated access and value‐adding to electronic documents
Wiesman et al. Information retrieval by metabrowsing
KR100831055B1 (en) Method for information search based on ontology
Stenmark et al. Design Implications for Personal Information Management: A Theoretical Evaluation of a Prototype Interface
CN117591481A (en) Method for realizing unified single-entry access of common files
Duke et al. Semantically enhanced search and browse
Damnjanovic et al. Interactive workspace for exploring heterogeneous data'in G
NZ533296A (en) File system shell
Yong et al. Content-based retrieval of video data with flexibly managed attributes
Robbins et al. Fast, Flexible Filtering with Phlat—Personal Search and Organization Made Easy

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TURSKI, ANDRZEJ;RASHID, RICHARD F.;MACLAURIN, MATTHEW;AND OTHERS;REEL/FRAME:019687/0242

Effective date: 20040126

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001

Effective date: 20141014