CA2701171A1

CA2701171A1 - System and method for processing a query with a user feedback

Info

Publication number: CA2701171A1
Application number: CA2701171A
Authority: CA
Inventors: Matthew Colledge; Marc Carrier
Original assignee: Idilia Inc.; Matthew Colledge; Marc Carrier
Current assignee: Idilia Inc
Priority date: 2006-10-03
Filing date: 2007-10-03
Publication date: 2008-04-10
Also published as: EP2080125A1; CN101563685A; EP2080125A4; US20070136251A1; WO2008040121A1

Abstract

The invention provides a system and method of processing a query directed to a database. The invention comprises implementing the steps of: obtaining the query from a user;
disambiguating the query using a knowledge base to obtain a set of identifiable senses associated with words in the query; obtaining a set of interpretations of the query; presenting the set of interpretations to the user; obtaining from the user a selected interpretation from the set;
obtaining a providing results for the selected query interpretation. The invention also allows updates to databases for users, sessions and common data relating to the best identified results for the queries, to improve and personalize disambiguation of subsequent queries by a user.

Description

Agent Ref. 70615/00033

2 CROSS REFERENCE TO PRIOR APPLICATIONS

3 [0001] This application claims priority from US application number 11/538,285, filed October

4 3, 2007, which is a Continuation in Part of U.S. application number 10/921,875, filed August 20, 2004, which claims priority from U.S. Provisional application number 60/496,681 filed on August 6 21, 2003. The contents of such prior applications are incorporated herein by reference in their 7 entirety.

9 [0002] The present invention relates to Internet searching, and more particularly to Internet searching using semantic disambiguation and expansion. More particularly, the invention 11 provides a query processing method and system enabling a user to select a desired query 12 interpretation.

14 [0003] When working with large sets of data, such as a database of documents or web pages on the Internet, the volume of available data can make it difficult to find information of 16 relevance. Various methods of searching are used in an attempt to find relevant information in 17 such stores of information. Some of the best known systems are Internet search engines, such 18 as Yahoo (trademark) and Google (trademark) which allow users to perform keyword-based 19 searches. These searches typically involve matching keywords entered by the user with keywords in an index of web pages.
21 [0004] However, existing Internet search methods often produce results that are not 22 particularly useful. The search may return many results, but only a few or none may be relevant 23 to the user's query. On the other hand, the search may return only a small number of results, 24 none of which are precisely what the user is seeking while having failed to return potentially relevant results.
26 [0005] One reason for some difficulties encountered in performing such searches is the 27 ambiguity of words used in natural language. Specifically, difficulties are often encountered 28 because one word can have several meanings. This difficulty has been addressed in the past 29 by using a technique called word sense disambiguation, which involves changing words into word senses having specific semantic meanings. For example, the word "bank"
could have the 31 sense of "financial institution" or another definition attached to it.
32 [0006] US Patent 6,453,315 teaches meaning based information organization and retrieval.
33 This patent teaches creating a semantic space by a lexicon of concepts and relations between 21871431.1 1 Agent Ref. 70615/00033 1 concepts. Queries are mapped to meaning differentiators which represent the location of the 2 query and the semantic space. Searching is accomplished by determining a semantic 3 difference between differentiators to determine closeness and meaning. This system relies 4 upon the user to refine the search based on the meanings determined by the system or alternatively to navigate through nodes found in the search results.
6 [0007] As known in the art, the evaluation of the efficiency of information retrieval is 7 quantified by "precision" and "recall". Precision is quantified by dividing the number of correct 8 results found in a search by the total number of results. Recall is quantified by dividing the 9 number of correct results found in a search by the total number of possible correct results.
Perfect (i.e. 100%) recall may be obtained simply by returning all possible results, except of 11 course, this will give very poor precision. Most existing systems strive to balance the criteria of 12 precision and recall. Increasing recall, for example by providing more possible results by use of 13 synonyms, can consequentially reduce precision. On the other hand, increasing precision by 14 narrowing the search results, for example by selecting results that match the exact sequence of words in a query, can reduce recall.
16 [0008] There is a need for a query processing system and method which addresses 17 deficiencies in the prior art.

19 [0009] According to one aspect of the present invention, there is provided a method of searching information comprising the steps of disambiguating a query, disambiguating and 21 indexing information according to keyword meanings, searching the indexed information to find 22 information relevant to the query using keyword meanings in the query and other word 23 meanings which are semantically related to the keyword meanings in the query, and returning 24 search results which include information containing the keyword meanings and other semantically related words meanings.
26 [0010] The method may be applied to any database which is indexed using keywords.
27 Preferably, the method is applied to a search of the Internet.
28 [0011] The semantic relations may be any logically or syntactically defined type of 29 association between two words. Examples of such associations are synonymy, hyponymy etc.
[0012] The step of disambiguating the query may include assigning probability to word 31 meanings. Similarly, the step of disambiguating the information may include attaching 32 probabilities to word meanings.
33 [0013] The keyword meanings used in the method may be coarse groupings of finer word 34 meanings.

21871431.1 2 Agent Ref. 70615/00033 1 [0014] In another aspect, a method of processing a query directed to a database is 2 provided. The method comprising the steps of: obtaining the query from a user; and 3 disambiguating the query using a knowledge base to obtain a set of identifiable meanings of 4 words in the query, referred to as "interpretations" of the query. Further, if the set comprises more than one identifiable interpretation, then the following additional steps are executed:
6 selecting one interpretation from the set as a best interpretation;
utilizing the best interpretation 7 of the query to identify relevant results from the database related to the best interpretation; re-8 disambiguating the remaining interpretations of the set by excluding results associated with the 9 best interpretation; selecting a next best interpretation from the remaining interpretations; and utilizing the next best interpretation of the query to identify relevant results from the database 11 related to the next best interpretation.
12 [0015] In a further aspect, the invention provides a method of processing a query directed to 13 a database, the method comprising the steps of:
14 - obtaining the query from a user;
- disambiguating the query using a knowledge base to obtain a set of meanings for the 16 one or more words;
17 - obtaining a set of interpretations of the query based on the set of ineanings;
18 - presenting the user with the set of interpretations;
19 - obtaining from the user a selected interpretation from the set of interpretations; and - identifying relevant results from the database related to the selected interpretation.
21 [0016] In another aspect, the invention provides a system for processing a query directed to 22 a store of information, the system comprising:
23 - a means for obtaining the query from a user;
24 - a database comprising a knowledge base;
- a disambiguation module for disambiguating the query using the knowledge base to 26 provide a set of meanings for the one or more words and to provide a set of interpretations of 27 the query;
28 - a means for presenting the set of interpretations to the user;
29 - a means for obtaining from the user a selected interpretation from the set of interpretations;
31 - a processor for utilizing the selected interpretation to identify relevant results from the 32 database;
33 - a means for presenting the results to the user.
21871431.1 3 Agent Ref. 70615/00033 2 [0017] The foregoing and other aspects of the invention will become more apparent from 3 the following description of specific embodiments thereof and the accompanying drawings which 4 illustrate, by way of example only, the principles of the invention. In the drawings, where like elements feature like reference numerals (and wherein individual elements bear unique 6 alphabetical suffixes):
7 [0018] Figure 1 is a schematic representation of an information retrieval system providing 8 word sense disambiguation associated with an embodiment of the invention;
9 [0019] Figure 2 is a schematic representation of words and word senses associated with the system of Fig. 1;
11 [0020] Figure 3A is a schematic representation of a representative semantic relationship or 12 words for with the system of Fig. 1;
13 [0021] Figure 3B is a diagram of data structures used to represent the semantic 14 relationships of Fig. 3A for the system of Fig. 1;
[0022] Figure 4 is a flow diagram of a method performed by the system of Fig.
1 using the 16 word senses of Fig. 2 and the semantic relationships of Fig. 3A;
17 [0023] Figure 5 is a flow diagram of a method of applying word sense disambiguation as 18 provided by the system of Fig. 1 to query processing;
19 [0024] Figure 6 is a flow diagram of another method of applying word sense disambiguation as provided by the system of Fig. 1 to query processing; and 21 [0025] Figure 7 is a flow diagram of a method of applying personalization as provided by the 22 system of Fig. 1 to query processing.
23 [0026] Figure 8 is a schematic representation of a database containing personalization 24 information.
[0027] Figure 9 is a flow diagram of a method of applying personalization as provided by the 26 system of Fig. 1 to query processing.

28 [0028] The description which follows, and the embodiments described therein, is provided 29 by way of illustration of an example, or examples, of particular embodiments of the principles of the present invention. These examples are provided for the purposes of explanation, and not 31 limitation, of those principles and of the invention. In the description, which follows, like parts 32 are marked throughout the specification and the drawings with the same respective reference 33 numerals.

21871431.1 4 Agent Ref. 70615/00033 1 [0029] The following terms will be used in the following description, and have the meanings 2 shown below:
3 [0030] Computer readable storage medium: hardware for storing instructions or data for a 4 computer. For example, magnetic disks, magnetic tape, optically readable medium such as CD
ROMs, and semi-conductor memory such as PCMCIA cards. In each case, the medium may 6 take the form of a portable item such as a small disk, floppy diskette, cassette, or it may take the 7 form of a relatively large or immobile item such as hard disk drive, solid state memory card, or 8 RAM.
9 [0031] Information: documents, web pages, emails, image descriptions, transcripts, stored text etc. that contain searchable content of interest to users, for example, contents related to 11 news articles, news group messages, web logs, etc.
12 [0032] Module: a software or hardware component that performs certain steps and/or 13 processes; may be implemented in software running on a general-purpose processor.
14 [0033] Natural language: a formulation of words intended to be understood by a person rather than a machine or computer.
16 [0034] Network: an interconnected system of devices configured to communicate over a 17 communication channel using particular protocols. This could be a local area network, a wide 18 area network, the Internet, or the like operating over communication lines or through wireless 19 transmissions.
[0035] Query: a list of keywords indicative of desired search results; may utilize Boolean 21 operators (e.g. "AND", "OR"); may be expressed in natural language. A query may comprise 22 one or more words.
23 [0036] Query module: a hardware or software component to process a query.
24 [0037] Search engine: a hardware or software component to provide search results regarding information of interest to a user in response to a query from the user. The search 26 results may be ranked and/or sorted by relevance.
27 [0038] Sense or word sense: a meaning of a word, such as a keyword contained in a 28 query.
29 [0039] Interpretation: with respect to a query, an interpretation comprises a collection of word senses corresponding to one or more of the words in the query.
31 [0040] Referring to Figure 1, an information retrieval system associated with an embodiment 32 is shown generally at reference 10. The system includes a store of information 12 which is 33 accessible through a network 14. The store of information 12 may include documents, web 34 pages, databases, and the like. Preferably, the network 14 is the Internet, and the store of 21871431.1 5 Agent Ref. 70615/00033 1 information 12 comprises web pages. When the network 14 is the Internet, the protocols 2 include TCP/IP (Transmission Control Protocol/Internet Protocol). Various clients 16 are 3 connected to the network 14, by a wire in the case of a physical network or through a wireless 4 transmitter and receiver. Each client 16 includes a network interface as will be understood by those skilled in the art. The network 14 provides the clients 16 with access to the content within 6 the store of information 12. To enable the clients 16 to find particular information, documents, 7 web pages, or the like within the store of information 12, the system 10 is configured to allow the 8 clients 16 to search for information by submitting queries. The queries contain at least a list of 9 keywords and may also have structure in the form of Boolean relationships such as "AND" and "OR." The queries may also be structured in natural language as a sentence or question.
11 [0041] The system includes a search engine 20 connected to the network 14 to receive the 12 queries from the clients 16 to direct them to individual documents within the store of information 13 12. The search engine 20 may be implemented as dedicated hardware, or as software 14 operating on a general purpose processor. The search engine operates to locate documents within the store of information 12 that are relevant to the query from the client.
16 [0042] The search engine 20 generally includes a processor 22. The engine may also be 17 connected, either directly thereto, or indirectly over a network or other such communication 18 means, to a display 24, an interface 26, and a computer readable storage medium 28. The 19 processor 22 is coupled to the display 24 and to the interface 26, which may comprise user input devices such as a keyboard, mouse, or other suitable devices. If the display 24 is touch 21 sensitive, then the display 24 itself can be employed as the interface 26.
The computer 22 readable storage medium 28 is coupled to the processor 22 for providing instructions to the 23 processor 22 to instruct and/or configure processor 22 to perform steps or algorithms related to 24 the operation of the search engine 20, as further explained below. Portions or all of the computer readable storage medium 28 may be physically located outside of the search engine 26 28 to accommodate, for example, very large amounts of storage. Persons skilled in the art will 27 appreciate that various forms search engines can be used with the present invention.
28 [0043] Optionally, and for greater computational speed, the search engine 20 may include 29 multiple processors operating in parallel or any other multi-processing arrangement. Such use of multiple processors may enable the search engine 20 to divide tasks among various 31 processors. Furthermore, the multiple processors need not be physically located in the same 32 place, but rather may be geographically separated and interconnected over a network as will be 33 understood by those skilled in the art.

21871431.1 6 Agent Ref. 70615/00033 1 [0044] Preferably, the search engine 20 includes a database 30 for storing an index of word 2 senses and for storing a knowledge base used by search engine 20. The database 30 stores 3 the index in a structured format to allow computationally efficient storage and retrieval as will be 4 understood by those skilled in the art. The database 30 may be updated by adding additional keyword senses or by referencing existing keyword senses to additional documents. The 6 database 30 also provides a retrieval capability for determining which documents contain a 7 particular keyword sense. The database 30 may be divided and stored in multiple locations for 8 greater efficiency.
9 [0045] According to an embodiment, the search engine 20 includes a word sense disambiguation module 32 for processing words in an input document or a query into word 11 senses. A word sense is a given interpretation ascribed to a word, in view of the context of its 12 usage and its neighbouring words. For example, the word "book" in the sentence "Book me a 13 flight to New York" is ambiguous, because "book" can be a noun or a verb, each with multiple 14 potential meanings. The result of processing of the words by the disambiguation module 32 is a disambiguated document or disambiguated query comprising word senses rather than 16 ambiguous or uninterpreted words. The input document may be any unit of information in the 17 store of information, or one of the queries received from clients. The word sense 18 disambiguation module 32 distinguishes between word senses for each word in the document 19 or query. The word sense disambiguation module 32 identifies which specific meaning of the word is the intended meaning using a wide range of interlinked linguistic techniques to analyze 21 the syntax (e.g. part of speech, grammatical relations) and semantics (e.g.
logical relations) in 22 context. It may use a knowledge base of word senses which expresses explicit semantic 23 relationships between word senses to assist in performing the disambiguation. The knowledge 24 base may include relationships as described below with reference to Figures 3A and 3B.
[0046] The search engine 20 includes an indexing module 34 for processing a 26 disambiguated document to create the index of keyword senses and storing the index in the 27 database 30. Index module 34 is a module which indexes data, such data from documents, for 28 use by search engine 20. In one embodiment, index module 34 is enabled to search for 29 documents by crawling through the web using techniques known in the art.
Upon locating a document, index module provides it to disambiguation module 32 to provide a list of word 31 senses for the content of the document. Index module 34 then indexes information regarding 32 the word senses and the document in a database. The index includes an entry for each 33 keyword sense relating to the documents in which it may be found. The index is preferably 34 sorted and includes an indication of the locations of each indexed keyword sense. The index 21871431.1 7 Agent Ref. 70615/00033 1 module 34 creates the index by processing the disambiguated document and adding each 2 keyword sense to the index. Certain keywords may appear too many times to be useful and/or 3 may contain very little semantic information, such as "a" or "the". These keywords may not be 4 indexed.
[0047] The search engine 20 also includes a query module 36 for processing queries 6 received from client 16. The query module 36 is configured to receive queries and transfer 7 them to the disambiguation module 32 for processing. The query module 36 then finds results 8 in the index that are relevant to the disambiguated query, as described further below. The 9 results contain keyword senses semantically related to the word senses in the disambiguated query. The query module 36 provides the results to the client. The results may be ranked 11 and/or scored for relevance to assist the client in interpreting them.
12 [0048] Referring to Figure 2, the relationship between words and word senses is shown 13 generally by the reference 100. As seen in this example, certain words have multiple senses.
14 Among many other possibilities, the word "bank" may represent: (i) a noun referring to a financial institution; (ii) a noun referring to a river bank; or (iii) a verb referring to an action to 16 save money. The word sense disambiguation module 32 splits the ambiguous word "bank" into 17 less ambiguous word senses for storage in the index. Similarly, the word "interest" has multiple 18 meanings including: (i) a noun representing an amount of money payable relating to an 19 outstanding investment or loan; (ii) a noun representing special attention given to something; or (iii) a noun representing a legal right in something.
21 [0049] Referring to Figures 3A and 3B, example semantic relationships between word 22 senses are shown. These semantic relationships are precisely defined types of associations 23 between two words based on meaning. The relationships are between word senses, that is, 24 specific meanings of words.
[0050] Specifically in Fig. 3A, for example, a bank (in the sense of a river bank) is a type of 26 terrain and a bluff (in the sense of a noun meaning a land formation) is also a type of terrain. A
27 bank (in the sense of river bank) is a type of incline (in the sense of grade of the land). A bank 28 in the sense of a financial institution is synonymous with a "banking company" or a "banking 29 concern." A bank is also a type of financial institution, which is in turn a type of business. A
bank (in the sense of financial institution) is related to interest (in the sense of money paid on 31 investments) and is also related to a loan (in the sense of borrowed money) by the generally 32 understood fact that banks pay interest on deposits and charge interest on loans.
33 [0051] It will be understood that there are many other types of semantic relationships that 34 may be used. Although known in the art, following are some examples of semantic 21871431.1 8 Agent Ref. 70615/00033 1 relationships between words: Words which are in synonymy are words which are synonyms to 2 each other. A hypernym is a relationship where one word represents a whole class of specific 3 instances. For example "transportation" is a hypernym for a class of words including "train", 4 "chariot", "dogsled" and "car", as these words provide specific instances of the class.
Meanwhile, a hyponym is a relationship where one word is a member of a class of instances.
6 From the previous list, "train" is a hyponym of the class "transportation".
A meronym is a 7 relationship where one word is a constituent part of, the substance of, or a member of 8 something. For example, for the relationship between "leg" and "knee", "knee" is a meronym to 9 "leg", as a knee is a constituent part of a leg. Meanwhile, a holonym a relationship where one word is the whole of which a meronym names a part. From the previous example, "leg" is a 11 holonym to "knee". Any semantic relationships that fall into these categories may be used. In 12 addition, any known semantic relationships that indicate specific semantic and syntactic 13 relationships between word senses may be used.
14 [0052] It is known that there are ambiguities in interpretation when strings of keywords are provided as queries and that having an expanded list of keywords in a query increases the 16 number of results found in the search. The embodiment provides a system and method to 17 identify relevant, disambiguated lists of keywords for a query. Providing such a list delineated 18 on the sense of words reduces the amount of extraneous information that is retrieved. The 19 embodiment expands the query language without obtaining unrelated results due to extra, related senses of a word. These related senses may include synonyms. For example, 21 expanding the "financial institution" sense of bank will not also expand the other senses such as 22 "river-bank" or "to save". This allows information management software to identify more 23 precisely the information for which a client is looking.
24 [0053] Expanding a query involves using one or both of the following steps:
[0054] 1. Adding to a disambiguated query keyword sense, any other word and its 26 associated senses that are semantically related to the disambiguated keyword sense.
27 [0055] 2. Paraphrasing the query by parsing its syntactic structure and transforming it into 28 other semantically equivalent queries. Paraphrasing the query by parsing its syntactic structure 29 and transforming it into other semantically equivalent queries. The index contains fields that identify syntactic structures and semantic equivalents for words. Paraphrasing is a term and 31 concept known in the art.
32 [0056] It will be recognized that the use of word sense disambiguation in a search 33 addresses the problem of retrieval relevance. Furthermore, users often express queries as they 34 would express language. However, since the same meaning can be described in many different 21871431.1 9 Agent Ref. 70615/00033 1 ways, users encounter difficulties when they do not express a query in the same specific 2 manner in which the relevant information was initially classified.
3 [0057] For example if the user is seeking information about "Java" the island, and is 4 interested in "holidays" on Java (island), the user would not retrieve useful documents that had been categorized using the keywords "Java" and "vacation". It will be recognized that the 6 semantic expansion feature, according to an embodiment, addresses this issue. It has been 7 recognized that deriving precise synonyms and sub-concepts for each key term in a naturally 8 expressed query increases the volume of relevant retrievals. If this were performed using a 9 thesaurus without word sense disambiguation, the result could be worsened.
For example, semantically expanding the word "Java" without first establishing its precise meaning would 11 yield a massive and unwieldy result set with results potentially selected based on word senses 12 as diverse as "Indonesia" and "computer programming". It will be recognized that the described 13 methods of interpreting the meaning of each word and then semantically expanding that 14 meaning returns a more comprehensive and simultaneously more target result set.
[0058] Referring to Fig. 3B, to assist in disambiguating such word senses, the embodiment 16 utilizes knowledge base 400 of word senses capturing relationships of words as described 17 above for Fig. 3A. Knowledge base 400 is associated with database 30 and is accessed to 18 assist word sense disambiguation (WSD) module 32 in performing word sense disambiguation.
19 Knowledge base 400 contains definitions of words for each of their word senses and also contains information on relations between pairs of word senses. These relations includes the 21 definition of the sense and the associated part of speech (noun, verb, etc.), fine sense 22 synonyms, antonyms, hyponyms, meronyms, pertainyms, similar adjectives relations and other 23 relationships known in the art. While prior art electronic dictionaries and lexical databases, such 24 as WordNet (trademark), have been used in systems, knowledge base 400 provides an enhanced inventory of words and relations. Knowledge base 400 contains: (i) additional 26 relations between word senses, such as the grouping of fine senses into coarse senses, new 27 types of inflectional and derivational morphological relations, and other special purpose 28 semantic relations; (ii) large-scale corrections of errors in data obtained from published sources;
29 and (iii) additional words, word senses, and associated relations that are not present in other prior art knowledge bases.
31 [0059] In the embodiment, knowledge base 400 is a generalized graph data structure and is 32 implemented as a table of nodes 402 and a table of edge relations 404 associating two nodes 33 together. Each is described in turn. In other embodiments, other data structures, such as 34 linked lists, may be used to implement knowledge base 400.

21871431.1 10 Agent Ref. 70615/00033 1 [0060] In table 402, each node is an element in a row of table 402. A record for each node 2 may have as many as the following fields: an ID field 406, a type field 408 and an annotation 3 field 410. There are two types of entries in table 402: a word and a word sense definition. For 4 example, the word "bank" in ID field 406A is identified as a word by the "word" entry in type field 408A. Also, exemplary table 402 provides several definitions of words. To catalog the 6 definitions and to distinguish definition entries in table 402 from word entries, labels are used to 7 identify definition entries. For example, entry in ID field 406B is labeled "LABEL001". A
8 corresponding definition in type field 408B identifies the label as a "fine sense" word 9 relationship. A corresponding entry in annotation filed 410B identifies the label as "Noun. A
financial institution". As such, a "bank" can now be linked to this word sense definition.
11 Furthermore an entry for the word "brokerage" may also be linked to this word sense definition.
12 Alternate embodiments may use a common word with a suffix attached to it, in order to facilitate 13 recognition of the word sense definition. For example, an alternative label could be "bank/n1", 14 where the "/n1" suffix identifies the label as a noun (n) and the first meaning for that noun. It will be appreciated that other label variations may be used. Other identifiers to identify adjectives, 16 adverbs and others may be used. The entry in type field 408 identifies the type associated with 17 the word. There are several types available for a word, including: word, fine sense and coarse 18 sense. Other types may also be provided. In the embodiment, when an instance of a word has 19 a fine sense, that instance also has an entry in annotation field 410 to provide further particulars on that instance of the word.
21 [0061] Edge/Relations table 404 contains records indicating relationships between two 22 entries in nodes table 402. Table 404 has the following entries: From node ID column 412, to 23 node ID column 414, type column 416 and annotation column 418. Columns 412 and 414 are 24 used to link two entries in table 402 together. Column 416 identifies the type of relation that links the two entries. A record has the ID of the origin and the destination node, the type of the 26 relation, and may have annotations based on the type. Types of relations include "root word to 27 word", "word to fine sense", "word to coarse sense", "coarse to fine sense", "derivation", 28 "hyponym", "category", "pertainym", "similar", "has part". Other relations may also be tracked 29 therein. Entries in annotation column 418 provide a (numeric) key to uniquely identify an edge type going from a word node to either a coarse node or fine node for a given part-of-speech.
31 [0062] Further detail is now provided on steps performed by the embodiment to perform a 32 search utilizing results from disambiguating a word associated with a query. Referring to Figure 33 4, a process for performing such a search is shown generally by the reference 300. The 34 process may be divided into two general stages. The first stage comprises pre-processing the 21871431.1 11 Agent Ref. 70615/00033 1 information (or a subset of the information) to facilitate the second stage of responding to a 2 query. In the first stage of pre-processing, each document in the store of information (or a 3 subset of the store of information) is summarized to create the index in the database. At step 4 302, the word sense disambiguation module 32 distinguishes between word senses for each word in each document. The word sense disambiguation module 32 was defined above.
6 [0063] The search engine then applies the index module to the disambiguated information 7 at step 304 to obtain an index of keyword senses. The index module 34 creates the index by 8 processing the disambiguated document and adding each keyword sense to the index. Certain 9 keywords may appear too many times to be useful, such as "a" or "the".
Preferably, these keywords are not indexed. It will be recognized that this step effectively indexes one word as 11 several different word senses. This index of word senses is stored in the database at step 306.
12 [0064] In the second stage of the process, the search engine receives a query from one of 13 the clients at step 308. The query is parsed into its word components and then each word can 14 be analyzed for its context alone and in context with its neighbouring words. Parsing techniques for strings of words are known in the art and are not repeated here. The word sense 16 disambiguation module 32 distinguishes between meanings for each word in the query at step 17 310. To assist in disambiguation, the module may make use of results that the user has 18 previously selected or a previously disambiguated query entered by the user, as context in 19 addition to words in the query itself.
[0065] In the preferred embodiment, as shown at step 312, using knowledge base 400 (Fig.
21 3B), the search engine expands the disambiguated query to include keyword senses which are 22 semantically related to the specific keyword senses in the query. The expansion is performed 23 on the basis of word sense and accordingly produces a list of word senses which are related to 24 the meaning of the query. The semantic relationships may be those described above with reference to Figures 3A and 3B.
26 [0066] The search engine then compares the disambiguated and expanded query to word 27 sense information in the database at step 314. Entries in the knowledge base whose word 28 senses match the keyword senses in the query are selected to be results. As noted earlier, the 29 knowledge base includes a database of indexed documents. The search engine then returns results to the client at step 316. In one embodiment, the results may be weighted according to 31 the semantic relationship between the word senses found in the results and that of the 32 keywords in the query. Thus, for example, a result containing a word sense with a synonymous 33 relationship to the keyword senses in the query may be given a higher weighting as compared 34 to a result containing word senses with a hyponym relationship. The results may also be 21871431.1 12 Agent Ref. 70615/00033 1 weighted by a probability that a keyword sense in the disambiguated query and/or 2 disambiguated document is correct. The results may also be weighted by other features of the 3 document or web page corresponding to the results such as the frequency of the relevant word 4 senses or their location in relation to each other, or other techniques for ranking results as will be understood by persons skilled in the art.
6 [0067] It will be recognized that the first stage of the process may be performed as a pre-7 computation step, prior to interaction with the clients. The second stage could be performed 8 several times without repeating the first stage. The first stage may be performed occasionally, 9 or at regular intervals to maintain currency of the database. The database could also be updated incrementally by choosing performing the first stage on subsets of the information, such 11 as newly added or modified information.
12 [0068] Generally, the embodiment also utilizes word sense disambiguation to sense tag 13 queries. In particular, the embodiment performs the following functions to sense tag queries:
14 1. Identifying likely senses of the query key words using word sense disambiguation;
16 2. Identifying other likely alternate interpretations of the query using word sense 17 disambiguation;
18 3. Ranking each interpretation as for its likelihood as being the intended meaning;
19 4. Using the alternate interpretations derived using word sense disambiguation to obtain confirmation from the user of the meant meaning and correct interpretation.
21 5. If required, updating the intended interpretation of the query for a given user.
22 [0069] Details of each of the five functions are provided below.
23 [0070] For the first function, system 10 uses disambiguation engine 32 and the knowledge 24 base to identify a likely word sense for a query. In order to identify plausible word senses, a number of word sense disambiguation components, but not necessarily all, are used by the 26 embodiment to identify their senses. One component accesses a set of rules associated with 27 the words to determine the sense of a word. The rules identify the presence of any relation 28 between word senses of the given word and adjacent words. In the embodiment, the rules are 29 manually coded. One example of a rule is as follows: for two words in a sentence, if the two words have a common sense in their list of possible senses, then this common sense is 31 determined to be the likely intended meaning. An application of this rule is found in the 32 sentence: "He sold his interest in the company which amounted to a 25%
stake." Therein, the 33 words "interest" and "stake" share a common sense of "right, title, or legal share in something".
34 Other embodiments may use automatically coded rules.

21871431.1 13 Agent Ref. 7 061 5/0003 3 1 [0071] A second process for the first function assigns senses to words by identifying any 2 coherent topics which capture a main semantic meaning of the words. A topic is a vector of 3 weighted senses. Coherence between topics is measured as a function of the likelihood that 4 the senses in the topics are going to appear together in text. When multiple topics are identified in the text, each topic may be complimentary or contradictory to the other topics. Contradictory 6 topics may indicate different possible interpretations of the query. A
contradictory topic is a 7 different vector with alternate senses of the same words also results in a comparable length 8 vector.
9 [0072] For the second function, the embodiment may use or re-use a disambiguation process to identify likely alternative word senses and analyze results of each process against 11 the other results. Some of the processes are described below. It will be appreciated that the 12 processes and algorithms may be considered to be components of the embodiment.
13 [0073] A first process for the second function repeats the disambiguation process for a 14 query but constrains the sense of a word to a sense that had not been previously reported. The disambiguation of the query then selects an alternate sense for that word and may modify the 16 sense of the remaining words. This process may be repeated for each sense of each word to 17 obtain a set of alternate interpretations.
18 [0074] Another process re-disambiguates for the second function the query using all of the 19 set of algorithms, but constrains the algorithms to consider that one of the alternative topics be the most likely solution (to the exclusion of the previously identified most likely topic).
21 Accordingly, when the other algorithms execute, their respective results will change. This can 22 be systematically repeated for each identified topic to obtain a set of alternate interpretations.
23 [0075] Another algorithm for the second function assigns a sense from the set of known 24 possible senses to one of the word and disambiguates the senses of the remaining words. This can be systematically repeated for each sense of each word to obtain a set of alternate 26 interpretations.
27 [0076] Each of the algorithms for the second function may be used individually or in 28 combination to generate a list of possible alternate interpretation of the query's meaning. Some 29 of the generated interpretations may be duplicates of each other and only a single instance may be kept for further processing.
31 [0077] For the third function, a ranking may be attributed to each result which may be used 32 to state an accuracy for each result. For example, a ranking may be based on the number of 33 hits generated for each interpretation. Alternatively, a probability threshold may be set and a 34 probability score may be assigned to the results of each process. If scores of the word senses 21871431.1 14 Agent Ref. 70615/00033 1 distribution are above the threshold, then each such sense is retained.
Alternatively, if the 2 difference in scoring between the top sense and the second sense exceeds a certain delta 3 value, then the top value is deemed to be acceptable. Also, interpretations having a deemed 4 low probability score, because their score values are below an unacceptable threshold value, may be automatically discarded.
6 [0078] For the fourth function, using word sense disambiguation, various algorithms are 7 provided to obtain confirmation from the user of the meant meaning. The first algorithm is used 8 to derive a question to be posed by system 10 related to the query. The second algorithm is 9 used to selectively group the results of the disambiguation. A third algorithm is used to identify various meanings of a query and to obtain information from the user as to which meaning is 11 intended prior to providing results. Each algorithm is discussed in turn below.
12 [0079] Referring to Fig. 5, algorithm 500 is shown representing the first algorithm of the 13 fourth function. Algorithm 500 presents a user with a question asking if the intended meaning is 14 the second likely interpretation while presenting the search results based on the first interpretation. As an example, if the original query contained only the keyword "java", the 16 algorithm would identify a likely meaning of the word "java" relates to either Indonesia or the 17 programming language. For the example, it is presumed that "Indonesia" is the more confident 18 interpretation and its results are displayed. However, as an added filter, the first algorithm 19 generates the following question for the user: "Did you mean an object-oriented programming language?" If the user answers affirmatively to the question, then the results for the second 21 interpretation are displayed.
22 [0080] In order to identify terms to use in the question, it is preferable that algorithm 500:
23 1. First, obtain the query (step 502) 24 2. Disambiguate the query to identify the most likely word senses as the first interpretation using disambiguation engine 32 (step 504);
26 3. After step 504, conducting, in parallel, steps in path 506 and path 508;
27 A) In path 506, the following steps are performed:
28 - Expand the query for semantically related senses; this may utilize word 29 sense disambiguation to find suitable semantically related senses for the identified word sense (step 510), this may use the knowledge base 31 describing word senses and the semantic relationships between the 32 senses; then 33 - Compare the expanded set of query senses to an index senses found in 34 documents; the index may be generated by index module 32 (step 512);
21871431.1 15 Agent Ref. 70615/00033 1 B) In path 508, the following steps are performed:
2 - Identify the second most likely interpretation of the whole query 3 providing alternate word senses for at least one word; this is preferably 4 done by eliminating the effect of the first most likely word sense identified in step 504 from the possible set of results and then re-disambiguating 6 the remaining senses amongst themselves using disambiguation engine 7 32 (step 514);
8 - From the selected second most likely interpretation, identifying words 9 that have a different meaning between the first and second interpretation (step 516);
11 - Between the best and the second most likely interpretations, identify a 12 term or association which is semantically related only to the second word 13 sense and not related to the first sense. This distinguishes the second 14 word sense from the first. Further, the term may form part of question phrase. In the example above, in the knowledge base, "Java" has a 16 "type-of" association with the phrase "object-oriented programming 17 language" and "Java" has an alternate "part-of" association with 18 "Indonesia". As such the "type-of" association distinguishes the first and 19 second senses for "Java" (step 518);
4. Return results and generate a question based on the keyword or association 21 identified for the second most likely interpretation. Algorithm 500 preferably uses the first 22 interpretation as being the intended meaning unless the user selects the question. If the 23 question is selected, the display search results can be updated to the second interpretation and 24 the intended meaning can be also updated (step 520);

5. If the second most likely interpretation was selected, then re-disambiguate the 26 query, using the senses associated with the second most likely interpretation to re-compute the 27 word sense probability distribution with the new input that confirms the intended meaning of the 28 second most likely interpretation using disambiguation engine 32 (step 522); and 29 6. Store the results of the interpretation selected by the user for the query and update the knowledge base accordingly (step 524); and return to the beginning of paths 506 31 and 508.
32 [0081] In algorithm 500, in step 516 the descriptive term of the second word sense is 33 identified by analyzing each semantic relation to other word senses of all of the senses of the 34 query word. If the descriptive term has semantic relations appearing in more than one sense of 21871431.1 16 Agent Ref. 7061 5/00 03 3 1 the query word, then the descriptive term is discarded, as it does not differentiate the senses of 2 the query word. Thereafter, the remaining semantically related word senses are ranked for their 3 descriptive and differentiating attributes. These attributes include: their type of semantic 4 relation, the frequency of their word senses, their parts-of-speech, the number of other semantically related word senses, and others.

6 [0082] It will be appreciated that algorithm 500 provides three levels of refinement to search

7 queries. The first level is a first unconstrained pass at disambiguation to identify a first

8 interpretation in step 504. The second level is to identify a second most likely interpretation, by

9 constraining it to ignore the first answer. It will be appreciated that the results of the second level may still be ambiguous. As the first interpretation is effectively ignored for the second level 11 by constraining the second level to consider only alternative senses, re-disambiguation at this 12 point can better find the next best interpretation as the effects of the first interpretation from the 13 set of word senses are eliminated. The third level is activated only when the user selects the 14 question in step 520. In this level, as the user has provided feedback as to the intended meaning of the query (either directly via answering a question or indirectly by not answering a 16 question), the meaning of the word in the query is no longer ambiguous. Its sense is now 17 known with a high degree of certainty. Thereafter the further re-disambiguation in step 522 is 18 based on the second most likely interpretation only, ignoring any additional interpretations which 19 were located in step 514. For example, a query with the word "Java" may have been interpreted as an island in Indonesia in the first level of disambiguation. When the query is re-21 disambiguated and constrained to ignore that sense, the disambiguation engine may determine 22 that an object-oriented programming language was the second best interpretation of that word.
23 However, "java" could still refer to "coffee". Accordingly, in the last disambiguation, the meaning 24 of "java" is confirmed to be an object-oriented language and its constraints can be updated to indicate that "java" in this context is neither the island nor coffee.
26 [0083] In an alternative embodiment to algorithm 500, a decision point (not shown) may be 27 inserted immediately after step 504. At the decision point, the results of step 504 are analyzed 28 and if there is confidence in the results, then path 506 is taken for processing results of step 29 504. If there is insufficient confidence in the results, then paths 506 and 508 is taken.
[0084] Referring to Fig. 6, algorithm 600 is shown representing the second algorithm of the 31 fourth function. Algorithm 600 presents a user with result for two or more interpretations of a 32 query and monitors which result the user selects to view to determine the intended meaning of 33 the query. Algorithm 600 determines the intended meaning of a query via two methods:

21871431.1 17 Agent Ref. 70615/00033 1 1. In the first method, a most likely and at least one alternative interpretation of the 2 query word are generated. However, the algorithm simply selects the most likely 3 interpretation as being the correct interpretation. Only the most likely interpretation is 4 selected if the ranking score is above a certain threshold. Subsequently, the sense tagging of each query keyword is confirmed accordingly.
6 2. In the second method, again a most likely and at least one alternative 7 interpretation of the query word are generated. When the user selects a document 8 associated with one of the interpretations, the algorithm re-disambiguating the query 9 using the selected document as context. This method allows the senses of each word to be confirmed or corrected based on the content of the document. The document may 11 provide additional context that allows other ambiguous query words in the alternate 12 interpretation to be disambiguated with higher confidence.
13 [0085] Briefly, notable steps of algorithm 600 are as follows:
14 1. First, obtain the query (step 602, similar to step 502) 2. Disambiguate the query using disambiguation module 32 (step 604, similar to 16 step 504);
17 3. Determine rankings for the results. In one alternative, the ranking value 18 threshold for the ranking is set to a low value threshold (step 606);
19 4. If the threshold is met, then path 608 is taken. If the threshold is not met, then path 610 is taken.
21 A) In path 608, the following functions are performed for each interpretation of a 22 query:
23 - Expand the query using word sense disambiguation using disambiguation 24 engine 32 (step 612, similar to step 510); then, - Compare the query sense to the index (step 614, similar to step 512);
26 B) In path 610, the following function is performed prior to steps 612 and 27 614:
28 - Use word sense disambiguation to identify a list of alternative interpretations of 29 the query. The list is generated by first ignoring results associated with the highest ranked results (step 616, similar to step 514);
31 5. After step 614, return results of each interpretation and wait for input (step 618);
32 6. Obtain user feedback on the selected interpretation or selected document (step 33 620) 21871431.1 1 $

Agent Ref. 70615/00033 1 7. Re-disambiguate the query using the selected document as context, by ignoring 2 other word senses (step 622, similar to step 520); and 3 8. Store the results of the interpretation selected by the user for the query (step 4 624).
[0086] For algorithm 600, various methods can be used to present to a user the different 6 groups of results. Three exemplary methods are described. A first method utilizes clearly 7 clustering results into separate groups of alternate interpretations. A word or description of 8 each interpretation can optionally be included with each group using methods described earlier 9 to identify descriptive and differentiating words semantically related to each interpretation. A
second method displays results for the first interpretation with a link for each of the other 11 remaining interpretation allowing the user to view the associated results.
A third method 12 merges results from each interpretation into a single list of results. The user is not aware that 13 multiple interpretations of the query are displayed but upon his selection of a result, the intended 14 meaning can be identified as described above.
[0087] Another aspect of the embodiment enables disambiguation of a query to be 16 personalized for each user and across each user session. This is preferably done in step 522 of 17 algorithm 500 and step 624 in algorithm 600. Personalization of the word sense disambiguation 18 enables the embodiment to assign different word senses to the same or related queries for 19 different users. Personalization and customization of word sense disambiguation improves the quality of the search results obtained from the improved query senses due to automatic 21 acquisition and use of the personalized information. It can readily be seen that personalization 22 can enhance customer loyalty to a particular search engine service provider, because of the 23 improved search results provided to each customer.
24 [0088] Referring to Fig. 8, personalization of queries requires tracking of information in database 30. This information is tracked in query personalization database 800 in database 30.
26 Data for database 800 is derived from tagged senses identified when the embodiment 27 disambiguates a query.
28 [0089] It will be appreciated that for a user of a search engine, there are at least three types 29 of temporal relationships with him and the search engine. The user is defined as a person that uses a search engine. When the user accesses the search engine in a session having a period 31 of interactivity with a search engine with a clear beginning and end, this period is defined as a 32 session. The session may be for a defined period of time. During the session, he may be 33 looking for a few specific topics, e.g. vacation sites. The collective searches of all of the user's 21871431.1 19 Agent Ref. 706 1 5/0003 3 1 sessions define his user data. All of the user data of all of the users of the search engine define 2 the common data for the search engine.
3 [0090] To track user, session and common information, query personalization database 800 4 is partitioned into three sets of data: a set of common data 802 relating to word sense tags used by all users; a set of per user data 804; and a set of per user session data 806. Other sets 6 of data may also be tracked.
7 [0091] Data in database 800 is updated at sufficient intervals for each type of data with 8 sense tagged queries or information transformed from the related queries.
For example, per 9 user session data 806 may be updated after each query; per user data 804 may be updated at the beginning or end of each session of a user; and common data 802 may updated at periodic 11 time intervals. A user can be identified to the embodiment by installing and evaluating cookies 12 installed on his machine. It will be appreciated that if a user activates several sessions, 13 separate cookies can be provided on his machine to identify each session.
14 [0092] Common data 802 may be in stored in a consolidated common partition of query personalization database 800. Per user data 804 and per user session data 806 may be stored 16 in a partition of query personalization database 800 that exists for each user. The sense tagged 17 queries and derived information may be stored in a temporary partition that exists in the 18 system's memory for each user session. Preferably, there is a file for the common data, for 19 each user, and for each user session. Part of the data in these files is loaded into system memory as need when disambiguating a query.
21 [0093] When disambiguating a query for a given user in a specific user session, the 22 additional information from query personalization database 800 may be used by other 23 components simultaneously. This can cause those components to generate different results 24 under different circumstances. The common, per user, and per user session information derived from the sense tagged queries is used as input to the components in addition to the 26 core disambiguation database. It will be appreciated that different data may affect different 27 queries. Data associated with a session may only affect queries associated with that session.
28 Data associated with one user may only affect queries associated with that user. Common data 29 may affect any user.
[0094] Referring to Fig. 7, algorithm 700 is shown which identifies notable steps of 31 personalization of data. In particular for algorithm 700, its steps are as follows:
32 1. First, obtain the query (step 702) 33 2. Disambiguate the query using personalization data (step 704);
34 3. After step 704, conducting in parallel steps along path 706 and path 708;
21871431.1 20 Agent Ref. 70615/00033 1 A) In path 706, the following steps are performed:
2 - Expand the query for semantically related senses to find suitable 3 semantically related senses for the identified word utilizing the knowledge base 4 (step 710);
- Compare the expanded set of query senses to an index of the senses 6 found in disambiguated documents (step 712);
7 - Return results of the query (step 714);
8 - Go to step obtain user input/feedback (in step 716);
9 B) In path 708, simply step 716 is done next;
4. Upon completion of paths 706 and 708, obtain user feedback on the selected 11 interpretation or selected document (step 716); and 12 5. Update query personalization data (step 718).
13 [0095] For algorithm 700, for steps 716 and 718, conducting personalization of data 14 involves: acquiring and storing of personalized data relating to a query;
and using data to improve word sense disambiguation of queries. Each requirement is discussed in turn.
16 [0096] For acquiring and storing data, it is already assumed that a system exists for sense 17 tagging initial queries of a user. A validated sense tagged query has a word sense assigned to 18 each of the query keywords. It is preferable that the system has vetted the word senses such 19 that there is high confidence that the word sense represent the intended meaning of the word.
[0097] As a user submits a query to a search engine, the sense tagged query as well as 21 other information derived from it is stored in query personalization database 800. Information 22 derived from the sense tagged queries is stored in a file for disambiguation algorithms of 23 disambiguation engine 32. The disambiguation algorithms include: a priors algorithm; an 24 example memory algorithm; an n-gram algorithm; a dependencies algorithm and a classifier algorithm. Details of each algorithm are described below. Other algorithms may also be used.
26 [0098] The priors algorithm predicts word senses by utilizing historical statistical data on 27 frequency of appearances of various word senses. Specifically the algorithm assigns a 28 probability to each word sense based on the frequency the word sense in the input sense 29 tagged text. Therein, senses in the input sense tagged text are counted and the frequency distribution of the senses for each word is preferably normalized. Note the input sense tagged 31 text is not the text being disambiguated but is text that has previously been disambiguated and 32 where the confidence that the intended meaning has been correctly identified is very high.
33 [0099] For optimization and performance issues, the priors algorithm computes a frequency 34 count for each sense from the sense tagged text and stores the frequency data as a file in 21871431.1 21 Agent Ref. 70615/00033 1 database 800. The core database contains the frequency counts obtained from sense tagged 2 text while the personalization database 800 holds the word sense frequency counts of sense 3 tagged queries. Also, a consolidated file exists containing the frequency count of word senses 4 of sense tagged queries from all users. A separate file exists in database 800 for each user containing the word sense frequency count of sense tagged queries associated with that user.
6 These files containing the user information, user session data, and common data for all users 7 represents the query personalized data. This data is stored in the personalization database 8 800. Thus, after the files are updated, the sense distribution derived from the last execution of 9 the algorithm is available for the next execution of the priors algorithm.
[00100] Finally, the system maintains a frequency count of the sense tagged queries of a 11 specific user's session either in memory or on a hard disk. Preferably, this data is not used 12 when disambiguating a query with personalization information.
13 [00101] Therein, senses in the sense tagged query are counted and the frequency 14 distribution of the senses for each word is preferably normalized. The set of queries used can be all queries from all users, all queries from one user, or the queries from one user session.
16 The system updates the frequency count as each query is processed or at appropriate intervals.
17 The normalization of the frequency distribution may be performed on a word-by-word basis 18 when disambiguating that word in a new query or text.
19 [00102] The example memory algorithm predicts words senses for phrases (or word sequences). Phrases typically are defined as a series of consecutive words. A
phrase can be 21 two words long up to a full sentence. The algorithm accesses a list of phrases (word 22 sequences) which provide a deemed correct sense for each word in that phrase. Preferably, 23 the list comprises sentence fragments from input sense tagged text that occurred multiple times 24 where the senses for each of the fragments occurrence were identical.
Preferably, when an analyzed phrase contains a word which has a sense which differs from a sense previously 26 attributed to that word in that phrase, senses in the analyzed phrase are rejected and are not 27 retained in the list of word sequences.
28 [00103] When disambiguating a new text or query, the example memory algorithm identifies 29 whether parts of the text or query match the previously identified recurring sequences of words.
If there is a match, the module assigns the word senses of the sequence to the matching words 31 in the new text or query. Preferably, the algorithm initially searches for the longest match and 32 does not assign the word senses if a word sense contradicts with senses that have already 33 been identified in the text or query. When analyzing a query, the algorithm searches for 34 matches of sentence fragments from the query being processed to fragments in its associated 21871431.1 22 Agent Ref. 70615/00033 1 list. When a match is located, it is assigned the sense from the list to the fragment being 2 processed. The algorithm maintains several lists to assist in its processing, including: a list of 3 word sequences with correct senses that were derived from training input sense tagged text; a 4 list derived from sense tagged queries from all users; a list derived from all queries of a user;
and a list derived from the queries of a user's session.
6 [00104] For optimization and performance issues, the example memory algorithm stores data 7 regarding identification of recurring sequences of word senses and frequency of that pattern as 8 separate data in a file. This is done instead of processing the input sense tagged text each time 9 the embodiment disambiguates new text. The example memory algorithm also stores a file containing information derived from the senses tagged queries. There is also a file for the 11 common data; a file for each user; and a file for each user session. These files represent the 12 user, user session and common data represent the query personalized data.
Part of the data in 13 these files is loaded into the system memory as need when processing the disambiguation of a 14 query. When the files are updated, on the next execution of the priors algorithm, the senses derived from the last execution of the algorithm become available for the knowledge base.
16 [00105] The n-grams algorithm predicts a sense of a single word by looking for recurring 17 patterns of words or word senses in words around the single word. Whiie generically, the 18 algorithm looks n number of words before or following the single word, typically, n is set at two 19 words. The algorithm utilizes a list of word pairs with a correct sense associated with each word. This list is derived from word pairs from input sense tagged text that occurred multiple 21 times, where the senses for each of the word pair occurrences was identical. However, when a 22 sense of at least one word differs, such word pair senses are rejected and are not retained in 23 the list. When disambiguating text, the algorithm matches word pairs from the query or text 24 being processed with word pair present in the list maintained by the algorithm. A match is identified when a word pair is found and the sense of one of the two words is already present in 26 the query or text being processed. When a match is identified, it is assigned the sense relating 27 to the second word in the word pair being processed. N-gram maintains several lists, including:
28 a list of word pairs with correct senses that it derived from training input sense tagged text, a list 29 derived from sense tagged queries from all users, a list derived from all queries of a user, and a list derived from the queries of a user's session.
31 [00106] The n-gram algorithm differs from the example memory algorithm as it operates over 32 a fixed range of words and only attempts to predict a sense of a single word once at a time.
33 The example memory algorithm attempts to predict word senses of all the words in a sequence.
21871431.1 23 Agent Ref. 70615/00033 1 [00107] For optimization and performance issues, the n-gram algorithm stores data in a 2 separate file information regarding recurring pattern of surrounding words or word senses and 3 the frequency of that pattern which it has derived from input sense tagged text. This is done 4 instead of processing the input sense tagged text each time the embodiment disambiguates new text. In addition to the file in the core database, the n-gram algorithm stores into system 6 memory: a file of information derived from the senses tagged queries; a file for the common 7 data; a file for each user; and a file for each user session. These files represent the user, user 8 session and common data represent the query personalized data. Part of the data in these files 9 is loaded into the system memory as need when processing the disambiguation of a query.
Information in the user and user session files is updated when each new sense tagged query 11 from a user becomes available. When the files are updated, on the next execution of the priors 12 algorithm, the senses derived from the last execution of the algorithm become available for the 13 knowledge base.
14 [00108] The dependencies algorithm is similar to the n-gram algorithm, but it generates a syntactic parse tree (e.g. adjective modifies noun, first noun modifies second noun in a noun 16 phrase, etc.). It operates on associations between the head and the modifier in the parse tree.
17 [00109] The classifier algorithm predicts a sense of words by regrouping into topics possible 18 senses for the words in a text segment. The senses with the strongest overlap (i.e., that can be 19 best clustered) are deemed the most likely senses for the set of words in the segment. The overlap can be measured in terms of several different features (e.g., coarse senses, fine 21 senses, etc.) The scope of the document text can vary from a few words to several sentences or 22 paragraphs. The classifier algorithm uses words and word senses in previous queries of the 23 user's session as additional context to personalize the disambiguation of the current query. The 24 word senses of the previous queries are added to the set of possible topics.
[00110] Turning back to the process of using personalization data to improve word sense 26 disambiguation of queries, when disambiguating a query, each disambiguation engine 32 27 component makes use of the core database and any available information in query 28 personalization database 800. Each component can be configured to access the core database 29 and the query personalization database 800 both independently and collectively in distinct steps during the word sense disambiguation process.
31 [00111] Figure 9 illustrates a further algorithm for a method for processing a query having 32 alternate interpretations. As shown, the algorithm 900 first comprises receiving or obtaining a 33 query 902 from a user as with the previously described algorithms. As indicated above, a query 34 may comprise one or more words and may include Boolean terms. The query is then 21871431.1 24 Agent Ref. 70615/00033 1 disambiguated to identify its interpretations 904. As discussed above, this step is executed by a 2 disambiguation module of the system. In the disambiguation process, the word or words in the 3 query are provided with a set of meanings and interpretations of the query are obtained by 4 forming collections of related groups of such query word meanings. It will be understood that the length or detail of the query will determine the number of possible interpretations. For 6 example, in a detailed query, only one or a few interpretations may be identified. In other 7 situations, where the query is not detailed or comprises, for example a single word, numerous 8 interpretations would be possible.
9 [00112] The various interpretations of the query are then presented to the user 906. In this step, the interpretations may be first ranked by likelihood. Such ranking is discussed above.
11 The presentation of the various interpretations may be done in various ways. For example, the 12 interpretations can be presented in the form of questions such as "Did you mean ...?", prompting 13 the user to choose one of the presented interpretations. The user may then be prompted to 14 select an interpretation in any manner such as selecting directly from the list of interpretations, entering the number of a selection in an entry box etc. Various other forms of presentation will 16 be known to persons skilled in the art. In addition, as indicated above, the presentations can 17 optionally be ranked in order of likelihood using the methodologies discussed above.
18 [00113] In situations where numerous interpretations are possible, the method may optionally 19 involve listing only a select number of choices for the user. For example, in a further embodiment, the method shown in Figure 9 may optionally include a determination of a 21 threshold likelihood (not shown) after step 904. In other words, the interpretations generated by 22 the disambiguation module are ranked based on the likelihood of the interpretations matching 23 the meaning intended by the user. Further, where various interpretations are identified, the 24 method may involve ranking each interpretation in order of likelihood and listing only those that have a likelihood above a pre-determined value. It will be appreciated that in situations where 26 only one interpretation meets such threshold likelihood, steps 906 and 908 may be bypassed.
27 [00114] Once the user selects the desired interpretation of the query 908, the method 28 involves the steps of expanding the query 910, comparing the expanded query to the database 29 index 912, and returning the results of the query 914. These steps of the method have been discussed above.
31 [00115] Persons skilled in the art will appreciate that by presenting the user with a choice of 32 interpretations prior to presenting results, the method of Figure 9 offers various advantages.
33 Firstly, the method avoids the time in presenting results for a most likely interpretation. This 34 would be of value where the most likely interpretation determined by the method is not the 21871431.1 25 Agent Ref. 70615/00033 1 intended meaning of the user. In addition, by initially presenting the user with only a list of 2 interpretations, the user interface (i.e. screen) is not filled with potentially unwanted results.
3 [00116] It will also be appreciated that the method of Figure 9 is particularly suited, though 4 not exclusive, to searches conducted using a mobile device such as a cellular telephone, personal digital assistant (such as a BlackberryTM device), and various other similar devices as 6 will be known to persons skilled in the art. For example, as discussed above, one of the 7 advantages of the method of Figure 9 is that it does not fill the user's screen with potentially 8 unwanted results. This advantage is of particular relevance for users conducting searches on 9 hand held devices such as PDA's or cell phones where the small size of the screen makes it necessary for the user to scroll through numerous results. In addition, the speed of information 11 retrieval may be increased for mobile searching by avoiding the need for presenting potentially 12 unwanted results.
13 [00117] A further advantage offered by the method of Figure 9 lies in the fact that queries 14 submitted by a user need not be detailed since the method involves the initial step of interpreting the query and obtaining clarification from the user before proceeding. As will be 16 appreciated, this is again of particular relevance to mobile searching where entry of key strokes 17 is rendered more difficult as compared to desktop keyboards. Thus, the user is able to enter 18 shorter or more ambiguous queries and the method will provide feedback on possible 19 interpretations allowing the user to simply choose the desired interpretation. By way of example, a user may simply enter the term "java" as a query. Prior to accumulating and 21 presenting results of a search, the method of Figure 9 would present the user with a choice of 22 interpretations such as: coffee, programming language, and Indonesia.
Results are then 23 presented after one of the interpretations is selected.
24 [00118] As discussed above, one aspect of the present invention involves the personalization and/or customization of searches. That is, a user's prior search history may be used to aid in 26 the interpretation of queries. The above discussion made reference to the creation of a query 27 personalization database such as database 800 shown in Figure 8. Such personalization is an 28 important feature since some queries are irresolvably ambiguous, unless it is known how the 29 user entering the query makes use of word meanings. Thus, the method of the present invention is capable of learning how the user makes use of word meanings, either overall or 31 during a session based on the choices made in conducting the present or past query or 32 queries. It will be understood that this feature is very useful for minimizing the number of words 33 a user needs to enter for a given query. This learning process is non-intrusive since it involves 34 tracking the word meanings a user tends to use as opposed to tracking the sites visited etc. For 21871431.1 26 Agent Ref. 70615/00033 1 example, the query "Java" could be assumed to refer to the Indonesian island, if the previous 2 query was about Indonesia, or if the user had a previous history (over several sessions) 3 indicating a preference for this meaning of the word or for geographical meanings in general.
4 Such personalization of queries is also adaptable to mobile searching. That is, given that mobile phones tend to be personal, information related to a user's prior query could easily be 6 associated with a particular mobile phone number. As will be appreciated, this personalization 7 step increases the precision of search results, while reducing the number of words (and 8 therefore number of keystrokes) a user needs to enter to conduct a search.
9 [00119] The above method of presenting a question for further defining the interpretation of a query offers a further advantage with respect to the personalization process.
That is, in one 11 aspect as defined above, the user selects one of the presented results, which then serves to 12 further narrow the other results presented. This may be considered as indirect feedback to the 13 system from the user. However, with the use of an initial question, the user is able to provide 14 direct feedback by the selection of a specific interpretation of the query and is, in fact, encouraged to do so since no results are presented prior to or in conjunction with the question.
16 As will be understood by persons skilled in the art, such direct feedback improves the quality of 17 the personalization process. Moreover, as indicated above, the method of the invention utilizes 18 a user's prior search history to further provide more accurate search results.
19 [00120] Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without 21 departing from the scope of the invention as outlined in the claims appended hereto. A person 22 skilled in the art would have sufficient knowledge of at least one or more of the following 23 disciplines: computer programming, machine learning and computational linguistics.

21871431.1 27

Claims

WE CLAIM:

1. A method of processing a query directed to a database, the query comprising one or more words, said method comprising the steps of:
- obtaining said query from a user;
- disambiguating the query using a knowledge base to obtain a set of meanings for said one or more words;
- obtaining a set of interpretations of said query based on the set of meanings;
- presenting the user with the set of interpretations;
- obtaining from the user a selected interpretation from the set of interpretations;
- identifying relevant results from said database related to said selected interpretation;
and, - presenting said relevant results to the user.

2. The method of claim 1 further comprising ranking said interpretations according to likelihood prior to presenting to the user.

3. The method of claim 2 wherein said set of interpretations comprises interpretations that meet a threshold level of likelihood.

4. The method of claim 3 wherein said step of disambiguating said query comprises utilizing an algorithm selected from: an example disambiguation algorithm, an n-word disambiguation algorithm, a priors disambiguation algorithm; a dependencies algorithm and a classifying algorithm.

5. A system for processing a query directed to a store of information, the query comprising one or more words, said system comprising:
- a means for obtaining said query from a user;
- a database comprising a knowledge base;
- a disambiguation module for disambiguating said query using said knowledge base to provide a set of meanings for said one or more words and to provide a set of interpretations of the query, each of said interpretations comprising a collection of query word meanings;
- a means for presenting said set of interpretations to the user;

- a means for obtaining from the user a selected interpretation from the set of interpretations;
- a processor for utilizing said selected interpretation to identify relevant results from said database;
- a means for presenting said results to the user.

6. The system of claim 5 further comprising a ranking module for ranking said query interpretations prior to presenting to said user.

7. The system of claim 6 wherein the means for obtaining said query comprises a mobile communication device.

8. The system of claim 7 wherein said mobile device comprises a cellular telephone or personal digital assistant.