US20050086215A1 - System and method for harmonizing content relevancy across structured and unstructured data - Google Patents

System and method for harmonizing content relevancy across structured and unstructured data Download PDF

Info

Publication number
US20050086215A1
US20050086215A1 US10/972,248 US97224804A US2005086215A1 US 20050086215 A1 US20050086215 A1 US 20050086215A1 US 97224804 A US97224804 A US 97224804A US 2005086215 A1 US2005086215 A1 US 2005086215A1
Authority
US
United States
Prior art keywords
structured data
function
intrinsic
scores
adjusted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/972,248
Inventor
Igor Perisic
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Entopia Inc
Original Assignee
Entopia Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Entopia Inc filed Critical Entopia Inc
Priority to US10/972,248 priority Critical patent/US20050086215A1/en
Assigned to ENTOPIA, INC. reassignment ENTOPIA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PERISIC, IGOR
Publication of US20050086215A1 publication Critical patent/US20050086215A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Definitions

  • the disclosed embodiments relate generally to information retrieval systems and methods, and more particularly to a system and method for harmonizing content relevancy across structured and unstructured data.
  • Searchers looking for information typically make use of an information retrieval system.
  • an information retrieval system typically consists of document management software, such as Applicant's QUANTUMTM suite, or iManage Inc's INFORITETM or WORKSITETM products.
  • Information retrieval from the Internet is typically undertaken using a search engine, such as YAHOOTM or GOOGLETM.
  • these information retrieval systems extract keywords from each document in a network. Such keywords typically contain no semantic or syntactic information. For each document, each keyword is then indexed into a searchable data structure with a link back to the document itself.
  • a user supplies the information retrieval system with a query containing one or more search terms, which may be separated by Boolean operators, such as “AND” or “OR.” These search terms can be further expanded through the use of a Thesaurus.
  • the information retrieval system attempts to locate information, such as documents, that match the searcher supplied (or expanded) keywords.
  • the information retrieval system searches through its databases to locate documents that contain at least one keyword that matches one of the search terms in the query (or its expanded version).
  • the information retrieval system then presents the searcher with a list of document records for the documents located.
  • the list is typically sorted based on document ranking, where each document is ranked according to the number of keyword to search term matches in that document relative to those for the other located documents.
  • GOOGLETM and WISENUTTM rank document relevancy as a function of a network of links pointing to the document
  • methods based on Salton's work such as ORACLETM text
  • rank document relevancy as a function of the number of relevant documents within the repository.
  • An example of unstructured data is free form text.
  • An example of structured data is data organized into one or more fields having one or more restrictions.
  • an unrestricted field can include whatever values the owner wants to provide.
  • a restricted field can include content which is constrained to a controlled vocabulary, size, or other parameter.
  • documents or other data objects are searched for keywords without considering whether the documents are structured or unstructured. Since structured documents may be more relevant to a searcher than unstructured documents, the nature of the document structure should be considered when determining its relevancy to a search request.
  • a search request is received from a requester including one or more search terms.
  • one or more objects are located that fulfill the search request.
  • a relevancy score is computed for each object based on whether the object includes structured or unstructured data. The relevancy scores enable the requestor to determine the content relevancy of the located objects.
  • a method for retrieving information includes receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
  • the object can include structured data and the relevancy score can be determined at least in part on whether the structured data includes restricted or unrestricted fields.
  • the restricted fields can be associated with a first modifier value and the unrestricted fields can be associated with a second modifier value.
  • the first and second modifier values can be the same values or different values.
  • the modifier values can be adjustable and expandable.
  • an information retrieval system includes a processor and a memory coupled to the processor.
  • the memory includes instructions, which, when executed by the processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
  • a computer-readable medium includes instructions, which, when executed by a processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
  • the relevancy score is determined based on the inclusion of structured or unstructured data and on one or more other factors, such as, for example, the expertise of users/searchers, creators and/or requesters.
  • FIG. 1 is a block diagram of a system architecture for a system for personalizing information retrieval.
  • FIG. 2 is a block diagram of a creator device, contributor device, or searcher device, as shown in FIG. 1 .
  • FIG. 3 is a block diagram of the information retrieval system and Repository of FIG. 1 .
  • FIG. 4 is a flow chart of document collection according to an embodiment of the invention.
  • FIG. 5 is a flow chart of a process for information retrieval according to an embodiment of the invention.
  • FIG. 6 is a flow chart of document collection for use with structured data according to an embodiment of the invention.
  • FIG. 7 is a flow chart of document collection for use with unstructured data according to an embodiment of the invention.
  • FIG. 8 is a flow chart of a process for information retrieval for structured and unstructured data according to an embodiment of the invention.
  • FIG. 9 is a flow chart of a process for adjusting field intrinsic scores according to an embodiment of the invention.
  • FIG. 1 is a block diagram of a system architecture 100 for a system for personalizing information retrieval.
  • An information retrieval system 102 is coupled to a repository 104 and to a network 110 .
  • Also coupled to the network 110 are a searcher device 108 , one or more creator device/s 106 , and one or more contributor device/s 112 .
  • Searcher device 108 , creator device/s 106 , contributor device/s 112 , and information retrieval system 102 are all computing devices, such as clients, servers, or the like.
  • the network is preferably a Local Area Network (LAN), but alternatively may be any network, such as the Internet.
  • LAN Local Area Network
  • searcher device 108 creator device/s 106 , contributor device/s 112 , and information retrieval system 102 are shown as distinct entities, they may be combined into one or more devices. Further details of the searcher device 108 , creator device/s 106 , contributor device/s 112 , and information retrieval system 102 can be found below in relation to FIGS. 2-5 .
  • the repository 104 is any storage device/s that is capable of storing data, such as a hard disk drive, magnetic media drive, or the like.
  • the repository 104 is preferably contained within the information retrieval system 102 , but is shown as a separate component for ease of explanation. Alternatively, the repository 104 may be dispersed throughout a network, and may even be located within the searcher device 108 , creator device/s 106 , and/or contributor device/s 112 .
  • Each creator device 106 is a computing device operated by a creator who creates one or more documents.
  • Each contributor device 112 is a computing device operated by a contributor who contributes to a document by, for example, adding to, commenting on, viewing, or otherwise accessing documents created by a creator/s.
  • the searcher device 108 is a computing device operated by a searcher who is conducting a search for relevant documents created by the creator/s or contributed to by the contributor/s.
  • the searcher, creator/s, and contributor/s are not limited to the above described roles and may take on any role at different times. Also, the searcher, creator/s, and contributor/s may browse the repository 104 without the use of the information retrieval system 102 .
  • FIG. 2 is a block diagram of a creator device 106 , contributor device 112 , or searcher device 108 , as shown in FIG. 1 .
  • the devices 106 / 108 / 112 preferably include the following components: at least one data processor or central processing unit (CPU) 202 ; a memory 214 ; input and/or output devices 206 , such as a monitor and keyboard; communications circuitry 204 for communicating with the network 110 ( FIG. 1 ) and information retrieval system 102 ( FIG. 1 ); and at least one bus 210 that interconnects these components.
  • CPU central processing unit
  • Memory 214 preferably includes an operating system 216 , such as but not limited to, VXWORKSTM, LINUXTM, or WINDOWSTM having instructions for processing, accessing, storing, or searching data, etc.
  • Memory 214 also preferably includes communication procedures for communicating with the network 110 ( FIG. 1 ) and information retrieval system 102 ( FIG. 1 ); searching procedures 220 , such as proprietary search software, a Web-browser, or the like; application programs 222 , such as a word processor, email client, database, or the like; a unique user identifier 224 ; and a cache 226 for temporarily storing data.
  • an operating system 216 such as but not limited to, VXWORKSTM, LINUXTM, or WINDOWSTM having instructions for processing, accessing, storing, or searching data, etc.
  • Memory 214 also preferably includes communication procedures for communicating with the network 110 ( FIG. 1 ) and information retrieval system 102 ( FIG. 1 ); searching procedures 220 , such as proprietary search software
  • the unique user identifier 224 may be supplied by the creator/searcher/contributor each time he or she performs a search, such as by supplying a username.
  • the unique user identifier 224 may be the user's login username, Media Access Control (MAC) address, Internet Protocol (IP) address, or the like.
  • MAC Media Access Control
  • IP Internet Protocol
  • FIG. 3 is a block diagram of the information retrieval system 102 and Repository 104 of FIG. 1 .
  • the repository 104 is preferably contained within the information retrieval system 102 .
  • the information retrieval system 102 preferably includes the following components: at least one data processor or central processing unit (CPU) 302 ; a memory 308 ; input and/or output devices 306 , such as a monitor and keyboard; communications circuitry 304 for communicating with the network 110 ( FIG. 1 ), creator device/s 106 ( FIG. 1 ), contributor device/s 112 ( FIG. 1 ), and/or searcher device 108 ( FIG. 1 ); and at least one bus 310 that interconnects these components.
  • CPU central processing unit
  • Memory 308 preferably includes an operating system 312 , such as but not limited to, VXWORKSTM, LINUXTM, or WINDOWSTM having instructions for processing, accessing, storing, or searching data, etc.
  • Memory 308 also preferably includes communication procedures 314 for communicating with the network 110 ( FIG. 1 ), creator device/s 106 ( FIG. 1 ), contributor device/s 112 ( FIG. 1 ), and/or searcher device 108 ( FIG. 1 ); a collection engine 316 for receiving and storing documents; a search engine 324 ; intrinsic relevancy adjustment procedures 325 ; expertise adjustment procedures 326 ; a repository 104 , as shown in FIG. 1 ; and a cache 338 for temporarily storing data.
  • an operating system 312 such as but not limited to, VXWORKSTM, LINUXTM, or WINDOWSTM having instructions for processing, accessing, storing, or searching data, etc.
  • Memory 308 also preferably includes communication procedures 314 for communicating with the network 110 ( FIG. 1 ),
  • the collection engine 316 comprises a keyword extractor or parser 318 that extracts text and/or keywords from any suitable document, such as an ASCII or XML file, Portable Document Format (PDF) file, word processing file, or the like.
  • the collection engine 316 also preferably comprises a concept identifier 320 .
  • the concept identifier 320 is used to extract the document's important concepts.
  • the concept identifier may be a semantic, synaptic, or linguistic engine, or the like.
  • the concept identifier 320 is a semantic engine, such as TEXTANALYSTTM made by MEGAPUTER INTELLIGENCETM Inc.
  • the collection engine 316 also preferably comprises a metadata filter 322 for filtering and/or refining the concept/s identified by the concept identifier 320 .
  • a metadata filter 322 for filtering and/or refining the concept/s identified by the concept identifier 320 .
  • metadata about each document is stored in the repository 104 . Further details of the processes performed by the collection engine 316 are discussed in relation to FIG. 4 .
  • metadata includes any data, other than raw content, associated with a document.
  • the search engine 324 is any standard search engine, such as a keyword search engine, statistical search engine, semantic search engine, linguistic search engine, natural language search engine, or the like. In a preferred embodiment, the search engine 324 is a semantic search engine.
  • the expertise adjustment procedures 326 are used to adjust an object's intrinsic score to an adjusted score based on the expertise of the searcher, creator/s, and/or contributor/s.
  • the content relevancy harmonizer 325 is used to adjust the objects field intrinsic score to an adjusted score based on whether the object includes structured and/or unstructured data.
  • the expertise adjustment procedures 326 and the content relevancy harmonizer 325 are described in further detail below in relation to FIGS. 5-9 .
  • a file collection 328 ( 1 )-(N) is created in the repository 104 for each object input into the system, such as a document or source.
  • Each file collection 328 ( 1 )-(N) preferably contains: metadata 330 ( 1 )-(N), such as associations between keywords, concepts, or the like; content 332 ( 1 )-(N), which is preferably ASCII or XML text or the content's original format; and contributions 334 ( 1 )-(N), such as contributor comments or the like.
  • each file collection contains content 332 ( 1 )-(N).
  • the repository 104 also contains user profiles 336 ( 1 )-(N) for each user, i.e., each searcher, creator, or contributor.
  • Each user profile 336 ( 1 )-(N) includes associated user activity, such as which files a user has created, commented on, opened, printed, viewed, or the like, and links to various file collections 328 ( 1 )-(N)that the user has created or contributed to. Further details of use of the repository 104 are discussed in relation to FIG. 5 .
  • FIG. 4 is a flow chart of document collection according to an embodiment of the invention.
  • a creator supplies an object, such as a document or source, to the searching procedures 220 ( FIG. 2 ) at step 402 .
  • the creator may for example, supply any type of data file that contains text, such as an email, word processing document, text document, or the like.
  • a document comes from a source of the document. Therefore, to supply a source, the creator may provide a link to a document, such as by providing a URL to a Web-page on the Internet, or supply a directory that contains multiple documents.
  • the creator also supplies his or her unique user identifier 224 ( FIG. 2 ), and any other data, such as privacy settings, or the like.
  • the unique user identifier may be supplied without the creator's knowledge, such as by the creator device 106 ( FIG. 1 ) automatically supplying its IP or MAC address.
  • the document, source, and/or other data is then sent to the information retrieval system 102 ( FIG. 1 ) by the communication procedures 218 ( FIG. 2 ).
  • the information retrieval system 102 receives the document, source, and/or other data at step 403 .
  • the keyword extractor or parser 318 When supplied with a document, parses the document and/or source into ASCII text at step 404 , and thereafter extracts the important keywords at step 408 .
  • the keyword extractor or parser 318 ( FIG. 3 ) firstly obtains the document/s from the source before parsing the important keywords into text.
  • Keywords, document, source, and other data are then stored at step 406 as in the repository 104 as part of a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • the unique user identifier is used to associate or link each file collection 328 ( 1 )(N) ( FIG. 3 ) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336 ( 1 )-(N) ( FIG. 3 ).
  • the user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • the concept identifier 320 ( FIG. 3 ) then identifies the important concept/s from the extracted keywords at step 410 .
  • the metadata filter 322 ( FIG. 3 ) then refines the concept at step 412 .
  • the refined concept is then stored in the repository 104 as part of the metadata 330 ( 1 )-(N) ( FIG. 3 ) within a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • contributors can supply their contributions, at step 416 , such as additional comments, threads, or other activity to be associated with the file collection 328 ( 1 )-(N). These contributions are received by the information retrieval engine at step 418 and stored in the repository at step 420 , as contributions 334 ( 1 )-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 403 - 414 .
  • FIG. 5 is a flow chart of a process for information retrieval according to an embodiment of the invention.
  • a searcher using a searcher device 108 submits a search request to the information retrieval system 102 ( FIG. 1 ), at step 502 . Submittal of this search occurs using searching procedures 220 ( FIG. 2 ) and communication procedures 218 ( FIG. 2 ) on the searcher device 108 ( FIG. 1 ).
  • the search request preferably contains one or more search terms, and the unique user identifier 224 ( FIG. 2 ) of the searcher.
  • the search is preferably conducted to locate objects.
  • Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like.
  • a search for documents returns a list of relevant documents
  • a search for experts returns a list of experts with expertise in the relevant field.
  • a search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
  • the search is received at step 504 by the information retrieval system 102 ( FIG. 1 ) using communications procedures 314 ( FIG. 3 ).
  • the information retrieval system 102 ( FIG. 1 ) searches the repository 104 for relevant objects at step 506 .
  • This search is undertaken by the search engine 324 ( FIG. 3 ), at step 506 , using any known or yet to be discovered search techniques.
  • the search undertakes a semantic analysis of each file collection 328 ( 1 )-(N) stored in the repository 104 .
  • the search engine 324 locates relevant objects 328 ( 1 )-(N) at step 508 and calculates an intrinsic score at step 510 for each located object.
  • located object it is meant any part of a file collection that is found to be relevant, including the content, source, metadata, etc.
  • Calculation of the intrinsic score is based on known, or yet to be discovered techniques for calculating relevancy of located objects based solely on the located objects themselves, the repository itself and the search terms. In its simplest form, such a search calculates the intrinsic score based on the number of times that a search term appears in the content 332 ( 1 )-(N) ( FIG. 3 ) of located objects. However, in a preferred embodiment, this calculation is also based on a semantic analysis of the relationship between words in the content 332 ( 1 )-(N) ( FIG. 3 ).
  • the intrinsic score is then adjusted to an adjusted score by the expertise adjustment procedures 326 , at step 512 .
  • This adjustment takes the expertise of the creator/s, searcher, and/or contributor/s into account, as described in further detail below.
  • a list of the located objects is sorted at step 514 .
  • the list may be sorted by any field, such as by adjusted score, intrinsic score, source, most recently viewed, creator expertise, etc.
  • the list preferably containing a brief record for each located object, is then transmitted to the searcher device 108 ( FIG. 1 ) at step 516 .
  • Each record preferably contains the located object's adjusted score, creator, title, etc.
  • the list is then received by the searcher device at step 518 and displayed to the searcher at step 520 .
  • sorting of the list is performed by the searching procedures 220 ( FIG. 2 ) on the searcher device 108 ( FIG. 1 ).
  • IDS Intrinsic Document Score
  • DCS Document Content Score
  • CCS Comments Content Score
  • IDS a*DCS +(1 ⁇ a )* CCS (2) with “a” being a number between 0 and 1 and determining the importance of the content of a document relative to the content of its attached comments.
  • the DCS and CCS are calculated by any methodology or technique. Existing search engine algorithms may be used to fulfill this task. Also note that the DCS and CCS are not influenced by the searcher that entered the query. In this embodiment, the DCS and CCS can be any number between 2 and 100.
  • Extreme intrinsic scores i.e., scores near 2 or 100, are less influenced than scores near the middle, i.e., scores near 50.
  • F ⁇ ( User ⁇ ⁇ cont . ) ⁇ i : all ⁇ ⁇ relevant ⁇ ⁇ documents ⁇ ( 2 * ( W i , max + C i ) * ( ( DCS ) i 100 ) 2 ) + ⁇ i : all ⁇ ⁇ nonrelevant ⁇ ⁇ documents ⁇ C i + 2 * Taxonomy ( 8 ) where the first sum is over all relevant documents and the second sum is over all non-relevant documents that possessed a relevant comment, i.e., the comment was relevant but not the document.
  • DCS) i is the Intrinsic document relevancy score attained for the i-th relevant document.
  • W i,max is the user activity measure.
  • C i 0.1 * ( 1 - Exp ⁇ ( - # ⁇ ⁇ relevant ⁇ ⁇ comments ⁇ ⁇ in ⁇ ⁇ Doc i ⁇ ⁇ by ⁇ ⁇ this ⁇ ⁇ user 2 ) ) ( 9 ) and is the reward assigned to matching comments made on documents, relevant or not. A matched comment is not necessarily attached to a relevant document.
  • W i,max accounts for the type of contribution (such as but not limited to creation, commenting, or highlighting). In short, W i,max is the maximum of the following weights (if applicable).
  • W i,edit 1, if the user created or edited i-th file collection
  • W i,comment 0.5*Max*(0,7-Min comments *(Level))/6, if the user commented on the i-th file collection. Since these comments are organized in a threaded discussion, the weight will also depend on how remote a comment is to the file collection itself. For example, a comment on a comment on a comment to the original file collection will receive a lesser weight than a comment on the original file collection. In the formula, Level measures how remote the comment is from the file collection. The least remote comment is taken into consideration as long as it is closer than six comments away from the parent file collection.
  • the taxonomy in this preferred embodiment stands for folder names. Each user has built some part of the repository by naming folders, directories, or sub-directories. For example, creator 1 might have grouped his Hubble telescope pictures in a folder called “Space Images.” Then term “Space Images” becomes part of the user's taxonomy.
  • each folder has an administrator who bestows rights to users, such as the right to access the folder, the right to edit any documents within it, the right to edit only documents that the specific user created, or the right to view but not edit or contribute any document to the folder. Only the names of the folders that a user creates are part of his or her taxonomy.
  • NCO, NCOR and IWCOF are only calculated using non-confidential content objects.
  • the intrinsic score is increased to an adjusted score if the creator of the content objects is more knowledgeable about the searched subject matter than the person that entered the query, i.e., if the creator expertise is higher than the searcher expertise.
  • the intrinsic score is decreased to an adjusted score if the creator is less knowledgeable about the searched subject matter than the searcher, i.e., if the creator expertise is lower than the searcher expertise.
  • RSS — ADJ intrinsic Source Content score+expertise adjustment SCS+R 2( SCS )* W 2( RS — EXP — ABS )
  • multiple documents may have been saved as multiple file collections from a single Web-site.
  • R2(SCS) determines the maximal amount of the expertise adjustment, or, in other words, the range for the alteration due to the expertise of the creator of the document taken from the source, which depends on the level of the intrinsic source score, i.e., SCS.
  • RS_EXP_ABS(Searcher) is the absolute relevance score of the searcher.
  • the intrinsic score for the source is adjusted upward to an adjusted score if the maximum creator expertise of all creators for a particular source exceeds the searcher expertise.
  • the intrinsic score for the source is lowered to an adjusted score if the creator expertise of all creators for a particular source is lower than the searcher expertise.
  • RSS ADJ Min(100, Max (1, Round( RSS — ADJ ))) (20) where Round(d) rounds the number d to its nearest integer.
  • the adjusted score for each document (RS_ADJ) or the adjusted score for sources (RSS 13 ADJ) is calculated based on the expertise of the searcher, creator/s, and/or contributor/s.
  • Such adjusted scores provide a significant improvement in precision over that attainable through conventional information retrieval systems.
  • an adjusted relevancy score is calculated.
  • Peers are other users that have a similar expertise or come from a similar, or the same, department as the searcher.
  • the adjusted relevancy score uses the expertise values and adjusts them with respect to the searcher's expertise. This is the similar to resorting the list with respect to the searcher, but instead recalculates the values themselves.
  • the adjusted relevancy score is a measure of the difference between two levels of expertise.
  • the square root maps the difference to a continuous and monotone measure while diminishing the importance of differences when two experts are far apart. It is also asymmetric in the sense that it favors expertise above the searcher expertise. Finally, recall that
  • Table 1 sets out the environment in which a search is conducted. Furthermore, in this illustration, the factor a (from formula 2, determining the importance of the content of a document relative to its attached comments) has been arbitrarily set to 1.
  • TABLE 1 Number of users # experts 100 10 Total Number of File # of relevant File Collections Collections # of relevant comments 1000 10 10 Departments of experts Names Marketing Adam M. Bryan M. Christie M. David M. Engineering Eric E. Fred E. Gail E. Finance Hugo F. Henry F. Legal Ivan L.
  • 100 users having a total number of 1000 file collections in the repository yields 10 experts and 10 relevant file collections. There are also 10 comments that are found to be relevant.
  • the enterprise in which the example takes place has four departments, namely marketing, engineering, finance, and legal. For ease of explanation, each employee's last name begins with the department in which they work.
  • an Intrinsic Document Score is calculated for each located document. This score is a weighted average between a Document Content Score (DCS) and a Comment Content Score (CCS).
  • DCS Document Content Score
  • CCS Comment Content Score
  • the DCS and CCS are calculated using any standard search engine techniques.
  • CCS is the Comment Content Score calculated by any means such as semantic engine, frequency of words, etc.
  • W1(RS_EXP_ABS) is then calculated using formula 6 (for different searcher expertise) to yield the following results: TABLE 4 W(RS_EXP_ABS) Searcher Expertise/ Name 0 5 10 15 20 25 30 35 40 45 Adam M. 0.29 0.24 0.19 0.14 0.09 0.04 ⁇ 0.01 ⁇ 0.06 ⁇ 0.11 ⁇ 0.16 Bryan M. 0.3 0.25 0.2 0.15 0.1 0.05 0 ⁇ 0.05 ⁇ 0.1 ⁇ 0.15 Christie M. 0.32 0.27 0.22 0.17 0.12 0.07 0.02 ⁇ 0.03 ⁇ 0.08 ⁇ 0.13 David M. 0.25 0.2 0.15 0.1 0.05 0 ⁇ 0.05 ⁇ 0.1 ⁇ 0.15 ⁇ 0.2 Eric E.
  • EA Expertise Adjustment
  • This entry is DCE +
  • each document and/or source is adjusted to an adjusted score based on the expertise of the users.
  • a document and/or source that may have been less relevant is adjusted so that it is more relevant, or visa versa. In this way, the precision of document and/or source relevancy is improved.
  • FIG. 6 is a flow chart of document collection for use with structured data according to an embodiment of the invention.
  • a creator supplies an object, such as a document or source, to the searching procedures 220 ( FIG. 2 ) at step 602 .
  • the creator may for example, supply any type of data file that contains text, such as an email, word processing document, text document, or the like.
  • a document comes from a source of the document. Therefore, to supply a source, the creator may provide a link to a document, such as by providing a URL to a Web-page on the Internet, or supply a directory that contains multiple documents.
  • the creator also supplies his or her unique user identifier 224 ( FIG. 2 ), and any other data, such as privacy settings, or the like.
  • the unique user identifier may be supplied without the creator's knowledge, such as by the creator device 106 ( FIG. 1 ) automatically supplying its IP or MAC address.
  • the document, source, and/or other data is then sent to the information retrieval system 102 ( FIG. 1 ) by the communication procedures 218 ( FIG. 2 ).
  • the information retrieval system 102 ( FIG. 1 ) receives the document, source, and/or other data at step 603 .
  • the document is examined for structured data. If the document is a structured document (step 607 ), then for each field in the document, the keyword extractor or parser 318 ( FIG. 3 ) parses the field into ASCII text at step 604 , and thereafter extracts the important keywords at step 608 . However, when supplied with a source, the keyword extractor or parser 318 ( FIG. 3 ) firstly obtains the document/s from the source before parsing the important keywords into text. Also at step 604 , each field is assigned a Field ID, a Field Type (e.g., restricted, unrestricted, etc.) and Field Modifier value, which is stored in the repository 104 with its associated text. The Field Type and Field Modifier value are used to adjust field intrinsic scores, as described more fully with respect to FIGS. 8 and 9 .
  • a Field ID e.g., restricted, unrestricted, etc.
  • Field Modifier value e.g., restricted, unrestricted, etc.
  • Keywords, document, source, and other data are then stored at step 606 as in the repository 104 as part of a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • the unique user identifier is used to associate or link each file collection 328 ( 1 )(N) ( FIG. 3 ) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336 ( 1 )-(N) ( FIG. 3 ).
  • the user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • the concept identifier 320 ( FIG. 3 ) then identifies the important concept/s from the extracted keywords at step 610 .
  • the metadata filter 322 ( FIG. 3 ) then refines the concept at step 612 .
  • the refined concept is then stored 614 in the repository 104 as part of the metadata 330 ( 1 )-(N) ( FIG. 3 ) within a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • the next field in the document is processed by steps 604 - 614 . If there are no more fields in the document (step 605 ), then the process stops (i.e., the last field has been processed).
  • contributors can supply their contributions, at step 616 , such as additional comments, threads, or other activity to be associated with the file collection 328 ( 1 )-(N). These contributions are received by the information retrieval engine at step 618 and stored in the repository at step 620 , as contributions 334 ( 1 )-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 603 - 614 .
  • FIG. 7 is a flow chart of document collection for use with unstructured data according to an embodiment of the invention. If the document is an unstructured document (step 607 ), then the keyword extractor or parser 318 ( FIG. 3 ) parses the document and/or source into ASCII text at step 704 , and thereafter extracts the important keywords at step 708 . However, when supplied with a source, the keyword extractor or parser 318 ( FIG. 3 ) firstly obtains the document/s from the source before parsing the important keywords into text.
  • Keywords, document, source, and other data are then stored at step 706 as in the repository 104 as part of a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • the unique user identifier is used to associate or link each file collection 328 ( 1 )(N) ( FIG. 3 ) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336 ( 1 )-(N) ( FIG. 3 ).
  • the user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • the concept identifier 320 ( FIG. 3 ) then identifies the important concept/s from the extracted keywords at step 710 .
  • the metadata filter 322 ( FIG. 3 ) then refines the concept at step 712 .
  • the refined concept is then stored 714 in the repository 104 as part of the metadata 330 ( 1 )-(N) ( FIG. 3 ) within a file collection 328 ( 1 )-(N) ( FIG. 3 ).
  • contributors can supply their contributions, at step 716 , such as additional comments, threads, or other activity to be associated with the file collection 328 ( 1 )-(N). These contributions are received by the information retrieval engine at step 718 and stored in the repository at step 720 , as contributions 334 ( 1 )-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 703 - 714 .
  • FIG. 8 is a flow chart of a process for information retrieval for structured and unstructured data according to an embodiment of the invention. While the process described below includes a number of steps that appear to occur in a specific order, it should be apparent that the process steps are not limited to any particular order, and, moreover, the process can include more or fewer steps, which can be executed serially or in parallel (e.g., using parallel processors or a multi-threading environment).
  • a searcher using a searcher device 108 submits a search request to the information retrieval system 102 ( FIG. 1 ), at step 802 . Submittal of this search occurs using searching procedures 220 ( FIG. 2 ) and communication procedures 218 ( FIG. 2 ) on the searcher device 108 ( FIG. 1 ).
  • the search request preferably contains one or more search terms, and the unique user identifier 224 ( FIG. 2 ) of the searcher.
  • the search is preferably conducted to locate structured and unstructured objects.
  • Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like.
  • a search for documents returns a list of relevant documents
  • a search for experts returns a list of experts with expertise in the relevant field.
  • a search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
  • the search is received at step 804 by the information retrieval system 102 ( FIG. 1 ) using communications procedures 314 ( FIG. 3 ).
  • the information retrieval system 102 ( FIG. 1 ) searches the repository 104 for relevant objects at step 806 .
  • This search is undertaken by the search engine 324 ( FIG. 3 ), at step 806 , using any known or yet to be discovered search techniques.
  • the search undertakes a semantic analysis of each file collection 328 ( 1 )-(N) stored in the repository 104 .
  • the search engine 324 locates relevant objects 328 ( 1 )-(N) at step 808 .
  • located object it is meant any part of a file collection that is found to be relevant, including the content, source, metadata, etc.
  • the calculation of intrinsic scores is based on known, or yet to be discovered techniques for calculating relevancy of located objects based solely on the located objects themselves, the repository 104 itself and the search terms. In its simplest form, such a search calculates the intrinsic score based on the number of times that a search term appears in the content 332 ( 1 )-(N) ( FIG. 3 ) of located objects.
  • this calculation is also based on a semantic analysis of the relationship between words in the content 332 ( 1 )-(N) ( FIG. 3 ).
  • the intrinsic score and field intrinsic scores differ in that the former is computed for the entire document and the latter is computed for each field in a structured document.
  • the search engine 324 determines if the located objects are structured or unstructured (step 810 ). If a located object is unstructured, then the search engine 324 calculates an intrinsic score at step 812 for the unstructured object. If more located objects are available (step 813 ), then step 812 is repeated until all the located objects have been processed. If all the located objects have been processed, the search engine 324 adjusts the intrinsic scores based on expertise (step 814 ), sorts the structured and unstructured objects (step 816 ) based on the adjusted scores, transfers the list to the searcher (step 818 ), where it is received (step 820 ) and displayed to the searcher (step 822 ). Other than step 810 , all of the foregoing steps were previously described with respect to FIG. 5 .
  • the search engine 324 determines the fields of the object that match the search query (step 824 ) and calculates field intrinsic scores for matching fields (step 824 ).
  • the Field ID stored in the repository 104 during collection ( FIG. 6 ) can be used to determine which fields match the search query.
  • the scores are adjusted to harmonize content relevancy (step 828 ). If more located objects are available (step 813 ), then steps 824 - 828 are repeated until all the located objects have been processed. If all the located objects have been processed, the search engine 324 adjusts the adjusted field intrinsic scores to account for expertise (step 814 ). The adjustment for expertise has been previously described and the adjustment to harmonize content relevancy, is described more fully below.
  • the structured and unstructured objects are sorted (step 816 ) by their adjusted scores and transferred to the searcher (step 818 ), where they are received (step 820 ) and displayed (step 822 ) to the searcher.
  • field intrinsic scores are adjusted differently based on the logical operator or operators used in the search query.
  • logical operators include but are not limited to: AND, OR and TMTB (“The More The Better”).
  • TMTB is an accumulate type operator that looks for objects that match as many keywords associated with a search query as possible.
  • OR and AND operators are the most common logical operators used in search engine queries, it should be apparent that the formulae described below can be adapted to other types of operators, including proximity operators (e.g., ADJACENT, WITH, NEAR, FOLLOWED BY, etc.).
  • SFA Sales Force Automation
  • CRM customer relationship management
  • the structured objects generated by the SFA system include a mix of restricted and unrestricted fields.
  • a restricted field is a field that is constrained by one or more parameters, such as a controlled vocabulary or a controlled size.
  • An unrestricted field can include any value the owner wants to provide, such as free form text.
  • an SFA document could include records having the following four fields: SFAAccountName (the name of the account); SFAAccountDescription (a brief description of the account); SFAAccountIndustry (the industry to which the account belongs to); and SFAAccountAttachment (documents that might be attached to the record).
  • SFAAccountName and SFAAccountIndustry are examples of restricted fields and SFAAccountDescription and SFAAccountAttachment are examples of unrestricted fields.
  • each field is assigned a field modifier value, which is adjustable and expandable.
  • the field modifier value can be adjusted according to a profile of the user and/or a profile of the data source.
  • Table 13 summarizes the SFA fields described above with some examples of corresponding field modifier values. Note that in this example the restricted fields are weighed more heavily based on the assumption that a keyword match to a restricted field may be more relevant. It should be apparent, however, that the field modifiers can be selected and adjusted, as necessary, depending on the search engine design. TABLE 13 Summary of SFA Fields & Modifiers Field Field Type Field Modifier SFAAccountName Restricted 95 SFAAccountDescription Unrestricted 0.6 SFAAccountIndustry Restricted 100 SFAAccountAttachment Unrestricted 0.4
  • AIR Structured f operator (Field, Modifier parameters), (22) where AIR Structured is the adjusted intrinsic relevancy score for a structured object and the function depends on the operator. Specific formulae for the most common logical operators are described in turn below.
  • the OR logical operator looks for the presence of keywords associated with the query of the user. If a match is found in a document on any one keyword, the document is retrieved.
  • the Field Intrinsic score is, for example in the case of an unrestricted field, the raw score as provided by a semantic engine or a full-text engine for that given keyword.
  • formula (23) uses the maximal subscores to define the intrinsic value of the document.
  • the growth of the score within each keyword is preferably concave as scores accumulate and convex as they start accumulating.
  • a difference between two high scores e.g., the scores 1000 and 1010
  • the difference between two low scores e.g., the scores 10 and 20
  • the score function behaves like a utility function with respect to relevancy.
  • a difference between a first pair of low scores e.g., 0 and 2
  • the difference between the a second pair of low scores e.g., 10 and 12
  • is again a normalizing constant.
  • one or more parameters can be added, as necessary, to tune the behavior of the inverse logit function.
  • the TMTB operator is an accumulation operator that tends to provide highest scores for those objects that are relevant to more terms. It also, however, allows for objects that do not match all concepts to still receive a high score if those matches are good.
  • AIR Structured TMTB 1 ⁇ ⁇ round ⁇ ⁇ ( ⁇ * ( 1 - Exp ⁇ ( - FQC NQC ) ) ) * MAX Keywords ⁇ [ ⁇ * logit - 1 ⁇ ⁇ ⁇ Fields ⁇ ( Field ⁇ ⁇ Intrinsic ⁇ ⁇ score * Field ⁇ ⁇ modifier ) ⁇ ] , ( 30 ) where NQC is the number of concepts present in the query and FQC is the number of concepts that were matched to the object.
  • FIG. 9 is a flow chart of an embodiment of a process for determining a relevancy score for a located object. While the process described below includes a number of steps that appear to occur in a specific order, it should be apparent that the process steps are not limited to any particular order, and, moreover, the process can include more or fewer steps, which can be executed serially or in parallel (e.g., using parallel processors or a multi-threading environment).
  • the process begins by initializing a keyword counter x to one (step 902 ) or any other suitable number.
  • the fields in the located object containing keyword x matches are determined (step 904 ) and field intrinsic scores are computed (step 906 ).
  • search queries with N terms are matched (step 903 ) to M concepts (where M is less than or equal to N). If multiple terms in a search query are matched to the same concept, then only one term in the query will be used in the matching step 904 to reduce computation time. For example, if a searcher types “Car OR Car,” then only one unique concept (i.e., a type of vehicle) was specified by the searcher.
  • the calculation of field intrinsic scores can be based on the number of times that a search term appears in a particular field of the located object, or on a semantic analysis of the relationship between keywords and content.
  • the appropriate field modifiers are applied to the field intrinsic scores (step 908 ).
  • the modifiers are selected based on whether the fields with the keyword matches are restricted or unrestricted. For example, in the SFA system previously described, the fields SFAAccountName and SFAAccountIndustry are restricted and the fields SFAAccountDescription and SFAAccountAttachment are unrestricted.
  • a first option is to determine a maximum field intrinsic score from the set of modified field intrinsic scores for keyword x (step 910 ).
  • a second option sums the modified field intrinsic scores (step 912 ).
  • a third option applies an inverse logit function to the sum of the modified field intrinsic scores (step 914 ). Note that in some embodiments steps 912 and 914 may also include a normalizing step.
  • the scores are adjusted relative to the types of operators used in the query (e.g., AND, OR, TMTB, NEAR, ADJACENT, WITH, FOLLOWED BY, etc.) (step 918 ). If a logical OR operator is used with the keywords, then a relevancy score for the located object is determined from the maximum of the field intrinsic scores for keywords 1 , . . . , N (step 920 ). If a logical AND operator is used with the keywords, then a relevancy score for the located object is determined from a minimum of the field intrinsic scores for keywords 1 , . . . , N. If another type of operator is used (e.g., TMTB, etc.), then a relevancy score for the located objects is determined using the appropriate formulas (step 924 ).
  • TMTB e.g., TMTB, etc.

Abstract

A search request is received from a requester including one or more search terms. In response to the search request, one or more objects are located that fulfill the search request. A relevancy score is computed for the one or more objects based on whether the object(s) include structured or unstructured data. The relevancy scores enable the requestor to determine the content relevancy of one or more objects.

Description

    CROSS-RELATED APPLICATION
  • This application is a continuation-in-part of U.S. patent application Ser. No. 10/172,165, filed Jun. 14, 2002, which application is incorporated by reference herein in its entirety.
  • TECHNICAL FIELD
  • The disclosed embodiments relate generally to information retrieval systems and methods, and more particularly to a system and method for harmonizing content relevancy across structured and unstructured data.
  • BACKGROUND
  • With the proliferation of corporate networks and the Internet, an ever increasing amount of information is being made available in electronic form. Such information includes documents, graphics, video, audio, or the like. While corporate information is typically well indexed and stored on corporate databases within a corporate network, information on the Internet is generally highly disorganized.
  • Searchers looking for information typically make use of an information retrieval system. In corporate networks, such an information retrieval system typically consists of document management software, such as Applicant's QUANTUM™ suite, or iManage Inc's INFORITE™ or WORKSITE™ products. Information retrieval from the Internet, however, is typically undertaken using a search engine, such as YAHOO™ or GOOGLE™.
  • Generally speaking, these information retrieval systems extract keywords from each document in a network. Such keywords typically contain no semantic or syntactic information. For each document, each keyword is then indexed into a searchable data structure with a link back to the document itself. To search the network, a user supplies the information retrieval system with a query containing one or more search terms, which may be separated by Boolean operators, such as “AND” or “OR.” These search terms can be further expanded through the use of a Thesaurus. In response to the query, which might have been expanded, the information retrieval system attempts to locate information, such as documents, that match the searcher supplied (or expanded) keywords. In doing so, the information retrieval system searches through its databases to locate documents that contain at least one keyword that matches one of the search terms in the query (or its expanded version). The information retrieval system then presents the searcher with a list of document records for the documents located. The list is typically sorted based on document ranking, where each document is ranked according to the number of keyword to search term matches in that document relative to those for the other located documents. An example of a search engine that uses such a technique, where document relevancy is based solely on the content of the document, is INTELISEEK™. However, most documents retrieved in response to such a query have been found to be irrelevant.
  • In an attempt to improve precision, a number of advanced information retrieval techniques have been developed. These techniques include syntactic processing, natural language processing, semantic processing, or the like. Details of such techniques can be found in U.S. Pat. Nos. 5,933,822; 6,182,068; 6,311,194; and 6,199,067, all of which are incorporated herein by reference.
  • However, even these advanced information retrieval techniques have not been able to reach the level of precision required by today's corporations. In fact, a recent survey found that forty four percent of users say that they are frustrated with search engine results. See Internet Usage High, Satisfaction low: Web Navigation Frustrate Many Consumers, Berrier Associates—sponsored by Realnames Corporation (April 2000).
  • In addition, other advanced techniques have also proven to lack adequate precision. For example, GOOGLE™ and WISENUT™ rank document relevancy as a function of a network of links pointing to the document, while methods based on Salton's work (such as ORACLE™ text) rank document relevancy as a function of the number of relevant documents within the repository.
  • This lack of precision is at least partially caused by current information retrieval systems not taking the personal profiles of the document creator, searcher, and any contributors into account. In other words, when trying to assess the relevancy of documents within a network, most information retrieval systems ignore the searcher that performs the query, i.e., most information retrieval systems adopt a one-fit-all approach. For example, when a neurologist and a high school student both perform a search for “brain AND scan,” an identical list of located documents is presented to both the neurologist and the high school student. However, the neurologist is interested in high level documents containing detailed descriptions of brain scanning techniques, while the student is only interested in basic information on brain scans for a school project. As can be seen, a document query that does not take the searcher into account can retrieve irrelevant and imprecise results.
  • Moreover, not only should the profession of a searcher affect a search result, but also the expertise of the searcher within the search domain. For example, a medical doctor that is a recognized world expert would certainly assign different relevancy scores to the returned documents than say an intern. This means that information retrieval systems should be highly dynamic and consider the current expertise level of the searcher and/or creator/s at the time of the query.
  • In addition, the current lack of precision is at least partially caused by the treatment of documents as static entities. Current information retrieval techniques typically do not take into account the dynamic nature of documents. For example, after creation, documents may be commented on, printed, viewed, copied, etc. To this end, document relevancy should consider the activity around a document.
  • Another problem encountered with conventional information retrieval techniques is the handling of structured and unstructured data. An example of unstructured data is free form text. An example of structured data is data organized into one or more fields having one or more restrictions. For example, an unrestricted field can include whatever values the owner wants to provide. A restricted field, however, can include content which is constrained to a controlled vocabulary, size, or other parameter. In conventional information retrieval systems, documents or other data objects are searched for keywords without considering whether the documents are structured or unstructured. Since structured documents may be more relevant to a searcher than unstructured documents, the nature of the document structure should be considered when determining its relevancy to a search request.
  • Therefore, a need exists in the art for a system and method for retrieving information that can yield a significant improvement in precision over that attainable through conventional information retrieval systems. Moreover, such a system and method should personalize information retrieval based on user expertise and whether the information is structured or unstructured.
  • SUMMARY
  • A search request is received from a requester including one or more search terms. In response to the search request, one or more objects are located that fulfill the search request. A relevancy score is computed for each object based on whether the object includes structured or unstructured data. The relevancy scores enable the requestor to determine the content relevancy of the located objects.
  • In some embodiments, a method for retrieving information includes receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data. The object can include structured data and the relevancy score can be determined at least in part on whether the structured data includes restricted or unrestricted fields. The restricted fields can be associated with a first modifier value and the unrestricted fields can be associated with a second modifier value. The first and second modifier values can be the same values or different values. The modifier values can be adjustable and expandable.
  • In some embodiments, an information retrieval system includes a processor and a memory coupled to the processor. The memory includes instructions, which, when executed by the processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
  • In some embodiments, a computer-readable medium includes instructions, which, when executed by a processor, causes the processor to perform the operations of receiving a search request from a requester including one or more search terms; searching a plurality of objects based on at least one search term; identifying at least one object associated with at least one search term; and determining a relevancy score for the object based on whether the object includes structured or unstructured data.
  • In some embodiments, the relevancy score is determined based on the inclusion of structured or unstructured data and on one or more other factors, such as, for example, the expertise of users/searchers, creators and/or requesters.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of a system architecture for a system for personalizing information retrieval.
  • FIG. 2 is a block diagram of a creator device, contributor device, or searcher device, as shown in FIG. 1.
  • FIG. 3 is a block diagram of the information retrieval system and Repository of FIG. 1.
  • FIG. 4 is a flow chart of document collection according to an embodiment of the invention.
  • FIG. 5 is a flow chart of a process for information retrieval according to an embodiment of the invention.
  • FIG. 6 is a flow chart of document collection for use with structured data according to an embodiment of the invention.
  • FIG. 7 is a flow chart of document collection for use with unstructured data according to an embodiment of the invention.
  • FIG. 8 is a flow chart of a process for information retrieval for structured and unstructured data according to an embodiment of the invention.
  • FIG. 9 is a flow chart of a process for adjusting field intrinsic scores according to an embodiment of the invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS
  • FIG. 1 is a block diagram of a system architecture 100 for a system for personalizing information retrieval. An information retrieval system 102 is coupled to a repository 104 and to a network 110. Also coupled to the network 110 are a searcher device 108, one or more creator device/s 106, and one or more contributor device/s 112. Searcher device 108, creator device/s 106, contributor device/s 112, and information retrieval system 102 are all computing devices, such as clients, servers, or the like. The network is preferably a Local Area Network (LAN), but alternatively may be any network, such as the Internet. It should be appreciated that although searcher device 108, creator device/s 106, contributor device/s 112, and information retrieval system 102 are shown as distinct entities, they may be combined into one or more devices. Further details of the searcher device 108, creator device/s 106, contributor device/s 112, and information retrieval system 102 can be found below in relation to FIGS. 2-5.
  • The repository 104 is any storage device/s that is capable of storing data, such as a hard disk drive, magnetic media drive, or the like. The repository 104 is preferably contained within the information retrieval system 102, but is shown as a separate component for ease of explanation. Alternatively, the repository 104 may be dispersed throughout a network, and may even be located within the searcher device 108, creator device/s 106, and/or contributor device/s 112.
  • Each creator device 106 is a computing device operated by a creator who creates one or more documents. Each contributor device 112 is a computing device operated by a contributor who contributes to a document by, for example, adding to, commenting on, viewing, or otherwise accessing documents created by a creator/s. The searcher device 108 is a computing device operated by a searcher who is conducting a search for relevant documents created by the creator/s or contributed to by the contributor/s. The searcher, creator/s, and contributor/s are not limited to the above described roles and may take on any role at different times. Also, the searcher, creator/s, and contributor/s may browse the repository 104 without the use of the information retrieval system 102.
  • FIG. 2 is a block diagram of a creator device 106, contributor device 112, or searcher device 108, as shown in FIG. 1. The devices 106/108/112 preferably include the following components: at least one data processor or central processing unit (CPU) 202; a memory 214; input and/or output devices 206, such as a monitor and keyboard; communications circuitry 204 for communicating with the network 110 (FIG. 1) and information retrieval system 102 (FIG. 1); and at least one bus 210 that interconnects these components.
  • Memory 214 preferably includes an operating system 216, such as but not limited to, VXWORKS™, LINUX™, or WINDOWS™ having instructions for processing, accessing, storing, or searching data, etc. Memory 214 also preferably includes communication procedures for communicating with the network 110 (FIG. 1) and information retrieval system 102 (FIG. 1); searching procedures 220, such as proprietary search software, a Web-browser, or the like; application programs 222, such as a word processor, email client, database, or the like; a unique user identifier 224; and a cache 226 for temporarily storing data. The unique user identifier 224 may be supplied by the creator/searcher/contributor each time he or she performs a search, such as by supplying a username. Alternatively, the unique user identifier 224 may be the user's login username, Media Access Control (MAC) address, Internet Protocol (IP) address, or the like.
  • FIG. 3 is a block diagram of the information retrieval system 102 and Repository 104 of FIG. 1. As mentioned in relation to FIG. 1, the repository 104 is preferably contained within the information retrieval system 102. The information retrieval system 102 preferably includes the following components: at least one data processor or central processing unit (CPU) 302; a memory 308; input and/or output devices 306, such as a monitor and keyboard; communications circuitry 304 for communicating with the network 110 (FIG. 1), creator device/s 106 (FIG. 1), contributor device/s 112 (FIG. 1), and/or searcher device 108 (FIG. 1); and at least one bus 310 that interconnects these components.
  • Memory 308 preferably includes an operating system 312, such as but not limited to, VXWORKS™, LINUX™, or WINDOWS™ having instructions for processing, accessing, storing, or searching data, etc. Memory 308 also preferably includes communication procedures 314 for communicating with the network 110 (FIG. 1), creator device/s 106 (FIG. 1), contributor device/s 112 (FIG. 1), and/or searcher device 108 (FIG. 1); a collection engine 316 for receiving and storing documents; a search engine 324; intrinsic relevancy adjustment procedures 325; expertise adjustment procedures 326; a repository 104, as shown in FIG. 1; and a cache 338 for temporarily storing data.
  • The collection engine 316 comprises a keyword extractor or parser 318 that extracts text and/or keywords from any suitable document, such as an ASCII or XML file, Portable Document Format (PDF) file, word processing file, or the like. The collection engine 316 also preferably comprises a concept identifier 320. The concept identifier 320 is used to extract the document's important concepts. The concept identifier may be a semantic, synaptic, or linguistic engine, or the like. In a preferred embodiment the concept identifier 320 is a semantic engine, such as TEXTANALYST™ made by MEGAPUTER INTELLIGENCE™ Inc. Furthermore, the collection engine 316 also preferably comprises a metadata filter 322 for filtering and/or refining the concept/s identified by the concept identifier 320. Once the metadata filter 322 has filtered and/or refined the concept, metadata about each document is stored in the repository 104. Further details of the processes performed by the collection engine 316 are discussed in relation to FIG. 4. In addition to refined concepts, metadata includes any data, other than raw content, associated with a document.
  • The search engine 324 is any standard search engine, such as a keyword search engine, statistical search engine, semantic search engine, linguistic search engine, natural language search engine, or the like. In a preferred embodiment, the search engine 324 is a semantic search engine.
  • The expertise adjustment procedures 326 are used to adjust an object's intrinsic score to an adjusted score based on the expertise of the searcher, creator/s, and/or contributor/s. The content relevancy harmonizer 325 is used to adjust the objects field intrinsic score to an adjusted score based on whether the object includes structured and/or unstructured data. The expertise adjustment procedures 326 and the content relevancy harmonizer 325 are described in further detail below in relation to FIGS. 5-9.
  • A file collection 328(1)-(N) is created in the repository 104 for each object input into the system, such as a document or source. Each file collection 328(1)-(N) preferably contains: metadata 330(1)-(N), such as associations between keywords, concepts, or the like; content 332(1)-(N), which is preferably ASCII or XML text or the content's original format; and contributions 334(1)-(N), such as contributor comments or the like. At a minimum, each file collection contains content 332(1)-(N). The repository 104 also contains user profiles 336(1)-(N) for each user, i.e., each searcher, creator, or contributor. Each user profile 336(1)-(N) includes associated user activity, such as which files a user has created, commented on, opened, printed, viewed, or the like, and links to various file collections 328(1)-(N)that the user has created or contributed to. Further details of use of the repository 104 are discussed in relation to FIG. 5.
  • FIG. 4 is a flow chart of document collection according to an embodiment of the invention. A creator supplies an object, such as a document or source, to the searching procedures 220 (FIG. 2) at step 402. To supply a document, the creator may for example, supply any type of data file that contains text, such as an email, word processing document, text document, or the like. A document comes from a source of the document. Therefore, to supply a source, the creator may provide a link to a document, such as by providing a URL to a Web-page on the Internet, or supply a directory that contains multiple documents. In a preferred embodiment, the creator also supplies his or her unique user identifier 224 (FIG. 2), and any other data, such as privacy settings, or the like. The unique user identifier may be supplied without the creator's knowledge, such as by the creator device 106 (FIG. 1) automatically supplying its IP or MAC address.
  • The document, source, and/or other data is then sent to the information retrieval system 102 (FIG. 1) by the communication procedures 218 (FIG. 2). The information retrieval system 102 (FIG. 1) receives the document, source, and/or other data at step 403. When supplied with a document, the keyword extractor or parser 318 (FIG. 3) parses the document and/or source into ASCII text at step 404, and thereafter extracts the important keywords at step 408. However, when supplied with a source, the keyword extractor or parser 318 (FIG. 3) firstly obtains the document/s from the source before parsing the important keywords into text.
  • Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 406 as in the repository 104 as part of a file collection 328(1)-(N) (FIG. 3). Also, the unique user identifier is used to associate or link each file collection 328(1)(N) (FIG. 3) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336(1)-(N) (FIG. 3). The user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • In a preferred embodiment, the concept identifier 320 (FIG. 3) then identifies the important concept/s from the extracted keywords at step 410. Again, in a preferred embodiment, the metadata filter 322 (FIG. 3) then refines the concept at step 412. The refined concept is then stored in the repository 104 as part of the metadata 330(1)-(N) (FIG. 3) within a file collection 328(1)-(N) (FIG. 3).
  • At any time, contributors can supply their contributions, at step 416, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 418 and stored in the repository at step 420, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 403-414.
  • FIG. 5 is a flow chart of a process for information retrieval according to an embodiment of the invention. A searcher using a searcher device 108 (FIG. 1) submits a search request to the information retrieval system 102 (FIG. 1), at step 502. Submittal of this search occurs using searching procedures 220 (FIG. 2) and communication procedures 218 (FIG. 2) on the searcher device 108 (FIG. 1). The search request preferably contains one or more search terms, and the unique user identifier 224 (FIG. 2) of the searcher.
  • The search is preferably conducted to locate objects. Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like. A search for documents returns a list of relevant documents, and a search for experts returns a list of experts with expertise in the relevant field. A search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
  • The search is received at step 504 by the information retrieval system 102 (FIG. 1) using communications procedures 314 (FIG. 3). The information retrieval system 102 (FIG. 1) then searches the repository 104 for relevant objects at step 506. This search is undertaken by the search engine 324 (FIG. 3), at step 506, using any known or yet to be discovered search techniques. In a preferred embodiment, the search undertakes a semantic analysis of each file collection 328(1)-(N) stored in the repository 104.
  • The search engine 324 (FIG. 3) then locates relevant objects 328(1)-(N) at step 508 and calculates an intrinsic score at step 510 for each located object. By “located object,” it is meant any part of a file collection that is found to be relevant, including the content, source, metadata, etc. Calculation of the intrinsic score is based on known, or yet to be discovered techniques for calculating relevancy of located objects based solely on the located objects themselves, the repository itself and the search terms. In its simplest form, such a search calculates the intrinsic score based on the number of times that a search term appears in the content 332(1)-(N) (FIG. 3) of located objects. However, in a preferred embodiment, this calculation is also based on a semantic analysis of the relationship between words in the content 332(1)-(N) (FIG. 3).
  • The intrinsic score is then adjusted to an adjusted score by the expertise adjustment procedures 326, at step 512. This adjustment takes the expertise of the creator/s, searcher, and/or contributor/s into account, as described in further detail below.
  • Once the intrinsic score has been adjusted to an adjusted score, a list of the located objects is sorted at step 514. The list may be sorted by any field, such as by adjusted score, intrinsic score, source, most recently viewed, creator expertise, etc. The list, preferably containing a brief record for each located object, is then transmitted to the searcher device 108 (FIG. 1) at step 516. Each record preferably contains the located object's adjusted score, creator, title, etc. The list is then received by the searcher device at step 518 and displayed to the searcher at step 520. In an alternative embodiment, sorting of the list is performed by the searching procedures 220 (FIG. 2) on the searcher device 108 (FIG. 1).
  • Preferred algorithms for adjusting the intrinsic score (step 512 of FIG. 5) will now be described. It should be appreciated that these algorithms are merely exemplary and in no way limit the invention other than as claimed. Calculation of the adjusted score from the intrinsic score is dependent on the objects searched for, such as documents, comments, sources, experts, or peers.
  • Expertise Adjustment when Searching for Documents
  • Search term(s) entered by the searcher may or may not be extended to form a query. Such possible extensions, include but are not limited to, synonyms or stemming of search term(s). Once the intrinsic score has been calculated according to step 510 above, the adjusted score (RS_ADJ) for each located document is calculated as follows:
    RS ADJ=Intrinsic Document Score+Expertise Adjustment=IDS+EA  (1)
    where the Intrinsic Document Score (IDS) is a weighted average between a Document Content Score (DCS) and a Comments Content Score (CCS).
    IDS=a*DCS+(1−a)*CCS  (2)
    with “a” being a number between 0 and 1 and determining the importance of the content of a document relative to the content of its attached comments.
  • The DCS and CCS are calculated by any methodology or technique. Existing search engine algorithms may be used to fulfill this task. Also note that the DCS and CCS are not influenced by the searcher that entered the query. In this embodiment, the DCS and CCS can be any number between 2 and 100. The Expertise Adjustment (EA) is calculated as follows:
    EA=DCE+CCE  (3)
    where DCE is the Document Creator Expertise adjustment and CCE is the Comments Contributors Expertise adjustment. The DCE adjustment takes into account all activity performed by a given user and is computed as follows:
    DCE=R1(DCS)*W1(RS EXP ABS)  (4)
    where R1(DCS) determines the maximal amount of the expertise adjustment, or, in other words, the range for the alteration due to the expertise of the creator of the document. This depends on the level of the DCS. The range function is given by: R1 ( DCS ) = 20 * ( 1 - DCS - 50 100 ) ( 5 )
  • Extreme intrinsic scores, i.e., scores near 2 or 100, are less influenced than scores near the middle, i.e., scores near 50. The maximum possible change in a score is 20 when DCS=50 and linearly decreases to 10 when DCS=100 or 2.
  • W1(RS_EXP_ABS) determines what percentage of available range R1(DCS), positively or negatively, is considered for adjusting the intrinsic score. It is given by: W1 ( RS_EXP _ABS ) = RS_EXP _ABS ( Creator ) - RS_EXP _ABS ( Searcher ) 100 ( 6 )
    where RS-EXP-ABS denotes the absolute relevance score of a user, that is, the user expertise, be it searcher expertise, creator expertise, or contributor expertise. The calculation of RS-EXP-ABS occurs as follows:
    RS-EXP-ABS=3*F(User contribution)*G(Company expertise)*H(Query specificity)  (7)
    where F (User contribution) accounts for the relevancy of all contributions made by the user, considering all documents created, all comments contributed, and the user's definition of his or her folders within the software. These folders (private or public) constitute the user's personal taxonomy. G (Company expertise) accounts for the company expertise about the query, i.e., whether a few or most employees in a company have produced something relevant to the query. H (Query specificity) accounts for the specificity of the query within the repository, i.e., whether many or just a few file collections were created.
  • In detail: F ( User cont . ) = i : all relevant documents ( 2 * ( W i , max + C i ) * ( ( DCS ) i 100 ) 2 ) + i : all nonrelevant documents C i + 2 * Taxonomy ( 8 )
    where the first sum is over all relevant documents and the second sum is over all non-relevant documents that possessed a relevant comment, i.e., the comment was relevant but not the document. (DCS)i is the Intrinsic document relevancy score attained for the i-th relevant document. Also, Wi,max, is the user activity measure. Ci is calculated as follows: C i = 0.1 * ( 1 - Exp ( - # relevant comments in Doc i by this user 2 ) ) ( 9 )
    and is the reward assigned to matching comments made on documents, relevant or not. A matched comment is not necessarily attached to a relevant document.
  • Wi,max accounts for the type of contribution (such as but not limited to creation, commenting, or highlighting). In short, Wi,max is the maximum of the following weights (if applicable). Wi,edit=1, if the user created or edited i-th file collection,
    W i,comment=0.5*Max*(0,7-Mincomments*(Level))/6,
    if the user commented on the i-th file collection. Since these comments are organized in a threaded discussion, the weight will also depend on how remote a comment is to the file collection itself. For example, a comment on a comment on a comment to the original file collection will receive a lesser weight than a comment on the original file collection. In the formula, Level measures how remote the comment is from the file collection. The least remote comment is taken into consideration as long as it is closer than six comments away from the parent file collection.
    • Wi,rename=0.8, if the user renamed i-th file collection.
    • Wi,highlight=0.8, if the user highlighted some subparts of i-th file collection.
    • Wi,link=0.8, if the user linked the file collection to another file collection or “external” URL. Taxonomy = { 1 If Query term found within user ' s taxonomy 0 Otherwise
  • The taxonomy in this preferred embodiment stands for folder names. Each user has built some part of the repository by naming folders, directories, or sub-directories. For example, creator 1 might have grouped his Hubble telescope pictures in a folder called “Space Images.” Then term “Space Images” becomes part of the user's taxonomy.
  • Within an organization or enterprise, some of the taxonomy (folder structure) has been defined by the organization or enterprise itself and has “no owners.” In this case, each folder has an administrator who bestows rights to users, such as the right to access the folder, the right to edit any documents within it, the right to edit only documents that the specific user created, or the right to view but not edit or contribute any document to the folder. Only the names of the folders that a user creates are part of his or her taxonomy. G ( Company expertise ) = 1 + Log ( P E ) = IEF , ( 10 )
    where Log is the logarithmic function base 10; P is the total number of users; and E is the number of relevant experts. The number of relevant experts is calculated by determining how many unique creators and contributors either created or contributed to the located documents. IEF stands for Inverse Expertise Frequency.
  • This adjustment raises the adjusted scores when there are few relevant experts within the company. H ( Query specificity ) = 1 + 1 Log ( NCO ) Log ( NCO NCOR ) = 2 - Log ( NCOR ) Log ( NCO ) = IWCOF ( 11 )
    where Log is the logarithmic function base 10; NCO is the total number of content objects available in the database at the time of the query; and NCOR is the total number of relevant content objects for a given query. IWCOF stands for the Inverse Weighted Content Objects Frequency. Preferably, in this embodiment, NCO, NCOR and IWCOF are only calculated using non-confidential content objects.
  • IWCOF is similar to IEF as it adjusts the score by slightly raising the adjusted score when only a few relevant content objects are found in the database. Therefore, the absolute relevance score for a given user (or the user expertise) is: RS - EXP - ABS = 3 * ( 1 + Log ( P E ) ) ( 2 - Log ( NCOR ) Log ( NCO ) ) * ( i : all relevant documents ( 2 * ( W i , max + C i ) * ( ( DCS ) i 100 ) 2 ) + i : all nonrelevant documents C i + 2 * Taxonomy ) = 3 * IEF * IWCOF * ( i : all relevant documents ( 2 * ( W i , max + C i ) * ( ( DCS ) i 100 ) 2 ) + i : all nonrelevant documents C i + 2 * Taxonomy ) ( 12 )
  • Using the above equations, the intrinsic score is increased to an adjusted score if the creator of the content objects is more knowledgeable about the searched subject matter than the person that entered the query, i.e., if the creator expertise is higher than the searcher expertise. On the other hand, the intrinsic score is decreased to an adjusted score if the creator is less knowledgeable about the searched subject matter than the searcher, i.e., if the creator expertise is lower than the searcher expertise.
  • To calculate the Comments Contributors Expertise Adjustment (CCE) the following equation is used: CCE = 5 * ( 2 * Exp ( DX ) 1 + Exp ( DX ) - 1 ) where ( 13 ) Δ x = 1 50 Distinct Contributors ( RS_EXP _ABS ( Contributors ) - RS_EXP _ABS ( Searcher ) ) ( 14 )
    where
  • Once these adjustments have been computed, one has to ensure that the relevancy score from (1) is in the appropriate range and that it is preferably in this embodiment an integer. This is obtained as follows:
    RS ADJ=Min(100, Max(1, Round(RS ADJ)))  (15)
    where Round(d) rounds the number d to its nearest integer.
    Expertise Adjustment when Searching for Sources
  • Once the intrinsic score has been calculated according to step 510 above, the adjusted score for sources (RSS_ADJ) for each source is calculated as follows:
    RSS ADJ=intrinsic Source Content score+expertise adjustment SCS+R2(SCS)*W2(RS EXP ABS)
    where SCS is the intrinsic Source Content Score computed, which is, preferably in this embodiment, defined here as the maximum of all the intrinsic Document Content Scores (DCS) that were created from each source, ie.,
    SCS=MAX(DCS)  (17)
  • For example, multiple documents may have been saved as multiple file collections from a single Web-site.
  • R2(SCS) determines the maximal amount of the expertise adjustment, or, in other words, the range for the alteration due to the expertise of the creator of the document taken from the source, which depends on the level of the intrinsic source score, i.e., SCS. The range function is given by: RI ( SCS ) = 20 * ( 1 - SCS - 50 100 ) ( 18 )
  • Extreme scores are less influenced than scores in the middle. The maximum possible change in a score is 20 when SCS=50 and linearly decreases to 10 when SCS=100 or 2.
  • W2(RS_EXP_ABS) determines what percentage of available range for the expertise adjustment, R2(SCS), positively or negatively, is considered for building the scoring. It is given by: W2 ( RS_EXP _ABS ) = MAX ( RS_EXP _ABS ( Creator ) ) - RS_EXP _ABS ( Searcher ) 100 ( 19 )
    where RS_EXP_ABS is the absolute relevance score of the expert (as defined previously). MAX(RS_EXP_ABS(Creator)) is the maximum of absolute expertise scores over all creators that have created file collections from this source. RS_EXP_ABS(Searcher) is the absolute relevance score of the searcher. In other words, the intrinsic score for the source is adjusted upward to an adjusted score if the maximum creator expertise of all creators for a particular source exceeds the searcher expertise. On the other hand, the intrinsic score for the source is lowered to an adjusted score if the creator expertise of all creators for a particular source is lower than the searcher expertise.
  • Once this adjustment has been computed, one has to ensure that the relevancy score is in the appropriate range and that it is preferably in this embodiment an integer. This is obtained as follows:
    RSS ADJ=Min(100, Max (1, Round(RSS ADJ)))  (20)
    where Round(d) rounds the number d to its nearest integer.
  • In this way, the adjusted score for each document (RS_ADJ) or the adjusted score for sources (RSS13 ADJ) is calculated based on the expertise of the searcher, creator/s, and/or contributor/s. Such adjusted scores provide a significant improvement in precision over that attainable through conventional information retrieval systems.
  • Expertise Adjustment when Searching for Peers
  • When users are looking for peers rather than experts an adjusted relevancy score is calculated. Peers are other users that have a similar expertise or come from a similar, or the same, department as the searcher. The adjusted relevancy score uses the expertise values and adjusts them with respect to the searcher's expertise. This is the similar to resorting the list with respect to the searcher, but instead recalculates the values themselves.
  • Once the expertise for each user has been determined, they are adjusted with respect to the searcher expertise. The adjusted relative or personalized relevancy score for an expert is defined by: Adjusted Rel = 100 - 10 * ( RS - EXP - ABS - RS - EXP - ABS searcher + 10 ) ( 21 )
  • The adjusted relevancy score is a measure of the difference between two levels of expertise. The square root maps the difference to a continuous and monotone measure while diminishing the importance of differences when two experts are far apart. It is also asymmetric in the sense that it favors expertise above the searcher expertise. Finally, recall that |K| represents the absolute value of K (i.e., the difference).
  • An example of a method for personalizing information retrieval using the above formulae will now be described. It should, however, be appreciated that this example is described herein merely for ease of explanation, and in no way limits the invention to the scenario described. Table 1 sets out the environment in which a search is conducted. Furthermore, in this illustration, the factor a (from formula 2, determining the importance of the content of a document relative to its attached comments) has been arbitrarily set to 1.
    TABLE 1
    Number of users # experts
    100  10
    Total Number of File # of relevant File
    Collections Collections # of relevant comments
    1000  10 10
    Departments of experts Names
    Marketing Adam M.
    Bryan M.
    Christie M.
    David M.
    Engineering Eric E.
    Fred E.
    Gail E.
    Finance Hugo F.
    Henry F.
    Legal Ivan L.
    Contributors (total # of
    contributions, # of relevant
    File Collection number Creator contributions)
     11 Adam M. Bryan M. (2, 2)
    Christie M. (1, 0)
    101 Adam M.
    201 David M. David M. (2)
    Hugo F. (3)
    301 David M. David M. (1)
    401 Christie M. Adam M. (1)
    Christie M. (3, 1)
    David M. (1)
    Eric E. (2)
    Fred E. (2, 2)
    Hugo F. (3)
    Ivan L. (5)
    501 Gail E. Eric E. (1, 0)
    Fred E. (5, 0)
    Gail E. (4, 0)
    601 Eric E.
    701 Henry F. Henry F. (6,0)
    Hugo F. (7, 1)
    Bryan M. (1, 1)
    801 Hugo F.
    901 Ivan L. Henry F. (1, 0)
    999 John I. Bryan M. (2, 2)
    Fred E. (3, 1)
    Attached comments
    intrinsic
    File Collection Intrinsic score score, by author
    File Collection number DCS score CCS scores
     11  85 Bryan M., 1
    Bryan M., 1
    Christie M., 0
    101  85
    201 100 David M., 0
    Hugo F. 0
    301  50 David M., 0
    401  75 Adam M., 0
    Christie M., 1
    David M., 0
    Eric E., 0
    Fred E., 1
    Fred E. 1
    Hugo F., 0
    Ivan L., 0
    501  80 Eric E., 0
    Fred E., 0
    Gail E., 0
    601  80
    701  40 Henry F., 0
    Hugo F., 1
    Hugo F., 0
    Bryan M., I
    801  60
    901  70 Henry F., 0
    999  0 Byran M., 1
    Bryan M., 1
    Fred E., 1
    Fred E., 0
    Taxonomy matches
    Christie M.
    Bryan M.
    File Collection number Original source
     11 cnn.com The source name here is
    101 nytimes.com truncated to the “root level”
    201 microsoft.com for simplification purposes.
    301 bbc.com In reality it is the entire url
    401 nytimes.com tag. For example,
    501 cnn.com http://www.cnn.com/2002/
    601 nytimes.com WORLD/meast/03/26/
    701 latimes.com arab.league/index.html
    801 bbc.com
    901 corporate intranet
  • For this example, 100 users having a total number of 1000 file collections in the repository yields 10 experts and 10 relevant file collections. There are also 10 comments that are found to be relevant. The enterprise in which the example takes place has four departments, namely marketing, engineering, finance, and legal. For ease of explanation, each employee's last name begins with the department in which they work.
  • Once the repository 104 (FIG. 1) has been searched (step 506-FIG. 5) and all relevant documents located (step 508-FIG. 5), an Intrinsic Document Score (IDS) is calculated for each located document. This score is a weighted average between a Document Content Score (DCS) and a Comment Content Score (CCS). The DCS and CCS are calculated using any standard search engine techniques. CCS is the Comment Content Score calculated by any means such as semantic engine, frequency of words, etc.
  • Using formulae 7-12 above, the expertise of each searcher, creator, and/or contributor is then calculated. The calculations for F(User contribution) yield the results in Table 2 below.
    TABLE 2
    F(user contribution) Second
    File Wi by File Ci by File First sum in formula sum in Taxonomy
    User Collection collection collection Details Value formula match
    Adam M.
     11 1 0 2 * 1 * (85/100){circumflex over ( )}2 1.445 0
    101 1 0 2 * 1 * (85/100){circumflex over ( )}2 1.445 0 0
    F(Adam M.) 2.89
    Bryan M.
     11 0.5 0.063 2 * (0.5 + 0.063) * (.85){circumflex over ( )}2 0.814 0
    701 0.5 0.039 2 * (0.5 + 0.039) * (.4){circumflex over ( )}2 0.172 0
    999 0.5 0.063 2 * (0.5 + 0.063) * 0 0 0.063 2
    F(Bryan M.) 3.049
    Christie M.
    401 1 0.039 2 * (1 + 0.039) * (75/100){circumflex over ( )}2 1.169 0.039 2
    F(Christie M.) 3.208
    David M.
    201 1 0 2 * 1 * 1 2 0
    301 1 0 2 * 1 * (0.5){circumflex over ( )}2 0.5 0 0
    F(David M.) 2.5
    Eric E.
    601 1 0 2 * 1 * (.8){circumflex over ( )}21.28 0 0
    F(Eric E.) 1.28
    Fred E.
    401 0.5 0.063 2 * (0.5 + 0.063) * .75{circumflex over ( )}2 0.633
    999 0.5 0.039 2 * (0.5 + 0.039) * 0 0 0.039 0
    F(Fred E.) 0.672
    Gail E.
    501 1 0 2 * 1 * .8{circumflex over ( )}21.28 0 0
    F(Gail E.) 1.28
    Hugo F.
    801 1 0 2 * 1 * 6{circumflex over ( )}2 0.72 0 0
    F(Hugo F.) 0.72
    Ivan L.
    901 1 0 2 * 1 * .7{circumflex over ( )}2 0.98 0 0
    F(Ivan L.) 0.98
  • Using formulae 10 and 11, G(Company Expertise) is calculated to be 2, while H(Query Specificity) is calculated to be 1.667. These values and the values in Table 2 are plugged into formula 7 to arrive at the following expertise values:
    TABLE 3
    Name RS-EXP-ABS
    Adam M. 29
    Bryan M. 30
    Christie M. 32
    David M. 25
    Eric E. 13
    Fred E. 7
    Gail E. 13
    Hugo F. 7
    Henry F. 3
    Ivan L. 10
  • W1(RS_EXP_ABS) is then calculated using formula 6 (for different searcher expertise) to yield the following results:
    TABLE 4
    W(RS_EXP_ABS)
    Searcher
    Expertise/
    Name 0 5 10 15 20 25 30 35 40 45
    Adam M. 0.29 0.24 0.19 0.14 0.09 0.04 −0.01 −0.06 −0.11 −0.16
    Bryan M. 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 −0.1 −0.15
    Christie M. 0.32 0.27 0.22 0.17 0.12 0.07 0.02 −0.03 −0.08 −0.13
    David M. 0.25 0.2 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2
    Eric E. 0.13 0.08 0.03 −0.02 −0.07 −0.12 −0.17 −0.22 −0.27 −0.32
    Fred E. 0.07 0.02 −0.03 −0.08 −0.13 −0.18 −0.23 −0.28 −0.33 −0.38
    Gail E. 0.13 0.08 0.03 −0.02 −0.07 −0.12 −0.17 −0.22 −0.27 −0.32
    Hugo F. 0.07 0.02 −0.03 −0.08 −0.13 −0.18 −0.23 −0.28 −0.33 −0.38
    Henry F. 0.03 −0.02 −0.07 −0.12 −0.17 −0.22 −0.27 −0.32 −0.37 −0.42
    Ivan L. 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 −0.3 −0.35
  • DCE and CCE are then calculated using formulae 4, 5, 13, and 14 (for different searcher expertise) to yield the following results:
    TABLE 5
    DCE Calculations
    File collection ID R1 W (searcher exp = 0) DCE (0)
    11 13 0.29 3.77
    101 13 0.29 3.77
    201 10 0.25 2.5
    301 20 0.25 5
    401 15 0.32 4.8
    501 14 0.13 1.82
    601 14 0.13 1.82
    701 18 0.03 0.54
    801 18 0.07 1.26
    901 16 0.1 1.6
    File collection ID R1 W (searcher exp = 30) DCE (30)
    11 13 −0.01 −0.13
    101 13 −0.01 −0.13
    201 10 −0.05 −0.5
    301 20 −0.05 −1
    401 15 0.02 0.3
    501 14 −0.17 −2.38
    601 14 −0.17 −2.38
    701 18 −0.27 −4.86
    801 18 −0.23 −4.14
    901 16 −0.2 −3.2
    CCE Calculations
    “Delta X” or Dx Searcher
    File collection ID Searcher Exp = 0 Exp = 30 CCE (0) CCE (30)
    11 1.24 0.04 2.76 0.1
    101 0 0 0 0
    201 0.64 −0.56 1.55 −1.36
    301 0.5 −0.1 1.22 −0.25
    401 2.46 −1.74 4.21 −3.51
    501 0.66 −1.14 1.59 −2.58
    601 0 0 0 0
    701 0.8 −1 1.9 −2.31
    801 0 0 0 0
    901 0.06 −0.54 0.15 −1.32
  • The Expertise Adjustment (EA) is then calculated according to formula 3 to yield the following results for EA:
    TABLE 6
    Expertise
    Adjustment (EA) Values for DCE and CCE are from Table 5 above
    File collection ID Searcher expertise = 0 Searcher expertise = 30
    11 6.53 −0.03
    101 3.77 −0.13
    201 4.05 −1.86
    301 6.22 −1.25
    401 9.01 −3.21
    501 3.41 −4.96
    601 1.82 −2.38
    701 2.44 −7.17
    801 1.26 −4.14
    901 1.75 −4.52
    This entry is DCE + This entry is DCE + CCE
    CCE when the searcher when the searcher
    expertise is 0 expertise is 30
  • Finally, the adjusted score (RS ADJ) for each located document is calculated using formula 1 to yield the following results:
    TABLE 7
    RS_ADJ RS_ADJ
    Document Adjusted score Adjusted score
    File collection ID Intrinsic score Searcher exp = 0 Searcher exp = 30
    11 85 92 85
    101 85 89 85
    201 100 100 98
    301 50 56 49
    401 75 84 72
    501 80 83 75
    601 80 82 78
    701 40 42 33
    801 60 61 56
    901 70 72 65
  • In a similar manner, the adjusted scores are calculated when searching for sources as per tables 8-12 below.
    TABLE 8
    File collection created Creators
    Source name from source of File collections
    cnn.com  11,501 Adam M., Gail E.
    microsoft.com 201 David M.
    nytimes.com 101, 401, 601 Adam M., Christie M.,
    Eric E.
    bbc.com 801 Hugo F.
    latimes.com 701 Henry F.
    corporate intranet 901 Ivan L.
  • TABLE 9
    SCS calculations
    Source SCS
    cnn.com 85
    microsoft.com 100
    nytimes.com 85
    bbc.com 60
    latimes.com 40
    corporate intranet 70
  • TABLE 10
    R2 calculations
    cnn.com 13
    microsoft.com 10
    nytimes.com 13
    bbc.com 18
    latimes.com 18
    corporate intranet 16
  • TABLE 11
    W2 Calculations
    Searcher Searcher
    Expertise = 0 Expertise = 30
    cnn.com 0.29 −0.01
    microsoft.com 0.25 −0.05
    nytimes.com 0.32 0.02
    bbc.com 0.07 −0.23
    latimes.com 0.03 −0.27
    corporate intranet 0.1 −0.2
  • TABLE 12
    Adjusted relevancy scores
    RSS_ADJ RSS_ADJ
    Searcher Searcher SCS
    Source name Expertise = 0 Expertise = 30 Intrinsic score
    cnn.com 89 85 85
    microsoft.com 100 100 100
    nytimes.com 89 85 85
    bbc.com 61 56 60
    latimes.com 41 35 40
    corporate intranet 72 67 70
  • As can be seen the intrinsic scores of each document and/or source is adjusted to an adjusted score based on the expertise of the users. In other words, a document and/or source that may have been less relevant, is adjusted so that it is more relevant, or visa versa. In this way, the precision of document and/or source relevancy is improved.
  • Harmonizing Content Relevancy Across Structured and Unstructured Data
  • FIG. 6 is a flow chart of document collection for use with structured data according to an embodiment of the invention. A creator supplies an object, such as a document or source, to the searching procedures 220 (FIG. 2) at step 602. To supply a document, the creator may for example, supply any type of data file that contains text, such as an email, word processing document, text document, or the like. A document comes from a source of the document. Therefore, to supply a source, the creator may provide a link to a document, such as by providing a URL to a Web-page on the Internet, or supply a directory that contains multiple documents. In a preferred embodiment, the creator also supplies his or her unique user identifier 224 (FIG. 2), and any other data, such as privacy settings, or the like. The unique user identifier may be supplied without the creator's knowledge, such as by the creator device 106 (FIG. 1) automatically supplying its IP or MAC address.
  • The document, source, and/or other data is then sent to the information retrieval system 102 (FIG. 1) by the communication procedures 218 (FIG. 2). The information retrieval system 102 (FIG. 1) receives the document, source, and/or other data at step 603.
  • The document is examined for structured data. If the document is a structured document (step 607), then for each field in the document, the keyword extractor or parser 318 (FIG. 3) parses the field into ASCII text at step 604, and thereafter extracts the important keywords at step 608. However, when supplied with a source, the keyword extractor or parser 318 (FIG. 3) firstly obtains the document/s from the source before parsing the important keywords into text. Also at step 604, each field is assigned a Field ID, a Field Type (e.g., restricted, unrestricted, etc.) and Field Modifier value, which is stored in the repository 104 with its associated text. The Field Type and Field Modifier value are used to adjust field intrinsic scores, as described more fully with respect to FIGS. 8 and 9.
  • Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 606 as in the repository 104 as part of a file collection 328(1)-(N) (FIG. 3). Also, the unique user identifier is used to associate or link each file collection 328(1)(N) (FIG. 3) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336(1)-(N) (FIG. 3). The user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • In a preferred embodiment, the concept identifier 320 (FIG. 3) then identifies the important concept/s from the extracted keywords at step 610. Again, in a preferred embodiment, the metadata filter 322 (FIG. 3) then refines the concept at step 612. The refined concept is then stored 614 in the repository 104 as part of the metadata 330(1)-(N) (FIG. 3) within a file collection 328(1)-(N) (FIG. 3). After storing 614 the metadata, the next field in the document is processed by steps 604-614. If there are no more fields in the document (step 605), then the process stops (i.e., the last field has been processed).
  • At any time during the process, contributors can supply their contributions, at step 616, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 618 and stored in the repository at step 620, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 603-614.
  • FIG. 7 is a flow chart of document collection for use with unstructured data according to an embodiment of the invention. If the document is an unstructured document (step 607), then the keyword extractor or parser 318 (FIG. 3) parses the document and/or source into ASCII text at step 704, and thereafter extracts the important keywords at step 708. However, when supplied with a source, the keyword extractor or parser 318 (FIG. 3) firstly obtains the document/s from the source before parsing the important keywords into text.
  • Extraction of important keywords is undertaken using any suitable technique. These keywords, document, source, and other data are then stored at step 706 as in the repository 104 as part of a file collection 328(1)-(N) (FIG. 3). Also, the unique user identifier is used to associate or link each file collection 328(1)(N) (FIG. 3) created with a particular creator. This link between the creator and the file collection is stored in the creator's user profile 336(1)-(N) (FIG. 3). The user profile data can be updated by the user him/herself or more preferably by a system administrator.
  • In a preferred embodiment, the concept identifier 320 (FIG. 3) then identifies the important concept/s from the extracted keywords at step 710. Again, in a preferred embodiment, the metadata filter 322 (FIG. 3) then refines the concept at step 712. The refined concept is then stored 714 in the repository 104 as part of the metadata 330(1)-(N) (FIG. 3) within a file collection 328(1)-(N) (FIG. 3).
  • At any time during the process, contributors can supply their contributions, at step 716, such as additional comments, threads, or other activity to be associated with the file collection 328(1)-(N). These contributions are received by the information retrieval engine at step 718 and stored in the repository at step 720, as contributions 334(1)-(N). Alternatively, contributions may be received and treated in the same manner as a document/source, i.e., steps 703-714.
  • FIG. 8 is a flow chart of a process for information retrieval for structured and unstructured data according to an embodiment of the invention. While the process described below includes a number of steps that appear to occur in a specific order, it should be apparent that the process steps are not limited to any particular order, and, moreover, the process can include more or fewer steps, which can be executed serially or in parallel (e.g., using parallel processors or a multi-threading environment).
  • A searcher using a searcher device 108 (FIG. 1) submits a search request to the information retrieval system 102 (FIG. 1), at step 802. Submittal of this search occurs using searching procedures 220 (FIG. 2) and communication procedures 218 (FIG. 2) on the searcher device 108 (FIG. 1). The search request preferably contains one or more search terms, and the unique user identifier 224 (FIG. 2) of the searcher.
  • The search is preferably conducted to locate structured and unstructured objects. Objects preferably include: content objects, such as documents, comments, or folders; source objects; people objects, such as experts, peers, or workgroups; or the like. A search for documents returns a list of relevant documents, and a search for experts returns a list of experts with expertise in the relevant field. A search for sources returns a list of sources from where relevant documents were obtained. For example, multiple relevant documents may be stored within a particular directory or website.
  • The search is received at step 804 by the information retrieval system 102 (FIG. 1) using communications procedures 314 (FIG. 3). The information retrieval system 102 (FIG. 1) then searches the repository 104 for relevant objects at step 806. This search is undertaken by the search engine 324 (FIG. 3), at step 806, using any known or yet to be discovered search techniques. In a preferred embodiment, the search undertakes a semantic analysis of each file collection 328(1)-(N) stored in the repository 104.
  • The search engine 324 (FIG. 3) then locates relevant objects 328(1)-(N) at step 808. By “located object,” it is meant any part of a file collection that is found to be relevant, including the content, source, metadata, etc. The calculation of intrinsic scores is based on known, or yet to be discovered techniques for calculating relevancy of located objects based solely on the located objects themselves, the repository 104 itself and the search terms. In its simplest form, such a search calculates the intrinsic score based on the number of times that a search term appears in the content 332(1)-(N) (FIG. 3) of located objects. However, in a preferred embodiment, this calculation is also based on a semantic analysis of the relationship between words in the content 332(1)-(N) (FIG. 3). The intrinsic score and field intrinsic scores differ in that the former is computed for the entire document and the latter is computed for each field in a structured document.
  • After the objects are located, the search engine 324 determines if the located objects are structured or unstructured (step 810). If a located object is unstructured, then the search engine 324 calculates an intrinsic score at step 812 for the unstructured object. If more located objects are available (step 813), then step 812 is repeated until all the located objects have been processed. If all the located objects have been processed, the search engine 324 adjusts the intrinsic scores based on expertise (step 814), sorts the structured and unstructured objects (step 816) based on the adjusted scores, transfers the list to the searcher (step 818), where it is received (step 820) and displayed to the searcher (step 822). Other than step 810, all of the foregoing steps were previously described with respect to FIG. 5.
  • If a located object is a structured object, then the search engine 324 determines the fields of the object that match the search query (step 824) and calculates field intrinsic scores for matching fields (step 824). The Field ID stored in the repository 104 during collection (FIG. 6) can be used to determine which fields match the search query. After the field intrinsic scores are calculated (step 826), the scores are adjusted to harmonize content relevancy (step 828). If more located objects are available (step 813), then steps 824-828 are repeated until all the located objects have been processed. If all the located objects have been processed, the search engine 324 adjusts the adjusted field intrinsic scores to account for expertise (step 814). The adjustment for expertise has been previously described and the adjustment to harmonize content relevancy, is described more fully below.
  • After the field intrinsic scores are adjusted at step 814, the structured and unstructured objects are sorted (step 816) by their adjusted scores and transferred to the searcher (step 818), where they are received (step 820) and displayed (step 822) to the searcher.
  • Adjusting Field Intrinsic Scores To Harmonize Content Relevancy
  • In some embodiments, field intrinsic scores are adjusted differently based on the logical operator or operators used in the search query. These logical operators include but are not limited to: AND, OR and TMTB (“The More The Better”). Note that TMTB is an accumulate type operator that looks for objects that match as many keywords associated with a search query as possible. While the OR and AND operators are the most common logical operators used in search engine queries, it should be apparent that the formulae described below can be adapted to other types of operators, including proximity operators (e.g., ADJACENT, WITH, NEAR, FOLLOWED BY, etc.).
  • An example of a data source that generates documents including structured objects is a Sales Force Automation (SFA) system. A typical SFA system includes software and systems that support sales staff lead generation, contact, scheduling, performance tracking and other functions. SFA functions are normally integrated with base systems that provide order, product, inventory status and other information and may be included as part of a larger customer relationship management (CRM) system.
  • The structured objects generated by the SFA system include a mix of restricted and unrestricted fields. A restricted field is a field that is constrained by one or more parameters, such as a controlled vocabulary or a controlled size. An unrestricted field can include any value the owner wants to provide, such as free form text. For example, an SFA document could include records having the following four fields: SFAAccountName (the name of the account); SFAAccountDescription (a brief description of the account); SFAAccountIndustry (the industry to which the account belongs to); and SFAAccountAttachment (documents that might be attached to the record). In this example, SFAAccountName and SFAAccountIndustry are examples of restricted fields and SFAAccountDescription and SFAAccountAttachment are examples of unrestricted fields.
  • To account for the differences in relevancy between restricted and unrestricted fields, each field is assigned a field modifier value, which is adjustable and expandable. For example, the field modifier value can be adjusted according to a profile of the user and/or a profile of the data source.
  • Table 13 below summarizes the SFA fields described above with some examples of corresponding field modifier values. Note that in this example the restricted fields are weighed more heavily based on the assumption that a keyword match to a restricted field may be more relevant. It should be apparent, however, that the field modifiers can be selected and adjusted, as necessary, depending on the search engine design.
    TABLE 13
    Summary of SFA Fields & Modifiers
    Field Field Type Field Modifier
    SFAAccountName Restricted 95
    SFAAccountDescription Unrestricted 0.6
    SFAAccountIndustry Restricted 100
    SFAAccountAttachment Unrestricted 0.4
  • For each operator, the general formula to calculate its adjusted intrinsic relevancy score is of the form:
    AIR Structured =f operator(Field, Modifier parameters),  (22)
    where AIRStructured is the adjusted intrinsic relevancy score for a structured object and the function depends on the operator. Specific formulae for the most common logical operators are described in turn below.
    The “OR” Operator
  • The OR logical operator looks for the presence of keywords associated with the query of the user. If a match is found in a document on any one keyword, the document is retrieved. In some embodiments, a formula for an OR operator can be determined by AIR Structured OR = Max Keywords [ Max Fields ( Field Intrinsic score * Field modifier ) ] , ( 23 )
  • where the Field Intrinsic score is, for example in the case of an unrestricted field, the raw score as provided by a semantic engine or a full-text engine for that given keyword. Note that formula (23) uses the maximal subscores to define the intrinsic value of the document. Another approach is to take all fields to participate, AIR Structured OR = Max Keywords [ 1 λ Fields ( Field Intrinsic score * Field modifier ) ] , ( 24 )
    where λ is a normalizing parameter. Finally, the growth of the score within each keyword is preferably concave as scores accumulate and convex as they start accumulating. For example, as the score accumulates, a difference between two high scores (e.g., the scores 1000 and 1010) is less significant than the difference between two low scores (e.g., the scores 10 and 20). That is, for high scores the score function behaves like a utility function with respect to relevancy. On the other hand, as the score accumulates, a difference between a first pair of low scores (e.g., 0 and 2) could be less significant than the difference between the a second pair of low scores (e.g., 10 and 12) due to uncertainties and lack of relevancy associated with low scores. AIR Structured OR = Max Keywords [ θ * logit - 1 { Fields ( Field Intrinsic score * Field modifier ) } ] , ( 25 )
  • where θ is again a normalizing constant. In some embodiments, one or more parameters can be added, as necessary, to tune the behavior of the inverse logit function.
  • The AND Operator
  • The AND operator looks for objects that match all keywords associated with the search query. In some embodiments, a simple scoring mechanism is used that assumes that all keywords are equal (i.e., the need to be all present). The relevancy is then associated to the least keyword. This would lead to the following “logit” type of formula: AIR Structured AND = Min Keywords [ θ * logit - 1 { Fields ( Field Intrinsic score * Field modifier ) } ] , ( 26 )
    where the MAX calculation used in the OR formulas (23) through (25) are switched to MIN calculations.
  • In an alternative embodiment, an average “concept” relevancy score can be determined by, AIR Structured AND = ( AVE [ ] + MAX [ ] ) 2 , where ( 27 ) AVE [ ] = Average Keywords [ θ * logit - 1 { Fields ( Field Intrinsic score * Field modifier ) } ] , and ( 28 ) MAX [ ] = MAX Keywords [ θ * logit - 1 { Fields ( Field Intrinsic score * Field modifier ) } ] . ( 29 )
  • Note that by averaging the relevancy score, an overall score is determined that does not overweight a single potential low score.
  • The TMTB Operator
  • The TMTB operator is an accumulation operator that tends to provide highest scores for those objects that are relevant to more terms. It also, however, allows for objects that do not match all concepts to still receive a high score if those matches are good. Using the “logit” style of formula for illustration, gives: AIR Structured TMTB = 1 λ round ( β * ( 1 - Exp ( - FQC NQC ) ) ) * MAX Keywords [ θ * logit - 1 { Fields ( Field Intrinsic score * Field modifier ) } ] , ( 30 )
    where NQC is the number of concepts present in the query and FQC is the number of concepts that were matched to the object.
    Field Extensions
  • In the above approaches, no field was an exclusion field. It is also possible, however, to enforce that one of the matches has to occur in a specific field. For example, in embodiments using the MAX, OR scheme described above, the formula would be: AIR Structured OR = 1 { Match Field i } Max Keywords [ Max Fields ( Field Intrinsic score * Field modifier ) ] , ( 31 )
    where 1{MatchεSet i } is an indicator function which equals 1 if the match occurred within fields specified by Seti.
  • FIG. 9 is a flow chart of an embodiment of a process for determining a relevancy score for a located object. While the process described below includes a number of steps that appear to occur in a specific order, it should be apparent that the process steps are not limited to any particular order, and, moreover, the process can include more or fewer steps, which can be executed serially or in parallel (e.g., using parallel processors or a multi-threading environment).
  • The process begins by initializing a keyword counter x to one (step 902) or any other suitable number. For the first keyword in a search query including N keywords, the fields in the located object containing keyword x matches are determined (step 904) and field intrinsic scores are computed (step 906). In some embodiments, prior to executing the matching step 904, search queries with N terms are matched (step 903) to M concepts (where M is less than or equal to N). If multiple terms in a search query are matched to the same concept, then only one term in the query will be used in the matching step 904 to reduce computation time. For example, if a searcher types “Car OR Car,” then only one unique concept (i.e., a type of vehicle) was specified by the searcher. Similarly, if a searcher types “A Car”, then only one unique concept was specified. Thus, in the above examples the keyword “Car” would be included in the matching step 904, and the second instance of the term “Car” and the term “A” would be excluded from the matching step 904. The foregoing additional steps would be used in, for example, the calculation of an adjusted intrinsic relevancy score based on the TMTB operator, as described with respect to equation 30.
  • In some embodiments, the calculation of field intrinsic scores can be based on the number of times that a search term appears in a particular field of the located object, or on a semantic analysis of the relationship between keywords and content. Next, the appropriate field modifiers are applied to the field intrinsic scores (step 908). In some embodiments, the modifiers are selected based on whether the fields with the keyword matches are restricted or unrestricted. For example, in the SFA system previously described, the fields SFAAccountName and SFAAccountIndustry are restricted and the fields SFAAccountDescription and SFAAccountAttachment are unrestricted.
  • After the field intrinsic scores are modified, there are several options for determining the relevancy score for the located object. A first option is to determine a maximum field intrinsic score from the set of modified field intrinsic scores for keyword x (step 910). A second option sums the modified field intrinsic scores (step 912). A third option applies an inverse logit function to the sum of the modified field intrinsic scores (step 914). Note that in some embodiments steps 912 and 914 may also include a normalizing step. In step 916, a check is made for more keywords. If there are more keywords, then the keyword counter is incremented, and the process continues at (step 904), where fields with matches for the next keyword (i.e., keyword x=x+1) are retrieved. If there are no more keywords, then the scores are adjusted relative to the types of operators used in the query (e.g., AND, OR, TMTB, NEAR, ADJACENT, WITH, FOLLOWED BY, etc.) (step 918). If a logical OR operator is used with the keywords, then a relevancy score for the located object is determined from the maximum of the field intrinsic scores for keywords 1, . . . , N (step 920). If a logical AND operator is used with the keywords, then a relevancy score for the located object is determined from a minimum of the field intrinsic scores for keywords 1, . . . , N. If another type of operator is used (e.g., TMTB, etc.), then a relevancy score for the located objects is determined using the appropriate formulas (step 924).
  • While the foregoing description and drawings represent preferred embodiments of the present invention, it will be understood that various additions, modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined in the accompanying claims. In particular, it will be clear to those skilled in the art that the present invention may be embodied in other specific forms, structures, arrangements, proportions, and with other elements, materials, and components, without departing from the spirit or essential characteristics thereof. The presently disclosed embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims, and not limited to the foregoing description. Furthermore, it should be noted that the order in which the process is performed may vary without substantially altering the outcome of the process.

Claims (75)

1. A method for retrieving information, comprising:
receiving a search request from a requester including one or more search terms;
searching a plurality of objects based on at least one search term;
identifying at least one object associated with at least one search term; and
determining a relevancy score for the object based on whether the object includes structured or unstructured data.
2. The method of claim 1, wherein the plurality of objects includes some objects containing structured data and the relevancy score is determined at least in part on whether the structured data includes restricted or unrestricted fields.
3. The method of claim 2, wherein restricted fields are associated with a first set of modifier values and unrestricted fields are associated with a second set of modifier values.
4. The method of claim 3, wherein at least some of the modifiers are adjustable.
5. The method of claim 1, further comprising:
adjusting the relevancy score based on contributor expertise.
6. The method claim 1, further comprising:
adjusting the relevancy score based on creator expertise.
7. The method claim 1, further comprising:
adjusting the relevancy score based on requestor expertise.
8. The method of claim 1, wherein the relevancy score is determined from a function that is based on one or more operators used in the search request.
9. The method of claim 8, wherein the one or more operators include at least one Boolean operator.
10. The method of claim 8, wherein the one or more operators include at least one proximity operator.
11. The method of claim 8, wherein the one or more operators include a combination of Boolean and proximity operators.
12. The method of claim 8, wherein the step of determining a relevancy score, comprises:
identifying objects that include structured data;
searching for a first keyword match in the structured data;
determining a first set of intrinsic scores for the first keyword based at least in part on the number of matches in the structured data;
searching for a second keyword match in the structured data, when a second keyword is present in the search request;
determining a second set of intrinsic scores for the second keyword, when present in the search request, based at least in part on the number of matches in the structured data;
searching for subsequent keywords matches in the structured data, when subsequent keywords are present in the search request;
determining subsequent sets of intrinsic scores for subsequent keywords, when present in the search query, based at least in part on the number of matches in the structured data;
modifying the first, second and subsequent sets of intrinsic scores with modifier values to produce first, second and subsequent sets of modified intrinsic scores, wherein the modifier values are based at least in part on whether the structured data is restricted or unrestricted;
determining an adjusted intrinsic score across one or more fields within the structured data from the first set of modified intrinsic scores associated with the first keyword;
determining an adjusted intrinsic score across one or more fields within the structured data from the second set of modified intrinsic scores associated with the second keyword;
determining adjusted intrinsic scores across one or more fields within the structured data from the subsequent sets of modified intrinsic scores associated with the subsequent keywords; and
selecting the relevancy score to be determined by a function aggregating the adjusted intrinsic scores of the first, second and subsequent keywords.
13. The method of claim 12, wherein the adjusted intrinsic score is dependent on one or more operators within the search query.
14. The method of claim 13, wherein the one or more operators include at least one Boolean operator.
15. The method of claim 13, wherein the one or more operators include at least one proximity operator.
16. The method of claim 13, wherein the one or more operators include a combination of Boolean and proximity operators.
17. The method of claim 12, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are normalized.
18. The method of claim 12, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a convex function.
19. The method of claim 12, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a concave function.
20. The method of claim 12, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a partly concave and a partly convex function.
21. The method of claim 20, wherein the partly concave and partly convex function is an inverse logit function.
22. The method of claim 12, wherein the aggregation function is either a Maximum or a Minimum over the adjusted intrinsic scores.
23. The method of claim 12, wherein the aggregation function is based on a sum or average over the adjusted intrinsic scores.
24. The method of claim 12, wherein the aggregation function is one of a group of aggregation functions including a convex function, a concave function or a partly convex and partly concave function over the adjusted intrinsic scores.
25. The method of claim 24, wherein the aggregation function is an inverse logit function.
26. A computer-readable medium having instructions stored thereon, which, when executed by a processor, causes the processor to perform the operations of:
receiving a search request from a requester including one or more search terms;
searching a plurality of objects based on at least one search term;
identifying at least one object associated with at least one search term; and
determining a relevancy score for the object based on whether the object includes structured or unstructured data.
27. The computer-readable medium of claim 26, wherein the plurality of objects includes some objects containing structured data and the relevancy score is determined at least in part on whether the structured data includes restricted or unrestricted fields.
28. The computer-readable medium of claim 27, wherein restricted fields are associated with a first set of modifier values and unrestricted fields are associated with a second set of modifier values.
29. The computer-readable medium of claim 26, wherein at least some of the modifiers are adjustable.
30. The computer-readable medium of claim 26, further comprising:
adjusting the relevancy score based on contributor expertise.
31. The computer-readable medium of claim 26, further comprising:
adjusting the relevancy score based on creator expertise.
32. The computer-readable medium claim 26, further comprising:
adjusting the relevancy score based on requestor expertise.
33. The computer-readable medium of claim 26, wherein the relevancy score is determined from a function that is based on one or more operators used in the search request.
34. The computer-readable medium of claim 33, wherein the one or more operators includes at least one Boolean operator.
35. The computer-readable medium of claim 33, wherein the one or more operators includes at least one proximity operator.
36. The computer-readable medium of claim 33, wherein the one or more operators includes a combination of Boolean and proximity operators.
37. The computer-readable medium of claim 33, wherein the step of determining a relevancy score, comprises:
identifying objects that include structured data;
searching for a first keyword match in the structured data;
determining a first set of intrinsic scores for the first keyword based at least in part on the number of matches in the structured data;
searching for a second keyword match in the structured data, when a second keyword is present in the search request;
determining a second set of intrinsic scores for the second keyword, when present in the search request, based at least in part on the number of matches in the structured data;
searching for subsequent keywords matches in the structured data, when subsequent keywords are present in the search request;
determining subsequent sets of intrinsic scores for subsequent keywords, when present in the search query, based at least in part on the number of matches in the structured data;
modifying the first, second and subsequent sets of intrinsic scores with modifier values to produce first, second and subsequent sets of modified intrinsic scores, wherein the modifier values are based at least in part on whether the structured data is restricted or unrestricted;
determining an adjusted intrinsic score across one or more fields within the structured data from the first set of modified intrinsic scores associated with the first keyword;
determining an adjusted intrinsic score across one or more fields within the structured data from the second set of modified intrinsic scores associated with the second keyword;
determining adjusted intrinsic scores across one or more fields within the structured data from the subsequent sets of modified intrinsic scores associated with the subsequent keywords; and
selecting the relevancy score to be determined by a function aggregating the adjusted intrinsic scores of the first, second and subsequent keywords.
38. The computer-readable medium of claim 37, wherein the adjusted intrinsic score is dependent on one or more operators within the search query.
39. The computer-readable medium of claim 37, wherein the one or more operators include at least one Boolean operator.
40. The computer-readable medium of claim 37, wherein the one or more operators include at least one proximity operator.
41. The computer-readable medium of claim 37, wherein the one or more operators include a combination of Boolean and proximity operators.
42. The computer-readable medium of claim 37, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are normalized.
43. The computer-readable medium of claim 37, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a convex function.
44. The computer-readable medium of claim 37, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a concave function.
45. The computer-readable medium of claim 37, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a partly concave and a partly convex function.
46. The computer-readable medium of claim 45, wherein the partly concave and partly convex function is an inverse logit function.
47. The computer-readable medium of claim 37, wherein the aggregation function is either a Maximum or a Minimum over the adjusted intrinsic scores.
48. The computer-readable medium of claim 37, wherein the aggregation function is based on a sum or average over the adjusted intrinsic scores.
49. The computer-readable medium of claim 37, wherein the aggregation function is one of a group of aggregation functions including a convex function, a concave function or a partly convex and partly concave function over the adjusted intrinsic scores.
50. The computer-readable medium of claim 49, wherein the aggregation function is an inverse logit function.
51. An information retrieval system, comprising:
a processor;
a memory coupled to the processor and including instructions, which, when executed by the processor, causes the processor to perform the operations of:
receiving a search request from a requester including one or more search terms;
searching a plurality of objects based on at least one search term;
identifying at least one object associated with at least one search term; and
determining a relevancy score for the object based on whether the object includes structured or unstructured data.
52. The system of claim 51, wherein the plurality of objects includes some objects containing structured data and the relevancy score is determined at least in part on whether the structured data includes restricted or unrestricted fields.
53. The system of claim 52, wherein restricted fields are associated with a first set of modifier value and unrestricted fields are associated with a second set of modifier value.
54. The system of claim 51, wherein at least some of the modifiers are adjustable.
55. The system of claim 51, further comprising:
adjusting the relevancy score based on contributor expertise.
56. The system of claim 51, further comprising:
adjusting the relevancy score based on creator expertise.
57. The system of claim 51, further comprising:
adjusting the relevancy score based on requestor expertise.
58. The system of claim 51, wherein the relevancy score is determined from a function that is based on one or more operators used in the search request.
59. The system of claim 58, wherein the one or more operators include at least one Boolean operator.
60. The system of claim 58, wherein the one or more operators include at least one proximity operator.
61. The system of claim 58, wherein the one or more operators include a combination of Boolean and proximity operators.
62. The system of claim 58, wherein the step of determining a relevancy score, comprises:
identifying objects that include structured data;
searching for a first keyword match in the structured data;
determining a first set of intrinsic scores for the first keyword based at least in part on the number of matches in the structured data;
searching for a second keyword match in the structured data, when a second keyword is present in the search request;
determining a second set of intrinsic scores for the second keyword, when present in the search request, based at least in part on the number of matches in the structured data;
searching for subsequent keywords matches in the structured data, when subsequent keywords are present in the search request;
determining subsequent sets of intrinsic scores for subsequent keywords, when present in the search query, based at least in part on the number of matches in the structured data;
modifying the first, second and subsequent sets of intrinsic scores with modifier values to produce first, second and subsequent sets of modified intrinsic scores, wherein the modifier values are based at least in part on whether the structured data is restricted or unrestricted;
determining an adjusted intrinsic score across one or more fields within the structured data from the first set of modified intrinsic scores associated with the first keyword;
determining an adjusted intrinsic score across one or more fields within the structured data from the second set of modified intrinsic scores associated with the second keyword;
determining adjusted intrinsic scores across one or more fields within the structured data from the subsequent sets of modified intrinsic scores associated with the subsequent keywords; and
selecting the relevancy score to be determined by a function aggregating the adjusted intrinsic scores of the first, second and subsequent keywords.
63. The system of claim 62, wherein the adjusted intrinsic score is dependent on one or more operators within the search query.
64. The system of claim 62, wherein the one or more operators include at least one Boolean operator.
65. The system of claim 62, wherein the one or more operators include at least one proximity operator.
66. The system of claim 62, wherein the one or more operators include a combination of Boolean and position operators.
67. The system of claim 62, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are normalized.
68. The system of claim 62, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a convex function.
69. The system of claim 62, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a concave function.
70. The system of claim 62, wherein the adjusted intrinsic scores of the first, second and subsequent keywords are weighted by a partly concave and a partly convex function.
71. The system of claim 70, wherein the partly concave and partly convex function is an inverse logit function.
72. The system of claim 62, wherein the aggregation function is either a Maximum or a Minimum over the adjusted intrinsic scores.
73. The system of claim 62, wherein the aggregation function is based on a sum or average over the adjusted intrinsic scores.
74. The system of claim 62, wherein the aggregation function is one of a group of aggregation functions including a convex function, a concave function or a partly convex and partly concave function over the adjusted intrinsic scores.
75. The system of claim 74, wherein the aggregation function is an inverse logit function.
US10/972,248 2002-06-14 2004-10-22 System and method for harmonizing content relevancy across structured and unstructured data Abandoned US20050086215A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/972,248 US20050086215A1 (en) 2002-06-14 2004-10-22 System and method for harmonizing content relevancy across structured and unstructured data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/172,165 US6892198B2 (en) 2002-06-14 2002-06-14 System and method for personalized information retrieval based on user expertise
US10/972,248 US20050086215A1 (en) 2002-06-14 2004-10-22 System and method for harmonizing content relevancy across structured and unstructured data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/172,165 Continuation-In-Part US6892198B2 (en) 2002-06-14 2002-06-14 System and method for personalized information retrieval based on user expertise

Publications (1)

Publication Number Publication Date
US20050086215A1 true US20050086215A1 (en) 2005-04-21

Family

ID=29732957

Family Applications (4)

Application Number Title Priority Date Filing Date
US10/172,165 Expired - Fee Related US6892198B2 (en) 2002-06-14 2002-06-14 System and method for personalized information retrieval based on user expertise
US10/971,830 Expired - Fee Related US6978267B2 (en) 2002-06-14 2004-10-21 System and method for personalized information retrieval based on user expertise
US10/972,248 Abandoned US20050086215A1 (en) 2002-06-14 2004-10-22 System and method for harmonizing content relevancy across structured and unstructured data
US11/222,726 Abandoned US20060015488A1 (en) 2002-06-14 2005-09-09 System and method for personalized information retrieval based on user expertise

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US10/172,165 Expired - Fee Related US6892198B2 (en) 2002-06-14 2002-06-14 System and method for personalized information retrieval based on user expertise
US10/971,830 Expired - Fee Related US6978267B2 (en) 2002-06-14 2004-10-21 System and method for personalized information retrieval based on user expertise

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/222,726 Abandoned US20060015488A1 (en) 2002-06-14 2005-09-09 System and method for personalized information retrieval based on user expertise

Country Status (3)

Country Link
US (4) US6892198B2 (en)
AU (1) AU2003251510A1 (en)
WO (1) WO2003107127A2 (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060057560A1 (en) * 2004-03-05 2006-03-16 Hansen Medical, Inc. System and method for denaturing and fixing collagenous tissue
US20070011134A1 (en) * 2005-07-05 2007-01-11 Justin Langseth System and method of making unstructured data available to structured data analysis tools
US20070033228A1 (en) * 2005-08-03 2007-02-08 Ethan Fassett System and method for dynamically ranking items of audio content
WO2007021386A2 (en) * 2005-07-05 2007-02-22 Clarabridge, Inc. Analysis and transformation tools for strctured and unstructured data
US20080016071A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Connections Between Users, Tags and Documents to Rank Documents in an Enterprise Search System
US20080016098A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Tags in an Enterprise Search System
US20080016061A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using a Core Data Structure to Calculate Document Ranks
US20080016052A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Connections Between Users and Documents to Rank Documents in an Enterprise Search System
US20080016072A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Enterprise-Based Tag System
US20080016053A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Administration Console to Select Rank Factors
US20080195602A1 (en) * 2005-05-10 2008-08-14 Netbreeze Gmbh System and Method for Aggregating and Monitoring Decentrally Stored Multimedia Data
WO2008157810A2 (en) * 2007-06-21 2008-12-24 Baggott Christopher C System and method for compending blogs
US7634475B1 (en) * 2007-03-12 2009-12-15 A9.Com, Inc. Relevance scoring based on optimized keyword characterization field combinations
US20100100554A1 (en) * 2008-10-16 2010-04-22 Carter Stephen R Techniques for measuring the relevancy of content contributions
US20100185653A1 (en) * 2009-01-16 2010-07-22 Google Inc. Populating a structured presentation with new values
US20100185934A1 (en) * 2009-01-16 2010-07-22 Google Inc. Adding new attributes to a structured presentation
US20100185666A1 (en) * 2009-01-16 2010-07-22 Google, Inc. Accessing a search interface in a structured presentation
US20100185651A1 (en) * 2009-01-16 2010-07-22 Google Inc. Retrieving and displaying information from an unstructured electronic document collection
US20100185654A1 (en) * 2009-01-16 2010-07-22 Google Inc. Adding new instances to a structured presentation
US20100306223A1 (en) * 2009-06-01 2010-12-02 Google Inc. Rankings in Search Results with User Corrections
US7849049B2 (en) 2005-07-05 2010-12-07 Clarabridge, Inc. Schema and ETL tools for structured and unstructured data
US20110106819A1 (en) * 2009-10-29 2011-05-05 Google Inc. Identifying a group of related instances
US7974681B2 (en) 2004-03-05 2011-07-05 Hansen Medical, Inc. Robotic catheter system
US20120117116A1 (en) * 2010-11-05 2012-05-10 Apple Inc. Extended Database Search
CN102651011A (en) * 2011-02-27 2012-08-29 祁勇 Method and system for determining document characteristic and user characteristic
CN103390008A (en) * 2012-05-08 2013-11-13 祁勇 Method and system for acquiring personalized features of user
CN103514237A (en) * 2012-06-25 2014-01-15 祁勇 Method and system for obtaining personalized features of user and file
CN103544190A (en) * 2012-07-17 2014-01-29 祁勇 Method and system for acquiring personalized features of users and documents
US8768932B1 (en) * 2007-05-14 2014-07-01 Google Inc. Method and apparatus for ranking search results
US8972411B1 (en) * 2009-01-27 2015-03-03 Google Inc. Selection of sponsored content using multiple sets of query terms
CN104636487A (en) * 2015-02-26 2015-05-20 湖北光谷天下传媒股份有限公司 Advertising information management method
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data
US11200217B2 (en) * 2016-05-26 2021-12-14 Perfect Search Corporation Structured document indexing and searching
US11416532B2 (en) 2018-05-31 2022-08-16 Wipro Limited Method and device for identifying relevant keywords from documents
US11803918B2 (en) 2015-07-07 2023-10-31 Oracle International Corporation System and method for identifying experts on arbitrary topics in an enterprise social network

Families Citing this family (104)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7249018B2 (en) 2001-01-12 2007-07-24 International Business Machines Corporation System and method for relating syntax and semantics for a conversational speech application
US7743045B2 (en) 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
US7693830B2 (en) 2005-08-10 2010-04-06 Google Inc. Programmable search engine
US7716199B2 (en) 2005-08-10 2010-05-11 Google Inc. Aggregating context data for programmable search engines
US6892198B2 (en) * 2002-06-14 2005-05-10 Entopia, Inc. System and method for personalized information retrieval based on user expertise
US7171620B2 (en) * 2002-07-24 2007-01-30 Xerox Corporation System and method for managing document retention of shared documents
US7249312B2 (en) * 2002-09-11 2007-07-24 Intelligent Results Attribute scoring for unstructured content
US7568148B1 (en) 2002-09-20 2009-07-28 Google Inc. Methods and apparatus for clustering news content
US8090717B1 (en) 2002-09-20 2012-01-03 Google Inc. Methods and apparatus for ranking documents
US7958443B2 (en) 2003-02-28 2011-06-07 Dictaphone Corporation System and method for structuring speech recognized text into a pre-selected document format
US20040243552A1 (en) * 2003-05-30 2004-12-02 Dictaphone Corporation Method, system, and apparatus for viewing data
US20040243545A1 (en) * 2003-05-29 2004-12-02 Dictaphone Corporation Systems and methods utilizing natural language medical records
US8290958B2 (en) * 2003-05-30 2012-10-16 Dictaphone Corporation Method, system, and apparatus for data reuse
US7577655B2 (en) 2003-09-16 2009-08-18 Google Inc. Systems and methods for improving the ranking of news articles
US20050120300A1 (en) * 2003-09-25 2005-06-02 Dictaphone Corporation Method, system, and apparatus for assembly, transport and display of clinical data
US7860717B2 (en) * 2003-09-25 2010-12-28 Dictaphone Corporation System and method for customizing speech recognition input and output
US8321278B2 (en) * 2003-09-30 2012-11-27 Google Inc. Targeted advertisements based on user profiles and page profile
US7693827B2 (en) * 2003-09-30 2010-04-06 Google Inc. Personalization of placed content ordering in search results
US20050222989A1 (en) * 2003-09-30 2005-10-06 Taher Haveliwala Results based personalization of advertisements in a search engine
US7542909B2 (en) * 2003-09-30 2009-06-02 Dictaphone Corporation Method, system, and apparatus for repairing audio recordings
US20050071328A1 (en) * 2003-09-30 2005-03-31 Lawrence Stephen R. Personalization of web search
US8024176B2 (en) * 2003-09-30 2011-09-20 Dictaphone Corporation System, method and apparatus for prediction using minimal affix patterns
US7996223B2 (en) * 2003-10-01 2011-08-09 Dictaphone Corporation System and method for post processing speech recognition output
US7774196B2 (en) * 2003-10-01 2010-08-10 Dictaphone Corporation System and method for modifying a language model and post-processor information
US20050144184A1 (en) * 2003-10-01 2005-06-30 Dictaphone Corporation System and method for document section segmentation
US7818308B2 (en) * 2003-10-01 2010-10-19 Nuance Communications, Inc. System and method for document section segmentation
GB0326903D0 (en) * 2003-11-19 2003-12-24 Ibm System and method for software debugging
WO2005050474A2 (en) 2003-11-21 2005-06-02 Philips Intellectual Property & Standards Gmbh Text segmentation and label assignment with user interaction by means of topic specific language models and topic-specific label statistics
US7315811B2 (en) * 2003-12-31 2008-01-01 Dictaphone Corporation System and method for accented modification of a language model
US7783474B2 (en) * 2004-02-27 2010-08-24 Nuance Communications, Inc. System and method for generating a phrase pronunciation
CA2498728A1 (en) * 2004-02-27 2005-08-27 Dictaphone Corporation A system and method for normalization of a string of words
US7716223B2 (en) 2004-03-29 2010-05-11 Google Inc. Variable personalization of search results in a search engine
US7379946B2 (en) 2004-03-31 2008-05-27 Dictaphone Corporation Categorization of information using natural language processing and predefined templates
US7565630B1 (en) 2004-06-15 2009-07-21 Google Inc. Customization of search results for search queries received from third party sites
WO2006007194A1 (en) * 2004-06-25 2006-01-19 Personasearch, Inc. Dynamic search processor
US8620915B1 (en) 2007-03-13 2013-12-31 Google Inc. Systems and methods for promoting personalized search results based on personal information
US8078607B2 (en) * 2006-03-30 2011-12-13 Google Inc. Generating website profiles based on queries from webistes and user activities on the search results
US9002783B2 (en) * 2004-09-17 2015-04-07 Go Daddy Operating Company, LLC Web page customization based on expertise level of a user
US9009100B2 (en) * 2004-09-17 2015-04-14 Go Daddy Operating Company, LLC Web page customization based on a search term expertise level of a user
US7680901B2 (en) * 2004-09-17 2010-03-16 Go Daddy Group, Inc. Customize a user interface of a web page using an expertise level rules engine
US7340672B2 (en) * 2004-09-20 2008-03-04 Intel Corporation Providing data integrity for data streams
US8874570B1 (en) 2004-11-30 2014-10-28 Google Inc. Search boost vector based on co-visitation information
US20060122974A1 (en) * 2004-12-03 2006-06-08 Igor Perisic System and method for a dynamic content driven rendering of social networks
US7620631B2 (en) * 2005-03-21 2009-11-17 Microsoft Corporation Pyramid view
US7694212B2 (en) 2005-03-31 2010-04-06 Google Inc. Systems and methods for providing a graphical display of search activity
US20060224583A1 (en) * 2005-03-31 2006-10-05 Google, Inc. Systems and methods for analyzing a user's web history
US8990193B1 (en) 2005-03-31 2015-03-24 Google Inc. Method, system, and graphical user interface for improved search result displays via user-specified annotations
US9256685B2 (en) * 2005-03-31 2016-02-09 Google Inc. Systems and methods for modifying search results based on a user's history
US8166028B1 (en) 2005-03-31 2012-04-24 Google Inc. Method, system, and graphical user interface for improved searching via user-specified annotations
US7747632B2 (en) * 2005-03-31 2010-06-29 Google Inc. Systems and methods for providing subscription-based personalization
US7444350B1 (en) * 2005-03-31 2008-10-28 Emc Corporation Method and apparatus for processing management information
US7783631B2 (en) 2005-03-31 2010-08-24 Google Inc. Systems and methods for managing multiple user accounts
US20060224608A1 (en) * 2005-03-31 2006-10-05 Google, Inc. Systems and methods for combining sets of favorites
US8589391B1 (en) 2005-03-31 2013-11-19 Google Inc. Method and system for generating web site ratings for a user
US8069411B2 (en) * 2005-07-05 2011-11-29 Dictaphone Corporation System and method for auto-reuse of document text
US7565358B2 (en) * 2005-08-08 2009-07-21 Google Inc. Agent rank
WO2007032003A2 (en) * 2005-09-13 2007-03-22 Yedda, Inc. Device, system and method of handling user requests
US8095419B1 (en) * 2005-10-17 2012-01-10 Yahoo! Inc. Search score for the determination of search quality
US7801910B2 (en) * 2005-11-09 2010-09-21 Ramp Holdings, Inc. Method and apparatus for timed tagging of media content
US7664746B2 (en) * 2005-11-15 2010-02-16 Microsoft Corporation Personalized search and headlines
US7925649B2 (en) 2005-12-30 2011-04-12 Google Inc. Method, system, and graphical user interface for alerting a computer user to new results for a prior search
US8250061B2 (en) * 2006-01-30 2012-08-21 Yahoo! Inc. Learning retrieval functions incorporating query differentiation for information retrieval
US7603350B1 (en) 2006-05-09 2009-10-13 Google Inc. Search result ranking based on trust
US9443022B2 (en) 2006-06-05 2016-09-13 Google Inc. Method, system, and graphical user interface for providing personalized recommendations of popular search queries
US7792967B2 (en) * 2006-07-14 2010-09-07 Chacha Search, Inc. Method and system for sharing and accessing resources
US7801879B2 (en) * 2006-08-07 2010-09-21 Chacha Search, Inc. Method, system, and computer readable storage for affiliate group searching
US20120158648A1 (en) * 2006-08-09 2012-06-21 Mark Watkins Method and system for dynamically roting an endity's attribute by receiving validation value from an attribute of a qualifying entity
US8346555B2 (en) 2006-08-22 2013-01-01 Nuance Communications, Inc. Automatic grammar tuning using statistical language model generation
US8769099B2 (en) * 2006-12-28 2014-07-01 Yahoo! Inc. Methods and systems for pre-caching information on a mobile computing device
US20080244375A1 (en) * 2007-02-09 2008-10-02 Healthline Networks, Inc. Hyperlinking Text in Document Content Using Multiple Concept-Based Indexes Created Over a Structured Taxonomy
US8161040B2 (en) * 2007-04-30 2012-04-17 Piffany, Inc. Criteria-specific authority ranking
US7996400B2 (en) * 2007-06-23 2011-08-09 Microsoft Corporation Identification and use of web searcher expertise
EP2181406A1 (en) * 2007-07-11 2010-05-05 Koninklijke Philips Electronics N.V. Method of operating an information retrieval system
US8577894B2 (en) 2008-01-25 2013-11-05 Chacha Search, Inc Method and system for access to restricted resources
US8706643B1 (en) 2009-01-13 2014-04-22 Amazon Technologies, Inc. Generating and suggesting phrases
US8423349B1 (en) * 2009-01-13 2013-04-16 Amazon Technologies, Inc. Filtering phrases for an identifier
US8768852B2 (en) * 2009-01-13 2014-07-01 Amazon Technologies, Inc. Determining phrases related to other phrases
US9569770B1 (en) 2009-01-13 2017-02-14 Amazon Technologies, Inc. Generating constructed phrases
US8706644B1 (en) 2009-01-13 2014-04-22 Amazon Technologies, Inc. Mining phrases for association with a user
US9600581B2 (en) * 2009-02-19 2017-03-21 Yahoo! Inc. Personalized recommendations on dynamic content
US9298700B1 (en) 2009-07-28 2016-03-29 Amazon Technologies, Inc. Determining similar phrases
US10007712B1 (en) 2009-08-20 2018-06-26 Amazon Technologies, Inc. Enforcing user-specified rules
US8606792B1 (en) 2010-02-08 2013-12-10 Google Inc. Scoring authors of posts
US8799658B1 (en) 2010-03-02 2014-08-05 Amazon Technologies, Inc. Sharing media items with pass phrases
US8768723B2 (en) 2011-02-18 2014-07-01 Nuance Communications, Inc. Methods and apparatus for formatting text for clinical fact extraction
US8694335B2 (en) 2011-02-18 2014-04-08 Nuance Communications, Inc. Methods and apparatus for applying user corrections to medical fact extraction
US9679107B2 (en) 2011-02-18 2017-06-13 Nuance Communications, Inc. Physician and clinical documentation specialist workflow integration
US8738403B2 (en) 2011-02-18 2014-05-27 Nuance Communications, Inc. Methods and apparatus for updating text in clinical documentation
US10032127B2 (en) 2011-02-18 2018-07-24 Nuance Communications, Inc. Methods and apparatus for determining a clinician's intent to order an item
US10460288B2 (en) 2011-02-18 2019-10-29 Nuance Communications, Inc. Methods and apparatus for identifying unspecified diagnoses in clinical documentation
US9916420B2 (en) 2011-02-18 2018-03-13 Nuance Communications, Inc. Physician and clinical documentation specialist workflow integration
US9904768B2 (en) 2011-02-18 2018-02-27 Nuance Communications, Inc. Methods and apparatus for presenting alternative hypotheses for medical facts
US8788289B2 (en) 2011-02-18 2014-07-22 Nuance Communications, Inc. Methods and apparatus for linking extracted clinical facts to text
US8799021B2 (en) 2011-02-18 2014-08-05 Nuance Communications, Inc. Methods and apparatus for analyzing specificity in clinical documentation
US9239872B2 (en) * 2012-10-11 2016-01-19 Nuance Communications, Inc. Data store organizing data using semantic classification
US9836551B2 (en) * 2013-01-08 2017-12-05 International Business Machines Corporation GUI for viewing and manipulating connected tag clouds
US9779182B2 (en) * 2013-06-07 2017-10-03 Microsoft Technology Licensing, Llc Semantic grouping in search
US9779145B2 (en) 2014-01-24 2017-10-03 Nektoon Ag Variable result set size based on user expectation
US10031976B2 (en) * 2014-04-07 2018-07-24 Paypal, Inc. Personalization platform
US9501211B2 (en) 2014-04-17 2016-11-22 GoDaddy Operating Company, LLC User input processing for allocation of hosting server resources
US9660933B2 (en) 2014-04-17 2017-05-23 Go Daddy Operating Company, LLC Allocating and accessing hosting server resources via continuous resource availability updates
US11475048B2 (en) 2019-09-18 2022-10-18 Salesforce.Com, Inc. Classifying different query types
US11727040B2 (en) * 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation

Citations (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US16786A (en) * 1857-03-10 Melodeon
US34639A (en) * 1862-03-11 Improved clothes-wringer
US5535382A (en) * 1989-07-31 1996-07-09 Ricoh Company, Ltd. Document retrieval system involving ranking of documents in accordance with a degree to which the documents fulfill a retrieval condition corresponding to a user entry
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5576954A (en) * 1993-11-05 1996-11-19 University Of Central Florida Process for determination of text relevancy
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5724571A (en) * 1995-07-07 1998-03-03 Sun Microsystems, Inc. Method and apparatus for generating query responses in a computer-based document retrieval system
US5822747A (en) * 1996-08-23 1998-10-13 Tandem Computers, Inc. System and method for optimizing database queries
US5870740A (en) * 1996-09-30 1999-02-09 Apple Computer, Inc. System and method for improving the ranking of information retrieval results for short queries
US5911140A (en) * 1995-12-14 1999-06-08 Xerox Corporation Method of ordering document clusters given some knowledge of user interests
US5933822A (en) * 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US6009422A (en) * 1997-11-26 1999-12-28 International Business Machines Corporation System and method for query translation/semantic translation using generalized query language
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6115709A (en) * 1998-09-18 2000-09-05 Tacit Knowledge Systems, Inc. Method and system for constructing a knowledge profile of a user having unrestricted and restricted access portions according to respective levels of confidence of content of the portions
US6119114A (en) * 1996-09-17 2000-09-12 Smadja; Frank Method and apparatus for dynamic relevance ranking
US6167397A (en) * 1997-09-23 2000-12-26 At&T Corporation Method of clustering electronic documents in response to a search query
US6182068B1 (en) * 1997-08-01 2001-01-30 Ask Jeeves, Inc. Personalized search methods
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
US6199067B1 (en) * 1999-01-20 2001-03-06 Mightiest Logicon Unisearch, Inc. System and method for generating personalized user profiles and for utilizing the generated user profiles to perform adaptive internet searches
US6243713B1 (en) * 1998-08-24 2001-06-05 Excalibur Technologies Corp. Multimedia document retrieval by application of multimedia queries to a unified index of multimedia data for a plurality of multimedia data types
US6289353B1 (en) * 1997-09-24 2001-09-11 Webmd Corporation Intelligent query system for automatically indexing in a database and automatically categorizing users
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
US6311194B1 (en) * 2000-03-15 2001-10-30 Taalee, Inc. System and method for creating a semantic web and its applications in browsing, searching, profiling, personalization and advertising
US6327590B1 (en) * 1999-05-05 2001-12-04 Xerox Corporation System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis
US6363373B1 (en) * 1998-10-01 2002-03-26 Microsoft Corporation Method and apparatus for concept searching using a Boolean or keyword search engine
US6377983B1 (en) * 1998-08-31 2002-04-23 International Business Machines Corporation Method and system for converting expertise based on document usage
US20020120619A1 (en) * 1999-11-26 2002-08-29 High Regard, Inc. Automated categorization, placement, search and retrieval of user-contributed items
US6460036B1 (en) * 1994-11-29 2002-10-01 Pinpoint Incorporated System and method for providing customized electronic newspapers and target advertisements
US20020152208A1 (en) * 2001-03-07 2002-10-17 The Mitre Corporation Method and system for finding similar records in mixed free-text and structured data
US6687696B2 (en) * 2000-07-26 2004-02-03 Recommind Inc. System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models
US20040039734A1 (en) * 2002-05-14 2004-02-26 Judd Douglass Russell Apparatus and method for region sensitive dynamically configurable document relevance ranking
US6718365B1 (en) * 2000-04-13 2004-04-06 International Business Machines Corporation Method, system, and program for ordering search results using an importance weighting
US6892198B2 (en) * 2002-06-14 2005-05-10 Entopia, Inc. System and method for personalized information retrieval based on user expertise

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2002A (en) * 1841-03-12 Tor and planter for plowing
US2001A (en) * 1841-03-12 Sawmill
US4689743A (en) * 1986-02-11 1987-08-25 Andrew Chiu Method and an apparatus for validating the electronic encoding of an ideographic character
US6493702B1 (en) 1999-05-05 2002-12-10 Xerox Corporation System and method for searching and recommending documents in a collection using share bookmarks
US20010034639A1 (en) 2000-03-10 2001-10-25 Jacoby Jennifer B. System and method for matching aggregated user experience data to a user profile

Patent Citations (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US34639A (en) * 1862-03-11 Improved clothes-wringer
US16786A (en) * 1857-03-10 Melodeon
US5535382A (en) * 1989-07-31 1996-07-09 Ricoh Company, Ltd. Document retrieval system involving ranking of documents in accordance with a degree to which the documents fulfill a retrieval condition corresponding to a user entry
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5576954A (en) * 1993-11-05 1996-11-19 University Of Central Florida Process for determination of text relevancy
US6460036B1 (en) * 1994-11-29 2002-10-01 Pinpoint Incorporated System and method for providing customized electronic newspapers and target advertisements
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5724571A (en) * 1995-07-07 1998-03-03 Sun Microsystems, Inc. Method and apparatus for generating query responses in a computer-based document retrieval system
US5911140A (en) * 1995-12-14 1999-06-08 Xerox Corporation Method of ordering document clusters given some knowledge of user interests
US5822747A (en) * 1996-08-23 1998-10-13 Tandem Computers, Inc. System and method for optimizing database queries
US6119114A (en) * 1996-09-17 2000-09-12 Smadja; Frank Method and apparatus for dynamic relevance ranking
US5870740A (en) * 1996-09-30 1999-02-09 Apple Computer, Inc. System and method for improving the ranking of information retrieval results for short queries
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US5933822A (en) * 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US6182068B1 (en) * 1997-08-01 2001-01-30 Ask Jeeves, Inc. Personalized search methods
US6167397A (en) * 1997-09-23 2000-12-26 At&T Corporation Method of clustering electronic documents in response to a search query
US6289353B1 (en) * 1997-09-24 2001-09-11 Webmd Corporation Intelligent query system for automatically indexing in a database and automatically categorizing users
US6009422A (en) * 1997-11-26 1999-12-28 International Business Machines Corporation System and method for query translation/semantic translation using generalized query language
US6243713B1 (en) * 1998-08-24 2001-06-05 Excalibur Technologies Corp. Multimedia document retrieval by application of multimedia queries to a unified index of multimedia data for a plurality of multimedia data types
US6377983B1 (en) * 1998-08-31 2002-04-23 International Business Machines Corporation Method and system for converting expertise based on document usage
US6115709A (en) * 1998-09-18 2000-09-05 Tacit Knowledge Systems, Inc. Method and system for constructing a knowledge profile of a user having unrestricted and restricted access portions according to respective levels of confidence of content of the portions
US6363373B1 (en) * 1998-10-01 2002-03-26 Microsoft Corporation Method and apparatus for concept searching using a Boolean or keyword search engine
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
US6199067B1 (en) * 1999-01-20 2001-03-06 Mightiest Logicon Unisearch, Inc. System and method for generating personalized user profiles and for utilizing the generated user profiles to perform adaptive internet searches
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
US6327590B1 (en) * 1999-05-05 2001-12-04 Xerox Corporation System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis
US20020120619A1 (en) * 1999-11-26 2002-08-29 High Regard, Inc. Automated categorization, placement, search and retrieval of user-contributed items
US6311194B1 (en) * 2000-03-15 2001-10-30 Taalee, Inc. System and method for creating a semantic web and its applications in browsing, searching, profiling, personalization and advertising
US6718365B1 (en) * 2000-04-13 2004-04-06 International Business Machines Corporation Method, system, and program for ordering search results using an importance weighting
US6687696B2 (en) * 2000-07-26 2004-02-03 Recommind Inc. System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models
US20020152208A1 (en) * 2001-03-07 2002-10-17 The Mitre Corporation Method and system for finding similar records in mixed free-text and structured data
US20040039734A1 (en) * 2002-05-14 2004-02-26 Judd Douglass Russell Apparatus and method for region sensitive dynamically configurable document relevance ranking
US6892198B2 (en) * 2002-06-14 2005-05-10 Entopia, Inc. System and method for personalized information retrieval based on user expertise

Cited By (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7974681B2 (en) 2004-03-05 2011-07-05 Hansen Medical, Inc. Robotic catheter system
US7976539B2 (en) 2004-03-05 2011-07-12 Hansen Medical, Inc. System and method for denaturing and fixing collagenous tissue
US20060057560A1 (en) * 2004-03-05 2006-03-16 Hansen Medical, Inc. System and method for denaturing and fixing collagenous tissue
US20080195602A1 (en) * 2005-05-10 2008-08-14 Netbreeze Gmbh System and Method for Aggregating and Monitoring Decentrally Stored Multimedia Data
US20070011134A1 (en) * 2005-07-05 2007-01-11 Justin Langseth System and method of making unstructured data available to structured data analysis tools
WO2007021386A2 (en) * 2005-07-05 2007-02-22 Clarabridge, Inc. Analysis and transformation tools for strctured and unstructured data
WO2007021386A3 (en) * 2005-07-05 2007-09-20 Clarabridge Inc Analysis and transformation tools for strctured and unstructured data
US7849048B2 (en) 2005-07-05 2010-12-07 Clarabridge, Inc. System and method of making unstructured data available to structured data analysis tools
US7849049B2 (en) 2005-07-05 2010-12-07 Clarabridge, Inc. Schema and ETL tools for structured and unstructured data
US20070033228A1 (en) * 2005-08-03 2007-02-08 Ethan Fassett System and method for dynamically ranking items of audio content
US7849070B2 (en) * 2005-08-03 2010-12-07 Yahoo! Inc. System and method for dynamically ranking items of audio content
US20080016053A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Administration Console to Select Rank Factors
US20080016072A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Enterprise-Based Tag System
US8204888B2 (en) 2006-07-14 2012-06-19 Oracle International Corporation Using tags in an enterprise search system
US20080016052A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Connections Between Users and Documents to Rank Documents in an Enterprise Search System
US20080016061A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using a Core Data Structure to Calculate Document Ranks
US20110125760A1 (en) * 2006-07-14 2011-05-26 Bea Systems, Inc. Using tags in an enterprise search system
US7873641B2 (en) 2006-07-14 2011-01-18 Bea Systems, Inc. Using tags in an enterprise search system
US20080016098A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Tags in an Enterprise Search System
US20080016071A1 (en) * 2006-07-14 2008-01-17 Bea Systems, Inc. Using Connections Between Users, Tags and Documents to Rank Documents in an Enterprise Search System
US7634475B1 (en) * 2007-03-12 2009-12-15 A9.Com, Inc. Relevance scoring based on optimized keyword characterization field combinations
US8768932B1 (en) * 2007-05-14 2014-07-01 Google Inc. Method and apparatus for ranking search results
WO2008157810A2 (en) * 2007-06-21 2008-12-24 Baggott Christopher C System and method for compending blogs
US10360272B2 (en) 2007-06-21 2019-07-23 Oracle International Corporation System and method for compending blogs
US9208245B2 (en) * 2007-06-21 2015-12-08 Oracle International Corporation System and method for compending blogs
US20100185664A1 (en) * 2007-06-21 2010-07-22 Baggott Christopher C System and method for compending blogs
WO2008157810A3 (en) * 2007-06-21 2009-03-05 Christopher C Baggott System and method for compending blogs
US8108402B2 (en) 2008-10-16 2012-01-31 Oracle International Corporation Techniques for measuring the relevancy of content contributions
US20100100554A1 (en) * 2008-10-16 2010-04-22 Carter Stephen R Techniques for measuring the relevancy of content contributions
US8615707B2 (en) 2009-01-16 2013-12-24 Google Inc. Adding new attributes to a structured presentation
US20100185653A1 (en) * 2009-01-16 2010-07-22 Google Inc. Populating a structured presentation with new values
US20100185934A1 (en) * 2009-01-16 2010-07-22 Google Inc. Adding new attributes to a structured presentation
US20100185654A1 (en) * 2009-01-16 2010-07-22 Google Inc. Adding new instances to a structured presentation
US20100185666A1 (en) * 2009-01-16 2010-07-22 Google, Inc. Accessing a search interface in a structured presentation
US20100185651A1 (en) * 2009-01-16 2010-07-22 Google Inc. Retrieving and displaying information from an unstructured electronic document collection
US8412749B2 (en) 2009-01-16 2013-04-02 Google Inc. Populating a structured presentation with new values
US8977645B2 (en) 2009-01-16 2015-03-10 Google Inc. Accessing a search interface in a structured presentation
US8452791B2 (en) 2009-01-16 2013-05-28 Google Inc. Adding new instances to a structured presentation
US8924436B1 (en) 2009-01-16 2014-12-30 Google Inc. Populating a structured presentation with new values
US8972411B1 (en) * 2009-01-27 2015-03-03 Google Inc. Selection of sponsored content using multiple sets of query terms
US20100306223A1 (en) * 2009-06-01 2010-12-02 Google Inc. Rankings in Search Results with User Corrections
US20110106819A1 (en) * 2009-10-29 2011-05-05 Google Inc. Identifying a group of related instances
US8442982B2 (en) * 2010-11-05 2013-05-14 Apple Inc. Extended database search
US20120117116A1 (en) * 2010-11-05 2012-05-10 Apple Inc. Extended Database Search
US9009201B2 (en) * 2010-11-05 2015-04-14 Apple Inc. Extended database search
CN102651011A (en) * 2011-02-27 2012-08-29 祁勇 Method and system for determining document characteristic and user characteristic
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data
US10372741B2 (en) 2012-03-02 2019-08-06 Clarabridge, Inc. Apparatus for automatic theme detection from unstructured data
CN103390008A (en) * 2012-05-08 2013-11-13 祁勇 Method and system for acquiring personalized features of user
CN103514237A (en) * 2012-06-25 2014-01-15 祁勇 Method and system for obtaining personalized features of user and file
CN103544190A (en) * 2012-07-17 2014-01-29 祁勇 Method and system for acquiring personalized features of users and documents
CN104636487A (en) * 2015-02-26 2015-05-20 湖北光谷天下传媒股份有限公司 Advertising information management method
US11803918B2 (en) 2015-07-07 2023-10-31 Oracle International Corporation System and method for identifying experts on arbitrary topics in an enterprise social network
US11200217B2 (en) * 2016-05-26 2021-12-14 Perfect Search Corporation Structured document indexing and searching
US11416532B2 (en) 2018-05-31 2022-08-16 Wipro Limited Method and device for identifying relevant keywords from documents

Also Published As

Publication number Publication date
US20060015488A1 (en) 2006-01-19
WO2003107127A3 (en) 2004-04-08
US6978267B2 (en) 2005-12-20
AU2003251510A8 (en) 2003-12-31
US20030233345A1 (en) 2003-12-18
US6892198B2 (en) 2005-05-10
WO2003107127A2 (en) 2003-12-24
US20050055346A1 (en) 2005-03-10
AU2003251510A1 (en) 2003-12-31

Similar Documents

Publication Publication Date Title
US6892198B2 (en) System and method for personalized information retrieval based on user expertise
US11036814B2 (en) Search engine that applies feedback from users to improve search results
US7577643B2 (en) Key phrase extraction from query logs
Si et al. A semisupervised learning method to merge search engine results
Huang et al. Relevant term suggestion in interactive web search based on contextual information in query session logs
Si et al. Using sampled data and regression to merge search engine results
US8606781B2 (en) Systems and methods for personalized search
Xing et al. Weighted pagerank algorithm
US7020679B2 (en) Two-level internet search service system
US8676784B2 (en) Relevant individual searching using managed property and ranking features
US20090070325A1 (en) Identifying Information Related to a Particular Entity from Electronic Sources
Amitay et al. Social search and discovery using a unified approach
US20110055185A1 (en) Interactive user-controlled search direction for retrieved information in an information search system
US20060242133A1 (en) Systems and methods for collaborative searching
US8180751B2 (en) Using an encyclopedia to build user profiles
US8977630B1 (en) Personalizing search results
CA2545232A1 (en) Method and system for creating a taxonomy from business-oriented metadata content
KR20060017765A (en) Concept network
Yang Information retrieval on the web.
Chakrabarti et al. Using Memex to archive and mine community Web browsing experience
Delbru et al. Sindice at semsearch 2010
Gauch et al. Ontology-based user profiles for personalized search
Veningston et al. Semantic association ranking schemes for information retrieval applications using term association graph representation
Sunitha et al. A comparative study over search engine optimization on precision and recall ratio
Balabantaray et al. A case study for ranking of relevant search results

Legal Events

Date Code Title Description
AS Assignment

Owner name: ENTOPIA, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PERISIC, IGOR;REEL/FRAME:015930/0769

Effective date: 20041022

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION