WO1997038376A2 - A system, software and method for locating information in a collection of text-based information sources - Google Patents

A system, software and method for locating information in a collection of text-based information sources Download PDF

Info

Publication number
WO1997038376A2
WO1997038376A2 PCT/IB1997/000748 IB9700748W WO9738376A2 WO 1997038376 A2 WO1997038376 A2 WO 1997038376A2 IB 9700748 W IB9700748 W IB 9700748W WO 9738376 A2 WO9738376 A2 WO 9738376A2
Authority
WO
WIPO (PCT)
Prior art keywords
linguistic
terms
text
knowledge base
language
Prior art date
Application number
PCT/IB1997/000748
Other languages
French (fr)
Other versions
WO1997038376A3 (en
Inventor
Yuval Levi
Haim Margulis
Iris Arad
Original Assignee
Flair Technologies, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Flair Technologies, Ltd. filed Critical Flair Technologies, Ltd.
Priority to EP97925221A priority Critical patent/EP0934569A2/en
Priority to JP9532080A priority patent/JP2000507008A/en
Publication of WO1997038376A2 publication Critical patent/WO1997038376A2/en
Publication of WO1997038376A3 publication Critical patent/WO1997038376A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Definitions

  • the present invention relates generally to the field of information retrieval. More particularly, the present invention relates to information management systems and computational linguistic systems for finding information related to a user-input query, in a collection of text- based information sources.
  • data is stored in a strictly structured environment.
  • Such systems may be based upon tables of records or spreadsheet models, for example.
  • Such systems may be flat or may be relational with respect to how records in the database are associated with each other.
  • conventional database management systems generally require structured records in which one or more fields may be searchable, i.e. are key fields.
  • key fields use terms, e.g. numbering systems, labels, etc., in a consistent manner which facilitates searching with known query values, i.e. combinations of numbers, labels, etc.
  • full-text searching has developed.
  • Full-text searching of a collection of text-based information sources permits a user to write a query containing terms known to be used in relevant documents.
  • the collection of documents is first fully indexed and the words of the documents in the index are compared with the query terms.
  • an exact match between a query term and an index entry must be found in order to identify a relevant document. Spelling errors, word variants, etc. will tend to prevent finding all relevant documents.
  • a technique called wild- carding may be used to partially alleviate this problem, but many irrelevant documents, referred to as "noise,” often turn up when wild-carding is used.
  • Pinkas G., Natural Language Full-Text Retrieval System, Master's Thesis, University of Jerusalem, 1985, discloses a system which automatically expands a user's query to include additional relevant terms in a manner more noise-free than simple wild-carding.
  • the Pinkas system (1) receives a user query composed of query-words and boolean operators; (2) expands the query linguistically, i.e. by referring to a pre-processed database of morpholo_gical and phonetic information; (3) expands the query associatively, i.e.
  • query expansion depended upon a set of linguistic rules which were developed by an expert in the language to be processed.
  • the set of linguistic rules was both extensive and relatively inflexible, since as many characteristics of the input language as possible had to be accounted for before processing any text-based information sources.
  • Development of the linguistic rules for each language to be processed was a very labor-intensive and time- consuming task.
  • aspects of the present invention solving the problems of the prior art include at least a system, software and a method for processing information contained in a collection of text-based information sources.
  • the system may include a computer or data processor and software structured as one or more software modules, units or functions which when executed in a specified order by the computer or data processor perform the desired information processing task.
  • One or more software modules, units or functions may be made available in conventional manner as either compile-time or run-time library entries which may be referred to by a software program which is written in a manner to be aware of such a library.
  • the present invention further provides a method to process query-concepts and transform them to an expanded/improved query using an expansion matrix providing great flexibility, high accuracy and low noise output.
  • a text-based information processing system comprising an automatic linguistic knowledge base generator having an input receiving a collection of text-based information sources and which produces a linguistic knowledge base; an index generator having inputs receiving a collection of text-based information sources and the linguistic knowledge base and which produces an index of the received text-based information and further which updates the linguistic knowledge base to reflect the inputs to the index generator and maintain correlation between the index and the linguistic knowledge base; a query processor having inputs receiving a query composed by an operator, the linguistic knowledge base, the index and a thesaurus and which produces a list of locations in the collection of text-based information sources relevant to the query.
  • the text- based information processing system may be subject to numerous modifications and variations.
  • the automatic linguistic knowledge base generator, the automatic index generator and the query processor may be embodied in various ways.
  • an automatic linguistic knowledge base generator may comprise a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and a linguistic expander connected to receive the legal individual terms and producing entries stored in the linguistic knowledge base.
  • an automatic indexer may comprise a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and an index entry generator connected to receive the legal individual terms and producing entries stored in the index when the terms have not previously been indexed and modifying an existing index entry when the terms have previously been indexed.
  • an expansion unit for expanding terms in a language may comprise an associative expander having an input receiving a term and having an output representing the term and at least one associated term found by the associated expander making reference to a thesaurus; and a linguistic expander having an input connected to the output of the associative expander and having an output representing the input of the linguistic expander and at least one term linguistically related to the input of the linguistic expander and found by reference to a linguistic knowledge base for the language.
  • the normalizers recited above may be constructed of two units.
  • the first normalizer unit may be connected to receive the individual terms and the linguistic rules and producing terms from which illegal characters have been removed; and the second normalizer unit may then be connected to receive the terms from which illegal characters have been removed and the linguistic rules and which produces normalized terms including word stems found by applying the linguistic rules to the terms from which illegal characters have been removed.
  • Fig. 1 is a schematic block diagram of a computer or data processing system on which the present invention may be practiced;
  • Fig. 2 is a schematic block diagram of the memory of Fig. 1;
  • Fig. 3 is a flow chart of automatic linguistic knowledge base generation;
  • Fig. 4 is a flow chart of automatic index generation;
  • Fig. 5 is a flow chart of query expansion
  • Fig. 6 is a flow chart of an information retrieval system including the features illustrated in Figs. 3-5.
  • a "language” is considered to be any organized system of tokens, which have symbolic meaning.
  • the tokens are referred to hereinafter as "words” or “terms,” since the most common types of languages dealt with by text- based information systems are natural human languages composed of words or combinations of words, i.e. terms, which are understood to have specific meanings by humans.
  • words and “terms” are intended to encompass word phrases in those instances where a word phrase may in fact be a token having a meaning independent of its sub-units, keywords in those instances where a word or word-phrase has a specific context/importance, and artificial words such as acronyms and short-cuts.
  • a “word base” is the base portion of a word which remains after removing all prefixes and suffixes of the word which modify the meaning or part of speech of the word root appropriately for the context in which the word may be used.
  • thesaurus refers to a database of terms, words and/or word bases, in which each term, word or word base is associated in the database with other terms, words and word bases having a defined relationship such as mo ⁇ hological proximity, phonetic similarity, similar meaning (synonyms), nearly opposite meaning (antonyms), broader meaning, narrower meaning, related term in a specific context, etc.
  • the database may be navigated or searched on the basis of the terms, words and/or word bases stored therein.
  • Languages considered here have known linguistic rules for the mo ⁇ hological and phonetic variations which words may undergo.
  • the mo ⁇ hological rules of a language may define how a plural is formed from a singular noun, by changing the shape of the word, i.e. adding a final "s" in English, while the phonetic rules may represent the common variations in spelling resulting from user spelling errors.
  • a table, file or database may be used by a software program to hold a list of such linguistic rules.
  • languages also include words which do not follow the linguistic rules of the language.
  • the English language mo ⁇ hological rule for generating the past tense of a verb does not apply to the verb "to go,” which becomes "went” rather than the nonsensical "goed.” Therefore, exceptions to the rules may be held by a software program in a table of exceptions, so that words which do not obey the rules may be handled as accurately as words which do obey the rules.
  • a "linguistic knowledge base" is developed by applying the linguistic rules and the one or more tables of exceptions or irregular forms to a large body of textual information to produce an efficient, adaptable and flexible representation of the variations of word bases which produce meaning, particularly in natural languages, but also in language generally.
  • a computer system or data processing system generally includes a processor 101, a memory 103, one or more input devices 105, and one or more output devices 107, all interconnected through an interconnection mechanism 109.
  • processor 101 a processor 101
  • memory 103 a memory 103
  • input devices 105 a processor
  • output devices 107 all interconnected through an interconnection mechanism 109.
  • an interconnection mechanism 109 Many variations of this basic plan are possible. For example, viable systems may lack input devices 105 and output devices 107, communicating entirely through interactions with the memory 103 by external devices (not shown). Also, distributed computer systems and data processing systems are contemplated as falling within this basic plan.
  • the interconnection mechanism 109 may be an internal system bus of a personal computer or may be the Internet, through which a processor 101 interacts with a database stored on a remote memory 103.
  • Memory 103 may be classified into two categories useful to this discussion, long term memory (also called non-volatile memory), and short term memory (also called volatile memory). These two types of memory are often both used in computer systems and data processing systems, as shown in Fig. 2.
  • Volatile memory 201 such as integrated circuit random access memory (RAM) is often used in close physical proximity to the processor 101 because the technologies in which such volatile memory 201 is most readily realized produce fast access times, such as are desirable to support fast processors 101.
  • RAM integrated circuit random access memory
  • Non-volatile memory 203 is often used to store massive quantities of data for longer periods of time because it can be more cheaply constructed than volatile memory of a similar capacity.
  • Non-volatile memory 203 is often implemented as magnetic or optical disk or tape storage units, which provide a further advantage of data and software program interchange between different computer or data processing systems.
  • non-volatile memory 203 may be a software product disk on which are recorded signals representing instructions, which when executed by a processor 101 cause the computer or data processing system to perform a special pu ⁇ ose function.
  • Software embodying aspects of the present invention may be recorded on such a non-volatile memory 203 for distribution by a manufacturer, for archival pu ⁇ oses, for access through a volatile memory 201 by a processor 101. etc.
  • a system for searching through and locating information in a collection of text-based information sources may be constructed.
  • a linguistic knowledge base is first generated. Then, the collection of text-based information sources is indexed. A user next inputs a query defining the information sought. The query is expanded according to selected associative and linguistic rules, using a thesaurus and the linguistic knowledge base. Finally, information is identified which matches the various expanded query terms.
  • the thesaurus, linguistic knowledge base and index may be stored in one or more computer files to which the system has access through memory 103.
  • the aspects of the invention connected with automatic generation of the linguistic knowledge base, automatic generation of the index and query expansion are next described in detail. I. Automatic Generation of the Linguistic Knowledge Base
  • Fig. 3 software as shown in Fig. 3 is provided which when executed by a suitable data processing system will automatically generate the linguistic knowledge base from an input body of text based information sources.
  • a collection of English language documents may be processed to generate an English language linguistic knowledge base.
  • a small set of linguistic rules 301 including a list of exceptions 302 to the linguistic rules for a language, e.g. English, is first generated by statistical analysis of a large body of text based information.
  • This small set of rules includes:
  • a word normalization table specifying legal characters in the language, i.e. the alphabet of the language, and legal character positions in the language, e.g. special rules concerning characters which can only appear at specific locations within a word;
  • This set of rules 301 is then used to analyze a body of text based information sources 303, to generate a linguistic knowledge base 305 specifically adapted from the body of text based information sources 303.
  • the body of sources 303 may be selected to be sources from a particular field of endeavor in which future queries are expected to be made, for example. This will result in a linguistic knowledge base better able to cope with the specifics of that particular field of endeavor.
  • the body of sources 303 from which the linguistic knowledge base 305 is derived may not be the same body of sources which is ultimately to be searched. However, automatically generating the linguistic knowledge base 305 from the body of sources to be searched has the advantage that the linguistic knowledge base 305 so produced is particularly well adapted to the body of sources to be searched.
  • Automatic generation of the linguistic knowledge base 305 proceeds as follows.
  • the body of text based information sources 303 forms an input stream of text 304 to the system.
  • This input stream 304 is first parsed into words and terms 307 in accordance with either fixed word recognition rules or word recognition rules specific to one or more languages.
  • the language of each of the words parsed from the input stream is then recognized 309. Once the language of a word has been recognized the word may be normalized 311 according to the linguistic rules 301 for the language. Irregular words may also be recognized at this point, since known irregular words are already in the list of irregular words 302 and hence need no further processing.
  • the system may also identify as potential new irregular words, those words meeting some rule-based criteria.
  • Regular words are linguistically expanded 313 before being added to the linguistic knowledge base 305 such that word bases are stored in the linguistic knowledge base 305 along with a list of related words from the body of sources 303. Linguistic expansion 313 is discussed in greater detail below.
  • the step of parsing 307 the input stream 304 into sentences and words takes place according to the following pseudo-code:
  • ⁇ for each word ⁇ if language not explicitly specified, identify word's language end if; normalize word; return word and word's coordinates; return rest of input stream; ⁇ next word; ⁇ next sentence.
  • Normalization is performed as follows. Normalization identifies and removes garbage characters from the words of the input stream 304. For each character of an input word
  • new keys are added to the linguistic knowledge base 305 by the following procedure. If language not explicitly specified identify the language of the input word; for each recognition type
  • the mo ⁇ hological analyzer of the described embodiment functions in accordance with the following procedure.
  • the mo ⁇ hological analyzer receives a list of valid prefixes and suffixes in the language identified for the input word. Start at end of word; strip next substring from end; for each substring of word ⁇ /* search for prefix*/ if substring is found to be a prefix in the identified language
  • the phonetic analyzer converts each word into a phonological representation of the word on the basis of letter to sound rules. Words having similar or same phonological representations may be considered to be related by their phonetic mo ⁇ hology.
  • a linguistic knowledge base 305 for the languages of the text in the body of sources 303 will have been automatically generated.
  • new text based information sources are added to the system, they are also processed as above.
  • new sources increase the knowledge and accuracy of the linguistic knowledge base 305 through the addition of new information to the linguistic knowledge base 305, as well as through the linguistic correction mechanism which corrects the contents of individual entries in the linguistic knowledge base 305 according to new information.
  • the learning procedure which embodies the linguistic correction mechanism is as follows. Open a new table entry for a new correct key; get the list of words in the body of a previous key entry; for each word
  • the affected word base and list of related words may be automatically, or at the direction of a human operator, reanalyzed and updated in accordance with the newly presented information and the above procedure.
  • the system constantly learns about each language processed and updates the affected linguistic knowledge bases.
  • the retrieval system In addition to the linguistic knowledge base 305, the retrieval system according to another aspect of the present invention shown in Fig. 4 automatically generates an index 401, whereby text based information may be found by reference to the index 401. Automatic generation of the index 401 is accompanied by updating of the linguistic knowledge base 305, so that the contents of the linguistic knowledge base 305 reflects the relevant terms contained in the body of text based information sources 303 and is thus correlated with the index 401.
  • the index 401 simply relates words actually found in the body of text based information sources 303 to locations within the body of sources 303. It is preferred that the location be defined hierarchically. For example, the location may be represented hierarchically by a document number, section number, sentence number and position number. Other hierarchical location identification schemes may be used, as seen fit by those skilled in this art.
  • the index 401 is assisted by the linguistic knowledge base 305.
  • the index 401 includes only words and terms actually occurring in the text-based information sources 303.
  • the linguistic knowledge base 305 relates word bases derived from the words actually occurring in the text-based information sources 303 to lists of related words. During retrieval, which is explained below, the system retrieves an entry from the linguistic knowledge base 305 which is then used to reference one or more index entries.
  • Automatic generation of the index 401 proceeds as follows.
  • the body of text based information sources 303 forms an input stream of text 304 to the indexing subsystem.
  • This input stream 304 is first parsed into words and terms 307 in accordance with the word recognition rules.
  • the language of each of the words parsed from the input stream is then recognized 309. Once the language of a word has been recognized 309 the word may be normalized 31 1 according to the linguistic rules 301 for the language.
  • An index entry is then generated 403 for each new normalized word. If the normalized word already has an entry in the index 401 , then the location of the current occurrence of the word is added to the previous entry.
  • the linguistic knowledge base 305 is continuously kept correlated with new and modified entries produced in index 401.
  • Each normalized word is reduced to its word base 405 in accordance with the linguistic rules of the language of the word.
  • the word base and related word is then added to the linguistic knowledge base file 407, if not already present.
  • the user may also specify that related words include various types of expansions of the word bases. If expansions are included, expansion of the word base is performed before storing the word base and related words in the linguistic knowledge base file 305.
  • the linguistic knowledge base 305 is correlated with the index 401 and reflects the relevant terms contained in the body of text based information sources 303.
  • Query expansion is performed in accordance with a third aspect of the invention, shown in Fig. 5. Since a query may contain more than one word or term, word recognition is first performed as above. The words and terms identified by the word recognition task may further be normalized.
  • Each recognized word in the query may then be expanded using a 2D expansion matrix.
  • the 2D expansion matrix is one way of defining the expansion space in which an input word may be represented.
  • the dimensions of this space are associative and linguistic.
  • the associative dimension is based upon the meaning of words/word-bases in the language to be processed.
  • the associative dimension is defined by one or more thesauri 501 relating words and terms to their synonyms, broader terms, narrower terms and other relations.
  • Each thesaurus 501 includes a database of terms along with conceptually related terms.
  • the thesaurus is searchable by term.
  • each thesaurus entry contains an entry key which is a list of searchable terms.
  • Each entry key has associated therewith one or more terms conceptually related to the entry key, such as synonyms, broader terms, narrower terms, associated terms, antonyms, etc. The inclusion of any one or more categories of association is optional.
  • each entry term may optionally have associated therewith a conventional dictionary definition and usage guide, as well as a query string into which the entry key may be translated when required.
  • the thesaurus is a list of entries, wherein each entry has a structure substantially as follows:
  • KEYWORD (in the form of natural language phrase or term) Used as an entry key.
  • SYNONYMS A list of keywords synonymous with the KEYWORD that comprise a concept or descriptor.
  • the linguistic dimension of this expansion space is based upon the linguistic knowledge base 305 of the language to be processed.
  • the linguistic knowledge base is built automatically from the actual co ⁇ ora of the text-based information sources, independent of manually crafted linguistic dictionaries, and not being restricted to "legal" or "proper” words.
  • linguistic expansion grammars of mo ⁇ hology and phonetics are supported.
  • the expansion task performs 2D expansion in substantially two main steps.
  • First an associative expansion is performed at step 503, in which each input word of an input query 505 is expanded to a list of words 507 including words having defined relations to the input word.
  • the associated words are found by making reference to the thesaurus 501.
  • This expanded list of words 507 becomes the input on which linguistic expansion 509 is performed in both the mo ⁇ hological and phonetic dimensions, simultaneously.
  • the mo ⁇ hological and phonetic expansion is controlled by making reference to the linguistic knowledge base 305.
  • the linguistic expansion 509 may be controlled by expansion parameters 511 supplied by the user to include varying degrees of mo ⁇ hological expansion and phonetic expansion, ranging for both dimensions from no expansion in that dimension to full expansion in that dimension.
  • expansion strategies for mo ⁇ hology and phonetics may be intelligently related.
  • the relationships between the expansion dimensions are defined in the linguistic knowledge base 305 for the language.
  • a rule for mo ⁇ hological expansion may define a mo ⁇ hological variation which changes depending upon the phonetic properties of the input word or the expanded result.
  • less noise is generated in the expanded output because relating the mo ⁇ hological and phonetic dimensions as a single linguistic plane eliminates mo ⁇ hological variants which are phonetically unacceptable under the totality of the linguistic rules, and vice versa.
  • a retrieval system may be constructed as shown in Fig. 6, which can perform efficient and accurate location of information within a collection of text-based information sources.
  • a system is given access to one or more collections of text-based information sources 303a and 303b.
  • At least one group of text-based information sources 303a is supplied to automatic linguistic knowledge base generating software 601, which generates the linguistic knowledge base 305 as described above.
  • Text-based information sources 303b are provided to an indexing subsystem 603 which creates an index 401 of words in the text-based information sources 303b, in which each entry in the index 401 defines a relationship between a word and the location of the word in the collection, as described above.
  • the index 401 be generated using normalized words in one or more languages for which the system has a thesaurus 501 and a linguistic knowledge base 305.
  • the indexing subsystem 603 may include a module to recognize words having forms which conform to one of the languages supported by the system and may further include an appropriate normalizing module for each language supported by the system. Words are normalized in their language as discussed above, to reduce the number of anomalous entries appearing in the index 401.
  • the system further receives a user query 505 in the form of one or more words expressive of the information sought by the user.
  • the query words are expanded 605 using a 2D expansion matrix, as discussed above.
  • the query is first associatively expanded to include words related to the original query words by reference to the thesaurus appropriate for the language of the query words.
  • the associatively expanded query is then linguistically expanded in both the mo ⁇ hological and phonetic dimensions, simultaneously.
  • the degree of expansion in each dimension is specified by the user, by parameters 511 supplied with the query.
  • the degree of expansion may be specified by the user, for example, by attaching a checklist of expansion parameters 51 1 to each query term.
  • the terms of the fully expanded query 607 are compared 609 with the entries in the index 401 to find relevant locations 611 within the collection of text-based information sources 303b.
  • Relevant locations 611 in the collection of text-based information sources 303b do not necessarily contain any of the original query terms.
  • the locations found 611 will contain one of the original query terms or a related term produced by the associative and linguistic expansion processes.
  • the locations found 61 1 will not include many '"noise " locations because the linguistic expansion process is performed as described above in a manner in which the mo ⁇ hological and phonetic linguistic rules are applied simultaneously in a synergistic manner that avoids the problem of applying a mo ⁇ hological rule to generate a phonetically nonsensical result or vice versa.
  • the text-based information sources may be text documents stored on a computer system.
  • the indexing system may hierarchically refer to locations by document number, section number, sentence number and position with the sentence.
  • freely formatted text documents may be processed by the above-described system. There is no need to structure the documents a particular way or to manually produce classifications or keywords, as done in some prior art systems, because the present system indexes words and manipulates queries according to the rules of the language in which the words occur. If it is desired that a phrase be treated as a single word or term in a particular language, then that phrase may be so defined as a conceptual entity in a thesaurus. In all other respects, the phrase so defined as a word is treated simply as a word in the language. However, it is unnecessary to declare a long list of accepted keywords because the process of indexing and query expansion generates accurate, relatively noise-free matches for user queries reasonably expressive of the information sought.

Abstract

A system for processing information contained in a collection of text-based information sources employs associative and linguistic expansion of input words in which associative expansion is first performed, followed by simultaneous linguistic expansion in accordance with related morphological and phonetic rules. The system automatically generates and updates a linguistic knowledge base for each language to be processed by analyzing a large body of text in each language. The system also automatically indexes the collection of text-based information sources to be searched. A method is provided to expand a word or term in a supported language using a two-dimensional (2D) expansion matrix providing great flexibility, high accuracy and low noise output. The 2D expansion matrix includes an associative dimension that utilizes thesauri, databses of saved queries and other associated information sources, in which words are related to other words by meaning and relations, and a linguistic dimension which utilizes recognition-grammars, in which words are related to other words by combined rules for morphological and phonetic variation.

Description

A SYSTEM. SOFTWARE AND METHOD FOR LOCATING INFORMATION IN A COLLECTION OF TEXT-BASED INFORMATION SOURCES
BACKGROUND
1. Field of the Invention
The present invention relates generally to the field of information retrieval. More particularly, the present invention relates to information management systems and computational linguistic systems for finding information related to a user-input query, in a collection of text- based information sources.
2. Discussion of Related Art
In the Information Age, the ability to manage enormous volumes of information efficiently and find needed information quickly has become a driving force in all human endeavors. Early in the development of information management systems, the capability to process large volumes of free-form text documents and other text-based information sources was severely limited. Therefore information specialists developed various types of database management systems and searching systems based on strictly controlling how data may be received, stored, and referred to. However, as the volume and nature of the information which must be handled by such systems has expanded, conventional database management systems have been unable to keep pace.
In conventional database management systems, data is stored in a strictly structured environment. Such systems may be based upon tables of records or spreadsheet models, for example. Such systems may be flat or may be relational with respect to how records in the database are associated with each other. However, conventional database management systems generally require structured records in which one or more fields may be searchable, i.e. are key fields. Furthermore, it is desirable that such key fields use terms, e.g. numbering systems, labels, etc., in a consistent manner which facilitates searching with known query values, i.e. combinations of numbers, labels, etc. In order to locate information within general text-based information sources, so-called full-text searching has developed. Full-text searching of a collection of text-based information sources, such as English-language documents stored in a computer system, permits a user to write a query containing terms known to be used in relevant documents. The collection of documents is first fully indexed and the words of the documents in the index are compared with the query terms. In the simplest form of this type of system, an exact match between a query term and an index entry must be found in order to identify a relevant document. Spelling errors, word variants, etc. will tend to prevent finding all relevant documents. A technique called wild- carding may be used to partially alleviate this problem, but many irrelevant documents, referred to as "noise," often turn up when wild-carding is used. An example of the use of wild-carding is where a user query term includes only what the user has identified to be a word base of a relevant term, such as "comput*" for the concept of "compute," "computer," "computing," "computation," etc., where "*" indicates the portion of the term which has been left out.
Modern, conventional, full-text searching systems have been developed which have a much higher level of sophistication. For example, Pinkas, G., Natural Language Full-Text Retrieval System, Master's Thesis, University of Jerusalem, 1985, discloses a system which automatically expands a user's query to include additional relevant terms in a manner more noise-free than simple wild-carding. The Pinkas system: (1) receives a user query composed of query-words and boolean operators; (2) expands the query linguistically, i.e. by referring to a pre-processed database of morpholo_gical and phonetic information; (3) expands the query associatively, i.e. by referring to a database of associated sub-queries; and (4) merges the results of steps 2 and 3 above. Morphological expansion draws in the infix variations of the query terms, while phonetic expansion brings in terms that may be generated by misspelled vowels (e.g. recieve → receive). Associative expansion draws into the query related terms as pre-defined by the user in the form of sub-queries being associated with a specific query-word (e.g. to associate the acronym "USA" with its full wording, one creates an association between the word "USA" and a query applying a boolean "and" operation to the following 4 words: "United", "States", "of, "America", restricted to a proximity of 1 word distance. Thus a comprehensive expanded query is generated to cover the many different words that may conceptually be related to user's original query. Some variation in the level of moφhological expansion and the level of phonetic expansion to be performed is available to the user by selection of expansion parameters.
However, this process of moφhological and phonetic expansion suffers from many inefficiencies: it fails to recognize the fundamental differences between different "word-bases" such as moφhological stems and phonemes, therefore it misses many relevant linguistic permutations affected by both mechanisms, and at the same time it generates a large amount of noise, i.e. false-positives, due to the combinatorial effect of combining both mechanisms. Moreover, this process also is fairly limited to recognizing and expanding single words, and even then the interaction between the associative expansion and the linguistic expansion is fairly limited to a trivial merge of both results, having not shared a conceptual foundation that allows a mutual feedback (e.g. the query-word "aiφlane" expands to ("aiφlane" or "aiφlanes" or "aircraft") but not to "aircrafts". In conventional systems, query expansion depended upon a set of linguistic rules which were developed by an expert in the language to be processed. The set of linguistic rules was both extensive and relatively inflexible, since as many characteristics of the input language as possible had to be accounted for before processing any text-based information sources. Development of the linguistic rules for each language to be processed was a very labor-intensive and time- consuming task.
Finally, conventional systems are known which require manual indexing of text sources, as well as which index text sources automatically. Conventional indexes simply map a word found in the text sources to a location at which the word is found. Manual full-text indexing is extremely time-consuming and error-prone. Keyword indexing is subjective and also somewhat error-prone.
SUMMARY OF THE INVENTION
Therefore, it is a general aim of the present invention to solve the problems noted above with respect to the prior art. Aspects of the present invention solving the problems of the prior art include at least a system, software and a method for processing information contained in a collection of text-based information sources.
The system may include a computer or data processor and software structured as one or more software modules, units or functions which when executed in a specified order by the computer or data processor perform the desired information processing task. One or more software modules, units or functions may be made available in conventional manner as either compile-time or run-time library entries which may be referred to by a software program which is written in a manner to be aware of such a library. The present invention further provides a method to process query-concepts and transform them to an expanded/improved query using an expansion matrix providing great flexibility, high accuracy and low noise output. According to one aspect of the invention, there may be provided a text-based information processing system, comprising an automatic linguistic knowledge base generator having an input receiving a collection of text-based information sources and which produces a linguistic knowledge base; an index generator having inputs receiving a collection of text-based information sources and the linguistic knowledge base and which produces an index of the received text-based information and further which updates the linguistic knowledge base to reflect the inputs to the index generator and maintain correlation between the index and the linguistic knowledge base; a query processor having inputs receiving a query composed by an operator, the linguistic knowledge base, the index and a thesaurus and which produces a list of locations in the collection of text-based information sources relevant to the query. The text- based information processing system may be subject to numerous modifications and variations. For example, the automatic linguistic knowledge base generator, the automatic index generator and the query processor may be embodied in various ways.
In accordance with another aspect of the invention, in a text-based information processing system, an automatic linguistic knowledge base generator may comprise a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and a linguistic expander connected to receive the legal individual terms and producing entries stored in the linguistic knowledge base. In accordance with yet another aspect of the invention, in a text-based information processing system, an automatic indexer may comprise a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and an index entry generator connected to receive the legal individual terms and producing entries stored in the index when the terms have not previously been indexed and modifying an existing index entry when the terms have previously been indexed.
Finally, in accordance with yet another aspect of the invention, in a text-based information processing system, an expansion unit for expanding terms in a language may comprise an associative expander having an input receiving a term and having an output representing the term and at least one associated term found by the associated expander making reference to a thesaurus; and a linguistic expander having an input connected to the output of the associative expander and having an output representing the input of the linguistic expander and at least one term linguistically related to the input of the linguistic expander and found by reference to a linguistic knowledge base for the language. The normalizers recited above may be constructed of two units. The first normalizer unit may be connected to receive the individual terms and the linguistic rules and producing terms from which illegal characters have been removed; and the second normalizer unit may then be connected to receive the terms from which illegal characters have been removed and the linguistic rules and which produces normalized terms including word stems found by applying the linguistic rules to the terms from which illegal characters have been removed.
The present invention will be better understood by reading the Detailed Description of at least one illustrative embodiment of the invention, in connection with the attached drawing.
BRIEF DESCRIPTION OF THE DRAWINGS In the drawings, in which like reference designations indicate like elements,
Fig. 1 is a schematic block diagram of a computer or data processing system on which the present invention may be practiced;
Fig. 2 is a schematic block diagram of the memory of Fig. 1; Fig. 3 is a flow chart of automatic linguistic knowledge base generation; Fig. 4 is a flow chart of automatic index generation;
Fig. 5 is a flow chart of query expansion; and
Fig. 6 is a flow chart of an information retrieval system including the features illustrated in Figs. 3-5.
DETAILED DESCRIPTION
In order to better understand the following detailed description, reference should be made to the following definitions. In this discussion a "language" is considered to be any organized system of tokens, which have symbolic meaning. For convenience, the tokens are referred to hereinafter as "words" or "terms," since the most common types of languages dealt with by text- based information systems are natural human languages composed of words or combinations of words, i.e. terms, which are understood to have specific meanings by humans. Thus, the terms "words" and "terms" are intended to encompass word phrases in those instances where a word phrase may in fact be a token having a meaning independent of its sub-units, keywords in those instances where a word or word-phrase has a specific context/importance, and artificial words such as acronyms and short-cuts. A "word base" is the base portion of a word which remains after removing all prefixes and suffixes of the word which modify the meaning or part of speech of the word root appropriately for the context in which the word may be used. The term
"thesaurus" as used herein refers to a database of terms, words and/or word bases, in which each term, word or word base is associated in the database with other terms, words and word bases having a defined relationship such as moφhological proximity, phonetic similarity, similar meaning (synonyms), nearly opposite meaning (antonyms), broader meaning, narrower meaning, related term in a specific context, etc. The database may be navigated or searched on the basis of the terms, words and/or word bases stored therein.
Languages considered here have known linguistic rules for the moφhological and phonetic variations which words may undergo. For example, the moφhological rules of a language may define how a plural is formed from a singular noun, by changing the shape of the word, i.e. adding a final "s" in English, while the phonetic rules may represent the common variations in spelling resulting from user spelling errors. A table, file or database may be used by a software program to hold a list of such linguistic rules.
Generally, languages also include words which do not follow the linguistic rules of the language. For example, the English language moφhological rule for generating the past tense of a verb does not apply to the verb "to go," which becomes "went" rather than the nonsensical "goed." Therefore, exceptions to the rules may be held by a software program in a table of exceptions, so that words which do not obey the rules may be handled as accurately as words which do obey the rules. In the context of the present invention, a "linguistic knowledge base" is developed by applying the linguistic rules and the one or more tables of exceptions or irregular forms to a large body of textual information to produce an efficient, adaptable and flexible representation of the variations of word bases which produce meaning, particularly in natural languages, but also in language generally. The "linguistic knowledge base" is a table, list or database of word bases and related words. Related words are those words which when analyzed under the linguistic rules for the language are determined to have the same word base. The present invention is constructed in the context of computer systems and data processing systems. An overview of such systems is given in connection with the block diagram of Fig. 1. A computer system or data processing system generally includes a processor 101, a memory 103, one or more input devices 105, and one or more output devices 107, all interconnected through an interconnection mechanism 109. Many variations of this basic plan are possible. For example, viable systems may lack input devices 105 and output devices 107, communicating entirely through interactions with the memory 103 by external devices (not shown). Also, distributed computer systems and data processing systems are contemplated as falling within this basic plan. The interconnection mechanism 109 may be an internal system bus of a personal computer or may be the Internet, through which a processor 101 interacts with a database stored on a remote memory 103. Other variations will be evident to those skilled in this art. Memory 103 may be classified into two categories useful to this discussion, long term memory (also called non-volatile memory), and short term memory (also called volatile memory). These two types of memory are often both used in computer systems and data processing systems, as shown in Fig. 2. Volatile memory 201 such as integrated circuit random access memory (RAM) is often used in close physical proximity to the processor 101 because the technologies in which such volatile memory 201 is most readily realized produce fast access times, such as are desirable to support fast processors 101. Non-volatile memory 203 is often used to store massive quantities of data for longer periods of time because it can be more cheaply constructed than volatile memory of a similar capacity. Non-volatile memory 203 is often implemented as magnetic or optical disk or tape storage units, which provide a further advantage of data and software program interchange between different computer or data processing systems. As such, non-volatile memory 203 may be a software product disk on which are recorded signals representing instructions, which when executed by a processor 101 cause the computer or data processing system to perform a special puφose function. Software embodying aspects of the present invention may be recorded on such a non-volatile memory 203 for distribution by a manufacturer, for archival puφoses, for access through a volatile memory 201 by a processor 101. etc.
In accordance with various aspects of the present invention, there may be constructed a system for searching through and locating information in a collection of text-based information sources. In accordance with various aspects of the invention, a linguistic knowledge base is first generated. Then, the collection of text-based information sources is indexed. A user next inputs a query defining the information sought. The query is expanded according to selected associative and linguistic rules, using a thesaurus and the linguistic knowledge base. Finally, information is identified which matches the various expanded query terms.
The thesaurus, linguistic knowledge base and index may be stored in one or more computer files to which the system has access through memory 103. The aspects of the invention connected with automatic generation of the linguistic knowledge base, automatic generation of the index and query expansion are next described in detail. I. Automatic Generation of the Linguistic Knowledge Base
According to one aspect of the invention, software as shown in Fig. 3 is provided which when executed by a suitable data processing system will automatically generate the linguistic knowledge base from an input body of text based information sources. For example, according to this aspect of the invention, a collection of English language documents may be processed to generate an English language linguistic knowledge base.
A small set of linguistic rules 301, including a list of exceptions 302 to the linguistic rules for a language, e.g. English, is first generated by statistical analysis of a large body of text based information. This small set of rules includes:
• a list of irregular words and word bases in the language, i.e. the list of exceptions noted above;
• a word normalization table specifying legal characters in the language, i.e. the alphabet of the language, and legal character positions in the language, e.g. special rules concerning characters which can only appear at specific locations within a word;
• a prefix and suffix list specifying legal prefixes and legal suffixes in the language; and
• letter-to-sound rules for both ordinary words and proper names in the language.
This set of rules 301, including the list of exceptions 302, is then used to analyze a body of text based information sources 303, to generate a linguistic knowledge base 305 specifically adapted from the body of text based information sources 303. The body of sources 303 may be selected to be sources from a particular field of endeavor in which future queries are expected to be made, for example. This will result in a linguistic knowledge base better able to cope with the specifics of that particular field of endeavor. The body of sources 303 from which the linguistic knowledge base 305 is derived may not be the same body of sources which is ultimately to be searched. However, automatically generating the linguistic knowledge base 305 from the body of sources to be searched has the advantage that the linguistic knowledge base 305 so produced is particularly well adapted to the body of sources to be searched.
Automatic generation of the linguistic knowledge base 305 proceeds as follows. The body of text based information sources 303 forms an input stream of text 304 to the system. This input stream 304 is first parsed into words and terms 307 in accordance with either fixed word recognition rules or word recognition rules specific to one or more languages. The language of each of the words parsed from the input stream is then recognized 309. Once the language of a word has been recognized the word may be normalized 311 according to the linguistic rules 301 for the language. Irregular words may also be recognized at this point, since known irregular words are already in the list of irregular words 302 and hence need no further processing. The system may also identify as potential new irregular words, those words meeting some rule-based criteria. Those previously unknown irregular words may be identified to a human operator for a determination of whether they should be added to the list of irregular words. Regular words are linguistically expanded 313 before being added to the linguistic knowledge base 305 such that word bases are stored in the linguistic knowledge base 305 along with a list of related words from the body of sources 303. Linguistic expansion 313 is discussed in greater detail below.
The step of parsing 307 the input stream 304 into sentences and words takes place according to the following pseudo-code:
Load segmentation rules; segment input stream into sentences and words using segmentation rules; for each sentence
{ for each word { if language not explicitly specified, identify word's language end if; normalize word; return word and word's coordinates; return rest of input stream; } next word; } next sentence.
Normalization is performed as follows. Normalization identifies and removes garbage characters from the words of the input stream 304. For each character of an input word
{ if the character is illegal
{ set normalization status according to the character;
} else
{ translate the character to the internal alphabet; add the translated character to the output word;
} } next character.
Finally, new keys are added to the linguistic knowledge base 305 by the following procedure. If language not explicitly specified identify the language of the input word; for each recognition type
{ analyze word according to recognition type and level; search for analysis results in key table; if result is found in key table
{ next recognition type;
} else
{ insert key and word into table; if key has a legal sub-key
{ activate linguistic correction mechanism;
} } } next recognition type.
Two useful recognition types subject to analysis as indicated in the above pseudocode are moφhological and phonological. The moφhological analyzer of the described embodiment functions in accordance with the following procedure. The moφhological analyzer receives a list of valid prefixes and suffixes in the language identified for the input word. Start at end of word; strip next substring from end; for each substring of word { /* search for prefix*/ if substring is found to be a prefix in the identified language
{ strip prefix - create initial stem;
/* search for suffix */ start at beginning of stem; strip next substring from beginning of stem; for each substring of stem
{ if substring is found to be a suffix { strip suffix - create stem; return stem; } endif; } next substring; } endif;
} next substring.
The phonetic analyzer converts each word into a phonological representation of the word on the basis of letter to sound rules. Words having similar or same phonological representations may be considered to be related by their phonetic moφhology.
When the above processes have been completed for the body of text based information sources 303 initially presented, a linguistic knowledge base 305 for the languages of the text in the body of sources 303 will have been automatically generated. When new text based information sources are added to the system, they are also processed as above. Thus, new sources increase the knowledge and accuracy of the linguistic knowledge base 305 through the addition of new information to the linguistic knowledge base 305, as well as through the linguistic correction mechanism which corrects the contents of individual entries in the linguistic knowledge base 305 according to new information. The learning procedure which embodies the linguistic correction mechanism is as follows. Open a new table entry for a new correct key; get the list of words in the body of a previous key entry; for each word
{ re-analyze word; if analysis results match the new correct key
{ delete word from body of previous key entry; add word to body of new correct key entry; }
} next word; if previous key entry is empty delete previous key entry.
When the system detects inconsistencies between the contents of the linguistic knowledge base 305 and a newly presented text source, the affected word base and list of related words may be automatically, or at the direction of a human operator, reanalyzed and updated in accordance with the newly presented information and the above procedure. Thus, the system constantly learns about each language processed and updates the affected linguistic knowledge bases. II. Automatic Generation of the Index Correlated with the Linguistic Knowledge Base
In addition to the linguistic knowledge base 305, the retrieval system according to another aspect of the present invention shown in Fig. 4 automatically generates an index 401, whereby text based information may be found by reference to the index 401. Automatic generation of the index 401 is accompanied by updating of the linguistic knowledge base 305, so that the contents of the linguistic knowledge base 305 reflects the relevant terms contained in the body of text based information sources 303 and is thus correlated with the index 401. The index 401 simply relates words actually found in the body of text based information sources 303 to locations within the body of sources 303. It is preferred that the location be defined hierarchically. For example, the location may be represented hierarchically by a document number, section number, sentence number and position number. Other hierarchical location identification schemes may be used, as seen fit by those skilled in this art.
In accordance with a preferred embodiment of the invention, the index 401 is assisted by the linguistic knowledge base 305. The index 401 includes only words and terms actually occurring in the text-based information sources 303. The linguistic knowledge base 305 relates word bases derived from the words actually occurring in the text-based information sources 303 to lists of related words. During retrieval, which is explained below, the system retrieves an entry from the linguistic knowledge base 305 which is then used to reference one or more index entries.
Automatic generation of the index 401 proceeds as follows. The body of text based information sources 303 forms an input stream of text 304 to the indexing subsystem. This input stream 304 is first parsed into words and terms 307 in accordance with the word recognition rules. The language of each of the words parsed from the input stream is then recognized 309. Once the language of a word has been recognized 309 the word may be normalized 31 1 according to the linguistic rules 301 for the language. An index entry is then generated 403 for each new normalized word. If the normalized word already has an entry in the index 401 , then the location of the current occurrence of the word is added to the previous entry.
At substantially the same time as the above process, the linguistic knowledge base 305 is continuously kept correlated with new and modified entries produced in index 401. Each normalized word is reduced to its word base 405 in accordance with the linguistic rules of the language of the word. The word base and related word is then added to the linguistic knowledge base file 407, if not already present. The user may also specify that related words include various types of expansions of the word bases. If expansions are included, expansion of the word base is performed before storing the word base and related words in the linguistic knowledge base file 305. When indexing of a body of text based information sources 303 is complete, the linguistic knowledge base 305 is correlated with the index 401 and reflects the relevant terms contained in the body of text based information sources 303. III. Query Expansion
Query expansion is performed in accordance with a third aspect of the invention, shown in Fig. 5. Since a query may contain more than one word or term, word recognition is first performed as above. The words and terms identified by the word recognition task may further be normalized.
That is, they may be converted to a base form, if desired. By making reference to a thesaurus and linguistic rules, spelling errors may be removed, different lexical forms of acronyms and short-cuts may be recognized, etc.
Each recognized word in the query may then be expanded using a 2D expansion matrix. The 2D expansion matrix is one way of defining the expansion space in which an input word may be represented. The dimensions of this space are associative and linguistic. The associative dimension is based upon the meaning of words/word-bases in the language to be processed. In the described embodiment of the invention, the associative dimension is defined by one or more thesauri 501 relating words and terms to their synonyms, broader terms, narrower terms and other relations. Each thesaurus 501 includes a database of terms along with conceptually related terms. The thesaurus is searchable by term. Thus, each thesaurus entry contains an entry key which is a list of searchable terms. Each entry key has associated therewith one or more terms conceptually related to the entry key, such as synonyms, broader terms, narrower terms, associated terms, antonyms, etc. The inclusion of any one or more categories of association is optional. Furthermore, each entry term may optionally have associated therewith a conventional dictionary definition and usage guide, as well as a query string into which the entry key may be translated when required. Thus, the thesaurus is a list of entries, wherein each entry has a structure substantially as follows:
• KEYWORD: (in the form of natural language phrase or term) Used as an entry key.
• DESCRIPTION (optional): A description of keyword meaning and usage (as in encyclopedic dictionaries). • QUERY: A complete query statement in an underlying full-text query language that the keyword is translated to when required (optional). (E.g., KEYWORD "USA" - QUERY "United AND States AND of AND America".) If a translation of the keyword to a complete query statement is not supplied explicitly, a default translation is applied to the keyword. • RELATIONS
SYNONYMS: A list of keywords synonymous with the KEYWORD that comprise a concept or descriptor. BROADER TERMS NARROWER TERMS • ASSOCIATIONS
OTHER All of these features may be used by an operator to determine whether associative expansion is having a desired effect.
The linguistic dimension of this expansion space is based upon the linguistic knowledge base 305 of the language to be processed. As discussed above, the linguistic knowledge base is built automatically from the actual coφora of the text-based information sources, independent of manually crafted linguistic dictionaries, and not being restricted to "legal" or "proper" words. In this embodiment of the invention, linguistic expansion grammars of moφhology and phonetics are supported.
The expansion task performs 2D expansion in substantially two main steps. First an associative expansion is performed at step 503, in which each input word of an input query 505 is expanded to a list of words 507 including words having defined relations to the input word. The associated words are found by making reference to the thesaurus 501. This expanded list of words 507 becomes the input on which linguistic expansion 509 is performed in both the moφhological and phonetic dimensions, simultaneously. The moφhological and phonetic expansion is controlled by making reference to the linguistic knowledge base 305. The linguistic expansion 509 may be controlled by expansion parameters 511 supplied by the user to include varying degrees of moφhological expansion and phonetic expansion, ranging for both dimensions from no expansion in that dimension to full expansion in that dimension. By performing the moφhological and phonetic expansions as a single, linguistic expansion step 509, expansion strategies for moφhology and phonetics may be intelligently related. The relationships between the expansion dimensions are defined in the linguistic knowledge base 305 for the language. Thus, a rule for moφhological expansion may define a moφhological variation which changes depending upon the phonetic properties of the input word or the expanded result. As a result, less noise is generated in the expanded output because relating the moφhological and phonetic dimensions as a single linguistic plane eliminates moφhological variants which are phonetically unacceptable under the totality of the linguistic rules, and vice versa. IV. A Complete Text Retrieval System
It can now be seen that using the software described above a retrieval system may be constructed as shown in Fig. 6, which can perform efficient and accurate location of information within a collection of text-based information sources. Briefly, such a system is given access to one or more collections of text-based information sources 303a and 303b. At least one group of text-based information sources 303a is supplied to automatic linguistic knowledge base generating software 601, which generates the linguistic knowledge base 305 as described above. Text-based information sources 303b are provided to an indexing subsystem 603 which creates an index 401 of words in the text-based information sources 303b, in which each entry in the index 401 defines a relationship between a word and the location of the word in the collection, as described above. It is preferred that the index 401 be generated using normalized words in one or more languages for which the system has a thesaurus 501 and a linguistic knowledge base 305. The indexing subsystem 603 may include a module to recognize words having forms which conform to one of the languages supported by the system and may further include an appropriate normalizing module for each language supported by the system. Words are normalized in their language as discussed above, to reduce the number of anomalous entries appearing in the index 401. The system further receives a user query 505 in the form of one or more words expressive of the information sought by the user. The query words are expanded 605 using a 2D expansion matrix, as discussed above. The query is first associatively expanded to include words related to the original query words by reference to the thesaurus appropriate for the language of the query words. The associatively expanded query is then linguistically expanded in both the moφhological and phonetic dimensions, simultaneously. The degree of expansion in each dimension is specified by the user, by parameters 511 supplied with the query. The degree of expansion may be specified by the user, for example, by attaching a checklist of expansion parameters 51 1 to each query term. Finally, the terms of the fully expanded query 607 are compared 609 with the entries in the index 401 to find relevant locations 611 within the collection of text-based information sources 303b.
Relevant locations 611 in the collection of text-based information sources 303b do not necessarily contain any of the original query terms. By the processing described above, the locations found 611 will contain one of the original query terms or a related term produced by the associative and linguistic expansion processes. The locations found 61 1 will not include many '"noise" locations because the linguistic expansion process is performed as described above in a manner in which the moφhological and phonetic linguistic rules are applied simultaneously in a synergistic manner that avoids the problem of applying a moφhological rule to generate a phonetically nonsensical result or vice versa. In a system such as described above, the text-based information sources may be text documents stored on a computer system. In this case, it may be convenient for the indexing system to hierarchically refer to locations by document number, section number, sentence number and position with the sentence. Furthermore, freely formatted text documents may be processed by the above-described system. There is no need to structure the documents a particular way or to manually produce classifications or keywords, as done in some prior art systems, because the present system indexes words and manipulates queries according to the rules of the language in which the words occur. If it is desired that a phrase be treated as a single word or term in a particular language, then that phrase may be so defined as a conceptual entity in a thesaurus. In all other respects, the phrase so defined as a word is treated simply as a word in the language. However, it is unnecessary to declare a long list of accepted keywords because the process of indexing and query expansion generates accurate, relatively noise-free matches for user queries reasonably expressive of the information sought.
Having thus described at least one illustrative embodiment of the invention, various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only and is not intended as limiting. The invention is limited only as defined in the following claims and the equivalents thereto.

Claims

CLAΪMS
1. A text-based information processing system, comprising: an automatic linguistic knowledge base generator having an input receiving a collection of text-based information sources and which produces a linguistic knowledge base; an index generator having inputs receiving a collection of text-based information sources and the linguistic knowledge base and which produces an index of the received text-based information and further which updates the linguistic knowledge base to reflect the inputs to the index generator and maintain correlation between the index and the linguistic knowledge base; and a query processor having inputs receiving a query composed by an operator, the linguistic knowledge base, the index and a thesaurus and which produces a list of locations in the collection of text-based information sources relevant to the query.
2. In a text-based information processing system, an automatic linguistic knowledge base generator, comprising: a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further connected to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and a hnguistic expander connected to receive the normalized terms and producing entries stored in the linguistic knowledge base.
3. The system of claim 2, wherein the normalizer further comprises: a first normalizer unit connected to receive the individual terms and the linguistic rules and producing terms from which illegal characters have been removed; and a second normalizer unit connected to receive the terms from which illegal characters have been removed and the linguistic rules and which produces normalized terms including word stems found by applying the linguistic rules to the terms from which illegal characters have been removed.
4. In a text-based information processing system, an automatic indexer, comprising: a parser, receiving an input stream of terms and producing individual terms; a language recognizer connected to receive the individual terms from the parser and which produces an output indicative of a language to which each individual term belongs; a normalizer connected to receive the individual terms and further connected to receive linguistic rules for the language indicated by the output of the language recognizer and producing normalized terms; and an index generator having inputs receiving a collection of text-based information sources and the linguistic knowledge base and which produces an index of the received text-based information and further which updates the linguistic knowledge base to reflect the inputs to the index generator and maintain correlation between the index and the linguistic knowledge base.
5. The system of claim 4, wherein the normalizer further comprises: a first normalizer unit connected to receive the individual terms and the linguistic rules and producing terms from which illegal characters have been removed; and a second normalizer unit connected to receive the terms from which illegal characters have been removed and the linguistic rules and which produces normalized terms including word stems found by applying the linguistic rules to the terms from which illegal characters have been removed.
6. In a text-based information processing system, an expansion unit for expanding terms in a language, comprising: an associative expander having an input receiving a term and having an output representing the term and at least one associated term found by the associated expander making reference to a thesaurus; and a linguistic expander having an input connected to the output of the associative expander and having an output representing the input of the linguistic expander and at least one term linguistically related to the input of the linguistic
PCT/IB1997/000748 1996-04-04 1997-04-04 A system, software and method for locating information in a collection of text-based information sources WO1997038376A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP97925221A EP0934569A2 (en) 1996-04-04 1997-04-04 A system, software and method for locating information in a collection of text-based information sources
JP9532080A JP2000507008A (en) 1996-04-04 1997-04-04 Systems, software and methods for locating information in a collection of text-based information sources

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US1481596P 1996-04-04 1996-04-04
US66047896A 1996-06-07 1996-06-07
US08/660,478 1996-06-07
US60/014,815 1996-06-07

Publications (2)

Publication Number Publication Date
WO1997038376A2 true WO1997038376A2 (en) 1997-10-16
WO1997038376A3 WO1997038376A3 (en) 1997-12-04

Family

ID=26686566

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB1997/000748 WO1997038376A2 (en) 1996-04-04 1997-04-04 A system, software and method for locating information in a collection of text-based information sources

Country Status (4)

Country Link
EP (1) EP0934569A2 (en)
JP (1) JP2000507008A (en)
CA (1) CA2250694A1 (en)
WO (1) WO1997038376A2 (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1104568A1 (en) * 1998-07-15 2001-06-06 Amazon.Com. System and method for correcting spelling errors in search queries
WO2001067280A2 (en) * 2000-03-08 2001-09-13 Accenture Llp Method for a knowledge model
WO2001098946A1 (en) * 2000-06-22 2001-12-27 Hapax Information Systems Ab Method and system for information extraction
GB2393541A (en) * 2002-09-27 2004-03-31 Hewlett Packard Development Co Method for management of synonymic searching
EP1411448A2 (en) * 2002-10-17 2004-04-21 Matsushita Electric Industrial Co., Ltd. Data searching apparatus
US6735560B1 (en) * 2001-01-31 2004-05-11 International Business Machines Corporation Method of identifying members of classes in a natural language understanding system
US6957205B1 (en) 2000-03-08 2005-10-18 Accenture Llp Knowledge model-based indexing of information
US6996774B2 (en) 2002-02-12 2006-02-07 Accenture Global Services Gmbh Display of data element indicia based on data types
GB2417115A (en) * 2002-09-27 2006-02-15 Hewlett Packard Development Co Managing synonymic searching and ranking results
US7072847B2 (en) * 2000-08-25 2006-07-04 Jonas Ulenas Method and apparatus for obtaining consumer product preferences through product selection and evaluation
US7089188B2 (en) * 2002-03-27 2006-08-08 Hewlett-Packard Development Company, L.P. Method to expand inputs for word or document searching
US7350138B1 (en) * 2000-03-08 2008-03-25 Accenture Llp System, method and article of manufacture for a knowledge management tool proposal wizard
EP2074505A4 (en) * 2006-10-05 2010-01-13 Splunk Inc Time series search engine
US7716201B2 (en) 2006-08-10 2010-05-11 Yahoo! Inc. Method and apparatus for reconstructing a search query
US7765176B2 (en) 2006-11-13 2010-07-27 Accenture Global Services Gmbh Knowledge discovery system with user interactive analysis view for analyzing and generating relationships
US7853556B2 (en) 2003-09-12 2010-12-14 Accenture Global Services Limited Navigating a software project respository
US7904411B2 (en) 2005-02-04 2011-03-08 Accenture Global Services Limited Knowledge discovery tool relationship generation
US7979314B2 (en) 2001-08-23 2011-07-12 Jonas Ulenas Method and apparatus for obtaining consumer product preferences through interactive product selection and evaluation
US8010581B2 (en) 2005-02-04 2011-08-30 Accenture Global Services Limited Knowledge discovery tool navigation
US8356036B2 (en) 2005-02-04 2013-01-15 Accenture Global Services Knowledge discovery tool extraction and integration
US8660977B2 (en) 2005-02-04 2014-02-25 Accenture Global Services Limited Knowledge discovery tool relationship generation
US10019496B2 (en) 2013-04-30 2018-07-10 Splunk Inc. Processing of performance data and log data from an information technology environment by using diverse data stores
US10225136B2 (en) 2013-04-30 2019-03-05 Splunk Inc. Processing of log data and performance data obtained via an application programming interface (API)
US10318541B2 (en) 2013-04-30 2019-06-11 Splunk Inc. Correlating log data with performance measurements having a specified relationship to a threshold value
US10346357B2 (en) 2013-04-30 2019-07-09 Splunk Inc. Processing of performance data and structure data from an information technology environment
US10353957B2 (en) 2013-04-30 2019-07-16 Splunk Inc. Processing of performance data and raw log data from an information technology environment
US10614132B2 (en) 2013-04-30 2020-04-07 Splunk Inc. GUI-triggered processing of performance data and log data from an information technology environment
US10698937B2 (en) 2017-12-13 2020-06-30 Microsoft Technology Licensing, Llc Split mapping for dynamic rendering and maintaining consistency of data processed by applications
CN112053758A (en) * 2020-08-27 2020-12-08 北京颢云信息科技股份有限公司 Intelligent construction and optimization method for single disease database
US10997191B2 (en) 2013-04-30 2021-05-04 Splunk Inc. Query-triggered processing of performance data and log data from an information technology environment
US11550751B2 (en) 2016-11-18 2023-01-10 Microsoft Technology Licensing, Llc Sequence expander for data entry/information retrieval

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404506A (en) * 1985-03-27 1995-04-04 Hitachi, Ltd. Knowledge based information retrieval system
EP0657828A1 (en) * 1993-12-06 1995-06-14 Matsushita Electric Industrial Co., Ltd. An apparatus and a method for retrieving image objects

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404506A (en) * 1985-03-27 1995-04-04 Hitachi, Ltd. Knowledge based information retrieval system
EP0657828A1 (en) * 1993-12-06 1995-06-14 Matsushita Electric Industrial Co., Ltd. An apparatus and a method for retrieving image objects

Cited By (78)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1104568A4 (en) * 1998-07-15 2006-04-12 Amazon Com Inc System and method for correcting spelling errors in search queries
US7996398B2 (en) 1998-07-15 2011-08-09 A9.Com, Inc. Identifying related search terms based on search behaviors of users
US7840577B2 (en) 1998-07-15 2010-11-23 A9.Com, Inc. Search query processing to identify related search terms and to correct misspellings of search terms
EP1104568A1 (en) * 1998-07-15 2001-06-06 Amazon.Com. System and method for correcting spelling errors in search queries
US7444324B2 (en) 1998-07-15 2008-10-28 A9.Com, Inc. Search query processing to identify search string corrections that reflect past search query submissions of users
WO2001067280A2 (en) * 2000-03-08 2001-09-13 Accenture Llp Method for a knowledge model
WO2001067280A3 (en) * 2000-03-08 2003-12-24 Accenture Llp Method for a knowledge model
US7350138B1 (en) * 2000-03-08 2008-03-25 Accenture Llp System, method and article of manufacture for a knowledge management tool proposal wizard
US6957205B1 (en) 2000-03-08 2005-10-18 Accenture Llp Knowledge model-based indexing of information
US7099854B2 (en) 2000-03-08 2006-08-29 Accenture Llp Knowledgebase framework system
US6842730B1 (en) 2000-06-22 2005-01-11 Hapax Limited Method and system for information extraction
US7657425B2 (en) 2000-06-22 2010-02-02 Hapax Limited Method and system for information extraction
WO2001098946A1 (en) * 2000-06-22 2001-12-27 Hapax Information Systems Ab Method and system for information extraction
US7194406B2 (en) 2000-06-22 2007-03-20 Hapax Limited Method and system for information extraction
US7072847B2 (en) * 2000-08-25 2006-07-04 Jonas Ulenas Method and apparatus for obtaining consumer product preferences through product selection and evaluation
US6735560B1 (en) * 2001-01-31 2004-05-11 International Business Machines Corporation Method of identifying members of classes in a natural language understanding system
US7979314B2 (en) 2001-08-23 2011-07-12 Jonas Ulenas Method and apparatus for obtaining consumer product preferences through interactive product selection and evaluation
US6996774B2 (en) 2002-02-12 2006-02-07 Accenture Global Services Gmbh Display of data element indicia based on data types
US7089188B2 (en) * 2002-03-27 2006-08-08 Hewlett-Packard Development Company, L.P. Method to expand inputs for word or document searching
GB2417115A (en) * 2002-09-27 2006-02-15 Hewlett Packard Development Co Managing synonymic searching and ranking results
GB2393541A (en) * 2002-09-27 2004-03-31 Hewlett Packard Development Co Method for management of synonymic searching
EP1411448A3 (en) * 2002-10-17 2007-12-05 Matsushita Electric Industrial Co., Ltd. Data searching apparatus
EP1411448A2 (en) * 2002-10-17 2004-04-21 Matsushita Electric Industrial Co., Ltd. Data searching apparatus
US7853556B2 (en) 2003-09-12 2010-12-14 Accenture Global Services Limited Navigating a software project respository
US8356036B2 (en) 2005-02-04 2013-01-15 Accenture Global Services Knowledge discovery tool extraction and integration
US7904411B2 (en) 2005-02-04 2011-03-08 Accenture Global Services Limited Knowledge discovery tool relationship generation
US8660977B2 (en) 2005-02-04 2014-02-25 Accenture Global Services Limited Knowledge discovery tool relationship generation
US8010581B2 (en) 2005-02-04 2011-08-30 Accenture Global Services Limited Knowledge discovery tool navigation
US7716201B2 (en) 2006-08-10 2010-05-11 Yahoo! Inc. Method and apparatus for reconstructing a search query
US8209317B2 (en) 2006-08-10 2012-06-26 Yahoo! Inc. Method and apparatus for reconstructing a search query
US8046347B2 (en) 2006-08-10 2011-10-25 Yahoo! Inc. Method and apparatus for reconstructing a search query
US9747316B2 (en) 2006-10-05 2017-08-29 Splunk Inc. Search based on a relationship between log data and data from a real-time monitoring environment
US11561952B2 (en) 2006-10-05 2023-01-24 Splunk Inc. Storing events derived from log data and performing a search on the events and data that is not log data
US10977233B2 (en) 2006-10-05 2021-04-13 Splunk Inc. Aggregating search results from a plurality of searches executed across time series data
US8990184B2 (en) 2006-10-05 2015-03-24 Splunk Inc. Time series search engine
US9002854B2 (en) 2006-10-05 2015-04-07 Splunk Inc. Time series search with interpolated time stamp
US9514175B2 (en) 2006-10-05 2016-12-06 Splunk Inc. Normalization of time stamps for event data
US9594789B2 (en) 2006-10-05 2017-03-14 Splunk Inc. Time series search in primary and secondary memory
EP2074505A4 (en) * 2006-10-05 2010-01-13 Splunk Inc Time series search engine
US9922067B2 (en) 2006-10-05 2018-03-20 Splunk Inc. Storing log data as events and performing a search on the log data and data obtained from a real-time monitoring environment
US9922066B2 (en) 2006-10-05 2018-03-20 Splunk Inc. Aggregation and display of search results from multi-criteria search queries on event data
US9922065B2 (en) 2006-10-05 2018-03-20 Splunk Inc. Determining timestamps to be associated with events in machine data
US9928262B2 (en) 2006-10-05 2018-03-27 Splunk Inc. Log data time stamp extraction and search on log data real-time monitoring environment
US9996571B2 (en) 2006-10-05 2018-06-12 Splunk Inc. Storing and executing a search on log data and data obtained from a real-time monitoring environment
US11947513B2 (en) 2006-10-05 2024-04-02 Splunk Inc. Search phrase processing
US10216779B2 (en) 2006-10-05 2019-02-26 Splunk Inc. Expiration of persistent data structures that satisfy search queries
US10891281B2 (en) 2006-10-05 2021-01-12 Splunk Inc. Storing events derived from log data and performing a search on the events and data that is not log data
US10242039B2 (en) 2006-10-05 2019-03-26 Splunk Inc. Source differentiation of machine data
US10255312B2 (en) 2006-10-05 2019-04-09 Splunk Inc. Time stamp creation for event data
US10262018B2 (en) 2006-10-05 2019-04-16 Splunk Inc. Application of search policies to searches on event data stored in persistent data structures
EP3493074A1 (en) * 2006-10-05 2019-06-05 Splunk Inc. Time series search engine
US11550772B2 (en) 2006-10-05 2023-01-10 Splunk Inc. Time series search phrase processing
US11537585B2 (en) 2006-10-05 2022-12-27 Splunk Inc. Determining time stamps in machine data derived events
US11526482B2 (en) 2006-10-05 2022-12-13 Splunk Inc. Determining timestamps to be associated with events in machine data
US10747742B2 (en) 2006-10-05 2020-08-18 Splunk Inc. Storing log data and performing a search on the log data and data that is not log data
US11249971B2 (en) 2006-10-05 2022-02-15 Splunk Inc. Segmenting machine data using token-based signatures
US10678767B2 (en) 2006-10-05 2020-06-09 Splunk Inc. Search query processing using operational parameters
US11144526B2 (en) 2006-10-05 2021-10-12 Splunk Inc. Applying time-based search phrases across event data
US10740313B2 (en) 2006-10-05 2020-08-11 Splunk Inc. Storing events associated with a time stamp extracted from log data and performing a search on the events and data that is not log data
US7953687B2 (en) 2006-11-13 2011-05-31 Accenture Global Services Limited Knowledge discovery system with user interactive analysis view for analyzing and generating relationships
US7765176B2 (en) 2006-11-13 2010-07-27 Accenture Global Services Gmbh Knowledge discovery system with user interactive analysis view for analyzing and generating relationships
US10592522B2 (en) 2013-04-30 2020-03-17 Splunk Inc. Correlating performance data and log data using diverse data stores
US10346357B2 (en) 2013-04-30 2019-07-09 Splunk Inc. Processing of performance data and structure data from an information technology environment
US10019496B2 (en) 2013-04-30 2018-07-10 Splunk Inc. Processing of performance data and log data from an information technology environment by using diverse data stores
US11782989B1 (en) 2013-04-30 2023-10-10 Splunk Inc. Correlating data based on user-specified search criteria
US10997191B2 (en) 2013-04-30 2021-05-04 Splunk Inc. Query-triggered processing of performance data and log data from an information technology environment
US10225136B2 (en) 2013-04-30 2019-03-05 Splunk Inc. Processing of log data and performance data obtained via an application programming interface (API)
US11119982B2 (en) 2013-04-30 2021-09-14 Splunk Inc. Correlation of performance data and structure data from an information technology environment
US10877986B2 (en) 2013-04-30 2020-12-29 Splunk Inc. Obtaining performance data via an application programming interface (API) for correlation with log data
US10318541B2 (en) 2013-04-30 2019-06-11 Splunk Inc. Correlating log data with performance measurements having a specified relationship to a threshold value
US10614132B2 (en) 2013-04-30 2020-04-07 Splunk Inc. GUI-triggered processing of performance data and log data from an information technology environment
US10353957B2 (en) 2013-04-30 2019-07-16 Splunk Inc. Processing of performance data and raw log data from an information technology environment
US11550751B2 (en) 2016-11-18 2023-01-10 Microsoft Technology Licensing, Llc Sequence expander for data entry/information retrieval
US10698937B2 (en) 2017-12-13 2020-06-30 Microsoft Technology Licensing, Llc Split mapping for dynamic rendering and maintaining consistency of data processed by applications
US11126648B2 (en) 2017-12-13 2021-09-21 Microsoft Technology Licensing, Llc Automatically launched software add-ins for proactively analyzing content of documents and soliciting user input
US11061956B2 (en) 2017-12-13 2021-07-13 Microsoft Technology Licensing, Llc Enhanced processing and communication of file content for analysis
US10929455B2 (en) 2017-12-13 2021-02-23 Microsoft Technology Licensing, Llc Generating an acronym index by mining a collection of document artifacts
CN112053758A (en) * 2020-08-27 2020-12-08 北京颢云信息科技股份有限公司 Intelligent construction and optimization method for single disease database

Also Published As

Publication number Publication date
WO1997038376A3 (en) 1997-12-04
EP0934569A2 (en) 1999-08-11
CA2250694A1 (en) 1997-10-16
JP2000507008A (en) 2000-06-06

Similar Documents

Publication Publication Date Title
WO1997038376A2 (en) A system, software and method for locating information in a collection of text-based information sources
US6161084A (en) Information retrieval utilizing semantic representation of text by identifying hypernyms and indexing multiple tokenized semantic structures to a same passage of text
US5331556A (en) Method for natural language data processing using morphological and part-of-speech information
US7567902B2 (en) Generating speech recognition grammars from a large corpus of data
US8041697B2 (en) Semi-automatic example-based induction of semantic translation rules to support natural language search
JP5243167B2 (en) Information retrieval system
US7734623B2 (en) Semantics-based method and apparatus for document analysis
US20050080613A1 (en) System and method for processing text utilizing a suite of disambiguation techniques
JP2011118689A (en) Retrieval method and system
Pandey et al. An unsupervised Hindi stemmer with heuristic improvements
Nakanishi et al. Probabilistic models for disambiguation of an HPSG-based chart generator
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
US20220229986A1 (en) System and method for compiling and using taxonomy lookup sources in a natural language understanding (nlu) framework
US20220237383A1 (en) Concept system for a natural language understanding (nlu) framework
Ahkouk et al. Human language question to sql query using deep learning
Rishel et al. Augmentation of a term/document matrix with part-of-speech tags to improve accuracy of latent semantic analysis.
JP2005025555A (en) Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
JP4635585B2 (en) Question answering system, question answering method, and question answering program
JP3176750B2 (en) Natural language translator
JP3419748B2 (en) Dictionary creation device and method, and recording medium recording dictionary creation program
Arun Statistical parsing of the French treebank
Asubiaro An Analysis of the Structure of Index Terms for Yoruba Texts
JPH11120193A (en) Method and device for retrieving natural sentence and storage medium recording natural sentence retrieval program
JPH0320866A (en) Text base retrieval system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

AK Designated states

Kind code of ref document: A3

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase in:

Ref country code: CA

Ref document number: 2250694

Kind code of ref document: A

Format of ref document f/p: F

Ref document number: 2250694

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 1997925221

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1997925221

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1997925221

Country of ref document: EP