WO2001024053A2

WO2001024053A2 - System and method for automatic context creation for electronic documents

Info

Publication number: WO2001024053A2
Application number: PCT/US2000/025755
Authority: WO
Inventors: Rachael Sokolwski; Philip Oxenberg
Original assignee: Xmlexpress, Inc.
Priority date: 1999-09-28
Filing date: 2000-09-20
Publication date: 2001-04-05
Also published as: WO2001024053A3; WO2001024053A9; AU4025301A

Abstract

A system and method for automatically generating a context for information contained in any type of text-based electronic document such as a hypertext markup language (HTML) encoded web page. The contexts generated by the system describe the content or meaning of parts or sections of the electronic document. Additionally, the system automatically generates a hierarchy of how these contexts are organized. The generated contexts do not describe format or appearance such as heading or paragraph; the contexts created are descriptive names that summarize the content. The contexts provided for the electronic document are used to generate descriptive markup of an electronic document, key words, and indices. The system uses a unique combination of sentence and paragraph boundaries, document markup and linguistic information to generate the context and/or keyword. The contexts generated may be used to electronically provide start and end boundaries for the information. The preferred embodiment is the creation of an XML (eXtensible Markup Language) electronic document.

Description

SYSTEM AND METHOD FOR AUTOMATIC CONTEXT CREATION FOR ELECTRONIC DOCUMENTS

Technical Field This invention relates generally to electronic document context creation and management. More particularly, this invention relates to methods and apparatus for adding markup or additional information to documents in electronic or other non-paper media, and more specifically, for using algorithms to automatically generate contexts for logical components of the document.

Background of the Invention

Access to relevant electronic information is increasingly important as the world becomes less paper-based and relies more on digital information. The majority of documents available on the internet today are stored in Hypertext Markup Language ( HTML ). HTML is a markup language that encodes a document via the use of tags and attributes. Tags appear between < > brackets, and attributes are specified in the form of "name = value". HTML specifies the meaning of each tag and attribute and how text located between tags and/or attributes will appear. An example is a tag <p> which designates the beginning of a new paragraph. A corresponding tag </p> designates the end of the paragraph. HTML documents are typically interpreted by HTML interpreters found in web browsers.

The Extensible Markup Language(XML), was developed to provide greater flexibility for applications utilizing electronic documents. Similar to HTML, XML is a markup language that uses tags and attributes, but unlike HTML, XML uses tags only to delimit pieces of data. The interpretation given to the meaning of the data is left up to the application that reads the data. As noted above, the tag <p> in HTML specifies that a new paragraph is needed, whereas the tag <p> in XML has an application specific meaning. This flexibility allows applications making use of the data to interpret the data in different ways. The development of XML has created a pressing need to convert legacy HTML and other types of electronic documents into XML documents. There is an additional need to convert current physical documents into XML. Today, conversions of documents into XML documents is typically done manually.

Summary of the Invention

The present invention provides an approach to migrating documents from an original format to the XML format. In one embodiment of the present invention, HTML documents are automatically converted into XML documents without the need for manual intervention. The information contained in the document is analyzed and categorized. The results of the analysis are used to identify a context for the document's information. The context identifies the manner in which the document's information is interrelated. On the basis of its context, the embodiment of the present invention enables applications to detect differences between documents such as a purchase order and a radiology report. In addition, the invention provides contexts for the data within the document so that pieces of the document, such as the symptoms and diagnosis of an illness, or a zip code and a telephone number, can be easily located and distinguished from one another; thus enhancing the ability to locate information within documents and the searchability of the documents by a search engine. The preferred embodiment of the present invention automatically locates a piece of information's context. The system uses a combination of boundary markers, known contexts, and linguistic information to determine the start, the end, and the name of the context. The boundary markers include, but are not limited to, the end of a sentence, the end of a paragraph, a word processor style or a markup item such as an HTML tag. A known context might be a well-understood and descriptive format such as an address containing a name, a number, a street name, a city, a state, a country, and a zip code. The linguistic information includes, but is not limited to: parts of speech of individual words within the document, noun phrases within the sentences of the document, and/or the subject of a sentence. Once a piece of information's context in a document is determined, the context may be expressed as markup within the electronic document or as meta-data attached to the document. For instance, information about a number would include, for example, whether the number is a zip code, a telephone number, or a total of sales items. With a context, it is possible to locate information more efficiently because searching retrieves results that match what is requested by searching for specific contexts.

The preferred embodiment of the present invention is a methodology and data architecture utilized in a computer, computer program, computer system, television, television system, video display, scanning device, speech recognition system, or any other mechanism providing text-based electronic documents and requiring automatic addition of contexts for display, manipulation, or archiving of such documents.

Brief Description of the Drawings

An illustrative embodiment consistent with the principles of the present invention will be described below relative to the following drawings.

Figure 1 depicts an electronic device suitable for practicing the illustrative embodiment; Figure 2 depicts a network environment suitable for practicing the illustrative embodiment;

Figure 3 depicts a flow chart of the information flow through the separate modules in the illustrative embodiment;

Figure 4 depicts a Boundary Processor module used in the illustrative embodiment;

Figure 5 depicts a flow chart of the steps performed by the Lexical Tagger module;

Figure 6 depicts a block diagram of a lexically tagged text stream; Figure 7 depicts the lexically tagged stream of Figure 6 processed by the Phrase Generator module;

Figure 8 depicts a flow chart of the steps performed by the Subject Determiner module;

Figure 9 depicts the lexically tagged text stream of Figure 7 processed by the Markup Tagger module; Figure 10 depicts a sample document produced by the Document Creator

Module. Detailed Description of the Invention

The illustrative embodiment of the present invention provides an approach for converting documents from an original text-based format ( such as HTML ) to an XML format. In making this conversion, the illustrative embodiment identifies a "context" for information contained in the text-based document. The contexts generated by the illustrative embodiment describe the content or meaning of sections, paragraphs, sentences and other significant words and phrases of the document. The illustrative embodiment automatically generates a hierarchy of the contexts for a document. The hierarchy reflects how information is organized in the document. In general, the contexts are descriptive names that summarize content. The contexts may be incorporated in the XML document to provide a descriptive markup of keywords and indices in the original text-based document or may be stored in meta-data attached to the document. A context for a number, for example, may identify whether the number is a zip code, telephone number or a total price for sale items. Electronic text documents stored without markup are difficult to find, import into databases, search and retrieve. The provision of markup on the electronic documents provides a reference point for other applications to quickly focus on when they are searching for documents of a particular type or documents containing particular data. The processing of text data that is performed by the present invention, is performed so that the significant elements of the text data, whether words, phrases, sentences, or paragraphs can be marked, and a designation of those elements included when the text data document is converted to a new format ( XML in the preferred embodiment ). Marked elements or "contexts" in an electronic document function like tabs on a manila folder, allowing applications to see at a glance what is in the documents without having to review the entire document. This makes the storage, retrieval and searching of a document more efficient than the storage, retrieval and searching of documents without context.

The illustrative embodiment allows documents to be stored, retrieved and searched in an efficient manner by allowing storage, retrieval and searching based on context. The identification of context for information in the document allows applications to quickly distinguish between different types of documents and different types of information contained within a single document. For example, contexts may be used to distinguish between a sales report and a radiology report. Moreover, contexts may be used to distinguish between content located within a single document. Thus, contexts enable distinction between a zip code and a telephone number in a single document. Third party applications may use the contexts to organize, store, and retrieve information based on their own particular criteria. For example, search engines may use the contexts to retrieve documents relevant to a query.

The illustrative embodiment uses a number of heuristics and a knowledge base to identify contexts. Linguistic information ( such as parts of speech of a word ), boundary markers ( such as HTML tags, punctuation marks and ends of paragraphs ) and other information are used to identify contexts. The heuristics, identify the start of a context, the end of a context and the name of context. For example, a heuristic may identify that an address begins with a street number and ends with a zip code. The context name is "address", in such an example.

Figure 1 depicts a block diagram of electronic device 1 suitable for practicing the illustrated embodiment of the present invention. The electronic device 1 includes a CPU 2, a display 4, a keyboard 6, a mouse 8, a network adapter 10 and a modem 12. The electronic device 1 also includes permanent storage 14, HTML documents 16, XML documents 17 and a conversion facility 18. The conversion facility 18 is responsible for converting electronic documents ( such as the HTML documents 16 ) into XML documents. The conversion facility 18 may be implemented in one or more software modules that run on the CPU 2. The conversion facility 18 may be invoked programmatically or by explicit user command. It should be appreciated that the conversion facility need not be a solitary package but rather may be part of a suite or other software package. Those skilled in the art will appreciate that the electronic device 1 depicted in

Figure 1 may take many forms. For example, the electronic device 1 may be a personal computer, a workstation, a mainframe, a laptop computer, a personal digital assistant ( PDA ), a network computer, an Internet appliance, a phone, an electronic book, an intelligent pager, or other type of intelligent electronic device. The configuration of the electronic device 1 shown in Figure 1 is intended to be merely illustrative and not limiting of the present invention. For example, the electronic device may include multiple processors and may lack some of the components shown in Figure 1. As shown in Figure 2, the electronic device 1 may be interfaced with a network 22, such as a computer network, a wireless network or a communications network. The network 22, may be, for example, the Internet, an intranet, an extranet or a local area network ( LAN ). The network 22 may have a server 24 ( such as a web server ) connected to it. The server 24 may hold or have access to original content ( such as an HTML document ) that is to be converted by the conversion facility 18 running on the electronic device 1. More generally, the content converted by the conversion facility 18 need not originate locally but rather may originate remotely.

Those skilled in the art will appreciate that additional servers and components may also be attached to the network 22 and that there are a multitude of possible network connections and configurations for practicing the present invention.

Figure 3 depicts data flow and process flow among modules of the conversion facility 18 in processing and converting an input piece of content ( such as an electronic document ) into XML. Initially, the content is received by the Boundary Processor module 28. The Boundary Processor module 28 looks for boundaries between logical portions of a document. The Boundary Processor module 28 produces a list of structural elements, words and sentences by locating markup boundaries, sentence boundaries, word boundaries and context boundaries indicative of known types of layouts such as address layouts. Figure 4 depicts the steps performed by the Boundary Processor module 28.

Initially, the input content( e.g. text stream ) is read into a data structure similar to a text buffer ( step 54 ). The Boundary Processor then performs a search to locate format information and boundaries in the input content ( step 55 ). The format information includes markup, such as HTML tags and application specific styles, such as word processor styles. Boundaries include white space, such as spaces, tabs, new lines, paragraph marks, and carriage returns. In addition, boundaries include the end of a sentence indicated by sentence white space ( carriage returns, tabs spaces ) terminating punctuation, such as periods, question marks, and exclamation points. Boundaries also include a new paragraph, or a new tag. The content is processed one character at a time, and the results are stored in a data structure ( step 56 ). The input content is further divided into words and phrases by identifying white space characters that delimit the words and phrases. Words and phrases are stored in objects of respective object classes. The output from the Boundary Processor module 28 is passed to a Lexical Tagger module 30 ( See Figure 3 ). The Lexical Tagger module 30 compares the text information in the received list against entries in a Knowledge Database 48 to determine if matching phrases and words are found in the Knowledge Database 48. Each delimited sentence is compared to the Knowledge Database 48 to check whether it appears in its entirety as a phrase. If it does not, increasingly smaller pieces of the delimited sentence will be checked for phrase matches. Eventually, any unmatched words from the sentence will be checked to see if the individual word matches an entry in the Knowledge Database 48. The actual process is explained in more detail below. The Lexical Tagger module 30 assigns an initial part of speech ( i.e.: noun, verb, etc. ) to each word or phrase for which a match is found in the Knowledge Database 48. If a word is capable of being designated as more than one type of speech according to the Knowledge Database 48, the Lexical Tagger module 30 assigns the part of speech found most often and maintains the other part of speech tags. For example the word pain may either be a noun or verb, but "pain" is most often used as a noun. The system assigns the most likely part of speech tag as a noun but retains the verb part of speech tag within the data structure for the word The designation of words capable of more than one part of speech will be double checked by the Part of Speech Determiner module 32 prior to any final determination of the part of speech.. Figure 5 depicts the steps that are performed for each group of words in the text of the input content. Initially, the text of the input content is arranged into groups of words ( step 57 ), each initial group of words corresponding to a sentence in the input content. The group of words is then compared with entries in the Knowledge Database 48 to determine if there is a matching phrase ( step 58 ). If there is a matching entry, the linguistic information contained in the Knowledge Database 48 is assigned to the phrase ( step 60 ). The linguistic information may include part of speech information and an alternate tag. An alternate tag is markup referring to an identified element in the text stream being examined which has already been assigned a tag. For example, if the phrase "Chief Complaint" is marked with the tag <hl>Chief Complaint</hl>, indicating a first header, and the phrase "Chief Complaint" appears in the Knowledge Database, the phrase "Chief Complaint" is compared against the database's store of alternate tags, and the alternate tag "CC" is generated ( see Figure 9 ). The alternate tag "CC" might be listed as <COChief Complaint</CC>, a second header. If the group of words does not match a phrase contained in the Knowledge Database 48, and the size of the word grouping being checked is greater than the number 2 ( step 59 ), the number of words in the group is lessened by one and all phrases within the sentence containing the lessened number of adjacent words are checked against the Knowledge Database 48 ( step 58 ). If the size of the word grouping last checked is equal to 2 ( step 59 ), the sentence will be parsed into individual words ( step 61 ). Linguistic information for the individual words is then retrieved from the Knowledge Database 48 and assigned to the respective words ( step 62 ). For example, if the input content contains a sentence of 5 words, the first comparison to the

Knowledge Database 48 will be the sentence in its entirety ( step 58 ). Assuming no matching phrase is found, the subsequent comparisons will consist of checking the two possible four word groupings, the three possible three word groupings, and the four possible two word groupings for phrase matches ( step 58 ). If there are no matches, the sentence will be parsed into individual words ( step 61 ), and the individual words compared to the Knowledge Database 48 for matching entries ( step 62 ). If a word has no corresponding entry in the Knowledge Base it is assigned an "unknown" part of speech tag.

Figure 6 shows an electronic document 64 that is being processed by the conversion facility 18. In this example, the input content is an HTML document 64. Figure 6 also depicts the linguistic information assigned to the text of the HTML document 64 by the Lexical Tagger module 30. For example, the phrase "Chief Complaint" is identified 68 and an alternate tag ( "CC" ) is generated. The phrase "Patient complains of chest pain," 79 marked as a separate sentence by the Boundary Processor module 28 has no matching phrase entries in the Knowledge Database, and thus is parsed into individual words having respective linguistic information 70, 72, 74, 76 and 78.

The information from the Lexical Tagger module is passed on to the Part of Speech Determiner module 32 which compares the parsed text stream to a Database of Statistical Information 50 to resolve words or phrases that have multiple part of speech possibilities. The Database of Statistical Information 50 uses statistical information about pairs of words, their parts of speech, and their location within a sentence. As an example, a word that could be a noun or an adjective is most likely to be a noun after an article such as "the" and followed by a word without a verb part of speech tag. The word "patient" can be an adjective or a noun. It is most likely to be a noun when preceded by a word assigned an "article" part of speech tag and is followed by an adverbial noun in the phrase, "the patient complains of chest pain, yesterday." The sentence "She is a patient" would be processed in a similar way and would determine "patient" to be a noun. In contrast, in the sentence "He was patient," the word "patient" is most likely to be an adjective since the preceding word is a verb.

The next module, the Phrase Generator module 34, determines noun phrases and verb groups by determining the nouns proximity to other nouns and the verbs proximity to other verbs. The Phrase Generator module 34 uses the reconciled part-of-speech generated by the Part of Speech Determiner module 32 and the words generated by the Boundary Processor module 28, to identify the noun and verb phrases within the sentence. Once each word has a single part of speech assignment, the Phrase Generator module 34 collects the noun phrases and verb groups. Noun phrases are determined by proximity to other nouns, pronouns ( "she" ), determiners ( "both" ), adjectives ( "green" ), articles ( "the" ), conjunctions ( "and" ) and prepositions ( "of ). Verb groups are constructed by collecting verbs, helping verbs ( "have" ), infinitival to ( "need to go" ) , and adverbs ( "formally" ) in the same way. Thus in Figure 7, the lexically tagged text stream from an input HTML document 80 contains the sentence "Patient complains of chest pain". By noting that the preposition "of 90, the noun "chest" 92, and the noun "pain" 94, all follow the verb "complains" 88, the phrase generator combines the words into the noun phrase "of chest pain" 96. Similarly, the verb "complains" 88, located between the noun "Patient" and the noun phrase "of chest pain" 96 is treated as a one word verb phrase 98. The noun "Patient" 86 is separated from the rest of the sentence by the verb "complains" and is marked as a one word noun phrase 100.

The information from the Phrase Generator module 34 is passed on to the Subject Determiner module 36 as depicted in Figure 8. The Subject module 36 begins with a text stream with identified noun and verb phrases 102. The text stream 102 is compared with the Sentence Pattern Database 104 which contains templates of sentences containing common placements of subjects within sentences and determines, on the basis of these common placements, the potential subject of the sentence. For instance the sentence "Where did he go?" has a different template than the sentence "Patient complains of chest pain." The sentence "Where did he go?" begins with "Where", an adverb of a special type when located at the beginning of the sentence, followed by a verb or verb phrase followed by the subject "he". A matching template for this sentence would be ADVERB NOUNGROUP(subject) VERBGROUP QUESTIONMARK. The second sentence begins with a noun phrase followed by a verb or verb phrase. A matching template for "The patient complains of chest pain", would be NOUNGROUP (subject) VERBGROUP NOUNGROUP(object). By matching the placement of the noun phrase and verb groups with other identifying words such as "where", a specialized adverb, to the template, the Subject Determiner generates a potential subject for the sentence from the template. If a sentence completely matches the template 106, it is given a high score for confidence in having found the selection of the subject of the sentence 108. If the sentence does not match the template completely 110, it will receive a lower confidence score 1 12. The actual score is assigned on a percentage basis based on the percentage of data matching the sentence template. As part of this analysis to determine the subject of the sentence, an object of the sentence, either a direct object or an indirect object, may be identified. In the sentence "Patient complains of chest pain.", the noun phrase "of chest pain" may be determined as the object of the sentence.

The Markup Tagger module 38 depicted in Figure 9 utilizes the information generated in the Lexical Tagger module 30, the Part of Speech Determiner module 32, the Phrase Generator module 34, and/or the Subject Determiner module 36 to construct the correct tag name and to determine where the start and end tags, and the boundaries, should be placed. The Markup Tagger module 38 analyzes the same text string 114 as the Subject Determiner module. If the Subject Determiner module's 36 confidence score is high 108, the Markup Tagger may use all the information from the Phrase Generator module 34 and Subject Determiner module 36. However, if the confidence score is low 112, it may not use any of the information. The present invention allows an end user to specify a parameter for an acceptable confidence score. For example, the parameter may be set at 80%, in which case any confidence score below 80% would cause the Markup Tagger module 38 to ignore the subject determination of the Subject Determiner module 36. In cases where an alternate tag name has been indicated for the markup ( as in the alternate tag CC for Chief Complaint 118 ), the Markup Tagger uses the alternate name to create the tag <CC> 132. If a subject and an object of a sentence exist, these two terms are combined into one markup tag name. For instance, in the sentence "patient complains of chest pain", the subject is "Patient" 120 and the object is "of chest pain" 124,126,128. Under these circumstances, the Markup Tagger will combine the word Patient and the nouns of the object ( minus the preposition "of ), to create a markup tag with the name "Patient.chest.pain" 134.

The Document Creator module 40 uses knowledge of how XML documents are constructed ( the "XML grammar" ), to create a document with the correct markup syntax from the tags generated by the Markup Tagger module 38 and the text found in the original electronic document. The hierarchy of contexts generated by the present invention and original tags are used to create new documents. Even though the preferred embodiment utilizes the markup language XML, the Document Creator module 40 is not limited to XML since the Markup Tagger module 38 and Document Creator module 40, in combination, can generate any type of markup language, such as HTML 46. Figure 10 depicts an example of the type of document created by the Document Creator module 136.

It will thus be seen that the invention efficiently attains the objects made apparent from the preceding description. Since certain changes may be made without departing from the scope of the present invention, it is intended that all matter contained in the above description or shown in the accompanying drawings be interpreted as illustrative and not in a literal sense. Practitioners of the art will realize that the separate modules illustrated herein may be combined or split up into a lesser or greater number of total modules without departing from the scope of the present invention. Those skilled in the art will further realize that while the majority of the illustrations and descriptions refer to the conversion of HTML documents to XML documents, the present invention is capable of directly converting other types of electronic documents besides HTML documents into XML documents, is capable of converting other types of electronic documents besides HTML documents into HTML documents or other types of markup language documents, and is capable of being part of a process to convert physical documents into electronic documents ( for example by scanning a document ) and then convert the document into a markup language document.

Claims

We Claim:

1. In an electronic device, a method, comprising the steps of: providing a document having content, including text; selecting from the text of said document semantically significant elements, including words and phrases, indicative of the content of the document; and copying said semantically significant elements into a designated portion of said document; and marking said designated portion to indicate that said designated portion contains a high concentration of significant information for determining relevance of the document to a query.

2. The method of claim 1 wherein said method further comprises the steps of: providing a linguistic database of words and phrases; and comparing said semantically significant words and phrases to said database to to identify said semantically significant words and phrases which appear in said database, and instantiating said semantically significant words and phrases which appear in said linguistic database as class objects, and parsing said semantically significant phrases which do not appear in said linguistic database into individual words for said comparison and said instantiation as individual words.

3. The method of claim 2 wherein an alternate tag name, which serves as a descriptive label, is generated for said identified phrases or words selected from said document which appear in said linguistic database.

4. The method of claim 2 wherein said method further comprises the steps of: assigning a tentative grammatical part of speech to said semantically significant word which appears in said linguistic database; and providing a database containing statistical information regarding pairs of words, word part of speech designations, and word placement within a sentence; and verifying said tentative grammatical part of speech designation by further comparing said semantically significant word to said database containing statistical information.

5. The method of claim 2, wherein nouns selected from said document are organized into noun phrases based on their proximity to other nouns.

6. The method of claim 5, wherein verbs selected from said document are organized into verb groups based on their proximity to other verbs.

7. The method of claim 6, wherein the probable subject of a sentence selected from said document is determined by comparing the location of noun phrases and verb groups within a sentence structure against a database containing templates of sentence structures.

8. The method of claim 1, wherein a document is created using markup syntax from said marked significant information and the text from the original document.

9. The method of claim 8 wherein the markup language of the created document is XML.

10. The method of claim 8 wherein the markup language of the created document is HTML.

11. The method of claim 1 wherein said document is generated by optically scanning a physical document into a computer system.

12. A method for enhancing the searchability of a text-based document by a search engine, said method comprising the steps of: providing a document having content, including text; and selecting from the text of said document semantically significant elements, including words and phrases, indicative of the content of the document; and copying said semantically significant elements into a designated portion of said document; and marking said designated portion to indicate that said designated portion contains a high concentration of significant information for determining relevance of the document to a query.

13. The method of claim 12 wherein the method further comprises the steps of: providing a linguistic database of words and phrases; and comparing said semantically significant words and phrases to said database to to identify said semantically significant words and phrases which appear in said database, and instantiating said semantically significant words and phrases which appear in said linguistic database as class objects, and parsing said semantically significant phrases which do not appear in said linguistic database into individual words for said comparison and said instantiation as individual words.

14. The method of claim 13, wherein an alternate tag name, which serves as a descriptive label, is generated for said identified phrases or words selected from said document which appear in said linguistic database.

15. The method of claim 13 wherein said method further comprises the steps of: assigning a tentative grammatical part of speech to said semantically significant word which appears in said linguistic database; and providing a database containing statistical information regarding pairs of words, word part of speech designations, and word placement within a sentence; and verifying said tentative grammatical part of speech designation by further comparing said semantically significant word to said database containing statistical information.

16. The method of claim 13, wherein nouns selected from said document are organized into noun phrases based on their proximity to other nouns.

17. The method of claim 16, wherein verbs selected from said document are organized into verb groups based on their proximity to other verbs.

18. The method of claim 17, wherein the probable subject of a sentence selected from said document is determined by comparing the location of said noun phrases and verb groups within a sentence structure against a database containing templates of sentence structures.

19. The method of claim 12 wherein said method comprises the further step of: creating from said marked significant information and the text of said document an XML document.

20. The method of claim 12 wherein said text based document is originally an HTML document.

21. The method of claim 12 wherein said method further comprises the step of: creating from said marked significant information and the text of said document an HTML document.

22. A medium for use with an electronic device, said medium holding computer- executable instructions for performing a method, comprising: providing a document having content, including text; and selecting from the text of said document semantically significant elements, including words and phrases, indicative of the content of the document; and copying said semantically significant elements into a designated portion of said document; and marking said designated portion to indicate that said designated portion contains a high concentration of significant information for determining relevance of the document to a query; and creating from said semantically important information and the text of said HTML document an XML document.

23. The method of claim 22 wherein the method further comprises the steps of: providing a linguistic database of words and phrases; and comparing said semantically significant words and phrases to said database to to identify said semantically significant words and phrases which appear in said database, and instantiating said semantically significant words and phrases which appear in said linguistic database as class objects, and parsing said semantically significant phrases which do not appear in said linguistic database into individual words for said comparison and said instantiation as individual words.

24. The method of claim 23, wherein an alternate tag name, which serves as a descriptive label, is generated for said identified phrases or words selected from said document which appear in said linguistic database.

25. The method of claim 22 wherein said method further comprises the steps of: assigning a tentative grammatical part of speech to said semantically significant word which appears in said linguistic database; and providing a database containing statistical information regarding pairs of words, word part of speech designations, and word placement within a sentence; and verifying said tentative grammatical part of speech designation by further comparing said semantically significant word to said database containing statistical information.

26. The method of claim 22, wherein nouns selected from said document are organized into noun phrases based on their proximity to each other

27. The method of claim 26, wherein verbs selected from said document are organized into verb groups based on their proximity to other verbs.

28. The method of claim 27, wherein the probable subject of a sentence selected from said document is determined by comparing the location of said noun phrases and verb groups within a sentence structure against a database containing templates of sentence structures.