WO2000041096A1 - Method for producing summaries of text document - Google Patents

Method for producing summaries of text document Download PDF

Info

Publication number
WO2000041096A1
WO2000041096A1 PCT/US2000/000268 US0000268W WO0041096A1 WO 2000041096 A1 WO2000041096 A1 WO 2000041096A1 US 0000268 W US0000268 W US 0000268W WO 0041096 A1 WO0041096 A1 WO 0041096A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
phrase
computer method
phrases
Prior art date
Application number
PCT/US2000/000268
Other languages
French (fr)
Inventor
Michael J. Witbrock
Vibhu O. Mittal
Original Assignee
Justsystem Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Justsystem Corporation filed Critical Justsystem Corporation
Priority to AU24927/00A priority Critical patent/AU2492700A/en
Publication of WO2000041096A1 publication Critical patent/WO2000041096A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Definitions

  • Extractive summarization is the process of selecting and extracting text spans- -usually whole sentences- -from a source document. The extracts are then arranged in some order (usually the order as found in the source document) to form a summary. In this method, the quality of the summary is dependent on the scheme used to select the text spans from the source document . Most of the prior art uses a combination of lexical, frequency and syntactic cues to select whole sentences for inclusion in the summary. Consequently, the summaries cannot be shorter than the shortest text span selected and cannot combine concepts from different text spans in a simple phrase or statement.
  • StreamTokenizer BigramFile new StreamTokenizer(new BufferedReader (new InpiitStreamReader

Abstract

A computer method for preparing a summary string (19) from a source document of encoded text (17). The method comprises comparing a training set of encoded text documents (10) with manually generated summary strings (11) associated therewith to learn probabilities (13) that a given summary word or phrase will appear in summary strings (19) given a source word or phrase appears in encoded text documents (17) and constructing from the source document a summary string containing summary words or phrases (19) having the highest probabilities of appearing in a summary string (19) based on the learned probabilities established in the previous step.

Description

METHOD FOR PRODUCING SUMMARIES OF TEXT DOCUMENT
BACKGROUND OF THE INVENTION Extractive summarization is the process of selecting and extracting text spans- -usually whole sentences- -from a source document. The extracts are then arranged in some order (usually the order as found in the source document) to form a summary. In this method, the quality of the summary is dependent on the scheme used to select the text spans from the source document . Most of the prior art uses a combination of lexical, frequency and syntactic cues to select whole sentences for inclusion in the summary. Consequently, the summaries cannot be shorter than the shortest text span selected and cannot combine concepts from different text spans in a simple phrase or statement. U.S. Patent No. 5,638,543 discloses selecting sentences for an extractive summary based on scoring sentences based on lexical items appearing in the sentences. U.S. Patent No. 5,077,668 discloses an alternative sentence scoring scheme based upon markers of relevance such as hint words like "important", "significant" and "crucial". U.S. Patent No. 5,491,760 works on bitmap images of a page to identify key sentences based on the visual appearance of hint words. U.S. Patent Nos. 5,384,703 and 5,778,397 disclose selecting sentences scored on the inclusion of the most frequently used nonstop words in the entire text.
In contrast to the large amount of work that has been undertaken in extractive summarization, there has been much less work on generative methods of summarization. A generative method of summarization selects words or phrases (not whole sentences) and generates a summary based upon the selected words or phrases. Early approaches to generative methods are discussed in the context of the FRUMP system. See DeJong, G.F., "An Overview of the FRUMP System", Strategies for Natural Language Processing, (Lawrence Erlbaum Associates, Hillsdale, NJ 1982) . This system provides a set of templates for extracting information from news stories and presenting it in the form of a summary. Neither the selection of content nor the generation of the summary is learned by the system. The selection templates are handcrafted for a particular application domain. Other generative systems are known. However, none of these systems can: (a) learn rules, procedures, or templates for content selection and/or generation from a training set or (b) generate summaries that may be as short as a single noun phrase.
The method disclosed herein relates somewhat to the prior art for statistically modeling of natural language applied to language translation. U.S. Patent No. 5,510,981 describes a system that uses a translation model describing correspondences between sets of words in a source language and sets of words in a target language to achieve natural language translation. This system proceeds linearly through a document producing a rendering in the target language of successive document text spans. It is not directed to operate on the entire document to produce a summary for the document . SUMMARY OF THE INVENTION
As used herein, a "summary string" is a derivative representation of the source document which may, for example, comprise an abstract, key word summary, folder name, headline, file name or the like. Briefly, according to this invention, there is provided a computer method for generating a summary string from a source document of encoded text comprising the steps of: a) comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn probabilities that a given summary word or phrase will appear in summary strings given that a source word or phrase appears in an encoded text document ; and b) from the source document, generating a summary string containing a summary word, words, a phrase or phrases having the highest probabilities of appearing in a summary string based on the learned probabilities established in the previous step. Preferably, the summary string contains the most probable summary word, words, phrase or phrases for a preselected number of words in the summary string. In one embodiment, the training set of encoded manually generated summary strings is compared to learn the probability that a summary word or phrase appearing in a summary string will follow another summary word or phrase. Summary strings are generated containing the most probable sequence of words and/or phrases for a preselected number of words in the summary string .
In a preferred embodiment, the computer method, according to this invention, comprises comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn the probabilities that a given summary word or phrase will appear in summary strings given a source word or phrase appears in the encoded text considering the context in which the source word or phrase appears in the encoded text documents. For example, the context in which the source words or phrases may be considered includes titles, headings, standard paragraphs, fonts, holding, and/or italicizing.
In yet another preferred embodiment, the computer method, according to this invention, further comprises learning multiple probabilities that a summary word or phrase will appear in a summary string given a source word or phrase appears in the encoded text and considering the various usages of the word or phrase in the encoded text, for example, syntactic usages and semantic usages.
In a still further preferred embodiment, according to this invention, the step for comparing a training set of encoded manually generated summary strings takes into consideration external information in the form of queries, user models, past user interaction and other biases to optimize the form of the generated summary strings . BRIEF DESCRIPTION OF THE DRAWING
Further features and other objects and advantages will become clear from the following detailed description made with reference to the drawing which is a schematic diagram illustrating the processing of text to produce summaries .
DESCRIPTION OF THE PREFERRED EMBODIMENTS Referring now to the drawing, a collection of representative documents are assembled at 10 and corresponding manually generated summaries are assembled at 11. These comprise a training set. They are encoded for computer processing and stored in computer memory. They may be preprocessed to add syntactic and semantic tags.
The documents and summaries are processed in the translation model generator at 12 to build a translation model 13 which is a file containing the probabilities that a word found in a summary will be found in the document. The translation model generator constructs a statistical model describing the relationship between the text units or the annotated text units in documents and the text units or annotated text units used in the summaries of documents. The translation model is used to identify items in a source document 17 that can be used in summaries. These items may include words, parts of speech ascribed to words, semantic tags applied to words, phrases with syntactic tags, phrases with semantic tags, syntactic or semantic relationships established between words or phrases in the document, structural information obtained from the document, such as positions of words or phrases, mark-up information obtained from the document such as the existence of bold face or italics, or of headings or section numbers and so forth.
The summaries are processed by the language model generator 14 to produce a summary language model 15. The language model is a file containing the probabilities of each word or phrase found in the training set summaries following another word or phrase. The language model generator builds a statistical model describing the likely order of appearance of text units or annotated text units in summaries. The headlines or summaries may be preprocessed to identify text items that can be used in determining the typical structure of summaries. These text items may include words, parts of speech ascribed to words, semantic tags applied to words, phrases, phrases with syntactic tags, syntactic or semantic relations established between words or phrases, structure information, such as positions of words or phrases in the summary, and so forth. The translation model 13 and summary language model 15 along with a document 17 to be summarized and summarization control parameters 18 are supplied to the summary search engine 16 to select a sequence of items
(characters or lexemes) that jointly optimize the information content extracted from the source document to be summarized. These are supplied to the summary generation engine 19 which generates the summary.
The following Table is an example document for explaining the practice of this invention:
Table 1
"The U.N. Security Council on Monday was to address a dispute between U.N. chief weapons inspector Richard Butler and Iraq over which disarmament documents Baghdad must hand over. Speaking in an interview with CNN on Sunday evening, Butler said that despite the latest dispute with Iraq, it was too soon to make a judgment that the Iraqis had broken last week's agreement to unconditionally resume cooperation with weapons inspector -- an agreement which narrowly averted air strikes by the United States and Britain. " Some possible headline/summaries for the document produced above are :
"Security Council to address Iraqi document dispute." "Iraqi Weapons Inspections Dispute." These summaries illustrate some of the reasoning required for summarization. The system must decide (1) what information to present in the summary, (2) how much detail to include in the summary or how long the summary can be, and (3) how best to phrase the information so that it seems coherent. The two summaries above illustrate some of the issues of length, content and emphasis.
The statistical models are produced by comparison of a variety of documents and summaries for those documents similar to those set forth above to learn for a variety of parameter settings, mechanisms for both (1) content selection for the most likely summaries of a particular length and (2) generating coherent English (or any other language) text to express the content. The learning for both content selection and summary generation may take place at a variety of conceptual levels ranging from characters, words, word sequences or n-grams, phrases, text spans and their associated syntactic and semantic tags. In this case, prior to the comparison, the texts in the training sets must be tagged. Set forth in the following table is the text of
Table 1 after being tagged with syntactic parts of speech using the LDC standard, e.g., DT : definite article, NNP : proper noun, JJ: adjective.
Table 2 The_DT U.N._NNP Security_NNP Council_NNP on_IN Monday_NNP was_VBD to_TO address_VB a_NN dispute_NN between_IN U.N._NNP chief_JJ weapons_NNS inspector_NN Richard_NNP Butler_NNP and_CC Iraq_NNP over_IN which_WDT disarmament_NN documents_NNS Baghdad_NNP must_NN hand__NN over._CD _NN _NN _NN Speaking_VBG in_IN an_DT interview_NN with_IN CNN_NNP on_IN Sunday_NNP evening , _NNP Butler_NNP said_VBD that_IN despite_IN the_DT latest_JJS dispute_NN with_IN Iraq,_NNP it_PRP was_VBD too_RB soon_RB to_VBP make_VB a_DT judgment_NN that_IN the_DT Iraqis_NNPS had_VBD broken_VBN last_JJ week's_NN agreement_NN to_TO unconditionally_RB resume_VB cooperation_NN with_NN weapons_NNS inspectors :_NNS an_DT agreement_NN which_WDT narrowly_RB averted_VBP airstrikes_NNS by_IN the_DT United_NNP States_NNPS and_CC Britain._NNP .
Set forth in the following table is the text of
Table 1 after being tagged with semantic tags using the TIPSTER/MUC standards; NE : named entity, TE : temporal entity, LOC : location.
Table 3 The [U.N. Security Council] -NE on [Monday] -TE was to address a dispute between [U.N.]-NE chief weapons inspector [Richard Butler] -NE and [Iraq] -NE over which disarmament documents [Baghdad] -NE must hand over.
Speaking in an interview with [CNN] -NE on [Sunday] - TE evening, [Butler] -NE said that despite the latest dispute with [Iraq] -NE, it was too soon to make a judgment that the [Iraqis] -NE had broken last week's agreement to unconditionally resume cooperation with weapons inspectors -- an agreement which narrowly averted airstrikes by the [United States] -NE and [Britain] -NE.
The training set is used to model the relationship between the appearance of some features (text spans, labels, or other syntactic and semantic features of the document) in the document, and the appearance of features in the summary. This can be, in the simplest case, a mapping between the appearance of a word in the document and the likelihood of the same or another word appearing in the summary.
The applicants used a training set of over twenty- five thousand documents that had associated headlines or summaries. These documents were analyzed to ascertain the conditional probability of a word in a document given that the word appears in the headline. In the following table, the probabilities for words appearing in the text of Table 1 are set forth. Table 4
Word Conditional Probability
Iraqi 0.4500
Dispute 0.9977 Weapons 1.000
Inspection 0.3223
Butler 0.6641
The system making use of the translation model extracts words or phrases from the source text based upon the probability these or other words will appear in summaries .
The probability that certain subsets of words individually likely to appear in summaries will appear in combination can be calculated using Bayes theorem. Thus, the probability that the phrase "weapons inspection dispute", or any ordering thereof may be expressed simply: Pr ( "weapons" I "weapons" in document) *Pr ("inspection" | "inspection" in document) *Pr ("dispute" I "dispute" in document) . Equivalently, this probability may be expressed:
Log (Pr( "weapons" I "weapons" in document)) + Log (Pr ( "inspection" I "inspection" in document)) + Log (Pr( "dispute" I "dispute" in document)) .
More involved models can express the relationship among arbitrary subsets, including subsequences, of the words in the document and subsets of candidate words that may appear in the summary. The more involved models can express relationships among linguistic characterizations of subsets of terms in the document and summaries such as parts-of -speech tags, or parse trees.
The more involved models may express relationships among these sets of terms and meta- information related to the document or the summary, such as length, derived statistics over terms (such as proportion of verbs or nouns in the document, average sentence length, etc.), typographical information, such as typeface, formatting information, such as centering, paragraph breaks and so forth, and meta-information, such as provenance (author, publisher, date of publication, Dewey or other classification) recipient, reader, news group, media through which presented (web, book, magazine, TV chiron or caption) .
One of the advantages in learning a content selection model is that the system can learn relationships between summary terms that are not in the document and terms that are in the document, and apply those relationships to new documents thereby introducing new terms in the summary. Once a content selection model has been trained on the training set, conditional probabilities for the features that have been seen in the summaries can be computed. The summary structure generator makes use of these conditional probabilities to compute the most likely summary candidates for particular parameters, such as length of summary. Since the probability of a word appearing in a summary can be considered to be independent of the structure of the summary, the overall probability of a particular candidate summary can be computed by multiplying the probabilities of the content in the summary with the probability of that content expressed using a particular summary structure (e.g., length and/or word order) .
Since there is no limitation on the types of relationships that can be expressed in the content selection model, variations on this invention can use appropriate training sets to produce a cross-lingual or even cross-media summary. For example, a table expressing the conditional probability that an English word should appear in a summary of a Japanese document could be used to simultaneously translate and summarize Japanese documents. An inventory of spoken word forms, together with a concatenative synthesis algorithm and a table of conditional probabilities that speech segments would be used in a spoken summary of a particular document, could be used to generate spoken summaries. Similarly, corresponding video or other media could be chosen to represent the content of documents.
Example
For use in generating summaries, the probability of finding particular words in a summary is learned from the training set . For certain words appearing in the text set forth in Table 1, the learned probabilities are listed in the following table:
Table 5 Word Log probability of word in Reuters headlines
Iraqi -3.0852
Dispute -1.0651
Weapons -2.7098
Inspection -2.8417 Butler -1.0038
Also, for generating summaries, the probability of finding pairs of words in sequence in the training set summaries is learned. For certain words appearing in the text set forth in Table 1, the learned probabilities are listed in the following table:
Table 6
Word pair (word 1, word 2) Log probability of word 2 given word 1
Iraqi weapons -0.7622
Weapons inspection -0.6543
Inspection dispute -1.4331 To calculate the desirability of a headline containing the sequence "Iraqi weapons inspection...", the system multiplies the likelihood of seeing the word "Iraqi" in a headline (see Table 5) by it being followed by "weapons" and that being followed by "inspection" (see Table 6) . This may be expressed as follows: Log (P ( " Iraqi " ) ) +Log (P ( "weapons" | "Iraqi" ) ) +Log (P ( "inspection" | "weapons" ) ) , which, using the values in the tables, yields a log probability of -2.8496. Alternative sequences using the same words, such as "Iraqi dispute weapons", have probabilities that can be calculated similarly. In this case, the sequence "Iraqi dispute weapons" has not appeared in the training data, and is estimated using a back-off weight. A back-off weight is a very small but non-zero weight or assigned probability for words not appearing in the training set.
These calculations can be extended to take into account the likelihood of semantic and syntactic tags both at the word or phrase level, or can be carried out with respect to textual spans from characters on up. The calculations can also be generalized to use estimates of the desirability of sequences of more than two text spans (for example, tri-gram (three-word sequence) probabilities may be used) .
Other measures of the desirability of word sequences can be used. For example, the output of a neural network trained to evaluate the desirability of a sequence containing certain words and tags could be substituted for the log probabilities used in the preceding explanation.
Moreover, other combination functions for these measures could be used rather than multiplication of probabilities or addition of log probabilities.
In general, the summary generator comprises any function for combining any form of estimate of the desirability of the whole summary under consideration such that this overall estimate can be used to make a comparison between a plurality of possible summaries.
Even though the search engine and summary generator have been presented as two separate processes, there is no reason for these to be separate.
In the case of the phrase discussed above, the overall weighting used in ranking can, as one possibility, be obtained as a weighted combination of the content and structure model log probabilities.
Alpha* (Log (Pr ("Iraqi" I "Iraqi" in doc) ) + Log (Pr ( "weapons" | "weapons" in doc))+
Log (Pr ( "inspection" I "inspection" in doc)))+ Beta* (Log (Pr ( "Iraqi" | start_of_sentence) ) +Log (Pr ( (weapons" | "Iraqi" ) ) +Log (Pr ( "inspection" | "weapons" ) ) ) . Using a combination of content selection models, language models of user needs and preferences, and summary parameters, a plurality of possible summaries, together with estimates of their desirability, is generated. These summaries are ranked in order of estimated desirability, and the most highly ranked summary or summaries are produced as the output of the system.
Depending on the nature of the language, translation and other models, heuristic means may be employed to permit the generation and ranking of only a subset of the possible summary candidates in order to render the summarization process computationally tractable. In the first implementation of the system, Viterbi beam search was used to greatly limit the number of candidates produced. The beam search makes assumptions regarding the best possible word in at the front position of a summary and in consideration of the next position will not undo the assumption concerning the first position. Other search techniques, such as A* or IDA*, SMA* , may be employed to comply with particular algorithmic or resource limitations. An example of the results of commanding the search to output the most highly ranked candidate for a variety of values of the summary length control parameter is set forth m the following table.
Table 7
Number of Words String 1 Iraq
2 United States
3 Iraq on Weapons
4 United States on Iraq
5 United States m latest week 6 United States m latest week on Iraq
7 United States on security cooperation m latest week
The following computer code appendix contains code m the Java language to implement this invention. The UltraSummarise class is the mam function that makes a summarizer object, loads a story, creates a search object and uses the Vocabulary class and story to produce a summary. The ViteriSearch class defines the meat of the operation. It takes the LanguageModel class, the TranslationModel class and the story and searches for strings having the highest probability of being used m a summary for the story. The LanguageModel class reads a file which is a model for summaries containing the probabilities of each word following another. The TranslationModel class reads a file containing the probabilities that a word will appear a summary given words m the story. The Story class reads the story. The Vocabulary class reads m a file that turns words into numbers . Those skilled m the computer programming arts could implement the invention described herein m a number of computer programming languages. It would not be necessary to use an object oriented programming language such as Java. COMPUTER CODE APPENDIX
The following code in the Java language was written to implement the invention described above. The UltraSummarise class is the main function that makes a summarizer object, loads a story, creates a search object and uses the Vocabulary class and search object to produce a summary. The ViterbiSearch class defines the meat of the operation. It takes the LanguageModel class, the TranslationiVIodel class and the story and searches for strings having the highest probability of being used in a summary for the story. The LanguageModel class reads in a file which is a model for summaries containing the probabilities of each word following another word in a summary. The TranslationModel class reads in a file containing the probabilities that a word will appear in a summary given words in the story. The Story class reads in the story. The Vocabulary class reads in a file that turns words into numbers.
import java.util.Date; import LanguageModel; import TranslationModel; import Story; import Vocabulary; import ViterbiSearch;
final public class UltraSummarise
{ final static int MAXJNALEXEMES = 40000; final static int MAX_N_BIGRAMS = 400000;
LanguageModel LM;
TranslationModel TRM;
Vocabulary Vcb; boolean myboredom=true; String sty l,sty2; public UltraSummarise (String [] args) throws Exception
{ if(args.length >3) { myboredom=true; }
Vcb=new Vocabulary(args[0],MAX_N_LEXEMES); //name,maxnlexemes LM=new LanguageModel(args[0], MAX_N_LEXEMES,MAX_N_BIGRAMS); // name, maxnlexemes, maxnbigrams
TRM=new TranslationModel(args[0], MAX_N_LEXEMES); // name, maxnlexemes styl=args[l]; sty2=args[2];
} public void Run() throws Exception
{ Story Sty;
ViterbiSearch Search; Sty=new Story(styl, Vcb,MAX_N_LEXEMES); // storyname , maxnieximes Search= new ViterbiSearch(myboredom,Vcb,LM.TRM); Search.produceStringSummary(Sty, 15);
} public static void main (String [] args) throws Exception
{ System.out.println(new DateO); if(args.length < 2) {
System.err.println("Usage Java UltraSummarise corpusname story-file <bored>"); System.exit(l);
}
UltraSummarise Ult=new UltraSummarise(args);
Ult.RunO;
System.out.println(new DateO); System.exit (0);
} }
import java.util.Hashtable; import java.io.*; import Vocabulary;
final public class LanguageModel { int MAX_N_LEXEMES; int MAX_LEX_BIGRAMS; // this is ugly, but it's not straightforward to avoid, // since the bigram file doesn't start with a count. Later, perhaps force it to.
final static boolean verbose_debug=false; int lastlexicalbigram; int lastlexicalunigram; float [] lexicalunigramprobs; float [] lexicalunigrambackoffs; float [] lexicalbigramprobs; final float NOT_A_LOG_PROB = 5.0F; // n:log(n)=5 is not a probability
String corpusname; Hashtable bigram_hashtable; int bigram_hashtable_last_element = 0; //used by bigramjndex below
static char [] mycharacters = {
A,,,B,,,C,,,D,,,,F,,G,,,H,,T,T, 'K'/L'/M'/N O'/P'.'Q'/R'/S'/T',
^V/WVX'.ΥΪZVa".'^'^'^ 'eVf/gVhViVjVkViVmVn1, O'/pVq'/r'.'s'/t'/u'/v'/w' x*, y,'z'
}; static int range = mycharacters. length; static StringBuffer keyspace = new StringBuffer("");
public LanguageModel (String setcorpusname, int MAX_N_LEXEMES, int MAX_LEX_BIGRAMS) throws Exception
{ corpusname = new String (setcorpusname); lexicalunigramprobs = new float [MAX_N_LEXEMES]; lexicalunigrambackoffs = new float [MAX_N_LEXEMES]; lexicalbigramprobs = new float [MAX_LEX_BIGRAMS]; bigram_hashtable = new Hashtable(MAX_LEX_BIGRAMS); System.out.println("Reading LM "+coφusname); readLMO; } public String getCorpusName()
{ return new String(corpusname);
}
// convert two bigram index elements into strings and store them into a hash table,
// look them up later int bigramjndex (int wordl, int word2, boolean create _p) throws Exception {
String mytempstring = null; int index; keyspace.setLength(O); index = wordl; while (index>0) { keyspace.append(mycharacters[index % range]); index /= range; } keyspace.appendC '); index = word2; while (index> ) { keyspace.append(mycharacters[index % range]); index /= range;
} if (keyspace.lengthO >= 1024) throw new Exception("something wrong with indices to bigramjndex");
mytempstring=keyspace.toString();
// System.out.print(mytempstring+ ",");
if (! (bigram Jιashtable.containsKey(myτempstring))) { if (create_p) { bigram_hashtable.put(mytempstring, new Integer(bigram_hashtable_last_element÷+)); if (verbose_debug) System.out.println("Put hash entry for ["-mytempstring+
"] ="+wordl÷","+word2);
} else { if (verbose_debug) System.out.println("no hash entry for ["+mytempstring+
"] ="+wordl÷","+word2); return -1; }
} retταm ((Integer)(bigram_hashtable.get(mytempstring))).intValueO; } void readLMO throws Exception
{
// read bigrams try {
StreamTokenizer BigramFile = new StreamTokenizer(new BufferedReader (new InpiitStreamReader
(new FileInputStream(corpusname+".biprobs")))); // see http://charts.um^ode.org/Unicode.charts/glyphless U0000.html BigramFile.wordChars(0x0021,0x007e); // basically all the characters // that could concievably be in a word in English
int lasi bi= -l; while (BigramFiie.nextTokenO != BigramFile.TT_EOF){ if (BigramFile.ttype != BigramFile.TT_NUMBER){ System.out.println(" Non number ["-BigramFile.sval÷"] in "
÷coφusname÷".biprobs"); Svstem.exit(l);
} ' int wl=(int)BigramFile.nval; if (BigramFile.nextToken() = BigramFile.TT_EOF){
System.out.println(" Number "÷wl÷" without second number in " ÷coφusname÷".biprobs");
System.exit(l);
} if (BigramFile.ttype != BigramFile.TT_NUMBER){
System.out.println(" Non number where second number expected [" ÷BigramFile.sval÷"] in "+coφusname-f-".biprobs");
System.exit(l);
} int w2==(int)BigramFile.nval; if (BigramFile nextTokenO = BigramFile.TT_EOF) {
System.out.println(" Numbers "+wl-","τw2÷" without probability in " ÷coφusname+".biprobs");
System.exit(l);
} if (BigramFile.ttype != BigramFile.TT_NUMBER){
System.out.println(" Non word in "τcoφusname+".biprobs");
System.exit(l);
} float prob = (float)BigramFile.nval; if (verbose_debug)System.out.println((wl)+","+w2-r" => "- prob); int bi = bigram Jndex(wl,w2,true); if(bi < last_bi){ System.err.println("Got duplicate "+wl+","+w2+"both mappeed to "+bi);
} last_bi=bi; lexicalbigramprobs[bi]=prob; lastlexicalbigram++;
} } catch (java.io.FileNotFoundException e)
{
System.out.println(" Couldn't open "+coφusname÷".biprobs"); System.exit(l); catch (java.io.IOException e)
{
System.out.println(" Problem reading "÷coφusname~".biprobs"); Svstem.exit(l);
} ' System.out.println(lastlexicalbigram÷" bigrams read");
// read unigrams try { lastlexicalunigram=0; StreamTokenizer UnigramFile = new StreamTokenizer(new BufferedReader (new InputStreamReader
(new FileInputStream(coφusname÷".uniprobs")))); // see http://charte.unicode.org Unicode.charts/glyphless/UOOOO.html UnigramFile. wordChars(0x0021,0x007e); // basically all the characters // that could concievably be in a word in English
while (UmgramFile.nextToken() != UnigramFile.TT_EOF){ if (UnigramFile.ttype != UnigramFile.TT_NUMBER){ System.out.println(" Non number ["÷UnigramFile.sval÷"] in "
+coφusname÷".uniprobs"); System.exit(l);
} int wl=(int)UnigramFile.nval; if (UnigramFile.nextTokenO = UnigramFile.TT_EOF) {
System.out.println(" Numbers "+wl+" without probability in " τcoφusname+".uniprobs");
System.exit(l);
} if (UnigrarnFile.ttype != UnigramFile.TT_NUMBER){
System.out.println(" Non word in "+coφUsname+".uniprobs");
System.exit(l);
} float prob = (float)UnigramFile.nval; if (verbose_debug) System.out.println((wl)+" => "+prob);
lexicalunigramprobs[wl]=prob; lastlexicalunigram++; } } catch 0'ava.io.FileNotFoundException e)
{ System.out.printin(" Couldn't open "-coφUsname-".uniprobs"); System, exit(l);
} catch (java.io.IOException e)
{
System.out.println(" Problem reading "÷coφusname÷".uniprobs"); System.exit(l);
}
System.out.println(lastlexicalunigram÷" unigrams read");
// read unigrambackoffs try { int lastuniback=0;
StreamTokenizer UnibackFile = new StreamTokenizer(new BufferedReader (new InputStreamReader
(new FileInputStream(coφusname÷".unibackoff')))); // see htφ://charts.umcode.org/Umcode.charts/glyphless/U0000.html UnibackFile. wordChars(0x0021,0x007e); // basically all the characters // that could concievably be in a word in English
while (UnibackFile.nextToken() != UnibackFile.TT_EOF){ if (UnibackFile.ttype != UnibackFile.TT_NUMBER) { System.out.println(" Non number ["+UnibackFile.sval+"] in "
+coφusname+".unibackoff'); System.exit(l);
} int wl=(int)UnibackFile.nval; if (UnibackFile.nextTokenO = UnibackFile.TT_EOF) {
System.out.println(" Number "+wl+" without probability in " +coφusname+".unibackoff);
System.exit(l);
} if (UnibackFile.ttype != UnibackFile.TT_NUMBER) {
System.out.println(" Non word in "+coφUsname+".unibackoff ');
System.exit(l);
} float prob = (float)UnibackFile.nval; if (verbose_debug) System.out.println((wl A" => "+prob); lexicalunigrambackoffs[wl]=prob; lastuniback-H-: Svstem.out.printin(lastuniback-" unisrams backoffs read");
} catch (java.io.FileNotFoundException e)
{
System.out.println(" Couldn't open "-coφusname-".unibackoff ');
System.exit(l);
} catch (java.io.IOException e)
{
System.out.println(" Problem reading "-coφusname-". unibackoff ');
System.exit(l); }
} float bi2ram_probability(int wordl, int word2) throws Exception
{ int bijndex = bigramjndex (wordl, word2. false);
if (bijndex = -l) return NOT_A_LOG_PROB; else return lexicalbigramprobs[biJndex]; } float backoff_weight(int w) {return lexicalunigrambackoffs[w];}
float unigram_probabihty(int w) {return lexicalunigrambackoffsfw];}
public float probability(int wl, int w2) throws Exception
{
//p(wd2|wdl)= if(bigram exists) p_2(wdl,wd2) // else bo_wt_l(wdl)*p_l(wd2) float bigram_prob=bigram_probability(wl ,w2);
if (!(NOT_A_LOG_PROB = bigram_prob)){
//System.out.println(wl+" "+w2+" bigram " +(bigram_prob)); return bigram_prob; } else {
//System.out.println(wl+" "+w2+" backoff " +(backoff_weight(wl)+unigram_probability(w2))); return // -100000.0F; (backoff_weight(wl)+unigram_probability(w2))*3; // make backoff rather undesirable
import java.io.
final public class TranslationModel { float [] lexicaltranslationprobs; String coφusname; static boolean verbose_debug = false;
public TranslationModel( String ThisCoφusName, int MaxNLex ernes) { lexicaltranslationprobs = new float [MaxNLex ernes]; coφusname=ThisCθφusName;
System.out.println("Reading Translation Model "÷coφusname÷".numtφrobs"); readTrModelO; } public float probability(int wl)
{ return lexicaltranslationprobs[wl]; //log 1
} void readTrModel() { try { int i=0; StreamTokenizer TransFile = new StreamTokenizer(new BufferedReader (new InputStreamReader
(new FileInputStream(coφusname+".numtφrobs")))); // see htt ://charts.um^ode.org Unicode.charts/glyphless/U0000.html TransFile.wordChars(0x0021, 0x007 e); // basically all the characters // that could concievably be in a word in English
while (TransFile.nextToken() != TransFile.TT_EOF){ if (TransFile.ttype != TransFile.TT_NUMBER) { System.out.println(" Non number ["-TransFile.sval+"] in "+coφusname+".numtφrobs"); Svstem.exit(l);
} ' int index=(int)TransFile.nval; if (TransFile.nextTokenO = TransFile.TT_EOF) {
System.out.println(" Number "+index-" without string in " +coφusname÷".numtφrobs");
System.exit(l);
} if (TransFile.ttype != TransFile.TT_NUMBER) {
System.out.println(" Non float in "÷coφusname÷".numtφrobs");
Svstem.exit(l);
} lexicaltranslationprobs[index]=(float)TransFile.nval; if (verbose_debug) System.out.println((index)+ " => "+( lexicaltranslationprobs[index])); i++;
}
System.out.println((i)-r" translation probs read");
} catch (java.io.FileNotFoundExcεption e)
{
System.out.println(" Couldn't open "-coφusname+".numtφrobs"); System.exit(l);
} catch (java.io.IOException e)
{
System.out.println(" Problem reading "+coφusname+".numtφrobs"); System.exit(l); } }
}
import Java. util.Date; import java.io.*; import Vocabulary;
public class Story { final static boolean verbose debug=false; static String storyvocabname: private int n_unique_lex ernes: private int [] uniquejexemes: private int [] lexeme_used;
public Story (String storyname, Vocabulary Vocab, int MAXJNALEXEMES) { storyvocabname=storyname; unique_lexemes = new int [ZvLAX_N_LEXEMES]; lexeme_used = new int [MAXJNJ EXEMES]; // use plain initVocab() if you have numeric stories initVocabFromTextFile(Vocab); } int termCountO { return n_uniquejexemes;
} int term(int i) { return unique Jexemes[i-1]; }
void initVocabFromTextFile(Vocabulary Vocab) { // Read the current story vocab try { n_unique exemes=0; StreamTokenizer VCBFile = new StreamTokenizer(new BufferedReader (new InputStreamReader (new FilelnputStream(storyvocabname))));
while (VCBFile.nextToken() != VCBFile.TT_EOF){ String tok; if (VCBFile.ttype = VCBFile.TT_NUMBER){ tok=(String.valueOf(VCBFile.nval)).toLowerCase0; }else if (VCBFile.ttype = VCBFile.TT_WORD){ tok=VCBFile.sval.toLowerCase().replace('.*,' ').replacε("",* ').replace(',',' ').trim0; }else {
// System.out.println("Story.java Found funny thing in "+storyvocabname-r" type
"+VCBFile.ttype); tok="<unknown>"; continue;
}
// System.out.println(" "+tok-" "-Vocab.tolndex tok)); if (Vocab. o Index(tok) >=0 ){ if (0 = lexeme_used[Vocab.toIndex('tok)]) ( uniquejexemes[n_uniquejexemes — ]=Vocab.toIndex tok); lexeme_usedrVocab.toIndex(tok)] — ; } } if (verbose_debug) System.out.println((n_unique Jexemes- 1 )-
" -> "-(unique_lexemesrn_uniquejexemes-l])):
}
// n_uniquejexemes — ; // Undo extra increment
System.out.println("Usin2 vocabularv of "-(n_uniquejexemes)-" words");
} catch (java.io.FileNotFoundException e)
{
System.out.println(" Couldn't open "-rstoryvocabname); Svstem.exit(l);
} catch 'ava.io.IOException e)
{
System.out.println(" Problem reading "÷storyvocabname);
System, exit(l); } } void initVocabO {
// Read the current story vocab token indices try { n_unique exemes=0; StreamTokenizer VCBFile = new StreamTokenizer(new BufferedReader (new InputStreamReader (new FilelnputStream(storyvocabname))));
while (VCBFile.nextTokenO != VCBFile.TTJEOF){ if (VCBFile.ttype != VCBFile.TT_NUMBER) { System.out.println(" Non number in "+storyvocabname); System.exit(l);
} umqueJexemes[n_unique_lexemes-H-]=(int)VCBFile.nval; if (verbose_debug) System.out.println((n_unique Jexemes- 1 )-
" -> "+(unique_lexemes[n_uniquejexemes-l])); }
// n_uniquejexemes -; // Undo extra increment
System.out.printlnO'Usinε vocabulary of "~(n_unique Jexemes A )÷" words");
} catch 'ava.io.FileNotFoundException e)
{
System.out.println(" Couldn't open "-storyvocabname); System, exit(l);
} catch (java.io.IOException e)
{
System.out.println(" Problem reading "÷storyvocabname); System.exit(l); } } public static void main (String [] args) throws Exception { int MAX_N_LEXEMES = 100000; Vocabulary Vocab; Story CurrentStory; System.out.println(new Date()); if (args. length < 1) {
System.err.println 'Usage Story coφusname textualstory");
System, exit(l);
}
Vocab=new Vocabulary(args[0],MAX_N_LEXEMES); //name,maxnlexemes Currents tory=new Story(args[l],Vocab, MAX_N_LEXEMES); System. out.println(new Date()); System.exit (0); }
} import java.util.Hashtable; import java.io.*;
/* The vocabulary used by the search, It has a method
It provides mehods that convert the lexical items used by the language model into strings for output*/
public class Vocabulary { final static boolean verbose_debug = false; String [] vocab_words; Hashtabie vocabjndex: int staπ_sent; int end sent;
public Vocabulary (String coφusname,int MAXJNJLEXEMES) { vocab_words = new String [MAX_NJ_EXEMES]; vocab_index = new Hashtable(MAX_N_LEXEMES); // Read the current story vocab names try { StreamTokenizer VCBFile = new StreamTokenizer(new BufferedReader (new InputStreamReader
(new FileInputStrεam(coφusname÷".numvocab ")))); // see http://charts.unicode.org Unicode.charts/glyphless UOOOO.html VCBFile.wordChars(0x0021,0x007e); // basically all the characters // that could concievably be in a word in English
while (VCBFile.nextTokenO != VCBFile.TT_EOF){ if (VCBFile.ttype != VCBFile.TT_NUMBER){ System.out.println(" Non number ["-VCBFile.svaK"] in "+coφusname÷".num vocab "+VCBFile.linenoO);
System.exit(l);
} int index=(int)VCBFile.nval; if (VCBFile.nextToken() = VCBFile.TT_EOF){
System.out.println(" Number "+index÷" without string in " +coφusname+".numvocab "+VCBFile.lineno());
System.exit(l);
}
// read in string that has been quoted by the massage program, // and strip the quotes off (Java kind of rules compared to c) vocab_words[index]=VCBFile.sval.replace('"V
Figure imgf000030_0001
if (vocab Jndex.containsKey(vocab_words[index])) { System.out.println("Repeated Vocab term "+vocab_words[index]+" at index "+ index); } else { vocabJndex.put(vocab_words[index],new Integer(index));
} if (verbose_debug) System.out.println((indεx)+ " => "+( vocab_words [index]));
} } catch (java.io.FileNotFoundExcεption ε)
{ System.out.println(" Couldn't open "-coφusname— ".numvocab"): Svstεm.exit(l);
} catch 0'ava.io.IOException e)
{
System.out.println(" Problem reading "-coφusname-Anumvocab"); System.exit(l);
// read sentεncε start and εnd markεrs try { StrεamTokεnizer StartStopFile = new StreamTokεnizεr(nεw BufferedReader (new InputStreamReader
(new FileInputStrεam(coφusnamε-".startεnd")))); // sεε http://charts.unicodε.org/Unicode.chaπs/glyphless/U0000.html StartStopFile.wordChars(0x0021,0x007e); // basically all the characters // that could conciεvably bε in a word in English
whilε (StartStopFile.nextToken() != StartStopFile.TT_EOF){ if (StartStopFile.ttypε != StartStopFile.TT_NUMBER){ System.out.println(" Non number ["÷StartStopFile.sval÷"] in " +coφusname+".startend"); System.exit(l);
} start_sent=(int)StartStopFile.nval; if (StartStopFile.nextTokenO = StartStopFile.TT_EOF){
System.out.println(" Number "+start_sent÷" without second in " +coφusname+".startend");
Svstem.exit(l);
} if (StartStopFile.ttype != StartStopFile.TT_NUMBER){
System.out.println(" Non word in "+coφusname-r".startεnd");
System.exit(l);
} end_sent = (int)StartStopFile.nval;
} } catch Oava.io.FileNotFoundException e)
{
System.out.println(" Couldn't open "+coφusname+".startend"); System.exit(l); } catch 0'ava.io.IOException e) I
System.out.printlnC Problem reading "-corpusname-". staπend"); Svstem.exit(l); } } public int startsenttoken 0 { return start_sent;
} public int endsenttoken () { return end_sent;
\ ) public String toString(int i) { return vocab_words[i]; } public int toIndex(String s){ if (vocab_index.containsKey(s)) { return ((Integer)(vocab_indεx.gεt(s))).intValuε(); } else return(-l); } }
/* Need to make this general, so that the probabilities for each transition come from an externally supplied method */
import java.util.Date; import java.io.*; import QSortAlgorithm; import Vocabulary;
public class ViterbiSearch { final static int MAX_N_SENSES = 40000; final static int MAX_SENT_LEN = 10; final public static float LOGJBOREDOMJDISCOUNT = -100.0F; final static float beam_width = 3.0F; final static int MIN BEAM SIZE = 20: } void setUpSearchStatεs (Story Sty){ n_current_states=0;
System.out.println ("Viterbi Search using story with "-(Sty.termCountO)-" words"); curr_start_sent=0;
SetSearch_state(curr_start_sent,Vocab.staπsεnttoken ); curr_end_sεnt=Sty.tεrmCount()÷l ;
SetSearch_state(curr_end_sentNocab.endsenttoken()); int i; for (i=l; i<=Sty.termCount();i-H-) { S etS earch_state(i,S tv. term(i)) ;
} } void SεtSearch_state(int statε, int value) { search_states[state]=value;
7/ System.out.println("State:"+state÷" word: "-value); stateJoJexeme[state]=value; n_current_states=java.lang.Math.max(n_current_statεs.state÷l);
}
* The following code section does the
* actual viterbi search
// change these later to dynamic size, since easy in Java int [][] backpointers = new int [MAX_SEΝT_LEΝ][MAX_Ν_SEΝSES];
// only actually need 2 slices, but this is clearer float [][] scores = new float [MAX_SENT_LEN][MAX_N_SENSES];
int [] currentscoreindicεs = new int [MAX_N_SENSES]; float [] currentscores = new float [MAX_N_SENSES];
void backtrack(int posJn_sent, int from_state)
{ if (pos Jn_sent>=0) backtrack(pos_in_sent-l, backpointers [posJn_sent][from_state]);
System.out.print(Vocab.toString(search_states[from_state])÷" ");//÷"("-from_state÷" "÷sεarch_statεs[from_statεl-") "); }
// This backtracks through the current path to date,
// discouraging the selection of words already in the path
// -- multiple occurrances are multiply discouraging.
// This is not a terribly good idea, since it can't undo decisions earlier on,
// it may pick a non optimal repetition disallowing path float discount >yJ>oredomJevel(fioat bored_probability, int posJn_sent, int from_word,
int word o_match){ if (pos_in_sent>=0){ if (from_word=word_to_match) {
// System.out.print("BORED!"+wordJo_match); bored_probability += LOG_BOREDOM_DISCOUNT;
} bored_probability= discount_by_boredomJevel(bored_probability, pos_in_sent-l, backpointers[posJn_sent][from_word], word ojnatch);
} return bored_probability;
}
void dump()
{ int wordJn_sent; int this_vocab; for (word Jn_sent=0;word_in_sent<MAX_SENT_LEN; wordJn_sent++) { System.out.println (wordJn_sεnt); for (this_vocab = 0; this_vocab < n_currεnt_statεs; this_vocab++) { System.out.printm(tWs_vocab+''<-,'+backpointers[word_in_sent][this_vocab]+'' "+scores[word_in_sent][this_vocab]);
}
System.out.println("\n"); } } public void producεStringSummary (Story Sty, int length) throws Exception { setUpSearchStates(Sty); doSearchO; } void doSearchO throws Exception { // System. out.println("doSearch ncurrentstates is "÷n_current_states÷"\n curr_start_Sent= "+Vocab.toString(curr_staπ_sent)÷" ("+curr_start_sent÷")"÷"curr_end_Sent= "÷Vocab.toString(curr_εnd_sent)÷" ("+curr_end_sεnt-")"); // first state probabilities are transitions out of <s> for (int this_state = 0; this_state < n_current_states; this_state÷÷) { backpointers[0][this_state]=curr_start_sent; scores[0] [this_state]
=LMod.probability(search_states[curr_start_sent], search_states [this_statε] ) +TrMod.probability(sεarch_states[this_state]); // Beam search currentscores[this_state]=scorεs[0] [this_statε] ; currentscoreindices[this_statε]=this_state:
}
// Beam search quicksort.sort(currentscores, currentscoreindices); for (int word_in_sent=l;word_in_sent<MAX_SENT_LEN; word Jn_sent+÷) { int best_state=curr_start_sent; float best_score=-100000.0F;
// Beam search - don't consider statεs with log probs (bεam_width) timεs smallεr
// than best score int current_beam = 0; for (int from_stateJ = 0; from_stateJ < n_current_states; from_stateJ-H-) {
// System.out.println("From state i:"+from_stateJ+" n_current_states:"-ι- n_current_states+" Current Scores[',+((n_current_states-l)-from_stateJ)+',]:"+(currentscores[(n_current_statεs-l)-from_stat ej])+" Currεnt Scores indices[M+((n_current_states-l)-from_stateJ)+"]:',+(currentscorεmdices[(n_current_statεs-l)-fro m_statej])); if ((from_stateJ >=MIN_BEAM_SIZE) && (currentscores[(n_current_states- 1 )-from_state J] < (currentscores[n_current_states-l] * beam_width))) break; current_beam=from_stateJ; } (int this_state = 0: this_state < n_currem_star.es: this_state÷V) { float max_score= -1000000.0F; int current_back=curr_start_sent; for (int from_statε_i = 0; from_star.e_i <= curren t_beam; from_state_i-H-) { float test_scorε; int from_state = currentscoreindicεs[(n_current_states-l)-from_state_i];
// System.out.println("state "÷this_state-:-" from i:"÷from_state_i-" from:"- from_statε);
// Nεver repεat a statε immεdiately, or make a transition out of EOS if ((from_state = this_state) || (from_state == curr_εnd_sent)) continue; if ((word_in_sent > 1) && (from_statε = curr_start_sent)) continue;
test_score=scores [word_in_sent- 1 ] [from_state] +LMod.probability(search_states[from_state],search_states[this_state]); if (test_score >max_score) { current_back=from_state; max_score=test_score; } } float bored_probability=TrMod.probability(search_states[this_state]); if (boredom) bored_probability= discount_by_boredom_level(borεd_probability, word Jn_sεnt- 1 ,currεnt_back,this_statε) ; scores[wordjn_sent][this_state]=max_score+bored_probability;
// Save scores for sorting, so can do beam search currentscores[this_state]=max_score+bored_probability; currεntscoreindices[this_state]=this_state;
backpointers[word_in_sent][this_statε]=currεnt_back;
// we need to check if the best state now is end of sent, and stop if so if (scores[word_in_sent][this_statε] > best_score){ best_score = scores[word_in_sent][this_state]; best_state = this_state; } // Beam Search quicksort.sort(currentscores. currentscoreindicεs);
Systεm.out.print(word_in_sent-":"); if (best_state = curr_end_sent) System. out.print ("* "); backtrack(word_in_sent, curr_end_sent);
System.out.println(" "-scores[word_ιn_sent][curr_end_sεnt]-" Beam "-(curren t_beam-l));
}
// dump();
}
As used in the following claims, a "summary string" is a derivative representation of the source document which may, for example, comprise an abstract, key word summary, folder name, headline, file name or the like.
Having thus defined our invention in the detail and particularity required by the Patent Laws, what is desired to be protected by Letters Patent is set forth in the following claims.

Claims

WHAT IS CLAIMED IS:
1. A computer method for preparing a summary string from a source document of encoded text , the method comprising the steps of: a) comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn probabilities that a given summary word or phrase will appear in summary strings given a source word or phrase appears in an encoded text document ; and b) constructing from the source document a summary string containing summary words or phrases having the highest probabilities of appearing in a summary string based on the learned probabilities established in the previous step.
2. The computer method according to claim 1, comprising constructing a summary string containing the most probable summary word, words, phrase or phrases for a preselected number of words or phrases in the summary string.
3. The computer method according to claim 2 , comprising comparing the training set of encoded text documents with manually generated summary strings to learn the probability that a summary word or phrase appearing in a summary string will follow another summary word or phrase and constructing a summary string containing the most probable word or sequence of words and/or phrases for a preselected number of words in the summary string.
4. The computer method according to claim 1 , comprising comparing a corpus of encoded text documents with manually generated summary strings associated therewith to learn the probabilities that a given summary word or phrase will appear in summary strings given a source word or phrase appears in the encoded text considering the context in which the source word or phrase appears in the encoded text documents.
5. The computer method according to claim 4, wherein the contexts in which the source words or phrases are considered include titles, headings and standard paragraphs .
6. The computer method according to claim 4, wherein the contexts in which the source words or phrases are considered include fonts, holding and italicizing.
7. The computer method according to claim 4, further comprising learning multiple probabilities that a summary word or phrase will appear in a summary string given a source word or phrase appears in the encoded text considering the various usages of the word or phrase in the encoded text .
8. The computer method according to claim 7, wherein the usages in which the source words are considered are syntactic usages .
9. The computer method according to claim 8 , wherein the syntactic usages include the word or phases part of speech.
10. The computer method according to claim 7, wherein the usages in which the source words or phrases are considered are semantic usages.
11. The computer method according to claim 10, wherein the usages in which source words or phrases are considered include usage categories selected from the TIPSTER/MUC standards.
12. The computer method according to claim 10, wherein the usages in which source words or phrases are considered include usage categories selected from the group AGENT, CIRCUMSTANCE, C I RCUMSTANCE / TEMPORAL , COMMUNICATIVE_ACTION and OBJECT.
13. The computer method according to claim 4, wherein the step for comparing a corpus of encoded text documents with manually generated summary strings takes into consideration external information in the form of queries, user models, past user interaction and other biases to optimize the form of the summary strings constructed in the summary constructing step.
14. The computer method according to claim 1, comprising producing summaries in a different language from the source document by using a training set of an encoded text document in one language with manual summaries in another language.
PCT/US2000/000268 1999-01-07 2000-01-06 Method for producing summaries of text document WO2000041096A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU24927/00A AU2492700A (en) 1999-01-07 2000-01-06 Method for producing summaries of text document

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11501699P 1999-01-07 1999-01-07
US60/115,016 1999-01-07
US09/351,952 1999-07-12
US09/351,952 US6317708B1 (en) 1999-01-07 1999-07-12 Method for producing summaries of text document

Publications (1)

Publication Number Publication Date
WO2000041096A1 true WO2000041096A1 (en) 2000-07-13

Family

ID=26812763

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/000268 WO2000041096A1 (en) 1999-01-07 2000-01-06 Method for producing summaries of text document

Country Status (3)

Country Link
US (1) US6317708B1 (en)
AU (1) AU2492700A (en)
WO (1) WO2000041096A1 (en)

Families Citing this family (95)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197451B1 (en) * 1998-07-02 2007-03-27 Novell, Inc. Method and mechanism for the creation, maintenance, and comparison of semantic abstracts
US7152031B1 (en) * 2000-02-25 2006-12-19 Novell, Inc. Construction, manipulation, and comparison of a multi-dimensional semantic space
JP2001014306A (en) * 1999-06-30 2001-01-19 Sony Corp Method and device for electronic document processing, and recording medium where electronic document processing program is recorded
US7509572B1 (en) * 1999-07-16 2009-03-24 Oracle International Corporation Automatic generation of document summaries through use of structured text
JP2001043215A (en) * 1999-08-02 2001-02-16 Sony Corp Device and method for processing document and recording medium
US6564210B1 (en) * 2000-03-27 2003-05-13 Virtual Self Ltd. System and method for searching databases employing user profiles
US6941513B2 (en) * 2000-06-15 2005-09-06 Cognisphere, Inc. System and method for text structuring and text generation
US7672952B2 (en) * 2000-07-13 2010-03-02 Novell, Inc. System and method of semantic correlation of rich content
US7286977B1 (en) * 2000-09-05 2007-10-23 Novell, Inc. Intentional-stance characterization of a general content stream or repository
US7653530B2 (en) * 2000-07-13 2010-01-26 Novell, Inc. Method and mechanism for the creation, maintenance, and comparison of semantic abstracts
US7389225B1 (en) 2000-10-18 2008-06-17 Novell, Inc. Method and mechanism for superpositioning state vectors in a semantic abstract
US20090234718A1 (en) * 2000-09-05 2009-09-17 Novell, Inc. Predictive service systems using emotion detection
US20100122312A1 (en) * 2008-11-07 2010-05-13 Novell, Inc. Predictive service systems
US7177922B1 (en) 2000-09-05 2007-02-13 Novell, Inc. Policy enforcement using the semantic characterization of traffic
JP4299963B2 (en) * 2000-10-02 2009-07-22 ヒューレット・パッカード・カンパニー Apparatus and method for dividing a document based on a semantic group
US20020087985A1 (en) * 2000-12-01 2002-07-04 Yakov Kamen Methods and apparatuses for displaying meaningful abbreviated program titles
JP3768105B2 (en) 2001-01-29 2006-04-19 株式会社東芝 Translation apparatus, translation method, and translation program
GB2377046A (en) * 2001-06-29 2002-12-31 Ibm Metadata generation
WO2003005166A2 (en) 2001-07-03 2003-01-16 University Of Southern California A syntax-based statistical translation model
US9009590B2 (en) * 2001-07-31 2015-04-14 Invention Machines Corporation Semantic processor for recognition of cause-effect relations in natural language documents
US8799776B2 (en) * 2001-07-31 2014-08-05 Invention Machine Corporation Semantic processor for recognition of whole-part relations in natural language documents
WO2003012661A1 (en) * 2001-07-31 2003-02-13 Invention Machine Corporation Computer based summarization of natural language documents
US7712028B2 (en) * 2001-10-19 2010-05-04 Xerox Corporation Using annotations for summarizing a document image and itemizing the summary based on similar annotations
JP2003248676A (en) * 2002-02-22 2003-09-05 Communication Research Laboratory Solution data compiling device and method, and automatic summarizing device and method
US20030170597A1 (en) * 2002-02-22 2003-09-11 Rezek Edward Allen Teaching aids and methods for teaching interviewing
JP3624186B2 (en) * 2002-03-15 2005-03-02 Tdk株式会社 Control circuit for switching power supply device and switching power supply device using the same
US7620538B2 (en) 2002-03-26 2009-11-17 University Of Southern California Constructing a translation lexicon from comparable, non-parallel corpora
GB2390704A (en) * 2002-07-09 2004-01-14 Canon Kk Automatic summary generation and display
US7328147B2 (en) * 2003-04-03 2008-02-05 Microsoft Corporation Automatic resolution of segmentation ambiguities in grammar authoring
US20040243545A1 (en) * 2003-05-29 2004-12-02 Dictaphone Corporation Systems and methods utilizing natural language medical records
GB2401225A (en) * 2003-04-30 2004-11-03 Smart Analytical Solutions Ltd Classifying Message Content
US7603267B2 (en) * 2003-05-01 2009-10-13 Microsoft Corporation Rules-based grammar for slots and statistical model for preterminals in natural language understanding system
US8548794B2 (en) 2003-07-02 2013-10-01 University Of Southern California Statistical noun phrase translation
JP3987934B2 (en) * 2003-11-12 2007-10-10 国立大学法人大阪大学 Document processing apparatus, method and program for summarizing user evaluation comments using social relationships
US8296127B2 (en) 2004-03-23 2012-10-23 University Of Southern California Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US8666725B2 (en) 2004-04-16 2014-03-04 University Of Southern California Selection and use of nonstatistical translation components in a statistical machine translation framework
WO2006042321A2 (en) 2004-10-12 2006-04-20 University Of Southern California Training for a text-to-text application which uses string to tree conversion for training and decoding
EP1669896A3 (en) * 2004-12-03 2007-03-28 Panscient Pty Ltd. A machine learning system for extracting structured records from web pages and other text sources
US20060161537A1 (en) * 2005-01-19 2006-07-20 International Business Machines Corporation Detecting content-rich text
US7610545B2 (en) * 2005-06-06 2009-10-27 Bea Systems, Inc. Annotations for tracking provenance
US8676563B2 (en) 2009-10-01 2014-03-18 Language Weaver, Inc. Providing human-generated and machine-generated trusted translations
US8886517B2 (en) 2005-06-17 2014-11-11 Language Weaver, Inc. Trust scoring for language translation systems
US7813918B2 (en) * 2005-08-03 2010-10-12 Language Weaver, Inc. Identifying documents which form translated pairs, within a document collection
US7565372B2 (en) * 2005-09-13 2009-07-21 Microsoft Corporation Evaluating and generating summaries using normalized probabilities
US10319252B2 (en) 2005-11-09 2019-06-11 Sdl Inc. Language capability assessment and training apparatus and techniques
US8943080B2 (en) 2006-04-07 2015-01-27 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US8386232B2 (en) * 2006-06-01 2013-02-26 Yahoo! Inc. Predicting results for input data based on a model generated from clusters
KR100785927B1 (en) * 2006-06-02 2007-12-17 삼성전자주식회사 Method and apparatus for providing data summarization
US8886518B1 (en) 2006-08-07 2014-11-11 Language Weaver, Inc. System and method for capitalizing machine translated text
US7899822B2 (en) * 2006-09-08 2011-03-01 International Business Machines Corporation Automatically linking documents with relevant structured information
US8433556B2 (en) 2006-11-02 2013-04-30 University Of Southern California Semi-supervised training for statistical word alignment
US7734623B2 (en) * 2006-11-07 2010-06-08 Cycorp, Inc. Semantics-based method and apparatus for document analysis
US9122674B1 (en) 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US8468149B1 (en) 2007-01-26 2013-06-18 Language Weaver, Inc. Multi-lingual online community
US7725442B2 (en) * 2007-02-06 2010-05-25 Microsoft Corporation Automatic evaluation of summaries
US8615389B1 (en) 2007-03-16 2013-12-24 Language Weaver, Inc. Generation and exploitation of an approximate language model
US9031947B2 (en) * 2007-03-27 2015-05-12 Invention Machine Corporation System and method for model element identification
US8831928B2 (en) 2007-04-04 2014-09-09 Language Weaver, Inc. Customizable machine translation service
US7925496B1 (en) 2007-04-23 2011-04-12 The United States Of America As Represented By The Secretary Of The Navy Method for summarizing natural language text
US20080270119A1 (en) * 2007-04-30 2008-10-30 Microsoft Corporation Generating sentence variations for automatic summarization
US8825466B1 (en) 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
US20090083026A1 (en) * 2007-09-24 2009-03-26 Microsoft Corporation Summarizing document with marked points
WO2010073591A1 (en) * 2008-12-26 2010-07-01 日本電気株式会社 Text processing device, text processing method, and computer readable recording medium
US8301622B2 (en) * 2008-12-30 2012-10-30 Novell, Inc. Identity analysis and correlation
US8296297B2 (en) 2008-12-30 2012-10-23 Novell, Inc. Content analysis and correlation
US8386475B2 (en) * 2008-12-30 2013-02-26 Novell, Inc. Attribution analysis and correlation
JP2012520528A (en) * 2009-03-13 2012-09-06 インベンション マシーン コーポレーション System and method for automatic semantic labeling of natural language text
CN102439594A (en) * 2009-03-13 2012-05-02 发明机器公司 System and method for knowledge research
US20100250479A1 (en) * 2009-03-31 2010-09-30 Novell, Inc. Intellectual property discovery and mapping systems and methods
US20100299140A1 (en) * 2009-05-22 2010-11-25 Cycorp, Inc. Identifying and routing of documents of potential interest to subscribers using interest determination rules
US8990064B2 (en) 2009-07-28 2015-03-24 Language Weaver, Inc. Translating documents based on content
US8380486B2 (en) 2009-10-01 2013-02-19 Language Weaver, Inc. Providing machine-generated translations and corresponding trust levels
US10417646B2 (en) 2010-03-09 2019-09-17 Sdl Inc. Predicting the cost associated with translating textual content
US9268878B2 (en) * 2010-06-22 2016-02-23 Microsoft Technology Licensing, Llc Entity category extraction for an entity that is the subject of pre-labeled data
US9317595B2 (en) * 2010-12-06 2016-04-19 Yahoo! Inc. Fast title/summary extraction from long descriptions
US11003838B2 (en) 2011-04-18 2021-05-11 Sdl Inc. Systems and methods for monitoring post translation editing
US8694303B2 (en) 2011-06-15 2014-04-08 Language Weaver, Inc. Systems and methods for tuning parameters in statistical machine translation
CA2851772C (en) * 2011-10-14 2017-03-28 Yahoo! Inc. Method and apparatus for automatically summarizing the contents of electronic documents
US8886515B2 (en) 2011-10-19 2014-11-11 Language Weaver, Inc. Systems and methods for enhancing machine translation post edit review processes
US8942973B2 (en) 2012-03-09 2015-01-27 Language Weaver, Inc. Content page URL translation
US9336187B2 (en) * 2012-05-14 2016-05-10 The Boeing Company Mediation computing device and associated method for generating semantic tags
US10261994B2 (en) 2012-05-25 2019-04-16 Sdl Inc. Method and system for automatic management of reputation of translators
US8577671B1 (en) 2012-07-20 2013-11-05 Veveo, Inc. Method of and system for using conversation state information in a conversational interaction system
US9465833B2 (en) 2012-07-31 2016-10-11 Veveo, Inc. Disambiguating user intent in conversational interaction system for large corpus information retrieval
US9152622B2 (en) 2012-11-26 2015-10-06 Language Weaver, Inc. Personalized machine translation via online adaptation
US10095692B2 (en) * 2012-11-29 2018-10-09 Thornson Reuters Global Resources Unlimited Company Template bootstrapping for domain-adaptable natural language generation
US9213694B2 (en) 2013-10-10 2015-12-15 Language Weaver, Inc. Efficient online domain adaptation
US9916284B2 (en) * 2013-12-10 2018-03-13 International Business Machines Corporation Analyzing document content and generating an appendix
US9852136B2 (en) 2014-12-23 2017-12-26 Rovi Guides, Inc. Systems and methods for determining whether a negation statement applies to a current or past query
US9854049B2 (en) * 2015-01-30 2017-12-26 Rovi Guides, Inc. Systems and methods for resolving ambiguous terms in social chatter based on a user profile
US10838996B2 (en) * 2018-03-15 2020-11-17 International Business Machines Corporation Document revision change summarization
US11501073B2 (en) * 2019-02-26 2022-11-15 Greyb Research Private Limited Method, system, and device for creating patent document summaries
US11263394B2 (en) * 2019-08-02 2022-03-01 Adobe Inc. Low-resource sentence compression system
US11281854B2 (en) * 2019-08-21 2022-03-22 Primer Technologies, Inc. Limiting a dictionary used by a natural language model to summarize a document
US11630869B2 (en) 2020-03-02 2023-04-18 International Business Machines Corporation Identification of changes between document versions

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5077668A (en) * 1988-09-30 1991-12-31 Kabushiki Kaisha Toshiba Method and apparatus for producing an abstract of a document
US5297027A (en) * 1990-05-11 1994-03-22 Hitachi, Ltd. Method of and apparatus for promoting the understanding of a text by using an abstract of that text
US5384703A (en) * 1993-07-02 1995-01-24 Xerox Corporation Method and apparatus for summarizing documents according to theme
US5638543A (en) * 1993-06-03 1997-06-10 Xerox Corporation Method and apparatus for automatic document summarization
US5708825A (en) * 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US5778397A (en) * 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
US5848191A (en) * 1995-12-14 1998-12-08 Xerox Corporation Automatic method of generating thematic summaries from a document image without performing character recognition

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2077274C (en) 1991-11-19 1997-07-15 M. Margaret Withgott Method and apparatus for summarizing a document without document image decoding
US5701500A (en) * 1992-06-02 1997-12-23 Fuji Xerox Co., Ltd. Document processor
US5774845A (en) * 1993-09-17 1998-06-30 Nec Corporation Information extraction processor
US5510981A (en) 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
JP2809341B2 (en) * 1994-11-18 1998-10-08 松下電器産業株式会社 Information summarizing method, information summarizing device, weighting method, and teletext receiving device.
US5689716A (en) 1995-04-14 1997-11-18 Xerox Corporation Automatic method of generating thematic summaries
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors
JP3579204B2 (en) * 1997-01-17 2004-10-20 富士通株式会社 Document summarizing apparatus and method
DE19708183A1 (en) * 1997-02-28 1998-09-03 Philips Patentverwaltung Method for speech recognition with language model adaptation
US6178401B1 (en) * 1998-08-28 2001-01-23 International Business Machines Corporation Method for reducing search complexity in a speech recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5077668A (en) * 1988-09-30 1991-12-31 Kabushiki Kaisha Toshiba Method and apparatus for producing an abstract of a document
US5297027A (en) * 1990-05-11 1994-03-22 Hitachi, Ltd. Method of and apparatus for promoting the understanding of a text by using an abstract of that text
US5638543A (en) * 1993-06-03 1997-06-10 Xerox Corporation Method and apparatus for automatic document summarization
US5384703A (en) * 1993-07-02 1995-01-24 Xerox Corporation Method and apparatus for summarizing documents according to theme
US5708825A (en) * 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US5778397A (en) * 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
US5848191A (en) * 1995-12-14 1998-12-08 Xerox Corporation Automatic method of generating thematic summaries from a document image without performing character recognition

Also Published As

Publication number Publication date
US6317708B1 (en) 2001-11-13
AU2492700A (en) 2000-07-24

Similar Documents

Publication Publication Date Title
WO2000041096A1 (en) Method for producing summaries of text document
US5423032A (en) Method for extracting multi-word technical terms from text
Bikel et al. An algorithm that learns what's in a name
US7092872B2 (en) Systems and methods for generating analytic summaries
Miller et al. BBN: Description of the SIFT system as used for MUC-7
Palmer Tokenisation and sentence segmentation
Řehůřek et al. Language identification on the web: Extending the dictionary method
Al Shamsi et al. A hidden Markov model-based POS tagger for Arabic
Scannell Statistical unicodification of African languages
JP2002517039A (en) Word segmentation in Chinese text
EP1627325B1 (en) Automatic segmentation of texts comprising chunks without separators
Avner et al. Identifying translationese at the word and sub-word level
De Meulder et al. Memory-based named entity recognition using unannotated data
WO2003079224A1 (en) Text generation method and text generation device
Wijffels et al. Package ‘udpipe’
Tufiş et al. Automatic diacritics insertion in Romanian texts
Zeldes et al. An NLP pipeline for Coptic
Ross et al. Factors affecting pitch accent placement
Van Halteren et al. Towards a syntactic database: The TOSCA analysis system
Zhuang et al. Weakly supervised extractive summarization with attention
Alegría et al. Linguistic and statistical approaches to Basque term extraction
Deksne Chat Language Normalisation using Machine Learning Methods.
L’haire FipsOrtho: A spell checker for learners of French
Sanders Using probabilistic methods to predict phrase boundaries for a text-to-speech system
Abebe et al. Amharic Text Corpus based on Parts of Speech tagging and headwords

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase