Suche Bilder Maps Play YouTube News Gmail Drive Mehr »
Anmelden
Nutzer von Screenreadern: Klicke auf diesen Link, um die Bedienungshilfen zu aktivieren. Dieser Modus bietet die gleichen Grundfunktionen, funktioniert aber besser mit deinem Reader.

Patentsuche

  1. Erweiterte Patentsuche
VeröffentlichungsnummerUS20080072134 A1
PublikationstypAnmeldung
AnmeldenummerUS 11/532,977
Veröffentlichungsdatum20. März 2008
Eingetragen19. Sept. 2006
Prioritätsdatum19. Sept. 2006
Veröffentlichungsnummer11532977, 532977, US 2008/0072134 A1, US 2008/072134 A1, US 20080072134 A1, US 20080072134A1, US 2008072134 A1, US 2008072134A1, US-A1-20080072134, US-A1-2008072134, US2008/0072134A1, US2008/072134A1, US20080072134 A1, US20080072134A1, US2008072134 A1, US2008072134A1
ErfinderSreeram Viswanath Balakrishnan, Ganesh Ramakrishnan, Sachindra Joshi
Ursprünglich BevollmächtigterSreeram Viswanath Balakrishnan, Ganesh Ramakrishnan, Sachindra Joshi
Zitat exportierenBiBTeX, EndNote, RefMan
Externe Links: USPTO, USPTO-Zuordnung, Espacenet
Annotating token sequences within documents
US 20080072134 A1
Zusammenfassung
Token sequences within a number of documents are annotated. First, a base inverse index for unique tokens within the documents is received. The base inverse index includes a set of the unique tokens within the documents and a set of location lists for each unique token. Second, indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.
Bilder(5)
Previous page
Next page
Ansprüche(20)
1. A method for annotating token sequences within a plurality of documents comprising:
receiving a base inverse index for unique tokens within the plurality of documents, where the base inverse index comprises a set of the unique tokens within the plurality of documents and a set of location lists for each unique token; and,
creating indices for a set of the token sequences within the plurality of documents from the base inverse index, to annotate the token sequences.
2. The method of claim 1, wherein the base inverse index has an ordered list of the unique tokens, and each location list of the base inverse index is an ordered list of pointers to the plurality of documents.
3. The method of claim 2, wherein each location list comprises an ordered list of pointers configured to locate a document from the plurality of documents and a token offset within the document corresponding to a single occurrence of a token sequence associated with the location list.
4. The method of claim 2, wherein an annotation is defined as a dictionary label associated with all the token sequences annotating dictionary entities of a dictionary, the method further comprising:
creating an index for each token sequence within the dictionary having more than one token, as a multiple-token entry within the dictionary; and,
creating an index to a final dictionary annotation, by merging the indices for the multiple-token entries within the dictionary and single token entries within the dictionary.
5. The method of claim 4, wherein creating an index for each token sequence within the dictionary having more than one token comprises searching indices for a sequence of tokens within the token sequence for a subset of locations in which all tokens sequentially occur in the sequence.
6. The method of claim 1, further comprising defining a regular-expression entity as a token that matches a regular expression, the regular-expression entity employed in annotating the token sequences within the plurality of documents.
7. The method of claim 1, further comprising defining a merge operation operable on a first location list and a second location list that returns a location list of pointers, where each pointer of the location list returned is within the first location list or the second location list.
8. The method of claim 1, further comprising defining a consecutive-intersection operation operable on a first location list and a second location list that returns a location list of pointers.
9. The method of claim 8, wherein each pointer of the location list returned points to a sequence of tokens having a first consecutive subsequence within the first location list and a second consecutive subsequence within the second location list, and
wherein determining the index as the consecutive intersection of all of the plurality of location lists of pointers within the dictionary entity comprises employing the consecutive-intersection operation.
10. A method for annotating each of a plurality of tokens within a plurality of documents comprising:
receiving a base inverse index for the plurality of documents, the base inverse index having an ordered list of unique tokens and a set of location lists for each unique token, each location list being an ordered list of pointers to the plurality of documents;
for each of a plurality of derived entities, each derived entity being a sequence of tokens, determining an index as a consecutive intersection of all of a plurality of location lists of pointers within the derived entity, such that the index contains location lists of pointers to all occurrences of the sequence of tokens of the derived entity within the plurality of documents; and,
merging the location lists of pointers for all the derived entities to result in a final location list, such that the documents are annotated with the tokens of the derived entities.
11. The method of claim 10, further comprising composing each derived entity from a plurality of preexisting simpler entities using a set of rules written in modified context-free grammar (CFG).
12. The method of claim 11, wherein composing each derived entity from the preexisting simpler entities using the set of rules written in modified CFG comprises deriving the derived entity from a first consecutive sequence of tokens and a second consecutive sequence of tokens.
13. The method of claim 12, further comprising modifying the CFG from each derived entity is composed from preexisting simpler entity rules, comprising:
defining a parallel intersection operation operable on a first location list and a second location list that returns a location list of pointers that is a subset of pointers to sequences of tokens within both the first location list and the second location list.
14. The method of claim 13, wherein modifying the CFG further comprises:
defining a first extension to consecutive-intersection operation operable on a first location list and a second location list that returns a location list of pointers that is a subset of the second location list, where every sequence within the subset is immediately preceded by a sequence within the first location list; and,
defining a second extension to consecutive-intersection operation operable on a first location list and a second location list that returns a location list of pointers that is a subset of the first location list, where every sequence within the subset is immediately preceded by a sequence within the second location list.
15. The method of claim 10, further comprising defining a merge operation operable on a first location list and a second location list that returns a location list of pointers, where each pointer of the location list returned is within the first location list or the second location list,
wherein merging the location lists of pointers for all the derived entities comprises employing the merge operation.
16. The method of claim 10, further comprising defining a consecutive-intersection operation operable on a first location list and a second location list that returns a location list of pointers, where each pointer of the location list returned points to a sequence of tokens having a first consecutive subsequence within the first location list and a second consecutive subsequence within the second location list,
wherein determining the index as the consecutive intersection of all of the plurality of location lists of pointers within the derived entity comprises employing the consecutive-intersection operation.
17. The method of claim 10, further comprising imposing a partial ordering of annotations of the tokens within the plurality of documents, so that lower-order annotations do not overlap with higher-order annotations.
18. The method of claim 17, further comprising defining on apply-order operation operable on a location list having an annotation type and an associated integer for the annotation type that returns a location list of pointers that is a subset of the location list having the annotation type for which all tokens in sequences of the subset returned having values less than or equal to the associated integer,
wherein imposing the partial ordering comprises employing the apply-order operation.
19. An article of manufacture comprising:
a tangible computer-readable medium; and,
means in the medium for annotating each of a plurality of tokens within a plurality of documents based on a base inverse index for the plurality of documents.
20. A computerized system comprising:
a computer-readable medium storing:
a plurality of documents having a plurality of tokens;
a base inverse index previously generated for the documents;
a mechanism to annotate each token within the documents based on the base inverse index, such that annotation of the plurality of documents occurs at a same time.
Beschreibung
    FIELD OF THE INVENTION
  • [0001]
    The present invention relates generally to annotating a collection of documents, and more particularly to annotating a collection of documents using a base inverse index of the documents.
  • BACKGROUND OF THE INVENTION
  • [0002]
    Entity annotation entails attaching a label, such as NAME or ORGANIZATION, to a sequence of tokens within a document. Entity annotation is typically useful in improving the accuracy of keyword-based web and document searches, as well as for data mining of text repositories. However, existing approaches to entity annotation are less than desirable.
  • [0003]
    For instance, existing approaches to entity annotation operate at the document level. Using either a rule-based or a machine learning-based annotator, the sequence of tokens within a document is fed to the annotator, and the annotator outputs corresponding labels. This approach does allow powerful natural language processing techniques to be used, such as part-of-speech tagging, phrase grammar parsing, and so on. However, a disadvantage of this approach is fundamentally a speed limitation, in that the total time taken to annotate a corpus of documents scales at least linearly with the total number of tokens within the corpus. For document collections exceeding 108 or 109 documents, it thus can take days to annotate a large corpus of documents, even when using highly parallel server farms.
  • [0004]
    In particular, the prior art for named entity annotation is focused on annotation on a one-document-at-a-time basis. That is, tokens in a document are analyzed, either using handcrafted or machine-learned rules, and a sequence of tokens is determined as being an entity that belongs to one of several predetermined named entity annotation types. There are two broad categories of named entity recognition systems: knowledge engineering-based systems and machine learning system-based systems. The former are typically rules based, developed by experienced language engineers making use of human intuition, and require just a small amount of training data. However, a disadvantage is that development of such systems can be time-consuming, and changes may be difficult to accommodate.
  • [0005]
    By comparison, machine learning system-based systems use large amounts of annotated training data, and changes can be achieved, albeit by re-annotating all of the training data. Machine learning system-based systems are less expensive, but their results may be less than optimal due to poor precision and recall. The present invention improves the efficiency of both rule-based and machine learning-based annotators, as is now described.
  • SUMMARY OF THE INVENTION
  • [0006]
    The present invention relates to annotating token sequences within a collection of documents. A method for such annotation according to one embodiment of the invention receives a base inverse index for unique tokens within the documents. The base inverse index includes a set of the unique tokens within the documents, and a set of location lists for each unique token. Indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.
  • [0007]
    An article of manufacture of an embodiment of the invention includes a tangible computer-readable medium and means in the medium. The tangible medium may be a recordable data storage medium, or another type of tangible computer-readable medium. The means is for annotating each token within a number of documents based on a base inverse index for the documents, such as by performing a method of an embodiment of the invention, as has been described.
  • [0008]
    A computerized system of an embodiment of the invention includes a computer-readable medium and an annotation mechanism. The computer-readable medium stores a number of documents having a number of tokens, and a base inverse index previously generated for the documents. The mechanism annotates the token sequences within the documents based on the base inverse index, such as by performing a method of an embodiment of the invention, as has been described, and such that annotation of the documents occurs at the same time.
  • [0009]
    Embodiments of the invention provide for advantages over the prior art. The approach to entity annotation of the invention employs an inverse index typically created for rapid keyword-based searching of a document collection. As such, entity annotation does not occur at the document level, but rather at the document collection-level, such that annotation occurs for all the documents at substantially the same time. Operations on the inverse index are defined that enable the creation of indices to arbitrarily complex annotations from indices to simpler annotations. In one embodiment, the relationship between complex and simpler annotations is specified using a modified form of CFG. In these approaches, entity annotations for an entire collection of documents can be achieved several orders of magnitude faster than the document-based approaches within the prior art.
  • [0010]
    It is noted that the concept of using the inverted index for building complex entity annotations can be interpreted generally. For example, document classification and information extraction may all be considered forms of entity annotation that traditionally have been approached at the document level. Thus, those of ordinary skill within the art can appreciate that simple extensions to the methods described below allow for such document classification and information extraction at the index level, such that entity annotation as this phrase is used herein is inclusive of such classification and extraction.
  • [0011]
    Therefore, embodiments of the invention differ from the prior art at least in the respect that instances of annotation types are effectively found within an entire corpus, or collection, of documents, by working on the corpus-level inverted index, which itself can be determined fairly efficiently. As such, entity annotation occurs much more quickly than in the prior art. Still other advantages, aspects, and embodiments of the invention will become apparent by reading the detailed description that follows, and by referring to the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0012]
    The drawings referenced herein form a part of the specification. Features shown in the drawing are meant as illustrative of only some embodiments of the invention, and not of all embodiments of the invention, unless otherwise explicitly indicated, and implications to the contrary are otherwise not to be made.
  • [0013]
    FIG. 1 is a block diagram of a system, according to an embodiment of the invention.
  • [0014]
    FIG. 2 is a flowchart of a method for annotating documents based on an inverse index of the documents, according to an embodiment of the invention.
  • [0015]
    FIG. 3 is a flowchart of a partial method that can be employed in relation to the method of FIG. 2, according to an embodiment of the invention.
  • [0016]
    FIG. 4 is a flowchart of a partial method that can be employed in relation to the method of FIG. 2, according to an embodiment of the invention.
  • DETAILED DESCRIPTION OF THE DRAWINGS
  • [0017]
    In the following detailed description of exemplary embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific exemplary embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized, and logical, mechanical, and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
  • [0018]
    FIG. 1 shows a system 100, according to an embodiment of the invention. The system 100 includes an annotation mechanism 102 and a computer-readable medium 104. The annotation mechanism 102 may be implemented in software, hardware, or a combination of software and hardware. The computer-readable medium 104 may be a tangible computer-readable medium, and may be or include a hard disk drive, volatile semiconductor memory, as well as other types of computer-readable media. As can be appreciated by those of ordinary skill within the art, the system 100 can include other components, besides those shown in FIG. 1.
  • [0019]
    The computer-readable medium 104 stores a number of text-based documents 106. The medium 104 also stores a base inverse index 108 that is generated for the documents 106. The generation of the inverse index 108 is beyond the scope of this patent application, and can be generated in a conventional or other manner. The inverse index 108 is typically created for rapid keyword-based search of the documents 106. The inverse index 108 may be considered information regarding the occurrence of terms within the documents 106 sorted by the terms themselves.
  • [0020]
    The annotation mechanism 102 generally annotates the tokens, or terms, within the inverse index 108 to generate the annotated inverse index 108′. As such, the documents 106 are inherently annotated, as the annotate documents 106′, by virtue of the annotated inverse index 108′. Because the documents 106 are annotated by annotating the inverse index 108 of the documents 106, it can be said that the documents 106 are all annotated at the same time. That is, because the inverse index 108 pertains to all the documents 106, annotating the index 108 effectively annotates all the documents 106 at the same time. Various approaches by which the annotation mechanism 102 may annotate the inverse index 108, and thus the documents 106 from which the inverse index 108 was generated, are now described.
  • [0021]
    FIG. 2 shows a method 200 for annotating token sequences within a collection of documents, according to an embodiment of the invention. The method 200 is particularly performed in relation to a dictionary with a unique name and associated set of token sequences that belong to the dictionary. Thereafter, another method is described that can be performed on more general entities, referred to as derived entities, where the dictionary entities of FIG. 2 are simply a special case of such derived entities.
  • [0022]
    A base inverse index for the documents is received (202). The base inverse index is the inverse index prior to annotation thereof, and hence is described as being the base such index. It is presumed that a document collection D contains documents d(1) to d(N). The base inverse index has two ordered sets: a first ordered lists of unique tokens T with elements t(1) through t(M) that occur in the document collection D, and a set of location lists #L, where there is one list #l(i) for each unique token t(i). A location list is defined as an ordered list of pointers to the document collection D. Each pointer locates the document and the token offset of a single occurrence of the token t(i). Thus, the location list #l(i) for token t(i) can be used to locate every occurrence of t(i) within the document collection D.
  • [0023]
    It is further noted that the base inverse index is an index of base entities, where the base entities are unique tokens within the corpus of documents. Two more complex entities can be derived from the base index: regexp, or regular-expression, entities; and dictionary entities. Thus, a regular-expression entity is defined (204). A regexp entity &ERname is defined as a token that matches a regular expression % Rname. For example, if % Rcapword is ([A-Z][a-z]*), then any &ERcapword is a token corresponding to a word with an initial capital letter.
  • [0024]
    A merge operation is also defined (206). The merge operation merge(#la, #lb) returns a location list in which each pointer occurs in location list #la, location list #lb, or both location lists #la and #lb. Therefore, the location list #LRcapword for all entities &ERcapword, for example, can be composed by using merge (#la, #lb) to combine all the lists #l(i) for the tokens t(i), where t(i) satisfies % Rcapword.
  • [0025]
    A consecutive-intersection operation is also defined (208). The consecutive-operation consint(#la, #lb) is the consecutive operation of location lists #la and #lb, and returns a location list. For a pointer to be in the location list returned by consint(#la, #lb), it must point to a token sequence that consists of two consecutive subsequences @sa and @sb. Furthermore, the sequence @sa occurs in #la, and the sequence @sb occurs in #lb.
  • [0026]
    Thereafter, for each dictionary entity of a dictionary, an index is determined as a consecutive intersection of all location lists of pointers within the dictionary entity (210). A dictionary entity &EDname is defined as a sequence of tokens that occur in the dictionary $Dname. This dictionary is simply a list of token sequences, which are typically ordered. For example, if $Dfname is a list of all first names, then any token sequence annotated as &EDfname is a first name. For the simple case in which all first names are one token in length, the location list #Ldfname corresponding to all entities of type &EDfname can be composed by using merge(#la, #lb) to combine all the lists #l(i), where t(i) is in the dictionary $Dfname.
  • [0027]
    For the more complex case, in which the sequences in $Dfname are more than one token in length, the following is performed. Particularly, for each token sequence t(i1), t(i2) . . . , t(ix) in $Dfname, where x is the length of the sequence, consint(#la, #lb) is first employed to generate an index that is the consecutive intersection of the lists #l(i1), #l(i2), . . . , #l(ix). It can be appreciated by those of ordinary skill within the art that the complex case automatically collapses to the simple case where the token sequence is one token in length—that is, where x is equal to one. This index contains the pointers to all occurrences of the sequence t(il) through t(ix) in the collection. As such, the consecutive-intersection operation defined in part 208 may be considered as being used to perform part 210 of the method 200.
  • [0028]
    Thereafter, the location lists of all the token sequences that are members of the dictionary are merged to result in a final location list for the dictionary (212). As such, the documents are annotated via the tokens of the dictionary entities annotating the base inverse index. For instance, the merge operation merge (#la, #lb) is used to combine the lists for each sequence in $Dfname to yield the final location list #LDfname. As such, the merge operation defined in part 206 may be considered as being used to perform part 212 of the method 200.
  • [0029]
    It is noted that dictionary entities as in the method 200 of FIG. 2 are a special case of more complex entities that are referred to as derived entities. FIG. 3 shows a portion of a modified method 200′ for utilizing such derived entities generally, instead of using just the dictionary entities as in the method 200 of FIG. 2, according to an embodiment of the invention. The method 200′ of FIG. 3 includes all the parts that have been described as to the method 200 of FIG. 2, but the entities employed in parts 210 and 212 of the method 200 are modified within the method 200′ as being derived entities generally, and not necessarily dictionary entities. The modified method 200′ of FIG. 3 adds to the method 200 of FIG. 2 parts 302, 304, 306, 308, 310, and 312 being performed between parts 208 and 210 of the method 200.
  • [0030]
    Each derived entity is composed from preexisting simpler entities using a set of rules written in modified context-free grammar (CFG) (302). Consider the example &EXfullname->&EDfname &EDlname. This means that the derived entity &EXfullname is composed of two consecutive sequences @seq1 and @seq2, where @seq1 is an entity of type &EDfname and @seq2 is of type &EDlname, assuming that &EDlname is the dictionary entity of last names. From the definition of consint(#la, #lb), the location list for #LXfullname for &EXfullname is obtained as follows: #LXfullname=consint(#LDfname, #LDlname). As such, &EXa→&EXb &EXc is generally interpreted as meaning that #LXa equals consint(#LXb, #LXc). Furthermore, &EXa→&EXb|EXc is generally interpreted as meaning #LXa equals merge(#LXb, #LXc).
  • [0031]
    Therefore, extending the example further, $Dnameprefix may be a dictionary of common prefixes for names such as Mr., Mrs., Ms., Dr., and so on. A derived entity &EXperson can be composed that annotates sequences as a person so long as they are a first name, full name, last name, or name prefix followed by a sequence of capitalized words of at most two in length. Thus, &EXperson→&EDfname|&EXfullname|&EDlname|&EXnewname; &EXnewname→&EDnameprefix & EXcapword2; and, &EXcapword2→&ERcapword|&ERcapword &ERcapword.
  • [0032]
    The location list for &EXperson is composed from the simpler location lists recursively, by using the operators merge(#l1, #12) and consint(#l1, #12). Hence, #LXcapword2=merge(#LRcapword, consint(#LRcapword, #LRcapword)). Further, #LXnewname=consint(#Ldnameprefix, #LXcapword2). Therefore, #LXperson merge(#LDfname, merge(#LXfullname, merge (#LXlname, #LXnewname))).
  • [0033]
    It is noted that one difficulty with the above approach is that the location list corresponding to &EXnewname can have pointers that span both the name-prefix and the sequence of the capitalized words. Therefore, it may be desirable to restrict the pointers so that they ignore the name-prefix. Another restriction that may be desired is that the capitalized words are also nouns, assuming that there is a noun entity annotator.
  • [0034]
    Therefore, the CFG is modified to include three operations (304). A parallel-intersection operation is defined (306). This operation parallelint(#la, #lb) is the parallel intersection of #la and #lb, returning the subset of pointers to sequences that are present in both #la and #lb. Thus, one modification of the CFG, using this parallel-intersection operation, is that &EXa→&EXb̂&EXc is interpreted to mean that the entity &EXa corresponds to a sequence of tokens that have both &EXb and &EXc annotations, and both of which fully span the sequence. That is, given the production rule &EXa→&EXb̂&EXc, the location list #LXa for &EXa is determined as #LXa=parallelint(#LXb, #LXc).
  • [0035]
    A first extension to consecutive-intersection operation is also defined (308), as well as a second extension to consecutive-intersection operation (310), where both of these operations are different than the consecutive-intersection operation defined in part 208 of FIG. 2. The first extension to consecutive-intersection operation is consintwp(#la, #lb), and the second extension to consecutive-intersection operation is consintws(#la, #lb). Both return an ordered list of pointers. In the case of consintwp, the returned list is a subset of #lb and has the property that each sequence in this subset is immediately preceded by a sequence from within #la. For consintws, the returned list is a subset of #la, where each sequence within the subset is immediately followed by a sequence in #lb.
  • [0036]
    Thus, another modification of the CFG, using these two consecutive-intersection operations, is that &EXa→{&EXb}&EXc is interpreted to mean that entity &EXa is formed from two consecutive token sequences @seq1 and @seq2, where @seq1 is of type entity &EXb and @seq2 is of type entity &EXc. The curly brackets denote that where the location list for &EXa is computed, the pointers skip @seq1 and just point to @seq2. Thus, the location list #LXa for &EXa→{&EXb } &EXc is determined as #LXa=consintwp(#LXb, #LXc) and the location list #LXa for &EXa→&EXb {&EXc } is determined as #LXa=consintws(#LXb, #LXc).
  • [0037]
    Using this modified CFG, then, each derived entity may be derived from a first sequence ot tokens and a second sequence of tokens (312), as an example of which has been described in relation to the initial description of part 302. Thus, an arbitrarily complex annotation may be composed from simpler annotations. For the person-name example, the final set of rules that use the above modification are: &EXperson→&EDfname|&EXfullname|&EDlname|&EXnewname; &EXnewname→{&EDnameprefix} &EXncapword2; &EXncapword2→&ERncapword|&ERncapword &Erncapword; and, &EXncapword→&EXnoun̂&ERcapword.
  • [0038]
    It is assumed that &EXnoun is the annotation for all tokens that are nouns. The corresponding location lists are determined as follows. First, #LXncapword=parallelint(#LXnoun, #LRcapword). Second, #LXncapword2=merge( #LXncapword, consint(#LXncapword, #LXncapword)). Third, #LXnewname=consintwp( #LDnamepref#LX, #LXcapword2). Finally, fourth, #LXperson=merge( #LDfname, merge( #LXfullname, merge( #LXlname, #LXnewname))).
  • [0039]
    It is noted that the method 200 of FIG. 2 that has been described, as can be modified to result in the method 200′ of FIG. 3, assumes that the entity annotations are independent of one another, and that a sequence of tokens within a document collection can have multiple overlapping annotations. However, in some situations, it may be desirable to impose a partial ordering of the annotations such that lower-order annotations do not overlap with higher-order annotations. For example, where a sequence may be either an organization name or a person name, it may be desired to give priority to one over the other.
  • [0040]
    Therefore, FIG. 4 shows a portion of a modified method 200″ for imposing such ordering, according to an embodiment of the invention. The method 200″ of FIG. 4 includes all the parts that have been described as to the method 200 of FIG. 2, and which can be modified as has been described as to the method 200′ of FIG. 3. The method 200″ of FIG. 4 adds to the method 200 or the method 200′ parts 402, 404, and 406 after part 212, which are now described.
  • [0041]
    In general, as has been noted, a partial ordering of annotations of tokens within the documents is imposed (402). In particular, and in one embodiment, an array tokStatus of the integers of size equal to the total number of tokens within the document collection in question is created. This array is initialized with zeros. A positive integer is associated with each annotation type so that the order of these integers reflects the partial ordering of the annotation types that is desired to be imposed. Annotation types that are at the same level and that can overlap have the same integer associated with them.
  • [0042]
    An apply-order operation is defined (404). This operation tokStatus.applyorder(x, #lp) takes as arguments, the location list #lp of an annotation type, and the associated integer x for that type. The operation returns a subset of pointers from #lp for which all the tokens in the sequences in #lp have associated values in tokStatus less than or equal to x. In addition, the tokStatus values for the sequences that are returned are updated to the value x. Therefore, if any part of a token sequence has already been annotated as an entity with a higher value of x, this token sequence will be removed from the list of pointers in #lp.
  • [0043]
    Thus, the apply-order operation is employed to impose a desired partial ordering (406), as defined in the array tokStatus. To ensure the location lists correctly reflect the partial ordering of the entities, the apply-order operation is applied in descending order of x values. That is, the operation is performed beginning with the highest order annotation types.
  • [0044]
    It is noted that as an alternative to determining tokStatus.applyorder(x, #lp) as a post-processing operation on a location list, this operation can be combined the operation merge( #la, #lb). For instance, the operation tokStatus.merge( #la, #lb, x) can be defined as the operation that returns a location list which is a merge of the lists #la and #lb and which satisfies the constraints that tokStatus.applyorder(x, #lp) imposes on the resulting list. There may be efficiency reasons for using this alternative approach, since while the location lists are being merged the token sequences can be simultaneously checked against tokStatus.
  • [0045]
    It is noted that, although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This application is thus intended to cover any adaptations or variations of embodiments of the present invention. Therefore, it is manifestly intended that this invention be limited only by the claims and equivalents thereof.
Patentzitate
Zitiertes PatentEingetragen Veröffentlichungsdatum Antragsteller Titel
US5584024 *24. März 199410. Dez. 1996Software AgInteractive database query system and method for prohibiting the selection of semantically incorrect query parameters
US5742769 *6. Mai 199621. Apr. 1998Banyan Systems, Inc.Directory with options for access to and display of email addresses
US5915249 *14. Juni 199622. Juni 1999Excite, Inc.System and method for accelerated query evaluation of very large full-text databases
US5953723 *2. Apr. 199314. Sept. 1999T.M. Patents, L.P.System and method for compressing inverted index files in document search/retrieval system
US6131092 *12. Juli 199410. Okt. 2000Masand; BrijSystem and method for identifying matches of query patterns to document text in a document textbase
US6349308 *15. Febr. 199919. Febr. 2002Korea Advanced Institute Of Science & TechnologyInverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
US6523030 *24. Okt. 200018. Febr. 2003Claritech CorporationSort system for merging database entries
US6704728 *2. Mai 20009. März 2004Iphase.Com, Inc.Accessing information from a collection of data
US7319994 *23. Mai 200315. Jan. 2008Google, Inc.Document compression scheme that supports searching and partial decompression
US20030030645 *13. Aug. 200113. Febr. 2003International Business Machines CorporationModifying hyperlink display characteristics
US20040100510 *10. Febr. 200327. Mai 2004Natasa Milic-FraylingUser interface for a resource search tool
US20040138946 *28. Dez. 200115. Juli 2004Markus StolzeWeb page annotation systems
US20040243560 *30. Mai 20032. Dez. 2004International Business Machines CorporationSystem, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
US20050021512 *23. Juli 200427. Jan. 2005Helmut KoenigAutomatic indexing of digital image archives for content-based, context-sensitive searching
US20070078880 *30. Sept. 20055. Apr. 2007International Business Machines CorporationMethod and framework to support indexing and searching taxonomies in large scale full text indexes
US20070088734 *14. Okt. 200519. Apr. 2007International Business Machines CorporationSystem and method for exploiting semantic annotations in executing keyword queries over a collection of text documents
Referenziert von
Zitiert von PatentEingetragen Veröffentlichungsdatum Antragsteller Titel
US82143871. Apr. 20053. Juli 2012Google Inc.Document enhancement system and method
US8244732 *26. Mai 201014. Aug. 2012Institute For Information IndustryNamed entity marking apparatus, named entity marking method, and computer readable medium thereof
US834662028. Sept. 20101. Jan. 2013Google Inc.Automatic modification of web pages
US8418055 *18. Febr. 20109. Apr. 2013Google Inc.Identifying a document by performing spectral analysis on the contents of the document
US844233118. Aug. 200914. Mai 2013Google Inc.Capturing text from rendered documents using supplemental information
US844706612. März 201021. Mai 2013Google Inc.Performing actions based on capturing information from rendered documents, such as documents under copyright
US844711121. Febr. 201121. Mai 2013Google Inc.Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US844714418. Aug. 200921. Mai 2013Google Inc.Data capture from rendered documents using handheld device
US848962429. Jan. 201016. Juli 2013Google, Inc.Processing techniques for text capture from a rendered document
US850509020. Febr. 20126. Aug. 2013Google Inc.Archive of text captures from rendered documents
US85158161. Apr. 200520. Aug. 2013Google Inc.Aggregate analysis of text captures performed by multiple users from rendered documents
US85317101. Aug. 201110. Sept. 2013Google Inc.Association of a portable scanner with input/output and storage devices
US86001966. Juli 20103. Dez. 2013Google Inc.Optical scanners, such as hand-held optical scanners
US86191476. Okt. 201031. Dez. 2013Google Inc.Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device
US861928717. Aug. 200931. Dez. 2013Google Inc.System and method for information gathering utilizing form identifiers
US86200835. Okt. 201131. Dez. 2013Google Inc.Method and system for character recognition
US862076011. Okt. 201031. Dez. 2013Google Inc.Methods and systems for initiating application processes by data capture from rendered documents
US86213495. Okt. 201031. Dez. 2013Google Inc.Publishing techniques for adding value to a rendered document
US863836318. Febr. 201028. Jan. 2014Google Inc.Automatically capturing information, such as capturing information using a document-aware device
US871341812. Apr. 200529. Apr. 2014Google Inc.Adding value to a rendered document
US878122813. Sept. 201215. Juli 2014Google Inc.Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US87931625. Mai 201029. Juli 2014Google Inc.Adding information or functionality to a rendered document via association with an electronic counterpart
US879909913. Sept. 20125. Aug. 2014Google Inc.Processing techniques for text capture from a rendered document
US879930313. Okt. 20105. Aug. 2014Google Inc.Establishing an interactive environment for rendered documents
US883136511. März 20139. Sept. 2014Google Inc.Capturing text from rendered documents using supplement information
US887450422. März 201028. Okt. 2014Google Inc.Processing techniques for visual capture data from a rendered document
US88924958. Jan. 201318. Nov. 2014Blanding Hovenweep, LlcAdaptive pattern recognition based controller apparatus and method and human-interface therefore
US890375921. Sept. 20102. Dez. 2014Google Inc.Determining actions involving captured information and electronic content associated with rendered documents
US89538868. Aug. 201310. Febr. 2015Google Inc.Method and system for character recognition
US90084471. Apr. 200514. Apr. 2015Google Inc.Method and system for character recognition
US903069913. Aug. 201312. Mai 2015Google Inc.Association of a portable scanner with input/output and storage devices
US907577922. Apr. 20137. Juli 2015Google Inc.Performing actions based on capturing information from rendered documents, such as documents under copyright
US90817996. Dez. 201014. Juli 2015Google Inc.Using gestalt information to identify locations in printed information
US911689011. Juni 201425. Aug. 2015Google Inc.Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US914363829. Apr. 201322. Sept. 2015Google Inc.Data capture from rendered documents using handheld device
US926885213. Sept. 201223. Febr. 2016Google Inc.Search engines and systems with handheld document data capture devices
US92750517. Nov. 20121. März 2016Google Inc.Automatic modification of web pages
US93237849. Dez. 201026. Apr. 2016Google Inc.Image search using text-based elements within the contents of images
US945476414. Okt. 201027. Sept. 2016Google Inc.Contextual dynamic advertising based upon captured rendered text
US9460071 *21. Apr. 20154. Okt. 2016Sas Institute Inc.Rule development for natural language processing of text
US951413415. Juli 20156. Dez. 2016Google Inc.Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US953556312. Nov. 20133. Jan. 2017Blanding Hovenweep, LlcInternet appliance system and method
US963301322. März 201625. Apr. 2017Google Inc.Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US20050236962 *31. März 200527. Okt. 2005Lee Sang JNegative hole structure having a protruded portion, method for forming the same, and electron emission device including the same
US20110026838 *5. Okt. 20103. Febr. 2011King Martin TPublishing techniques for adding value to a rendered document
US20110029443 *12. März 20103. Febr. 2011King Martin TPerforming actions based on capturing information from rendered documents, such as documents under copyright
US20110035656 *18. Febr. 201010. Febr. 2011King Martin TIdentifying a document by performing spectral analysis on the contents of the document
US20110075228 *23. Sept. 201031. März 2011King Martin TScanner having connected and unconnected operational behaviors
US20110258194 *26. Mai 201020. Okt. 2011Institute For Information IndustryNamed entity marking apparatus, named entity marking method, and computer readable medium thereof
US20160078014 *21. Apr. 201517. März 2016Sas Institute Inc.Rule development for natural language processing of text
US20160306985 *16. Apr. 201520. Okt. 2016International Business Machines CorporationMulti-Focused Fine-Grained Security Framework
US20160308902 *27. Mai 201520. Okt. 2016International Business Machines CorporationMulti-Focused Fine-Grained Security Framework
EP2940606A1 *30. Apr. 20154. Nov. 2015Google, Inc.Searchable index
Klassifizierungen
US-Klassifikation715/230, 715/231, 707/E17.084, 707/999.1
Internationale KlassifikationG06F17/00, G06F7/00
UnternehmensklassifikationG06F17/30616, G06F17/278
Europäische KlassifikationG06F17/27R4E, G06F17/30T1E
Juristische Ereignisse
DatumCodeEreignisBeschreibung
19. Sept. 2006ASAssignment
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BALAKRISHNAN, SREERAM VISWANATH;RAMAKRISHNAN, GANESH;JOSHI, SACHINDRA;REEL/FRAME:018271/0639;SIGNING DATES FROM 20060714 TO 20060716