US20090055389A1 - Ranking similar passages - Google Patents

Ranking similar passages Download PDF

Info

Publication number
US20090055389A1
US20090055389A1 US12/134,145 US13414508A US2009055389A1 US 20090055389 A1 US20090055389 A1 US 20090055389A1 US 13414508 A US13414508 A US 13414508A US 2009055389 A1 US2009055389 A1 US 2009055389A1
Authority
US
United States
Prior art keywords
passage
score
calculating
ranking
instances
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/134,145
Inventor
William Noah Schilit
Okan Kolak
Justin John Paul Vincent-Foglesong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US12/134,145 priority Critical patent/US20090055389A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KOLAK, OKAN, SCHILIT, WILLIAM NOAH, VINCENT-FOGLESONG, JUSTIN JOHN PAUL
Publication of US20090055389A1 publication Critical patent/US20090055389A1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Definitions

  • This invention pertains, in general, to scoring similar passages in digital text documents and, in particular, to ranking similar passages based on characteristics of the similar passages occurring in the digital text documents.
  • a digital text corpus such as a digital library that is accessible via the Internet.
  • a digital text corpus is established, for example, by scanning paper copies of documents including books and newspapers, and then applying an optical character recognition (OCR) process to produce computer-readable text from the scans.
  • OCR optical character recognition
  • the corpus can also be established by receiving documents and other texts already in machine-readable form.
  • Embodiments of the method comprise calculating at least one score based at least in part on characteristics of instances of the passage occurring in the digital corpus and generating a ranking score associated with the passage based at least in part on the calculated at least one score.
  • the method further comprises storing the ranking score in association with the passage in a computer-readable medium.
  • Embodiments of the computer program product and computer system comprise computer code for performing similar functions.
  • FIG. 1 shows an environment adapted to support ranking similar passages according to one embodiment.
  • FIG. 2 is a high-level block diagram illustrating a functional view of a typical computer for use as one of the entities illustrated in the environment of FIG. 1 according to one embodiment.
  • FIG. 3 is a high-level block diagram illustrating modules within the scoring engine according to one embodiment.
  • FIG. 4 is a flow chart illustrating steps performed by the scoring engine according to one embodiment.
  • FIG. 5 is a flow chart illustrating the interaction between the client device and the web server, the scoring engine, and the ranking engine according to one embodiment.
  • FIG. 6 is an exemplary web page showing ranked search results according to one embodiment.
  • FIG. 1 shows an environment adapted to support ranking similar passages according to one embodiment.
  • the environment 100 includes a data store 110 for storing a corpus 112 and a similar passage database 114 , a passage mining engine 116 for identifying similar passages in the corpus, a scoring engine 128 for assigning scores to similar passages, and a ranking engine 130 for ranking similar passages.
  • the environment also includes a client 118 for requesting and/or viewing information from the data store 110 , and a web server 120 for interacting with the client and providing interfaces allowing the client to access the information in the data store.
  • a network 122 enables communications between and among the data store 110 , passage mining engine 116 , scoring engine 128 , ranking engine 130 , client 118 , and web server 120 .
  • passage mining engine 116 and/or scoring engine 128 are connected to the network 122 periodically.
  • the engines 116 and 128 only need to communicate with the data store 110 in order to score similar passages in the corpus 112 and store the passage data in the passage database 114 .
  • the engines 116 and 128 do not need to interact with the client 118 or the web server 120 according to one embodiment.
  • the passage mining engine 116 may be off-line, and the web server 120 supports passage navigating by interacting with the client 118 and the data store 110 to retrieve information from the data store that is requested by the client.
  • the scoring engine 128 may be off-line, and the web server 120 supports retrieval of ranking information by interacting with the client 118 and data store 110 to retrieve information from the data store that is requested by the client.
  • the scoring engine 128 is connected to the network 122 periodically. When it is online, the scoring engine 128 communicates with the passage mining engine 116 or data store 110 in order to identify which similar passage instances to rank. The scoring engine 128 does not need to interact with the client 118 or the web server 120 according to one embodiment.
  • different embodiments of the environment 100 include different and/or additional entities than the ones shown in FIG. 1 , and the entities are organized in a different manner.
  • the data store 110 stores the corpus 112 of information and the similar passage database 114 . It also stores data utilized to support the functionalities or generated by the functionalities described herein. The data store 110 can also store other corpora and data. The data store 110 receives requests for information stored in it and provides the information in return. In a typical embodiment, the data store 110 is comprised of multiple computers and/or storage devices configured to collectively store a large amount of information.
  • the corpus 112 stores a set of information. In one embodiment, the corpus 112 stores the contents of a large number of digital documents.
  • the term “document” refers to a written work or composition. This definition includes, for example, conventional books such as published novels, and collections of text such as newspapers, news stories, magazines, journals, pamphlets, letters, articles, web pages and other electronic documents.
  • the document contents stored by the corpus 112 include, for example, the document text represented in a computer-readable format, images from the documents, scanned images of pages from the documents, etc.
  • the term “word” refers to a token containing a block of structured text. The word does not necessarily have meaning in any language, although it will have meaning in most cases.
  • the corpus 112 stores metadata about the documents within it.
  • the metadata are structured data that describe the documents. Examples of metadata include metadata about a book such as the author, publisher, year published, number of pages, edition, and libraries that carry the book.
  • the metadata stored in the corpus is associated with the similar passages stored in the similar passage database 114 .
  • the similar passage database 114 stores data describing similar passages in the corpus 112 .
  • the similar passage database 114 also stores the ranking score of the similar passage once a ranking score is assigned by the scoring engine 128 . More details describing the function of the scoring engine 128 are provided below.
  • similar passage refers to a passage in a source document that is found in a similar form in one or more different target documents. Occurrences of the same similar passage are referred to as “instances” of that passage. Oftentimes, the similar passage instances are identical. Nevertheless, the passages are referred to as “similar” because there might be slight differences among the passage instances in the different documents. When a source document is said to have multiple “similar passages,” it means that multiple passages in the source document are also found in other documents. This phrase does not necessarily mean that the “similar passages” within the source document are similar to each other. Similar passages are also referred to as “quotations,” “shared passages,” “popular passages,” and “related passages.”
  • the passage database 114 is generated by the passage mining engine 116 to store information obtained from passage mining.
  • the passage mining engine 116 constructs the passage database 114 by copying existing quotation collections such as Bartlett's, and searching and indexing the instances of quotations and their variations that appear in the corpus 112 .
  • the passage mining engine 116 constructs the passage database 114 by copying existing text appearing in a quoted form, such as delimited by quotation marks, from the corpus, and searching and indexing the instances of the text in the corpus 112 .
  • the passage mining engine 116 constructs the passage database 114 by copying each group of words, such as sentences, from the corpus, and searching and indexing the instances of the group of words in the corpus 112 .
  • the database 114 stores similar passages, document identifiers (Doc IDs) identifying the documents in which the passages exist, position identifiers (Pos IDs) identifying the location in the documents at which the passages appear, passage ranking results, etc. Further, in some embodiments, the database 114 also stores the documents or portions of the documents that have the similar passages.
  • the passage mining engine 116 includes one or more computers adapted to analyze the texts of documents in the corpus 112 in order to identify similar passages. For example, the passage mining engine 116 may find that the passage “I read somewhere that everybody on this planet is separated by only six other people” from the book “Six Degrees of Separation” by John Guare, also appears in 13 other books published between 2000 and 2006.
  • the passage mining engine 116 may store, in the similar passage database 114 , the passage, its location in the “Six Degrees of Separation” book, Doc IDs of the 13 other books, Pos IDs indicating the locations of the passage instances in the 13 other books, and its ranking relative to other similar passages in the “Six Degrees of Separation” book or relative to other similar passages in the corpus 112 . More detail regarding the passage mining engine 116 is described in the related application, U.S. patent application Ser. No. 11/781,213, filed Jul. 20, 2007, and titled “Identifying and Linking Similar Passages in a Digital Text Corpus.” Passage mining may be performed off-line, asynchronously of any queries made by the client 118 against the data store 110 .
  • the passage mining engine 116 runs periodically to process all the text information in the corpus 112 from scratch and generate similar passage data for storing in the similar passage database 114 , disregarding any information obtained from prior passage mining. In another embodiment, the passage mining engine 116 is used periodically to incrementally update the data stored in the similar passage database 114 , for example, as new documents are added to the corpus 112 .
  • the scoring engine 128 includes one or more computers adapted to assign scores to the similar passages identified by the passage mining engine 116 and stored in the similar passages database 114 .
  • the scoring engine 128 analyzes the characteristics of the similar passages and the documents containing the similar passages stored in the similar passage database 114 and assigns ranking scores to the similar passages. Scoring may be performed on-line when the scoring engine is connected to network 122 and may also be performed off-line, asynchronously of any queries made by client 118 against the data store 110 .
  • the scoring engine 128 runs periodically to process all of the content from the data store 110 from scratch and assigns a score associated with a similar passage for storing in the similar passage database 114 .
  • scoring engine 128 is used periodically to incrementally update the ranking information stored in the similar passage database 114 , for example, as new similar passages are found and added to the similar passage database.
  • the ranking engine 130 ranks a set of similar passages to be displayed on the client 118 .
  • the ranking engine 130 ranks the set of similar passages based on the associated ranking scores of the similar passages.
  • the set of similar passages can be displayed on the client 118 in the ranked order.
  • FIG. 1 shows the passage mining engine 116 , the scoring engine 128 , and the ranking engine 130 as discrete servers. However, in various embodiments, any or all of these engines can be combined. This allows a single server to perform the functions of one or more of the above-described engines.
  • the client 118 is an electronic device having a web browser for interacting with the web server 120 via the network 122 , and it is used by a human user to access and obtain information from the data store 110 . It can be, for example, a notebook, desktop, or handheld computer, a mobile telephone, personal digital assistant (PDA), mobile email device, portable game player, portable music player, computer integrated into a vehicle, etc.
  • PDA personal digital assistant
  • the web server 120 interacts with the client 118 and the ranking engine 130 to provide information from the data store 110 .
  • the web server 120 includes a User Interface (UI) module 124 that communicates with the client's 118 web browser to receive and present information.
  • the web server 120 also includes a searching module 126 that searches for information in the data store 110 .
  • the UI module 124 may receive a query from the web browser issued by a user of the client 118 , and the searching module 126 may execute the query against the corpus 112 and the similar passage database 114 , and retrieve information including similar passages information that satisfies the query.
  • the similar passages are displayed and listed in accordance with a ranking order provided by the ranking engine 130 .
  • the network 122 represents communication pathways between the data store 110 , passage mining engine 116 , client 118 , web server 120 , the scoring engine 128 , and the ranking engine 130 .
  • the network 122 is the Internet.
  • the network 122 can also utilize dedicated or private communications links that are not necessarily part of the Internet.
  • the network 122 uses standard communications technologies, protocols, and/or interprocess communications techniques.
  • the network 122 can include links using technologies such as Ethernet, 802.11, integrated services digital network (ISDN), digital subscriber line (DSL), asynchronous transfer mode (ATM), etc.
  • the networking protocols used on the network 122 can include the transmission control protocol/Internet protocol (TCP/IP), the hypertext transport protocol (HTTP), the simple mail transfer protocol (SMTP), the file transfer protocol (FTP), the short message service (SMS) protocol, etc.
  • the data exchanged over the network 122 can be represented using technologies and/or formats including the hypertext markup language (HTML), the extensible markup language (XML), etc.
  • all or some of links can be encrypted using conventional encryption technologies such as the secure sockets layer (SSL), HTTP over SSL (HTTPS), and/or virtual private networks (VPNs).
  • the nodes can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
  • FIG. 2 is a high-level block diagram illustrating a functional view of a typical computer 200 for use as one or more of the entities illustrated in the environment 100 of FIG. 1 according to one embodiment. Illustrated are at least one processor 202 coupled to a bus 204 . Also coupled to the bus 204 are a memory 206 , a storage device 208 , a keyboard 210 , a graphics adapter 212 , a pointing device 214 , and a network adapter 216 . A display 218 is coupled to the graphics adapter 212 .
  • the processor 202 may be any general-purpose processor such as an INTEL x86 compatible-CPU.
  • the storage device 208 is any device capable of holding data, like a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
  • the memory 206 holds instructions and data used by the processor 202 and may be, for example, firmware, read-only memory (ROM), non-volatile random access memory (NVRAM), and/or RAM.
  • the pointing device 214 may be a mouse, track ball, or other type of pointing device, and is used in combination with the keyboard 210 to input data into the computer system 200 .
  • the graphics adapter 212 displays images and other information on the display 218 .
  • the network adapter 216 couples the computer system 200 to the network 122 .
  • the computer 200 is adapted to execute computer program modules.
  • module refers to computer program logic and/or data for providing the specified functionality.
  • a module can be implemented in hardware, firmware, and/or software.
  • the modules are stored on the storage device 208 , loaded into the memory 206 , and executed by the processor 202 as one or more processes.
  • the types of computers used by the entities of FIG. 1 can vary depending upon the embodiment and the processing power utilized by the entity.
  • the client 118 typically requires less processing power than the passage mining engine 116 , scoring engine 128 , ranking engine 130 , and web server 120 .
  • the client 118 system can be a standard personal computer or a mobile telephone.
  • the passage mining engine 116 , scoring engine 128 , ranking engine 130 , and web server 120 may comprise processes executing on more powerful computers, logical processing units, and/or multiple computers working together to provide the functionality described herein.
  • the passage mining engine 116 , scoring engine 128 , ranking engine 130 , and web server 120 might lack devices that are not required to operate them, such as displays 218 , keyboards 210 , and pointing devices 214 .
  • Embodiments of the entities described herein can include other and/or different modules than the ones described here.
  • the functionality attributed to the modules can be performed by other or different modules in other embodiments.
  • this description occasionally omits the term “module” for purposes of clarity and convenience.
  • FIG. 3 is a high-level block diagram illustrating modules within the scoring engine 128 according to one embodiment.
  • the scoring engine 128 includes a characteristics analysis module 302 and a score calculation module 306 .
  • An embodiment of the scoring engine 128 analyzes characteristics of similar passages and calculates scores for the passages based on the analyzed characteristics. The scores are assigned to the associated similar passages and stored in the similar passage database 114 . Some embodiments have different and/or additional modules than those shown in FIG. 3 . Moreover, the functionalities can be distributed among the modules in a different manner than described here.
  • the characteristics analysis module 302 analyzes characteristics associated with a similar passage and its similar passage instances in order to produce a total score. Characteristics that are analyzed include characteristics associated with the passage or passage instance itself and characteristics associated with the usage of the similar passage in the digital corpus 112 . Examples of such characteristics are the number of words in the passage, the author of the document which contains the similar passage instance, the publisher of the document which contains the similar passage instance, the characteristics of the words introducing and following the similar passage, how frequently the similar passage appears in the digital corpus, the length of the similar passage, the words of the similar passage, the usage of punctuation associated with the similar passage, and the diffusion of the similar passage in the digital corpus.
  • the diffusion of the similar passage is determined by analyzing the variation of the authors of the documents in which the instances of the passage appear, the variation of the publishers of the documents in which the similar passage instances appear, the variation of the libraries that carry the documents in which the similar passage instances appear, and/or the variation of the parts of the documents in which the similar passage instances appear.
  • the author associated with the document which contains a similar passage instance is identified and examined by the characteristics analysis module 302 .
  • the characteristics analysis module 302 compares the identified author to a list or database of previously-identified famous or known authors. In one embodiment, each author in the list or database has an associated score. In such embodiments, when the characteristics analysis module 302 compares the identified authors to the list or database, and the identified author is found therein, the module 302 assigns the score associated with that author to the similar passage instance. If the identified author is not found, the module 302 assigns a low score or a score of zero to the similar passage instance. In some embodiments, the authors in the list or database do not have an associated score. In those embodiments, the module 302 assigns a score to the similar passage instance based on whether the identified author was found in the database. The assigned score is represented by A(Q).
  • the list or database of previously-identified famous or known authors may be based on authors found in a printed encyclopedia, an online encyclopedia, such as Wikipedia, or other sources such as Bartlett's.
  • frequency of appearance of the similar passage is a characteristic that is examined.
  • the characteristics analysis module 302 examines and identifies the frequency of appearance of the similar passage in the digital corpus 112 . If the similar passage appears in fewer documents, the characteristics analysis module 302 assigns a lower score to that similar passage. If the similar passage appears in many documents, the characteristics analysis module 302 assigns a higher score to that similar passage.
  • a cliché or overused slogan may be identified as a similar passage and may be very prevalent throughout the digital corpus 112 . In those instances, the cliché or slogan may be assigned a lower score because the high frequency of occurrence does not necessarily indicate that the passage has great significance.
  • the length of the similar passage may be a factor in determining a score based on the frequency of appearance of the similar passage. For example, a very short similar passage (for example, one that including less than five or six words) may appear frequently. However, since this passage is shorter than the average length of a passage, it is assigned a lower score. Conversely, if the similar passage is long (for example, more than ten words in length), it would still be assigned a high score if the frequency of appearance of the similar passage within the digital corpus 112 is high.
  • the score associated with the frequency of appearance of the similar passage in the digital corpus 112 is represented by F(Q).
  • the length of the similar passage is a characteristic that is separately examined and scored by the characteristics analysis module 302 .
  • the characteristics analysis module 302 assigns a lower score to a very short passage (for example, one that including less than five or six words) and assigns a higher score to a long passage (for example, more than ten words in length).
  • the score associated with the length of the similar passage in the digital corpus 112 is represented by L(Q).
  • the variation of words and grammar of the similar passage are characteristics that are examined.
  • the characteristics analysis module 302 examines the words of the similar passage and assigns a score to the similar passage in response.
  • the characteristics analysis module 302 assigns a lower score to a similar passage that contains repeating words or numbers and assigns a higher score to a passage that contains few repeating words or numbers.
  • the similar passage is a chart, or another table-like presentation of words (i.e. words with no verbs)
  • the characteristics analysis module 302 assigns a lower score to that similar passage.
  • the characteristics analysis module 302 applies one or more language models to analyze the words of the similar passage. For example, language models may be used to determine whether the words of the similar passage demonstrate usage of proper grammar or whether the words contain too many numbers. In such embodiments, a high score is assigned to a passage that demonstrates use of proper grammar and a low score is assigned to a passage that demonstrates use of improper grammar. Additionally, the score of a passage that contains too many numbers is lowered. In one embodiment, the score associated with the word analysis of the similar passage in the digital corpus is represented by W(Q).
  • the usage of punctuation associated with the similar passage is identified and examined by the characteristics analysis module 302 .
  • the use of quotation marks surrounding a similar passage is an indication that the similar passage is a quotation and therefore the passage is assigned a higher score.
  • the score associated with the use of punctuation marks is represented by P(Q).
  • the document that contains a similar passage instance is a characteristic that is identified and examined by the characteristics analysis module 302 . Similar to the analysis of the author of the document, the characteristics analysis module 302 compares the identified document to a list or database of previously-identified famous or known documents. In one embodiment, each document in the list or database has an associated score. In such embodiments, when the characteristics analysis module 302 compares the identified document to the list or database of documents, and the identified document is found therein, the module 302 assigns the score associated with that document to the similar passage instance. If the identified document is not found in the database, the module 320 assigns a low score or a score of zero. In some embodiments, the documents in the list or database do not have associated scores. In those embodiments, the module 302 assigns a score to the similar passage instance based on whether the identified document was found therein. In one embodiment, the assigned score is represented by B(Q).
  • the set of words introducing a similar passage and the set of words following a similar passage is a characteristic that is examined.
  • these words are known as speech acts.
  • words such as “Person X says” or “Person X wrote” are indications that a similar passage is to follow.
  • speech acts such as “said Person X” are indications that a similar passage appeared before the exemplary speech act phrase.
  • a higher score is assigned to a similar passage that is introduced by or followed by a speech act. In one embodiment, the assigned score is represented by S(Q).
  • a diffusion of the similar passage in the digital corpus 112 is examined by the characteristics analysis module 302 .
  • the assigned score is represented by D(Q) and is calculated by first calculating entropy scores as explained below.
  • the variation of the authors, or number of different authors, of the documents containing a particular similar passage is a component of the diffusion score.
  • the characteristics analysis module 302 examines the authors of the documents containing the instances of a particular similar passage in order to determine the number of different authors.
  • the characteristics analysis module 302 assigns a higher score to a similar passage that is associated with many different authors, and assigns a lower score to a similar passage that is associated with fewer different authors.
  • the score is calculated using the following entropy equation:
  • the entropy of the authors is calculated by taking the negative summation of the product of p(x) and the log of p(x), where p(x) is the probability that author x will occur in a given set of examined documents and is expressed as a fraction.
  • the individual probabilities correspond to the probability that a particular author will appear as an author of a document among the set of examined documents containing a particular similar passage.
  • p(x) would be one, and the entropy of the author (E(A)) would be zero.
  • the entropy of the author (E(A)) would be greater than zero. If a large number of documents were examined and all the documents were associated with different authors, the value of the entropy of the authors would be high. For example, if ten documents were examined and ten authors were identified (each document corresponding to a different author), p(x)*log 2 (p(x)) for each author is ⁇ 0.3322 and the negative summation is 3.322.
  • the variation of the publishers of the documents associated with the particular similar passage is a component of the diffusion score.
  • the publishers of the documents containing instances of the particular similar passage are examined and identified. Similar to the calculation for authors, the characteristics analysis module 302 calculates an entropy of the publishers (E(P)) by using a formula similar to the one above, but in this case p(x) corresponds to the probability of the occurrence of a particular publisher. Therefore, similar to the analysis of the authors, the characteristics analysis module 302 assigns a higher score to a similar passage that is associated with many different publishers, and assigns a lower score to a similar passage that is associated with fewer different publishers.
  • the variation of the libraries that carry copies of the documents containing instances of the particular passage is a component of the diffusion score that is identified by the characteristics analysis module 302 .
  • the characteristics analysis module 302 calculates an entropy of the libraries (E(L)).
  • p(x) corresponds to the probability of the appearance of a particular library that carries a copy of a document containing a particular similar passage. Therefore, similar to the analysis of the authors and publishers, the characteristics analysis module 302 assigns a higher score to a similar passage that is appears in a document that is held in a collection of many different libraries, and assigns a lower score to a similar passage that appears in a document that is held in a collection of fewer different libraries.
  • the variation of the parts of documents in which the similar passage instances appear is a component of the diffusion score.
  • the characteristics analysis module 302 examines and identifies parts of the documents in which the similar passage appears.
  • a document is divided into a number of parts. For example, a document may be divided into three parts: a first third (the beginning part of the document), a second third (the middle part of the document), and a last third (the end part of the document).
  • the characteristics analysis module 302 makes a determination as to which parts of the documents the similar passage instances appear. Similar to the calculations above, the characteristics analysis module 302 calculates an entropy of the parts of the documents (E(Q)) using a similar formula.
  • the characteristics analysis module 302 assigns a higher score to a similar passage that appears in different parts of documents, and assigns a lower score to a similar passage that appears in the same part, or mostly the same part, of the documents.
  • the characteristics analysis module 302 combines the entropies calculated above (E(A), E(P), E(L), and E(Q)) in order to calculate a total diffusion (D(Q)) of the similar passage throughout the corpus. Depending upon the embodiment, the characteristics analysis module 302 calculates D(Q) as a sum of its components, as a weighted linear combination, as a weighted geometric mean or using another technique. The characteristics analysis module 302 assigns the total diffusion score D(Q) to the similar passage. In some embodiments, the total diffusion score is stored in association with the similar passage in the similar passage database 114 .
  • An embodiment of the score calculation module 306 combines the individual scores described above (A(Q), F(Q), L(Q), W(Q), P(Q), B(Q), S(Q), and D(Q)) to determine the total score assigned to a similar passage.
  • the total score is calculated by summing the individual scores.
  • certain individual characteristics are more important or more relevant than others. Therefore, the characteristics analysis module 302 weights scores for certain characteristics more than scores for other characteristics.
  • the total score is determined by a weighted linear combination of the individual scores. In other words, each individual score is assigned a weight and is multiplied by its assigned weight to yield a weighted score. The weighted scores are summed in order to yield the total score.
  • the total is determined by a weighted geometric mean.
  • each score is assigned a weight.
  • Each score is then raised to the power of the weight to yield a weighted score.
  • the weighted scores are then multiplied together to yield the total score.
  • the sum of the weights equals one. Therefore, if one weight is increased by a certain amount the total of the other weights is decreased by the same amount such that the sum of the weights remains one.
  • the total score serves as the ranking score for the passage.
  • the score calculation module 306 aggregates a subset of the scores described above to produce the ranking score for a similar passage. Information about the similar passage and its associated ranking score are stored in the similar passage database 114 .
  • FIG. 4 is a flow chart illustrating steps performed by the scoring engine 128 according to one embodiment. Other embodiments may perform different or additional steps than the ones shown in FIG. 4 .
  • the scoring engine 128 receives 402 a set of similar passage instances for a passage in the digital corpus 112 to be analyzed.
  • the scoring engine 128 calculates 404 the individual scores (A(Q), F(Q), L(Q), W(Q), P(Q), B(Q), S(Q), and D(Q)) for the examined characteristics.
  • the scoring engine 128 determines 406 a ranking score for the identified passage.
  • the individual scores are summed in order to produce a total score that serves as the ranking score for the identified passage.
  • the scores can also be combined using one or more of the weighting techniques described above.
  • the ranking score is associated with the passage and stored 408 in the similar passage database 114 . This process can be performed for each similar passage in the similar passage database 114 .
  • FIG. 5 is a flow chart illustrating the interaction between the client device 118 and web server 120 , scoring engine 128 and ranking engine 130 according to one embodiment. Other embodiments may perform different or additional steps than the ones shown in FIG. 5 .
  • a client device 118 sends 502 a request to the web server 120 .
  • the request from the client device 118 may be a search query entered by a user. In some embodiments, the request from the client device 118 may be created when the user selects a hypertext link presented on the client device.
  • the web server 120 receives 504 the request and determines 506 a set of results from the similar passage database 114 .
  • the set of results is a set of similar passages.
  • the ranking engine 130 ranks 508 the similar passages based on the ranking scores associated with the similar passages, thereby determining the order in which to display the similar passages.
  • the search results are received 510 by the client device 118 and displayed 512 in the ranked order.
  • FIG. 6 is an exemplary web page 600 showing ranked search results according to one embodiment.
  • the page 600 displays search results 604 that are displayed when a user enters the search query “space race” in the search field 602 of the web page 600 .
  • the search results 604 identify three books that relate to the query “space race.” For each book, the web page 600 displays an image 606 , a passage 608 , and related terms and other information associated with the book/passage 610 .
  • the books in the search results 604 are ranked based at least in part on the ranking score of the passage.
  • the ranking score can be used to influence both the order of the books displayed in the search results and the selection of a particular passage from a book.
  • the first search result 604 A displays the passage 608 A “That's one small step for a man. One giant leap for centuries.” This passage is highly quoted and thus would have received a very high ranking score relative to other passages.
  • a book that contains this passage is presented first in the ranked order of books, and the passage itself is displayed in association with the book (as opposed to other passages appearing in the book that have lower ranking scores).
  • any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment.
  • the appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
  • the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion.
  • a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
  • “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).

Abstract

Passages in a digital corpus are scored and ranked based at least in part on characteristics of instances of the passages occurring in the corpus. Such characteristics include the popularity of the author, the characteristics of the words introducing and following the similar passage, frequency of appearance of the passage in the digital corpus, the length of the similar passage, the words of the similar passage, the usage of punctuation with the similar passage, and the diffusion of the similar passage within the digital corpus. The characteristics are scored and weighted to produce ranking scores for the associated passages. The ranking scores are used for purposes including selecting passages to display in association with a document and ranking passages displayed in response to a search.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Patent Provisional Application No. 60/956,880, filed Aug. 20, 2007, the contents of which are hereby incorporated by reference.
  • This application is related to U.S. patent application Ser. No. 11/781,213, filed Jul. 20, 2007, and titled “Identifying and Linking Similar Passages in a Digital Text Corpus,” the contents of which are hereby incorporated by reference.
  • BACKGROUND
  • 1. Field of Art
  • This invention pertains, in general, to scoring similar passages in digital text documents and, in particular, to ranking similar passages based on characteristics of the similar passages occurring in the digital text documents.
  • 2. Description of the Related Art
  • Advancement in digital technology has changed the way people acquire information. For example, people can now view electronic documents that are stored in a predominantly text corpus such as a digital library that is accessible via the Internet. Such a digital text corpus is established, for example, by scanning paper copies of documents including books and newspapers, and then applying an optical character recognition (OCR) process to produce computer-readable text from the scans. The corpus can also be established by receiving documents and other texts already in machine-readable form.
  • Many of these electronic documents contain similar passages or quotations that appear multiple times within the corpus. Users may search for documents in the digital corpus based on various search queries. Additionally, users may search for the documents based on known or popular quotations or phrases contained in the documents. However, these types of searches may yield thousands of matching results and the most relevant results may not initially be displayed making it difficult for users to locate the documents or passages most relevant to their queries.
  • SUMMARY
  • The problems described above are addressed by a computer-implemented method, computer program product, and computer system for calculating a score for a passage having a plurality of instances occurring in a digital corpus. Embodiments of the method comprise calculating at least one score based at least in part on characteristics of instances of the passage occurring in the digital corpus and generating a ranking score associated with the passage based at least in part on the calculated at least one score. The method further comprises storing the ranking score in association with the passage in a computer-readable medium. Embodiments of the computer program product and computer system comprise computer code for performing similar functions.
  • The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The disclosed embodiments have other advantages and features which will be more readily apparent from the detailed description, the appended claims, and the accompanying figures (or drawings). A brief introduction of the figures is below.
  • FIG. 1 shows an environment adapted to support ranking similar passages according to one embodiment.
  • FIG. 2 is a high-level block diagram illustrating a functional view of a typical computer for use as one of the entities illustrated in the environment of FIG. 1 according to one embodiment.
  • FIG. 3 is a high-level block diagram illustrating modules within the scoring engine according to one embodiment.
  • FIG. 4 is a flow chart illustrating steps performed by the scoring engine according to one embodiment.
  • FIG. 5 is a flow chart illustrating the interaction between the client device and the web server, the scoring engine, and the ranking engine according to one embodiment.
  • FIG. 6 is an exemplary web page showing ranked search results according to one embodiment.
  • The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
  • DETAILED DESCRIPTION
  • The Figures (FIGS.) and the following description describe embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
  • FIG. 1 shows an environment adapted to support ranking similar passages according to one embodiment. The environment 100 includes a data store 110 for storing a corpus 112 and a similar passage database 114, a passage mining engine 116 for identifying similar passages in the corpus, a scoring engine 128 for assigning scores to similar passages, and a ranking engine 130 for ranking similar passages. The environment also includes a client 118 for requesting and/or viewing information from the data store 110, and a web server 120 for interacting with the client and providing interfaces allowing the client to access the information in the data store. A network 122 enables communications between and among the data store 110, passage mining engine 116, scoring engine 128, ranking engine 130, client 118, and web server 120.
  • Not all the entities shown in FIG. 1 are required to be connected to the network 122 at the same time for the functionalities described herein to be realized. In one embodiment, passage mining engine 116 and/or scoring engine 128 are connected to the network 122 periodically. When it is online, the engines 116 and 128 only need to communicate with the data store 110 in order to score similar passages in the corpus 112 and store the passage data in the passage database 114. The engines 116 and 128 do not need to interact with the client 118 or the web server 120 according to one embodiment. Once identifying similar passages is finished, the passage mining engine 116 may be off-line, and the web server 120 supports passage navigating by interacting with the client 118 and the data store 110 to retrieve information from the data store that is requested by the client. Similarly, once the scoring of the similar passages is done, the scoring engine 128 may be off-line, and the web server 120 supports retrieval of ranking information by interacting with the client 118 and data store 110 to retrieve information from the data store that is requested by the client. In another embodiment, the scoring engine 128 is connected to the network 122 periodically. When it is online, the scoring engine 128 communicates with the passage mining engine 116 or data store 110 in order to identify which similar passage instances to rank. The scoring engine 128 does not need to interact with the client 118 or the web server 120 according to one embodiment. Moreover, different embodiments of the environment 100 include different and/or additional entities than the ones shown in FIG. 1, and the entities are organized in a different manner.
  • The data store 110 stores the corpus 112 of information and the similar passage database 114. It also stores data utilized to support the functionalities or generated by the functionalities described herein. The data store 110 can also store other corpora and data. The data store 110 receives requests for information stored in it and provides the information in return. In a typical embodiment, the data store 110 is comprised of multiple computers and/or storage devices configured to collectively store a large amount of information.
  • The corpus 112 stores a set of information. In one embodiment, the corpus 112 stores the contents of a large number of digital documents. As used herein, the term “document” refers to a written work or composition. This definition includes, for example, conventional books such as published novels, and collections of text such as newspapers, news stories, magazines, journals, pamphlets, letters, articles, web pages and other electronic documents. The document contents stored by the corpus 112 include, for example, the document text represented in a computer-readable format, images from the documents, scanned images of pages from the documents, etc. As used herein, the term “word” refers to a token containing a block of structured text. The word does not necessarily have meaning in any language, although it will have meaning in most cases.
  • In addition, the corpus 112 stores metadata about the documents within it. The metadata are structured data that describe the documents. Examples of metadata include metadata about a book such as the author, publisher, year published, number of pages, edition, and libraries that carry the book. The metadata stored in the corpus is associated with the similar passages stored in the similar passage database 114.
  • The similar passage database 114 stores data describing similar passages in the corpus 112. The similar passage database 114 also stores the ranking score of the similar passage once a ranking score is assigned by the scoring engine 128. More details describing the function of the scoring engine 128 are provided below.
  • As used herein, the phrase “similar passage” refers to a passage in a source document that is found in a similar form in one or more different target documents. Occurrences of the same similar passage are referred to as “instances” of that passage. Oftentimes, the similar passage instances are identical. Nevertheless, the passages are referred to as “similar” because there might be slight differences among the passage instances in the different documents. When a source document is said to have multiple “similar passages,” it means that multiple passages in the source document are also found in other documents. This phrase does not necessarily mean that the “similar passages” within the source document are similar to each other. Similar passages are also referred to as “quotations,” “shared passages,” “popular passages,” and “related passages.”
  • In one embodiment, the passage database 114 is generated by the passage mining engine 116 to store information obtained from passage mining. In some embodiments, the passage mining engine 116 constructs the passage database 114 by copying existing quotation collections such as Bartlett's, and searching and indexing the instances of quotations and their variations that appear in the corpus 112. In some embodiments, the passage mining engine 116 constructs the passage database 114 by copying existing text appearing in a quoted form, such as delimited by quotation marks, from the corpus, and searching and indexing the instances of the text in the corpus 112. Further, in some embodiments the passage mining engine 116 constructs the passage database 114 by copying each group of words, such as sentences, from the corpus, and searching and indexing the instances of the group of words in the corpus 112. In one embodiment, the database 114 stores similar passages, document identifiers (Doc IDs) identifying the documents in which the passages exist, position identifiers (Pos IDs) identifying the location in the documents at which the passages appear, passage ranking results, etc. Further, in some embodiments, the database 114 also stores the documents or portions of the documents that have the similar passages.
  • The passage mining engine 116 includes one or more computers adapted to analyze the texts of documents in the corpus 112 in order to identify similar passages. For example, the passage mining engine 116 may find that the passage “I read somewhere that everybody on this planet is separated by only six other people” from the book “Six Degrees of Separation” by John Guare, also appears in 13 other books published between 2000 and 2006. The passage mining engine 116 may store, in the similar passage database 114, the passage, its location in the “Six Degrees of Separation” book, Doc IDs of the 13 other books, Pos IDs indicating the locations of the passage instances in the 13 other books, and its ranking relative to other similar passages in the “Six Degrees of Separation” book or relative to other similar passages in the corpus 112. More detail regarding the passage mining engine 116 is described in the related application, U.S. patent application Ser. No. 11/781,213, filed Jul. 20, 2007, and titled “Identifying and Linking Similar Passages in a Digital Text Corpus.” Passage mining may be performed off-line, asynchronously of any queries made by the client 118 against the data store 110. In one embodiment, the passage mining engine 116 runs periodically to process all the text information in the corpus 112 from scratch and generate similar passage data for storing in the similar passage database 114, disregarding any information obtained from prior passage mining. In another embodiment, the passage mining engine 116 is used periodically to incrementally update the data stored in the similar passage database 114, for example, as new documents are added to the corpus 112.
  • The scoring engine 128 includes one or more computers adapted to assign scores to the similar passages identified by the passage mining engine 116 and stored in the similar passages database 114. In one embodiment, the scoring engine 128 analyzes the characteristics of the similar passages and the documents containing the similar passages stored in the similar passage database 114 and assigns ranking scores to the similar passages. Scoring may be performed on-line when the scoring engine is connected to network 122 and may also be performed off-line, asynchronously of any queries made by client 118 against the data store 110. In one embodiment, the scoring engine 128 runs periodically to process all of the content from the data store 110 from scratch and assigns a score associated with a similar passage for storing in the similar passage database 114. In another embodiment, scoring engine 128 is used periodically to incrementally update the ranking information stored in the similar passage database 114, for example, as new similar passages are found and added to the similar passage database.
  • The ranking engine 130 ranks a set of similar passages to be displayed on the client 118. The ranking engine 130 ranks the set of similar passages based on the associated ranking scores of the similar passages. The set of similar passages can be displayed on the client 118 in the ranked order.
  • For purposes of illustration, FIG. 1 shows the passage mining engine 116, the scoring engine 128, and the ranking engine 130 as discrete servers. However, in various embodiments, any or all of these engines can be combined. This allows a single server to perform the functions of one or more of the above-described engines.
  • In one embodiment, the client 118 is an electronic device having a web browser for interacting with the web server 120 via the network 122, and it is used by a human user to access and obtain information from the data store 110. It can be, for example, a notebook, desktop, or handheld computer, a mobile telephone, personal digital assistant (PDA), mobile email device, portable game player, portable music player, computer integrated into a vehicle, etc.
  • The web server 120 interacts with the client 118 and the ranking engine 130 to provide information from the data store 110. In one embodiment, the web server 120 includes a User Interface (UI) module 124 that communicates with the client's 118 web browser to receive and present information. The web server 120 also includes a searching module 126 that searches for information in the data store 110. For example, the UI module 124 may receive a query from the web browser issued by a user of the client 118, and the searching module 126 may execute the query against the corpus 112 and the similar passage database 114, and retrieve information including similar passages information that satisfies the query. The similar passages are displayed and listed in accordance with a ranking order provided by the ranking engine 130.
  • The network 122 represents communication pathways between the data store 110, passage mining engine 116, client 118, web server 120, the scoring engine 128, and the ranking engine 130. In one embodiment, the network 122 is the Internet. The network 122 can also utilize dedicated or private communications links that are not necessarily part of the Internet. In one embodiment, the network 122 uses standard communications technologies, protocols, and/or interprocess communications techniques. Thus, the network 122 can include links using technologies such as Ethernet, 802.11, integrated services digital network (ISDN), digital subscriber line (DSL), asynchronous transfer mode (ATM), etc. Similarly, the networking protocols used on the network 122 can include the transmission control protocol/Internet protocol (TCP/IP), the hypertext transport protocol (HTTP), the simple mail transfer protocol (SMTP), the file transfer protocol (FTP), the short message service (SMS) protocol, etc. The data exchanged over the network 122 can be represented using technologies and/or formats including the hypertext markup language (HTML), the extensible markup language (XML), etc. In addition, all or some of links can be encrypted using conventional encryption technologies such as the secure sockets layer (SSL), HTTP over SSL (HTTPS), and/or virtual private networks (VPNs). In another embodiment, the nodes can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
  • FIG. 2 is a high-level block diagram illustrating a functional view of a typical computer 200 for use as one or more of the entities illustrated in the environment 100 of FIG. 1 according to one embodiment. Illustrated are at least one processor 202 coupled to a bus 204. Also coupled to the bus 204 are a memory 206, a storage device 208, a keyboard 210, a graphics adapter 212, a pointing device 214, and a network adapter 216. A display 218 is coupled to the graphics adapter 212.
  • The processor 202 may be any general-purpose processor such as an INTEL x86 compatible-CPU. The storage device 208 is any device capable of holding data, like a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 206 holds instructions and data used by the processor 202 and may be, for example, firmware, read-only memory (ROM), non-volatile random access memory (NVRAM), and/or RAM. The pointing device 214 may be a mouse, track ball, or other type of pointing device, and is used in combination with the keyboard 210 to input data into the computer system 200. The graphics adapter 212 displays images and other information on the display 218. The network adapter 216 couples the computer system 200 to the network 122.
  • As is known in the art, the computer 200 is adapted to execute computer program modules. As used herein, the term “module” refers to computer program logic and/or data for providing the specified functionality. A module can be implemented in hardware, firmware, and/or software. In one embodiment, the modules are stored on the storage device 208, loaded into the memory 206, and executed by the processor 202 as one or more processes.
  • The types of computers used by the entities of FIG. 1 can vary depending upon the embodiment and the processing power utilized by the entity. For example, the client 118 typically requires less processing power than the passage mining engine 116, scoring engine 128, ranking engine 130, and web server 120. Thus, the client 118 system can be a standard personal computer or a mobile telephone. The passage mining engine 116, scoring engine 128, ranking engine 130, and web server 120, in contrast, may comprise processes executing on more powerful computers, logical processing units, and/or multiple computers working together to provide the functionality described herein. Further, the passage mining engine 116, scoring engine 128, ranking engine 130, and web server 120 might lack devices that are not required to operate them, such as displays 218, keyboards 210, and pointing devices 214.
  • Embodiments of the entities described herein can include other and/or different modules than the ones described here. In addition, the functionality attributed to the modules can be performed by other or different modules in other embodiments. Moreover, this description occasionally omits the term “module” for purposes of clarity and convenience.
  • FIG. 3 is a high-level block diagram illustrating modules within the scoring engine 128 according to one embodiment. The scoring engine 128 includes a characteristics analysis module 302 and a score calculation module 306. An embodiment of the scoring engine 128 analyzes characteristics of similar passages and calculates scores for the passages based on the analyzed characteristics. The scores are assigned to the associated similar passages and stored in the similar passage database 114. Some embodiments have different and/or additional modules than those shown in FIG. 3. Moreover, the functionalities can be distributed among the modules in a different manner than described here.
  • The characteristics analysis module 302 analyzes characteristics associated with a similar passage and its similar passage instances in order to produce a total score. Characteristics that are analyzed include characteristics associated with the passage or passage instance itself and characteristics associated with the usage of the similar passage in the digital corpus 112. Examples of such characteristics are the number of words in the passage, the author of the document which contains the similar passage instance, the publisher of the document which contains the similar passage instance, the characteristics of the words introducing and following the similar passage, how frequently the similar passage appears in the digital corpus, the length of the similar passage, the words of the similar passage, the usage of punctuation associated with the similar passage, and the diffusion of the similar passage in the digital corpus. The diffusion of the similar passage is determined by analyzing the variation of the authors of the documents in which the instances of the passage appear, the variation of the publishers of the documents in which the similar passage instances appear, the variation of the libraries that carry the documents in which the similar passage instances appear, and/or the variation of the parts of the documents in which the similar passage instances appear.
  • In one embodiment, the author associated with the document which contains a similar passage instance is identified and examined by the characteristics analysis module 302. In some embodiments, the characteristics analysis module 302 compares the identified author to a list or database of previously-identified famous or known authors. In one embodiment, each author in the list or database has an associated score. In such embodiments, when the characteristics analysis module 302 compares the identified authors to the list or database, and the identified author is found therein, the module 302 assigns the score associated with that author to the similar passage instance. If the identified author is not found, the module 302 assigns a low score or a score of zero to the similar passage instance. In some embodiments, the authors in the list or database do not have an associated score. In those embodiments, the module 302 assigns a score to the similar passage instance based on whether the identified author was found in the database. The assigned score is represented by A(Q).
  • In some embodiments, the list or database of previously-identified famous or known authors may be based on authors found in a printed encyclopedia, an online encyclopedia, such as Wikipedia, or other sources such as Bartlett's.
  • In one embodiment, frequency of appearance of the similar passage, or the number of similar passage instances in the digital corpus 112, is a characteristic that is examined. The characteristics analysis module 302 examines and identifies the frequency of appearance of the similar passage in the digital corpus 112. If the similar passage appears in fewer documents, the characteristics analysis module 302 assigns a lower score to that similar passage. If the similar passage appears in many documents, the characteristics analysis module 302 assigns a higher score to that similar passage.
  • In some embodiments, there are certain similar passages that tend to appear very frequently and the characteristics analysis module 302 adjusts the score downward as a result. For example, a cliché or overused slogan may be identified as a similar passage and may be very prevalent throughout the digital corpus 112. In those instances, the cliché or slogan may be assigned a lower score because the high frequency of occurrence does not necessarily indicate that the passage has great significance.
  • In some embodiments, the length of the similar passage may be a factor in determining a score based on the frequency of appearance of the similar passage. For example, a very short similar passage (for example, one that including less than five or six words) may appear frequently. However, since this passage is shorter than the average length of a passage, it is assigned a lower score. Conversely, if the similar passage is long (for example, more than ten words in length), it would still be assigned a high score if the frequency of appearance of the similar passage within the digital corpus 112 is high. In one embodiment, the score associated with the frequency of appearance of the similar passage in the digital corpus 112 is represented by F(Q).
  • In one embodiment, the length of the similar passage is a characteristic that is separately examined and scored by the characteristics analysis module 302. The characteristics analysis module 302 assigns a lower score to a very short passage (for example, one that including less than five or six words) and assigns a higher score to a long passage (for example, more than ten words in length). In one embodiment, the score associated with the length of the similar passage in the digital corpus 112 is represented by L(Q).
  • In one embodiment, the variation of words and grammar of the similar passage are characteristics that are examined. The characteristics analysis module 302 examines the words of the similar passage and assigns a score to the similar passage in response. The characteristics analysis module 302 assigns a lower score to a similar passage that contains repeating words or numbers and assigns a higher score to a passage that contains few repeating words or numbers. In some embodiments, if the similar passage is a chart, or another table-like presentation of words (i.e. words with no verbs), then the characteristics analysis module 302 assigns a lower score to that similar passage.
  • In some embodiments, the characteristics analysis module 302 applies one or more language models to analyze the words of the similar passage. For example, language models may be used to determine whether the words of the similar passage demonstrate usage of proper grammar or whether the words contain too many numbers. In such embodiments, a high score is assigned to a passage that demonstrates use of proper grammar and a low score is assigned to a passage that demonstrates use of improper grammar. Additionally, the score of a passage that contains too many numbers is lowered. In one embodiment, the score associated with the word analysis of the similar passage in the digital corpus is represented by W(Q).
  • In one embodiment, the usage of punctuation associated with the similar passage is identified and examined by the characteristics analysis module 302. For example, the use of quotation marks surrounding a similar passage is an indication that the similar passage is a quotation and therefore the passage is assigned a higher score. In one embodiment, the score associated with the use of punctuation marks is represented by P(Q).
  • In one embodiment, the document that contains a similar passage instance is a characteristic that is identified and examined by the characteristics analysis module 302. Similar to the analysis of the author of the document, the characteristics analysis module 302 compares the identified document to a list or database of previously-identified famous or known documents. In one embodiment, each document in the list or database has an associated score. In such embodiments, when the characteristics analysis module 302 compares the identified document to the list or database of documents, and the identified document is found therein, the module 302 assigns the score associated with that document to the similar passage instance. If the identified document is not found in the database, the module 320 assigns a low score or a score of zero. In some embodiments, the documents in the list or database do not have associated scores. In those embodiments, the module 302 assigns a score to the similar passage instance based on whether the identified document was found therein. In one embodiment, the assigned score is represented by B(Q).
  • In one embodiment, the set of words introducing a similar passage and the set of words following a similar passage is a characteristic that is examined. In some embodiments, these words are known as speech acts. For example, words such as “Person X says” or “Person X wrote” are indications that a similar passage is to follow. As another example, speech acts, such as “said Person X” are indications that a similar passage appeared before the exemplary speech act phrase. A higher score is assigned to a similar passage that is introduced by or followed by a speech act. In one embodiment, the assigned score is represented by S(Q).
  • In one embodiment, a diffusion of the similar passage in the digital corpus 112 is examined by the characteristics analysis module 302. In one embodiment, the assigned score is represented by D(Q) and is calculated by first calculating entropy scores as explained below.
  • In one embodiment, the variation of the authors, or number of different authors, of the documents containing a particular similar passage is a component of the diffusion score. The characteristics analysis module 302 examines the authors of the documents containing the instances of a particular similar passage in order to determine the number of different authors. The characteristics analysis module 302 assigns a higher score to a similar passage that is associated with many different authors, and assigns a lower score to a similar passage that is associated with fewer different authors. In one embodiment, the score is calculated using the following entropy equation:
  • E ( A ) = - x A p ( x ) · log 2 ( p ( x ) )
  • As shown in the exemplary equation above, the entropy of the authors (E(A)), is calculated by taking the negative summation of the product of p(x) and the log of p(x), where p(x) is the probability that author x will occur in a given set of examined documents and is expressed as a fraction. For example, when calculating E(A), the individual probabilities correspond to the probability that a particular author will appear as an author of a document among the set of examined documents containing a particular similar passage. Using the equation above, if ten documents containing instances of a particular similar passage were examined and all ten documents were associated with the same author, p(x) would be one, and the entropy of the author (E(A)) would be zero. However, if some of the documents were associated with different authors, the entropy of the author (E(A)) would be greater than zero. If a large number of documents were examined and all the documents were associated with different authors, the value of the entropy of the authors would be high. For example, if ten documents were examined and ten authors were identified (each document corresponding to a different author), p(x)*log2(p(x)) for each author is −0.3322 and the negative summation is 3.322.
  • In one embodiment, the variation of the publishers of the documents associated with the particular similar passage is a component of the diffusion score. The publishers of the documents containing instances of the particular similar passage are examined and identified. Similar to the calculation for authors, the characteristics analysis module 302 calculates an entropy of the publishers (E(P)) by using a formula similar to the one above, but in this case p(x) corresponds to the probability of the occurrence of a particular publisher. Therefore, similar to the analysis of the authors, the characteristics analysis module 302 assigns a higher score to a similar passage that is associated with many different publishers, and assigns a lower score to a similar passage that is associated with fewer different publishers.
  • In one embodiment, the variation of the libraries that carry copies of the documents containing instances of the particular passage is a component of the diffusion score that is identified by the characteristics analysis module 302. Similar to the calculation for authors and publishers, the characteristics analysis module 302 calculates an entropy of the libraries (E(L)). In this case, p(x) corresponds to the probability of the appearance of a particular library that carries a copy of a document containing a particular similar passage. Therefore, similar to the analysis of the authors and publishers, the characteristics analysis module 302 assigns a higher score to a similar passage that is appears in a document that is held in a collection of many different libraries, and assigns a lower score to a similar passage that appears in a document that is held in a collection of fewer different libraries.
  • In one embodiment, the variation of the parts of documents in which the similar passage instances appear is a component of the diffusion score. The characteristics analysis module 302 examines and identifies parts of the documents in which the similar passage appears. In some embodiments, a document is divided into a number of parts. For example, a document may be divided into three parts: a first third (the beginning part of the document), a second third (the middle part of the document), and a last third (the end part of the document). Among the documents containing the similar passage instances, the characteristics analysis module 302 makes a determination as to which parts of the documents the similar passage instances appear. Similar to the calculations above, the characteristics analysis module 302 calculates an entropy of the parts of the documents (E(Q)) using a similar formula. In this case, the p(x) corresponds to the probability of the appearance of a passage instance in a particular part of a document. Therefore, the characteristics analysis module 302 assigns a higher score to a similar passage that appears in different parts of documents, and assigns a lower score to a similar passage that appears in the same part, or mostly the same part, of the documents.
  • The characteristics analysis module 302 combines the entropies calculated above (E(A), E(P), E(L), and E(Q)) in order to calculate a total diffusion (D(Q)) of the similar passage throughout the corpus. Depending upon the embodiment, the characteristics analysis module 302 calculates D(Q) as a sum of its components, as a weighted linear combination, as a weighted geometric mean or using another technique. The characteristics analysis module 302 assigns the total diffusion score D(Q) to the similar passage. In some embodiments, the total diffusion score is stored in association with the similar passage in the similar passage database 114.
  • An embodiment of the score calculation module 306 combines the individual scores described above (A(Q), F(Q), L(Q), W(Q), P(Q), B(Q), S(Q), and D(Q)) to determine the total score assigned to a similar passage. In one embodiment, the total score is calculated by summing the individual scores. In some embodiments, certain individual characteristics are more important or more relevant than others. Therefore, the characteristics analysis module 302 weights scores for certain characteristics more than scores for other characteristics. In some embodiments, the total score is determined by a weighted linear combination of the individual scores. In other words, each individual score is assigned a weight and is multiplied by its assigned weight to yield a weighted score. The weighted scores are summed in order to yield the total score. In other embodiments, the total is determined by a weighted geometric mean. In other words, each score is assigned a weight. Each score is then raised to the power of the weight to yield a weighted score. The weighted scores are then multiplied together to yield the total score. In some embodiments, the sum of the weights equals one. Therefore, if one weight is increased by a certain amount the total of the other weights is decreased by the same amount such that the sum of the weights remains one.
  • The total score serves as the ranking score for the passage. In some embodiments, the score calculation module 306 aggregates a subset of the scores described above to produce the ranking score for a similar passage. Information about the similar passage and its associated ranking score are stored in the similar passage database 114.
  • FIG. 4 is a flow chart illustrating steps performed by the scoring engine 128 according to one embodiment. Other embodiments may perform different or additional steps than the ones shown in FIG. 4.
  • The scoring engine 128 receives 402 a set of similar passage instances for a passage in the digital corpus 112 to be analyzed. The scoring engine 128 calculates 404 the individual scores (A(Q), F(Q), L(Q), W(Q), P(Q), B(Q), S(Q), and D(Q)) for the examined characteristics. The scoring engine 128 then determines 406 a ranking score for the identified passage. In one embodiment, the individual scores are summed in order to produce a total score that serves as the ranking score for the identified passage. The scores can also be combined using one or more of the weighting techniques described above. The ranking score is associated with the passage and stored 408 in the similar passage database 114. This process can be performed for each similar passage in the similar passage database 114.
  • FIG. 5 is a flow chart illustrating the interaction between the client device 118 and web server 120, scoring engine 128 and ranking engine 130 according to one embodiment. Other embodiments may perform different or additional steps than the ones shown in FIG. 5.
  • A client device 118 sends 502 a request to the web server 120. The request from the client device 118 may be a search query entered by a user. In some embodiments, the request from the client device 118 may be created when the user selects a hypertext link presented on the client device. The web server 120 receives 504 the request and determines 506 a set of results from the similar passage database 114. The set of results is a set of similar passages. The ranking engine 130 ranks 508 the similar passages based on the ranking scores associated with the similar passages, thereby determining the order in which to display the similar passages. The search results are received 510 by the client device 118 and displayed 512 in the ranked order.
  • FIG. 6 is an exemplary web page 600 showing ranked search results according to one embodiment. In the example shown in FIG. 6, the page 600 displays search results 604 that are displayed when a user enters the search query “space race” in the search field 602 of the web page 600. The search results 604 identify three books that relate to the query “space race.” For each book, the web page 600 displays an image 606, a passage 608, and related terms and other information associated with the book/passage 610.
  • In FIG. 6, the books in the search results 604 are ranked based at least in part on the ranking score of the passage. The ranking score can be used to influence both the order of the books displayed in the search results and the selection of a particular passage from a book. For example, the first search result 604A displays the passage 608A “That's one small step for a man. One giant leap for mankind.” This passage is highly quoted and thus would have received a very high ranking score relative to other passages. As a result, a book that contains this passage is presented first in the ranked order of books, and the passage itself is displayed in association with the book (as opposed to other passages appearing in the book that have lower ranking scores).
  • As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
  • As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
  • In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
  • Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for ranking similar passages through the disclosed principles herein. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the disclosed embodiments are not limited to the precise construction and components disclosed herein. Various modifications, changes and variations, which will be apparent to those skilled in the art, may be made in the arrangement, operation and details of the method and apparatus disclosed herein without departing from the spirit and scope defined in the appended claims.

Claims (24)

1. A computer-implemented method for calculating a score for a passage having a plurality of instances occurring in a digital corpus, comprising:
calculating at least one score based at least in part on characteristics of instances of the passage occurring in the digital corpus;
generating a ranking score associated with the passage based at least in part on the calculated at least one score; and
storing the ranking score in association with the passage in a computer-readable medium.
2. The method of claim 1, wherein a plurality of scores are calculated based on a plurality of characteristics of the instances of the passage occurring in the digital corpus, and wherein generating the ranking score comprises combining the plurality of scores to form the ranking score.
3. The method of claim 1, wherein calculating the at least one score comprises:
accessing a database identifying authors and having associated author scores;
determining whether an author of a document in the digital corpus in which a passage instance occurs is found in the database; and
responsive to the author being found in the database, calculating the score based at least in part on the author score associated with the author in the database.
4. The method of claim 1, wherein calculating the at least one score comprises:
accessing a database identifying documents and having associated document scores;
determining whether a document in the digital corpus in which a passage instance occurs is found in the database; and
responsive to the document being found in the database, calculating the score based at least in part on the document score associated with the document in the database.
5. The method of claim 1, wherein calculating the at least one score comprises:
identifying a frequency that the passage instances appear in the digital corpus; and
calculating the score based at least in part on the frequency.
6. The method of claim 1, wherein calculating the at least one score comprises:
determining a length of the passage; and
calculating the score based at least in part on the length.
7. The method of claim 1, wherein calculating the at least one score comprises:
determining an amount of variation of words of the passage; and
calculating the score based at least in part on the amount of variation of words of the passage.
8. The method of claim 1, wherein calculating the at least one score comprises:
applying one or more language models to analyze words within the passage; and
calculating the score based at least in part on the application of the one or more language models.
9. The method of claim 1, wherein calculating the at least one score comprises:
determining a usage of punctuation associated with the passage; and
calculating the score based at least in part on the usage of punctuation associated with the passage.
10. The method of claim 1, wherein calculating the at least one score comprises:
identifying words introducing the passage and/or following the passage in a document in the digital corpus containing an instance of the passage;
ascertaining whether the words introducing and/or following the passage denote a speech act; and
calculating the score based at least in part on whether the words introducing and/or following the similar passage denote a speech act.
11. The method of claim 1, wherein calculating the at least one score comprises:
identifying a characteristic of the plurality of passage instances occurring in the digital corpus;
examining the plurality of passage instances to determine an amount of variation in the identified characteristic over the plurality of passage instances; and
calculating the at least one score based at least in part on the amount of variation in the characteristic.
12. The method of claim 11, wherein an identified characteristic is an author of a document in which a passage instance appears.
13. The method of claim 11, wherein an identified characteristic is a publisher of a document in which a passage instance appears.
14. The method of claim 11, wherein an identified characteristic is a library containing a document in which a passage instance appears.
15. The method of claim 11, wherein an identified characteristic is a part of a document in which a passage instance appears.
16. The method of claim 1, wherein a plurality of ranking scores are calculated for a plurality of different passages occurring in the digital corpus and further comprising:
ranking the plurality of different passages in an order responsive to the ranking scores calculated for the passages.
17. A computer-readable storage medium containing executable program code for calculating a score for a passage having multiple occurrences in a digital corpus, the program code comprising code for:
calculating at least one score based at least in part on characteristics of instances of the passage occurring in the digital corpus;
generating a ranking score associated with the passage based at least in part on the calculated at least one score; and
storing the ranking score in association with the passage in a computer-readable medium.
18. The computer-readable storage medium of claim 17, wherein a plurality of scores are calculated based on a plurality of characteristics of the instances of the passage occurring in the digital corpus, and wherein generating the ranking score comprises combining the plurality of scores to form the ranking score.
19. The computer-readable storage medium of claim 17, wherein calculating the at least one score comprises:
identifying a characteristic of the plurality of passage instances occurring in the digital corpus;
examining the plurality of passage instances to determine an amount of variation in the identified characteristic over the plurality of passage instances; and
calculating the at least one score based at least in part on the amount of variation in the characteristic.
20. The computer-readable storage medium of claim 17, wherein a plurality of ranking scores are calculated for a plurality of different passages occurring in the digital corpus and further comprising:
ranking the plurality of different passages in an order responsive to the ranking scores calculated for the passages.
21. A computer system for calculating a score for a passage having multiple occurrences in a digital corpus, the system comprising:
a computer-readable storage medium containing executable program code for calculating a score for a passage having multiple occurrences in a digital corpus, the program code comprising code for:
calculating at least one score based at least in part on characteristics of instances of the passage occurring in the digital corpus;
generating a ranking score associated with the passage based at least in part on the calculated at least one score; and
storing the ranking score in association with the passage in a computer-readable medium.
22. The computer system of claim 21, wherein a plurality of scores are calculated based on a plurality of characteristics of the instances of the passage occurring in the digital corpus, and wherein generating the ranking score comprises combining the plurality of scores to form the ranking score.
23. The computer system of claim 21, wherein calculating the at least one score comprises:
identifying a characteristic of the plurality of passage instances occurring in the digital corpus;
examining the plurality of passage instances to determine an amount of variation in the identified characteristic over the plurality of passage instances; and
calculating the at least one score based at least in part on the amount of variation in the characteristic.
24. The computer system of claim 21, wherein a plurality of ranking scores are calculated for a plurality of different passages occurring in the digital corpus and further comprising:
ranking the plurality of different passages in an order responsive to the ranking scores calculated for the passages.
US12/134,145 2007-08-20 2008-06-05 Ranking similar passages Abandoned US20090055389A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/134,145 US20090055389A1 (en) 2007-08-20 2008-06-05 Ranking similar passages

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US95688007P 2007-08-20 2007-08-20
US12/134,145 US20090055389A1 (en) 2007-08-20 2008-06-05 Ranking similar passages

Publications (1)

Publication Number Publication Date
US20090055389A1 true US20090055389A1 (en) 2009-02-26

Family

ID=40383114

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/022,842 Active 2031-05-10 US9323827B2 (en) 2007-07-20 2008-01-30 Identifying key terms related to similar passages
US12/134,145 Abandoned US20090055389A1 (en) 2007-08-20 2008-06-05 Ranking similar passages

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/022,842 Active 2031-05-10 US9323827B2 (en) 2007-07-20 2008-01-30 Identifying key terms related to similar passages

Country Status (1)

Country Link
US (2) US9323827B2 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095993A1 (en) * 2010-10-18 2012-04-19 Jeng-Jye Shau Ranking by similarity level in meaning for written documents
US20120216107A1 (en) * 2009-10-30 2012-08-23 Rakuten, Inc. Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
CN104285142A (en) * 2012-04-25 2015-01-14 Atonarp株式会社 System which provides content
US20150046468A1 (en) * 2013-08-12 2015-02-12 Alcatel Lucent Ranking linked documents by modeling how links between the documents are used
US20150142571A1 (en) * 2011-05-23 2015-05-21 Google Inc. System and method for increasing the likelihood of users reviewing advertisements
WO2015168344A1 (en) * 2014-05-02 2015-11-05 Microsoft Technology Licensing, Llc Searching locally defined entities
US20160098405A1 (en) * 2014-10-01 2016-04-07 Docurated, Inc. Document Curation System
US9367490B2 (en) 2014-06-13 2016-06-14 Microsoft Technology Licensing, Llc Reversible connector for accessory devices
US9384334B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content discovery in managed wireless distribution networks
US9384335B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content delivery prioritization in managed wireless distribution networks
US9430667B2 (en) 2014-05-12 2016-08-30 Microsoft Technology Licensing, Llc Managed wireless distribution network
WO2016161089A1 (en) * 2015-04-03 2016-10-06 Klangoo, Inc. Techniques for understanding the aboutness of text based on semantic analysis
US9614724B2 (en) 2014-04-21 2017-04-04 Microsoft Technology Licensing, Llc Session-based device configuration
US9874914B2 (en) 2014-05-19 2018-01-23 Microsoft Technology Licensing, Llc Power management contracts for accessory devices
US9875313B1 (en) * 2009-08-12 2018-01-23 Google Llc Ranking authors and their content in the same framework
US10111099B2 (en) 2014-05-12 2018-10-23 Microsoft Technology Licensing, Llc Distributing content in managed wireless distribution networks
CN111159461A (en) * 2019-12-30 2020-05-15 秒针信息技术有限公司 Audio file determination method and device, storage medium and electronic device
US10691445B2 (en) 2014-06-03 2020-06-23 Microsoft Technology Licensing, Llc Isolating a portion of an online computing service for testing
CN112685534A (en) * 2020-12-23 2021-04-20 上海掌门科技有限公司 Method and apparatus for generating context information of authored content during authoring process

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386239B2 (en) 2010-01-25 2013-02-26 Holovisions LLC Multi-stage text morphing
US8983989B2 (en) 2010-02-05 2015-03-17 Microsoft Technology Licensing, Llc Contextual queries
US8903794B2 (en) * 2010-02-05 2014-12-02 Microsoft Corporation Generating and presenting lateral concepts
US8161073B2 (en) 2010-05-05 2012-04-17 Holovisions, LLC Context-driven search
US20110307819A1 (en) * 2010-06-09 2011-12-15 Microsoft Corporation Navigating dominant concepts extracted from multiple sources
US8775431B2 (en) * 2011-04-25 2014-07-08 Disney Enterprises, Inc. Systems and methods for hot topic identification and metadata
US9639518B1 (en) 2011-09-23 2017-05-02 Amazon Technologies, Inc. Identifying entities in a digital work
US10108706B2 (en) 2011-09-23 2018-10-23 Amazon Technologies, Inc. Visual representation of supplemental information for a digital work
US9613003B1 (en) * 2011-09-23 2017-04-04 Amazon Technologies, Inc. Identifying topics in a digital work
US9449526B1 (en) 2011-09-23 2016-09-20 Amazon Technologies, Inc. Generating a game related to a digital work
US8589404B1 (en) * 2012-06-19 2013-11-19 Northrop Grumman Systems Corporation Semantic data integration
US9341490B1 (en) * 2015-03-13 2016-05-17 Telenav, Inc. Navigation system with spelling error detection mechanism and method of operation thereof
US9760564B2 (en) * 2015-07-09 2017-09-12 International Business Machines Corporation Extracting veiled meaning in natural language content
US10706113B2 (en) 2017-01-06 2020-07-07 Microsoft Technology Licensing, Llc Domain review system for identifying entity relationships and corresponding insights
US11016985B2 (en) * 2018-05-22 2021-05-25 International Business Machines Corporation Providing relevant evidence or mentions for a query
US11354501B2 (en) * 2019-08-02 2022-06-07 Spectacles LLC Definition retrieval and display

Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5909677A (en) * 1996-06-18 1999-06-01 Digital Equipment Corporation Method for determining the resemblance of documents
US5926812A (en) * 1996-06-20 1999-07-20 Mantra Technologies, Inc. Document extraction and comparison method with applications to automatic personalized database searching
US6119124A (en) * 1998-03-26 2000-09-12 Digital Equipment Corporation Method for clustering closely resembling data objects
US20010000356A1 (en) * 1995-07-07 2001-04-19 Woods William A. Method and apparatus for generating query responses in a computer-based document retrieval system
US6256622B1 (en) * 1998-04-21 2001-07-03 Apple Computer, Inc. Logical division of files into multiple articles for search and retrieval
US6285999B1 (en) * 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
US6370551B1 (en) * 1998-04-14 2002-04-09 Fuji Xerox Co., Ltd. Method and apparatus for displaying references to a user's document browsing history within the context of a new document
US20020052730A1 (en) * 2000-09-25 2002-05-02 Yoshio Nakao Apparatus for reading a plurality of documents and a method thereof
US6411953B1 (en) * 1999-01-25 2002-06-25 Lucent Technologies Inc. Retrieval and matching of color patterns based on a predetermined vocabulary and grammar
US20020123994A1 (en) * 2000-04-26 2002-09-05 Yves Schabes System for fulfilling an information need using extended matching techniques
US20020161570A1 (en) * 1998-11-30 2002-10-31 Wayne Loofbourrow Multi-language document search and retrieval system
US6615209B1 (en) * 2000-02-22 2003-09-02 Google, Inc. Detecting query-specific duplicate documents
US6658626B1 (en) * 1998-07-31 2003-12-02 The Regents Of The University Of California User interface for displaying document comparison information
US6665837B1 (en) * 1998-08-10 2003-12-16 Overture Services, Inc. Method for identifying related pages in a hyperlinked database
US20040064438A1 (en) * 2002-09-30 2004-04-01 Kostoff Ronald N. Method for data and text mining and literature-based discovery
US20040117366A1 (en) * 2002-12-12 2004-06-17 Ferrari Adam J. Method and system for interpreting multiple-term queries
US6859800B1 (en) * 2000-04-26 2005-02-22 Global Information Research And Technologies Llc System for fulfilling an information need
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20050198070A1 (en) * 2004-03-08 2005-09-08 Marpex Inc. Method and system for compression indexing and efficient proximity search of text data
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system
US20060129538A1 (en) * 2004-12-14 2006-06-15 Andrea Baader Text search quality by exploiting organizational information
US20060143175A1 (en) * 2000-05-25 2006-06-29 Kanisa Inc. System and method for automatically classifying text
US7139752B2 (en) * 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US7146361B2 (en) * 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US20060287971A1 (en) * 2005-06-15 2006-12-21 Geronimo Development Corporation Document quotation indexing system and method
US20070055926A1 (en) * 2005-09-02 2007-03-08 Fourteen40, Inc. Systems and methods for collaboratively annotating electronic documents
US20070136281A1 (en) * 2005-12-13 2007-06-14 Microsoft Corporation Training a ranking component
US7277766B1 (en) * 2000-10-24 2007-10-02 Moodlogic, Inc. Method and system for analyzing digital audio files
US20070294610A1 (en) * 2006-06-02 2007-12-20 Ching Phillip W System and method for identifying similar portions in documents
US20080033982A1 (en) * 2006-08-04 2008-02-07 Yahoo! Inc. System and method for determining concepts in a content item using context
US20080046394A1 (en) * 2006-08-14 2008-02-21 Microsoft Corporation Knowledge extraction from online discussion forums
US7660819B1 (en) * 2000-07-31 2010-02-09 Alion Science And Technology Corporation System for similar document detection
US7673344B1 (en) * 2002-09-18 2010-03-02 Symantec Corporation Mechanism to search information content for preselected data
US7734627B1 (en) * 2003-06-17 2010-06-08 Google Inc. Document similarity detection

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295542B1 (en) * 1998-10-02 2001-09-25 National Power Plc Method and apparatus for cross-referencing text
US6411962B1 (en) * 1999-11-29 2002-06-25 Xerox Corporation Systems and methods for organizing text
US7062498B2 (en) * 2001-11-02 2006-06-13 Thomson Legal Regulatory Global Ag Systems, methods, and software for classifying text from judicial opinions and other documents
JP2007280113A (en) * 2006-04-07 2007-10-25 Canon Inc Proxy service providing device and network system
US20080228769A1 (en) * 2007-03-15 2008-09-18 Siemens Medical Solutions Usa, Inc. Medical Entity Extraction From Patient Data

Patent Citations (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010000356A1 (en) * 1995-07-07 2001-04-19 Woods William A. Method and apparatus for generating query responses in a computer-based document retrieval system
US6230155B1 (en) * 1996-06-18 2001-05-08 Altavista Company Method for determining the resemining the resemblance of documents
US5909677A (en) * 1996-06-18 1999-06-01 Digital Equipment Corporation Method for determining the resemblance of documents
US5926812A (en) * 1996-06-20 1999-07-20 Mantra Technologies, Inc. Document extraction and comparison method with applications to automatic personalized database searching
US6285999B1 (en) * 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
US6349296B1 (en) * 1998-03-26 2002-02-19 Altavista Company Method for clustering closely resembling data objects
US6119124A (en) * 1998-03-26 2000-09-12 Digital Equipment Corporation Method for clustering closely resembling data objects
US6370551B1 (en) * 1998-04-14 2002-04-09 Fuji Xerox Co., Ltd. Method and apparatus for displaying references to a user's document browsing history within the context of a new document
US6256622B1 (en) * 1998-04-21 2001-07-03 Apple Computer, Inc. Logical division of files into multiple articles for search and retrieval
US6658626B1 (en) * 1998-07-31 2003-12-02 The Regents Of The University Of California User interface for displaying document comparison information
US6665837B1 (en) * 1998-08-10 2003-12-16 Overture Services, Inc. Method for identifying related pages in a hyperlinked database
US20020161570A1 (en) * 1998-11-30 2002-10-31 Wayne Loofbourrow Multi-language document search and retrieval system
US6411953B1 (en) * 1999-01-25 2002-06-25 Lucent Technologies Inc. Retrieval and matching of color patterns based on a predetermined vocabulary and grammar
US6615209B1 (en) * 2000-02-22 2003-09-02 Google, Inc. Detecting query-specific duplicate documents
US6859800B1 (en) * 2000-04-26 2005-02-22 Global Information Research And Technologies Llc System for fulfilling an information need
US20020123994A1 (en) * 2000-04-26 2002-09-05 Yves Schabes System for fulfilling an information need using extended matching techniques
US20060143175A1 (en) * 2000-05-25 2006-06-29 Kanisa Inc. System and method for automatically classifying text
US7660819B1 (en) * 2000-07-31 2010-02-09 Alion Science And Technology Corporation System for similar document detection
US20020052730A1 (en) * 2000-09-25 2002-05-02 Yoshio Nakao Apparatus for reading a plurality of documents and a method thereof
US7277766B1 (en) * 2000-10-24 2007-10-02 Moodlogic, Inc. Method and system for analyzing digital audio files
US7673344B1 (en) * 2002-09-18 2010-03-02 Symantec Corporation Mechanism to search information content for preselected data
US20040064438A1 (en) * 2002-09-30 2004-04-01 Kostoff Ronald N. Method for data and text mining and literature-based discovery
US20040117366A1 (en) * 2002-12-12 2004-06-17 Ferrari Adam J. Method and system for interpreting multiple-term queries
US7146361B2 (en) * 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US7139752B2 (en) * 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US7734627B1 (en) * 2003-06-17 2010-06-08 Google Inc. Document similarity detection
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20050198070A1 (en) * 2004-03-08 2005-09-08 Marpex Inc. Method and system for compression indexing and efficient proximity search of text data
US7536408B2 (en) * 2004-07-26 2009-05-19 Google Inc. Phrase-based indexing in an information retrieval system
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system
US20060129538A1 (en) * 2004-12-14 2006-06-15 Andrea Baader Text search quality by exploiting organizational information
US20060287971A1 (en) * 2005-06-15 2006-12-21 Geronimo Development Corporation Document quotation indexing system and method
US20070055926A1 (en) * 2005-09-02 2007-03-08 Fourteen40, Inc. Systems and methods for collaboratively annotating electronic documents
US20070136281A1 (en) * 2005-12-13 2007-06-14 Microsoft Corporation Training a ranking component
US20070294610A1 (en) * 2006-06-02 2007-12-20 Ching Phillip W System and method for identifying similar portions in documents
US20080033982A1 (en) * 2006-08-04 2008-02-07 Yahoo! Inc. System and method for determining concepts in a content item using context
US20080046394A1 (en) * 2006-08-14 2008-02-21 Microsoft Corporation Knowledge extraction from online discussion forums

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9875313B1 (en) * 2009-08-12 2018-01-23 Google Llc Ranking authors and their content in the same framework
US20120216107A1 (en) * 2009-10-30 2012-08-23 Rakuten, Inc. Characteristic content determination program, characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
US20120095993A1 (en) * 2010-10-18 2012-04-19 Jeng-Jye Shau Ranking by similarity level in meaning for written documents
US20150142571A1 (en) * 2011-05-23 2015-05-21 Google Inc. System and method for increasing the likelihood of users reviewing advertisements
CN104285142A (en) * 2012-04-25 2015-01-14 Atonarp株式会社 System which provides content
EP2843399A4 (en) * 2012-04-25 2016-01-27 Atonarp Inc System which provides content
US20150046468A1 (en) * 2013-08-12 2015-02-12 Alcatel Lucent Ranking linked documents by modeling how links between the documents are used
US9614724B2 (en) 2014-04-21 2017-04-04 Microsoft Technology Licensing, Llc Session-based device configuration
WO2015168344A1 (en) * 2014-05-02 2015-11-05 Microsoft Technology Licensing, Llc Searching locally defined entities
US20150317313A1 (en) * 2014-05-02 2015-11-05 Microsoft Corporation Searching locally defined entities
US9430667B2 (en) 2014-05-12 2016-08-30 Microsoft Technology Licensing, Llc Managed wireless distribution network
US9384335B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content delivery prioritization in managed wireless distribution networks
US9384334B2 (en) 2014-05-12 2016-07-05 Microsoft Technology Licensing, Llc Content discovery in managed wireless distribution networks
US10111099B2 (en) 2014-05-12 2018-10-23 Microsoft Technology Licensing, Llc Distributing content in managed wireless distribution networks
US9874914B2 (en) 2014-05-19 2018-01-23 Microsoft Technology Licensing, Llc Power management contracts for accessory devices
US10691445B2 (en) 2014-06-03 2020-06-23 Microsoft Technology Licensing, Llc Isolating a portion of an online computing service for testing
US9367490B2 (en) 2014-06-13 2016-06-14 Microsoft Technology Licensing, Llc Reversible connector for accessory devices
US9477625B2 (en) 2014-06-13 2016-10-25 Microsoft Technology Licensing, Llc Reversible connector for accessory devices
US20160098405A1 (en) * 2014-10-01 2016-04-07 Docurated, Inc. Document Curation System
WO2016161089A1 (en) * 2015-04-03 2016-10-06 Klangoo, Inc. Techniques for understanding the aboutness of text based on semantic analysis
US9632999B2 (en) 2015-04-03 2017-04-25 Klangoo, Sal. Techniques for understanding the aboutness of text based on semantic analysis
CN111159461A (en) * 2019-12-30 2020-05-15 秒针信息技术有限公司 Audio file determination method and device, storage medium and electronic device
CN112685534A (en) * 2020-12-23 2021-04-20 上海掌门科技有限公司 Method and apparatus for generating context information of authored content during authoring process

Also Published As

Publication number Publication date
US20090055394A1 (en) 2009-02-26
US9323827B2 (en) 2016-04-26

Similar Documents

Publication Publication Date Title
US20090055389A1 (en) Ranking similar passages
US7958128B2 (en) Query-independent entity importance in books
US8122032B2 (en) Identifying and linking similar passages in a digital text corpus
US8073877B2 (en) Scalable semi-structured named entity detection
US8862591B2 (en) System and method for evaluating sentiment
JP5281405B2 (en) Selecting high-quality reviews for display
US8386240B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US7519588B2 (en) Keyword characterization and application
US20100049709A1 (en) Generating Succinct Titles for Web URLs
US20110131205A1 (en) System and method to identify context-dependent term importance of queries for predicting relevant search advertisements
US20070219986A1 (en) Method and apparatus for extracting terms based on a displayed text
US20120278341A1 (en) Document analysis and association system and method
US9355372B2 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
TR201816343T4 (en) Systems and methods for searching queries using different language and / or language from different pages.
US20080065621A1 (en) Ambiguous entity disambiguation method
US20130151538A1 (en) Entity summarization and comparison
AU2014285073A1 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US20090204910A1 (en) System and method for web directory and search result display
US20100299322A1 (en) System and method for web page identifications
US20040162824A1 (en) Method and apparatus for classifying a document with respect to reference corpus
JP2010282403A (en) Document retrieval method
CN111737607A (en) Data processing method, data processing device, electronic equipment and storage medium
Buntine et al. Topic-specific scoring of documents for relevant retrieval
Mason An n-gram based approach to the automatic classification of web pages by genre
Alsmadi et al. Google n-gram viewer does not include arabic corpus! towards n-gram viewer for arabic corpus.

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SCHILIT, WILLIAM NOAH;KOLAK, OKAN;VINCENT-FOGLESONG, JUSTIN JOHN PAUL;REEL/FRAME:021061/0104

Effective date: 20080605

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044142/0357

Effective date: 20170929