WO2008083447A1 - Method and system of obtaining related information - Google Patents

Method and system of obtaining related information Download PDF

Info

Publication number
WO2008083447A1
WO2008083447A1 PCT/AU2008/000032 AU2008000032W WO2008083447A1 WO 2008083447 A1 WO2008083447 A1 WO 2008083447A1 AU 2008000032 W AU2008000032 W AU 2008000032W WO 2008083447 A1 WO2008083447 A1 WO 2008083447A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
document
graph
frequency map
source document
Prior art date
Application number
PCT/AU2008/000032
Other languages
French (fr)
Inventor
Jason Nicholas Polites
Original Assignee
Synetek Systems Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2007900155A external-priority patent/AU2007900155A0/en
Application filed by Synetek Systems Pty Ltd filed Critical Synetek Systems Pty Ltd
Publication of WO2008083447A1 publication Critical patent/WO2008083447A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Definitions

  • This invention relates to a method and system of obtaining related information and more particularly relates to a method and system of obtaining related information from a large volume of electronic or text based data.
  • Locating and obtaining info ⁇ nation from a large corpus or body of unstructured data is usually done through standard search techniques, which in turn is underpinned by indexing systems, or the introduction of manual and automated meta-data assignments.
  • the only search solution is to use a standard search underpinned by an indexing system.
  • a particular problem of being limited to a search is that it relies on the basis that the user has some knowledge of what they are actually searching for. Thus for example the user may use a combination of keywords and Boolean operators to narrow down the field of the search and therefore extract the most relevant documents.
  • the present invention seeks to overcome one or more of the above disadvantages by providing a method and system that automatically obtains or discovers all information related to a given topic or subject matter.
  • the topic or the subject matter generally refers to a source document or documents or a free text query entered by a user.
  • a method of obtaining related documents from a body of documents comprising the steps of: locating a source document; applying a similarity function to some or all of the documents in order to determine documents related to the source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
  • the similarity function performs a first iteration to retrieve a first row of related documents to the source document. A further iteration using the similarity function may be performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.
  • a threshold of minimum similarity may be subjected to each retrieved document so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.
  • the master frequency map may contain all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph.
  • the normalizing step may comprise using a statistical method, such as a median operation or a mean operation.
  • the comparing step may be performed using a statistical probability function that measures how closely a document in the graph represents the normalized map. A threshold may be used in the comparing step.
  • the documents obtained as a result of the comparing step may be ranked in order of how closely each document resembles the normalized map.
  • a system for obtaining related documents from a body of documents comprising: memory means for storing the body of documents; processor means for: applying a similarity function to some or all of the documents in order to determine documents related to a source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
  • a third aspect of the invention there is provided computer program means for instructing a processor to obtain related documents from a body of documents, wherein after a source document has been determined, the program means instructs the processor to: apply a similarity function to some or all of the documents in order to determine documents related to the source document; form a graph of all related documents; determine the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalize the term frequencies to form a single normalized frequency map; and compare each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
  • Figure 1 is a block diagram showing how a similarity function enables documents related to a source document to be obtained
  • Figure IA is a block diagram showing a graph of related documents
  • Figure 2 is a block diagram showing the formation of a master frequency map
  • Figure 3 is a flow diagram showing the operation of a computer program to implement the present invention according to an embodiment.
  • FIG. 1 there is a block diagram showing a sequence of documents related to a source document found via a process, implemented through software, using a similarity function.
  • a source document 10 is located, and this may be for example an email about information on which the user wishes to search for related documents.
  • a similarity function is then implemented through the software to determine a first row 12 of related documents 14, 16 and 18. This is the first stage of the process whereby the similarity function determines a percentile figure on the similarity of a document to the source document 10, based on their content.
  • a further search is performed using the similarity function for other documents that are related to the document 14.
  • it uncovers related documents 20, 22 and 24 that are all regarded as relevant to document 14.
  • AU of the documents obtained through this process form a graph of documents which are considered "related" to the source document 10.
  • Shown in Figure IA is the graph 35 of related documents.
  • Links 36 and 37 (dotted lines) show documents having unknown similarity, whilst all other links (full line) connect documents having a known similarity.
  • the graph does not, however, necessarily provide any accuracy in determining the similarity of the uncovered documents to the source document or to other documents in the graph. This determination is made during the next phase in the process.
  • Documents that are obtained after the execution of the similarity function are subject to elimination from current or subsequent phases of execution, for example between stages 1 and 2, on the basis of a similarity threshold 34.
  • the minimum similarity function requirements for the threshold is governed by the user with a general rule that the lower the threshold, the more results that would be returned, however with lower accuracy.
  • related document 18 is not considered to contain the minimum similarity, being below threshold 34, and is thus not considered for the iteration that leads to row 32.
  • the similarity function may be any function which produces a probability or similar indication of two documents being related.
  • a method called the cosine rule is used which uses analysis of term vectors within the written content of the information to be extracted and found similar to a source document 10. It is a well understood method of determining document similarity.
  • the next phase involves a determination of the context of the graph. This refers to the general subject matter of the information obtained via the similarity function in the previous phase of the process. In the case of textual data this is represented by term frequencies.
  • Each document consists of a number of terms which are typically words. Each term, excluding the stop terms, in the document is then counted such that several terms usually occur more than once.
  • Natural language analysis can be implemented, which refers to the ability to infer the context from the content of the document by understanding the semantics and vocabulary of the language used. As an example, if a document contains the phrase "house for sale” this will relate to another phrase "home on the market". Even though the words are different, the meaning conveyed is similar and this will be determined among documents using natural language analysis. The analysis therefore provides a greater accuracy in determining the similarity of two documents.
  • the master term frequency map 40 attempts to represent the context of the entire graph that resulted from analysis by the similarity function. It does this not for a single document but for the entire collection of documents.
  • the master frequency map 40 contains the aggregation of all document term frequencies such that it represents the total sum of the frequencies in each of the term frequencies obtained from the graph. It enables an understanding of the topic or context of all documents in the graph. Each document is then compared to the aggregate context in order to determine its relevance to the search as a whole.
  • the document frequency map 42 compiles the term frequencies, for example 44 in related document 46, for each of the related documents 46, 48, 50 and 52. These are subsequently compiled into the single master term frequency map 40 and then each related document is compared to the aggregate context in the master frequency map 40 to determine the relevance to the whole search.
  • the term frequencies are normalized to represent a typical profile of a document in the graph. This is done through the application of a simple statistical method such as a median operation or a mean operation. The end result of this operation is a single frequency map that defines the overall context of information in the graph.
  • the next phase in the process is the condensing of the graph into those documents that are closely related to the context defined for the graph.
  • Each document in the graph is compared to the normalized frequency map using a statistical probability function.
  • This function can be any statistical probability method that measures how closely a document in the graph represents the normalized context.
  • the default implementation uses a Chi-squared "best fit" method in order to determine the final similarity.
  • the condensing of the graph also uses a threshold which is not necessarily the same threshold used earlier to eliminate documents that do not closely resemble the normalized map.
  • the final stage of the process is to rank the documents in order of how closely they resemble a normalized map. This results in the final list of documents that are considered to be "related" to the original source document 10.
  • large organisations can have an archive of many thousands of documents.
  • a server of the business may keep copies of all documents that are transmitted to and from and within an organisation. This can be a disadvantage when a simple or detailed keyword search is conducted as there will still be a lot of documents provided in the search results.
  • the returned results will only match the input keyword or combination of keywords. This is a serious problem when the user conducting the search does not know exactly what he/she is searching for.
  • a particular example is a legal discovery process.
  • the present invention enables an automated document retrieval system. All electronic documents, such as emails, within a business are copied and indexed, including attachments, to provide detailed search capabilities. All mailboxes in the business's email archive are able to be searched. Once an email that appears to relate to the intended search criteria has been located, the system, through the software, locates all related documents to this email using the above described method across the whole business archive.
  • a source document 10 is determined, for example a relevant email, at step 62, Figure 3, then the server or a processor within the server is instructed by the software to use a similarity function in order to retrieve all of the documents, including emails, within the business that are related to the source document. This is done at step 64.
  • a graph is formed of the related documents at step 68.
  • term frequencies are obtained at step 70 from each of the documents and each term within each document is counted. Once this is done, the term frequencies in each document or email are compiled into a single master term frequency map at step 72 in order to determine the context of each document.
  • the software performs a normalization function to each of the term frequencies across all of the related documents that were formed in the graph. This normalization process then represents a typical profile of a document in the graph. Thus a single frequency map is derived at step 76 which defines the overall context of information in the graph.
  • the graph is condensed into those documents that are closely related to the context defined for the graph. Thus each document in the graph is compared to the normalized single frequency map in order to find the most closely related documents to the context defined for the graph.
  • step 80 the retrieved documents that are found as being the most closely related to the context defined for the graph are ranked in order.
  • the order of the documents represents how closely each document resembles the normalized single frequency map and results in a final list of documents that are related to the original source document.
  • An improvement is to include meta-data fields, if available, to better describe the context of a document.
  • the title or subject of a document may contain more contextual information than its content.
  • term frequencies obtained from these meta-data fields can be enhanced or crippled based on how important the meta-data field is to the context of the document.
  • a simple solution is to artificially increase or decrease the frequency of terms contained within the sensitive meta-data elements.
  • the normalized frequency map obtained previously may be used as a source document for a second iteration of the entire process or for further iterations. That is, the normalized frequency map is treated as a new source document and the process is conducted using this map as the source. The results obtained from such a second or subsequent pass would then either replace or be merged with the results obtained during the first iteration to retrieve related documents.
  • an improvement to the invention would include adjusting term frequencies obtained during the initial phases of the search by a factor reflective of the initial similarity of the document from which the terms were obtained. That is, documents discovered during the initial phase would have their term frequencies adjusted either up or down based on how similar they are to the source document or their parent document in the graph.

Abstract

A method and system of obtaining related documents from a body of documents comprising the steps of locating a source document, applying a similarity function to some or all of the documents in order to determine documents related to the source document, forming a graph of all related documents, determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph, normalizing the term frequencies to form a single normalized frequency map, and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

Description

"Method and system of obtaining related information"
Cross-Reference to Related Applications
The present application claims priority from Australian Provisional Patent Application No 2007900155 filed on 12 January 2007, the content of which is incorporated herein by reference.
Field of the Invention
This invention relates to a method and system of obtaining related information and more particularly relates to a method and system of obtaining related information from a large volume of electronic or text based data.
Background of the Invention
Locating and obtaining infoπnation from a large corpus or body of unstructured data is usually done through standard search techniques, which in turn is underpinned by indexing systems, or the introduction of manual and automated meta-data assignments. In those situations where the application of meta-data is impractical or impossible, then the only search solution is to use a standard search underpinned by an indexing system. A particular problem of being limited to a search is that it relies on the basis that the user has some knowledge of what they are actually searching for. Thus for example the user may use a combination of keywords and Boolean operators to narrow down the field of the search and therefore extract the most relevant documents. While this is often the case, there are several situations where information relevant to the user's search exists within the body of unstructured data but may not appear in the search as it may not contain the specific search terms or combination of keywords used. Thus, there is no possibility of actually retrieving relevant documents to a source document where the user generally does not have any knowledge of the particular subject matter.
The present invention seeks to overcome one or more of the above disadvantages by providing a method and system that automatically obtains or discovers all information related to a given topic or subject matter. The topic or the subject matter generally refers to a source document or documents or a free text query entered by a user. Summary of the Invention
According to a first aspect of the invention there is provided a method of obtaining related documents from a body of documents comprising the steps of: locating a source document; applying a similarity function to some or all of the documents in order to determine documents related to the source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document. Preferably the similarity function performs a first iteration to retrieve a first row of related documents to the source document. A further iteration using the similarity function may be performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.
In performing each iteration, a threshold of minimum similarity may be subjected to each retrieved document so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.
The master frequency map may contain all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph. The normalizing step may comprise using a statistical method, such as a median operation or a mean operation. The comparing step may be performed using a statistical probability function that measures how closely a document in the graph represents the normalized map. A threshold may be used in the comparing step.
The documents obtained as a result of the comparing step may be ranked in order of how closely each document resembles the normalized map.
According to a second aspect of the invention there is provided a system for obtaining related documents from a body of documents comprising: memory means for storing the body of documents; processor means for: applying a similarity function to some or all of the documents in order to determine documents related to a source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
According to a third aspect of the invention there is provided computer program means for instructing a processor to obtain related documents from a body of documents, wherein after a source document has been determined, the program means instructs the processor to: apply a similarity function to some or all of the documents in order to determine documents related to the source document; form a graph of all related documents; determine the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalize the term frequencies to form a single normalized frequency map; and compare each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
Brief Description of the Drawings
A preferred embodiment of the invention will hereinafter be described, by way of example only, with reference to the drawings in which:
Figure 1 is a block diagram showing how a similarity function enables documents related to a source document to be obtained;
Figure IA is a block diagram showing a graph of related documents;
Figure 2 is a block diagram showing the formation of a master frequency map; and
Figure 3 is a flow diagram showing the operation of a computer program to implement the present invention according to an embodiment.
Detailed Description of the Preferred Embodiment
Referring to Figure 1 there is a block diagram showing a sequence of documents related to a source document found via a process, implemented through software, using a similarity function. Specifically in Figure 1 a source document 10 is located, and this may be for example an email about information on which the user wishes to search for related documents. A similarity function is then implemented through the software to determine a first row 12 of related documents 14, 16 and 18. This is the first stage of the process whereby the similarity function determines a percentile figure on the similarity of a document to the source document 10, based on their content. On finding the related document 14, a further search is performed using the similarity function for other documents that are related to the document 14. Thus on a repetition of the process it uncovers related documents 20, 22 and 24 that are all regarded as relevant to document 14. Similarly, the similarity function is performed to find documents related to document 16 and this iteration uncovers documents 26, 28 and 30 as being related to document 16. Thus a second row 32 has now been formed of documents related to those that were uncovered in the first row 12 of the process. The whole process can be repeated any number of times, but typically it would only extend to one or two levels of indirection beyond the source document 10.
AU of the documents obtained through this process form a graph of documents which are considered "related" to the source document 10. Shown in Figure IA is the graph 35 of related documents. Links 36 and 37 (dotted lines) show documents having unknown similarity, whilst all other links (full line) connect documents having a known similarity. At this stage the graph does not, however, necessarily provide any accuracy in determining the similarity of the uncovered documents to the source document or to other documents in the graph. This determination is made during the next phase in the process.
Documents that are obtained after the execution of the similarity function are subject to elimination from current or subsequent phases of execution, for example between stages 1 and 2, on the basis of a similarity threshold 34. This defines the minimum similarity to the source document 10 required in order to participate in current or future phases of execution. The minimum similarity function requirements for the threshold is governed by the user with a general rule that the lower the threshold, the more results that would be returned, however with lower accuracy. In Figure 1, related document 18 is not considered to contain the minimum similarity, being below threshold 34, and is thus not considered for the iteration that leads to row 32.
The similarity function may be any function which produces a probability or similar indication of two documents being related. In practice a method called the cosine rule is used which uses analysis of term vectors within the written content of the information to be extracted and found similar to a source document 10. It is a well understood method of determining document similarity. On obtaining the graph of related documents as discussed with regard to Figure 1, the next phase involves a determination of the context of the graph. This refers to the general subject matter of the information obtained via the similarity function in the previous phase of the process. In the case of textual data this is represented by term frequencies. Each document consists of a number of terms which are typically words. Each term, excluding the stop terms, in the document is then counted such that several terms usually occur more than once. The more instances of a term appearing, the greater its importance in defining the context of the graph. This context definition can be improved through the use of term correlation and/or natural language analysis which is used to determine similarity. The present implementation uses the cosine rule, however this makes no distinction between the various components of structured data. There is an assumption made by the similarity function that all data in a document is of equal relevance as to the "meaning" of the document. In some cases it is possible to improve on this. As an example, the subject line of an email may have more information about the content of the email than the actual text in the body of the message. Therefore, two emails with the same subject line could be thought of as more related, or more strongly correlated, than two emails with similar content. In semi-structured data, some data elements convey a greater impact on similarity than other elements. Natural language analysis can be implemented, which refers to the ability to infer the context from the content of the document by understanding the semantics and vocabulary of the language used. As an example, if a document contains the phrase "house for sale" this will relate to another phrase "home on the market". Even though the words are different, the meaning conveyed is similar and this will be determined among documents using natural language analysis. The analysis therefore provides a greater accuracy in determining the similarity of two documents.
On obtaining the term frequencies or the number of terms that appear in each document, these are compiled into a single master term frequency map, as shown in Figure 2. This is used to determine the context of each document, for example each email. This master term frequency map 40 attempts to represent the context of the entire graph that resulted from analysis by the similarity function. It does this not for a single document but for the entire collection of documents. Thus the master frequency map 40 contains the aggregation of all document term frequencies such that it represents the total sum of the frequencies in each of the term frequencies obtained from the graph. It enables an understanding of the topic or context of all documents in the graph. Each document is then compared to the aggregate context in order to determine its relevance to the search as a whole. The document frequency map 42 compiles the term frequencies, for example 44 in related document 46, for each of the related documents 46, 48, 50 and 52. These are subsequently compiled into the single master term frequency map 40 and then each related document is compared to the aggregate context in the master frequency map 40 to determine the relevance to the whole search.
Once the master frequency map 40 has been computed, the term frequencies are normalized to represent a typical profile of a document in the graph. This is done through the application of a simple statistical method such as a median operation or a mean operation. The end result of this operation is a single frequency map that defines the overall context of information in the graph.
After frequency normalization has occurred the next phase in the process, again implemented through software, is the condensing of the graph into those documents that are closely related to the context defined for the graph. Each document in the graph is compared to the normalized frequency map using a statistical probability function. This function can be any statistical probability method that measures how closely a document in the graph represents the normalized context. In practice, the default implementation uses a Chi-squared "best fit" method in order to determine the final similarity. The condensing of the graph also uses a threshold which is not necessarily the same threshold used earlier to eliminate documents that do not closely resemble the normalized map.
The final stage of the process is to rank the documents in order of how closely they resemble a normalized map. This results in the final list of documents that are considered to be "related" to the original source document 10. As an example, large organisations can have an archive of many thousands of documents. A server of the business may keep copies of all documents that are transmitted to and from and within an organisation. This can be a disadvantage when a simple or detailed keyword search is conducted as there will still be a lot of documents provided in the search results. Furthermore, the returned results will only match the input keyword or combination of keywords. This is a serious problem when the user conducting the search does not know exactly what he/she is searching for. A particular example is a legal discovery process. If a business was obliged to produce all evidence pertaining to a given matter, most of this may be confined to electronic documents. Where the documents are over a long period of time and are numerous in number then the process is made even more difficult, expensive and prone to error. The present invention enables an automated document retrieval system. All electronic documents, such as emails, within a business are copied and indexed, including attachments, to provide detailed search capabilities. All mailboxes in the business's email archive are able to be searched. Once an email that appears to relate to the intended search criteria has been located, the system, through the software, locates all related documents to this email using the above described method across the whole business archive.
Once a source document 10 is determined, for example a relevant email, at step 62, Figure 3, then the server or a processor within the server is instructed by the software to use a similarity function in order to retrieve all of the documents, including emails, within the business that are related to the source document. This is done at step 64. At step 66, once all of the related documents are found and are above a predefined threshold then a graph is formed of the related documents at step 68.
In order to determine the context of the graph, term frequencies are obtained at step 70 from each of the documents and each term within each document is counted. Once this is done, the term frequencies in each document or email are compiled into a single master term frequency map at step 72 in order to determine the context of each document. At step 74 the software performs a normalization function to each of the term frequencies across all of the related documents that were formed in the graph. This normalization process then represents a typical profile of a document in the graph. Thus a single frequency map is derived at step 76 which defines the overall context of information in the graph. At step 78 the graph is condensed into those documents that are closely related to the context defined for the graph. Thus each document in the graph is compared to the normalized single frequency map in order to find the most closely related documents to the context defined for the graph.
Finally at step 80 the retrieved documents that are found as being the most closely related to the context defined for the graph are ranked in order. Thus the order of the documents represents how closely each document resembles the normalized single frequency map and results in a final list of documents that are related to the original source document.
An improvement is to include meta-data fields, if available, to better describe the context of a document. For example, the title or subject of a document may contain more contextual information than its content. In such cases term frequencies obtained from these meta-data fields can be enhanced or crippled based on how important the meta-data field is to the context of the document. A simple solution is to artificially increase or decrease the frequency of terms contained within the sensitive meta-data elements.
The normalized frequency map obtained previously may be used as a source document for a second iteration of the entire process or for further iterations. That is, the normalized frequency map is treated as a new source document and the process is conducted using this map as the source. The results obtained from such a second or subsequent pass would then either replace or be merged with the results obtained during the first iteration to retrieve related documents.
Furthermore, an improvement to the invention would include adjusting term frequencies obtained during the initial phases of the search by a factor reflective of the initial similarity of the document from which the terms were obtained. That is, documents discovered during the initial phase would have their term frequencies adjusted either up or down based on how similar they are to the source document or their parent document in the graph. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope of the invention as broadly described.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:
1. A method of obtaining related documents from a body of documents comprising the steps of: locating a source document; applying a similarity function to some or all of the documents in order to determine documents related to the source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
2. A method according to claim 1 wherein the similarity function performs a first iteration to retrieve a first row of related documents to the source document.
3. A method according to claim 2 wherein the similarity function determines a percentile figure on the similarity of a document to the source document based on the content of the document and the source document.
4. A method according to claim 2 or claim 3 wherein a further iteration using the similarity function is performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.
5. A method according to any one of claims 2 to 4 further comprising applying a threshold of minimum similarity to each retrieved document in each iteration, so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.
6. A method according to any one of the previous claims wherein the master frequency map contains all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph.
7. A method according to any one of the previous claims wherein the normalizing step uses a statistical method including either a median operation or a mean operation.
8. A method according to any one of the previous claims wherein the comparing step is performed using a statistical probability function that measures how closely a document in the graph represents the normalized frequency map.
9. A method according to claim 8 wherein a threshold is used in the comparing step to eliminate any documents that do not closely resemble the normalized frequency map.
10. A method according to any one of the previous claims further comprising ranking the documents obtained as a result of the comparing step in order of how closely each document resembles the normalized frequency map.
11. A method according to any one of the previous claims wherein the determining step further comprises using term correlation and/or natural language analysis.
12. A method according to any one of claims 3 to 11 wherein the normalized frequency map is used as a source document for iterations subsequent to the first iteration.
13. A method according to any one of the previous claims further comprising adjusting one or more determined term frequencies of a document based on the degree of similarity of the document to the source document or a parent document in the graph.
14. A system for obtaining related documents from a body of documents comprising: memory means for storing the body of documents; processor means for: applying a similarity function to some or all of the documents in order to determine documents related to a source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.
15. A system according to claim 14 wherein the similarity function performs a first iteration to retrieve a first row of related documents to the source document.
16. A system according to claim 15 wherein the similarity function determines a percentile figure on the similarity of a document to the source document based on the content of the document and the source document.
17. A system according to claim 15 or claim 16 wherein a further iteration using the similarity function is performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.
18. A system according to any one of claims 15 to 17 wherein the processor means applies a threshold of minimum similarity to each retrieved document in each iteration, so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.
19. A system according to any one of claims 14 to 18 wherein the master frequency map contains all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph.
20. A system according to any one of claims 14 to 19 wherein the normalizing step undertaken by the processor means uses a statistical method including either a median operation or a mean operation.
21. A system according to any one of claims 14 to 20 wherein the comparing step undertaken by the processor means is performed using a statistical probability function that measures how closely a document in the graph represents the normalized frequency map.
22. A system according to claim 21 wherein a threshold is used in the comparison to eliminate any documents that do not closely resemble the normalized frequency map.
23. A system according to any one of claims 14 to 22 wherein the processor means ranks the documents obtained as a result of the comparing step in order of how closely each document resembles the normalized frequency map.
5
24. A system according to any one of claims 14 to 23 wherein the determining step undertaken by the processor means further comprises using term correlation and/or natural language analysis.
10 25. A system according to any one of claims 16 to 24 wherein the normalized frequency map is used as a source document for iterations subsequent to the first iteration.
26. A system according to any one of claims 14 to 25 wherein the processor means 15 adjusts one or more determined term frequencies of a document based on the degree of similarity of the document to the source document or a parent document in the graph.
27. Computer program means for instructing a processor to obtain related documents from a body of documents, wherein after a source document has been 0 determined, the program means instructs the processor to: apply a similarity function to some or all of the documents in order to determine documents related to the source document; form a graph of all related documents; determine the context of the graph by forming a master frequency map of term 5 frequencies obtained from the documents in the graph; normalize the term frequencies to form a single normalized frequency map; and compare each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document. 0 28. Computer program means according to claim 27 for instructing the processor to undertake any one or more of method claims 2 to 13.
PCT/AU2008/000032 2007-01-12 2008-01-11 Method and system of obtaining related information WO2008083447A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2007900155A AU2007900155A0 (en) 2007-01-12 Method and system of obtaining related information
AU2007900155 2007-01-12

Publications (1)

Publication Number Publication Date
WO2008083447A1 true WO2008083447A1 (en) 2008-07-17

Family

ID=39608268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2008/000032 WO2008083447A1 (en) 2007-01-12 2008-01-11 Method and system of obtaining related information

Country Status (1)

Country Link
WO (1) WO2008083447A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8588531B2 (en) 2009-06-30 2013-11-19 International Business Machines Corporation Graph similarity calculation system, method and program
CN109145085A (en) * 2018-07-18 2019-01-04 北京市农林科学院 The calculation method and system of semantic similarity
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167398A (en) * 1997-01-30 2000-12-26 British Telecommunications Public Limited Company Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document
US20040088157A1 (en) * 2002-10-30 2004-05-06 Motorola, Inc. Method for characterizing/classifying a document
US20040167888A1 (en) * 2002-12-12 2004-08-26 Seiko Epson Corporation Document extracting device, document extracting program, and document extracting method
US6941321B2 (en) * 1999-01-26 2005-09-06 Xerox Corporation System and method for identifying similarities among objects in a collection
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167398A (en) * 1997-01-30 2000-12-26 British Telecommunications Public Limited Company Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document
US6941321B2 (en) * 1999-01-26 2005-09-06 Xerox Corporation System and method for identifying similarities among objects in a collection
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US20040088157A1 (en) * 2002-10-30 2004-05-06 Motorola, Inc. Method for characterizing/classifying a document
US20040167888A1 (en) * 2002-12-12 2004-08-26 Seiko Epson Corporation Document extracting device, document extracting program, and document extracting method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8588531B2 (en) 2009-06-30 2013-11-19 International Business Machines Corporation Graph similarity calculation system, method and program
US9122771B2 (en) 2009-06-30 2015-09-01 International Business Machines Corporation Graph similarity calculation system, method and program
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment
CN110019556B (en) * 2017-12-27 2023-08-15 阿里巴巴集团控股有限公司 Topic news acquisition method, device and equipment thereof
CN109145085A (en) * 2018-07-18 2019-01-04 北京市农林科学院 The calculation method and system of semantic similarity

Similar Documents

Publication Publication Date Title
US9009134B2 (en) Named entity recognition in query
US9146987B2 (en) Clustering based question set generation for training and testing of a question and answer system
US7769751B1 (en) Method and apparatus for classifying documents based on user inputs
Ceccarelli et al. Learning relatedness measures for entity linking
US8468156B2 (en) Determining a geographic location relevant to a web page
US9230009B2 (en) Routing of questions to appropriately trained question and answer system pipelines using clustering
JP5346279B2 (en) Annotation by search
US8738635B2 (en) Detection of junk in search result ranking
US20140297571A1 (en) Justifying Passage Machine Learning for Question and Answer Systems
US8321418B2 (en) Information processor, method of processing information, and program
Xiong et al. Towards better text understanding and retrieval through kernel entity salience modeling
US20150100308A1 (en) Automated Formation of Specialized Dictionaries
WO2014028860A2 (en) System and method for matching data using probabilistic modeling techniques
US20100191758A1 (en) System and method for improved search relevance using proximity boosting
US20110112824A1 (en) Determining at least one category path for identifying input text
US9569525B2 (en) Techniques for entity-level technology recommendation
EP2192503A1 (en) Optimised tag based searching
WO2023057988A1 (en) Generation and use of content briefs for network content authoring
US8862586B2 (en) Document analysis system
Blanco et al. Overview of NTCIR-13 Actionable Knowledge Graph (AKG) Task.
US9183297B1 (en) Method and apparatus for generating lexical synonyms for query terms
WO2008083447A1 (en) Method and system of obtaining related information
US10409861B2 (en) Method for fast retrieval of phonetically similar words and search engine system therefor
Li et al. Complex query recognition based on dynamic learning mechanism
Medlock Investigating classification for natural language processing tasks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08700330

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08700330

Country of ref document: EP

Kind code of ref document: A1