WO2008083447A1

WO2008083447A1 - Method and system of obtaining related information

Info

Publication number: WO2008083447A1
Application number: PCT/AU2008/000032
Authority: WO
Inventors: Jason Nicholas Polites
Original assignee: Synetek Systems Pty Ltd
Priority date: 2007-01-12
Filing date: 2008-01-11
Publication date: 2008-07-17

Abstract

A method and system of obtaining related documents from a body of documents comprising the steps of locating a source document, applying a similarity function to some or all of the documents in order to determine documents related to the source document, forming a graph of all related documents, determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph, normalizing the term frequencies to form a single normalized frequency map, and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

Description

"Method and system of obtaining related information"

Cross-Reference to Related Applications

The present application claims priority from Australian Provisional Patent Application No 2007900155 filed on 12 January 2007, the content of which is incorporated herein by reference.

Field of the Invention

This invention relates to a method and system of obtaining related information and more particularly relates to a method and system of obtaining related information from a large volume of electronic or text based data.

Background of the Invention

Locating and obtaining infoπnation from a large corpus or body of unstructured data is usually done through standard search techniques, which in turn is underpinned by indexing systems, or the introduction of manual and automated meta-data assignments. In those situations where the application of meta-data is impractical or impossible, then the only search solution is to use a standard search underpinned by an indexing system. A particular problem of being limited to a search is that it relies on the basis that the user has some knowledge of what they are actually searching for. Thus for example the user may use a combination of keywords and Boolean operators to narrow down the field of the search and therefore extract the most relevant documents. While this is often the case, there are several situations where information relevant to the user's search exists within the body of unstructured data but may not appear in the search as it may not contain the specific search terms or combination of keywords used. Thus, there is no possibility of actually retrieving relevant documents to a source document where the user generally does not have any knowledge of the particular subject matter.

The present invention seeks to overcome one or more of the above disadvantages by providing a method and system that automatically obtains or discovers all information related to a given topic or subject matter. The topic or the subject matter generally refers to a source document or documents or a free text query entered by a user. Summary of the Invention

According to a first aspect of the invention there is provided a method of obtaining related documents from a body of documents comprising the steps of: locating a source document; applying a similarity function to some or all of the documents in order to determine documents related to the source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document. Preferably the similarity function performs a first iteration to retrieve a first row of related documents to the source document. A further iteration using the similarity function may be performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.

In performing each iteration, a threshold of minimum similarity may be subjected to each retrieved document so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.

The master frequency map may contain all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph. The normalizing step may comprise using a statistical method, such as a median operation or a mean operation. The comparing step may be performed using a statistical probability function that measures how closely a document in the graph represents the normalized map. A threshold may be used in the comparing step.

The documents obtained as a result of the comparing step may be ranked in order of how closely each document resembles the normalized map.

According to a second aspect of the invention there is provided a system for obtaining related documents from a body of documents comprising: memory means for storing the body of documents; processor means for: applying a similarity function to some or all of the documents in order to determine documents related to a source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

According to a third aspect of the invention there is provided computer program means for instructing a processor to obtain related documents from a body of documents, wherein after a source document has been determined, the program means instructs the processor to: apply a similarity function to some or all of the documents in order to determine documents related to the source document; form a graph of all related documents; determine the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalize the term frequencies to form a single normalized frequency map; and compare each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

Brief Description of the Drawings

A preferred embodiment of the invention will hereinafter be described, by way of example only, with reference to the drawings in which:

Figure 1 is a block diagram showing how a similarity function enables documents related to a source document to be obtained;

Figure IA is a block diagram showing a graph of related documents;

Figure 2 is a block diagram showing the formation of a master frequency map; and

Figure 3 is a flow diagram showing the operation of a computer program to implement the present invention according to an embodiment.

Detailed Description of the Preferred Embodiment

Referring to Figure 1 there is a block diagram showing a sequence of documents related to a source document found via a process, implemented through software, using a similarity function. Specifically in Figure 1 a source document 10 is located, and this may be for example an email about information on which the user wishes to search for related documents. A similarity function is then implemented through the software to determine a first row 12 of related documents 14, 16 and 18. This is the first stage of the process whereby the similarity function determines a percentile figure on the similarity of a document to the source document 10, based on their content. On finding the related document 14, a further search is performed using the similarity function for other documents that are related to the document 14. Thus on a repetition of the process it uncovers related documents 20, 22 and 24 that are all regarded as relevant to document 14. Similarly, the similarity function is performed to find documents related to document 16 and this iteration uncovers documents 26, 28 and 30 as being related to document 16. Thus a second row 32 has now been formed of documents related to those that were uncovered in the first row 12 of the process. The whole process can be repeated any number of times, but typically it would only extend to one or two levels of indirection beyond the source document 10.

AU of the documents obtained through this process form a graph of documents which are considered "related" to the source document 10. Shown in Figure IA is the graph 35 of related documents. Links 36 and 37 (dotted lines) show documents having unknown similarity, whilst all other links (full line) connect documents having a known similarity. At this stage the graph does not, however, necessarily provide any accuracy in determining the similarity of the uncovered documents to the source document or to other documents in the graph. This determination is made during the next phase in the process.

Documents that are obtained after the execution of the similarity function are subject to elimination from current or subsequent phases of execution, for example between stages 1 and 2, on the basis of a similarity threshold 34. This defines the minimum similarity to the source document 10 required in order to participate in current or future phases of execution. The minimum similarity function requirements for the threshold is governed by the user with a general rule that the lower the threshold, the more results that would be returned, however with lower accuracy. In Figure 1, related document 18 is not considered to contain the minimum similarity, being below threshold 34, and is thus not considered for the iteration that leads to row 32.

The similarity function may be any function which produces a probability or similar indication of two documents being related. In practice a method called the cosine rule is used which uses analysis of term vectors within the written content of the information to be extracted and found similar to a source document 10. It is a well understood method of determining document similarity. On obtaining the graph of related documents as discussed with regard to Figure 1, the next phase involves a determination of the context of the graph. This refers to the general subject matter of the information obtained via the similarity function in the previous phase of the process. In the case of textual data this is represented by term frequencies. Each document consists of a number of terms which are typically words. Each term, excluding the stop terms, in the document is then counted such that several terms usually occur more than once. The more instances of a term appearing, the greater its importance in defining the context of the graph. This context definition can be improved through the use of term correlation and/or natural language analysis which is used to determine similarity. The present implementation uses the cosine rule, however this makes no distinction between the various components of structured data. There is an assumption made by the similarity function that all data in a document is of equal relevance as to the "meaning" of the document. In some cases it is possible to improve on this. As an example, the subject line of an email may have more information about the content of the email than the actual text in the body of the message. Therefore, two emails with the same subject line could be thought of as more related, or more strongly correlated, than two emails with similar content. In semi-structured data, some data elements convey a greater impact on similarity than other elements. Natural language analysis can be implemented, which refers to the ability to infer the context from the content of the document by understanding the semantics and vocabulary of the language used. As an example, if a document contains the phrase "house for sale" this will relate to another phrase "home on the market". Even though the words are different, the meaning conveyed is similar and this will be determined among documents using natural language analysis. The analysis therefore provides a greater accuracy in determining the similarity of two documents.

On obtaining the term frequencies or the number of terms that appear in each document, these are compiled into a single master term frequency map, as shown in Figure 2. This is used to determine the context of each document, for example each email. This master term frequency map 40 attempts to represent the context of the entire graph that resulted from analysis by the similarity function. It does this not for a single document but for the entire collection of documents. Thus the master frequency map 40 contains the aggregation of all document term frequencies such that it represents the total sum of the frequencies in each of the term frequencies obtained from the graph. It enables an understanding of the topic or context of all documents in the graph. Each document is then compared to the aggregate context in order to determine its relevance to the search as a whole. The document frequency map 42 compiles the term frequencies, for example 44 in related document 46, for each of the related documents 46, 48, 50 and 52. These are subsequently compiled into the single master term frequency map 40 and then each related document is compared to the aggregate context in the master frequency map 40 to determine the relevance to the whole search.

Once the master frequency map 40 has been computed, the term frequencies are normalized to represent a typical profile of a document in the graph. This is done through the application of a simple statistical method such as a median operation or a mean operation. The end result of this operation is a single frequency map that defines the overall context of information in the graph.

After frequency normalization has occurred the next phase in the process, again implemented through software, is the condensing of the graph into those documents that are closely related to the context defined for the graph. Each document in the graph is compared to the normalized frequency map using a statistical probability function. This function can be any statistical probability method that measures how closely a document in the graph represents the normalized context. In practice, the default implementation uses a Chi-squared "best fit" method in order to determine the final similarity. The condensing of the graph also uses a threshold which is not necessarily the same threshold used earlier to eliminate documents that do not closely resemble the normalized map.

The final stage of the process is to rank the documents in order of how closely they resemble a normalized map. This results in the final list of documents that are considered to be "related" to the original source document 10. As an example, large organisations can have an archive of many thousands of documents. A server of the business may keep copies of all documents that are transmitted to and from and within an organisation. This can be a disadvantage when a simple or detailed keyword search is conducted as there will still be a lot of documents provided in the search results. Furthermore, the returned results will only match the input keyword or combination of keywords. This is a serious problem when the user conducting the search does not know exactly what he/she is searching for. A particular example is a legal discovery process. If a business was obliged to produce all evidence pertaining to a given matter, most of this may be confined to electronic documents. Where the documents are over a long period of time and are numerous in number then the process is made even more difficult, expensive and prone to error. The present invention enables an automated document retrieval system. All electronic documents, such as emails, within a business are copied and indexed, including attachments, to provide detailed search capabilities. All mailboxes in the business's email archive are able to be searched. Once an email that appears to relate to the intended search criteria has been located, the system, through the software, locates all related documents to this email using the above described method across the whole business archive.

Once a source document 10 is determined, for example a relevant email, at step 62, Figure 3, then the server or a processor within the server is instructed by the software to use a similarity function in order to retrieve all of the documents, including emails, within the business that are related to the source document. This is done at step 64. At step 66, once all of the related documents are found and are above a predefined threshold then a graph is formed of the related documents at step 68.

In order to determine the context of the graph, term frequencies are obtained at step 70 from each of the documents and each term within each document is counted. Once this is done, the term frequencies in each document or email are compiled into a single master term frequency map at step 72 in order to determine the context of each document. At step 74 the software performs a normalization function to each of the term frequencies across all of the related documents that were formed in the graph. This normalization process then represents a typical profile of a document in the graph. Thus a single frequency map is derived at step 76 which defines the overall context of information in the graph. At step 78 the graph is condensed into those documents that are closely related to the context defined for the graph. Thus each document in the graph is compared to the normalized single frequency map in order to find the most closely related documents to the context defined for the graph.

Finally at step 80 the retrieved documents that are found as being the most closely related to the context defined for the graph are ranked in order. Thus the order of the documents represents how closely each document resembles the normalized single frequency map and results in a final list of documents that are related to the original source document.

An improvement is to include meta-data fields, if available, to better describe the context of a document. For example, the title or subject of a document may contain more contextual information than its content. In such cases term frequencies obtained from these meta-data fields can be enhanced or crippled based on how important the meta-data field is to the context of the document. A simple solution is to artificially increase or decrease the frequency of terms contained within the sensitive meta-data elements.

The normalized frequency map obtained previously may be used as a source document for a second iteration of the entire process or for further iterations. That is, the normalized frequency map is treated as a new source document and the process is conducted using this map as the source. The results obtained from such a second or subsequent pass would then either replace or be merged with the results obtained during the first iteration to retrieve related documents.

Furthermore, an improvement to the invention would include adjusting term frequencies obtained during the initial phases of the search by a factor reflective of the initial similarity of the document from which the terms were obtained. That is, documents discovered during the initial phase would have their term frequencies adjusted either up or down based on how similar they are to the source document or their parent document in the graph. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope of the invention as broadly described.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. A method of obtaining related documents from a body of documents comprising the steps of: locating a source document; applying a similarity function to some or all of the documents in order to determine documents related to the source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

2. A method according to claim 1 wherein the similarity function performs a first iteration to retrieve a first row of related documents to the source document.

3. A method according to claim 2 wherein the similarity function determines a percentile figure on the similarity of a document to the source document based on the content of the document and the source document.

4. A method according to claim 2 or claim 3 wherein a further iteration using the similarity function is performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.

5. A method according to any one of claims 2 to 4 further comprising applying a threshold of minimum similarity to each retrieved document in each iteration, so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.

6. A method according to any one of the previous claims wherein the master frequency map contains all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph.

7. A method according to any one of the previous claims wherein the normalizing step uses a statistical method including either a median operation or a mean operation.

8. A method according to any one of the previous claims wherein the comparing step is performed using a statistical probability function that measures how closely a document in the graph represents the normalized frequency map.

9. A method according to claim 8 wherein a threshold is used in the comparing step to eliminate any documents that do not closely resemble the normalized frequency map.

10. A method according to any one of the previous claims further comprising ranking the documents obtained as a result of the comparing step in order of how closely each document resembles the normalized frequency map.

11. A method according to any one of the previous claims wherein the determining step further comprises using term correlation and/or natural language analysis.

12. A method according to any one of claims 3 to 11 wherein the normalized frequency map is used as a source document for iterations subsequent to the first iteration.

13. A method according to any one of the previous claims further comprising adjusting one or more determined term frequencies of a document based on the degree of similarity of the document to the source document or a parent document in the graph.

14. A system for obtaining related documents from a body of documents comprising: memory means for storing the body of documents; processor means for: applying a similarity function to some or all of the documents in order to determine documents related to a source document; forming a graph of all related documents; determining the context of the graph by forming a master frequency map of term frequencies obtained from the documents in the graph; normalizing the term frequencies to form a single normalized frequency map; and comparing each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document.

15. A system according to claim 14 wherein the similarity function performs a first iteration to retrieve a first row of related documents to the source document.

16. A system according to claim 15 wherein the similarity function determines a percentile figure on the similarity of a document to the source document based on the content of the document and the source document.

17. A system according to claim 15 or claim 16 wherein a further iteration using the similarity function is performed on each retrieved related document in the first row to obtain further related documents to each related document in the first row.

18. A system according to any one of claims 15 to 17 wherein the processor means applies a threshold of minimum similarity to each retrieved document in each iteration, so that those documents not meeting the threshold are not used in a current iteration or subsequent iterations.

19. A system according to any one of claims 14 to 18 wherein the master frequency map contains all document term frequencies such that the map represents the total of the frequencies in each of the term frequencies obtained from the graph.

20. A system according to any one of claims 14 to 19 wherein the normalizing step undertaken by the processor means uses a statistical method including either a median operation or a mean operation.

21. A system according to any one of claims 14 to 20 wherein the comparing step undertaken by the processor means is performed using a statistical probability function that measures how closely a document in the graph represents the normalized frequency map.

22. A system according to claim 21 wherein a threshold is used in the comparison to eliminate any documents that do not closely resemble the normalized frequency map.

23. A system according to any one of claims 14 to 22 wherein the processor means ranks the documents obtained as a result of the comparing step in order of how closely each document resembles the normalized frequency map.

5

24. A system according to any one of claims 14 to 23 wherein the determining step undertaken by the processor means further comprises using term correlation and/or natural language analysis.

10 25. A system according to any one of claims 16 to 24 wherein the normalized frequency map is used as a source document for iterations subsequent to the first iteration.

26. A system according to any one of claims 14 to 25 wherein the processor means 15 adjusts one or more determined term frequencies of a document based on the degree of similarity of the document to the source document or a parent document in the graph.

27. Computer program means for instructing a processor to obtain related documents from a body of documents, wherein after a source document has been 0 determined, the program means instructs the processor to: apply a similarity function to some or all of the documents in order to determine documents related to the source document; form a graph of all related documents; determine the context of the graph by forming a master frequency map of term 5 frequencies obtained from the documents in the graph; normalize the term frequencies to form a single normalized frequency map; and compare each document in the graph to the normalized frequency map to obtain documents that are most closely related to the source document. 0 28. Computer program means according to claim 27 for instructing the processor to undertake any one or more of method claims 2 to 13.