WO2008029154A1 - Processing a database - Google Patents

Processing a database Download PDF

Info

Publication number
WO2008029154A1
WO2008029154A1 PCT/GB2007/003381 GB2007003381W WO2008029154A1 WO 2008029154 A1 WO2008029154 A1 WO 2008029154A1 GB 2007003381 W GB2007003381 W GB 2007003381W WO 2008029154 A1 WO2008029154 A1 WO 2008029154A1
Authority
WO
WIPO (PCT)
Prior art keywords
data objects
categorised
categorisation
verification
engine
Prior art date
Application number
PCT/GB2007/003381
Other languages
French (fr)
Inventor
Eric Zigmund Sandler
Yuriy Byurher
Original Assignee
Xploite Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0624663A external-priority patent/GB2442284A/en
Application filed by Xploite Plc filed Critical Xploite Plc
Publication of WO2008029154A1 publication Critical patent/WO2008029154A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to a method and system for processing a database, particularly, but not exclusively, processing a database for the creation and/or refinement of a knowledge base.
  • Knowledge bases are collections of processed data that can be used by tools to process new data.
  • Some knowledge bases contain data classified into categories.
  • a web page knowledge base may contain a number of web pages where each page is associated with a category.
  • Data within knowledge bases can be populated manually or automatically.
  • One automated method for building a knowledge base of web pages involves the creation of queries for search engines. Each query is constructed to provide web pages that are likely to belong to a particular category. For example, a query might be "guns OR knives” to produce web pages that fall within a weapons category.
  • a disadvantage with automatically populated knowledge bases is that known automatic methods provide low quality knowledge bases. For example, the query earlier would return web-pages that are related to "kitchen knives” as well as web-pages relevant to the weapons category.
  • Knowledge bases can be used by automated methods, such as learning engines, to categorise web pages that have not been categorised before. To provide this ability the knowledge base must be of high quality.
  • a high quality knowledge base is a knowledge base with a very low incidence of incorrectly categorised web pages.
  • a method for processing a database comprising a plurality of data objects associated with one or more categories, including the steps of: i) training a categorisation engine using the training set; ii) categorising data objects within the database using the categorisation engine; iii) reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engine; iv) adding the data objects correctly categorised during the review to the training set; and v) repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
  • the training set is constructed by extracting a subset from the database.
  • the data objects may be documents such as web pages.
  • the step of reviewing may be performed with the assistance of a human user.
  • the method may include the steps of: deleting the training set from the database; deleting correctly categorised data objects from the database; and deleting the subset from the database; wherein the predefined threshold is an empty database.
  • the training set is validated using a validation method.
  • the validation method may include the steps of: training a second categorisation engine using the training set; categorising training data objects within the training set using the second categorisation engine; reviewing the categorised training data objects to correctly categorise training data objects that are incorrectly categorised by the second categorisation engine; and deleting the training data objects incorrectly categorised by the second categorisation engine from the training set.
  • the validation method may further include the step of adding training data objects correctly categorised during the validation review to a verification set and/or the training set.
  • the method includes the steps of adding the data objects within the subset that are correctly categorised by the categorisation engine to a verification set and verifying the training set using the verification set in accordance with a verification method.
  • the verification method may include the steps of: training a third categorisation engine using the training set; categorising verification data objects within the verification set using the third categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the third categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a second predefined threshold.
  • the verification method may also include the steps of validating the learning set in accordance with a validation method, deleting from the verification set verification data objects that are categorised by the third categorisation engine differently to the verification set, and/or adding to the verification set verification data objects that are correctly categorised by the third categorisation engine and incorrectly categorised in the verification set.
  • the second accuracy level may be based on the number of verification data objects that are categorised by the third categorisation engine identically to the verification set.
  • the database may be generated by human input, an automatic method or by a combination of both.
  • the automatic method may include the steps of querying a search engine using a defined query; and receiving content resulting from the query.
  • the content may be in web page form.
  • the defined query may include keywords associated with the category.
  • the keywords may be associated with the category using human input.
  • the defined query may include one or more of combinations, exclusions, and/or pattern matching.
  • the defined query is optimised based on the results from the search engine.
  • the automatic method may include the clustering of data to extract natural categories.
  • the association between the data object and the category may include a weighting.
  • At least one categorisation engine uses the training set and/or the verification set to categorise an input data object.
  • the categorisation engine used in step (i) is a statistical engine and is based on the frequency of a feature, for a plurality of features, per category compared to the frequency of that feature for all categories.
  • the database may be stored across a plurality of heterogeneous computer platforms.
  • the method may further include the steps of: training a plurality of different categorisation engines using the training set; categorising data objects within the database using a combined score generated by each categorisation engine; reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engines; adding the data objects correctly categorised during the review to the training set; and repeating the above steps until an accuracy of categorisation exceeds a third predefined threshold.
  • a method of processing a knowledge base comprising a training set and a verification set, each set including data sets associated with one of a plurality of categories, including the steps of: training a categorisation engine using the training set; categorising verification data objects within the verification set using the categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
  • Figure 1 shows a schematic diagram illustrating the creation of a knowledge base in accordance with a method of the invention.
  • Figure 2 shows a schematic diagram illustrating the validation of a learning set used in the creation of a knowledge base in accordance with a method of the invention.
  • Figure 3 shows a schematic diagram illustrating the verification of a knowledge base in accordance with a method of the invention.
  • Figure 4 shows a schematic diagram illustrating the validation of a learning set of a knowledge base during verification of the knowledge base in accordance with a method of the invention.
  • the present invention provides a method and system for processing a database, particularly for refining a knowledge base containing a number of imperfectly categorised documents.
  • a subset of the documents is selected to form a training set which trains a categorisation engine.
  • the categorisation engines re-categorises the remaining documents.
  • a subset of these re- categorised documents is reviewed by a user; documents that are incorrectly categorised are correctly categorised by the user and added to the training set.
  • the categorisation engine is retrained on the expanded training set and the process of re-categorisation and reviewing is repeated until the accuracy of the re-categorised documents reaches acceptable levels.
  • the present invention will be described in relation to the processing of a database containing documents.
  • the database may contain any data object such as a file, web-page, or image.
  • categories can include thematic categories such as pornography, weapons, and drugs; structural categories such as shopping and forums; languages; any other taxonomy or any combination of the categories listed.
  • Figure 1 shows a database 1 of documents associated with categories.
  • the database 1 may be an existing database of documents categorised imperfectly by human or automated means.
  • the database 1 may be constructed from an input source such as a search engine or plurality of search engines. Queries may be constructed for obtaining the document from the search engine.
  • the queries may be constructed from keywords determine for each category. There may be between five and fifteen keywords per category.
  • the query may include combinations such as "knives AND weapon", exclusions such as “NOT kitchen knives”, and/or patterns such as " * .mp3".
  • the queries may be optimised based on the web content that is initially retrieved.
  • uncategorised documents can be obtained, such as from the internet, and then the documents can be automatically clustered to generate "natural" categories for the corpus.
  • a learning set 2 is also shown (the learning set will also be referred to as the training set).
  • the learning set 2 is a subset, of documents associated with categories, extracted 3 from the database 1.
  • a human user may review the database 1 and create the learning set 2 using documents that they believe are correctly categorised.
  • the learning set 2 may be between thirty and fifty documents in size.
  • the learning set 2 may be deleted from the database 1.
  • a categorisation engine 4 is trained 5 on the learning set 2.
  • the categorisation engine 4 may be a learning engine such as a Bayesian, support vector machine, or statistical learning engine.
  • the trained categorisation engine 4 categorises 6 the documents within the database 1.
  • Documents categorised by the engine 4 will fall within one of two groups: group 7 - documents that are categorised in the same category as previously categorised in the database 1 and group 8 - documents that are categorised in a different category to the database 1.
  • Documents that fall in group 7 may be added to the verification set 14.
  • a subset 9 of the documents falling in group 8 will be extracted 10.
  • the subset may be extracted via automatic means, such as by a random method.
  • the subset 9 of documents will be deleted from the database 1.
  • the subset 9 of documents may be between five and twenty documents in size.
  • a review 11 to correctly categorise each document of the subset 9 will then occur.
  • a human user may be used within the review 11.
  • each document in the subset 9 will fall within one of two groups: group 12 - documents correctly categorised by the engine 4 and group 13 - documents incorrectly categorised by the engine 4.
  • Documents falling within group 12 may be added 15 to the verification set 14.
  • Documents in group 13 will be correctly categorised 16 by the review process and added 17 to the learning set 2.
  • the correctly categorised documents 16 may also be added to the verification set 14.
  • documents falling within group 12 may be added to the database 1 instead of, or in addition to, adding 15 the documents to the verification set 14.
  • the process is then repeated in that the engine 4 is trained on the modified learning set 2 and the modified database 1 is categorised.
  • documents that are correctly categorised are deleted from the database 1 and the process ends when there are no longer any documents within the database 1. It will be appreciated that the process may be repeated until another level of accuracy is reached.
  • the level of accuracy may be determined by the number of documents remaining in the database. For example, the level of accuracy may be reached if there are only five documents remaining in the database. Alternatively, the level of accuracy may be reached by a separate measure such as testing the learning set using another categorisation engine.
  • the learning set 2 may be validated.
  • a purpose of validation is to improve the quality of data in the learning set caused by the human factor.
  • the learning set 2 may be validated every ten iterations (repetitions).
  • Figure 2 shows how the learning set 20 may be validated according to a method of the invention.
  • a categorisation engine 21 is trained 22 on the learning set.
  • the engine 21 is used to categorise 23 the learning set 20.
  • Documents categorised by the engine 21 will fall into one of two groups: group 24 - documents that are categorised in the same category as in the learning set 20 and group 25 - documents that are categorised in a different category to the learning set 20.
  • a review 27 of the documents in group 25 will then occur to correctly categorise the documents.
  • the review may be undertaken by a human user.
  • Correctly categorised documents 28 may be added 29 to the verification set 30.
  • the correctly categorised documents 28 may also be added to the learning set 20.
  • the learning and verification sets 20 and 30 forming the knowledge may be processed using a verification method to increase the quality of the knowledge base and thus increase the accuracy of categorisation.
  • the verification method may be used on any knowledge base, which includes a learning set and verification set or from which a learning set and verification set can be extracted, created by other means to refine the quality of that knowledge base.
  • Figure 3 shows an embodiment of the verification method in accordance with the invention.
  • a categorisation engine 40 is trained 41 on the learning set 42.
  • the categorisation engine 40 may be a learning engine such as a Bayesian engine, a support vector machine (SVM), a statistical engine or a neural network.
  • SVM support vector machine
  • the verification method will be performed by a number of different categorisation engines.
  • the trained categorisation engine 40 is used to categorise 43 the documents within the verification set 44.
  • Documents categorised by the engine 40 will fall within one of two groups: group 45 - documents that are categorised in the same category as in the verification set 44, and group 46 - documents that are categorised in a different category to the verification set 44.
  • a review 48 of the documents in group 46 will occur to correctly categorise the documents.
  • the review may be undertaken by a human user.
  • the correctly categorised documents may fall within one of two groups: group 49 - documents that were originally incorrectly categorised 50 within the verification set 44 and group 51 - documents that were originally incorrectly categorised 52 by the engine 40.
  • Documents falling within group 49 will be added 53 to the verification set 44 and documents falling within group 51 will be added 54 to the learning set 42.
  • the process is then repeated in that the engine 40 is trained on the modified learning set 42 and the modified verification set 44 is categorised, until the number of documents falling within group 46 shows that a required accuracy level has been met.
  • a required accuracy level may be predefined and may be 95%.
  • the accuracy level may be met if the value of (T-N)/T is greater than the accuracy level, where T is the number of documents in the verification set and N is the number of documents in group 46.
  • the resulting learning set is a high quality knowledge base that can be used by categorisation engines to categorise new documents. After categorization new documents can also be integrated into the learning set.
  • the learning set 42 may be validated.
  • the learning set 42 may be validated every ten iterations.
  • Figure 4 shows how the learning set 60 may be validated according to a method of the invention.
  • a categorisation engine 61 is trained 62 on the learning set 60.
  • the engine 61 is then used to categorise 63 the learning set 60.
  • Documents categorised by the engine 61 will fall into one of two groups: group 64 - documents that are categorised in the same category as in the learning set 60, group 65 - documents that are categorised in a different category to the learning set 60.
  • Documents that fall within group 65 will be deleted 66 from the learning set 60.
  • a review 67 of the documents in group 65 will occur to correctly categorise the documents.
  • the review may be undertaken by a human user.
  • Correctly categorised documents 68 may be added 69 to the verification set 70 and may also be added to the learning set 60. If the documents were correctly categorised in the learning set 60 but incorrectly categorised by the categorisation engine 61 , then the categorisation engine 61 may need to be tuned.
  • the engine 61 is a neural network
  • the engine 61 may be tuned by increasing the number of neurons.
  • SVM support vector machines
  • Bayesian, SVM, statistical and neural network engines can also be tuned by increasing the number of feature vectors (for example, by adding new words to vocabularies).
  • the quality of the knowledge base may be further increased by using multiple categorisation engines.
  • the multiple categorisation engines are trained using the training set. Each engine then produces a score for each document within the database. The scores are combined to produce a combined score and the document is categorised based upon the combined score.
  • a subset of categorised documents are extracted and reviewed.
  • the engines are retrained on the modified training sets and the process is repeated until an accuracy threshold is met.
  • the embodiment describes the use of a method of the invention to generate a model.
  • the model is used by a set of categorization algorithms A.
  • the purpose of the algorithms is to recognize categories C in arbitrary web documents/content.
  • the model consists of a set of categories C, associated with keywords and/or structural features KS, and with documents DL for learning, and with documents D ⁇ /for testing/verification.
  • the algorithms A use the DL set in order to build a knowledge base.
  • the algorithms A use the DV set for verification of category recognition.
  • an algorithm AO is selected for initial analysis.
  • the algorithm AO should not require any tuning or setup and be able to use raw input web content in order to effectively categorize content.
  • a method described in patent application CATEGORISATION OF DATA USING A MODEL would be appropriate.
  • the set of keywords and/or structural features KS[C] is defined.
  • the keywords use popular terms and words from the category, and structural signatures are used for structural features.
  • Structural signatures are described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS. It is not necessary to obtain full coverage for a category, just popular terms and/or key structural features can be used. For example, up to five to fifteen keywords for each category would be suitable.
  • a raw set of documents DR is constructed by a web crawler input /. It is preferred that the inputs / provides documents with a normal distribution across the different categories.
  • the input sources / are queried for categories C using keywords/structural features KS[C].
  • the initial raw document set DR[C] is built.
  • the input sources / are queried for the documents that do not belong to the category C and that do not have keywords and/or structural features KS[C].
  • the raw document set DR[other] is built.
  • the raw document set may be automatically extended by analyzing individual documents in each DRp], finding references or links to other documents, and adding the other documents to the set with the same category.
  • the depth of reference/link analysis may be three to five.
  • the capacity of the set DRp] is not limited.
  • the number of raw documents in the set can be more than 100,000.
  • the raw document set DRp] may be reviewed manually and documents that correctly correspond to their category are removed from the DRp] and stored in DLp]. The same review is made for DR[other] document set to create DL[other] set.
  • the initial learning set for each category may include thirty to fifty documents selected during the manual review process.
  • the raw document set for each category DR[C] is processed using the initial learning set DL[C]Xo improve the quality of the learning set and to build a verification set DV[C].
  • Algorithm AO is trained on the DLp]: i.
  • the knowledge base Bp] is created;
  • the knowledge base Bp] is used for recognizing documents from DRp] i. Algorithm AO and knowledge base Bp] are utilized to partition the documents from DR[i] into categories D[j][i] and D[j][other of ij.
  • the difference Deltap] and Delta [other of i] is calculated.
  • the document sets are updated for every category C including other.
  • the raw data set is reduced and only the documents that have not been correctly categorized are kept in it:
  • DRp] Deltap] - DeltaLp] - DeltaVp] 6.
  • the learning set may be validated for consistency when the initial quality of data is low: a. Algorithm AO is trained on the DL[i]: i. The knowledge base Bp] is created; b. The knowledge base Bp] is used to recognize documents from DLp] (e.g. from the same learning set): i. The documents from DLp] are divided into categories Dp]p] and Dp][other of i]. c. The difference set Deltap] and Delta[other of i] is calculated.
  • the next step is a verification stage which involves training all the algorithms A on the learning set for each category and then testing the algorithms against the corresponding verification set until the accuracy of the algorithm meets a target threshold THO.
  • the target threshold may be an accuracy level of 95%.
  • Each learning and verification document set has an additional category other that includes all the documents that do not belong to the list of original categories C. Iteratively, the following steps are performed, until the expected level of accuracy is achieved:
  • Algorithms from set A are trained on the DL.
  • the knowledge bases B[A] are used for recognizing documents from DV.
  • the documents from DV are divided in categories D[A][C].
  • the process is ended when the accuracy level has been met.
  • the Delta is reviewed manually.
  • the following parameters are analyzed: a.
  • the algorithm has recognized the document as category C1 (from DVO; b.
  • the document has an actual category C2 (after a manual review).
  • c. The following decision is made:
  • the document sets are updated for every category C: a.
  • the verification set is changed from the manually reviewed documents:
  • any of the methods and systems described above could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system. For example, a suitable system for storing data of the database or knowledge base is described in the patent application A DISTRIBUTED FILE SYSTEM. An embodiment of the invention has been used to produce a high quality knowledge base for use in the categorization of web pages. The results of the categorization are provided below.
  • Search engines Query a search engine using the specified keyword and save the links that are returned;
  • the raw data was then processed in accordance with methods of the invention already described.
  • the accuracy level was set to 90% and the set of algorithms
  • the AO algorithm was the statistical method.
  • a categorization engine using the knowledge base produced the following results:
  • Embodiments of the present invention provide the potential advantage of efficiently creating and maintaining high quality knowledge bases which in turn enables classifiers (such as categorization engines) to achieve accuracy of 95-99% in category recognition of new content.
  • Another potential advantage of embodiments of the present invention is that existing low quality knowledge bases of raw documents can be refined to produce high quality knowledge bases.

Abstract

A method for processing a database comprising a plurality of data objects associated with one or more categories, including the steps of: a) training a categorisation engine using the training set; b) categorising data objects within the database using the categorisation engine; c) reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engine; d) adding the data objects correctly categorised during the review to the training set; and e) repeating steps a) to d) until accuracy of categorisation exceeds a predefined threshold.

Description

PROCESSING A DATABASE
Field of Invention
The present invention relates to a method and system for processing a database, particularly, but not exclusively, processing a database for the creation and/or refinement of a knowledge base.
Background
Knowledge bases are collections of processed data that can be used by tools to process new data.
Some knowledge bases contain data classified into categories. For example, a web page knowledge base may contain a number of web pages where each page is associated with a category.
Data within knowledge bases can be populated manually or automatically.
There are a number of populated knowledge bases that have been constructed over time by the collaborative efforts of many people. One such knowledge base, which contains categorised web pages, is known as dmoz.org.
One difficulty with manually populated knowledge bases is consistency. Due to the large number of people providing categorisation, web pages may be inconsistently categorised. Another difficulty with manually constructed knowledge bases for immense data sources, such as the internet, is that they may be vastly deficient in content or require the unprecedented expenditure of resources to generate.
One automated method for building a knowledge base of web pages involves the creation of queries for search engines. Each query is constructed to provide web pages that are likely to belong to a particular category. For example, a query might be "guns OR knives" to produce web pages that fall within a weapons category. A disadvantage with automatically populated knowledge bases is that known automatic methods provide low quality knowledge bases. For example, the query earlier would return web-pages that are related to "kitchen knives" as well as web-pages relevant to the weapons category.
Knowledge bases can be used by automated methods, such as learning engines, to categorise web pages that have not been categorised before. To provide this ability the knowledge base must be of high quality. A high quality knowledge base is a knowledge base with a very low incidence of incorrectly categorised web pages.
There is a desire for a method of refining existing knowledge bases to improve their quality and a method for creating new high quality knowledge bases.
It is an object of the present invention to provide a method for processing a database to produce a high quality knowledge base and a method for processing a knowledge base to improve its quality which overcomes the disadvantages of above methods, or to at least provide a useful alternative.
Summary of the Invention
According to a first aspect of the invention there is provided a method for processing a database comprising a plurality of data objects associated with one or more categories, including the steps of: i) training a categorisation engine using the training set; ii) categorising data objects within the database using the categorisation engine; iii) reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engine; iv) adding the data objects correctly categorised during the review to the training set; and v) repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
Preferably, the training set is constructed by extracting a subset from the database.
The data objects may be documents such as web pages.
The step of reviewing may be performed with the assistance of a human user.
The method may include the steps of: deleting the training set from the database; deleting correctly categorised data objects from the database; and deleting the subset from the database; wherein the predefined threshold is an empty database.
Preferably, the training set is validated using a validation method.
The validation method may include the steps of: training a second categorisation engine using the training set; categorising training data objects within the training set using the second categorisation engine; reviewing the categorised training data objects to correctly categorise training data objects that are incorrectly categorised by the second categorisation engine; and deleting the training data objects incorrectly categorised by the second categorisation engine from the training set. The validation method may further include the step of adding training data objects correctly categorised during the validation review to a verification set and/or the training set.
Preferably, the method includes the steps of adding the data objects within the subset that are correctly categorised by the categorisation engine to a verification set and verifying the training set using the verification set in accordance with a verification method. The verification method may include the steps of: training a third categorisation engine using the training set; categorising verification data objects within the verification set using the third categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the third categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a second predefined threshold.
The verification method may also include the steps of validating the learning set in accordance with a validation method, deleting from the verification set verification data objects that are categorised by the third categorisation engine differently to the verification set, and/or adding to the verification set verification data objects that are correctly categorised by the third categorisation engine and incorrectly categorised in the verification set.
The second accuracy level may be based on the number of verification data objects that are categorised by the third categorisation engine identically to the verification set.
The database may be generated by human input, an automatic method or by a combination of both.
The automatic method may include the steps of querying a search engine using a defined query; and receiving content resulting from the query. The content may be in web page form. The defined query may include keywords associated with the category. The keywords may be associated with the category using human input. The defined query may include one or more of combinations, exclusions, and/or pattern matching. Preferably, the defined query is optimised based on the results from the search engine.
The automatic method may include the clustering of data to extract natural categories. The association between the data object and the category may include a weighting.
It is preferred that, after the accuracy of categorisation has exceeded the predefined threshold, at least one categorisation engine uses the training set and/or the verification set to categorise an input data object.
Preferably, the categorisation engine used in step (i) is a statistical engine and is based on the frequency of a feature, for a plurality of features, per category compared to the frequency of that feature for all categories.
The database may be stored across a plurality of heterogeneous computer platforms.
To improve accuracy of the knowledge base, the method may further include the steps of: training a plurality of different categorisation engines using the training set; categorising data objects within the database using a combined score generated by each categorisation engine; reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engines; adding the data objects correctly categorised during the review to the training set; and repeating the above steps until an accuracy of categorisation exceeds a third predefined threshold.
According to another aspect of the invention there is provided a method of processing a knowledge base comprising a training set and a verification set, each set including data sets associated with one of a plurality of categories, including the steps of: training a categorisation engine using the training set; categorising verification data objects within the verification set using the categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
Brief Description of the Drawings
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:
Figure 1 : shows a schematic diagram illustrating the creation of a knowledge base in accordance with a method of the invention.
Figure 2: shows a schematic diagram illustrating the validation of a learning set used in the creation of a knowledge base in accordance with a method of the invention.
Figure 3: shows a schematic diagram illustrating the verification of a knowledge base in accordance with a method of the invention.
Figure 4: shows a schematic diagram illustrating the validation of a learning set of a knowledge base during verification of the knowledge base in accordance with a method of the invention.
Detailed Description of the Preferred Embodiments
The present invention provides a method and system for processing a database, particularly for refining a knowledge base containing a number of imperfectly categorised documents. A subset of the documents is selected to form a training set which trains a categorisation engine. The categorisation engines re-categorises the remaining documents. A subset of these re- categorised documents is reviewed by a user; documents that are incorrectly categorised are correctly categorised by the user and added to the training set. The categorisation engine is retrained on the expanded training set and the process of re-categorisation and reviewing is repeated until the accuracy of the re-categorised documents reaches acceptable levels.
The present invention will be described in relation to the processing of a database containing documents. However, it will be appreciated that the database may contain any data object such as a file, web-page, or image.
It will be appreciated that categories can include thematic categories such as pornography, weapons, and drugs; structural categories such as shopping and forums; languages; any other taxonomy or any combination of the categories listed.
Figure 1 shows a database 1 of documents associated with categories.
The database 1 may be an existing database of documents categorised imperfectly by human or automated means. Alternatively, the database 1 may be constructed from an input source such as a search engine or plurality of search engines. Queries may be constructed for obtaining the document from the search engine. The queries may be constructed from keywords determine for each category. There may be between five and fifteen keywords per category. The query may include combinations such as "knives AND weapon", exclusions such as "NOT kitchen knives", and/or patterns such as "*.mp3". The queries may be optimised based on the web content that is initially retrieved.
In an alternative embodiment, uncategorised documents can be obtained, such as from the internet, and then the documents can be automatically clustered to generate "natural" categories for the corpus. A learning set 2 is also shown (the learning set will also be referred to as the training set). The learning set 2 is a subset, of documents associated with categories, extracted 3 from the database 1. A human user may review the database 1 and create the learning set 2 using documents that they believe are correctly categorised. The learning set 2 may be between thirty and fifty documents in size.
The learning set 2 may be deleted from the database 1.
A categorisation engine 4 is trained 5 on the learning set 2. The categorisation engine 4 may be a learning engine such as a Bayesian, support vector machine, or statistical learning engine.
The use of the statistical learning engine, such as described in patent application CATEGORISATION OF DATA USING A MODEL, provides the advantage of being able to process data from low quality knowledge bases.
The trained categorisation engine 4 categorises 6 the documents within the database 1.
Documents categorised by the engine 4 will fall within one of two groups: group 7 - documents that are categorised in the same category as previously categorised in the database 1 and group 8 - documents that are categorised in a different category to the database 1.
Documents that fall in group 7 will be deleted from the database.
Documents that fall in group 7 may be added to the verification set 14.
A subset 9 of the documents falling in group 8 will be extracted 10. The subset may be extracted via automatic means, such as by a random method.
The subset 9 of documents will be deleted from the database 1. The subset 9 of documents may be between five and twenty documents in size.
A review 11 to correctly categorise each document of the subset 9 will then occur. A human user may be used within the review 11.
During the review, each document in the subset 9 will fall within one of two groups: group 12 - documents correctly categorised by the engine 4 and group 13 - documents incorrectly categorised by the engine 4.
Documents falling within group 12 may be added 15 to the verification set 14. Documents in group 13 will be correctly categorised 16 by the review process and added 17 to the learning set 2. The correctly categorised documents 16 may also be added to the verification set 14.
In an alternative embodiment documents falling within group 12 may be added to the database 1 instead of, or in addition to, adding 15 the documents to the verification set 14.
The process is then repeated in that the engine 4 is trained on the modified learning set 2 and the modified database 1 is categorised. During the process documents that are correctly categorised are deleted from the database 1 and the process ends when there are no longer any documents within the database 1. It will be appreciated that the process may be repeated until another level of accuracy is reached. The level of accuracy may be determined by the number of documents remaining in the database. For example, the level of accuracy may be reached if there are only five documents remaining in the database. Alternatively, the level of accuracy may be reached by a separate measure such as testing the learning set using another categorisation engine.
During the iterations (repetitions) of the process the learning set 2 may be validated. A purpose of validation is to improve the quality of data in the learning set caused by the human factor. The learning set 2 may be validated every ten iterations (repetitions).
Figure 2 shows how the learning set 20 may be validated according to a method of the invention.
A categorisation engine 21 is trained 22 on the learning set. The engine 21 is used to categorise 23 the learning set 20.
Documents categorised by the engine 21 will fall into one of two groups: group 24 - documents that are categorised in the same category as in the learning set 20 and group 25 - documents that are categorised in a different category to the learning set 20.
Documents that fall within group 25 will be deleted 26 from the learning set 20.
A review 27 of the documents in group 25 will then occur to correctly categorise the documents. The review may be undertaken by a human user. Correctly categorised documents 28 may be added 29 to the verification set 30. The correctly categorised documents 28 may also be added to the learning set 20.
After the level of accuracy is reached, the learning and verification sets 20 and 30 forming the knowledge may be processed using a verification method to increase the quality of the knowledge base and thus increase the accuracy of categorisation.
It will be appreciated that the verification method may be used on any knowledge base, which includes a learning set and verification set or from which a learning set and verification set can be extracted, created by other means to refine the quality of that knowledge base. Figure 3 shows an embodiment of the verification method in accordance with the invention.
A categorisation engine 40 is trained 41 on the learning set 42. The categorisation engine 40 may be a learning engine such as a Bayesian engine, a support vector machine (SVM), a statistical engine or a neural network.
In one embodiment the verification method will be performed by a number of different categorisation engines.
The trained categorisation engine 40 is used to categorise 43 the documents within the verification set 44.
Documents categorised by the engine 40 will fall within one of two groups: group 45 - documents that are categorised in the same category as in the verification set 44, and group 46 - documents that are categorised in a different category to the verification set 44.
Documents within group 46 are deleted 47 from the verification set 44.
A review 48 of the documents in group 46 will occur to correctly categorise the documents. The review may be undertaken by a human user.
The correctly categorised documents may fall within one of two groups: group 49 - documents that were originally incorrectly categorised 50 within the verification set 44 and group 51 - documents that were originally incorrectly categorised 52 by the engine 40.
Documents falling within group 49 will be added 53 to the verification set 44 and documents falling within group 51 will be added 54 to the learning set 42.
The process is then repeated in that the engine 40 is trained on the modified learning set 42 and the modified verification set 44 is categorised, until the number of documents falling within group 46 shows that a required accuracy level has been met.
A required accuracy level may be predefined and may be 95%. The accuracy level may be met if the value of (T-N)/T is greater than the accuracy level, where T is the number of documents in the verification set and N is the number of documents in group 46.
When the required accuracy level is met the resulting learning set is a high quality knowledge base that can be used by categorisation engines to categorise new documents. After categorization new documents can also be integrated into the learning set.
During the iterations of the process the learning set 42 may be validated. The learning set 42 may be validated every ten iterations.
Figure 4 shows how the learning set 60 may be validated according to a method of the invention.
A categorisation engine 61 is trained 62 on the learning set 60. The engine 61 is then used to categorise 63 the learning set 60.
Documents categorised by the engine 61 will fall into one of two groups: group 64 - documents that are categorised in the same category as in the learning set 60, group 65 - documents that are categorised in a different category to the learning set 60.
Documents that fall within group 65 will be deleted 66 from the learning set 60.
A review 67 of the documents in group 65 will occur to correctly categorise the documents. The review may be undertaken by a human user. Correctly categorised documents 68 may be added 69 to the verification set 70 and may also be added to the learning set 60. If the documents were correctly categorised in the learning set 60 but incorrectly categorised by the categorisation engine 61 , then the categorisation engine 61 may need to be tuned. Where the engine 61 is a neural network, the engine 61 may be tuned by increasing the number of neurons. For support vector machines (SVM), the internal coefficients can be tuned. Bayesian, SVM, statistical and neural network engines can also be tuned by increasing the number of feature vectors (for example, by adding new words to vocabularies).
In one embodiment of the invention the quality of the knowledge base may be further increased by using multiple categorisation engines.
The multiple categorisation engines are trained using the training set. Each engine then produces a score for each document within the database. The scores are combined to produce a combined score and the document is categorised based upon the combined score.
A subset of categorised documents are extracted and reviewed.
Documents correctly categorised documents during the review are added to the training set.
The engines are retrained on the modified training sets and the process is repeated until an accuracy threshold is met.
The use of multiple categorisation engines to categorise documents is further described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES.
One embodiment of the invention will now be described.
The embodiment describes the use of a method of the invention to generate a model. The model is used by a set of categorization algorithms A. The purpose of the algorithms is to recognize categories C in arbitrary web documents/content.
The model consists of a set of categories C, associated with keywords and/or structural features KS, and with documents DL for learning, and with documents D\/for testing/verification.
Model = {C, { KS, DL, DV))
The algorithms A use the DL set in order to build a knowledge base. The algorithms A use the DV set for verification of category recognition.
Among the algorithms A, an algorithm AO is selected for initial analysis. Preferably, the algorithm AO should not require any tuning or setup and be able to use raw input web content in order to effectively categorize content. For example, a method described in patent application CATEGORISATION OF DATA USING A MODEL would be appropriate.
For every category C[i], the set of keywords and/or structural features KS[C] is defined. The keywords use popular terms and words from the category, and structural signatures are used for structural features. Structural signatures are described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS. It is not necessary to obtain full coverage for a category, just popular terms and/or key structural features can be used. For example, up to five to fifteen keywords for each category would be suitable.
In this embodiment a raw set of documents DR is constructed by a web crawler input /. It is preferred that the inputs / provides documents with a normal distribution across the different categories.
Automatically, the input sources / are queried for categories C using keywords/structural features KS[C]. The initial raw document set DR[C] is built. Automatically, the input sources / are queried for the documents that do not belong to the category C and that do not have keywords and/or structural features KS[C]. The raw document set DR[other] is built.
This results in a subset of Df?, DR[i] associated with a category / with some probability (initially a low quality data set for a given category).
DR[i] = { (C[I], DR) }
The raw document set may be automatically extended by analyzing individual documents in each DRp], finding references or links to other documents, and adding the other documents to the set with the same category. The depth of reference/link analysis may be three to five.
The capacity of the set DRp] is not limited. The number of raw documents in the set can be more than 100,000.
The raw document set DRp] may be reviewed manually and documents that correctly correspond to their category are removed from the DRp] and stored in DLp]. The same review is made for DR[other] document set to create DL[other] set.
The initial learning set for each category may include thirty to fifty documents selected during the manual review process.
The raw document set for each category DR[C] is processed using the initial learning set DL[C]Xo improve the quality of the learning set and to build a verification set DV[C].
For each category /, iteratively, the following steps are performed (where j is a number of the iteration), until the DRp] document set is empty:
1. Algorithm AO is trained on the DLp]: i. The knowledge base Bp] is created;
2. The knowledge base Bp] is used for recognizing documents from DRp] i. Algorithm AO and knowledge base Bp] are utilized to partition the documents from DR[i] into categories D[j][i] and D[j][other of ij.
3. The difference Deltap] and Delta [other of i] is calculated. The delta reflects all the differences between the previous iteration document category placement and current re-classification of the raw documents: Deltap] = DR[i] - D[j][i] (the same is performed for the other category, where (-) is the logical difference between the two sets).
4. A small part of the Delta is reviewed manually (five to twenty examples). The following parameters are analyzed: a. Suppose the algorithm has recognized a particular document as category C[1] (from DQ])\ b. If, for example, the document has an actual category C[2] (after manual review). c. The following decision is made:
Figure imgf000017_0001
5. The document sets are updated for every category C including other. a. The learning set is increased from manually reviewed documents: DL[i] = DL[i] + DeltaLp] b. The verification set is increased from the documents that have been categorized properly and from manually reviewed documents: DVp] = DVp] + Delta V[i] c. The raw data set is reduced and only the documents that have not been correctly categorized are kept in it:
DRp] = Deltap] - DeltaLp] - DeltaVp] 6. For every ten iterations, the learning set may be validated for consistency when the initial quality of data is low: a. Algorithm AO is trained on the DL[i]: i. The knowledge base Bp] is created; b. The knowledge base Bp] is used to recognize documents from DLp] (e.g. from the same learning set): i. The documents from DLp] are divided into categories Dp]p] and Dp][other of i]. c. The difference set Deltap] and Delta[other of i] is calculated. It includes the documents that have an initial assumption about some category that have not been recognized as the required category: Delta[C] = DLp] - Dp]p] (the same is performed for other category). d. The documents from the Delta set are reviewed and the correct category is defined for them, (Deltap] is transposed to Delta'p] where the documents from the set have correct category assigned). e. The document sets are updated for every category C including other: i. Conflicting documents are deleted from the learning document set:
DLp] = DLp] - Deltap] ii. Conflicting documents are added to the testing document set:
DVp] = DVp] + Delta'p]
The next step is a verification stage which involves training all the algorithms A on the learning set for each category and then testing the algorithms against the corresponding verification set until the accuracy of the algorithm meets a target threshold THO. The target threshold may be an accuracy level of 95%.
Each learning and verification document set has an additional category other that includes all the documents that do not belong to the list of original categories C. Iteratively, the following steps are performed, until the expected level of accuracy is achieved:
1. Algorithms from set A are trained on the DL. The knowledge bases B[A] axe created;
2. The knowledge bases B[A] are used for recognizing documents from DV. The documents from DV are divided in categories D[A][C].
3. The difference set Delta[A][C] is calculated. It includes the documents that are in the verification document set for some category but that have not been recognized as the required category: DeHa[A][C] = DV - D[A][C].
4. The accuracy level is checked: (T-N)ZT > THO (for every category C and algorithm A), where a. N - the number of documents in DeHa[A][C] b. T- the number of documents in DV[C]
The process is ended when the accuracy level has been met.
5. The Delta is reviewed manually. The following parameters are analyzed: a. The algorithm has recognized the document as category C1 (from DVO; b. The document has an actual category C2 (after a manual review). c. The following decision is made:
Figure imgf000019_0001
6. The document sets are updated for every category C: a. The learning set is increased from manually reviewed documents: DL = DL + DeltaL b. The verification set is changed from the manually reviewed documents:
DV = DV - Delta + DeltaV
7. For every ten iterations, the learning set may be validated for consistency: a. Algorithms from the set A are trained on the DL. The knowledge bases B[A] are created; b. The knowledge bases B[A] are used for recognizing documents from DL (e.g. from the same learning set). The documents from DL are divided in categories D[C]. c. The difference set Delta[C] is calculated. It includes the documents that have an initial assumption about some category that has not been recognized as the required category: DeHa[C] = DL[C] - D[C]. d. The documents from the Delta set are reviewed and correct category is defined for them, (DeHa[C] is transposed to Delta'[C] where the documents from the set has the correct category). e. The document sets are updated for every category C including other: i. Conflicting documents are deleted from the learning document set:
DL[J] = DL[j] - Delta ii. Conflicting documents are added to the testing document set:
DV[j] = DV[j] + Delta'
It will be appreciated that any of the methods and systems described above could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system. For example, a suitable system for storing data of the database or knowledge base is described in the patent application A DISTRIBUTED FILE SYSTEM. An embodiment of the invention has been used to produce a high quality knowledge base for use in the categorization of web pages. The results of the categorization are provided below.
The following categories were utilized in construction of the knowledge base:
• Languages: English, Spanish, Russian;
• Content categories in English, Russian, and Spanish: Chat & Instant Messaging, Nudity, Pornography, and Weapons.
For comparison purposes a knowledge base was also created by manually collecting high quality documents in specified categories and languages. This process took 24 man/weeks.
Recognition of categories and languages on the manually collected knowledge base by a categorization engine trained on the knowledge base was 80-90% accurate.
However, recognition of categories and languages on new documents obtained from the internet resulted in accuracy of 70% for recognition of language, and accuracy of 40% for recognition of content categories.
To obtain raw data for processing the following set of search engines was defined:
Figure imgf000021_0001
The following set of keywords KS was also defined for the categories:
Figure imgf000021_0002
Figure imgf000022_0001
Figure imgf000023_0001
An application queried the search engines and the internet directory using the categories and keywords. The approach of the application was as follows:
• Search engines: Query a search engine using the specified keyword and save the links that are returned;
• Internet Directories: Query an internet directory using the specified category, language, and keywords, get the list of sites, and download five to ten links from every site.
1 ,000 to 10,000 raw documents were collected for every language and category. A learning set of documents was manually defined for each category and language (forty to sixty documents in each category and language learning set).
The raw data was then processed in accordance with methods of the invention already described. The accuracy level was set to 90% and the set of algorithms A included Bayesian, Support Vector Machines, a statistical method as described in patent application CATEGORISATION OF DATA USING A MODEL and a categorization engine combination method as described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES. The AO algorithm was the statistical method.
A categorization engine using the knowledge base produced the following results:
Figure imgf000024_0001
Figure imgf000025_0001
Embodiments of the present invention provide the potential advantage of efficiently creating and maintaining high quality knowledge bases which in turn enables classifiers (such as categorization engines) to achieve accuracy of 95-99% in category recognition of new content. Another potential advantage of embodiments of the present invention is that existing low quality knowledge bases of raw documents can be refined to produce high quality knowledge bases.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

Claims
1. A method for processing a database comprising a plurality of data objects associated with one or more categories, including the steps of: i) training a categorisation engine using a training set; ii) categorising data objects within the database using the categorisation engine; iii) reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engine; iv) adding the data objects correctly categorised during the review to the training set; and v) repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
2. A method as claimed in claim 1 wherein the training set is constructed by extracting a subset from the database.
3. A method as claimed any one of the preceding claims wherein the data objects are documents.
4. A method as claimed in any one of the preceding claims wherein the step of reviewing is performed with the assistance of a human user.
5. A method as claimed in any one of the preceding claims including the steps of: i deleting the training set from the database; deleting correctly categorised data objects from the database; and deleting the subset of categorised data objects from the database; wherein the predefined threshold is an empty database.
6. A method as claimed in any one of the preceding claims including the step of validating the training set using a validation method.
7. A method as claimed in claim 6 wherein the validation method includes the steps of: training a second categorisation engine using the training set; categorising training data objects within the training set using the second categorisation engine; reviewing the categorised training data objects to correctly categorise training data objects that are incorrectly categorised by the second categorisation engine; and deleting the training data objects incorrectly categorised by the second categorisation engine from the training set.
8. A method as claimed in claim 7 wherein the validation method further includes the step of adding training data objects correctly categorised during the review to a verification set.
9. A method as claimed in any one of claims 7 to 8 wherein the validation method further includes the step of adding training data objects correctly categorised during the review to the training set.
10. A method as claimed in any one of the preceding claims including the step of adding the categorised data objects within the subset that are correctly categorised by the categorisation engine to a verification set.
11. A method as claimed in claim 10 including the step of verifying the training set using the verification set in accordance with a verification method.
12. A method as claimed in claim 11 wherein the verification method includes the steps of: training a third categorisation engine using the training set; categorising verification data objects within the verification set using the third categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the third categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a second predefined threshold.
13. A method as claimed in claim 12 wherein the verification method includes the step of validating the learning set in accordance with a validation method.
14. A method as claimed in any one of claims 12 to 13 wherein the verification method includes the step of deleting from the verification set verification data objects that are categorised by the third categorisation engine differently to the verification set.
15. A method as claimed in any one of claims 12 to 14 wherein the verification method includes the step of adding to the verification set verification data objects that are correctly categorised by the third categorisation engine and incorrectly categorised in the verification set.
16. A method as claimed in any one of claims 12 to 15 wherein the second accuracy level is based on the number of verification data objects that are categorised by the third categorisation engine identically to the verification set.
17. A method as claimed in any one of the preceding claims wherein at least most of the database is generated by human input.
18. A method as claimed in any one of claims 1 to 16 wherein at least most of the database is generated by an automatic method.
19. A method as claimed in claim 18 wherein the automatic method includes the steps of querying a search engine using a defined query; and receiving content resulting from the query.
20. A method as claimed in claim 19 wherein the content is in web page form.
21. A method as claimed in any one of claims 19 to 20 wherein the defined query includes keywords associated with the category.
22. A method as claimed in claim 21 wherein the keywords are associated with the category using human input.
23. A method as claimed in any one of claims 19 to 22 wherein the defined query includes one or more of combinations, exclusions, and pattern matching.
24. A method as claimed in any one of claims 19 to 23 wherein the defined query is optimised based on the results from the search engine.
25. A method as claimed in claim 18 wherein the automatic method includes clustering of data to extract natural categories.
26. A method as claimed in any one of the preceding claims wherein the association between the data object and the category includes a weighting.
27. A method as claimed in any one of the preceding claims wherein, when the accuracy of categorisation exceeds the predefined threshold, at least one categorisation engine uses the training set to categorise an input data object.
28. A method as claimed in any one of the preceding claims wherein the categorisation engine used in step (i) is a statistical engine.
29. A method as claimed in claim 28 wherein the statistical engine is based on the frequency of a feature, for a plurality of features, per category compared to the frequency of that feature for all categories.
30. A method as claimed in any one of the preceding claims wherein the database is stored across a plurality of heterogeneous computer platforms.
31. A method as claimed in any one of the preceding claims including the steps of: a) training a plurality of different categorisation engines using the training set; b) categorising data objects within the database using a combined score calculated from scores generated by each categorisation engine; c) reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised in relation to the combined score; d) adding the data objects correctly categorised during the review to the training set; and e) repeating the above steps until an accuracy of categorisation exceeds a third predefined threshold.
32. A method as claimed in any one of the preceding claims wherein the data objects are web pages.
33. A method of processing a knowledge base comprising a training set and a verification set, each set including data sets associated with one of a plurality of categories, including the steps of: training a categorisation engine using the training set; categorising verification data objects within the verification set using the categorisation engine; reviewing the categorised verification data objects to correctly categorise verification data objects that are incorrectly categorised by the categorisation engine; adding the verification data objects correctly categorised during the review that were correctly categorised in the verification set to the training set; and repeating the above steps until an accuracy of categorisation exceeds a predefined threshold.
34. A system for processing a database comprising a plurality of data objects associated with one or more categories, including: a memory arranged for storing the database and a training set comprising a subset of the database; and a processor arranged for training a categorisation engine using the training set, categorising data objects within the database using the categorisation engine, reviewing a subset of the categorised data objects to correctly categorise data objects that are incorrectly categorised by the categorisation engine, adding the data objects correctly categorised during the review to the training set and determining whether an accuracy of categorisation exceeds a predefined threshold.
35. A system arranged for effecting the method of any one of claims 1 to 33.
36. A computer program arranged for effecting the method or system of any one of the preceding claims.
37. Storage media arranged for storing a computer program as claimed in claim 36.
PCT/GB2007/003381 2006-09-07 2007-09-07 Processing a database WO2008029154A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
UA200609644 2006-09-07
UAA200609644 2006-09-07
GB0624663.1 2006-12-11
GB0624663A GB2442284A (en) 2006-09-07 2006-12-11 Processing a database of web pages

Publications (1)

Publication Number Publication Date
WO2008029154A1 true WO2008029154A1 (en) 2008-03-13

Family

ID=38728852

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003381 WO2008029154A1 (en) 2006-09-07 2007-09-07 Processing a database

Country Status (1)

Country Link
WO (1) WO2008029154A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444340A (en) * 2020-03-10 2020-07-24 腾讯科技(深圳)有限公司 Text classification and recommendation method, device, equipment and storage medium
WO2021154540A1 (en) * 2020-01-28 2021-08-05 Schlumberger Technology Corporation Oilfield data file classification and information processing systems

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020022956A1 (en) * 2000-05-25 2002-02-21 Igor Ukrainczyk System and method for automatically classifying text

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020022956A1 (en) * 2000-05-25 2002-02-21 Igor Ukrainczyk System and method for automatically classifying text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEBASTANI F: "Machine Learning in Automated Text Categorization", ACM COMPUTING SURVEYS, ACM, NEW YORK, NY, US, US, vol. 34, no. 1, March 2002 (2002-03-01), pages 1 - 47, XP002280034, ISSN: 0360-0300 *
WANAS N M ET AL: "Learning aggregation for combining classifier ensembles", NEURAL INFORMATION PROCESSING, 2002. ICONIP '02. PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON NOV. 18-22, 2002, PISCATAWAY, NJ, USA,IEEE, vol. 4, 18 November 2002 (2002-11-18), pages 1729 - 1733, XP010638909, ISBN: 981-04-7524-1 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021154540A1 (en) * 2020-01-28 2021-08-05 Schlumberger Technology Corporation Oilfield data file classification and information processing systems
EP4097603A4 (en) * 2020-01-28 2024-03-20 Services Petroliers Schlumberger Oilfield data file classification and information processing systems
CN111444340A (en) * 2020-03-10 2020-07-24 腾讯科技(深圳)有限公司 Text classification and recommendation method, device, equipment and storage medium
CN111444340B (en) * 2020-03-10 2023-08-11 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN108228541B (en) Method and device for generating document abstract
Milios et al. Automatic term extraction and document similarity in special text corpora
US20050102251A1 (en) Method of document searching
US20030004942A1 (en) Method and apparatus of metadata generation
US20040249808A1 (en) Query expansion using query logs
WO2006008733A2 (en) A method for determining near duplicate data objects
WO2007008263A2 (en) Self-organized concept search and data storage method
WO2001031479A1 (en) Context-driven information retrieval
EP1668541A1 (en) Information retrieval
EP3232336A1 (en) Method and device for recognizing stop word
WO2018169597A1 (en) Systems and methods for verbatim -text mining
Labusch et al. Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT.
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN115795061B (en) Knowledge graph construction method and system based on word vector and dependency syntax
Lee et al. Annotating multiple types of biomedical entities: a single word classification approach
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
JP2004334766A (en) Word classifying device, word classifying method and word classifying program
WO2008029154A1 (en) Processing a database
GB2442286A (en) Categorisation of data e.g. web pages using a model
CN111723179B (en) Feedback model information retrieval method, system and medium based on conceptual diagram
CN111339287B (en) Abstract generation method and device
Azmi et al. Relevance feedback using genetic algorithm on information retrieval for indonesian language documents
CN113609247A (en) Big data text duplicate removal technology based on improved Simhash algorithm
GB2442284A (en) Processing a database of web pages

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07804181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07804181

Country of ref document: EP

Kind code of ref document: A1