WO2013010557A1 - Method and system for data mining a document. - Google Patents

Method and system for data mining a document. Download PDF

Info

Publication number
WO2013010557A1
WO2013010557A1 PCT/EP2011/003590 EP2011003590W WO2013010557A1 WO 2013010557 A1 WO2013010557 A1 WO 2013010557A1 EP 2011003590 W EP2011003590 W EP 2011003590W WO 2013010557 A1 WO2013010557 A1 WO 2013010557A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
contextual
module
hyperlink
data
Prior art date
Application number
PCT/EP2011/003590
Other languages
French (fr)
Inventor
Miguel De Vega Rodrigo
Adolfo Sanchez-Barbudo Herrera
Original Assignee
Miguel De Vega Rodrigo
Adolfo Sanchez-Barbudo Herrera
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Miguel De Vega Rodrigo, Adolfo Sanchez-Barbudo Herrera filed Critical Miguel De Vega Rodrigo
Priority to PCT/EP2011/003590 priority Critical patent/WO2013010557A1/en
Publication of WO2013010557A1 publication Critical patent/WO2013010557A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/134Hyperlinking

Definitions

  • the invention belongs to the field of data mining.
  • Data mining (sometimes called data or knowledge discovery) is the automatic analysis and extraction of useful or relevant information from raw data.
  • Data mining solutions can help humans extract relevant information from electronic documents faster and with less cognitive effort. They can automatically process document contents and extract data that may potentially contain relevant information for the user according to a context. Data mining typically involves at least two basic tasks: Clustering and classification.
  • Clustering is the task of discovering parts, groups or structures in the data that are in some way or another "similar", based on a certain division criterion.
  • Classification is the task of assigning a degree of contextual relevance to the groups, structures or parts generated by the clustering task according to a context.
  • US 7451099 discloses various techniques for generating markup information to be displayed on a client computer system. Web page document contents are analysed for selected keywords from a keyword list. Matching words are converted into a link of any designation. This allows an information distributor (e.g., an Advertiser) to add links to a web page that direct users to specific web pages and/or present relevant offers
  • the purpose of this invention is not to help humans extract relevant information from electronic documents faster and with less cognitive effort, but it uses data mining techniques that could be adapted for that purpose. In that case, the solution presents several disadvantages.
  • the parts created by the clustering task in this data mining solution are simply the words in the document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type.
  • the classification task uses rather advanced mechanisms, like negative keyword match and fuzzy search. But the context it uses consists of a list of keywords not introduced by the reader of the web page, which necessarily means that it is not related to her/his context, but rather to that of the Advertiser.
  • Some Internet filters prevent the access to web page documents depending on a user concern, such as security or the suitability for children.
  • Two filters related to security are Web Sense Filter ' (http://www.websense.com/content/WebFilter.aspx), and Microsoft's SmartScreen filter (http://www.microsoft.com/security/filters/smartscreen.aspx).
  • Two filters related to the suitability of web contents for children are parental controls, such as K9 Web Protection (http://www1.k9webprotection.com), and Net Nanny (http://www.netnanny.com).
  • These filters may be understood as a data mining solution that extracts information (i.e., the suitability of the document depending on a user concern) relevant to the user from the web page document.
  • This solution presents several disadvantages. On the one hand, there is only one part created by the clustering task in this data mining solution that comprises the whole web page document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type. On the other hand, the classification task just contemplates two possible degrees of contextual relevance; either the document is suitable for the user or not.
  • the proposed method performs:
  • a dividing step that divides the content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for defining parts.
  • An analyzing step that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
  • An assigning step that assigns a degree of contextual relevance to the at least one analyzed part according to the fulfilment of the at least one evaluated contextual condition.
  • a modifying step that modifies the appearance on screen of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
  • the system may comprise the following modules:
  • a dividing module that divides a content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for identifying parts.
  • An extracting module that extracts data from at least one part.
  • An analyzing module that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
  • An assigning module that assigns a degree of contextual relevance for the at least one analyzed part according to the fulfilment of at least one contextual condition associated to the user context.
  • a modifying module that modifies the appearance of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
  • Yet another object of the invention is to provide a computer program product, (in a computer readable storage medium such as DVD, CD or the like, or through a network connection) comprising instructions to cause a computer to carry out steps of the proposed method.
  • FIG. 1 illustrates a flowchart (100) according to the invention. Mined information from a document (110) is presented differently depending on its degree of relevance with respect to a user context (170).
  • FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170).
  • FIG. 3 shows one embodiment in which further steps are taken by the analysing module (160) from FIG. 1 in order to adjust the degree of contextual relevance of a hyperlink part in a document.
  • FIG. 4 shows one embodiment in which a contextual condition (175) is enhanced by an enhancing module (410).
  • the present invention is directed to a method, a system, and a corresponding computer program product for mining user-relevant information from documents.
  • the embodiments of the present invention may include or be performed with a special purpose or general-purpose computer including handheld devices such as, but not limited to, electronic book readers, smartphones and tablets. They can be performed locally in the client computer operated by the user reading documents, in remote servers, or combinations thereof (some modules operating in the client computer while others in remote servers).
  • Embodiments within the scope of the present invention include documents (110) in any computer-readable format.
  • documents (110) can comprise PDF, DOC, ODT, DJVU, HTML, XML, XHTML, TXT, or any other format which can be used to represent textual information.
  • the user context (170) is relative to the user and to her/his present context. For this reason, the user context (170) may change with time for a given user. For instance, a user interested in extracting information from an electronic newspaper related to the domain of financial institutions may be later interested in sports.
  • the user context (170) is defined by a set containing at least one contextual condition (175).
  • the document (110) is processed by a dividing module (120) that uses a division criterion (125) in order to divide the document in several parts (130).
  • the division criterion usually depends on the format of the document.
  • the dividing module should create parts (130), generally: text and hyperlink parts.
  • Text parts are parts (130) that mainly or exclusively contain text
  • hyperlink parts are parts (130) that mainly or exclusively contain hyperlinks. Text parts usually contain several words.
  • the division criterion (125) configures the dividing module (120) in a way that it transforms every paragraph of pure text into a text part, every single hyperlink into a hyperlink part, every paragraph of text containing hyperlinks into a mixed part and every image into an image part.
  • the document (110) is an HTML webpage and text parts are built from certain HTML tags, such as ⁇ p>, ⁇ h1>, ⁇ title> and ⁇ //>, and hyperlink parts are built from ⁇ a> tags.
  • the division criterion (125) could also instruct the dividing module (120) to look for scripting code that dynamically adds text or hyperlinks to the webpage and encapsulate this code in a new type of part that could be called a script part.
  • the scripting language could be here for instance Javascript, and the scripting code could reside in the HTML page, in a file referenced from the HTML page, or it could be dynamically fetched from a remote server.
  • a video and audio part is respectively formed with an embedded video and audio in the document.
  • Modules 140, 160 and 190 in FIG. 1 receive and generate pieces of information 130, 150 and 180.
  • the acts carried out by the said modules may be repeated for each part identified by the dividing module (120).
  • the user context (170), comprising one or several contextual conditions (175), is the same for each one of these parts.
  • the extracting module (140) in FIG. 1 extracts data (150) from each part (130).
  • the data extracted by this module depends on the type of contents that the part contains.
  • the data extracted is simply the text that they contain, possibly with the information concerning the style properties applied to this text. If the document is an HTML web page the extracting module (140) may also extract the HTML tags embracing the text in the text part (130).
  • the extracted data (150) is the text of the label (if any) in the hyperlinks that are found in the hyperlink part.
  • the extracting module (140) extracts as data (150) from a hyperlink part the addresses of the hyperlinked documents (320) pointed by the hyperlinks that are found in the hyperlink part, and the analysing module (155) uses these addresses to access the content of the hyperlinked documents (320).
  • the data (150) extracted from a script part embedded or referenced from an HTML document may comprise the text messages that the scripting code prints on the document (110), as well as the text comments or names of variables, procedures, methods and classes from the scripting code. If the script part dynamically adds hyperlinks to the document, the extracting module (140) could extract data (150) from their labels, from the contents of the documents pointed by the hyperlinks, and combinations thereof.
  • the extracting module (140) could extract data (150) from this information.
  • the data extracted could be the fields and values from the message.
  • the data extracted may correspond to the metadata associated to these elements, such as their title and tags.
  • the analysing module (155) analyses the data (150) extracted from each part (130) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of evaluated contextual conditions (160).
  • the assigning module (165) receives the set of evaluated contextual conditions (160) and assigns a degree of contextual relevance (180) to the part (130).
  • the rules or criteria used by the assigning module (165) to assign a degree of contextual relevance (180) to a part (130) depend on the type of contents that the part contains.
  • the degree of contextual relevance (180) assigned by the assigning module (165) is a number that grows with the number of contextual conditions ( 75) that are fulfilled in the part (130), the number of times that each contextual condition (175) is fulfilled within the part (130), the inverse of the amount of data extracted from the part, and combinations thereof. For instance, if very little data (150) has been extracted from a part, but this data (150) fulfils many contextual conditions (175) many times, then the assigning module (165) assigns a high degree of contextual relevance (180). The converse is also true.
  • the degree of contextual relevance (180) is bounded by a minimum and a maximum value. It may take an infinite number of values between these two bounds.
  • the assigning module (165) takes into account the degree of contextual relevance (180) assigned to neighbouring parts in the document (110). In this case, the assigning module (165) assigns a higher degree of contextual relevance (180) to a part (130) when it is surrounded by parts with a high degree of contextual relevance than when it is not.
  • the modifying module (190) takes the degree of contextual relevance (180) and changes the appearance on screen of the corresponding part (130) in the document (110). That is, it changes the way this part is rendered, producing a modified part (195) version of the original part (130).
  • the changes in the appearance of the corresponding part can comprise, but are not limited to, the following actions: highlighting, underlining, changing the text font, changing the colour, the visibility, the size, deleting and combinations thereof.
  • a part containing text with a high degree of contextual relevance can be highlighted, the size of an image in a part with a low degree of contextual relevance can be reduced, the colour of the text in a part with low degree of contextual relevance (180) can be set close to the background colour of the text, and the hyperlink in a part with a low degree of contextual relevance can be deleted or hid from the document.
  • the way the document parts are presented to the reader depends on what the reader is interested in (i.e., on the user context).
  • the . same document (110) may look different to different readers and even to the same reader if she/he reads the document twice looking for different things (i.e., with a different use/ context).
  • the degree of contextual relevance (180) for one or more parts (130) in a document (110) can be stored in a local or remote database or file, or they can be coded in the document (110) itself.
  • the degree of contextual relevance (180) for each part (130) it is stored together with a value that identifies the part they correspond to. Examples of such a value are a reference to the location of the part within the document and a hash value computed from the contents of the part.
  • FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170) by the defining module.
  • the user context (170) is defined by a set containing at least one contextual condition (175).
  • a contextual condition (175) is any condition that can be evaluated by a computer and that upon evaluation yields either true or false as a result.
  • contextual conditions are evaluated by the analysing module (160) on the data (150) extracted by the extracting module (140).
  • Contextual conditions (175) may be defined by a defining module (210). This module gathers information from a variety of sources and uses this information in order to define the contextual conditions (175) that comprise the user context (170).
  • the defining module (210) can define a contextual condition (175) from the information entered by the user in a user interface (220), or from other sources of information comprising the user history data (260) and the document address (270).
  • the defining module (210) can define a contextual condition (175) from the information entered in an input field (230) by the user who is reading the document (110).
  • Such input field (230) could be, but is not limited to, an input field that appears in the user interface of the program used to view the document (110) for instance when a certain combination of keys is pressed (e.g., a search box), an input field (230) embedded in the document being viewed (e.g., an input form in an HTML document or the search box in an internet search engine), and an input field that belongs to an extension made to the program used to view the document (e.g., a browser extension, add-on or plugin such as a search engine toolbar).
  • a contextual condition (175) defined from the information entered from an input field (230) can consist on the evaluation of the presence or absence of a word in the data. It can also consist on the evaluation of the fulfilment of a Boolean or Regular expression on the data.
  • An example of a Boolean expression is "(park OR .children) AND -winter”, which could mean in a possible interpretation of the syntax, return true if the data contains the word "park” or the word “children” and it does not contain the word "winter”.
  • An example of Regular expression is ".at”, which means return true if the data contains any three-character string ending with "at”, including "hat”, “cat”, and "bat”.
  • the defining module (210) can define a contextual condition (175) from the information obtained from a set of user preferences (240). These user preferences can be stored in a local or remote database or file, or they can be coded in the document (110) itself. They can contain information introduced by the user in an input field (230), or information extracted from a variety of information sources, such as the history data (260) or the document address (270).
  • An example of a contextual condition defined from the information in a set of user preferences (240) is to evaluate the absence of a word in a list of offensive words, or in a list of banned subjects. In another example such a contextual condition (175) evaluates the absence of hyperlinks that link to an address that belong to a list of forbidden addresses.
  • contextual condition could evaluate the absence of hyperlinks pointing to webpages with forbidden URLs.
  • Another possible contextual condition (175) defined through a set of user preferences is to evaluate the matching of style properties on the data.
  • This can be combined with other contextual conditions
  • a combined contextual condition could be to evaluate the presence of a certain word in text that has the bold or underline style property applied.
  • the fulfilment of this combined condition could allow the assigning module (165) in FIG. 1 assign a higher degree of contextual relevance (180) to the part (130).
  • the defining module (210) can also use history data (260) collected by a data collecting module (250) from at least one previously consulted document in order to define a one or more contextual conditions (175).
  • the history data (260) can be stored in a local or remote database or file, or it can be coded in the document itself. It comprises, but is not limited to, the list of documents recently opened by the user, as well as the list of main user actions performed on those documents, such as word searches and mouse clicks on hyperlinks. This information tells what the user has been recently doing and can be therefore used to define contextual conditions (175) within the user context (170). For instance, the user could have accessed the present document by clicking on a hyperlink located in a previously opened document.
  • the data collecting module (250) can collect this event, together with the words from the label corresponding to the clicked hyperlink as part of the history data (260), and the defining module (210) can define a contextual condition (175) that evaluates the presence of these words.
  • the document address (270) can be also used by the defining module (210) in order to define a contextual condition (175).
  • the document address refers to its URL. In other cases, it refers to the path of the file that contains the document in the corresponding local or remote file system together With the name of this file.
  • the defining module (210) can extract words from the document address and define contextual conditions that evaluate the presence of these words in the document (110).
  • the input field (230) and the user preferences (240) both receive information from the user.
  • the user interface (220) are components of the user interface (220) through which users influence how the defining module (210) defines the contextual conditions (175) of the user context (170).
  • the user enters information in the input field (230) and then presses the ENTER key or presses a button and a contextual condition (175) is defined.
  • the user then configures her/his set of user preferences (240) by entering information in input fields (230), by selecting options from a check box, radio button, data picker, toggle button, list box or menu bar.
  • the user interface (220) further comprises a reset button.
  • the reset button When the user pushes the reset button all, or part of the contextual conditions (175) are eliminated from the user context (170).
  • the reset button By means of the reset button, the user has control over the life cycle of the user context (170). More specifically, the user can decide when to change her/his user context (170).
  • the user introduces in the input field (230) the words "quantum physics" and the defining module (210) defines one or more corresponding contextual conditions (175).
  • thermodynamics When the user finds in the document (110) the information about quantum physics she/he is looking for, she/he decides that her/his interest has shifted from quantum physics to thermodynamics.
  • the user then introduces in the input field (230) the words "thermodynamics" and the defining module (210) defines one or more corresponding contextual conditions (175), thus effectively changing the user context (170) at the user's command.
  • the user selects a new document to be mined by clicking on a hyperlink located in a document that has been mined the following actions take place.
  • the user context (170) is deleted, that is, all or most of its contextual conditions (175) are eliminated.
  • a new user context (170) is created by defining contextual conditions (175) from various sources of information, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink. If the clicked hyperlink has a high degree of contextual relevance (180), then the user context (170) is not deleted. Additionally, new contextual conditions (175) may be added to the user context (170) by taking into account information from various sources, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink.
  • the flowchart 300 of FIG. 3 describes further steps (310, 320, 330, 335, 340, 350, 355, 360, 365, 370, 375, 170, 175, 380 and 390) which are taken by the analysing module (160) in FIG. 1 according to another embodiment.
  • the accessing module 310, dividing module 330, extracting module 350, analysing module 365, assigning module 375 and adjusting module 390 receive and/or generate data from the hyperlink part 150, one or more hyperlinked documents 320, an auxiliary division criterion 335, one or more hyperlinked document parts 340, auxiliary data 360, one or more contextual conditions 175, one or more auxiliary evaluated contextual conditions 370, an auxiliary degree of contextual relevance 380 for each hyperlinked document part, and a degree of contextual relevance for the hyperlink part 180.
  • the purpose of the acts carried out by the said modules is to adjust the degree of contextual relevance (180) associated to a hyperlink part depending on the contents of the documents pointed by the hyperlinks contained in the hyperlink part.
  • An accessing module (310) accesses the hyperlinked document (320) pointed by each hyperlink contained in the data (150) extracted from the hyperlink part. For at least one of these hyperlinked documents, (320), the document is divided in hyperlinked document parts (340) by an auxiliary dividing module (330) which operates according to an auxiliary division criterion (335).
  • An auxiliary extracting module (350) further extracts auxiliary data (360) from at least one hyperlinked document part (340).
  • An auxiliary analysing module (365) analyses this auxiliary data (360) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of auxiliary evaluated contextual conditions (370).
  • An auxiliary assigning module (375) receives the set of auxiliary evaluated contextual conditions (370) and assigns an auxiliary degree of contextual relevance (380) to the hyperlinked document part (340).
  • An adjusting module (390) evaluates the auxiliary degrees of contextual relevance (380) of each one of the analysed hyperlinked document parts (340) in the document and adjusts the value of the degree of contextual relevance (180) associated to the hyperlink part accordingly.
  • the accessing module (310) not only accesses the hyperlinked documents (320) directly pointed by the hyperlinks contained in the data (150) extracted from the hyperlink part, or level-1 hyperlinked documents. It also recursively accesses a tree of n levels of the documents directly or indirectly pointed by the level-1 hyperlinked documents, where n is any integer finite number greater than one.
  • the accessing module (310) accesses documents that are pointed by hyperlinks located in the level-1 hyperlinked documents (i.e., level-2 hyperlinked documents), documents pointed by hyperlinks located, in level-2 hyperlinked documents (i.e., level-3 hyperlinked documents), and in general, level-j hyperlinked documents pointed by hyperlinks located in level-(j-1) hyperlinked documents, for 2 ⁇ j ⁇ n.
  • the adjusting module (390) gives more importance to the degrees of contextual relevance (180) obtained from level-j hyperlinked documents when j is close to 2 than when j is close to n.
  • FIG. 4 illustrates one embodiment 400 in which at least one of the contextual conditions (175) from the user context is processed by an enhancing module (410) resulting in an enhanced contextual condition (420).
  • the enhanced contextual condition (420) is a version of the same original contextual condition (175) processed by the enhancing module which has suffered some modifications.
  • such modifications can comprise: Adding to the contextual condition at least one synonym of at least one word in the contextual condition, adding to the contextual condition at least one inflected or derived word from at least one word in the contextual condition (e.g., adding from the word “work” the inflected and derived forms "workaholic", “worked” and “working”), eliminating from the contextual condition a word when it is in a list of irrelevant words (e.g., eliminating from “the house” the article “the”), adding to the contextual condition at least one phonetically equivalent word of at least one word, in the contextual condition (e.g., adding from the word "cool” the phonetically equivalent word “kool”), adding to the contextual condition a plural or singular form of at least one word in the contextual condition (e.g., adding from the word "tree” the word “trees”), adding to the contextual condition different capitalized versions of at least one word in the contextual condition (e.g. adding from the word "tree” the

Abstract

A method, a system and a computer program product are described for data mining a document or documents selected by a user. Data mining is performed according to an established user context that defines his preferences. As a result of analyzing the information contained in different parts of a document for a user context, different degrees of relevance are assigned to parts, and the appearance of the parts on screen may be modified accordingly for a better identification of the relevant information in the document. Thus, the part appearance is modified based on its contextual relevance for the user. The document contents may be divided in parts, such that each part is of the same type (e.g., text or hyperlinks). In this case different data extraction, analysis and assignment strategies may be used for each type of part. In addition, the relevant information may be stored and used.

Description

METHOD AND SYSTEM FOR DATA MINING A DOCUMENT
Field of the invention
The invention belongs to the field of data mining. Data mining (sometimes called data or knowledge discovery) is the automatic analysis and extraction of useful or relevant information from raw data.
Background of the invention
In the information era that we live in, the volume of data that we have to process on a daily basis keeps on growing exponentially. Usually, only a small fraction of the data that we process is relevant to our context. That is, we typically have to scan through the contents of each document discarding irrelevant data until we extract the information that we are looking for. This is an annoying, tedious, cognitive-demanding, time-consuming and error-prone task. Data mining solutions can help humans extract relevant information from electronic documents faster and with less cognitive effort. They can automatically process document contents and extract data that may potentially contain relevant information for the user according to a context. Data mining typically involves at least two basic tasks: Clustering and classification. Clustering is the task of discovering parts, groups or structures in the data that are in some way or another "similar", based on a certain division criterion. Classification is the task of assigning a degree of contextual relevance to the groups, structures or parts generated by the clustering task according to a context.
The state of the art provides several basic data mining solutions to extract relevant information from electronic documents, like the ones described below. Programs used to view documents, like Microsoft Word, Adobe Reader and web browsers, possibly with browser extensions such as https://addons.mozilla.orq/de/firefox/addon/googleenhancer/,
http://www.qooqle.com/support/toolbar/bin/answer.py?hl=de&answer=9273 and https://addons.mozilla.orq/de/firefox/addon/searchbox-companion/, provide built-in search functionality that typically highlights the matching words of a search query entered by the user. This functionality may be understood as a basic data mining solution that extracts information (i.e., the search matches) relevant to the user from the document. This solution presents several disadvantages. On the one hand, the parts created by the clustering task are simply the words in the document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type. On the other hand, the classification task just contemplates two possible degrees of contextual relevance; either the word is a match or it is not.
US 7451099 discloses various techniques for generating markup information to be displayed on a client computer system. Web page document contents are analysed for selected keywords from a keyword list. Matching words are converted into a link of any designation. This allows an information distributor (e.g., an Advertiser) to add links to a web page that direct users to specific web pages and/or present relevant offers
(banners, rich media). The purpose of this invention is not to help humans extract relevant information from electronic documents faster and with less cognitive effort, but it uses data mining techniques that could be adapted for that purpose. In that case, the solution presents several disadvantages. On the one hand, the parts created by the clustering task in this data mining solution are simply the words in the document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type. On the other hand, the classification task uses rather advanced mechanisms, like negative keyword match and fuzzy search. But the context it uses consists of a list of keywords not introduced by the reader of the web page, which necessarily means that it is not related to her/his context, but rather to that of the Advertiser.
Some Internet filters prevent the access to web page documents depending on a user concern, such as security or the suitability for children. Several known tools of this type can be cited. Two filters related to security are Web Sense Filter ' (http://www.websense.com/content/WebFilter.aspx), and Microsoft's SmartScreen filter (http://www.microsoft.com/security/filters/smartscreen.aspx). Two filters related to the suitability of web contents for children are parental controls, such as K9 Web Protection (http://www1.k9webprotection.com), and Net Nanny (http://www.netnanny.com). These filters may be understood as a data mining solution that extracts information (i.e., the suitability of the document depending on a user concern) relevant to the user from the web page document. This solution presents several disadvantages. On the one hand, there is only one part created by the clustering task in this data mining solution that comprises the whole web page document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type. On the other hand, the classification task just contemplates two possible degrees of contextual relevance; either the document is suitable for the user or not.
Accordingly, there currently exists a need in the art for automated and efficient data mining techniques that help human users find relevant information from the data in a document in less time and with less cognitive effort. More specifically, there is a need for clustering tasks capable of creating parts with a granularity meaningful to human readers i.e., not with the granularity of a word, or of a document, but something in between. Moreover, a typology of parts should be introduced, so that parts containing different types of contents (e.g. text or hyperlinks) can be processed differently. The classification task should be able to use a context that accurately represents the reader's context, possibly combining information from several sources. Furthermore, it should also be able to produce continuous-valued degrees of contextual relevance, not just binary values. The present invention fulfils this need and obtains further advantages that will become more apparent in the following sections.
Summary of the invention
In order to avoid the problems aforementioned, the present invention is set forth and characterized in the independent claims. Preferred embodiments are specified in the dependent claims.
It is an object of the invention a method for mining information from a user-selected document according to the relevance of the document data with respect to a user context defined from information, which may be completely, partially or not introduced by the user, so that a degree of contextual relevance can be calculated. To this end, the proposed method performs:
- A dividing step that divides the content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for defining parts.
- An extracting step that extracts data from at least one part.
- An analyzing step that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
- An assigning step that assigns a degree of contextual relevance to the at least one analyzed part according to the fulfilment of the at least one evaluated contextual condition. - A modifying step that modifies the appearance on screen of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
It is another major object of the invention to provide a system that carries out the above steps. The system may comprise the following modules:
- A dividing module that divides a content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for identifying parts.
- An extracting module that extracts data from at least one part.
- An analyzing module that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
- An assigning module that assigns a degree of contextual relevance for the at least one analyzed part according to the fulfilment of at least one contextual condition associated to the user context.
- A modifying module that modifies the appearance of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
Yet another object of the invention is to provide a computer program product, (in a computer readable storage medium such as DVD, CD or the like, or through a network connection) comprising instructions to cause a computer to carry out steps of the proposed method.
Brief description of the drawings
In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
FIG. 1 illustrates a flowchart (100) according to the invention. Mined information from a document (110) is presented differently depending on its degree of relevance with respect to a user context (170). FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170).
FIG. 3 shows one embodiment in which further steps are taken by the analysing module (160) from FIG. 1 in order to adjust the degree of contextual relevance of a hyperlink part in a document.
FIG. 4 shows one embodiment in which a contextual condition (175) is enhanced by an enhancing module (410).
Preferred embodiment of the invention
The present invention is directed to a method, a system, and a corresponding computer program product for mining user-relevant information from documents. The embodiments of the present invention may include or be performed with a special purpose or general-purpose computer including handheld devices such as, but not limited to, electronic book readers, smartphones and tablets. They can be performed locally in the client computer operated by the user reading documents, in remote servers, or combinations thereof (some modules operating in the client computer while others in remote servers).
Embodiments within the scope of the present invention include documents (110) in any computer-readable format. By way of example, and not limitation, such formats can comprise PDF, DOC, ODT, DJVU, HTML, XML, XHTML, TXT, or any other format which can be used to represent textual information.
The user context (170) is relative to the user and to her/his present context. For this reason, the user context (170) may change with time for a given user. For instance, a user interested in extracting information from an electronic newspaper related to the domain of financial institutions may be later interested in sports. In the discussion concerning FIG. 2 we will describe in detail what the user context is, and how it is computed. For the purpose of describing FIG. 1 it suffices to know that the user context (170) is defined by a set containing at least one contextual condition (175). According to FIG. 1 , the document (110) is processed by a dividing module (120) that uses a division criterion (125) in order to divide the document in several parts (130). The division criterion usually depends on the format of the document. It specifies how the dividing module should create parts (130), generally: text and hyperlink parts. Text parts are parts (130) that mainly or exclusively contain text, and hyperlink parts are parts (130) that mainly or exclusively contain hyperlinks. Text parts usually contain several words. In one embodiment, the division criterion (125) configures the dividing module (120) in a way that it transforms every paragraph of pure text into a text part, every single hyperlink into a hyperlink part, every paragraph of text containing hyperlinks into a mixed part and every image into an image part.
In another embodiment of the invention the document (110) is an HTML webpage and text parts are built from certain HTML tags, such as <p>, <h1>, <title> and <//>, and hyperlink parts are built from <a> tags. In this embodiment, the division criterion (125) could also instruct the dividing module (120) to look for scripting code that dynamically adds text or hyperlinks to the webpage and encapsulate this code in a new type of part that could be called a script part. The scripting language could be here for instance Javascript, and the scripting code could reside in the HTML page, in a file referenced from the HTML page, or it could be dynamically fetched from a remote server. In another embodiment, a video and audio part is respectively formed with an embedded video and audio in the document.
Modules 140, 160 and 190 in FIG. 1 receive and generate pieces of information 130, 150 and 180. The acts carried out by the said modules may be repeated for each part identified by the dividing module (120). The user context (170), comprising one or several contextual conditions (175), is the same for each one of these parts.
The extracting module (140) in FIG. 1 extracts data (150) from each part (130). The data extracted by this module depends on the type of contents that the part contains. For text parts, the data extracted is simply the text that they contain, possibly with the information concerning the style properties applied to this text. If the document is an HTML web page the extracting module (140) may also extract the HTML tags embracing the text in the text part (130). For hyperlink parts, the extracted data (150) is the text of the label (if any) in the hyperlinks that are found in the hyperlink part. In another embodiment, the extracting module (140) extracts as data (150) from a hyperlink part the addresses of the hyperlinked documents (320) pointed by the hyperlinks that are found in the hyperlink part, and the analysing module (155) uses these addresses to access the content of the hyperlinked documents (320). This embodiment is further detailed in FIG. 3. The data (150) extracted from a script part embedded or referenced from an HTML document may comprise the text messages that the scripting code prints on the document (110), as well as the text comments or names of variables, procedures, methods and classes from the scripting code. If the script part dynamically adds hyperlinks to the document, the extracting module (140) could extract data (150) from their labels, from the contents of the documents pointed by the hyperlinks, and combinations thereof. If the script part dynamically receives information from remote servers or databases, the extracting module (140) could extract data (150) from this information. For instance, if the script part receives a message with JSON format, the data extracted could be the fields and values from the message. For image, video and audio parts, the data extracted may correspond to the metadata associated to these elements, such as their title and tags.
The analysing module (155) analyses the data (150) extracted from each part (130) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of evaluated contextual conditions (160). The assigning module (165) receives the set of evaluated contextual conditions (160) and assigns a degree of contextual relevance (180) to the part (130). The rules or criteria used by the assigning module (165) to assign a degree of contextual relevance (180) to a part (130) depend on the type of contents that the part contains. The degree of contextual relevance (180) assigned by the assigning module (165) is a number that grows with the number of contextual conditions ( 75) that are fulfilled in the part (130), the number of times that each contextual condition (175) is fulfilled within the part (130), the inverse of the amount of data extracted from the part, and combinations thereof. For instance, if very little data (150) has been extracted from a part, but this data (150) fulfils many contextual conditions (175) many times, then the assigning module (165) assigns a high degree of contextual relevance (180). The converse is also true. The degree of contextual relevance (180) is bounded by a minimum and a maximum value. It may take an infinite number of values between these two bounds.
In another embodiment it may take a finite number of values between these bounds, and this number of values is equal or greater than two. In this case textual labels can be associated to the values, like for instance "high", "medium" and "low". According to another embodiment, the assigning module (165) takes into account the degree of contextual relevance (180) assigned to neighbouring parts in the document (110). In this case, the assigning module (165) assigns a higher degree of contextual relevance (180) to a part (130) when it is surrounded by parts with a high degree of contextual relevance than when it is not.
The modifying module (190) takes the degree of contextual relevance (180) and changes the appearance on screen of the corresponding part (130) in the document (110). That is, it changes the way this part is rendered, producing a modified part (195) version of the original part (130). The changes in the appearance of the corresponding part can comprise, but are not limited to, the following actions: highlighting, underlining, changing the text font, changing the colour, the visibility, the size, deleting and combinations thereof. For example, a part containing text with a high degree of contextual relevance can be highlighted, the size of an image in a part with a low degree of contextual relevance can be reduced, the colour of the text in a part with low degree of contextual relevance (180) can be set close to the background colour of the text, and the hyperlink in a part with a low degree of contextual relevance can be deleted or hid from the document. According to this, the way the document parts are presented to the reader depends on what the reader is interested in (i.e., on the user context). The. same document (110) may look different to different readers and even to the same reader if she/he reads the document twice looking for different things (i.e., with a different use/ context). In one embodiment, the degree of contextual relevance (180) for one or more parts (130) in a document (110) can be stored in a local or remote database or file, or they can be coded in the document (110) itself. In order to be able to retrieve the degree of contextual relevance (180) for each part (130), it is stored together with a value that identifies the part they correspond to. Examples of such a value are a reference to the location of the part within the document and a hash value computed from the contents of the part. In this embodiment, it is not always necessary to activate the extracting module (140),the analysing module (155) and the assigning module (165) in order to obtain the degree of contextual relevance. If the degree of contextual relevance (180) has been already obtained and stored or coded for a part, it can be directly retrieved from the data base, file or document. The User Context
FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170) by the defining module. The user context (170) is defined by a set containing at least one contextual condition (175). A contextual condition (175) is any condition that can be evaluated by a computer and that upon evaluation yields either true or false as a result. According to FIG. 1 , contextual conditions are evaluated by the analysing module (160) on the data (150) extracted by the extracting module (140).
Contextual conditions (175) may be defined by a defining module (210). This module gathers information from a variety of sources and uses this information in order to define the contextual conditions (175) that comprise the user context (170). In particular, the defining module (210) can define a contextual condition (175) from the information entered by the user in a user interface (220), or from other sources of information comprising the user history data (260) and the document address (270).
The defining module (210) can define a contextual condition (175) from the information entered in an input field (230) by the user who is reading the document (110). Such input field (230) could be, but is not limited to, an input field that appears in the user interface of the program used to view the document (110) for instance when a certain combination of keys is pressed (e.g., a search box), an input field (230) embedded in the document being viewed (e.g., an input form in an HTML document or the search box in an internet search engine), and an input field that belongs to an extension made to the program used to view the document (e.g., a browser extension, add-on or plugin such as a search engine toolbar). A contextual condition (175) defined from the information entered from an input field (230) can consist on the evaluation of the presence or absence of a word in the data. It can also consist on the evaluation of the fulfilment of a Boolean or Regular expression on the data. An example of a Boolean expression is "(park OR .children) AND -winter", which could mean in a possible interpretation of the syntax, return true if the data contains the word "park" or the word "children" and it does not contain the word "winter". An example of Regular expression is ".at", which means return true if the data contains any three-character string ending with "at", including "hat", "cat", and "bat". The defining module (210) can define a contextual condition (175) from the information obtained from a set of user preferences (240). These user preferences can be stored in a local or remote database or file, or they can be coded in the document (110) itself. They can contain information introduced by the user in an input field (230), or information extracted from a variety of information sources, such as the history data (260) or the document address (270). An example of a contextual condition defined from the information in a set of user preferences (240) is to evaluate the absence of a word in a list of offensive words, or in a list of banned subjects. In another example such a contextual condition (175) evaluates the absence of hyperlinks that link to an address that belong to a list of forbidden addresses. For example, in an HTML document such contextual condition could evaluate the absence of hyperlinks pointing to webpages with forbidden URLs. Another possible contextual condition (175) defined through a set of user preferences is to evaluate the matching of style properties on the data. This can be combined with other contextual conditions For instance, a combined contextual condition could be to evaluate the presence of a certain word in text that has the bold or underline style property applied. The fulfilment of this combined condition could allow the assigning module (165) in FIG. 1 assign a higher degree of contextual relevance (180) to the part (130). The defining module (210) can also use history data (260) collected by a data collecting module (250) from at least one previously consulted document in order to define a one or more contextual conditions (175). The history data (260) can be stored in a local or remote database or file, or it can be coded in the document itself. It comprises, but is not limited to, the list of documents recently opened by the user, as well as the list of main user actions performed on those documents, such as word searches and mouse clicks on hyperlinks. This information tells what the user has been recently doing and can be therefore used to define contextual conditions (175) within the user context (170). For instance, the user could have accessed the present document by clicking on a hyperlink located in a previously opened document. The data collecting module (250) can collect this event, together with the words from the label corresponding to the clicked hyperlink as part of the history data (260), and the defining module (210) can define a contextual condition (175) that evaluates the presence of these words. Examples of other user actions are selecting text in the document and leaving the mouse pointer for a number of seconds over some part in the document. The document address (270) can be also used by the defining module (210) in order to define a contextual condition (175). In the case of an HTML document, the document address refers to its URL. In other cases, it refers to the path of the file that contains the document in the corresponding local or remote file system together With the name of this file. The defining module (210) can extract words from the document address and define contextual conditions that evaluate the presence of these words in the document (110). The input field (230) and the user preferences (240) both receive information from the user. They are components of the user interface (220) through which users influence how the defining module (210) defines the contextual conditions (175) of the user context (170). In an embodiment, the user enters information in the input field (230) and then presses the ENTER key or presses a button and a contextual condition (175) is defined. In another embodiment, the user clicks on a link, or selects an option from a menu in the user interface and accesses a page or menu of user preferences (240). The user then configures her/his set of user preferences (240) by entering information in input fields (230), by selecting options from a check box, radio button, data picker, toggle button, list box or menu bar. The user then finishes editing her/his user preferences (240), saves and exits the user preferences page or menu, or simply exits it without saving the changes. In another embodiment the user interface (220) further comprises a reset button. When the user pushes the reset button all, or part of the contextual conditions (175) are eliminated from the user context (170). By means of the reset button, the user has control over the life cycle of the user context (170). More specifically, the user can decide when to change her/his user context (170). In one example, the user introduces in the input field (230) the words "quantum physics" and the defining module (210) defines one or more corresponding contextual conditions (175). When the user finds in the document (110) the information about quantum physics she/he is looking for, she/he decides that her/his interest has shifted from quantum physics to thermodynamics. The user presses the reset button in the user interface (220) and all contextual conditions (175) are deleted from the user context (170). The user then introduces in the input field (230) the words "thermodynamics" and the defining module (210) defines one or more corresponding contextual conditions (175), thus effectively changing the user context (170) at the user's command. In one embodiment when the user selects a new document to be mined by clicking on a hyperlink located in a document that has been mined the following actions take place. If the clicked hyperlink has a low degree of contextual relevance (180), then the user context (170) is deleted, that is, all or most of its contextual conditions (175) are eliminated. A new user context (170) is created by defining contextual conditions (175) from various sources of information, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink. If the clicked hyperlink has a high degree of contextual relevance (180), then the user context (170) is not deleted. Additionally, new contextual conditions (175) may be added to the user context (170) by taking into account information from various sources, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink.
Hyperlink parts
The flowchart 300 of FIG. 3 describes further steps (310, 320, 330, 335, 340, 350, 355, 360, 365, 370, 375, 170, 175, 380 and 390) which are taken by the analysing module (160) in FIG. 1 according to another embodiment. More specifically, the accessing module 310, dividing module 330, extracting module 350, analysing module 365, assigning module 375 and adjusting module 390 receive and/or generate data from the hyperlink part 150, one or more hyperlinked documents 320, an auxiliary division criterion 335, one or more hyperlinked document parts 340, auxiliary data 360, one or more contextual conditions 175, one or more auxiliary evaluated contextual conditions 370, an auxiliary degree of contextual relevance 380 for each hyperlinked document part, and a degree of contextual relevance for the hyperlink part 180. The purpose of the acts carried out by the said modules is to adjust the degree of contextual relevance (180) associated to a hyperlink part depending on the contents of the documents pointed by the hyperlinks contained in the hyperlink part. An accessing module (310) accesses the hyperlinked document (320) pointed by each hyperlink contained in the data (150) extracted from the hyperlink part. For at least one of these hyperlinked documents, (320), the document is divided in hyperlinked document parts (340) by an auxiliary dividing module (330) which operates according to an auxiliary division criterion (335). An auxiliary extracting module (350) further extracts auxiliary data (360) from at least one hyperlinked document part (340). An auxiliary analysing module (365) analyses this auxiliary data (360) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of auxiliary evaluated contextual conditions (370). An auxiliary assigning module (375) receives the set of auxiliary evaluated contextual conditions (370) and assigns an auxiliary degree of contextual relevance (380) to the hyperlinked document part (340). An adjusting module (390) evaluates the auxiliary degrees of contextual relevance (380) of each one of the analysed hyperlinked document parts (340) in the document and adjusts the value of the degree of contextual relevance (180) associated to the hyperlink part accordingly.
According to one embodiment, the accessing module (310) not only accesses the hyperlinked documents (320) directly pointed by the hyperlinks contained in the data (150) extracted from the hyperlink part, or level-1 hyperlinked documents. It also recursively accesses a tree of n levels of the documents directly or indirectly pointed by the level-1 hyperlinked documents, where n is any integer finite number greater than one. That is, the accessing module (310) accesses documents that are pointed by hyperlinks located in the level-1 hyperlinked documents (i.e., level-2 hyperlinked documents), documents pointed by hyperlinks located, in level-2 hyperlinked documents (i.e., level-3 hyperlinked documents), and in general, level-j hyperlinked documents pointed by hyperlinks located in level-(j-1) hyperlinked documents, for 2 <j <n. The adjusting module (390) gives more importance to the degrees of contextual relevance (180) obtained from level-j hyperlinked documents when j is close to 2 than when j is close to n.
Enhancing Contextual Conditions
FIG. 4 illustrates one embodiment 400 in which at least one of the contextual conditions (175) from the user context is processed by an enhancing module (410) resulting in an enhanced contextual condition (420). The enhanced contextual condition (420) is a version of the same original contextual condition (175) processed by the enhancing module which has suffered some modifications. By way of example, and not limitation, such modifications can comprise: Adding to the contextual condition at least one synonym of at least one word in the contextual condition, adding to the contextual condition at least one inflected or derived word from at least one word in the contextual condition (e.g., adding from the word "work" the inflected and derived forms "workaholic", "worked" and "working"), eliminating from the contextual condition a word when it is in a list of irrelevant words (e.g., eliminating from "the house" the article "the"), adding to the contextual condition at least one phonetically equivalent word of at least one word, in the contextual condition (e.g., adding from the word "cool" the phonetically equivalent word "kool"), adding to the contextual condition a plural or singular form of at least one word in the contextual condition (e.g., adding from the word "tree" the word "trees"), adding to the contextual condition different capitalized versions of at least one word in the contextual condition (e.g. adding from the word "tree" the words "Tree" and "TREE"), adding to the contextual condition different accentuated versions of at least one word in the contextual condition, and combinations thereof.

Claims

Claims
1. A computer-implemented method for mining data from a document (110) selected by a user according to the relevance of the data with a user context (170), the method comprising the steps of:
- dividing the content of the document (110) to be mined into a plurality of parts (130) to be individually analyzed, based on a division criterion (125) for defining parts, - selectively extracting data (150) from at least one part (130) in dependence upon the type of content present therein,
- analyzing the at least one part (130) by evaluating, for the extracted data (150), at least one contextual condition (175) associated to the user context (170),
- assigning a degree of contextual relevance (180) to the at least one analyzed part (130) according to the fulfillment of the at least one evaluated contextual condition (160), and - modifying the appearance of the at least one analyzed part (130) in the document (110) to be mined according to the assigned degree of contextual relevance (180).
2. The method according to claim 1 , wherein the division criterion (125) is associated to the type of content present in the document (130), such type of content being at least text or hyperlink thereby the step of dividing the document content produces at least one text part or at least one hyperlink part, respectively.
3. The method according to claim 1 or 2, wherein it comprises the step of establishing the user context (170), the user context being defined according to at least one contextual condition (175) to be fulfilled by the extracted data (150).
4. The method according to claim 3, wherein the* contextual condition (175) is at least one of the following conditions:
- presence of a word,
- absence of a word, - fulfillment of a Boolean expression,
- fulfillment of a Regular expression,-
- matching with an URL address pattern,
- matching style properties of the document,
- fulfillment of at least one contextual condition by a different part of the document,
- and combinations thereof.
5. The method according to claim 3 or 4, wherein the contextual condition (175) is established by reading the content of an input field (230) for allowing a contextual condition (175) to be defined by the user.
6. The method according to any of claims 3 to 5, wherein establishing the user context further comprises defining at least one contextual condition (175) by collecting history data (260) from at least one previously consulted document.
7. The method according to any of claims 2 to 6, wherein the step of analyzing the data (150) extracted from one part (130) when the part is a hyperlink part further comprises:
- accessing a hyperlinked document (320) when holding a linking association with the hyperlink part in the document (110),
- dividing the content of the hyperlinked document (320) into a plurality of hyperlinked document parts (340) based on an auxiliary division criterion (125) for defining hyperlinked document parts,
- extracting auxiliary data (360) from at least one hyperlinked document part (340), - analyzing the at least one hyperlinked document part (340) by evaluating, for the extracted auxiliary data (360), at least one contextual condition (175) associated to the user context (170),
- assigning an auxiliary degree of contextual relevance (380) to the at least one analyzed hyperlinked document part (340) according to the fulfillment of the at least one auxiliary evaluated contextual condition (370), and
- adjusting the degree of contextual relevance (180) for the hyperlink part in the document (110) to be mined according to the assigned auxiliary degree of contextual relevance (380) of the analyzed hyperlinked document part (340).
8. The method according to claim 7, wherein the linking association exists: - if a hyperlink of the hyperlink part of the document (110) is directly pointing to the hyperlinked document (320), or
- if a hyperlink of the hyperlink part of the document (110) is connectable through at least one intermediate document with at least one intermediate hyperlink directly
5 pointing to the hyperlinked document (320).
9. The method according to any of claims -1 to 8, wherein the step of analyzing the extracted data (150) further comprises enhancing the contextual condition (175) by performing at least one of the following operations for producing an enhanced0 contextual condition (420):
- adding at least one synonym of a word comprised in the contextual condition (175),
- applying stemming to a word comprised in the contextual condition (175),
- obviating a word comprised in the contextual condition (175) when it is in a list of irrelevant words,
5 - adding an equivalent phonetic word of a word comprised in the contextual condition (175).
10. The method according to any of claims 1 to 9, wherein it further comprises storing in a database a reference to at least one part (130) of the document (110) and its0 corresponding degree of contextual relevance (180).
11. The method according to any of claims 1 to 10, wherein it further comprises, for a plurality of documents (110), defining a ranking of documents according to the degrees of contextual relevance (180) of their parts (130).
5
12. The method according to any of claims 1 to 11 , wherein modifying the appearance of the part (130) on the document (110) according to the evaluated degree of contextual relevance (180) comprises at least one of the following actions:
- highlighting,
Q - underlining,
- changing the text font,
- changing the color,
- changing the visibility,
- deleting.
5
13. The method according to any of claims 1 to 12, wherein the document is a web page.
14. A computer-implemented system for mining data from a document (110) selected by a user according to the relevance of the data with a user context (170), the system comprising:
- a dividing module (120) configured to divide the content of the document (110) to be mined into a plurality of parts (130) to be individually analyzed, based on a division criterion (125) for identifying parts,
- an extracting module (140) configured to selectively extract data (150) from at least one part (130) in dependence upon the type of content present therein,
- an analyzing module (155) configured to analyze the at least one part (130) by evaluating, for the extracted data (150), at least one contextual condition (175) associated to the user context (170),
- an assigning module (165) configured to assign a degree of contextual relevance (180) for the at least one analyzed part (130) according to a fulfilment of the at least one evaluated contextual condition (175), and
- a modifying module (190) configured to modify the appearance of the at least one analyzed part (130) in the document (110) to be mined according to the assigned degree of contextual relevance (180).
15. The system according to claim 14, wherein the division criterion (125) applied by the dividing module (120) is associated to the type of content present in the document ( 30), such type of content being at least text or hyperlink, thereby the step of dividing the document content produces at least one text part or at least one hyperlink part, respectively.
16. The system according to claim 14 or 15, wherein it comprises a defining module (210) configured to establish the user context (170) by defining at least one contextual condition (175) to be fulfilled by the extracted data (150).
17. The system according to claim 16, wherein the contextual condition (175) is at least one of the following conditions:
- presence of a word, - absence of a word,
- fulfillment of a Boolean expression,
- fulfillment of a Regular expression,
- matching with an URL address pattern,
- matching style properties of the document,
- fulfillment of at least one contextual condition by a different part of the document,
- and combinations thereof.
18. The system according to claim 16 or 17, wherein the defining module (210) is further configured to define a contextual condition (175) from a user interface (220) configured to read the content of an input field (230) for allowing a contextual condition (175) to be defined by the user.
19. The system according to any of claims 16 to 18, wherein the defining module (210) is further configured to establish the user context (170) by defining at least one contextual condition (175) from user history data (260) collected by a data collecting module (250) from at least one previously consulted document.
20. The system according to any of claims 15 to 19, wherein, when the said part (130) is a hyperlink part, the analyzing module (370) further comprises:
- an accessing module (310) configured to access a hyperlinked document (320) when holding a linking association with the hyperlink part in the document (110),
- an auxiliary dividing module (330) configured to divide the content of the hyperlinked document (320) into a plurality of hyperlinked document parts (340) based on an auxiliary division criterion (335) for defining hyperlinked document parts,
- an auxiliary extracting module (350) configured to extract auxiliary data (360) from at least one hyperlinked document part (340),
- an auxiliary analyzing module (370) configured to analyze the at least one hyperlinked document part (340) by evaluating, for the extracted auxiliary data (360), at least one contextual condition (175) associated to the user context (170),
- an auxiliary assigning module (375) configured to assign an auxiliary degree of contextual relevance (380) to the at least one analyzed hyperlinked document part (340) according to the fulfillment of the at least one auxiliary evaluated contextual condition (370), and - an adjusting module (390) configured to adjust the degree of contextual relevance (180) for the hyperlink part in the document (1 10) to be mined according to the assigned auxiliary degree of contextual relevance (380) of the analyzed hyperlinked document part (340).
21. The system according to claim 20, wherein the linking association exists:
- if a hyperlink of the hyperlink part of the document (110) is directly pointing to the hyperlinked document (320), or .
- if a hyperlink of the hyperlink part of the document (1 10) is connectable through at least one intermediate document with at least one intermediate hyperlink directly pointing to the hyperlinked document (320).
22. The system according to any of claims 14 to 21 , wherein it further comprises an enhancing module (410) configured to perform at least one of the following operations with the at least one contextual condition (175) to produce at least one enhanced contextual condition (420):
- adding at least one synonym of a word comprised in the contextual condition (175),
- applying stemming to a word comprised in the contextual condition (175),
- obviating a word comprised in the contextual condition (175) when it is in a list of irrelevant words,
- adding an equivalent phonetic word of a word comprised in the contextual condition (175).
23. The system according to any of claims 14 to 22, wherein it further comprises a database for storing a reference to at least one part (130) of the document (1 10) and its corresponding degree of contextual relevance (180).
24. The system according to any of claims 14 to 23, wherein it further comprises a ranking module configured to define a ranking of documents for a plurality of mined documents (110) according to the degrees of contextual relevance (180) of their parts (130).
25. The system according to any of claims 14 to 24, wherein the modifying module (190) is further configured to perform at least one.of the following actions to modify the appearance of the at least one part (130) on the document (1 10), according to the evaluated degree of contextual relevance (180), to produce a modified part (195):
- highlighting,
- underlining,
- changing the text font,
- changing the color,
- changing the visibility,
- deleting.
26. The method according to any of claims 14 to 25, wherein the document is a web page.
27. A computer program product adapted to perform the method of claims 1 to 13.
PCT/EP2011/003590 2011-07-19 2011-07-19 Method and system for data mining a document. WO2013010557A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2011/003590 WO2013010557A1 (en) 2011-07-19 2011-07-19 Method and system for data mining a document.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2011/003590 WO2013010557A1 (en) 2011-07-19 2011-07-19 Method and system for data mining a document.

Publications (1)

Publication Number Publication Date
WO2013010557A1 true WO2013010557A1 (en) 2013-01-24

Family

ID=44629044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/003590 WO2013010557A1 (en) 2011-07-19 2011-07-19 Method and system for data mining a document.

Country Status (1)

Country Link
WO (1) WO2013010557A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100213A (en) * 2020-09-07 2020-12-18 中国人民解放军海军工程大学 Ship equipment technical data searching and sorting method
US11416575B2 (en) 2020-07-06 2022-08-16 Grokit Data, Inc. Automation system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040080532A1 (en) * 2002-10-29 2004-04-29 International Business Machines Corporation Apparatus and method for automatically highlighting text in an electronic document
EP1679617A2 (en) * 2005-01-07 2006-07-12 Palo Alto Research Center Incorporated Method for automatically performing conceptual highlighting in electronic text
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
WO2008103623A1 (en) * 2007-02-22 2008-08-28 Microsoft Corporation Synonym and similar word page search
US7451099B2 (en) 2000-08-30 2008-11-11 Kontera Technologies, Inc. Dynamic document context mark-up technique implemented over a computer network
US20110119262A1 (en) * 2009-11-13 2011-05-19 Dexter Jeffrey M Method and System for Grouping Chunks Extracted from A Document, Highlighting the Location of A Document Chunk Within A Document, and Ranking Hyperlinks Within A Document

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7451099B2 (en) 2000-08-30 2008-11-11 Kontera Technologies, Inc. Dynamic document context mark-up technique implemented over a computer network
US20040080532A1 (en) * 2002-10-29 2004-04-29 International Business Machines Corporation Apparatus and method for automatically highlighting text in an electronic document
EP1679617A2 (en) * 2005-01-07 2006-07-12 Palo Alto Research Center Incorporated Method for automatically performing conceptual highlighting in electronic text
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
WO2008103623A1 (en) * 2007-02-22 2008-08-28 Microsoft Corporation Synonym and similar word page search
US20110119262A1 (en) * 2009-11-13 2011-05-19 Dexter Jeffrey M Method and System for Grouping Chunks Extracted from A Document, Highlighting the Location of A Document Chunk Within A Document, and Ranking Hyperlinks Within A Document

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416575B2 (en) 2020-07-06 2022-08-16 Grokit Data, Inc. Automation system and method
US11568019B2 (en) 2020-07-06 2023-01-31 Grokit Data, Inc. Automation system and method
US11580190B2 (en) 2020-07-06 2023-02-14 Grokit Data, Inc. Automation system and method
US11640440B2 (en) * 2020-07-06 2023-05-02 Grokit Data, Inc. Automation system and method
US11860967B2 (en) 2020-07-06 2024-01-02 The Iremedy Healthcare Companies, Inc. Automation system and method
CN112100213A (en) * 2020-09-07 2020-12-18 中国人民解放军海军工程大学 Ship equipment technical data searching and sorting method

Similar Documents

Publication Publication Date Title
US8185530B2 (en) Method and system for web document clustering
Papadakis et al. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques
US8874542B2 (en) Displaying browse sequence with search results
US7519621B2 (en) Extracting information from Web pages
Peters et al. Content extraction using diverse feature sets
WO2004083990A2 (en) Web content adaption process and system
CN107577788B (en) E-commerce website topic crawler method for automatically structuring data
US11263062B2 (en) API mashup exploration and recommendation
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
KR20190058141A (en) Method for generating data extracted from document and apparatus thereof
Mehta et al. DOM tree based approach for web content extraction
CN114443928B (en) Web text data crawler method and system
US20080208803A1 (en) System and method for characterising a web page
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN111125485A (en) Website URL crawling method based on Scapy
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
WO2013010557A1 (en) Method and system for data mining a document.
US20120109965A1 (en) System for automatic semantic-based mining
CN106991144B (en) Method and system for customizing data crawling workflow
Kaddu et al. To extract informative content from online web pages by using hybrid approach
KR101650316B1 (en) Apparatus and method for collecting and analysing HTML5 documents based a distributed parallel processing
WO2014049310A2 (en) Method and apparatuses for interactive searching of electronic documents
CN112989163A (en) Vertical search method and system
EP3382575A1 (en) Electronic document file analysis
CN114003714B (en) Intelligent knowledge pushing method for document context sensing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11736000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/06/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11736000

Country of ref document: EP

Kind code of ref document: A1