WO2009107148A1

WO2009107148A1 - Metadata extraction from naturally hierarchical information sources

Info

Publication number: WO2009107148A1
Application number: PCT/IN2009/000090
Authority: WO
Inventors: Vikalp Sahni; Muddassir Hasan; Rajesh Warrier; Ruban Phukan
Original assignee: Ibibo Web Pvt. Ltd.; Ibibo (Mauritius) Ltd.
Priority date: 2008-02-26
Filing date: 2009-02-09
Publication date: 2009-09-03

Abstract

The invention relates to a method of generating meta data for a data element in a data source comprising extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element in the data source and adding the meta data extracted from the at least one other data element to the meta data of the data element.

Description

r

The present disclosure relates to extraction of information. In particular, the present disclosure relates to the extraction of information from naturally hierarchical information sources.

BACKGROUND

A conventional crawler is an application capable of crawling information sources, including the internet and enterprise intranets, and essentially copy data pages or relevant sections of the data pages for subsequent indexing by a search engine. The indexing of the information sources by the search engine provides for quicker searches on the information sources. For each crawl, a crawler can take a number of inputs such as initial seed URL's, depth or width of crawl, URL patterns etc resulting into the crawl that may be used for various applications or situations including feed for the search engine.

Naturally hierarchical information sources refer to the sources that occur naturally in a predetermined pattern. Such sources typically require that the parent data page be taken into consideration for reaching the final data page. Further, such sources include at the simplest end books and anthologies, hierarchical file systems, and the World Wide Web. For example, a disk file system is a naturally hierarchical information source wherein, the data available in the data file can be reached through a predetermined hierarchy that can for example, be the disk partition-folder-data file. This hierarchy has information of the data file and other similar objects that reside in the same hierarchy. Web portals may also be considered as naturally hierarchal sources where a data page is reached through a main page via other intermediate pages. In such web portals, significant information about a !

data page is often available in the path which is followed to reach the required data page from the main page. This path may be considered as the hierarchy of the data page.

The information available along the hierarchy of a data page is not captured during a crawl as information relevant for the data page. It would be useful to use data available along the hierarchy of a data page to improve the quality of data associated with the data page.

SUMMARY

This summary is provided to introduce concepts relating to extraction of information from hierarchical information sources. These concepts are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.

In one implementation, the system includes a crawler for traversing and fetching information through a hierarchical information source. The crawled information can be clustered based on configurable parameters such as number of pages to traverse, frequency of traversing, form filling data for each website if the information source is web based and the like. The traversed information after clustering can be converted into a more structured format such as XML or XHTML.

Furthermore, an extractor generates a wrapper of the information to be fetched from the hierarchical information source. The extractor provides for extraction of the information present within the tags of web based languages such as HTML or XML. The extracted information is forwarded to an information processor for filtering and classifying the information. Subsequently, a feed file is generated by a feed processor based on a format specified by the user.

The invention relates to a computer implemented method for creating a database for feeding search requests comprising crawling a data source to retrieve specified data elements, the data retrieved including data on the relation of a data element to other data elements in the hierarchy of the data source; generating meta data for a data element by extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element and adding the meta data extracted from the at least one other data element to the meta data of the data element; and storing in a database the meta data of the data element for feeding search requests.

The invention relates to a system for feeding search requests comprising a crawler for crawling a data source to retrieve specified data elements, the data retrieved including data on the relation of a data element to other data elements in the hierarchy of the data source; an extractor for generating meta data for a data element by extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element and adding the meta data extracted from the at least one other data element to the meta data of the data element; and a database for storing the meta data of the data element for feeding search requests.

The invention relates to a crawler for crawling a data source comprising a configuration loader module for defining the parameters of a crawl on a data source; the data source comprising of a plurality of data elements, wherein the data source is a naturally hierarchical source with data elements arranged in a hierarchical relation to one another; a HTTP module configured to access the data elements from the data source; and a listings module configured to extract the relation of a data element with other data elements of the data source.

BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.

Figure 1 illustrates an exemplary system for crawling and extracting information from a hierarchical information source.

Figure 2 illustrates an exemplary method of crawling and extracting information along a crawl path from a hierarchical information source.

Figure 3 illustrates the crawler components for crawling and extracting information from a hierarchical information source. K

Figure 4 illustrates an exemplary table listing the result of a crawl on a hierarchical information source. DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof. The following disclosure describes systems and methods for crawling and extracting information from hierarchical information sources. While aspects of described systems and methods for crawling and extracting information from hierarchical information sources can be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system(s) and method(s).

Systems and methods in accordance with an aspect of the invention relate to the extraction of information and in particular, to the extraction of information from naturally hierarchical information sources. The systems and methods include information retrieval and indexing processes. The system and method in accordance with one aspect improve h

information extraction from a hierarchical information source by associating metadata around linkage patterns.

As used herein a data element refers to any document, page, listing, list or any part thereof including any collection of data. A data source may be considered to comprise of a plurality of data elements. A data element may also be considered as a data source if it comprises of more than one data element.

Systems and methods in accordance with an aspect relate to the extraction and indexing of data related to a document or a data element in a naturally organized corpus, where the corpus could be a bibliography or scientific papers, the entire web, or a part of it, a file system on a document server, etc. The extraction of metadata for a data element is on the basis of its location as relative to other data elements in the database and the way other data elements refer to it. The system and method extract peripheral information from a referring document and associate it with the referred document. A crawl and extraction system and method is provided that simplifies the task of obtaining information from the vast resources of the World Wide Web and can be used to power various vertical searches.

Figure 1 illustrates a system 100 for crawling and extracting information from hierarchical information sources. The system preferably employs pluggable pipeline architecture and includes a crawler 102, an extractor 104, a document processor 106, a controller 108, a feed processor 110 and a database 112. The system depicted is a schematic illustration to explain the working of the claimed subject matter and does not define or limit the scope of the claimed subject matter. K

The controller 108 controls the operation of the system to crawl and extract information from hierarchical information sources. The controller 108 schedules the crawl jobs, transfers the crawled files to an extractor 104 and keeps record of the crawled files and extracted information. Further, the controller directs the extracted information to the document processor 106 and the final information for storage in the database 112.

A label file is manually generated for each data source that maps the labels or headings that are required from the information source to database fields. An extractor module reads the label file and generates a wrapper for the data source, or the specific data pages within the data source, and intelligently extracts data elements corresponding to each label or heading and maps it to the corresponding database fields. The extracted data is preferably passed through a document processing pipeline where noise and incomplete data is eliminated. Feeds may then be generated from this database.

The crawler 102 fetches information from various hierarchical information sources including a web based source of information like a web site. The data fetched by the crawler is stored by the system in storage for subsequent access by the extractor module. The data retrieved by the crawler includes linkage patterns, hierarchy of data elements within the data source in addition to the data elements. For the purpose of this disclosure the information source may be referred to as a site. The crawler 102 may be provided with numerous parameters including number of pages to crawl, frequency of crawling, URL pattern to follow, form filling data for each site etc. The crawler 102 has the capability to handle form submissions and can do a deep crawl of the site. The crawler 102 uses the URL patterns to restrict the crawl to a certain portion of the site. This is especially useful when only a few pages need to be crawled from every site in a certain time interval such as Jobs, classifieds etc. The crawled files may be clustered based on the URL pattern and may be converted into a more structured XHTML or XML format.

In one implementation, the crawler 102 may also add attributes that are related to the documents crawled; however are not directly available in the document itself. Such attributes may include for example, the query parameter or form submission values that are inputs to generate the listing of pages. In accordance with an aspect, and to increase the related data available the crawler 102 may map data present in the listings page with the data page and thus make it available to extract subsequently by the extractor. By way of specific example, for a particular site where all the business news items are listed on a single page, clicking a news item yields the news article. However, the article itself may not specifically indicate that it is a business related article. As the crawler traverses the hierarchy of the particular news article to reach it, it indentifies, that the article was reached through the business section. The meta-data of the article is modified to include this information.

In another implementation, a data page or a document higher in the hierarchy may include data related to multiple data items lower in the hierarchy. Such data may be intelligently mapped to each such data item lower in the hierarchy to complete its data tuple. For example, under a particular section of a web site, numerous sub sections may be present. The meta data associated with the section may be relevant for each of the sub sections and is mapped to each of the sub sections. This recognition of the links between data elements by the crawler and retrieving data elements along with the associated linkage patterns and hierarchy enables the system.

The extractor 104 generates a wrapper for the information to be fetched from the hierarchical information source. The wrapper serves as a reference file to the extractor for the information to be extracted from the data source. The extractor works on the data crawled and stored by the system and extracts the required information. The extractor 104 is capable of grouping elements of similar format that occur multiple times such as user comments and reviews. Also, the extractor 104 provides for extraction of information present within the tags of web based languages such as HTML or XML. The extractor 104 is run through all the pages of a data source and identifies the information relating to a label in the database and extracts the information.

The system includes a document processor 106 having pluggable pipeline architecture where generic, vertical or site based modules such as filters, classifiers and the like can be plugged in. The document processor 106 includes a location mapper that concatenates the location hierarchy to the location fields. The document processor 106 further has a category analyzer that populates category based fields based on ranking and keywords.

The document processor also functions as a filter that is used to generate homogenous data for search platforms. As the data collected by the system resides in different sites, with each site following its own format or style of displaying information. For example: a first site may display the date information in a DD/MM/YYYY format. Another site may display the same information in YYYY/MM/DD format. Such information when extracted is converted by the document processor to a standard format for easy and machine readable retrieval. The document processor may also be configured to remove junk data or extra data not relevant for the search platform.

The feed processor 110 has a pluggable pipeline architecture that makes use of the data stored within the database 112 for creating a feed file. The feeds are generated according to the requirements of a client which further eases the processing of the feed at the client end. The feed file is given as input to a search engine for efficient and quicker search results. The feed processor retrieves and presents data to a client from the database, as per the requirements of the client. The data in the database may be presented in varying formats based on the requirements provided to the feed processor.

The system is configured to extract information from structured and semi- structured web pages. The structure of a source assists the system in identifying hierarchies and linkage patterns and to co-relate data elements. The label file created, often in view of known structures of a data source, is used by the extractor to generate the wrapper that reflects the data and its structure to be extracted.

Figure 2 illustrates a method 200 for extraction and indexing of information relating to a document in a naturally hierarchical information source, the source can be a bibliography or scientific papers, the entire web, or a part of it, a file system on a document server, etc. The extraction of information from a document is based on its location as relative to other documents in the database and the way other documents refer to it. The system and method as herein described extract peripheral information from a referring document and associate it with the referred document. A

The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternative method. Additionally, individual blocks may be deleted from the method without departing from the spirit and scope, of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.

At block 201, a label file is created for each data source that maps the labels or the headings in the data source to the database fields. In one implementation the label file may be created from a matrimony website having labels such as name, gender, age, location, religion and the like. The above-mentioned labels are mapped to the corresponding database fields.

At block 202, an extractor module 104 generates a wrapper for the data source, or the specific data pages within the data source, that have been crawled and stored in a store. In one implementation an extractor module generates a wrapper for the information source such as the matrimony website or specific data pages within the matrimony website that have been crawled.

At block 203 the crawler crawls the data source and retrieves the specified data elements including data on the relation with other data elements and stores it in a store.

At block 204, the wrapper intelligently extracts data elements and the meta data for the data elements from the data source corresponding to each label or heading and maps it to the corresponding database fields. In one implementation the wrapper intelligently extracts data elements such as age, gender, religion, etc corresponding to each label of a matrimony website and is mapped to the corresponding database fields. The wrapper also extracts information.

At block 205, the extracted data is preferably passed through an information processing pipeline 106 for eliminating noise and incomplete data. In one implementation

> the data extracted from the matrimony website is passed through the information processing pipeline for eliminating noise and incomplete data.

At block 206, the feed processor 110 makes use of the information stored within the database 112 for generating a feed file. The feeds are generated according to the requirements of the client which eases the processing of the feed at the client end. In one implementation the feed processor makes use of the stored information relating to the matrimony website for generating a feed file.

At block 207, the feed file is provided as input to a search application. In one implementation the feed file containing data relating to the matrimony website is provided as input to the search application.

In another implementation, the document linkages patterns may be initially defined manually at the time of generation of the label file and the corresponding wrapper, with the actual linkages discovered automatically subsequently. These links between index or referring documents and target or referred documents are followed automatically, and the information associated with the link on the referring page is propagated when the link is followed. At the end of the crawl the entire information extracted over the path is stored along with the target document in a parsed form. In another implementation, the system may be configured to identify the references of data elements with each other in the hierarchy. This is the linkage between individual data elements of the data source. Once trained for a source it automatically identifies other possible hierarchies for that source and extracts corresponding data accordingly. For example for a website, system has to be manually trained to identify the final data element of the hierarchy (the data page), It then follows the path required to reach that data element and thus can also extract information which is available while reaching that data element.

While moving ahead in an information source or along the hierarchy, information is extracted from the referring document or from documents or links along the hierarchy of a document. Such information extraction enhances the quality of search or lookup on the information source.

Figure 3 illustrates components of a crawler 300 that include a configuration loader module 302 that loads manually specified parameters that are required while crawling or traversing a website. These parameters may include a category or vertical such as Jobs, News of a site, or the pattern of the data URL's. The configuration loader module 302 may also load certain optional parameters such as politeness and depth.

The crawler 102 further include an HTTP module 304 that behaves like a browser for accessing pages from a source by handling various HTTP request such as GET, POST etc. The HTTP module 304 is also capable of emulating a browser, such as Mozilla, etc thus allowing it to store cookies and enabling it to fill forms required to get the data or the listings page. The HTTP module 304 uses configuration parameters to understand the vertical and other required parameters of site.

The crawler 102 includes a GET Links module 306 that acts as an engine to fetch links from the listings page that are required to be crawled and to be used as a data page. The GET links module 306 uses the data fetched by HTTP module 304 and also certain configuration parameters to understand the links required to be collected. In accordance with an aspect, the GET Links module 306 may be enabled to carry out checks and profiling of the link such as duplicate removal etc. The GET links module 306 also preferably has a smart Java Script (JS) Engine attached to it that fetches links from websites where URL's of data pages are generated dynamically using JS codes written in HTML.

The crawler 102 further includes a component called a Listings Module 308 that manages attributes related to documents crawled but not available in the document. These parameters include the query parameter or form submission values that yield listing of pages. The listings module is configured to extract the relation of a data element with other data elements of the data source.

The crawler also include a Site Traversal module 310 that is capable of reading the parameters loaded by the configuration loader module 302 to understand the depth of crawl and then studies the page to fetch the next available listings page from the site. The Site Traversal Module 310 is enabled to take decisions on stopping or continuing the process of crawl for a particular site. Figure 4 illustrates details of experimental data in tabular form from the implementation of the system and method as taught herein. The experiment describes the way naturally hierarchical information source is used to extract information. The order in which the experimental data is presented should not be construed as a limitation, and any number of the described columns can be combined in any order to present experimental data. Additionally, individual columns may be deleted to describe the experimental data without departing from the spirit and scope of the subject matter described herein. The data is divided in to three preferred columns as follows:

1. Column 1 contains the initial page of a matrimony web site from which query form is submitted to the site

2. Column 2 is the data from listings page that shows up when the query is submitted

3. Column 3 is the data from details page that is at the bottom of the hierarchy

As shown in the example, the details page, column 3, information such as "Gender", "unique ref id" etc is not available as structured data for extraction. This information may however be determined from the parent links. The "gender" is identified from column 1, whereas other information such as unique reference ID or profile ID may be obtained from the listings page of column 2. The information from a matrimony website is divided into columns and data which are not directly available as structured information for extraction is extracted and stored in a database. Later, such indirect data set is given as input to a search engine for quick and efficient search results. The embodiment described above was in order to best explain the principles of the invention, the practical applications, and to enable others of ordinary skill in the art to understand the invention. Various other embodiments having various modifications, may be suited to a particular use contemplated, and are considered to be within the scope of the present invention.

Claims

We Claim:

1. A method of generating meta data for a data element in a data source comprising extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element in the data source and adding the meta data extracted from the at least one other data element to the meta data of the data element.

2. A method of generating meta data for a data element in a data source as claimed in claim 1 wherein extracting meta data from the other data elements includes extracting meta data of the other data elements.

3. A method of generating meta data for a data element in a data source as claimed in claim 1 wherein meta data extracted from another data element includes data on the location of the other data element with relation to the data element.

4. A method of generating meta data for a data element in a data source as claimed in claim 1 wherein the data source is a naturally hierarchical source with data elements arranged in a hierarchical relation to one another.

5. A method of generating meta data for a data element in a data source as claimed in any of the preceding claims wherein the meta data extracted from another data element is added to the meta data of all data elements lower than the other data element in the hierarchy of the data source.

6. A method as claimed in claim 1 wherein a data element is related to another data element if it refers to, is linked to or its position in the hierarchy of the data source is relative to the other data element.

7. A method as claimed in claim 1 wherein the relation of a data element with at least one other data element of the data source is defined prior to the generation of meta data.

8. A computer implemented method for creating a database for feeding search requests comprising: a. crawling a data source to retrieve specified data elements, the data retrieved including data on the relation of a data element to other data elements in the hierarchy of the data source; b. generating meta data for a data element by extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element and adding the meta data extracted from the at least one other data element to the meta data of the data element; and c. storing in a database the meta data of the data element for feeding search requests.

9. A method as claimed in claim 8 wherein generating meta data for a data element includes creating a wrapper for the data elements for which meta data is to be generated.

10. A method as claimed in claim 8 wherein the meta data generated for a data element is filtered by a document processor before storing in a database.

11. A method as claimed in claim 8 wherein the relation of a data element with at least one other data element of the data source is defined prior to the generation of meta data.

12. A system for feeding search requests comprising a. a crawler for crawling a data source to retrieve specified data elements, the data retrieved including data on the relation of a data element to other data elements in the hierarchy of the data source; b. an extractor for generating meta data for a data element by extracting meta data for the data element from the data element and extracting meta data for the data element from at least one other data element related to the data element and adding the meta data extracted from the at least one other data element to the meta data of the data element; and c. a database for storing the meta data of the data element for feeding search requests.

13. A system as claimed in claim 12 wherein the extractor includes a wrapper generation module for generating a wrapper for the data elements for which meta data is to be generated.

14. A system as claimed in claim 12 comprising a document processor for filtering the meta data extracted by the extractor for homogeneous data storage in the database.

15. A system as claimed in claim 12 wherein the extractor is configured to group similar data elements together where a type of data element appears more than once in a data source.

16. A system as claimed in claim 12 wherein the relation of a data element with at least one other data element of the data source is defined prior to the generation of meta data.

17. A crawler for crawling a data source comprising:

a. a configuration loader module for defining the parameters of a crawl on a data source; the data source comprising of a plurality of data elements, wherein the data source is a naturally hierarchical source with data elements arranged in a hierarchical relation to one another; b. a HTTP module configured to access the data elements from the data source; and c. a listings module configured to extract the relation of a data element with other data elements of the data source.

18. A method of generating meta data substantially as herein described with reference to and as illustrated by the accompanying drawings.

19. A computer implemented method for creating a database for feeding search requests substantially as herein described with reference to and as illustrated by the accompanying drawings.

20. A system for feeding search requests substantially as herein described with reference to and as illustrated by the accompanying drawings.

21. A crawler substantially as herein described with reference to and as illustrated by the accompanying drawings.