Selective Updating
Description
The present invention relates to the field of search engines, particularly but not exclusively to a method of selectively updating an index of web pages for use by a search engine in performing searches.
The World Wide Web, or simply 'web', is based on hypertext, which can be thought of as text that is not constrained to be sequential. The web can handle much more than just text, so the more general term hypermedia is used to cover all types of content, including but not limited to pictures, graphics, sound and video. While the primary language for representing hypermedia content on the web is HTML, other markup languages are constantly developing, including, for example, XML. The term hypermedia as used herein is therefore not intended to be limited to any particular web language, nor indeed to the web, but should be interpreted as a general term that can also refer to content on public or private networks which operate according to HyperText Transfer Protocol (HTTP) or other similar protocols.
As mentioned above, HTML is a document mark-up language that is the primary language for creating documents on the web. It defines the structure and layout of a web document by reference to a number of pre-defined tags with associated attributes. The tags and attributes are interpreted and the web page is accordingly displayed by a client application running on a computer, commonly referred to as a browser.
As a result of the vast amount of information available on the web, search engine technology is well established, with a large number of different search engines being available, including those with well-known names such as Google™, AltaVista™ and Excite™. A search engine is a system that can search for specific words and phrases in a set of electronic documents, particularly HTML documents on the web, although the term is not confined to use on the web.
The majority of search engines work on similar principles. Web content is hosted by a very large number of remote web servers. A computer program known as a 'spider' or 'robot' crawls through the content on the web that is to be indexed and stores information about each page it finds in a searchable index. The index therefore comprises a complete database of information about a predetermined list of web pages.
Each page or document in the index is given a ranking, indicating its relevance with reference to some word or phrase. Every search engine typically uses a different algorithm, although based on the same principle, namely that the ranking is determined by a combination of the number of occurrences of each keyword or phrase in the document, the total word count of the document and whether the keyword/phrase occurred in a particularly significant location within the document, such as in the title or in HTML tags known as meta-tags, or was in some other way highlighted in the document.
When a user performs a search for some keyword or phrase, the search engine consults its index and returns a set of search results ranked by relevance.
Clearly, it is important from a user point of view to be looking at the latest data available. In turn, search engines need to ensure that their index is kept as up-to- date as possible, since it is in the nature of the Internet that the content being indexed is likely to be constantly changing.
However, the creation of the index takes substantial computational and bandwidth resources. Furthermore, the speed at which updates can be performed is limited by the number of pages that the remote web servers can deliver in unit time.
One possible way of saving computational time is to record the creation time of a remote document during a first indexing operation and subsequently only refresh the information about this document if the creation time has changed. However,
since many web servers do not deliver accurate information on document dates, this is not a particularly good solution in practice.
The present invention aims to address the above problems.
According to the invention, there is provided a method of selectively updating an index of documents, some of said documents including links to other of said documents, the method comprising the step of updating only those of said documents which include said links.
The method can further comprise incorporating into the index documents not previously indexed and which are linked to by updated documents and some or all of said incorporated documents may not include said links.
A selected set of the documents in the index can comprise a document hierarchy, for example a website, in which each document is associated with a depth, the depth representing the number of said links required to reach the document from an entry document, wherein said links comprise links between documents at different depths.
The documents can comprise hypertext documents such as web pages and the links can be hyperlinks.
According to the invention, there is further provided a method of selectively updating an index of documents, including the steps of preparing the index for selective updating and selectively updating the index, wherein the step of preparing the index for selective updating includes classifying the documents in the index as leaf and non-leaf pages, wherein the leaf pages do not include links to other documents in the index and the non-leaf pages include links to other documents in the index.
The method can further comprise updating the non-leaf pages more frequently than the leaf pages and/or adding new leaf pages to the index more frequently than updating existing leaf pages.
According to the invention, there is also provided a search engine including means for selectively updating an index of documents, some of said documents including links to other of said documents, said means being configured to update only those of said documents which include said links.
Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Figure 1 is a schematic diagram of a system according to the invention; Figure 2 illustrates the structure of a typical website;
Figure 3 is a flow diagram illustrating search engine indexing operation in accordance with the invention, including full indexing and selective updating;
Figure 4 is a flow diagram illustrating the full indexing procedure referred to in
Figure 3; Figure 5 is a flow diagram illustrating the retrieval of a list of non-leaf pages, which requires the retrieval of a list of leaf pages;
Figure 6 is a flow diagram illustrating the retrieval of a list of leaf pages for use in the flow diagram of Figure 5;
Figure 7 is a flow diagram illustrating the selective updating procedure referred to in Figure 3; and
Figure 8 is a flow diagram illustrating the deletion of orphaned pages following selective updating.
Referring to Figure 1 , a system according to the invention comprises a search engine program 1, written, for example, in Java™ running on and executable by the processor 2 of a web server machine 3. The search engine program 1 communicates with a database 4 which stores the raw data that forms the basis of the information to be presented to a user carrying out a search.
The user accesses the web server 3 via a communications network 5, such as the
Internet, using browser software 6 running on a personal computer 7. The browser software 6, for example, Microsoft Internet Explorer™ or Netscape Navigator™, interfaces with the search engine program 1 via web server software 8 running on
the web server machine 3. The browser software 6 communicates with the web server software 8 using the HTTP protocol, in a way that is well known.
It will be understood that the web server machine 3 and the personal computer 7 are conventional computers equipped with all of the hardware and software necessary to carry out their respective tasks.
The search engine program 1 includes a web spider program 9, the function of which is to trawl the Internet to provide the raw data that will be processed and used to respond to subsequent user queries. The functionality of the search engine program 1 according to the invention will be described and illustrated below with reference to Figure 3.
Referring first to Figure 2, a typical website comprises an entry page or start page P, also referred to as the home page, a number of more specific pages Q, R, S and finally, specific items of particular interest T, U, V. For example, a news website has a home page listing the most current articles, a number of pages listing articles in specific sections, for example UK news and world news and finally the news articles themselves. The pages are interlinked by hyperlinks 10, indicated in bold in Figure 2. A website may have more than one entry page.
The depth of a web page in a website is defined for the purpose of this application as the minimum number of hyperUnks a user or computer program must traverse from an entry page in order to arrive at that page. A branch page is then defined as any page, other than an entry page, which contains one or more hyperUnks deeper into the website, while a leaf page is defined as any page that does not link deeper into the website. A non-leaf page is an entry page or a branch page.
Referring to Figure 3, the indexing operation of the search engine program 1 begins at step sO. The program 1 first determines whether a previous index exists (step si). If it does not, for example, because this is the first time that the indexing operation is being performed, then a full index is generated (step s2) and the process then terminates (step s3).
The procedure for generating a full index, which is also the way in which a conventional web spider program functions, is described in detail below with reference to Figure 4.
Referring to Figure 4, in a full indexing process, the web spider program 9 periodically downloads documents from the Internet in accordance with predetermined indexing criteria, which, for example, specify the coverage that the search engine is attempting to achieve (step s20). The type of document downloaded is classified as either an HTML document or 'other' document type, i.e. all documents which are not HTML documents (step s21). In both cases, a document parser corresponding to the document type parses the document (steps s22, s23) and stores data about the document in the database 4 (step s24). This data includes, for example, information relating to the positions and frequency of every word, with the option of excluding the most common words, in each document.
In addition, the HTML parser retrieves the URLs of any hyperlinks within the document being parsed (step s22) and checks whether each of these hyperlinks is new and meets the indexing criteria (step s25). The indexing criteria can comprise a pattern to be matched by the hyperUnk, for example, that the hyperUnk must have the same root as the source page. If this is the case, the hyperUnk is added to the download Ust stack (step s26) and the next document is downloaded from the Ust stack (step s20). If none of the hyperlinks found are to be added to the Ust stack, the spider program 9 determines if the Ust stack is empty (step s27). If it is, the spider program terminates (step s28). If not, control returns to the document downloader to download the next document from the Ust stack (step s20).
Returning to Figure 3, even if a previous index exists, the program determines whether it is time to perform a fuU re-indexing process (step s4). This process is carried out at intervals to ensure that a completely fresh index is periodically generated. For example, a full re-indexing process is carried out weekly, while selective indexing is carried out daily. The full re-indexing process involves carrying out the procedure set out above in relation to step s2.
If there is no need for full re-indexing and a previous index is available, the program examines the existing search index to generate tables, for example in the form of hashtables, of links between the pages (step s5). The tables include a forward set ('links to') and a reverse set ('links from') of link information. For the 'Unks to' table, a source URL maps to an array of Unk destination URLs, whereas for the 'links from' table, the destination URL maps to an array of URLs of pages having the destination link. For example, for the structure shown in Figure 2, the forward table includes the information that page P Unks to pages Q, R and S, and that page Q Unks to pages T, U and V, while the reverse table holds the information that page R has Unks from page P and V, that page S has links from pages P and R and that page N has Unks from pages Q and S. Corresponding information is held for all the other pages.
The next step carried out by the program is to obtain a list of all of the non-leaf pages (step s6). The procedure for doing this is set out in detail below and iUustrated with reference to Figure 5.
Referring to Figure 5, the subroutine begins at step s60. Essentially, to obtain the set of non-leaf pages, the program first determines the set of leaf pages for a given set of entry pages (step sόl). Pages which are not leaf pages are by definition non- leaf pages. The set of leaf pages is determined by the program using the subroutine illustrated in Figure 6.
Referring to Figure 6, the subroutine begins at step 6100. The first step is to initialise an empty set of leaf pages, to which leaf pages will be added as they are found, an empty set of pages called 'this_level', an empty set of pages caUed 'done_Unks', a variable called 'depth' which is initially set to 0 and an empty hashtable referred to herein as the depth hashtable, which maps URLs to the depth variable (step sόlOl). The 'this_level' set will contain all the pages at a particular depth i.e. starting with all the entry pages, then all the pages at depth 1 and so on. The 'done_links' set will contain all the Unks that have already been considered.
After initiaUsation, all of the entry pages are added to the 'this_leveP page set (step s6102). The program then determines whether there are pages in the 'this_level' page set (step s6103) and executes a first loop while there are such pages. Initially, therefore, all the entry pages are in the 'this_level' page set. The program then initialises an empty set of pages called 'next_level' (step s6104) for the first entry page. It then determines whether there are more entries in the this_level set (step s6105). If there are, then the program retrieves, from the index, the Ust of Unks for the current page, sets a boolean variable called allold to TRUE and adds a depth hashtable entry for the page equal to the depth (step s6106).
At the next stage, the program determines whether there are any more links on the page (step s6107). For each Unk which exists, the program determines whether the Unk is already present in the this_level set or the depth hashtable (step s6108). If it is present, control returns to step s6107 and the program looks at the next Unk. If the link does not already exist in the this_level set or the depth hashtable, the Boolean variable allold is set to FALSE (step s6109) and the program determines whether the Unk exists in the done_Unks set (step sόl lO). If it does, control again returns to step s6107 without further action. If it does not, an entry for the Unk URL is added to the done_Unks set and to the next_level set (step sόl l l). Control then returns to step s6107 and the program performs the same procedure for the next Unk.
If the program at step s6107 determines that there are no more Unks, then the program tests the state of the allold variable (step s6112). If this is set to TRUE, the page is added to the set of leaf pages (step s6109) and control passes back to step s6105. This will only be the case if the page has no Unks that go deeper into the document. If the allold variable is set to FALSE, control passes back to step s6105 without the page being added to the set of leaf pages. At step s6105, the program determines whether there are further entries in the this_level set. In the absence of further entries, the this_level set is set to hold the contents of the next_level set, and the depth is increased by 1 (step s6114). Control passes back to step s6103, which re-runs the steps described above for the next page in the this_level set.
Only when there are no further pages to be processed does the program exit with a completed Ust of leaf pages (step soi l 5).
An example of the operation of the algorithm above is now described with reference to Figure 2. On the first pass, page P is the only entry page so this is loaded into the this_level set (step s6102). Since this page is in the this_level set (step s6103), the next_level set is initialised (step s6104) and the Ust of links for page P are retrieved from the Unks_to hashtable (step s6106). These links are the URLs for pages Q, R and S. The allold variable is set to TRUE and an entry is added to the depth hashtable for page P specifying a depth of 0 (step s6106).
Pages Q, R and S do not exist in the this_level set or the depth hashtable (steps s6107 and s6108). Therefore, they are new pages, so allold is set to FALSE (step s6109). Pages Q, R and S do not exist in the done_Unks set either, so the URLs for each of pages Q, R and S are added to the done_Unks set and to the next_level set (steps s6107 to sόlll). It will be understood that, although these pages are described above, for the purpose of clarity and brevity, as if the program treated them together, each page is in fact processed by the program 1 on separate passes, for example on the basis of a 'for' loop over all of the Unks in the Ust. On the fourth pass through the loop, the program 1 determines that there are no more Unks for page P (step s6107). A test of the allold variable determines that it is set to false (step s6112), so control passes to step s6105. Since there are no more entry pages at depth 0, the this_level set is set to the contents of the next_level set i.e. to contain the URLs for pages Q, R and S and the depth variable is incremented (step s6114). The process described above then repeats from step s6103 for each of pages Q, R and S.
For example, an empty next_level set is initiahsed for page Q (step s6104) and the Ust of Unks for page Q is obtained. This contains pages T, U and N. allold is set to TRUE and the depth hashtable for page Q is set to 1 (step s6106).
The Unks to page T, U and V are not in the this_level or done_links sets nor in the depth Hashtable (steps s6108, sόl lO), so allold is set to FALSE (step s6110) and
pages T, U and V are added to the done_Unks and next_level sets. Control returns to step s6105. The next page at this level is page R.
Page R contains only a single link to page S. At step s6106, allold is set to TRUE and a depth hashtable for page R, depth = 1 is made. Program flow proceeds to step s6108. Since page S is in the this_level set, program flow returns to step s6107. Since allold is set to TRUE, page R is added to the set of leaf pages (step s6113). Although it Unks to another page, that page is not one which is deeper within the website. Control now returns to step s6105. The next page at this level is page S.
The Ust of Unks for page S contains page N only, allold is set to TRUE and a depth hashtable entry is made for page S, depth = 1 (step s6106). Since the Unk to page V is not in the this_level set nor in the depth hashtable (step s6108), allold is set to FALSE (step s6109). However, Unk N is in the set of done_Unks, so control passes back to step s6107 without adding link V to the next_level set (step sόl lO). Since there are no further Unks and allold is set to FALSE, control passes to step s6112 and then to step s6105 without adding page S to the set of leaf pages. Since there are no further pages at this level, the this_level set is set to the contents of next_level, i.e. pages T, U and V and the depth variable is incremented, so that depth = 2 (step s6114). On the next pass, an empty next_level set is initiaUsed (step s6104) and control passes through step s6105 to step s6106.
At step s6106, for each of pages T and U, allold is set to TRUE and hashtable entries are made for each of pages T and U with depth = 2. Since pages T and U do not include any Unks, control passes from step s6107 to step s6112 and since allold has just been set to TRUE, pages T and U are each added to the set of leaf pages on sequential passes through the flowchart.
For the final page in the website, page V, the list of links contains a Unk to page R. allold is set to TRUE and a hashtable entry made for page N, depth = 2 (step s6106). Program flow proceeds to step s6108, where the program determines that the Unk exists in the depth hashtable. An entry was previously made in the depth hashtable for page R, depth = 1. Therefore, program flow moves back to step s6107. There
are no further Unks and the program determines that allold is set to TRUE (step soi l 2), so the current page V is added to the set of leaf pages.
Control returns to step s6105, there are no more pages at level 2, so this_level is set to the contents of next_level and the depth is incremented (step s6114). However, no links were added to the next_level set on the previous pass, so the this_level set is empty. The program determines that this is the case (step s6103) and exits with a complete set of leaf pages, comprising pages R, T, U and V (step s6115).
Referring again to Figure 5, after obtaining the Ust of leaf pages as described in detail above, the program gets the complete Ust of pages in the search index and initiaUses an empty Ust of non-leaf pages (step s62). For every page in the search index (step s63), the program determines whether the page is in the Ust of leaf pages (step s64). If so, control returns to step s63 and the next page is looked at. If a page is not a leaf page, it is classified as a non-leaf page and added to the Ust of non-leaf pages (step s65). When the program determines that there are no further pages to be considered (step s63), it returns the complete list of non-leaf pages (step s66).
In the example given above in relation to Figure 2, the non-leaf pages are pages P, Q and S only.
Referring again to Figure 3, the non-leaf pages are then deleted from the search index (step s7). This involves the deletion of all the information about each non- leaf page that exists in the index as a result of previous indexing of the page.
Information outside the index, such as the URL of the page and the information in the links_to and Unks_from tables is unaffected at this stage. As the index contains an inverse table, which for each word Usts the pages and locations of that word, it is almost as quick to delete many pages at the same time as it is to delete a single page.
For each website present in the search index, the URL of its entry page is retrieved, together with the URLs of the existing leaf pages. A selective re-indexing procedure
is then performed (step s8), as described in more detail in Figure 7. In a multithreading environment, the indexing of a number of websites can run concurrently.
Referring to Figure 7, the selective re-indexing process according to the invention operates in a very similar way to the conventional indexing program illustrated in Figure 4. For ease of reference, the steps that are the same are indicated by the same reference numerals. The difference between this process and the spidering operation illustrated in Figure 4 Ues in the fact that the existing leaf pages are ignored. At step s22, the HTML document parser retrieves the URLs of any hyperlinks within a web page. As in the conventional spider program, the selective indexing spider determines whether the hyperlink is new and meets the criteria for indexing (step s25). If it is, it determines whether the hyperUnk is in the Ust of leaf pages for the given entry page (step s80). If it is in the list, the hyperUnk is not added to the download list stack for subsequent downloading and control returns to the document downloader (step s20). If it is not in the list, it is added to the download Ust stack in the usual way (step s26) and control is returned to the document downloader (step s20).
Based on the example of Figure 2, it is evident that of the seven pages which would require updating in a full indexing procedure, four pages R, T, U and V no longer require indexing according to the selective indexing procedure, so achieving a significant saving in computational resources.
Referring again to Figure 3, after performing selective re-indexing, a Ust of Unks to and from the pages in the new index is again extracted, as described in relation to step s5 above (step s9). A list of all orphaned pages is then obtained and can be deleted from the new index (step slO). Orphaned pages are pre-existing leaf pages which are no longer Unked to by other pages. They may or may not exist on a targeted website but, since they are no longer accessible on the live website, they should not be found by the search engine.
The process of orphaned page deletion is now explained in detail with reference to Figure 8. The process begins at step slOO. An empty Ust of unhnked documents is
initialised and the complete Ust of pages in the search index retrieved (step si 01). For every page in the search index (step si 02), the program 1 determines whether an entry exists in the Unked_from table created at step s9 in Figure 5 (step sl03). If an entry exists, control passes back to step si 02 and the process is repeated for the next page. If there is no entry in the Unked_from table (step si 03), indicating that the page in question is unreachable from any other page, the current page is added to the Ust of unUnked documents (step sl04). Control then passes back to step si 02. Once all the pages have been processed, control passes to step si 05, at which all the pages listed in the list of unUnked documents are deleted and the subroutine terminates (step si 06).