WO2001057725A2 - System and method for database searching - Google Patents

System and method for database searching Download PDF

Info

Publication number
WO2001057725A2
WO2001057725A2 PCT/GB2001/000446 GB0100446W WO0157725A2 WO 2001057725 A2 WO2001057725 A2 WO 2001057725A2 GB 0100446 W GB0100446 W GB 0100446W WO 0157725 A2 WO0157725 A2 WO 0157725A2
Authority
WO
WIPO (PCT)
Prior art keywords
databases
database
information
query
key
Prior art date
Application number
PCT/GB2001/000446
Other languages
French (fr)
Other versions
WO2001057725A3 (en
Inventor
Richard David Parratt
Original Assignee
Navigateone Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Navigateone Limited filed Critical Navigateone Limited
Priority to EP01902550A priority Critical patent/EP1254413A2/en
Priority to AU2001230402A priority patent/AU2001230402A1/en
Publication of WO2001057725A2 publication Critical patent/WO2001057725A2/en
Publication of WO2001057725A3 publication Critical patent/WO2001057725A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention is concerned with systems and methods for retrieving material from computerised information systems. Such systems and methods are sometimes referred to as search engines as they search through databases such as the Internet, Internet websites or collections of Internet websites.
  • the World- Wide- Web consists of a large number of separate databases or websites which can be accessed via a telecommunications link and then viewed or interrogated.
  • the different World- Wide- Web or Internet databases or websites are stored at computers which can be located at any location provided that they are connected to a telecommunications network.
  • Computers and/or databases within many organisations are also interconnected using the same techniques (so-called "Intranets").
  • Each computer or database in the network may contain information of interest to users of the network. This information may be structured in each computer or database in various ways and using various techniques.
  • the providers or compilers of each Internet website will select and implement an user interface for their website so as to allow users to gain access to information from the website or database.
  • Websites typically include a number of different possible information displays known as web pages. These are generated in response to instructions or requests posed by the user via the website user interface.
  • the information may be delivered as a formatted, human-readable page, or in a computer readable format designed for further processing by users' software applications.
  • a user For a user to retrieve a particular item of information from a collection of distributed websites such as the Internet or an Intranet, he or she needs to locate the website(s) or database(s) that holds the relevant information, learn to use the user interface of the website(s) and then use the interface to retrieve the web- page, database section or display containing that information.
  • a collection of distributed websites such as the Internet or an Intranet
  • Web directories rely on a manual process to locate each relevant individual web page and place it in a directory.
  • a team of editors review websites and input page addresses into categories, the results being stored in a database. Users can then browse through the database. Categories are generally organised hierarchically, e.g. Companies/UK/Telecoms/British Telecom.
  • the directory usually also provides a full-text search facility which allows users to look for information using keywords, phrases or boolean logic.
  • the directory compilation process is expensive due to the large number of editors required. The expense and time involved in reviewing possibly relevant websites also limits the amount of information that can be indexed at a reasonable cost in a reasonable time-scale.
  • the directory process can be quite precise, but can miss out valuable information due to sites not having been found and indexed.
  • directories examples include Yahoo (www.yahoo.com), Looksmart (www.looksmart.com) and dmoz Open Directory (www.dmoz.org).
  • Text search engines use a "web crawler" to locate content from the World- Wide- Web . This is a computer program that, starting with a number of initial website or webpage addresses with information of relevance to a particular query or type of query, follows the links between web pages to try and locate all linked pages on the Wold- Wide- Web. Most text search engines also allow web users and website owners to submit pages for indexing.
  • All data located by the web crawler is stored in a searchable database.
  • This database can then be searched or interrogated.
  • the database typically allows full text searching using keywords, phrases or boolean logic.
  • the searched text stored in the database can include the web page title, the full text of the website and "metadata" which is not displayed but provided by the web-site or database creator to aid search engines.
  • Text search engines provide coverage of every web page that is immediately accessible to the "crawler" used by the engine to gather content. This allows a text search to provide a high level of completeness, at the expense of results that may be irrelevant. Text search engines are intrinsically unable to locate information within websites that provide a database of content on multiple objects, such as newspaper archives, stock exchange quote service, airline booking services, etc. The web crawler will typically locate the front page of these sites and be unable to advance beyond the barrier of the user interface used by each particular website or database - preventing such information from being effectively indexed.
  • Natural language processing is used to match the results generated by the search against the initial request inputted by the user and obtain a statistical measure of the relevance of information to the users query.
  • Weighting is given to words in the database according to frequency of appearance, proximity and nearness to the start of the document.
  • Document comparison techniques are used, e.g. by allowing users to select a link and look for similar documents. Such systems typically use statistical matching techniques to pick up possibly similar documents. All of these techniques are limited when locating objects such as companies or places where the context is not understood by the search algorithm.
  • One system (Ask Jeeves - www.askjeeves.co.uk, or www.ask.com) allows users to enter a query in natural language. This is then statistically matched against a database of potential questions indexed against possibly relevant websites. The results are then returned as a list of those websites indexed in the database together with the statistically relevant potential questions. .
  • the database of websites and questions is typically maintained manually. Only questions similar to those held in the database are likely to receive a relevant result.
  • Meta-search engines (such as Dogpile, Copernicus, Sherlock) allow users to enter a search request which will then be presented to multiple search engines. This allows users to get the benefit of multiple technologies in finding information. Meta-search techniques tend to return a large amount of information which needs to be manually reviewed by the user.
  • the present invention in a first aspect provides a system for searching a distributed collection of databases comprising a number of databases connected to each other by a communications network system including query entry means for entering a request for information on a subject, object or matter, or a group of subjects, objects or matters, a first memory storing index entries each index entry including a portion representing a subject, or a group of subjects, objects or matter, object or matter) on which information might be sought and one or more locations entries indicating which of the databases may contain information in the respective subject, object or matter or group of subjects, objects or matters, and a second memory storing database interrogation modules routines or sub-routines for converting a request for information received by the data entry means into a set of appropriate instructions for each of the databases.
  • the present invention in a second aspect provides a method for obtaining information from a collection of databases, comprising entering a query for information comparing the query to a database of descriptions of the content or type of context of the databases constituting the collection of databases generating a list of potentially relevant databases from said comparison of the query to the database descriptions and converting or translating the query into an enquiry signal or signals recognised or processable by each of the potentially relevant databases.
  • the present invention in a third aspect provides a computer program comprising program wide means for performing the method set out above and in claims 12 to 17.
  • the present invention in a fourth aspect provides a computer program product comprising program code means stored on a computer readable medium for performing the method set out above and in claims 12 to 17.
  • Locating relevant information amongst the large amount of irrelevant content on the Internet which may contain words that match the name of the object. For instance, there are over 50,000 directly accessible web pages containing the words "British Telecom"
  • Locating information which is provided by an interactive website and is thus inaccessible to conventional web crawler based search engines Classifying results according to the type of content, e.g. news articles, charts, quotes, airline timetables, hotel information.
  • Figure 1 is a block diagram illustrating the processing of a query by a system embodying the invention
  • Figure 2 is a block diagram illustrating the architecture of a system embodying the invention
  • Figure 3 is a block diagram illustrating the database management of a system embodying the invention
  • Figure 4 is a diagrammatic illustration of the structure of a reference database used by the system of figures 1 to 4; and Figure 5 is a block diagram illustrating the components of a system embodying the invention;
  • Figures 6 and 7 are flow charts illustrating the processing of a query by the system of figures 1 to 5.
  • Preferred embodiments of the present invention are concerned with methods and/or systems for locating information relevant to a specific entity or matter held in a number of separate databases or websites or stored on a network of computers connected to a telecommunications system.
  • entity or matter on which information might be sought can be any real or abstract object or class of objects about which information exists in one or more databases.
  • the databases or websites may be on the public World- Wide- Web or on a private Intranet.
  • the term database when used in this document includes all possible stores of . information including websites and web pages.
  • a descriptions database of descriptions of databases is held on a computer or a number of computers.
  • This database contains descriptions of the type of information held in a number of databases or websites. These descriptions can be entered manually into the descriptions database by the system operator. Alternatively a website or database provider may publish its own description over the World- Wide- Web or over an Intranet in a manner such that this information will be automatically retrieved by the system and stored in the descriptions database.
  • a user accesses the system through a user interface which is used to present a query to the system.
  • the user interface prompts the user to identify the entity or matter they require information on and passes this information onto the system.
  • the system may allow one or more methods to be used to identify an entity or subject matter and present a query about such an entity or subject matter. Examples of these methods include: (a) the identification of companies by company name, symbols such as Reuters RIC codes, S & P codes or stock exchange ticker codes; (b) the identification of places or locations by postcode or latitude and longitude of a place.
  • the data structure identifying an entity or subject matter is described herein as a "Key".
  • the system will then compare the query presented by a user with the descriptions database and select those websites or databases in its descriptions database which are of possible relevance to the entity or subject matter the user has described in his or her query.
  • these selected possibly relevant databases or websites can then be selected further and ordered according to a number of criteria: for example, a priority level set by the manager of the system, the national language used on the website, the users own preferences.
  • the system also includes a database access instructions database which translates queries entered into the system into instructions which are appropriate for interrogating or viewing each of the databases or websites covered by the system.
  • a query is made by a user, this is compared to the entries in the descriptions database to determine the potentially relevant websites or databases and produce a list of these.
  • the query is than translated into instructions appropriate for these potentially relevant databases and the user invited to select those databases he or she wishes to check from the list of potentially relevant sources of information.. Any such databases, web page or websites selected are then interrogated using the appropriate instructions from the access instructions database.
  • Preferred embodiments of the invention produce a consolidated set of instructions which are passed to a user.
  • a system for searching the Internet they may be passed to the user's Internet browser application together with a suitable software module or page description that can interpret them.
  • the user can then use a common interface to select from the various sites or databases listed by the system and retrieve the information they require from each one.
  • the user interface may also provide the ability to classify sites into logical categories appropriate for the field of the search being undertaken.
  • a query such as a request for information in a particular company is compared to an indexed directory (1) of company names and company symbols.
  • a symbol representing the company on which information is sought is generated and passed or communicated to a relevance filter or scope matching module (2) comprising a descriptions database (12) of descriptions of the databases covered and indexed by the system.
  • the query represented by the company symbol is then compared to the database descriptions (12) and a list of potentially relevant databases or websites produced and passed to a symbol, key or enquiry mapping module which also receives details of the symbol representing the query posed to the system by a user.
  • the symbol, key or enquiry mapping module then converts the symbol into a signal or signals representing that symbol or company and capable of being recognised and processed by each of the potentially relevant databases.
  • a user selecting from the list of potentially relevant databases produced by the relevance filter or scope matching module can then interrogate the selected database(s) using the signal(s) created by the symbol or enquiry mapping module and the URL(s) (universal record locator) stored in the system and identifying the location of the various databases or websites.
  • the databases covered by the system can be either public (e.g. Internet websites) (4) or private (a company Intranet) (5).
  • Figure 2 illustrates a system covering private and public databases or sources of information.
  • the heart of the system is a reference database module (6) including the directory of symbols, the database descriptions and the access instructions.
  • the database module may be implemented using any relational database (such as Microsoft SQL Server, Sybase or Oracle) or similar technology in the prior art.
  • Information on the database derives from a manual analysis of two kinds of information; the set of objects the system is locating information on and the set of private and public databases containing the information.
  • the directory of objects (1) is a list of companies with a unique stock symbol for each company, in the case of the "travel” domain, a list of places with their postcode and geographical position.
  • Figure 3 shows one possible embodiment of the database creation process.
  • the reference database (6) consists of a collection of tables. Each table is in turn a collection of rows and each row a collection of columns. A column may hold any numeric or textual value.
  • the database is capable of efficiently locating any row based on the values in its columns.
  • Information on the set of objects can be manually entered into the database by a data input operator (7) using an input form (8). This will require the operator to enter all the information (name, symbol, alternative codes) for each object.
  • all or part of the directory of objects can be imported from another source such as a commercially available dataset (9) using a data translation module (16).
  • Data translation packages are supplied with most commercially available relational database systems.
  • the information is stored in a relational database table called the object directory (1) in the reference database (6).
  • Each private and public database known to the system is manually analysed to understand its description, purpose and structure. This information forms a set of database definitions (1 1). In a possible embodiment of the system these are written in the Extensible Markup Language (XML). These can then be translated into the database using a commercially available XML parser (13). The result is a database description table (12) in the reference database (6).
  • XML Extensible Markup Language
  • the Object directory or Object_Table (1) contains descriptions of all the entities the system understands. For example, in the financial domain this is a relational database table of company symbols, while in the travel domain this is a gazetteer of locations. Columns in this table relate directly to fields in the keys understood and used within the system. Rows in this table correspond to entities understood by the system.
  • the descriptions database or Site_Table (12) contains descriptions of each website indexed by the system each row in this relational database table corresponds to a website. Each website has a unique identifier (Site_Id) within this table. This identifier takes the form ⁇ name>. ⁇ name>. ⁇ name>.... and can be of arbitrary length subject to practical limitations of the underlying technology. This mechanism allows websites to be defined by different organisations without the risk of name clashes.
  • the columns of the site relational database table contain description information. This information is dependent on the domain being indexed. It consists of such fields as a full name and description to identify the site to the user, script modules used by the symbol, key or enquiry Mapping and
  • Scope Tables (15, 16) are defined for each indexed website.
  • the names of the scope tables are held in the fields Site_Scope and Site_Scope_Wild in the site description in relational database table (12).
  • the scope tables define the set of keys which the indexed website provides information on.
  • Site_Scope_Wild points to the wildcard scope relational database table (15).
  • This table holds keys with embedded wildcards, which can match several possible symbols.
  • a wildcard takes the form of a "*" character in the text of the field, which matches any arbitrary string of characters. Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field.
  • Site_Scope points to the non- wildcard scope relational database table (16).
  • This table holds discrete keys (without wildcards). Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field.
  • Mapping tables (17, 18) are defined for each indexed website. The names of mapping tables are held in fields Site_Symbol_Map and Symbol_Map_Wild in the site description in relational database table (12). These tables define a mapping between the keys defined for the domain (object directory or table 1) and the keys used by the individual websites, which may use a different symbology. For instance, in the financial domain a website may use the alphanumeric ISIN system to identify financial instruments. In this case, the table referenced by Site_Symbol_Map will define a mapping from the symbology used for the financial domain into ISIN codes.
  • Site_Symbol_Map points to the non-wildcard mapping relational database table (18). This holds a list of discrete keys in the form used by the domain and their counterparts in the form used by the individual websites. Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field.
  • Site_Symbol_Map_Wild points to the wildcard mapping relational database table (17). This holds keys with embedded wildcards, which can match several possible source keys.
  • the rules used to map wildcard symbols are described in Table 1. (See wildcard mapping rules on page 23). Keys are held in the columns of the relational database table, with each field of the key stored in a column named to correspond with the name of the field. e) The names of the field map tables (19) are defined in the field Site_Field_Map.
  • This relational database table has columns SrcField and DstField. These columns allow field names in the source key (user input or symbol) to be translated to field names in the destination key (enquiry signal in format recognised by selected database or website).
  • FIG 5. presents a block diagram of the information retrieval system of the present invention, where a user (20) makes a request for information on an entity and receives instructions on how to access that entity that are capable of interpretation by the user's browser software.
  • the user inputs a query (21) to the system.
  • the input of the query may be implemented using an HTML form or other technique.
  • the query consists of one or more text strings that identify an entity or set of entities within the system's coverage. For example, a user wanting information on the company British Telecom pic may input the text string "BRITISH TELECOM" as his or her query.
  • the query (21) is interpreted by the key text input module (22). This converts the query to a denormal key (23).
  • This key contains all the information in the users request in a standardised form. It may reference the requested entity using any valid method understood by the key normalisation module (24). Continuing the example of British Telecom pic, the users input text string "BRITISH
  • TELECOM TELECOM
  • Name a field in the key structure. If the user entered a symbol such as BT-A.GB, this would be split and stored in the fields shown in the following table:
  • the key normalisation module (24) is responsible for converting a denormal key (23) to a normalised key (25).
  • a denormal key may reference zero or more entities; for example the input BRITISH would pick up all entities including the word BRITISH.
  • a query about companies entered as BRITISH would pick up a large number of possibilities including BRITISH AEROSPACE, BRITISH TELECOM and ASSOCIATED BRITISH FOODS ) and would therefore reference more than one entity.
  • a normal key must reference one entity.
  • the denormal key (23) references either zero or more than one entities
  • the user is sent a message (26) to indicate that either no information was found or that their query was ambiguous.
  • the input BRITISH might result in the following prompt:
  • the user is presented with all possible keys and asked to select one. Continuing the example above, if the user wishes to access information on
  • BT-A.GB British Telecom pic he will select the relevant symbol (BT-A.GB).
  • BT-A.GB the unambiguous key or symbol BT-A.GB is re-presented to the key text input module (22) for further processing.
  • the key normalisation module (24) interacts with the reference database (6) to determine the list of potentially relevant databases or sources websites, webpages, of information.
  • a user will enter a query as a text string.
  • the system takes each text string and stores it in the denormal key as a name / value pair.
  • An example of this for enquiries about places might be: User entered “48° 52' N” into string "Latitude” and "5° 7' E” into string
  • Another embodiment of this invention specific to locating financial information defines a key syntax where the key takes the form ⁇ company>. ⁇ country> where
  • Denormal or unnormalised keys must be normalised so as to be capable of further processing by the system. This is achieved by the key normalisation module. This module normalises a denormal key and produces a normalised key. This process is performed by determining all the possible key entries in the symbol table stored in the reference database that match the specified key.
  • the system may store the symbol table as a table in a relational database with the columns corresponding to key fields (i.e. characteristic of entity /query /object) for system for requests on company information and the rows to valid symbols) (i.e. company or country symbols for requests on company information): for example the symbol table may be of the following format:
  • the reference database includes a listing of symbols indexed together with the possibly relevant sources of information.
  • Information is held in the database in the form of two dimensional indexed tables with an arbitrary number of rows and columns.
  • the index of possibly relevant sources of information might include the following websites:
  • the scope matching module (2) is responsible for determining the range of websites referenced by the normalised key (25) (e.g. BT-A.GB). This module uses the scope definitions (15, 16) from the database (2,6,12) to determine which websites (e.g. websites containing information on telecommunications companies, British companies, news items and/or the companies own websites) are potentially relevant to the user's request.
  • the result produced by this module is a list of filtered websites (27) such as that shown above.
  • the reference database (6) holds information defining the field or type of possible enquiry (e.g. company or financial information) and describing all the referenced websites. This information may be supplied by an online or offline process through manual entry or automatic information transfer (28).
  • the scope matching module then takes a normalised key and determines the websites that contain potentially relevant information for the entity described by that key.
  • the reference database includes a descriptions database of each website, database or information source covered by the system. This might contain the following information for each such source of information.
  • a non-wildcard relational database table (e.g. 15, 16) has a column corresponding to each key field and a row corresponding to each entity the website has information on.
  • KBWW Scope might be used to describe stocks handled by the Robertson Stephens website and have entries such as:
  • a wildcard relational database table also has a column corresponding to each key field. Each row also has a description of each entity in the form of a list of name / value pairs. However, the values may contain a textual wildcard indicated by the character "*". In testing a key against the value in the relational database table, the module will allow any character or string of characters to be matched by the character "*".
  • each key in the table is tested sequentially.
  • the row data is matched against each key field. If a match occurs, then the key can be said to be matched.
  • the matching process is performed on each website or database table in turn.
  • a pre-defined priority value is then retrieved from the reference database for each source of information, data (website or database).
  • the list of returned websites is sorted according to this value. Websites with a priority value of-1 are excluded from the result.
  • the key mapping module (3) is responsible for converting the normalised key (25) into a form understood by each website selected by the scope matching module.
  • the key mapping module uses key mapping tables (1056) in the reference database (1050) to perform key translations.
  • the single query inputted by the user may thereby be translated into the different instructions appropriate to interrogation of the different selected relevant websites or databases.
  • the result produced by this module is a list of site keys (1062); a tailored query for each database or website.
  • the key mapping module (1060) is responsible for converting the normalised key (1032) into a form understood by each website selected by the scope matching module.
  • the key mapping module takes a normalised key or input and maps it to a form recognised by each website.
  • the module takes as inputs the normalised key and a list of website definitions.
  • Each key may be mapped using a table-driven or algorithmic approach.
  • the type of mapping is defined in the website definition, held in the reference database.
  • the website definition will reference one or more tables in the database, which define mappings between the standard keys and the site specific keys.
  • the following relational database tables are defined for use in a table-driven mapping a) Site_Symbol_Map (18)
  • FIG. 1 A flowchart illustrating operation of the key mapping module is shown in Fig 6.
  • Step 31 decides whether a table-driven or algorithmic key mapping is being used. This information is provided in the website definition in the reference database.
  • step 32 accesses the relational database tables in the description.
  • step 33 the Site_Symbol_Map relational database table referenced by the description is indexed.
  • the module For a key with fields named N 0 - N m and containing values V 0 - V m , the module will perform an SQL query of the form:
  • step 35 is executed, each destination column with names of the form To_xxx (corresponding to a source column From_xxx) is copied to the key.
  • Step 36 gets the first entry in the wildcard relational database table.
  • step 37 a wildcard comparison is performed with each column value using the rules in table 1. If a match occurs, step 38 maps the values to the result.
  • Step 39 moves to the next row in the Site_Symbol_Map_Wild relational database table.
  • Step 40 checks whether the end of the table has been reached. If so, this indicates that no match has occurred. In this case, the terminal step 41 is reached and the website is not processed further, otherwise, step 37 is executed again.
  • Step 42 allows for field names to be remapped. If a relational database table Site_Field_Map is specified, this table is used to change the names of key fields in step 43. The mapped key is available at step 44.
  • a procedural script is defined to translate the key for a website.
  • the text of this script, together with an indication of the programming language is provided in the website description.
  • Script execution is provided by the basic operating platform the system is running (such as Microsoft Windows NT) or by other methods in the prior art.
  • the script execution facility is required to support the calling of methods on objects known to the script.
  • Step 45 retrieves the script from the website definition in the reference database.
  • the SetKey method of this script is called to set the key value.
  • step 47 the GetDestKey method is called to retrieve the key value. These methods are to be defined and programmed by the person creating the website definition.
  • Step 42 then performs any field mapping in the same way as for a table driven mapping.
  • the key mapping module uses key mapping tables (17, 18) in the reference database (6, 12) to perform key translations.
  • the single query inputted by the user may thereby be translated into the different instructions appropriate to interrogation of the different selected relevant websites or databases.
  • the result produced by this module is a list of site keys (30); a tailored query for each database or website.
  • the key mapping module produces a list of website access instructions (50) which can be interpreted by the database access or interrogation module (51) to interrogate selected websites.
  • the database access or interrogation module (51) is implemented as mobile code which executes in the user's database or Internet browser program. This module is responsible for interpreting a list of website access instructions (50) and providing the user with an interface which allows them to select websites that they wish to display information from. The module is then responsible for sending instructions to the websites in the form of universal record locators (URLs) to cause the websites to display data on the entity that the user requested.
  • URLs universal record locators
  • the database access module takes a normalised and mapped key and a site definition as inputs. These are used to create a set of instructions to access information on the entity identified in the query made by the use and in the websites and/or databases covered by the system. Website access instructions are created using one of two methods; template or algorithmic.
  • the template method provides a template into which key fields are substituted.
  • An appropriate access script is a data structure containing at least the fields defined in Table 2.
  • ObjText Array of Each element contains the text of a URL if strings ObjData is TRUE or displayable data otherwise.
  • this text may contain a field name delimited by escape characters. This field can be substituted for a field value.
  • ObjFramed Array of Each element is TRUE if the object referenced by boolean ObjText can be loaded in a frame (sub-window in flags a web browser), otherwise the object will be loaded in a new browser window
  • ObjData Array of Each element is TRUE if the object is a page of boolean data for display, otherwise the object is a URL flags
  • WaitLoad Array of Each element is TRUE if the browser should wait boolean for page to load before progressing otherwise the flags browser should proceed after Delay
  • Delay Array of Each element represents the number of seconds to numbers delay before loading next frame
  • the template is retrieved from the website description in the reference database. Any delimited fields in the ObjText array of the template are then substituted for the value the field name defines in the normalised and mapped key.
  • the result is a set of instructions which can be used to access the website.
  • the algorithmic method generates an access script by calling procedures in a scripting language. The text of this script, together with an indication of the programming language is provided in the website description. Script execution is provided by the basic operating platform the system is running (such as Microsoft Windows NT) or by other methods in the prior art. The script execution facility is required to support the calling of methods on objects known to the script.
  • a method, SetKey is called on the script to set the mapped key fields.
  • a further method, GetAccessScript is used to retrieve the complete access script structure.
  • This module executes in the users browser program in a typical embodiment of this invention.
  • the user will be presented with a user interface which allows them to select which of the websites selected by the system during the scope matching process to display information from.
  • the database access or interrogation module provides the ability to "drill down" into any website or database and thus simulate a users manual interaction with the site to access a particular page.
  • the input to this module is an access script as defined in Table 2.
  • a flowchart illustrating the operation of this module is shown in Fig 7.
  • the module iterates through the arrays in the access script.
  • the iterator variable N is set to 0. This indexes the access script arrays described in Table 2. These arrays are assumed to be zero based.
  • Steps 56, 57, 58 decide how the text in ObjText is to be displayed.
  • ObjText may hold HTML for immediate display or a URL to be loaded into a browser, according to the state of ObjData.
  • data or URLs may be displayed in a new browser window or in a browser frame.
  • Step 59 displays HTML as immediate data in a browser frame.
  • Step 60 displays HTML as immediate data in a new window.
  • Step 61 loads a URL into a new browser frame.
  • Step 62 loads a URL into a new browser window.
  • Step 63 decides whether to wait for a page to load in the browser. If so, the script will wait in step 64 before loading the next page. Step 65 implements a delay before loading the next page.
  • step 66 The iterator N is incremented in step 66.
  • step 67 N is tested against the size of the script arrays. If this is reached, then terminal step 68 is executed.
  • step 56 the process is repeated from step 56 until all the script instructions have been processed.

Abstract

A system for searching a distributed collection of databases (4, 5) comprising a number of databases connected to each other by a communications network system including: query entry means (7, 8) for entering a request for information on a subject, object or matter, or a group of subjects, objects or matters, a first memory (2, 6, 12) storing index entries, each index entry including a portion representing a subject, or a group of subjects, objects or matter, object or matter on which information might be sought and one or more locations entries indicating which of the databases may contain information in the respective subject, object or matter or group of subjects, objects or matters, and a second memory (3, 6) storing database interrogation modules, routines or sub-routines for converting a request for information received by the data entry means into a set of appropriate instructions for each of the databases.

Description

SYSTEM AND METHOD FOR DATABASE SEARCHING
The present invention is concerned with systems and methods for retrieving material from computerised information systems. Such systems and methods are sometimes referred to as search engines as they search through databases such as the Internet, Internet websites or collections of Internet websites.
A large numbers of computers and/or databases have been interconnected together using common technologies to form a single network ("the World- Wide- Web"). The World- Wide- Web consists of a large number of separate databases or websites which can be accessed via a telecommunications link and then viewed or interrogated. The different World- Wide- Web or Internet databases or websites are stored at computers which can be located at any location provided that they are connected to a telecommunications network. Computers and/or databases within many organisations are also interconnected using the same techniques (so-called "Intranets").
Each computer or database in the network may contain information of interest to users of the network. This information may be structured in each computer or database in various ways and using various techniques. The providers or compilers of each Internet website will select and implement an user interface for their website so as to allow users to gain access to information from the website or database. Websites typically include a number of different possible information displays known as web pages. These are generated in response to instructions or requests posed by the user via the website user interface. The information may be delivered as a formatted, human-readable page, or in a computer readable format designed for further processing by users' software applications. For a user to retrieve a particular item of information from a collection of distributed websites such as the Internet or an Intranet, he or she needs to locate the website(s) or database(s) that holds the relevant information, learn to use the user interface of the website(s) and then use the interface to retrieve the web- page, database section or display containing that information.
People using databases such as the Internet often require information on a specific entity. Examples of such entities include, but are not restricted to, companies, financial instruments, books, geographical locations and chemical substances.
In order to obtain such information from the Internet or Intranets by current techniques the user has to identify suitable websites and proceed to obtain the information separately from within each of those sites. This is a time- consuming task even if the user already has experience of the different websites or databases and knows how to use their different user interfaces. It is a particularly time-consuming and awkward task when the user wishes to interrogate or retrieve information from websites or databases with which he is not familiar.
A number of techniques exist to locate information on the Internet and similar networks. These include web directories, text search engines and meta-search engines.
Web directories rely on a manual process to locate each relevant individual web page and place it in a directory. A team of editors review websites and input page addresses into categories, the results being stored in a database. Users can then browse through the database. Categories are generally organised hierarchically, e.g. Companies/UK/Telecoms/British Telecom. The directory usually also provides a full-text search facility which allows users to look for information using keywords, phrases or boolean logic. The directory compilation process is expensive due to the large number of editors required. The expense and time involved in reviewing possibly relevant websites also limits the amount of information that can be indexed at a reasonable cost in a reasonable time-scale.
For the user, the directory process can be quite precise, but can miss out valuable information due to sites not having been found and indexed.
Examples of directories are Yahoo (www.yahoo.com), Looksmart (www.looksmart.com) and dmoz Open Directory (www.dmoz.org).
Text search engines use a "web crawler" to locate content from the World- Wide- Web . This is a computer program that, starting with a number of initial website or webpage addresses with information of relevance to a particular query or type of query, follows the links between web pages to try and locate all linked pages on the Wold- Wide- Web. Most text search engines also allow web users and website owners to submit pages for indexing.
All data located by the web crawler is stored in a searchable database. This database can then be searched or interrogated. The database typically allows full text searching using keywords, phrases or boolean logic. The searched text stored in the database can include the web page title, the full text of the website and "metadata" which is not displayed but provided by the web-site or database creator to aid search engines.
Text search engines provide coverage of every web page that is immediately accessible to the "crawler" used by the engine to gather content. This allows a text search to provide a high level of completeness, at the expense of results that may be irrelevant. Text search engines are intrinsically unable to locate information within websites that provide a database of content on multiple objects, such as newspaper archives, stock exchange quote service, airline booking services, etc. The web crawler will typically locate the front page of these sites and be unable to advance beyond the barrier of the user interface used by each particular website or database - preventing such information from being effectively indexed.
Some enhancements to text search engines are as follows:
Natural language processing is used to match the results generated by the search against the initial request inputted by the user and obtain a statistical measure of the relevance of information to the users query.
Weighting is given to words in the database according to frequency of appearance, proximity and nearness to the start of the document.
Proper names are distinguished by capitalisation. -"Stop words" such as 'a', 'or', 'and' irrelevant to searches are removed and words are "stemmed" (e.g. 'searchable', 'searching', 'searched' are stemmed to 'search', 'played', 'playing' are stemmed to 'play' so as to pick variations when searching) to their root.
The result links which have been most frequently considered to be of interest by previous users are given priority in ordering results (e.g. Google - www.google.com)
Document comparison techniques are used, e.g. by allowing users to select a link and look for similar documents. Such systems typically use statistical matching techniques to pick up possibly similar documents. All of these techniques are limited when locating objects such as companies or places where the context is not understood by the search algorithm. One system (Ask Jeeves - www.askjeeves.co.uk, or www.ask.com) allows users to enter a query in natural language. This is then statistically matched against a database of potential questions indexed against possibly relevant websites. The results are then returned as a list of those websites indexed in the database together with the statistically relevant potential questions. . The database of websites and questions is typically maintained manually. Only questions similar to those held in the database are likely to receive a relevant result.
Meta-search engines (such as Dogpile, Copernicus, Sherlock) allow users to enter a search request which will then be presented to multiple search engines. This allows users to get the benefit of multiple technologies in finding information. Meta-search techniques tend to return a large amount of information which needs to be manually reviewed by the user.
The present invention in a first aspect provides a system for searching a distributed collection of databases comprising a number of databases connected to each other by a communications network system including query entry means for entering a request for information on a subject, object or matter, or a group of subjects, objects or matters, a first memory storing index entries each index entry including a portion representing a subject, or a group of subjects, objects or matter, object or matter) on which information might be sought and one or more locations entries indicating which of the databases may contain information in the respective subject, object or matter or group of subjects, objects or matters, and a second memory storing database interrogation modules routines or sub-routines for converting a request for information received by the data entry means into a set of appropriate instructions for each of the databases. The present invention in a second aspect provides a method for obtaining information from a collection of databases, comprising entering a query for information comparing the query to a database of descriptions of the content or type of context of the databases constituting the collection of databases generating a list of potentially relevant databases from said comparison of the query to the database descriptions and converting or translating the query into an enquiry signal or signals recognised or processable by each of the potentially relevant databases.
The present invention in a third aspect provides a computer program comprising program wide means for performing the method set out above and in claims 12 to 17.
The present invention in a fourth aspect provides a computer program product comprising program code means stored on a computer readable medium for performing the method set out above and in claims 12 to 17.
The objects of one or more preferred embodiments of the invention include providing systems or methods capable of:
Finding accurate information on specific objects such as a company or place.
Locating relevant information amongst the large amount of irrelevant content on the Internet, which may contain words that match the name of the object. For instance, there are over 50,000 directly accessible web pages containing the words "British Telecom"
Locating information which is provided by an interactive website and is thus inaccessible to conventional web crawler based search engines Classifying results according to the type of content, e.g. news articles, charts, quotes, airline timetables, hotel information.
Allowing a user to specify a precise query and find a clearly defined set of results in return Allowing a large number of pieces of information (for example, 480 websites potentially storing data on 80,000 companies) to be indexed at acceptable cost
Enabling a user to be always provided with links to information provided by the latest new and updated web services, thus avoiding missing out on valuable knowledge
Avoiding the need for users to learn the intricacies of how to operate a large number of websites in order to obtain the information they require
Permitting unstructured information on the Internet to be linked to structured databases such as software to manage investment portfolios or take airline bookings, thus allowing users to be automatically delivered the content they require.
Preferred embodiments of the invention will now be described with reference to the attached figures in which:
Figure 1 is a block diagram illustrating the processing of a query by a system embodying the invention;
Figure 2 is a block diagram illustrating the architecture of a system embodying the invention; Figure 3 is a block diagram illustrating the database management of a system embodying the invention;
Figure 4 is a diagrammatic illustration of the structure of a reference database used by the system of figures 1 to 4; and Figure 5 is a block diagram illustrating the components of a system embodying the invention;
Figures 6 and 7 are flow charts illustrating the processing of a query by the system of figures 1 to 5.
Preferred embodiments of the present invention are concerned with methods and/or systems for locating information relevant to a specific entity or matter held in a number of separate databases or websites or stored on a network of computers connected to a telecommunications system. The entity or matter on which information might be sought can be any real or abstract object or class of objects about which information exists in one or more databases. The databases or websites may be on the public World- Wide- Web or on a private Intranet.
The term database when used in this document includes all possible stores of . information including websites and web pages.
The embodiments of the invention described below locate information for a particular class or type of entities or subject matters, for example companies or financial instruments. Such a class or type of entities is referred to as a
"domain".
A descriptions database of descriptions of databases is held on a computer or a number of computers. This database contains descriptions of the type of information held in a number of databases or websites. These descriptions can be entered manually into the descriptions database by the system operator. Alternatively a website or database provider may publish its own description over the World- Wide- Web or over an Intranet in a manner such that this information will be automatically retrieved by the system and stored in the descriptions database.
A user accesses the system through a user interface which is used to present a query to the system. The user interface prompts the user to identify the entity or matter they require information on and passes this information onto the system.
The system may allow one or more methods to be used to identify an entity or subject matter and present a query about such an entity or subject matter. Examples of these methods include: (a) the identification of companies by company name, symbols such as Reuters RIC codes, S & P codes or stock exchange ticker codes; (b) the identification of places or locations by postcode or latitude and longitude of a place. The data structure identifying an entity or subject matter is described herein as a "Key".
The system will then compare the query presented by a user with the descriptions database and select those websites or databases in its descriptions database which are of possible relevance to the entity or subject matter the user has described in his or her query. In a possible embodiment of the invention these selected possibly relevant databases or websites can then be selected further and ordered according to a number of criteria: for example, a priority level set by the manager of the system, the national language used on the website, the users own preferences.
The system also includes a database access instructions database which translates queries entered into the system into instructions which are appropriate for interrogating or viewing each of the databases or websites covered by the system.
When a query is made by a user, this is compared to the entries in the descriptions database to determine the potentially relevant websites or databases and produce a list of these. The query is than translated into instructions appropriate for these potentially relevant databases and the user invited to select those databases he or she wishes to check from the list of potentially relevant sources of information.. Any such databases, web page or websites selected are then interrogated using the appropriate instructions from the access instructions database.
Preferred embodiments of the invention produce a consolidated set of instructions which are passed to a user. In a system for searching the Internet they may be passed to the user's Internet browser application together with a suitable software module or page description that can interpret them. The user can then use a common interface to select from the various sites or databases listed by the system and retrieve the information they require from each one. The user interface may also provide the ability to classify sites into logical categories appropriate for the field of the search being undertaken.
Referring to figure 1 and considering an embodiment of the invention for obtaining company and/or financial information, a query such as a request for information in a particular company is compared to an indexed directory (1) of company names and company symbols. A symbol representing the company on which information is sought is generated and passed or communicated to a relevance filter or scope matching module (2) comprising a descriptions database (12) of descriptions of the databases covered and indexed by the system. The query represented by the company symbol is then compared to the database descriptions (12) and a list of potentially relevant databases or websites produced and passed to a symbol, key or enquiry mapping module which also receives details of the symbol representing the query posed to the system by a user. The symbol, key or enquiry mapping module then converts the symbol into a signal or signals representing that symbol or company and capable of being recognised and processed by each of the potentially relevant databases.
A user selecting from the list of potentially relevant databases produced by the relevance filter or scope matching module can then interrogate the selected database(s) using the signal(s) created by the symbol or enquiry mapping module and the URL(s) (universal record locator) stored in the system and identifying the location of the various databases or websites.
The databases covered by the system can be either public (e.g. Internet websites) (4) or private (a company Intranet) (5). Figure 2 illustrates a system covering private and public databases or sources of information.
The heart of the system is a reference database module (6) including the directory of symbols, the database descriptions and the access instructions. The database module may be implemented using any relational database (such as Microsoft SQL Server, Sybase or Oracle) or similar technology in the prior art.
Information on the database derives from a manual analysis of two kinds of information; the set of objects the system is locating information on and the set of private and public databases containing the information.
For example: in the case of the "finance" domain, the directory of objects (1) is a list of companies with a unique stock symbol for each company, in the case of the "travel" domain, a list of places with their postcode and geographical position.
Figure 3 shows one possible embodiment of the database creation process.
The reference database (6) consists of a collection of tables. Each table is in turn a collection of rows and each row a collection of columns. A column may hold any numeric or textual value. The database is capable of efficiently locating any row based on the values in its columns.
Information on the set of objects can be manually entered into the database by a data input operator (7) using an input form (8). This will require the operator to enter all the information (name, symbol, alternative codes) for each object.
Alternatively all or part of the directory of objects can be imported from another source such as a commercially available dataset (9) using a data translation module (16). Data translation packages are supplied with most commercially available relational database systems.
In both cases, the information is stored in a relational database table called the object directory (1) in the reference database (6).
Each private and public database known to the system is manually analysed to understand its description, purpose and structure. This information forms a set of database definitions (1 1). In a possible embodiment of the system these are written in the Extensible Markup Language (XML). These can then be translated into the database using a commercially available XML parser (13). The result is a database description table (12) in the reference database (6).
Referring to figure 4: a) The Object directory or Object_Table (1) contains descriptions of all the entities the system understands. For example, in the financial domain this is a relational database table of company symbols, while in the travel domain this is a gazetteer of locations. Columns in this table relate directly to fields in the keys understood and used within the system. Rows in this table correspond to entities understood by the system. b) The descriptions database or Site_Table (12) contains descriptions of each website indexed by the system each row in this relational database table corresponds to a website. Each website has a unique identifier (Site_Id) within this table. This identifier takes the form <name>.<name>.<name>.... and can be of arbitrary length subject to practical limitations of the underlying technology. This mechanism allows websites to be defined by different organisations without the risk of name clashes.
The columns of the site relational database table contain description information. This information is dependent on the domain being indexed. It consists of such fields as a full name and description to identify the site to the user, script modules used by the symbol, key or enquiry Mapping and
Access instructions modules and indexes to Scope Tables for each database or website and Mapping Tables. c) Scope Tables (15, 16) are defined for each indexed website. The names of the scope tables are held in the fields Site_Scope and Site_Scope_Wild in the site description in relational database table (12). The scope tables define the set of keys which the indexed website provides information on.
Site_Scope_Wild points to the wildcard scope relational database table (15). This table holds keys with embedded wildcards, which can match several possible symbols. A wildcard takes the form of a "*" character in the text of the field, which matches any arbitrary string of characters. Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field. Site_Scope points to the non- wildcard scope relational database table (16).
This table holds discrete keys (without wildcards). Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field. d) Mapping tables (17, 18) are defined for each indexed website. The names of mapping tables are held in fields Site_Symbol_Map and Symbol_Map_Wild in the site description in relational database table (12). These tables define a mapping between the keys defined for the domain (object directory or table 1) and the keys used by the individual websites, which may use a different symbology. For instance, in the financial domain a website may use the alphanumeric ISIN system to identify financial instruments. In this case, the table referenced by Site_Symbol_Map will define a mapping from the symbology used for the financial domain into ISIN codes.
Site_Symbol_Map points to the non-wildcard mapping relational database table (18). This holds a list of discrete keys in the form used by the domain and their counterparts in the form used by the individual websites. Keys are held in the columns of the table, with each field of the key stored in a column named to correspond with the name of the field.
Site_Symbol_Map_Wild points to the wildcard mapping relational database table (17). This holds keys with embedded wildcards, which can match several possible source keys. The rules used to map wildcard symbols are described in Table 1. (See wildcard mapping rules on page 23). Keys are held in the columns of the relational database table, with each field of the key stored in a column named to correspond with the name of the field. e) The names of the field map tables (19) are defined in the field Site_Field_Map. This relational database table has columns SrcField and DstField. These columns allow field names in the source key (user input or symbol) to be translated to field names in the destination key (enquiry signal in format recognised by selected database or website).
FIG 5. presents a block diagram of the information retrieval system of the present invention, where a user (20) makes a request for information on an entity and receives instructions on how to access that entity that are capable of interpretation by the user's browser software.
The user inputs a query (21) to the system. The input of the query may be implemented using an HTML form or other technique. The query consists of one or more text strings that identify an entity or set of entities within the system's coverage. For example, a user wanting information on the company British Telecom pic may input the text string "BRITISH TELECOM" as his or her query.
The query (21) is interpreted by the key text input module (22). This converts the query to a denormal key (23). This key contains all the information in the users request in a standardised form. It may reference the requested entity using any valid method understood by the key normalisation module (24). Continuing the example of British Telecom pic, the users input text string "BRITISH
TELECOM" would be stored in a field called "Name" in the key structure. If the user entered a symbol such as BT-A.GB, this would be split and stored in the fields shown in the following table:
Figure imgf000016_0001
Figure imgf000017_0001
The key normalisation module (24) is responsible for converting a denormal key (23) to a normalised key (25). A denormal key may reference zero or more entities; for example the input BRITISH would pick up all entities including the word BRITISH. A query about companies entered as BRITISH would pick up a large number of possibilities including BRITISH AEROSPACE, BRITISH TELECOM and ASSOCIATED BRITISH FOODS ) and would therefore reference more than one entity. A normal key must reference one entity.
When the denormal key (23) references either zero or more than one entities, the user is sent a message (26) to indicate that either no information was found or that their query was ambiguous. For example, the input BRITISH might result in the following prompt:
Company Name Company Symbol
Associated British Foods pic ABF.GB Associated British Ports Holdings pic ABP.GB
BAE Systems pic (was British Aerospace) BA.GB
British Airways pic BAY.GB
British American Tobacco (BAT) BATS.GB
British and Malayan Trustees Ltd B08.SG British Bloodstock Agency Pic BSK.GB
British Energy pic BGY.GB
British Land Co pic BLND.GB
British Sky Broadcasting pic (BSKYB) BSY.GB
British Telecom pic BT-A.GB British Vita pic BVIT.GB
Corus pic (was British Steel/Hoogovens) CS.GB
Malaysia British Assurance 1 163.MY
In other words, the user is presented with all possible keys and asked to select one. Continuing the example above, if the user wishes to access information on
British Telecom pic he will select the relevant symbol (BT-A.GB). When the user selects a key in this manner, the unambiguous key or symbol BT-A.GB is re-presented to the key text input module (22) for further processing. The key normalisation module (24) interacts with the reference database (6) to determine the list of potentially relevant databases or sources websites, webpages, of information.
A user will enter a query as a text string. The system takes each text string and stores it in the denormal key as a name / value pair. An example of this for enquiries about places might be: User entered "48° 52' N" into string "Latitude" and "5° 7' E" into string
"Longitude". The denormal key would then contain the fields:
Figure imgf000018_0001
Another embodiment of this invention, specific to locating financial information defines a key syntax where the key takes the form <company>.<country> where
"company" is a symbol defining a company and "country" is the ISO country code for the country. This is parsed into fields "Company" and "Country" by the key text input module. An example of this might be: - U
User entered "BT-A.GB" (symbol for British Telecom pic) into string "Symbol". The denormal key will contain the fields:
Figure imgf000019_0001
Denormal or unnormalised keys must be normalised so as to be capable of further processing by the system. This is achieved by the key normalisation module. This module normalises a denormal key and produces a normalised key. This process is performed by determining all the possible key entries in the symbol table stored in the reference database that match the specified key.
The system may store the symbol table as a table in a relational database with the columns corresponding to key fields (i.e. characteristic of entity /query /object) for system for requests on company information and the rows to valid symbols) (i.e. company or country symbols for requests on company information): for example the symbol table may be of the following format:
Figure imgf000019_0002
Assuming that we have a key with fields named N0 - Nm and containing values V0 - Vm, the module will perform an SQL query of the form:
SELECT * FROM symbol able WHERE N0 = V0 ... AND Nm = Vm This will select zero or more rows which match the key. If the number of rows selected is one, then the column values in the selected row will be copied into the resulting normalised key.
If more than one row is selected, all selected rows will be copied into a denormal key which is passed back to the user. This is then displayed in an appropriate form which will allow the user to select the appropriate key and thus choose the entity they require information on. For example, the enquiry BRITISH will select a number of rows and the system will then display a number of possible normalised queries matching the ambiguous one initially made.
If zero rows are selected, an error is passed back to the user to indicate that they have tried to find an entity for which no valid information exists.
As discussed above and illustrated in figures 1 to 4, the reference database includes a listing of symbols indexed together with the possibly relevant sources of information. Information is held in the database in the form of two dimensional indexed tables with an arbitrary number of rows and columns. For an enquiry about British Telecom pic the index of possibly relevant sources of information might include the following websites:
Name of website Type of information Teletext Realtime Realtime UK stock price
Yahoo Stock price information
Money Extra UK Quote and lyr chart
Citywire UK quote and mini-chart
MarketEye UK quote and chart (similar to TOPIC Terminal
Hemmington Scott UK quotes, trades, etc
Stockpoint UK full quote
UKInvest UK quote service
Comdirect Quotes from Comdirect
Yahoo 5 day Simple stock chart for 5 day period
Stockpoint quick chart lyr UK lyr chart
Yahoo intraday Simple stock chart for current day
Stockpoint interactive chart lyr UK lyr chart with interactive analysis
Finance Net daily (1 year) UK stock charts (text in French)
Yahoo intraday Simple stock chart for current day
Home Page British Telecom's own website
Yahoo Yahoo research information
The Motley Fool Popular discussion service - UK version
Wright Research Centre Research page with fundamentals, earnings estimates, etc.
MarketEye Forum (Eye-to-eye) Discussion on UK companies Multex UK Broker Reports Hemmington Scott Database of financial and background data on UK companies
MarketEye Summary financials on UK companies Stockpoint UK company profile Datastream Insite summary UK Company summary financials, chart, etc
Yahoo Profile Yahoo fundamental data on UK companies
UKWire RNS news service Yahoo News News on UK companies from various sources
Moreover.com Search Aggregated news headlines from various sources
Quote.com US stock quote service Bigcharts.com US stock chart service
Bank of England Home page of UK central bank
The scope matching module (2) is responsible for determining the range of websites referenced by the normalised key (25) (e.g. BT-A.GB). This module uses the scope definitions (15, 16) from the database (2,6,12) to determine which websites (e.g. websites containing information on telecommunications companies, British companies, news items and/or the companies own websites) are potentially relevant to the user's request. The result produced by this module is a list of filtered websites (27) such as that shown above.
The reference database (6) holds information defining the field or type of possible enquiry (e.g. company or financial information) and describing all the referenced websites. This information may be supplied by an online or offline process through manual entry or automatic information transfer (28).
The scope matching module then takes a normalised key and determines the websites that contain potentially relevant information for the entity described by that key. The reference database includes a descriptions database of each website, database or information source covered by the system. This might contain the following information for each such source of information. An example of an XML description might be: <BRANCH NAME="Financial.Sites.Test.Datastream.Insite.Summary">
<TITLE>Datastream Insite summary< TIT E>
<TYPE>Fundamentals< TYPE>
<STANDARDSCOPE>
<TABLE NAME="DSlSScope" COLS=l> <TH TYPE="WILDKEY">SymboK/TH>
<TR><TD>'.GB</TD></TR>
</TABLE> </STANDARDSCOPE>
<SYMBOLOGY>
<TOSYMBOLMAP>
<TABLE REF='"lCVSymbolMap"> < TABLE>
< TOSYMBOLMAP>
<TOFIELDMAP>
<TABLE NAME="DSISFιeldMap" C0LS-2>
<TH>SrcFιeld<TH><TH>DstFιeld<ATH> <TR><TD>Company</TD><TD>code< TD><TR>
< TAB E>
< TOFIELDMAP>
</SYMBOLOGY>
<URL>http //www datastreaminsite com/plc asp</URL> <FORMAT>Webpage</FORMAT>
</BRANCH>
Therefore, for each website or database definition known to the system, there exists one or more tables in a relational database that describe the set of keys or queries that website provides information on. These tables may take one of the following forms:
a) A non-wildcard relational database table (e.g. 15, 16) has a column corresponding to each key field and a row corresponding to each entity the website has information on.
For example; a table KBWW Scope might be used to describe stocks handled by the Robertson Stephens website and have entries such as:
Name of website/database - Teletext Realtime
Figure imgf000024_0001
b) A wildcard relational database table also has a column corresponding to each key field. Each row also has a description of each entity in the form of a list of name / value pairs. However, the values may contain a textual wildcard indicated by the character "*". In testing a key against the value in the relational database table, the module will allow any character or string of characters to be matched by the character "*".
Name of website/database
Teletext Realtime
Figure imgf000024_0002
In searching a non-wildcard relational database table for a key with fields named N0 - Nm and containing values V0 - V„„ the module will perform an SQL query of the form:
SELECT * FROM scope able WHERE N0 = V0 ... AND Nm = Vm
This will select one row if the key is matched, otherwise zero rows.
To search a wildcard relational database table, each key in the table is tested sequentially. The row data is matched against each key field. If a match occurs, then the key can be said to be matched.
The matching process is performed on each website or database table in turn.
A pre-defined priority value is then retrieved from the reference database for each source of information, data (website or database). The list of returned websites is sorted according to this value. Websites with a priority value of-1 are excluded from the result.
Alternative systems may extend this module to include additional matching and filtering operations appropriate to the field of enquiry.
The key mapping module (3) is responsible for converting the normalised key (25) into a form understood by each website selected by the scope matching module. The key mapping module uses key mapping tables (1056) in the reference database (1050) to perform key translations. The single query inputted by the user may thereby be translated into the different instructions appropriate to interrogation of the different selected relevant websites or databases. The result produced by this module is a list of site keys (1062); a tailored query for each database or website.
The key mapping module (1060) is responsible for converting the normalised key (1032) into a form understood by each website selected by the scope matching module. The key mapping module takes a normalised key or input and maps it to a form recognised by each website. The module takes as inputs the normalised key and a list of website definitions.
Each key may be mapped using a table-driven or algorithmic approach. The type of mapping is defined in the website definition, held in the reference database.
For a table-driven mapping, the website definition will reference one or more tables in the database, which define mappings between the standard keys and the site specific keys. The following relational database tables are defined for use in a table-driven mapping a) Site_Symbol_Map (18)
This has columns of the form From_name for each field in the key being translated, together with columns To_name for each field in the result. Each row corresponds to a unique key. For example: the following table translates into UK stock exchange symbols:
Figure imgf000026_0001
b) Site_Symbol_Map_Wild (17)
This has columns of the form Fromjname for each field in the key being translated, together with columns To_name for each field in the result. Each column entry may contain a wildcard. For each column, wildcard mapping will take the following form: Table 1. Wildcard mapping rules
Figure imgf000027_0001
c) Site_Field_Map (19)
This has two columns, SrcField and DstField. Each row corresponds to a field in the result key. For example:
Figure imgf000027_0002
A flowchart illustrating operation of the key mapping module is shown in Fig 6.
Step 31 decides whether a table-driven or algorithmic key mapping is being used. This information is provided in the website definition in the reference database.
For a table-driven mapping, step 32 accesses the relational database tables in the description.
In step 33, the Site_Symbol_Map relational database table referenced by the description is indexed. For a key with fields named N0 - Nm and containing values V0 - Vm, the module will perform an SQL query of the form:
SELECT * FROM scope able WHERE N0 = V0 ... AND Nm = Vm
This will select one row if the key is matched, otherwise zero rows. This is tested in step 34. For example, a search for British Telecom, symbol BT-A.GB in the "ICV" symbol map would take the form:
SELECT * FROM ICVSymbolMap WHERE Company=*BT-A" AND Country="GB"
If a row is matched, then step 35 is executed, each destination column with names of the form To_xxx (corresponding to a source column From_xxx) is copied to the key.
If no row was matched, a wildcard based scan of the Site_Symbol_Map_Wild relational database table is performed. Step 36 gets the first entry in the wildcard relational database table.
In step 37, a wildcard comparison is performed with each column value using the rules in table 1. If a match occurs, step 38 maps the values to the result.
Step 39 moves to the next row in the Site_Symbol_Map_Wild relational database table. Step 40 checks whether the end of the table has been reached. If so, this indicates that no match has occurred. In this case, the terminal step 41 is reached and the website is not processed further, otherwise, step 37 is executed again.
Step 42 allows for field names to be remapped. If a relational database table Site_Field_Map is specified, this table is used to change the names of key fields in step 43. The mapped key is available at step 44.
For algorithmic mapping, a procedural script is defined to translate the key for a website. The text of this script, together with an indication of the programming language is provided in the website description. Script execution is provided by the basic operating platform the system is running (such as Microsoft Windows NT) or by other methods in the prior art. The script execution facility is required to support the calling of methods on objects known to the script.
Step 45 retrieves the script from the website definition in the reference database. In step 46, the SetKey method of this script is called to set the key value.
In step 47, the GetDestKey method is called to retrieve the key value. These methods are to be defined and programmed by the person creating the website definition.
Step 42 then performs any field mapping in the same way as for a table driven mapping. The key mapping module uses key mapping tables (17, 18) in the reference database (6, 12) to perform key translations. The single query inputted by the user may thereby be translated into the different instructions appropriate to interrogation of the different selected relevant websites or databases. The result produced by this module is a list of site keys (30); a tailored query for each database or website.
The key mapping module produces a list of website access instructions (50) which can be interpreted by the database access or interrogation module (51) to interrogate selected websites.
The database access or interrogation module (51) is implemented as mobile code which executes in the user's database or Internet browser program. This module is responsible for interpreting a list of website access instructions (50) and providing the user with an interface which allows them to select websites that they wish to display information from. The module is then responsible for sending instructions to the websites in the form of universal record locators (URLs) to cause the websites to display data on the entity that the user requested.
The database access module takes a normalised and mapped key and a site definition as inputs. These are used to create a set of instructions to access information on the entity identified in the query made by the use and in the websites and/or databases covered by the system. Website access instructions are created using one of two methods; template or algorithmic.
a) The template method provides a template into which key fields are substituted. An appropriate access script is a data structure containing at least the fields defined in Table 2.
Table 2. Access script structure
Member Name Member Type Description
ObjText Array of Each element contains the text of a URL if strings ObjData is TRUE or displayable data otherwise. In a template, this text may contain a field name delimited by escape characters. This field can be substituted for a field value.
ObjFramed Array of Each element is TRUE if the object referenced by boolean ObjText can be loaded in a frame (sub-window in flags a web browser), otherwise the object will be loaded in a new browser window
ObjData Array of Each element is TRUE if the object is a page of boolean data for display, otherwise the object is a URL flags
WaitLoad Array of Each element is TRUE if the browser should wait boolean for page to load before progressing otherwise the flags browser should proceed after Delay
Delay Array of Each element represents the number of seconds to numbers delay before loading next frame The template is retrieved from the website description in the reference database. Any delimited fields in the ObjText array of the template are then substituted for the value the field name defines in the normalised and mapped key. The result is a set of instructions which can be used to access the website. b) The algorithmic method generates an access script by calling procedures in a scripting language. The text of this script, together with an indication of the programming language is provided in the website description. Script execution is provided by the basic operating platform the system is running (such as Microsoft Windows NT) or by other methods in the prior art. The script execution facility is required to support the calling of methods on objects known to the script.
A method, SetKey is called on the script to set the mapped key fields. A further method, GetAccessScript is used to retrieve the complete access script structure. These methods are to be defined and programmed by the person creating the website definition.
This module executes in the users browser program in a typical embodiment of this invention. The user will be presented with a user interface which allows them to select which of the websites selected by the system during the scope matching process to display information from. The database access or interrogation module provides the ability to "drill down" into any website or database and thus simulate a users manual interaction with the site to access a particular page.
The input to this module is an access script as defined in Table 2. A flowchart illustrating the operation of this module is shown in Fig 7. The module iterates through the arrays in the access script. In step 55, the iterator variable N is set to 0. This indexes the access script arrays described in Table 2. These arrays are assumed to be zero based.
Steps 56, 57, 58 decide how the text in ObjText is to be displayed. For each step, ObjText may hold HTML for immediate display or a URL to be loaded into a browser, according to the state of ObjData. As well as this, data or URLs may be displayed in a new browser window or in a browser frame. Step 59 displays HTML as immediate data in a browser frame. Step 60 displays HTML as immediate data in a new window. Step 61 loads a URL into a new browser frame. Step 62 loads a URL into a new browser window.
Step 63 decides whether to wait for a page to load in the browser. If so, the script will wait in step 64 before loading the next page. Step 65 implements a delay before loading the next page.
The iterator N is incremented in step 66. In step 67, N is tested against the size of the script arrays. If this is reached, then terminal step 68 is executed.
Otherwise, the process is repeated from step 56 until all the script instructions have been processed.

Claims

1 ) A system for searching a distributed collection of databases (4, 5) comprising a number of databases connected to each other by a communications network system including: query entry means (7, 8) for entering a request for information on a subject, object or matter, or a group of subjects, objects or matters.
a first memory (2, 6, 12) storing index entries, each index entry including a portion representing a subject, or a group of subjects, objects or matter, object or matter on which information might be sought and one or more locations entries indicating which of the databases may contain information in the respective subject, object or matter or group of subjects, objects or matters, and
a second memory (3, 6)storing database interrogation modules, routines or sub-routines for converting a request for information received by the data entry means into a set of appropriate instructions for each of the databases.
2) A system according to claim 1 including query normalisation means (24) for converting a request for information into a format matching that used in the first memory.
3) A system according to claim 2 wherein the query normalisation means (24) converts a text string into one or more of a number of pre-defined symbols.
4) A system according to any preceding claim including a descriptions database (12) of descriptions of the contents of the distributed collection of databases and search scope matching means for comparing a request for information entered into the system to the descriptions database and generating a list of possibly relevant databases. 5) A system according to any preceding claim including an access instructions module (3) with a query mapping database (17, 18) of data processing commands, instructions or routines for translating a request for information entered via the query entry means into one or more enquiry signals recognised and processable by one or more of the databases covered by the system.
6) A system according to claim 5 wherein the access instructions database (6,
17, 18) includes a set of data processing commands, instructions or routines for each database of the collection of databases covered by the system.
7) A system according to claims 5 or 6 wherein the data processing commands, instructions or routines form a template for mapping of a query.
8) A system according to claims 5, 6 or 7 wherein the data processing commands, instructions or routines form an algorithm for converting a query.
9) A system according to any preceding claim including means for displaying a list of databases and/or database locations potentially relevant to a request for information entered by a user.
10) A system according to claim 9 including selection means for a user to select one or more of the potentially relevant databases or database locations displayed by the system.
1 1) A system according to claim 10 including database interrogation means (51) for interrogating a selected database using the enquiry signal matching that database and mapped by the access instructions module from the request for information entered by the user.
12) A method for obtaining information from a collection of databases, comprising: entering a query (21) for information; comparing the query to a database (12) of descriptions of the content or type of content of the databases constituting the collection of databases; generating a list (30) of potentially relevant databases from said comparison of the query to the database descriptions; and mapping the query into an enquiry signal or signals (50) recognised or processable by each of the potentially relevant databases.
13) A method according to claim 12 further including the steps of: displaying the list of potentially relevant databases; selecting one or more of the displayed databases; and interrogating (51) the selected database or databases using the enquiry signal or signals recognised or processable by the selected database or databases.
14) A method according to claims 12 or 13 including the step of: converting a query entered by a user into one or more of a number of predefined symbols.
15) A method according to any of claims 12 to 14 wherein the query for information is entered manually by a user.
16) A method according to any of claims 12 to 14 wherein the query is a signal generated by data processing or telecommunications apparatus.
17) A method of obtaining information from a collection of databases, comprising: providing a database descriptions database of descriptions of the content or type of information stored or held in each database of the collection of databases; providing means for comparing a query or request for information to the descriptions database and generating a list of potentially relevant databases; providing an access instructions module including a database of data processing commands, instructions or routines for converting or translating a query or request for information into an enquiry signal or signals recognised or processable by one or more of the databases of the collection of databases.
18) A computer program comprising program code means for performing the steps of any one of claims 12 to 15 when said program is run on a computer.
1 ) A computer program product comprising program code means stored on a computer readable medium for performing the method of any one of claims 12 to 15 when said program product is run on a computer.
PCT/GB2001/000446 2000-02-03 2001-02-02 System and method for database searching WO2001057725A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP01902550A EP1254413A2 (en) 2000-02-03 2001-02-02 System and method for database searching
AU2001230402A AU2001230402A1 (en) 2000-02-03 2001-02-02 System and method for database searching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17993400P 2000-02-03 2000-02-03
US60/179,934 2000-02-03

Publications (2)

Publication Number Publication Date
WO2001057725A2 true WO2001057725A2 (en) 2001-08-09
WO2001057725A3 WO2001057725A3 (en) 2002-06-13

Family

ID=22658589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/000446 WO2001057725A2 (en) 2000-02-03 2001-02-02 System and method for database searching

Country Status (3)

Country Link
EP (1) EP1254413A2 (en)
AU (1) AU2001230402A1 (en)
WO (1) WO2001057725A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1611534A1 (en) * 2003-04-04 2006-01-04 Yahoo! Inc. A system for generating search results including searching by subdomain hints and providing sponsored results by subdomain
US7752285B2 (en) 2007-09-17 2010-07-06 Yahoo! Inc. Shortcut sets for controlled environments
CN113127490A (en) * 2021-04-23 2021-07-16 山东英信计算机技术有限公司 Key name generation method and device and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0829811A1 (en) * 1996-09-11 1998-03-18 Nippon Telegraph And Telephone Corporation Method and system for information retrieval
WO1998012881A2 (en) * 1996-09-20 1998-03-26 Netbot, Inc. Method and system for network information access

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0829811A1 (en) * 1996-09-11 1998-03-18 Nippon Telegraph And Telephone Corporation Method and system for information retrieval
WO1998012881A2 (en) * 1996-09-20 1998-03-26 Netbot, Inc. Method and system for network information access

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1611534A1 (en) * 2003-04-04 2006-01-04 Yahoo! Inc. A system for generating search results including searching by subdomain hints and providing sponsored results by subdomain
EP1611534A4 (en) * 2003-04-04 2010-02-03 Yahoo Inc A system for generating search results including searching by subdomain hints and providing sponsored results by subdomain
US8271480B2 (en) 2003-04-04 2012-09-18 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US8849796B2 (en) 2003-04-04 2014-09-30 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US9262530B2 (en) 2003-04-04 2016-02-16 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US9323848B2 (en) 2003-04-04 2016-04-26 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US7752285B2 (en) 2007-09-17 2010-07-06 Yahoo! Inc. Shortcut sets for controlled environments
US8566424B2 (en) 2007-09-17 2013-10-22 Yahoo! Inc. Shortcut sets for controlled environments
US8694614B2 (en) 2007-09-17 2014-04-08 Yahoo! Inc. Shortcut sets for controlled environments
CN113127490A (en) * 2021-04-23 2021-07-16 山东英信计算机技术有限公司 Key name generation method and device and computer readable storage medium
CN113127490B (en) * 2021-04-23 2023-02-24 山东英信计算机技术有限公司 Key name generation method and device and computer readable storage medium
US11941032B2 (en) 2021-04-23 2024-03-26 Shandong Yingxin Computer Technologies Co., Ltd. Key name generation method and apparatus and non-transitory computer-readable storage medium

Also Published As

Publication number Publication date
EP1254413A2 (en) 2002-11-06
AU2001230402A1 (en) 2001-08-14
WO2001057725A3 (en) 2002-06-13

Similar Documents

Publication Publication Date Title
US9348871B2 (en) Method and system for assessing relevant properties of work contexts for use by information services
US7895595B2 (en) Automatic method and system for formulating and transforming representations of context used by information services
US6256623B1 (en) Network search access construct for accessing web-based search services
US7275061B1 (en) Systems and methods for employing an orthogonal corpus for document indexing
US7739258B1 (en) Facilitating searches through content which is accessible through web-based forms
US6466940B1 (en) Building a database of CCG values of web pages from extracted attributes
US20050171932A1 (en) Method and system for extracting, analyzing, storing, comparing and reporting on data stored in web and/or other network repositories and apparatus to detect, prevent and obfuscate information removal from information servers
US8510339B1 (en) Searching content using a dimensional database
US20020065857A1 (en) System and method for analysis and clustering of documents for search engine
US20040006740A1 (en) Information access
US20080097958A1 (en) Method and Apparatus for Retrieving and Indexing Hidden Pages
US20080235567A1 (en) Intelligent form filler
US20070022085A1 (en) Techniques for unsupervised web content discovery and automated query generation for crawling the hidden web
US20090125529A1 (en) Extracting information based on document structure and characteristics of attributes
US20080320021A1 (en) Systems and methods for presenting information based on publisher-selected labels
US20070022096A1 (en) Method and system for searching a plurality of web sites
US20030093427A1 (en) Personalized web page
Nicholson Bibliomining for automated collection development in a digital library setting: Using data mining to discover Web‐based scholarly research works
WO2001024046A2 (en) Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup
EP1254413A2 (en) System and method for database searching
EP1014283A1 (en) Intranet-based cataloguing and publishing system and method
US8996514B1 (en) Mobile to non-mobile document correlation
AU2007100279A4 (en) Systems and methods of directionally guided, discriminate crawling of internet real estate listings
Jakob et al. Dcbot: Finding spatial information on the web
KR20030013814A (en) A system and method for searching a contents included non-text type data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

WWE Wipo information: entry into national phase

Ref document number: 2001902550

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001902550

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 2001902550

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP