WO2003046755A1 - Method and apparatus for information retrieval - Google Patents

Method and apparatus for information retrieval Download PDF

Info

Publication number
WO2003046755A1
WO2003046755A1 PCT/AU2002/001597 AU0201597W WO03046755A1 WO 2003046755 A1 WO2003046755 A1 WO 2003046755A1 AU 0201597 W AU0201597 W AU 0201597W WO 03046755 A1 WO03046755 A1 WO 03046755A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
retrieved
url
text
target
Prior art date
Application number
PCT/AU2002/001597
Other languages
French (fr)
Other versions
WO2003046755A9 (en
Inventor
Kathleen Phelan
Original Assignee
Webtrack Media Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Webtrack Media Pty Ltd filed Critical Webtrack Media Pty Ltd
Priority to AU2002342413A priority Critical patent/AU2002342413A1/en
Priority to CA002507279A priority patent/CA2507279A1/en
Priority to US10/496,811 priority patent/US20050010556A1/en
Priority to EP02779016A priority patent/EP1461725A4/en
Priority to NZ533730A priority patent/NZ533730A/en
Publication of WO2003046755A1 publication Critical patent/WO2003046755A1/en
Publication of WO2003046755A9 publication Critical patent/WO2003046755A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Definitions

  • This invention relates to information retrieval, and is directed primarily but not solely automated retrieval and analysis of information available on the Internet or similar databas such as databases, internal networks and intranets.
  • the invention provides a method for automated search and retrieval information available on a networked database, the method including the steps of
  • the network is the Internet.
  • the retrieved information is analysed.
  • an alert is provided to an entity as a result of the analysis.
  • the invention provides an automated information seai and retrieval system in which real time selection and retrieval of the information occurs.
  • the system includes provision for archiving the retrieved information in a read accessible manner.
  • the information is searched and retrieved from the Internet.
  • the invention provides a method for automated searching and retrieval information, performing real time selection and retrieval of the information.
  • the information is archived for subsequent analysis.
  • the method preferably includes the step of establishing one or more target resource locat from which information is to be searched and retrieved.
  • the target location preferably includes a URL which is spidered by the syst to identify underlying links.
  • the spidering step is performed in a plurality of passes, each pass being targe toward certain links, and each pass ignoring links that are unlikely to be relevant.
  • the method includes the step of retrieving information from links that app relevant.
  • the method includes the step of assigning or attaching metadata to each item information to create a database record.
  • the database records are archived.
  • Preferably retrieved information which is not in a textual format is converted to an edita raw-text data type.
  • Preferably data can be provided from other sources, for example hard copies which may converted to text using optical character recognition processors, or from an audio forr using speech recognition applications.
  • the method includes the step of analysing text retrieved by the method agaii predetermined rules.
  • the predetermined rules may include a literal string (key woi matches, regular expression matches, string patterns or occurrences of text, or otl linguistically defined criteria.
  • the predetermined rules may additionally involve other t( analysis technology to recognise desired matches.
  • the rules may be used to implemen criterion against which retrieved items of information are compared to determine th relevance to various topics and therefore the manner in which the information should indexed, or possibly discarded.
  • the method includes the step of discarding or stripping all extraneous informati from the information that is retrieved.
  • extraneous information may include HTI ⁇ tags, images and the like.
  • relevant information which is the subject of a new record created for immedi, analysis or for archiving is stored with associated metadata (for example source URL, d retrieved, string length, HTML headers and the like).
  • metadata for example source URL, d retrieved, string length, HTML headers and the like.
  • each record a distinct and unique item in the database or archive and is assigned a unique identifier.
  • the unique identifier may be a thirty two character UUID (universally unique identifier).
  • the invention also includes apparatus to implement the system or method of one or more the preceding statements of invention.
  • the invention includes a computing machine operable to implement the system or method one or more of the preceding statements of invention.
  • Figure 1 an overview diagram of an information retrieval and archiving syste according to the invention
  • Figure 2 is a diagrammatic time line of internet information search functions accordi to the invention.
  • Figure 3 is a flow diagram of an internet search and retrieval function according to t invention.
  • Figures 4a & 4b constitute a single flow diagram showing the search and retries function of Figure 3 in greater detail.
  • Figure 5 is a diagram showing the action of an agent or bot spidering a target server accordance with the invention.
  • Raw data is shown at a first level referenced 1. It is ti data that the present invention searches, selects and then organises or indexes to arrive relevant timely information. As can be seen from the diagram, this raw data can includedi diverse range of data formats such as hard copy documents 10, Internet data 12, audio d 14 and video data 16.
  • Sources of hard copy documents include sources such as newspapers and magazine artic or other paper records.
  • Internet or other network data can include data contained in or generated by HT1N documents, XML documents/feeds, dynamic pages (CGI, ASP, CFM, PHP) and WAP d, sources, amongst others.
  • Audio data can include radio broadcasts, tape recordings/interviews and streaming audio ( example provided on the Internet).
  • Video data can include television broadcasts, tape recordings or streaming video ( example provided on the Internet).
  • OCR optical character recognition
  • the application automatically scans each page, converts the document into a raw text fore using OCR (optical character recognition), and saves it into the central database.
  • OCR optical character recognition
  • the documents may be newspaper articles, magazine journals, printed PDF files, or oti hard-copy material.
  • HTTP and similar or subsequent methods and protocols
  • reque are used to supply the required HTML, or other, documents and these can then be stripped extraneous information such as HTML tags and the like to arrive at a text document.
  • T processing is generally indicated using reference numeral 20 in Figure 1.
  • Audio data and video data are processed using speech recognition components to transfo the audio information into a textual format.
  • This process is generally indicated us: reference numeral 22 in Figure 1.
  • a compii or series of computers running an application which processes audio from TV broadcas video, and other media (streaming, CDROM, etc).
  • the audio/video data may be stoi digitally on a storage device connected to the computer or captured from an analogue sou such as a bank of VCRs or similar playback devices.
  • the "audio signal" can be derived from either an audio or video source. Provision is ma for additional metadata with video sources that analyses and classifies video & ima information.
  • the application running on the computer analyses the broadcast using speech recogniti software to convert it to a raw text form where it is saved into the central database.
  • the result of the processing step in level 2 is a text document, referenced 24 which provided in electronic form.
  • Each text item 24 then has metadata added to it (as will described further below) so as to create a database record in step 26, and each record is tli stored on a database 28.
  • the database can then be accessed to review information of inter that has been gathered using the process.
  • the information on the database c be archived in a number of convenient formats for use to track changes and patterns o ⁇ time or to review historical data information.
  • a time line having an axis 30 representing time advancing in lini intervals in a direction to the right hand side of the figure shows examples of agents or b ⁇ which automatically search target data sources on the Internet.
  • Agents or bots are used in the preferred embodimi to automatically search target data sources on the Internet.
  • the agents are releas periodically.
  • a first agent 32 which has the task of extracting informati from a specific URL e.g. theage.com may be released.
  • Each agent is attached to a speci site and is profiled with information specific to that site. The information determines method and depth of spidering (this will be explained further below) and how information is extracted.
  • Each agent is released at predetermined intervals and they begin harvestii information through a process as will be described further below. Once each agent b finished its automated process, it returns to a "wait" state until it is next triggered.
  • another agent 34 may be attached to another UI e.g. SMH.com and be released at 8:00am.
  • the agent 36 may be attached to a URL e news.com.au and be released at 9:00am.
  • the agent 38 may be attached to yet another UI e.g. ordermail.com.au and be released at 10:00am.
  • step 40 the agent makes an http get request to retrieve the HTIv document from its target URL. This is performed in step 42.
  • the agent in step 40 is agent 32, then the URL that the request is sent to would theage.com.au.
  • the document that the agent receives from the target URL will include number of links. These links will typically consist of links to other URLs. These links ⁇ filtered according to certain criteria and information the agent is loaded with and stored oi system server in a "spider list". Certain types of resource are filtered as well as compared an "exclusion list" on the server. Any URL which is listed on the exclusion list is ignored the agent. In this way, from a general known website structure, links which are known to valueless in terms of their information can be readily excluded by the system.
  • This step filtering the relevant links is carried out in step 44 and is generally performed by a parsi process whereby the text and the link is analysed by the agent to look for key words known words or word patterns such as linguistically defined criteria or "themes" which ⁇ likely to indicate a relevant link to the information which is sought.
  • the method includes t step of analysing text retrieved by the method against predetermined rules.
  • T predetermined rules may include a literal string (key word) matches, regular expressi matches, string patterns or occurrences of text, or other linguistically defined criteria.
  • T predetermined rules may additionally involve other text analysis technology to recogn desired matches.
  • the rules may be used to implement a criterion against which retriev items of information are compared to determine their relevance to various topics a therefore the manner in which the information should be indexed, or possibly discarded.
  • I term "spidering" refers to the process of navigating through a series of on line resources a gathering information. Therefore, the spider list which is established by the agent s forth a pattern of links at the target site which is subsequently visited by the agent to retr information as is described further below.
  • step 46 the agent then proceeds to process each parsed URL from step 44 individua until all further links (of which there may be many) are checked in this manner. This occi in step 46. Again, links which are on the exclusion list are ignored by the agent.
  • the agent inserts the relevant URL (or link) into a URL string tab This occurs in step 48.
  • the agent then performs a query in step 50 retrieve all the URL's from the URL stream table.
  • step 52 the process begins by the agent making an HTTP GET request to retrieve a documi from the first URL.
  • the agent retrieves a profile for the base URL. This occurs in si 54 and the purpose is to obtain further information about any known document structure structures at the website of interest. Therefore, profiles tend to be specific toward each tarj URL. If the profile is known, then this can make the content of the HTML document mv easier to accurately retrieve in a desired form. If the structure of the HTML documi retrieved does not match the profile then the agent defaults to retrieving the entire text f the HTML document with the HTML tags stripped out.
  • step 56 the agent executes the profile and in step 58 retrieves the relevi material (for example) in text with extraneous content stripped out.
  • the next step 60 is for an analysis to be performed of the retrieved document.
  • the ag ⁇ analyses the text retrieved against predetermined rules which may be called "themes" stoi on the system server.
  • the themes may consist of actual literal string (i.e. key word) match regular expression matches, string patterns or occurrences of text or other linguistica defined criteria as determined.
  • themes are defined by system users in consultation with analysts and may cons of any of the foregoing, and additionally may involve other text analysis technology recognise desired matches.
  • the word "themes” is broadly used in this document describe a scheme of criteria against which retrieved items are compared to ascertain or di: documents of relevance to the user.
  • step 60 should the query performed in step 60 result in a match, then the ag inserts the text document that has been retrieved into the system database. This occurs step 62. If a match is not achieved, then the document is discarded.
  • the agent then returns to the next URL in the URL stre. table in step 64 so that the process begins to repeat from step 52 until all URLs have be examined.
  • step 66 the agent "returns" to the system server until next cycle is due to begin.
  • step 66 the agent "returns" to the system server until next cycle is due to begin.
  • step 66 the agent "returns" to the system server until next cycle is due to begin.
  • step 66 the agent "returns" to the system server until next cycle is due to begin.
  • step 66 the agent "returns" to the system server until next cycle is due to begin.
  • step 66 As described w reference to Figure 1, as each text item is added to the database, additional metadata is adc to the item so that the data is organised or indexed for subsequent retrieval or for furtl analysis for identification purposes. Therefore, as each new record is created on the syster database, the text is stored and any associated metadata (such as source URL, date retriev string length, HTML headers etc) is stored with the text.
  • Each record is created is thu distinct and unique item in the data base and is assigned a unique identifier. This identii
  • the system envisages storing text documents regardless of whether a theme is matched not so that recursive searches may be made.
  • step 70 the agent executes in step 70 and initial query occurs in step 72 which is an HTTP request to get the base URL.
  • step ! check is performed from the document returned as a result of the request. This check is review the header data from the HTML document that is returned to ascertain the last ti that the document was updated or modified.
  • step 76 A comparison occurs in step 76, and if then no change, then the agent returns to step 70.
  • step 78 the agent returns to step if a change has occurred, then document is received in step 78 and is parsed in step 80 to ascertain relevant links. I desired (but not absolutely necessary) that only links which relate to text documents parsed and that the agent ignores links from any exclusion list as described above.
  • step 82 the parsed URL is processed and in step 84 the agent performs a query to chi whether the processed URL is present in the URL stream table. If it is not, then in step 8 further query is performed to check whether the URL is in the URL archive table. If URL is not present in that table either, then the agent inserts the URL into the URL stre table together with further parameters such as the base URL, the date and time of ] modification of the document to which the URL relates and a depth variable.
  • step 84 the agent continues to process the next U in step 82 and the process continues until all the URL's have been parsed.
  • step 90 the agent retrieves all the URL's that have b ⁇ passed from the URL stream table.
  • a GET request is then performed in step 92 for the f URL from the URL stream table.
  • a check is then performed in step 94 to see whether depth variable is greater than 1 i.e. whether there are further links in the document tha retrieved from that URL. If there is, then these links are parsed and the process is perforn again beginning at step 80 until all the subsidiary links are parsed and then the agent retu to step 96 where a query is performed to retrieve the profile for the relevant base URL.
  • step 98 the agent attempts to execute retrieved profile. If there is a profile match failure, as shown in step 100, then the full texi the HTML document is simply retrieved and all the HTML tags are simply stripped from document. If there is a profile match success as shown in step 102, then the text from document is easily retrieved with extraneous content removed from it. The resultant t document is then compared with the themes referred to above to see whether a match occ in step 104. A query is then performed in step 106 to see whether the URL to which document relates already exists. If it does, then the URL is discarded and the agent turns the next URL in the URL stream table at step 108.
  • the agent inserts the full text into the content items table (i.e. into the databa together with further metadata such as the base URL and further information identification and search purposes. This occurs in step 110. If for some reason an article cannot be extracted, then an email is generated in s 112. The agent then continues to repeat the process for subsequent URL's in the U stream table at step 114.
  • Step 106 has the purpose of preventing information being retrieved and stored twice.
  • FIG 5 a simplified diagrammatic illustration of the spidering process described abc in Figures 3, 4a and 4b is shown.
  • the system server is referenced 150 and a target server which the target URL i.e. the base URL referred to above is located as referenced 152.
  • agent 154 begins by making a first pass of the base URL of the target server 152. That ag then returns data to the server as shown by arrow 156. If the information returned indica that there are links to further URL's on the target server, then the agent makes a further p i.e. a second pass 158. Information from the second parse is returned to the server in s 160.
  • a tb pass 162 may be made, which will again return further information to the server.
  • the method provides a logical ⁇ straight forward way of spidering a target server for relevant information.
  • information on a target server may be represented in a pie chart foi
  • the information in an initial state of the server 170 may show that no information has b ⁇ spidered.
  • a certain amount of information will have been retrieved indicated in diagram 172.
  • a second pass further information will have been retrie as shown by diagram 174.
  • yet more information has bt retrieved as shown by diagram 176.
  • the spidered information from the server is shown the shaded portions of each diagram. As can be seen, a certain amount of information ignored and this information relates to links that have been parsed by the agent but wh have been ignored because they have been determined to be a) irrelevant, b) on a list URL's to be ignored, or c) are not in the required data form (for example do not compris text document).
  • an "alert" After a content item has been stored in the database, an "alert" will be generated.
  • the al configuration is definable by the client, and may take the form of an email, an SMS messa the remote updating of a web page, or remote communication with another datab system of application.
  • the alert may be sent in "real-time” (as soon as the content item is retrieved) or after it ] been analysed (after the analyst has processed the content item).
  • the alerts may be received singly or in digest form on a different frequency, for examj. daily, weekly, or even monthly if desired.
  • the client may view "real-time" reports sowing visually the retrieval, processing J analysis of items that match their keyword themes. These reports consist of dynamic graphs, pie graphs, and other types of chart which display information and metad pertaining to these contents items.
  • the client may further manipulate these charts and graj with different ranges and criteria to produce different results.
  • the analysis may be performed by a human analyst or by a software component on server.
  • the analysis metadata is compiled from the client perspective and stored on a p user client, so one content item may have many analyses for different clients.
  • the analysis allows the user to select many database cross-sections for different repc showing the analysis metadata which is linked to retrieved content items.
  • the analysis x also be displayed real-time to the client so as items are updated and analysed the on-scn information is updated with no intervention from the client.
  • the analysis enables the user to quickly gain an understanding of the skew of a large volu of content at a glance; instead of perusing each item they are able to view a dissect overview in graphical format and provide a powerful tool in determining real-time trends they appear.

Abstract

A method for automated search and retrieval of information available on a networked database, the method including the steps of providing search topic information, providing a target information resource location, spidering or dividing the target information resource location for further resource locations that are likely to lead to relevant information, and retrieving information from the target information resource location or from a relevant one of the further resource locations.

Description

METHOD AND APPARATUS FOR INFORMATION RETRIEVAL
FIELD OF THE INVENTION
This invention relates to information retrieval, and is directed primarily but not solely automated retrieval and analysis of information available on the Internet or similar databas such as databases, internal networks and intranets.
BACKGROUND OF THE INVENTION
Computer databases, internal networks, intranets, networks and, in particular, the network networks such as that commonly referred to as the Internet have resulted in vast amounts information being publicly available on those sources. However, for example, there is single organised and completely up-to-date repository or index of all information on 1 Internet.
To be useful, information must be relevant and timely. The Internet makes information e∑ to access, but it can be a very difficult task to fully canvas the Internet to find all informati that is relevant to a particular topic or range of topics. Also, with information bei accumulated and changed so rapidly due to the Internet environment, even if extensi searching is performed in a manual procedure, then the time taken to search in this mannei quite likely to not be fully up to date.
There are a number of Internet search engines, such as "Yahoo™" for example whi attempt to provide a user friendly search facility for information on the Internet or simi databases. However, these search engines try to cover a full range of topics from ms disparate sources and are therefore not continually up to date. They also index or frequency of only 4 to 12 weeks. OBJECT OF THE INVENTION
It is an object of the present invention to provide methods or apparatus for informati retrieval and/or analysis and/or user information alerts which will at least go some w toward overcoming disadvantages of known apparatus and methods, or which will at le provide the public with a useful choice.
Throughout this specification, where there is a description with reference to the Internet should be appreciated that the invention is applicable also to databases, internal networ intranets and the like.
SUMMARY OF THE INVENTION
In one broad aspect the invention provides a method for automated search and retrieval information available on a networked database, the method including the steps of
providing search topic information,
providing a target information resource location,
spidering or dividing the target information resource location for further resou locations that are likely to lead to relevant information, and
retrieving information from the target information resource location or from relevant one of the further resource locations.
Preferably the network is the Internet.
Preferably the retrieved information is analysed.
Preferably an alert is provided to an entity as a result of the analysis. In another broad aspect the invention provides an automated information seai and retrieval system in which real time selection and retrieval of the information occurs.
Preferably the system includes provision for archiving the retrieved information in a read accessible manner.
It is preferred that the information is searched and retrieved from the Internet.
In a further aspect the invention provides a method for automated searching and retrieval information, performing real time selection and retrieval of the information.
Preferably the information is archived for subsequent analysis.
The method preferably includes the step of establishing one or more target resource locat from which information is to be searched and retrieved.
Furthermore, the target location preferably includes a URL which is spidered by the syst to identify underlying links.
Preferably the spidering step is performed in a plurality of passes, each pass being targe toward certain links, and each pass ignoring links that are unlikely to be relevant.
Preferably the method includes the step of retrieving information from links that app relevant.
Preferably the method includes the step of assigning or attaching metadata to each item information to create a database record.
Preferably the database records are archived.
Preferably retrieved information which is not in a textual format is converted to an edita raw-text data type.
Preferably data can be provided from other sources, for example hard copies which may converted to text using optical character recognition processors, or from an audio forr using speech recognition applications. Preferably the method includes the step of analysing text retrieved by the method agaii predetermined rules. The predetermined rules may include a literal string (key woi matches, regular expression matches, string patterns or occurrences of text, or otl linguistically defined criteria. The predetermined rules may additionally involve other t( analysis technology to recognise desired matches. The rules may be used to implemen criterion against which retrieved items of information are compared to determine th relevance to various topics and therefore the manner in which the information should indexed, or possibly discarded.
Preferably the method includes the step of discarding or stripping all extraneous informati from the information that is retrieved. Such extraneous information may include HTI\ tags, images and the like.
Preferably relevant information which is the subject of a new record created for immedi, analysis or for archiving is stored with associated metadata (for example source URL, d retrieved, string length, HTML headers and the like). Furthermore, preferably each record a distinct and unique item in the database or archive and is assigned a unique identifier.
The unique identifier may be a thirty two character UUID (universally unique identifier).
The invention also includes apparatus to implement the system or method of one or more the preceding statements of invention.
The invention includes a computing machine operable to implement the system or method one or more of the preceding statements of invention.
To those skilled in the art to which the invention relates, many changes in constructions a widely different embodiments and applications of the invention will suggest themseh without departing from the scope of the invention as defined in the appended claims. 1 disclosure and descriptions herein are purely illustrative and are not intended to be in a sense limiting.
The invention consists of the foregoing and also envisages constructions of which following gives examples only. DRAWINGS DESCRIPTION
One presently preferred embodiment of the invention will now be described with referen to the accompanying drawings, wherein;
Figure 1 an overview diagram of an information retrieval and archiving syste according to the invention,
Figure 2 is a diagrammatic time line of internet information search functions accordi to the invention.
Figure 3 is a flow diagram of an internet search and retrieval function according to t invention.
Figures 4a & 4b constitute a single flow diagram showing the search and retries function of Figure 3 in greater detail.
Figure 5 is a diagram showing the action of an agent or bot spidering a target server accordance with the invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
Referring to Figure 1, an overview of a method or system and associated apparatus accordi to the present invention is shown. Raw data is shown at a first level referenced 1. It is ti data that the present invention searches, selects and then organises or indexes to arrive relevant timely information. As can be seen from the diagram, this raw data can includi diverse range of data formats such as hard copy documents 10, Internet data 12, audio d 14 and video data 16.
Sources of hard copy documents include sources such as newspapers and magazine artic or other paper records. Internet or other network data can include data contained in or generated by HT1N documents, XML documents/feeds, dynamic pages (CGI, ASP, CFM, PHP) and WAP d, sources, amongst others.
Audio data can include radio broadcasts, tape recordings/interviews and streaming audio ( example provided on the Internet).
Video data can include television broadcasts, tape recordings or streaming video ( example provided on the Internet).
At level 2 in Figure 1, a data processing level is shown. For hardcopy documents 1 preferred processing is performed by an optical character recognition (OCR) applicatii This is indicated with reference number 18 in Figure 1. OCR uses high definition scann to capture an image of a hard copy document and convert it to a raw text format, facilitate OCR, a computer or series of computers to which a high-resolution scanni device/s (with a bulk feeder mechanism into which many pages of documents can be load< is attached.
The application automatically scans each page, converts the document into a raw text fore using OCR (optical character recognition), and saves it into the central database.
The documents may be newspaper articles, magazine journals, printed PDF files, or oti hard-copy material.
To process Internet data, HTTP (and similar or subsequent methods and protocols) reque are used to supply the required HTML, or other, documents and these can then be stripped extraneous information such as HTML tags and the like to arrive at a text document. T processing is generally indicated using reference numeral 20 in Figure 1.
Audio data and video data are processed using speech recognition components to transfo the audio information into a textual format. This process is generally indicated us: reference numeral 22 in Figure 1. To facilitate speech recognition/transcription, a compii or series of computers running an application which processes audio from TV broadcas video, and other media (streaming, CDROM, etc). The audio/video data may be stoi digitally on a storage device connected to the computer or captured from an analogue sou such as a bank of VCRs or similar playback devices. The "audio signal" can be derived from either an audio or video source. Provision is ma for additional metadata with video sources that analyses and classifies video & ima information.
The application running on the computer analyses the broadcast using speech recogniti software to convert it to a raw text form where it is saved into the central database.
The result of the processing step in level 2 is a text document, referenced 24 which provided in electronic form. Each text item 24 then has metadata added to it (as will described further below) so as to create a database record in step 26, and each record is tli stored on a database 28. The database can then be accessed to review information of inter that has been gathered using the process. Furthermore, the information on the database c be archived in a number of convenient formats for use to track changes and patterns o^ time or to review historical data information.
Although the system may be used with a wide variety of sources of raw data, as descrit with reference to Figure 1, an immediate application of the invention is to Internet data, a this is indicated in Figure 2 and will be described further by way of example with referer to the remaining figures.
Turning now to Figure 2, a time line having an axis 30 representing time advancing in lini intervals in a direction to the right hand side of the figure shows examples of agents or b< which automatically search target data sources on the Internet.
Agents or bots (or similar kinds of automated agents) are used in the preferred embodimi to automatically search target data sources on the Internet. The agents are releas periodically.
By way of example, at 7:00am, a first agent 32 which has the task of extracting informati from a specific URL e.g. theage.com may be released. Each agent is attached to a speci site and is profiled with information specific to that site. The information determines method and depth of spidering (this will be explained further below) and how information is extracted. Each agent is released at predetermined intervals and they begin harvestii information through a process as will be described further below. Once each agent b finished its automated process, it returns to a "wait" state until it is next triggered.
Therefore, to continue with the example, another agent 34 may be attached to another UI e.g. SMH.com and be released at 8:00am. The agent 36 may be attached to a URL e news.com.au and be released at 9:00am. The agent 38 may be attached to yet another UI e.g. ordermail.com.au and be released at 10:00am.
Turning now to Figure 3, a general process flow is described beginning at step 40 when t agent begins operation. Firstly, the agent makes an http get request to retrieve the HTIv document from its target URL. This is performed in step 42. In the example given in Figv. 2, if the agent in step 40 is agent 32, then the URL that the request is sent to would theage.com.au.
Almost invariably, the document that the agent receives from the target URL will include number of links. These links will typically consist of links to other URLs. These links ∑ filtered according to certain criteria and information the agent is loaded with and stored oi system server in a "spider list". Certain types of resource are filtered as well as compared an "exclusion list" on the server. Any URL which is listed on the exclusion list is ignored the agent. In this way, from a general known website structure, links which are known to valueless in terms of their information can be readily excluded by the system. This step filtering the relevant links is carried out in step 44 and is generally performed by a parsi process whereby the text and the link is analysed by the agent to look for key words known words or word patterns such as linguistically defined criteria or "themes" which ∑ likely to indicate a relevant link to the information which is sought. The method includes t step of analysing text retrieved by the method against predetermined rules. T predetermined rules may include a literal string (key word) matches, regular expressi matches, string patterns or occurrences of text, or other linguistically defined criteria. T predetermined rules may additionally involve other text analysis technology to recogn desired matches. The rules may be used to implement a criterion against which retriev items of information are compared to determine their relevance to various topics a therefore the manner in which the information should be indexed, or possibly discarded. I term "spidering" refers to the process of navigating through a series of on line resources a gathering information. Therefore, the spider list which is established by the agent s forth a pattern of links at the target site which is subsequently visited by the agent to retr information as is described further below.
In step 46 the agent then proceeds to process each parsed URL from step 44 individua until all further links (of which there may be many) are checked in this manner. This occi in step 46. Again, links which are on the exclusion list are ignored by the agent.
As each URL is parsed, the agent inserts the relevant URL (or link) into a URL string tab This occurs in step 48.
Once the spidering process has been completed, the agent then performs a query in step 50 retrieve all the URL's from the URL stream table.
The next general step is for the agent to look through a document retrieval process until the URLs or links from the URL stream table have been accessed i.e. spidered. Therefore, step 52 the process begins by the agent making an HTTP GET request to retrieve a documi from the first URL. The agent then retrieves a profile for the base URL. This occurs in si 54 and the purpose is to obtain further information about any known document structure structures at the website of interest. Therefore, profiles tend to be specific toward each tarj URL. If the profile is known, then this can make the content of the HTML document mv easier to accurately retrieve in a desired form. If the structure of the HTML documi retrieved does not match the profile then the agent defaults to retrieving the entire text f the HTML document with the HTML tags stripped out.
Therefore, in step 56, the agent executes the profile and in step 58 retrieves the relevi material (for example) in text with extraneous content stripped out.
The next step 60 is for an analysis to be performed of the retrieved document. The ag< analyses the text retrieved against predetermined rules which may be called "themes" stoi on the system server. The themes may consist of actual literal string (i.e. key word) match regular expression matches, string patterns or occurrences of text or other linguistica defined criteria as determined.
In practice, themes are defined by system users in consultation with analysts and may cons of any of the foregoing, and additionally may involve other text analysis technology recognise desired matches. The word "themes" is broadly used in this document describe a scheme of criteria against which retrieved items are compared to ascertain or di: documents of relevance to the user.
Returning to Figure 3, should the query performed in step 60 result in a match, then the ag inserts the text document that has been retrieved into the system database. This occurs step 62. If a match is not achieved, then the document is discarded.
Having retrieved one document, the agent then returns to the next URL in the URL stre. table in step 64 so that the process begins to repeat from step 52 until all URLs have be examined.
Once the spidering process is complete, the agent "returns" to the system server until next cycle is due to begin. This is represented as step 66 in Figure 3. As described w reference to Figure 1, as each text item is added to the database, additional metadata is adc to the item so that the data is organised or indexed for subsequent retrieval or for furtl analysis for identification purposes. Therefore, as each new record is created on the syster database, the text is stored and any associated metadata (such as source URL, date retriev string length, HTML headers etc) is stored with the text. Each record is created is thu distinct and unique item in the data base and is assigned a unique identifier. This identii preferably takes the form of 32 character UUID.
The system envisages storing text documents regardless of whether a theme is matched not so that recursive searches may be made.
Turning now to Figures 4a and 4b, a further example of spidering a target base URL provided, using the methodology similar to that described with reference to Figure 2, 1 incorporating some more detail. Thus in Figure 4a, the agent executes in step 70 and initial query occurs in step 72 which is an HTTP request to get the base URL. In step !• check is performed from the document returned as a result of the request. This check is review the header data from the HTML document that is returned to ascertain the last ti that the document was updated or modified. A comparison occurs in step 76, and if then no change, then the agent returns to step 70. However, if a change has occurred, then document is received in step 78 and is parsed in step 80 to ascertain relevant links. I desired (but not absolutely necessary) that only links which relate to text documents parsed and that the agent ignores links from any exclusion list as described above.
In step 82 the parsed URL is processed and in step 84 the agent performs a query to chi whether the processed URL is present in the URL stream table. If it is not, then in step 8 further query is performed to check whether the URL is in the URL archive table. If URL is not present in that table either, then the agent inserts the URL into the URL stre table together with further parameters such as the base URL, the date and time of ] modification of the document to which the URL relates and a depth variable.
If the URL is identified in steps 84 or 86, then the agent continues to process the next U in step 82 and the process continues until all the URL's have been parsed.
The process continues in step 90 when the agent retrieves all the URL's that have bι passed from the URL stream table. A GET request is then performed in step 92 for the f URL from the URL stream table. A check is then performed in step 94 to see whether depth variable is greater than 1 i.e. whether there are further links in the document tha retrieved from that URL. If there is, then these links are parsed and the process is perforn again beginning at step 80 until all the subsidiary links are parsed and then the agent retu to step 96 where a query is performed to retrieve the profile for the relevant base URL.
The process flow continues in Figure 4b where in step 98 the agent attempts to execute retrieved profile. If there is a profile match failure, as shown in step 100, then the full texi the HTML document is simply retrieved and all the HTML tags are simply stripped from document. If there is a profile match success as shown in step 102, then the text from document is easily retrieved with extraneous content removed from it. The resultant t document is then compared with the themes referred to above to see whether a match occ in step 104. A query is then performed in step 106 to see whether the URL to which document relates already exists. If it does, then the URL is discarded and the agent turns the next URL in the URL stream table at step 108. However, if the URL does not alre; exist, then the agent inserts the full text into the content items table (i.e. into the databa together with further metadata such as the base URL and further information identification and search purposes. This occurs in step 110. If for some reason an article cannot be extracted, then an email is generated in s 112. The agent then continues to repeat the process for subsequent URL's in the U stream table at step 114.
Step 106 has the purpose of preventing information being retrieved and stored twice.
In Figure 5, a simplified diagrammatic illustration of the spidering process described abc in Figures 3, 4a and 4b is shown. The system server is referenced 150 and a target server which the target URL i.e. the base URL referred to above is located as referenced 152. agent 154 begins by making a first pass of the base URL of the target server 152. That ag then returns data to the server as shown by arrow 156. If the information returned indica that there are links to further URL's on the target server, then the agent makes a further p i.e. a second pass 158. Information from the second parse is returned to the server in s 160. Again, if the second pass shows that further links are present on the server, then a tb pass 162 may be made, which will again return further information to the server. Of com a large number of parses may be made if required. The method provides a logical ∑ straight forward way of spidering a target server for relevant information. As can further seen from Figure 5, information on a target server may be represented in a pie chart foi The information in an initial state of the server 170 may show that no information has b< spidered. After the first pass, a certain amount of information will have been retrieved indicated in diagram 172. After a second pass further information will have been retrie as shown by diagram 174. Finally, after the third pass, yet more information has bt retrieved as shown by diagram 176. The spidered information from the server is shown the shaded portions of each diagram. As can be seen, a certain amount of information ignored and this information relates to links that have been parsed by the agent but wh have been ignored because they have been determined to be a) irrelevant, b) on a list URL's to be ignored, or c) are not in the required data form (for example do not compris text document).
After a content item has been stored in the database, an "alert" will be generated. The al configuration is definable by the client, and may take the form of an email, an SMS messa the remote updating of a web page, or remote communication with another datab system of application.
The alert may be sent in "real-time" (as soon as the content item is retrieved) or after it ] been analysed (after the analyst has processed the content item).
The alerts may be received singly or in digest form on a different frequency, for examj. daily, weekly, or even monthly if desired.
The client may view "real-time" reports sowing visually the retrieval, processing J analysis of items that match their keyword themes. These reports consist of dynamic graphs, pie graphs, and other types of chart which display information and metad pertaining to these contents items. The client may further manipulate these charts and graj with different ranges and criteria to produce different results.
The analysis may be performed by a human analyst or by a software component on server. The analysis metadata is compiled from the client perspective and stored on a p user client, so one content item may have many analyses for different clients.
The analysis allows the user to select many database cross-sections for different repc showing the analysis metadata which is linked to retrieved content items. The analysis x also be displayed real-time to the client so as items are updated and analysed the on-scn information is updated with no intervention from the client.
The analysis enables the user to quickly gain an understanding of the skew of a large volu of content at a glance; instead of perusing each item they are able to view a dissect overview in graphical format and provide a powerful tool in determining real-time trends they appear.
From the foregoing it will be seen that a system for retrieving relevant and tim information and archiving information in a form which is readily searchable and may analysed, is provided. In particular, a methodical and efficient method of spidering tar websites is provided. Also, a method of discarding irrelevant information to arrive document in text format is provided, together with a method of indexing or organising < • identifying retrieved documents for subsequent analysis. Finally a system of convenier and timely alerting users for the presence of information relevant to them is provid

Claims

1. A method for automated search and retrieval of information available on a networkec database, the method including the steps of
providing search topic information,
providing a target information resource location,
spidering or dividing the target information resource location for further resourci locations that are likely to lead to relevant information, and
retrieving information from the target information resource location or from ; relevant one of the further resource locations.
A method according to claim 1 in which the networked database is the Internet.
A method according to claim 2 in which the retrieved information is analysed.
4. A method according to claim 3 in which an alert is provided to an entity as a result o the analysis.
5. A method for automated searching and retrieval of information, performing real tims selection and retrieval of the information.
6. A method according to claim 5 in which the information is archived for subsequen analysis.
7. A method according to claim 6 including the step of establishing one c more target resource locations from which information is to be searched and retrieved.
8. A method according to claim 7 in which the target location includes a URL which i spidered by the system to identify underlying links.
9. A method according to claim 8 in which the spidering step is performed in a pluralit of passes, each pass being targeted toward certain links, and each pass ignoring links that ar unlikely to be relevant.
10. A method according to claim 9 including the step of retrieving information from link that appear relevant.
11. A method according to claim 10 including the step of assigning or attaching metadat to each item of information to create a database record.
12. A method according to claim 11 in which the database records are archived.
13. A method according to claim 12 in which retrieved information which is not in textual format is converted to an editable raw-text data type.
14. A method according to claim 13 including the step of analysing retrieved text agains predetermined rules to recognise desired matches.
15. A method according to claim 14 in which the rules are used to implement criterion against which retrieved items of information are compared to determine thei relevance to various topics and therefore the manner in which the information should b indexed, or possibly discarded.
16. A method according to claim 15 in which the rules include one or more of literε string (key word) matches, regular expression matches, string patterns or occurrences of texi or other linguistically defined criteria to recognise desired matches
17. A method according to claim 16 including the step of discarding or stripping a] extraneous information from the information that is retrieved including HTML tags, image and the like.
18. A method according to claim 17 in which relevant information which is the subject o a new record is stored with associated metadata.
19. A method according to claim 18 in which each record is a distinct and unique item ii the database or archive and is assigned a unique identifier.
20. An automated information search and retrieval system in which real time selectioi and retrieval of the information occurs.
21. A system according to claim 20 including provision for archiving the retrieve! information in a readily accessible manner.
22. A system according to claim 21 in which the information is searched retrieved from the Internet.
23. A system according to claim 22 including means for establishing one or more tai resource locations from which information is to be searched and retrieved.
24. A system according to claim 23 including means for spidering a target resov location to identify underlying links.
25. A system according to claim 24 including means for retrieving information fi links.
26. A system according to claim 25 including means for assigning or attaching metac to each item of information to create a database record.
27. A system according to claim 26 including means for archiving retrieved informal for later analysis.
28. A system according to claim 27 including means for converting retrieved informa which is not in a textual format to an editable raw-text data type.
29. A system according to claim 28 including means for providing text data from r text sources including hard copies by conversion to text using optical character recogni processors and audio format using speech recognition applications.
30. Apparatus to implement the system or method of any one of the precedύ claims.
31. A computing machine operable to implement the system or method or apparatus any one of the preceding claims.
PCT/AU2002/001597 2001-11-27 2002-11-27 Method and apparatus for information retrieval WO2003046755A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU2002342413A AU2002342413A1 (en) 2001-11-27 2002-11-27 Method and apparatus for information retrieval
CA002507279A CA2507279A1 (en) 2001-11-27 2002-11-27 Method and apparatus for information retrieval
US10/496,811 US20050010556A1 (en) 2002-11-27 2002-11-27 Method and apparatus for information retrieval
EP02779016A EP1461725A4 (en) 2001-11-27 2002-11-27 Method and apparatus for information retrieval
NZ533730A NZ533730A (en) 2001-11-27 2002-11-27 Method and apparatus for information retrieval

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AUPR9146 2001-11-27
AUPR9146A AUPR914601A0 (en) 2001-11-27 2001-11-27 Method and apparatus for information retrieval

Publications (2)

Publication Number Publication Date
WO2003046755A1 true WO2003046755A1 (en) 2003-06-05
WO2003046755A9 WO2003046755A9 (en) 2003-09-12

Family

ID=3832956

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2002/001597 WO2003046755A1 (en) 2001-11-27 2002-11-27 Method and apparatus for information retrieval

Country Status (5)

Country Link
EP (1) EP1461725A4 (en)
AU (1) AUPR914601A0 (en)
CA (1) CA2507279A1 (en)
NZ (1) NZ533730A (en)
WO (1) WO2003046755A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5835905A (en) * 1997-04-09 1998-11-10 Xerox Corporation System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
US6182072B1 (en) * 1997-03-26 2001-01-30 Webtv Networks, Inc. Method and apparatus for generating a tour of world wide web sites
WO2001027793A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Indexing a network with agents
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
US20010044800A1 (en) * 2000-02-22 2001-11-22 Sherwin Han Internet organizer

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5855015A (en) * 1995-03-20 1998-12-29 Interval Research Corporation System and method for retrieval of hyperlinked information resources
JPH1125125A (en) * 1997-07-08 1999-01-29 Canon Inc Network information retrieving device, its method and storage medium
GB2335761B (en) * 1998-03-25 2003-05-14 Mitel Corp Agent-based web search engine
US6463455B1 (en) * 1998-12-30 2002-10-08 Microsoft Corporation Method and apparatus for retrieving and analyzing data stored at network sites
KR100359233B1 (en) * 1999-07-15 2002-11-01 학교법인 한국정보통신학원 Method for extracing web information and the apparatus therefor
AU2595801A (en) * 1999-12-30 2001-07-16 Auctionwatch.Com, Inc. Minimal impact crawler
US20020103809A1 (en) * 2000-02-02 2002-08-01 Searchlogic.Com Corporation Combinatorial query generating system and method
US7418440B2 (en) * 2000-04-13 2008-08-26 Ql2 Software, Inc. Method and system for extraction and organizing selected data from sources on a network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182072B1 (en) * 1997-03-26 2001-01-30 Webtv Networks, Inc. Method and apparatus for generating a tour of world wide web sites
US5835905A (en) * 1997-04-09 1998-11-10 Xerox Corporation System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
WO2001027793A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Indexing a network with agents
US20010044800A1 (en) * 2000-02-22 2001-11-22 Sherwin Han Internet organizer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1461725A4 *

Also Published As

Publication number Publication date
CA2507279A1 (en) 2003-06-05
WO2003046755A9 (en) 2003-09-12
EP1461725A1 (en) 2004-09-29
AUPR914601A0 (en) 2001-12-20
EP1461725A4 (en) 2005-06-22
NZ533730A (en) 2006-04-28

Similar Documents

Publication Publication Date Title
US6910071B2 (en) Surveillance monitoring and automated reporting method for detecting data changes
US10210256B2 (en) Anchor tag indexing in a web crawler system
US6490579B1 (en) Search engine system and method utilizing context of heterogeneous information resources
US6633867B1 (en) System and method for providing a session query within the context of a dynamic search result set
US20050010556A1 (en) Method and apparatus for information retrieval
US10210222B2 (en) Method and system for indexing information and providing results for a search including objects having predetermined attributes
US8515954B2 (en) Displaying autocompletion of partial search query with predicted search results
US7167901B1 (en) Method and apparatus for improved bookmark and histories entry creation and access
US7065523B2 (en) Scoping queries in a search engine
EP2321745B1 (en) Providing posts to discussion threads in response to a search query
US9081861B2 (en) Uniform resource locator canonicalization
US7664744B2 (en) Query categorizer
US8095530B1 (en) Detecting common prefixes and suffixes in a list of strings
US8938455B2 (en) System and method for determining a homepage on the world-wide web
US20050149519A1 (en) Document information search apparatus and method and recording medium storing document information search program therein
US20070022085A1 (en) Techniques for unsupervised web content discovery and automated query generation for crawling the hidden web
US20070143317A1 (en) Mechanism for managing facts in a fact repository
US20050086206A1 (en) System, Method, and service for collaborative focused crawling of documents on a network
WO2005010701A9 (en) Method and system for rule based indexing of multiple data structures
JPH11191114A (en) Meta retrieving method, image retrieving method, meta retrieval engine and image retrieval engine
JP2006099341A (en) Update history generation device and program
WO2001024045A2 (en) Method, system, signals and media for indexing, searching and retrieving data based on context
US20050188300A1 (en) Determination of member pages for a hyperlinked document with link and document analysis
US20120109965A1 (en) System for automatic semantic-based mining
WO2003046755A1 (en) Method and apparatus for information retrieval

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
COP Corrected version of pamphlet

Free format text: PAGES 1-13, DESCRIPTION, REPLACED BY CORRECT PAGES 1-13; PAGES 14-18, CLAIMS, REPLACED BY CORRECT PAGES 14-18

WWE Wipo information: entry into national phase

Ref document number: 10496811

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2002342413

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 533730

Country of ref document: NZ

WWE Wipo information: entry into national phase

Ref document number: 2002779016

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2002779016

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2507279

Country of ref document: CA

WWP Wipo information: published in national office

Ref document number: 533730

Country of ref document: NZ

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP