WO2003042874A2 - Systems and methods for indexing data in a network environment - Google Patents

Systems and methods for indexing data in a network environment Download PDF

Info

Publication number
WO2003042874A2
WO2003042874A2 PCT/US2002/036276 US0236276W WO03042874A2 WO 2003042874 A2 WO2003042874 A2 WO 2003042874A2 US 0236276 W US0236276 W US 0236276W WO 03042874 A2 WO03042874 A2 WO 03042874A2
Authority
WO
WIPO (PCT)
Prior art keywords
indexing
data
index server
network
resources
Prior art date
Application number
PCT/US2002/036276
Other languages
French (fr)
Other versions
WO2003042874A9 (en
WO2003042874A3 (en
Inventor
Brian Mervyn Morrow
Michael Martin Gorlick
Arthur Hughes Muir, Iii
Original Assignee
Endeavors Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Endeavors Technology, Inc. filed Critical Endeavors Technology, Inc.
Priority to JP2003544636A priority Critical patent/JP2006502461A/en
Priority to EP02792247A priority patent/EP1444613A2/en
Publication of WO2003042874A2 publication Critical patent/WO2003042874A2/en
Publication of WO2003042874A3 publication Critical patent/WO2003042874A3/en
Publication of WO2003042874A9 publication Critical patent/WO2003042874A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates generally to systems and methods for indexing resources of electronic devices, and more particularly to systems and methods for indexing, searching, and/or sharing data, files, and/or other resources stored on a plurality of computing devices connected to a network.
  • Search engines are well known tools for finding information accessible via a wide area network, such as the Internet or World Wide Web. These search engines facilitate indexing and searching for data scattered over a large number of servers or other devices connected to the network. Such large scale web search engines generally require a -crawler, an indexer, and a query engine.
  • a crawler is an application that "crawls" across the "web,” following links and fetching web pages for the indexer.
  • the crawler similar to a browser, may contact web servers or other devices connected to the network to access the web pages available on the servers.
  • the crawler may extract information from the web pages, e.g., words or phrases extracted from content of the web pages, metatags embedded within the web pages, such as HTML markups, inferences made from the link structure of the web pages (outgoing and/or incoming), and the like.
  • the indexer is a compute-intensive and storage-intensive system that receives the web page information from the crawler.
  • the indexer generally constructs a comprehensive inverted index of every web page uncovered by the crawler.
  • the query engine is an application employed by end users to search the index constructed by the indexer, e.g., to return links to candidate web pages in response to query keywords and/or other criteria (such as language, domain of origin, age, and the like) provided by the end users.
  • query keywords and/or other criteria such as language, domain of origin, age, and the like
  • a plurality of computing devices may be connected to one or more networks and/or to one another, for example, by a local area network. The number and form of computing devices connected to such networks may vary dramatically between enterprises.
  • the devices connected to a particular network may vary widely in capacity, speed, platform, and method of network connection.
  • the devices may include corporate servers, desktop computers, laptops, personal digital assistants, embedded sensor and control networks, and the like.
  • the domination of networks and the proliferation of such devices tends to push data storage out to the "edges" of an enterprise's network, making sharing of resources difficult.
  • resources such. as documents, data files, and the like
  • Substantial amounts of the data and documents residing on these devices may be inaccessible to conventional crawlers and indexing engines.
  • some devices may only be intermittently connected to the network.
  • the protocol used by a crawler (such a HTTP) may not be supported by local network devices and/or the contents of the devices may be in a format unknown to a crawler and/or indexer.
  • shared network volumes may capture only a fraction of the data on a device, and the software required to support access may be unsuitable for small, mobile devices.
  • Data repositories frequently require explicit submission by users of the data stored on their devices, and therefore, the repository contents may not be current or comprehensive with respect to data available on many of the devices.
  • Repository indexing may also rely solely on keywords submitted by the users of respective devices whose data is indexed in the repository, and those keywords may not effectively reflect the data actually stored on the respective devices.
  • Knowledge management systems often rely on proprietary formats for content, restrict content to a small number of formats, and/or are specialized for a narrow domain. Thus, such systems may be ineffective for indexing and searching a broad array of information resources available on a network.
  • the present invention is directed generally to systems and methods for indexing resources available on electronic devices connected to a network. More particularly, the systems and methods of the present invention may facilitate an enterprise, such as a business, educational institution, or other organization, indexing, searching for, and/or sharing resources, such as documents, records, databases, media files, e-mail archives, and the like, that may be available on the enterprise's network.
  • the resources may be stored on any device that may be connected to the network, yet may be quickly found, preferably in a substantially secure environment.
  • a system for generating indexing data stored on a plurality of electronic devices connected to a network is provided.
  • the indexing data may include one or more pieces of information related to a respective device, such as information intake, content, and output; hardware configuration, settings, and status; software configuration, settings, and status; system and control logs; manner, rate, pattern, and frequency of use, and the like.
  • the devices for which indexing data may be generated may include desktop computers, laptops, mobile phones, telephones, printers, fax machines, personal digital assistants, portable digital devices, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like.
  • each electronic device on the network includes an
  • indexing agent e.g., one or more embedded digital processors, hardware components, and/or software modules, for indexing resources on the respective device.
  • the indexing agent is a web server resident on the respective device, that may include a translator, an authentication module, a presence module, and/or a thin server.
  • the indexing agent generates indexing data that includes content data describing individual resources stored on the respective device, and location identifiers, such as device-specific URLs and URL links, identifying the location of the individual resources associated with the respective content data, or other Uniform Resource Identifiers ("URIs").
  • URIs Uniform Resource Identifiers
  • the indexing agent extracts content-related information regarding the resources stored on the respective device, and stores the generated indexing data as web pages, for example, in HTML or XML format, or alternatively as text.
  • the indexing data may be stored in memory of the respective device for subsequent use or transfer, as described further below.
  • the indexing agent may include a translator, including one or more modules for translating device-specific information into indexing data that may be interpreted by a crawler or indexer. If the information from the device is already in a format that may interpreted by the crawler or indexer, e.g., HTML or XML, no translation may be necessary. For resources that are not already crawler or indexer compatible, however, such as word processor documents, media files, and the like, the indexing agent may extract content information regarding the resources, and the translator may translate the information, for example, into HTML or XML, and then the indexing agent may store the translated information as web pages.
  • a translator including one or more modules for translating device-specific information into indexing data that may be interpreted by a crawler or indexer. If the information from the device is already in a format that may interpreted by the crawler or indexer, e.g., HTML or XML, no translation may be necessary. For resources that are not already crawler or indexer compatible, however, such as word processor documents, media files,
  • Each indexing agent is preferably configured as one or more modules that operate silently in the background substantially undetected by the user of the respective device as processor cycles and/or other related bandwidth become available.
  • the indexing agent may automatically and periodically index desired portions of the device's resources such that the user of the device need not schedule or otherwise activate the indexing agent to create or update the indexing data.
  • the periodic indexing by the indexing agent may generate a complete index of the resources on the respective device, or it may generate an updated index, i.e., only reflecting resources that have changed since a previous indexing.
  • Each device also includes a communication interface for making the indexing data, for example, in the form of HTML web pages, available to a crawler and/or indexer.
  • Such communication interfaces may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • a system in accordance with another aspect of the present invention, includes a network, a plurality of electronic devices that are at least intermittently connected to the network, and one or more index servers.
  • Each of the electronic devices preferably includes an indexing agent, such as that described above.
  • the one or more index servers include a search engine that is connected to the network.
  • the index server may be a centralized computer system that includes a crawler, an indexer, and/or a query engine.
  • the crawler is an application that may periodically contact each of the devices connected to the network, and transfer the respective indexing data generated by the indexing agent on the respective device to the indexer.
  • each indexing agent may simply search the respective indexing data whenever a query is received from an authorized search engine, making a crawler and/or indexer unnecessary.
  • each indexing agent may "push" its indexing data directly to the indexer.
  • the indexing agent may be pre-programmed or instructed to update the indexing data at a desired frequency and transfer the updated indexing data to the indexer, or the indexer may periodically poll the indexing agent of each device.
  • mobile devices i.e., devices whose communication interface may be disconnected from the network for extended periods or whose connection to the network is intermittent, may include an indexing agent that includes a presence module.
  • the presence module may automatically register the presence of the device when it is initially docked in or otherwise connected to the network. Once connected, the indexing agent may automatically push its current set of indexing data to the indexer. Alternatively, the indexing agent may generate a new set of indexing data for the device or generate an updated set of indexing data reflecting any changes since the device's last connection, and transfer the new set to the indexer.
  • the generation of indexing data may be generated in a predetermined order of decreasing priority or criticality.
  • the presence module may also provide a presence service, transmitting a notification to all other users of the network of the appearance or connection of the respective device to the network. Alternatively, only a subset of users may "subscribe" to presence notification for a given set of devices for which the subscribers have sufficient . access authority. In this manner, the index server or subscriber may be notified when a specific device of interest connects to the network or when any device connects.
  • the device may also include an authentication module that may provide security for the device.
  • the authentication module may authenticate the connection of the device to a given network, for example, by requiring the crawler or indexer to authoritatively identify itself to the device before the indexing agent provides access to the indexing data. This may ensure that the crawler or indexer on the connected network is authorized to access the indexing data on the device.
  • the authentication module may substantially reduce the risk that a mobile device is connected to a foreign network, i.e., connected to a network other than the proper enterprise's network, and provides information to a non-authorized system.
  • the indexing agent may filter access to the indexing data based upon authentication circles, providing increasing levels of access to the indexing data and/or the indexed resources themselves.
  • the indexing agent may protect the device, allowing it to be crawled only by authorized crawlers.
  • the index server may be notified and the crawler from the index server may immediately begin crawling the device, collecting pages from its resident indexing agent. Crawling may continue for as long as the device is connected to the network or until completion of transfer of the indexing data.
  • the network may provide dynamic DNS service that allows the crawler to obtain an IP address of the device even if it is dynamically assigned and changes from one network connection to another. This IP address may be reflected by appropriately modifying or supplementing the location identifiers included in the indexing data to ensure that the index server may be able to subsequently identify the correct device having a particular resource therein.
  • a more selective form of crawling may be utilized. For example, the crawler may wait for direct contact from the device itself, e.g., in which the device informs the crawler of exactly where in the page space of the device to begin crawling. In this manner, the indexing agent on a respective device may instruct the crawler to collect just those pages that have changed since the device was last visited by the crawler. In a further alternative, the indexing agent may inform the crawler of a set of pages to visit according to a predetermined or desired priority.
  • the index server may archive the web pages collected by the crawler.
  • a search engine may be used to view the web pages of a device even if the device is disconnected from the network, since the index server itself has a copy (albeit one that may be out-of-date) .
  • the system may facilitate indexing of devices whose communication interface is of low bandwidth or unreliable.
  • Low bandwidth connections may present a particular challenge for crawling the contents of "bandwidth-challenged" devices.
  • the indexing agent may adopt tactics that ameliorate the deficiencies of the connection.
  • the indexing agent and the crawler may have a transport encoding in common such that the indexing agent may compress offline the web pages that it wants crawled and indexed.
  • the indexing agent may direct the crawler to crawl just those pages for which it has generated compressed content. In this manner, the device may make optimal use of its limited bandwidth.
  • the indexing agent may break the indexing data into small, individual "mini-pages," no one of which requires a substantial amount of transmission time.
  • the indexing agent may control the transfer of indexing data to ensure that personal, sensitive, or proprietary information is substantially securely transferred from the respective device to the mdexer.
  • the indexing agent and the crawler may establish a secret session key known to them alone that permits the substantially secure transmission of sensitive information from the device to the crawler.
  • FIG. 1 is a schematic drawing, showing a network architecture, according to the present invention.
  • FIG. 2 is a schematic diagram of a computing device including an indexing agent, in accordance with the present invention.
  • FIG. 3 is a flowchart showing a method for indexing resources on an electronic device, in accordance with the present invention.
  • FIG. 4 is a flowchart showing a method for searching indexing data generated by indexing agents in response to a query from a search engine, in accordance with the present invention.
  • FIG. 5 is a flowchart showing a method for implementing a searchable database of indexing data related to the content of resources on a plurality of devices, in accordance with the present invention.
  • FIG. 6 is a block diagram showing an exemplary computer system in which certain elements and functionality of the present invention may be implemented.
  • FIG. 1 is a top-level block diagram illustrating an example of a network architecture, according to an embodiment of the present invention.
  • Electronic devices 10, 20, 30, n are each at least termittently connected to a network 40, and an index server 50 is also connected to the network 40.
  • the network 40 may be a local area network ("LAN"), an Intranet, and/or a wireless communications network.
  • the network 40 may include a plurality of several different types of networks (not shown), including, but not limited to, a LAN, an Intranet, or a wireless network.
  • the network 40 incorporates all of the electronic devices within an enterprise that are capable of sharing information and/or being connected to the network 40.
  • the index server 50 includes a search engine 52 and a database 58 of indexing data related to each of the devices 10, 20, 30, n.
  • the search engine 52 includes an indexer 54 and a query engine 56, and optionally may also include a crawler (not shown), as described further below.
  • the indexer 54 receives indexing data from the indexing agents on the devices 10, 20, 30, n and creates a searchable index stored in the database 58.
  • the query engine 56 may be used to search the database 58 to identify, locate, and/or access resources related to a given query, as described further below.
  • the devices 10, 20, 30, n connected to the network 40 may include computing devices, such as desktop computers or other fixed workstations, and/or mobile or portable devices, such as laptops, personal digital assistants ("PDA's"), wireless access protocol ("WAP") telephones, portable digital devices, and the like.
  • PDA's personal digital assistants
  • WAP wireless access protocol
  • Each of the devices is generally capable of supporting and includes an indexing agent 60 resident on the device, as described further below with reference to FIG. 2.
  • other electronic devices may be included in the network, such as telephones, printers, fax machines, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical-instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like.
  • These devices may also include a resident indexing agent, or, alternatively, they may instead include a server, but may be directly coupled to another device including an indexing agent that may use the server to index resources on the device.
  • FIG. 2 a schematic of an exemplary computing device 10 is shown that includes an indexing agent 60 configured for generating indexing data 70 regarding resources 68 of the device 10.
  • the device 10 may include a number of modules, such as a server 62, a translator 64, an authentication module 66, and/or a presence module 67, that ' may be controlled and/or accessed by the indexing agent 60.
  • the device 10 may include conventional memory (not shown) for storing the resources 68 and/or the indexing data 70, and one or more processors for performing various functions, as will be appreciated by those skilled in the art.
  • An exemplary hardware architecture for the device is shown in FIG. 6, and described further below.
  • the device 10 includes a communication interface 72 for connecting the device 10 to the network and/or otherwise communicating with other devices (not shown in FIG. 2).
  • the communication interface 72 may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • the indexing agent 60 is a specialized, resource- conservative, embedded server, e.g., an HTTP web server.
  • the indexing agent 60 may be relatively small compared to conventional crawlers or servers, such that it may be installed on virtually any electronic or computing device, including personal devices such as personal digital assistants (e.g., Palm Pilots), WAP phones, or embedded micro- controllers, yet it may provide all of the services needed to index local resources on the device.
  • personal devices such as personal digital assistants (e.g., Palm Pilots), WAP phones, or embedded micro- controllers, yet it may provide all of the services needed to index local resources on the device.
  • the term “indexing agent” is used generally herein to refer to such an embedded web server, or any combination of hardware-based components and/or software-based modules that may perform the indexing features described herein.
  • the term “thin server” may also be used to refer to the indexing agent 60, because of its relatively small size compared to conventional servers.
  • the indexing agent 60 may direct the server 62 to access the device's resources 68, e.g., to serve up the device's file system, configuration data, and/or other resources, to facilitate the indexing agent 60 systematically indexing all of the resources to be indexed therein.
  • the indexing agent may be capable of accessing the device's resources 68 directly, without the intervention of the server 62, and the server 62 may be eliminated.
  • the indexing data 70 generated by the indexing agent 60 preferably includes content data associated with respective resources 68 available on the device 10.
  • the content data generally includes information describing individual resources of the device, preferably based upon the content of the individual resources, and/or metadata associated with the individual resources.
  • the indexing agent 60 may store the compatible information, e.g., text, metatags, and the like, as content data.
  • the indexing agent 60 may use the translator 64 to translate information extracted from a particular resource into a format that is capable of being interpreted by a crawler or indexer.
  • the translator 64 may include one or more device-specific modules for translating particular types of resources, such as application files, word processor files, spreadsheets, media files (such as image, audio and video files), databases, Portable Document Format ("PDF") files, and the like.
  • the translator 64 may also translate device configuration data, such as hardware or software settings, into formats that may be interpreted by a crawler or indexer.
  • the translator 64 may translate device-dependent content into Web-standard formats, such as HTML or XML. Consequently, the indexing data may include information and data for which no extractors previously existed, allowing the indexing data to be crawled and extracted by a crawler of any common web search engine. In effect, the indexing agent 60 and translator 64 may act as an extractor for the benefit of a crawler.
  • the indexing agent 60 also generally assigns location identifiers to each piece of content data to identify the location of the individual resources associated with the respective content data.
  • the location identifiers may identify the location of the respective resources within the device's file space, and/or may identify the specific device itself.
  • the location identifiers are device-specific Uniform Resource Locators ("URLs"), identifying a location where the resource may be found, or other Uniform Resource Identifiers ("URI's”), identifying a process for identifying the location, e.g., to identify a portable device that may be located at one more locations in the network.
  • URLs device-specific Uniform Resource Locators
  • URI's Uniform Resource Identifiers
  • the network may assign virtual location identifiers specifically for the benefit of an index server (not shown iii FIG. 2).
  • the network e.g., the index server
  • the presence module 67 may provide notice for such mobile devices.
  • the presence module 67 may announce to a network when the device 10 is connected to the network.
  • the authentication module 66 may include a security protocol to ensure that the device 10 is connected to a network that is authorized to access the indexing data 70, as described further below.
  • the indexing agent 60 is preferably configured to operate substantially silently in the background undetected by users of the device 10, e.g., as processor cycles and disk bandwidth become available.
  • the indexing agent 60 preferably automatically and periodically indexes the device's resources 68.
  • the indexing agent 60 may generate a complete index of the resources 68 on the respective device, or it may generate an updated index, i.e., only reflecting resources 68 that have changed since a previous indexing.
  • the indexing agent may access resources on the device in order to extract content-related information from the resources. This may involve the indexing agent directing a server resident on the device to serve up the device's resources, e.g., its file system, memory or other storage devices, peripheral devices, and the like, or may involve the indexing agent accessing the resources directly.
  • the indexing agent may also access other resources, such as software and hardware configuration settings of the device.
  • the indexing agent determines whether the respective resources are already in a web-standard format, such as HTML. If not, at step 114, the indexing agent extracts content-related information from the resources, and then, at step 116, translates the information into a web-standard format, as content data. If the respective resources are already in a web-standard format, content-related information, or content data, is extracted from the resources at step 118 as content data.
  • a web-standard format such as HTML.
  • the indexing agent assigns location identifiers to the content data, associating the content data with respective resources, e.g., using URLs that identify the location of the respective resources within the device's file space or other URIs.
  • the indexing agent may assign a dynamic URI to the content data that identifies the device independent of its specific connection to the network, e.g., provided by the network, as explained above.
  • the indexing agent stores the content data and associated location identifiers as indexing data in memory of the device.
  • the indexing agent stores the content-related information as web pages, for example, using HTML or XML markups, or as text.
  • the indexing data is stored in a format that may be easily crawled by a conventional web crawler or interpreted by a conventional indexer, as described further below.
  • the indexing agent 60 may then make the indexing data 70 available to external devices, e.g., an index server, crawler, search engine, and the like (not shown).
  • external devices e.g., an index server, crawler, search engine, and the like (not shown).
  • FIG. 4 a method is shown wherein the indexing agent may retain the indexing data on the device and respond to queries from a search engine connected to a network.
  • the indexing agent having generated indexing data for the device, may receive a query from a search engine, such as a request from a requestor whether any files on the device include particular keywords.
  • the query may be sent by the search engine to all of the devices connected to the network, or only to a specific subset of devices, such as those used by participants in a particular project group.
  • the indexing agent may then search the indexing data for content data related to the query at steps 132, 134. If no match is found, the indexing agent responds to the search engine at step 136 with a negative response, or alternatively no response at all. If one or more matches are found, the indexing agent may provide the search engine with information regarding the resource(s) whose content data matched the query criteria. The extent of information provided may depend upon the authority of the search engine and/or the requestor to access the indexing data and/or resources on the device. For example, the response may merely indicate that matches were found, e.g., identifying the device, without providing any further details.
  • the response may include the URL or URI for the resource(s) that resulted in matches, possibly also including the content data that resulted in the match.
  • the response may include transferring the resource itself, e.g., to provide a copy of a file on the device that matches the query to the requestor.
  • This method of serving up indexing data "on the fly” may be suitable for smaller enterprises that include only a limited number of devices.
  • the indexing agent pushes its indexing data to a repository or centralized index.
  • each of the devices 10, 20, 30, n preferably includes an indexing agent (not shown), which may push the indexing data from the respective device to the index server 52.
  • This model brings scaleable, comprehensive, and speedy enterprise-wide indexing and/or searching to any electronic or computing device within an enterprise.
  • an index server 50 may receive indexing data from a plurality of devices 10, 20, 30, n connected to the network 40.
  • the indexing agents (not shown) on the devices 10, 20, 30, n have previously generated the indexing data, including content data describing content of resources on the respective devices, as described above.
  • the indexing agents may transfer the respective indexing data to the index server using one of several models described further below.
  • the indexer 54 may compile the .
  • indexing data into a database 58 at step 152 using any known method for creating an inverted index or other searchable database.
  • the indexing data may be received from all of the devices at one time and then compiled, or indexing data from devices may be compiled intermittently, for example, as indexing data becomes available from mobile devices.
  • the index server may store web pages including the indexing data or otherwise retain a copy of the indexing data as stored by the indexing agents on the respective devices. This may be useful for archiving mobile devices, which may not be connected to the network when a query is submitted.
  • the database 58 may then be used to search for resources in response to queries by requestors having access to the database 58, such as co-workers, human resources personnel, security personnel, and the like.
  • the query engine 56 may receive a query, e.g., including keywords or other search criteria, submitted by a requestor.
  • the query engine 56 may access the database 58 at step 156 to search for indexing data related to the query, e.g., to identify any content data that matches the keywords or other criteria submitted by the requestor.
  • the query engine 56 may search the entire database 58 or a subset of the database 58, as will be appreciated by those skilled in the art.
  • the query engine may send a response to the requestor, indicating whether or not any matches were found. If any matches are found, the query engine may also provide additional information to the requestor, depending upon their access authority.
  • the response may include a device URL or URI or otherwise identify the device(s) that includes resources corresponding to content data satisfying the query, possibly identifying the user of the device. This level of response may be sufficient to identify the devices or users that satisfy the query without divulging the actual content of the resources, which may be sensitive, personal, or otherwise inappropriate for the requestor to access or review.
  • the response may include the location identifiers of any resources satisfying the query, either with or without explicitly identifying the device itself, thereby providing access to the resource. This level of response may be appropriate for shared files, such as those that should be available to members of a common project.
  • placing an indexing agent on each of the devices connected to a network may permit a search engine to discover, index, and query resources resident on the devices that . were previously unknown, uninventoried, and/or largely inaccessible.
  • a search engine may, with no additional effort, discover and access new sources of content enterprise-wide, as explained further below.
  • the indexing agents and index server may transfer the indexing data between them in a variety of different ways.
  • the indexing agents may generate the indexing data only when instructed to do so by the index server or by the device's user.
  • the users of the devices are not involved in the indexing activity of the indexing agents, i.e., the indexing agent acts autonomously in the background such the users are unaware of and/or not substantially affected by the indexing agents' activities.
  • the devices automatically generate and transfer ("serve up") the indexing data with a predetermined granularity to keep the database substantially current.
  • the indexing agents periodically generate indexing data, e.g., with substantially fixed or predetermined time periods between the generation of each set of indexing data.
  • the indexing data may be a complete set of indexing data reflecting all of the indexed resources on the device.
  • the indexing data may be an updated set including indexing data only for resources whose status has changed since a previous set, e.g. new, edited, or deleted files.
  • the indexing data may be generated "offline," i.e., when the devices are not connected to the network, e.g., at periodic intervals.
  • their indexing agents may automatically initiate the transfer of their respective indexing data to the index server. If the device is disconnected from the network, the indexing agent may discontinue transfer, and store the location within the indexing data where the transfer was discontinued. When the device is again connected to the network, the indexing agent may resume at the location where it left off.
  • mobile devices need not be connected to the network to allow transfer of indexing data in a single session, but transfer may be accomplished incrementally over several successive connections.
  • the index server includes a crawler, such as a conventional web crawler.
  • the crawler is preferably an autonomous robot that systematically and/or periodically contacts each of the devices including resources to be indexed.
  • the crawler may initiate contact successively with the indexing agents on the devices, and exchange "handshakes" to confirm connectability, identify itself, and/or to complete a security protocol confirming that the crawler has sufficient authority to access to the indexing data on the respective devices.
  • the indexing agents may serve up their indexing data to the crawler.
  • the indexing agent may offload the task of extracting URLs and content from the crawler by assigning virtual URIs that exist specifically for the benefit of the crawler.
  • the indexing agent may generate a dynamic page that summarizes the content of the original page. This may substantially reduce network transfers and load, and may improve the incisiveness of the indexing.
  • the technique of assigning URIs for the benefit of a crawler may be further extended to create and push device-specific content to a repository or indexer.
  • the indexing agent may have the ability to authenticate and control access.
  • "u" may represent an URL served by an indexing agent "M,” where u denotes content in a format for which no crawler extractor exists, for example, a device-specific content configuration.
  • M may create a virtual URL "v" that, when accessed, may trigger the translation of the content of u into standard HTML. In this manner, M may perform extraction on behalf of the crawler, giving it access to formats for which no crawler extractors have previously existed.
  • the indexing agent may be configured as a crawler-aware and/or indexer-aware server that offers device-dependent and/or content-dependent indexes directly to the crawler.
  • the authentication protocols and access controls of the indexing agent may allow the indexing agent to generate crawler-specific content that is optimized for the indexer of the search engine.
  • the indexing agent executing on a personal digital assistant may generate a summary of the Pilot's memo pad that may be suitable for cross-indexing with a departmental project web site.
  • crawler and indexer may be generalized considerably if the device's indexing agent knows of, and cooperates with, the crawler.
  • the architecture outlined here permits the deployment of enterprise-specific and/or domain-specific crawlers and indexers.
  • Crawlers may be deployed within an enterprise to search for a specific form of content, for example, all content relating to a specific project.
  • the crawler may move from device to device throughout the enterprise's network, and, with the cooperation of the indexing agents onboard each of the network's devices, may be served with just the content sought by the crawler, thereby generating relevant and incisive indices.
  • an indexing agent in accordance with the present invention may facilitate access by crawlers when a target device is not connected to the network as the crawler is making its rounds.
  • the indexing agent may announce its presence to the network when the device connects. This event notification may be propagated to all interested subscribers including, for example, a crawler, thereby allowing the crawler to immediately visit the device and push needed content back to the indexer.
  • the indexing agent may pre-index relevant content for the search engine and, when connected, push the indexing data back to the search engine for inclusion in, and integration with, the enterprise index database, as described above.
  • the indexing agent may "bleed" the indexing data incrementally to a search engine over the span of multiple connections to the network.
  • This strategy may be particularly appropriate for a device that is connected for only brief periods or supports only a low bandwidth connection.
  • the indexing agent on a device may act in a substantially autonomous fashion, it may index the resources on the device, and notify the index server when the device is connected, and/or has an update for the index server.
  • mobile and intermittently connected devices may be intelligently included in enterprise data searches.
  • the index server may not only be able to find resources on all devices in the network, but it may also instantly know the connection status of the device(s) that contained the resource(s) pointed to by the indexing data in the database. This opens the possibility of contacting the user of a disconnected mobile device in real-time to request that the device with critical data be connected to the network as soon as possible.
  • the indexing agent may also be able to filter responses for sensitive data.
  • sensitive data For example, financial, human resources, medical, personal, and/or other sensitive data may be contained appropriately, e.g., using the authentication module of the indexing agent.
  • the index server 50 may include a single server, or it may include a plurality of servers, each sharing a database or generating independent databases, e.g., including different types of compiled indexing data.
  • a single search engine, or a plurality of search engines may be provided.
  • a search engine (C, I, Q) includes three separate, but related, components: a crawler C, an indexer I, and a query engine Q. Each component may be characterized by action (what is done), locale (where it is done), and time (when it is done). In this manner, a taxonomy of search engines may be constructed that characterizes the range of variation available to search engines component with respect to action, locale, and time. Additional information on search engines may be found in S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, 1998, pp. 107-118, the disclosure of which is expressly incorporated herein by reference.
  • Crawlers are robot applications that substantially autonomously fetch data, preferably in the form of "web pages," for submission to an indexer of a search engine. Additional information on crawlers may be found in A. Rappapof ⁇ Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes, Search Tools consulting, available at www.searchtools.com, the disclosure of which is expressly incorporated herein by reference.
  • the crawler C Given a location identifier identifying a particular web page, e.g., an URL "u,” the crawler C first may decide whether or not to visit the web page designated by u. If affirmative, the crawler may reach u if C can connect to the host of u, and C has the authority to access the page designated by u. Connectivity and access authority are two separate, but related, considerations. The first is the ability to establish a connection, e.g. a TCP connection, between the crawler and the web server and may vary with the position of the crawler within the network (for example, relative to a firewall) or the network quality of service (such as congestion or routing anomalies).
  • a connection e.g. a TCP connection
  • the crawler may be required to obtain permission to read the page, for example, if it is password protected. Access may also be restricted to a finite set of users or hosts whose identity may be determined, for example, by inspecting a source IP address of the packet stream associated with the host or using cryptographic methods.
  • the crawler may extract whatever links it can for the next round of crawling. Extraction depends on the form and semantics of the content and the extractors available to the crawler. For example, all crawlers may extract links from HTML pages however, few crawlers have the extractors required to lift links embedded within PDF or Microsoft Word documents.
  • a crawler is characterized by: a) the IP address of the crawler host which limits, with respect to network topology, routing, and firewalls, the remote hosts to which the crawler may connect; b) access authority; c) a loading policy that gleans URLs of value from the set at hand; d) an extraction policy that determines if the, contents of a web page will yield URLs; and e) a set of extractors for extracting URLs from various forms of content.
  • An “extraction policy” E(pu) is a decision procedure that returns true if links (URLs) can be extracted from pu and false otherwise.
  • E may inspect the URL u, the MIME type of u (contained within the HTTP response) and the page contents pu since all offer valuable hints as to the format and structure of pu. For example, if the MIME type is "html,” then pu is a page whose structure is well defined (by the HTML specification) and amenable to the extraction of links. If the MIME type is unspecified (the HTTP response omitted the Content-Type header field), then the crawler may examine the syntax of the URL or the content itself to infer the media type.
  • the URL suffix .wav or .au may indicate (by common convention) an audio file that may contain links (rendered as speech) but whose extraction by machine agents is problematic at best. Some audio formats, however, may provide for the inclusion of digital metadata.
  • the crawler if equipped with a suitable extractor, may be able to extract that metadata for the benefit of the indexer.
  • a “loading policy” is a decision procedure L(u) that returns true if URL u is deemed suitable for loading and false otherwise.
  • a "page loading policy” determines whether a crawler ignores robot excluded pages and generated pages (such as those produced by CGI scripts) and honors page loading, resource, and time limits with respect to a site or domain. Other considerations may also play a role in the formulation of L.
  • an "access function" A(" ⁇ ", u, P) returns pu if and only if it is possible to access u from " ⁇ " and P grants sufficient authority.
  • a “crawler” C is a tuple (" ⁇ ", A, P, E, G, L), where " ⁇ " is the location (IP address) of C, A is an access function, P is a set of access permissions, E is an extraction policy, G is a nonempty set of extractors, and L is a loading policy.
  • E an extraction policy
  • G a nonempty set of extractors
  • L a loading policy.
  • FIG. 6 a block diagram illustrates an exemplary computer system 350 in which elements and functionality of the present invention may be implemented according to one embodiment of the present invention.
  • the present invention may be implemented using hardware, software, or a combination thereof and may be implemented in a computer system or other processing system.
  • Various software embodiments are described in terms of exemplary computer system 350. After reading this description, it will become apparent to a person having ordinary skill in the relevant art how to implement the invention using other computer systems, processing systems, or computer architectures.
  • the device 350 includes one or more processors, such as processor 352. Additional processors may be provided, such as an auxiliary processor to manage input/output, an auxiliary processor to perform floating point mathematical operations, a special-purpose microprocessor having an architecture suitable for fast execution of signal processing algorithms ("digital signal processor"), a slave processor subordinate to the main processing system (“back-end processor”), an additional microprocessor or controller for dual or multiple processor systems, or a coprocessor. It is recognized that such auxiliary processors may be discrete processors or may be integrated with the processor 352.
  • the processor 352 is connected to a communication bus 354.
  • the communication bus 354 may include a data channel for facilitating information transfer between storage and other peripheral components of the computer system 350.
  • the communication bus 354 further provides the set of signals required for communication with the processor 352,. ⁇ including a data bus, address bus, and control bus (not shown).
  • the communication bus 354 may include any known bus architecture according to promulgated standards, for example, industry standard architecture (ISA), extended industry standard architecture (EISA), Micro Channel Architecture (MCA), peripheral component interconnect (PCI) local bus, standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE) including IEEE 488 general-purpose interface bus (GPIB), IEEE 696/S-lOO, and the like.
  • ISA industry standard architecture
  • EISA extended industry standard architecture
  • MCA Micro Channel Architecture
  • PCI peripheral component interconnect
  • the Device 350 also includes a main memory 356 and may also include a secondary memory 358.
  • the main memory 356 provides storage of instructions and data for programs executing on the processor 352.
  • the main memory 356 is typically semiconductor-based memory such as dynamic random access memory (DRAM) a ⁇ d/or static random access memory (SRAM).
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • Other semiconductor-based memory types include, for example, synchronous dynamic random access memory (SDRAM), Rambus dynamic random access memory (RDRAM), ferroelectric random access memory (FRAM), and the like, as well as read only memory (ROM).
  • SDRAM synchronous dynamic random access memory
  • RDRAM Rambus dynamic random access memory
  • FRAM ferroelectric random access memory
  • ROM read only memory
  • the secondary memory 358 may include a hard disk drive 360 and/or a removable storage drive 362, for example a floppy disk . drive, a magnetic tape drive, an optical disk drive, and the like.
  • the removable storage drive 362 may read from and write to a removable storage unit 364 in a well-known manner.
  • removable storage unit 364 may include a floppy disk, magnetic tape, optical disk, and the like that'may be read from and written to by removable storage drive 362.
  • the removable storage unit 364 may include a computer usable storage medium with computer software and computer data stored thereon.
  • secondary memory 358 may include other similar components for allowing computer programs or other instructions to be loaded into the computer system 350.
  • such components may include interface 370 and removable storage unit 372.
  • secondary memory 358 may include semiconductor-based memory such as programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), or flash memory (block oriented memory similar to EEPROM).
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable read-only memory
  • flash memory block oriented memory similar to EEPROM.
  • any other interfaces 370 and removable storage units 372 that allow software and data to be transferred from the removable storage unit 372 to the computer system 350 through interface 370.
  • the device 350 also includes a communication interface 374.
  • Communication interface 374 allows software and data to be transferred between device 350 and external devices, networks, or information sources. Examples of communication interface 374 include but are not limited to a modem, a network interface (for example an Ethernet card), a communications port, a PCMCIA slot and card, an infrared interface, and the like.
  • Communication interface 374 preferably implements industry promulgated architecture standards, such as Ethernet IEEE 802 standards, Fibre Channel, digital subscriber line (DSL), asymmetric digital subscriber line (ASDL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on.
  • Software and data transferred via communication interface 374 may be in the form of signals 378 which may be electronic, electromagnetic, optical or other signals capable of being received by ⁇ communication interface 374. These signals 378 are provided to communication interface 374 via channel 376.
  • Channel 376 carries signals 378 and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, or other communications channels.
  • Computer programming instructions also known as computer programs, software, or firmware
  • Computer programs may be stored in the main memory 356 and the secondary memory 358.
  • Computer programs may also be received via communication interface 374.
  • Such computer programs when executed, enable the device 350 to perform the features of the present invention.
  • execution of the computer programming instructions may enable the processor 352 to perform the features and functions of the present invention. Accordingly, such computer programs represent controllers of the computer system 350.
  • a computer program product is used to refer to any medium used to provide programming instructions to the computer system 350. Examples of certain media include removable storage units 364 and 372, a hard disk installed in hard disk drive 360, and signals 378. Thus, a computer program product may be a means for providing programming instructions to the computer system 350.
  • the software may be stored in a computer program product and loaded into computer system 350 using hard disk drive 360, removable storage drive 362, interface 370, or communication interface 374.
  • the computer programming instructions when executed by the processor 352, may cause the processor 352 to perform the features and functions of the invention as described herein.
  • the invention may be implemented primarily in hardware using hardware components, such as application specific integrated circuits ("ASICs"). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons having ordinary skill in the relevant art. In yet another embodiment, the invention may be implemented using a combination of both hardware and software. It is understood that modification or reconfiguration of the device 350 by one having ordinary skill in the relevant art does not depart from the scope or the spirit of the present invention.
  • ASICs application specific integrated circuits

Abstract

An indexing agent is provided on each of a plurality of devices connected to a network. The indexing agents generate indexing data including information regarding the content and location of resources on the respective devices. An index server is connected to the network, including a query engine for searching the resources on the devices. The indexing agents transfer the indexing data from the respective devices to the index server, either automatically or with a crawler from the index server. The index server compiles the indexing data from the respective devices into a searchable database. The indexing agents may translate the indexing data from device-specific formats into formats that may be interpreted by the index server. The indexing agent may also allow indexing of mobile devices connectable to the network, and may restrict indexing to authenticated search engines.

Description

SYSTEMS AND METHODS FOR INDEXING DATA IN A NETWORK
ENVIRONMENT
FIELD OF THE INVENTION The present invention relates generally to systems and methods for indexing resources of electronic devices, and more particularly to systems and methods for indexing, searching, and/or sharing data, files, and/or other resources stored on a plurality of computing devices connected to a network.
BACKGROUND
Search engines are well known tools for finding information accessible via a wide area network, such as the Internet or World Wide Web. These search engines facilitate indexing and searching for data scattered over a large number of servers or other devices connected to the network. Such large scale web search engines generally require a -crawler, an indexer, and a query engine.
A crawler, as the name implies, is an application that "crawls" across the "web," following links and fetching web pages for the indexer. The crawler, similar to a browser, may contact web servers or other devices connected to the network to access the web pages available on the servers. The crawler may extract information from the web pages, e.g., words or phrases extracted from content of the web pages, metatags embedded within the web pages, such as HTML markups, inferences made from the link structure of the web pages (outgoing and/or incoming), and the like.
The indexer is a compute-intensive and storage-intensive system that receives the web page information from the crawler. The indexer generally constructs a comprehensive inverted index of every web page uncovered by the crawler. The query engine is an application employed by end users to search the index constructed by the indexer, e.g., to return links to candidate web pages in response to query keywords and/or other criteria (such as language, domain of origin, age, and the like) provided by the end users. Within an enterprise, such as a business, educational institution, or other organization, a plurality of computing devices may be connected to one or more networks and/or to one another, for example, by a local area network. The number and form of computing devices connected to such networks may vary dramatically between enterprises. In addition, the devices connected to a particular network may vary widely in capacity, speed, platform, and method of network connection. For example, the devices may include corporate servers, desktop computers, laptops, personal digital assistants, embedded sensor and control networks, and the like. The domination of networks and the proliferation of such devices tends to push data storage out to the "edges" of an enterprise's network, making sharing of resources difficult.
For example, it may be desirable to locate and/or share resources, such. as documents, data files, and the like, within an enterprise. Substantial amounts of the data and documents residing on these devices, however, may be inaccessible to conventional crawlers and indexing engines. For example, there may be no identifiable links that refer to the devices and/or their contents. In addition, some devices may only be intermittently connected to the network. Further, the protocol used by a crawler (such a HTTP) may not be supported by local network devices and/or the contents of the devices may be in a format unknown to a crawler and/or indexer.
Several solutions have been proposed to capture data stored on networked devices, such as shared file systems, data repositories, knowledge management systems, and centralized data archives. Despite their widespread use, these systems fail to capture many of the resources on desktop, mobile, and/or personal devices connected to a network.
For example, shared network volumes may capture only a fraction of the data on a device, and the software required to support access may be unsuitable for small, mobile devices. Data repositories frequently require explicit submission by users of the data stored on their devices, and therefore, the repository contents may not be current or comprehensive with respect to data available on many of the devices. Repository indexing may also rely solely on keywords submitted by the users of respective devices whose data is indexed in the repository, and those keywords may not effectively reflect the data actually stored on the respective devices. Knowledge management systems often rely on proprietary formats for content, restrict content to a small number of formats, and/or are specialized for a narrow domain. Thus, such systems may be ineffective for indexing and searching a broad array of information resources available on a network.
In addition, because these systems are generally centralized, they may be ill-suited for capturing data from mobile devices that are only intermittently connected to the network. Finally, while centralized data archives may be effective for retaining information of durable, lasting, and/or proven value, they may not effectively capture valuable, but short-lived, data created on desktop, portable, and/or mobile computing devices. Accordingly, it is believed that systems and methods that more effectively index- and/or facilitate searching of resources on devices connected to a network would be considered useful.
SUMMARY OF THE INVENTION The present invention is directed generally to systems and methods for indexing resources available on electronic devices connected to a network. More particularly, the systems and methods of the present invention may facilitate an enterprise, such as a business, educational institution, or other organization, indexing, searching for, and/or sharing resources, such as documents, records, databases, media files, e-mail archives, and the like, that may be available on the enterprise's network. The resources may be stored on any device that may be connected to the network, yet may be quickly found, preferably in a substantially secure environment.
In accordance with one aspect of the present invention, a system for generating indexing data stored on a plurality of electronic devices connected to a network is provided. The indexing data may include one or more pieces of information related to a respective device, such as information intake, content, and output; hardware configuration, settings, and status; software configuration, settings, and status; system and control logs; manner, rate, pattern, and frequency of use, and the like. The devices for which indexing data may be generated may include desktop computers, laptops, mobile phones, telephones, printers, fax machines, personal digital assistants, portable digital devices, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like. In a preferred embodiment, each electronic device on the network includes an
"indexing agent," e.g., one or more embedded digital processors, hardware components, and/or software modules, for indexing resources on the respective device. In one embodiment, the indexing agent is a web server resident on the respective device, that may include a translator, an authentication module, a presence module, and/or a thin server. Preferably, the indexing agent generates indexing data that includes content data describing individual resources stored on the respective device, and location identifiers, such as device-specific URLs and URL links, identifying the location of the individual resources associated with the respective content data, or other Uniform Resource Identifiers ("URIs"). More preferably, the indexing agent extracts content-related information regarding the resources stored on the respective device, and stores the generated indexing data as web pages, for example, in HTML or XML format, or alternatively as text. The indexing data may be stored in memory of the respective device for subsequent use or transfer, as described further below.
In one embodiment, the indexing agent may include a translator, including one or more modules for translating device-specific information into indexing data that may be interpreted by a crawler or indexer. If the information from the device is already in a format that may interpreted by the crawler or indexer, e.g., HTML or XML, no translation may be necessary. For resources that are not already crawler or indexer compatible, however, such as word processor documents, media files, and the like, the indexing agent may extract content information regarding the resources, and the translator may translate the information, for example, into HTML or XML, and then the indexing agent may store the translated information as web pages.
Each indexing agent is preferably configured as one or more modules that operate silently in the background substantially undetected by the user of the respective device as processor cycles and/or other related bandwidth become available. The indexing agent may automatically and periodically index desired portions of the device's resources such that the user of the device need not schedule or otherwise activate the indexing agent to create or update the indexing data. The periodic indexing by the indexing agent may generate a complete index of the resources on the respective device, or it may generate an updated index, i.e., only reflecting resources that have changed since a previous indexing. Each device also includes a communication interface for making the indexing data, for example, in the form of HTML web pages, available to a crawler and/or indexer. Such communication interfaces may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like. , In accordance with another aspect of the present invention, a system is provided that includes a network, a plurality of electronic devices that are at least intermittently connected to the network, and one or more index servers. Each of the electronic devices preferably includes an indexing agent, such as that described above. The one or more index servers include a search engine that is connected to the network. In a preferred embodiment, the index server may be a centralized computer system that includes a crawler, an indexer, and/or a query engine.
The crawler is an application that may periodically contact each of the devices connected to the network, and transfer the respective indexing data generated by the indexing agent on the respective device to the indexer. Alternatively, each indexing agent may simply search the respective indexing data whenever a query is received from an authorized search engine, making a crawler and/or indexer unnecessary. In a further alternative, each indexing agent may "push" its indexing data directly to the indexer. The indexing agent may be pre-programmed or instructed to update the indexing data at a desired frequency and transfer the updated indexing data to the indexer, or the indexer may periodically poll the indexing agent of each device.
In accordance with another aspect of the present invention, "mobile" devices, i.e., devices whose communication interface may be disconnected from the network for extended periods or whose connection to the network is intermittent, may include an indexing agent that includes a presence module. The presence module may automatically register the presence of the device when it is initially docked in or otherwise connected to the network. Once connected, the indexing agent may automatically push its current set of indexing data to the indexer. Alternatively, the indexing agent may generate a new set of indexing data for the device or generate an updated set of indexing data reflecting any changes since the device's last connection, and transfer the new set to the indexer. The generation of indexing data may be generated in a predetermined order of decreasing priority or criticality.
The presence module may also provide a presence service, transmitting a notification to all other users of the network of the appearance or connection of the respective device to the network. Alternatively, only a subset of users may "subscribe" to presence notification for a given set of devices for which the subscribers have sufficient . access authority. In this manner, the index server or subscriber may be notified when a specific device of interest connects to the network or when any device connects.
The device may also include an authentication module that may provide security for the device. For example, the authentication module may authenticate the connection of the device to a given network, for example, by requiring the crawler or indexer to authoritatively identify itself to the device before the indexing agent provides access to the indexing data. This may ensure that the crawler or indexer on the connected network is authorized to access the indexing data on the device. Thus, the authentication module may substantially reduce the risk that a mobile device is connected to a foreign network, i.e., connected to a network other than the proper enterprise's network, and provides information to a non-authorized system. In a similar manner, the indexing agent may filter access to the indexing data based upon authentication circles, providing increasing levels of access to the indexing data and/or the indexed resources themselves. Thus, the indexing agent may protect the device, allowing it to be crawled only by authorized crawlers. When a device connects to the network, e.g., when a personal digital assistant is set into its network and recharging cradle, the index server may be notified and the crawler from the index server may immediately begin crawling the device, collecting pages from its resident indexing agent. Crawling may continue for as long as the device is connected to the network or until completion of transfer of the indexing data. The network may provide dynamic DNS service that allows the crawler to obtain an IP address of the device even if it is dynamically assigned and changes from one network connection to another. This IP address may be reflected by appropriately modifying or supplementing the location identifiers included in the indexing data to ensure that the index server may be able to subsequently identify the correct device having a particular resource therein. Alternatively, a more selective form of crawling may be utilized. For example, the crawler may wait for direct contact from the device itself, e.g., in which the device informs the crawler of exactly where in the page space of the device to begin crawling. In this manner, the indexing agent on a respective device may instruct the crawler to collect just those pages that have changed since the device was last visited by the crawler. In a further alternative, the indexing agent may inform the crawler of a set of pages to visit according to a predetermined or desired priority.
In a further alternative, the index server may archive the web pages collected by the crawler. Thus, a search engine may be used to view the web pages of a device even if the device is disconnected from the network, since the index server itself has a copy (albeit one that may be out-of-date) .
In accordance with yet another aspect of the present invention, the system may facilitate indexing of devices whose communication interface is of low bandwidth or unreliable. Low bandwidth connections may present a particular challenge for crawling the contents of "bandwidth-challenged" devices. The indexing agent may adopt tactics that ameliorate the deficiencies of the connection. For example, the indexing agent and the crawler may have a transport encoding in common such that the indexing agent may compress offline the web pages that it wants crawled and indexed. The indexing agent may direct the crawler to crawl just those pages for which it has generated compressed content. In this manner, the device may make optimal use of its limited bandwidth. Alternatively, the indexing agent may break the indexing data into small, individual "mini-pages," no one of which requires a substantial amount of transmission time. This technique, in combination with compression and directed crawling, may facilitate incisive, directed "spot crawling" that allows a device to collaborate with a crawler even if the device is connected for only a brief period. In accordance with yet another aspect of the present invention, the indexing agent may control the transfer of indexing data to ensure that personal, sensitive, or proprietary information is substantially securely transferred from the respective device to the mdexer. In addition, the indexing agent and the crawler may establish a secret session key known to them alone that permits the substantially secure transmission of sensitive information from the device to the crawler.
Other objects and features of the present invention will become apparent from consideration of the following description taken in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic drawing, showing a network architecture, according to the present invention.
FIG. 2 is a schematic diagram of a computing device including an indexing agent, in accordance with the present invention.
FIG. 3 is a flowchart showing a method for indexing resources on an electronic device, in accordance with the present invention.
FIG. 4 is a flowchart showing a method for searching indexing data generated by indexing agents in response to a query from a search engine, in accordance with the present invention.
FIG. 5 is a flowchart showing a method for implementing a searchable database of indexing data related to the content of resources on a plurality of devices, in accordance with the present invention.
FIG. 6 is a block diagram showing an exemplary computer system in which certain elements and functionality of the present invention may be implemented.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Turning now to the drawings, FIG. 1 is a top-level block diagram illustrating an example of a network architecture, according to an embodiment of the present invention. Electronic devices 10, 20, 30, n are each at least termittently connected to a network 40, and an index server 50 is also connected to the network 40.
In one embodiment, the network 40 may be a local area network ("LAN"), an Intranet, and/or a wireless communications network. Alternatively, the network 40 may include a plurality of several different types of networks (not shown), including, but not limited to, a LAN, an Intranet, or a wireless network. Preferably, the network 40 incorporates all of the electronic devices within an enterprise that are capable of sharing information and/or being connected to the network 40.
The index server 50 includes a search engine 52 and a database 58 of indexing data related to each of the devices 10, 20, 30, n. The search engine 52 includes an indexer 54 and a query engine 56, and optionally may also include a crawler (not shown), as described further below. Generally, the indexer 54 receives indexing data from the indexing agents on the devices 10, 20, 30, n and creates a searchable index stored in the database 58. The query engine 56 may be used to search the database 58 to identify, locate, and/or access resources related to a given query, as described further below.
The devices 10, 20, 30, n connected to the network 40 may include computing devices, such as desktop computers or other fixed workstations, and/or mobile or portable devices, such as laptops, personal digital assistants ("PDA's"), wireless access protocol ("WAP") telephones, portable digital devices, and the like. Each of the devices is generally capable of supporting and includes an indexing agent 60 resident on the device, as described further below with reference to FIG. 2. In a further alternative, other electronic devices may be included in the network, such as telephones, printers, fax machines, digital media players and recorders, appliances, heating, ventilation, communication, and electrical systems, sensors and actuators, automotive electronic and mechanical systems, technical, scientific, and medical-instruments, machine tools, material handling, manufacturing, assembly, and delivery systems, and the like. These devices may also include a resident indexing agent, or, alternatively, they may instead include a server, but may be directly coupled to another device including an indexing agent that may use the server to index resources on the device. Turning to FIG. 2, a schematic of an exemplary computing device 10 is shown that includes an indexing agent 60 configured for generating indexing data 70 regarding resources 68 of the device 10. The device 10 may include a number of modules, such as a server 62, a translator 64, an authentication module 66, and/or a presence module 67, that ' may be controlled and/or accessed by the indexing agent 60. The device 10 may include conventional memory (not shown) for storing the resources 68 and/or the indexing data 70, and one or more processors for performing various functions, as will be appreciated by those skilled in the art. An exemplary hardware architecture for the device is shown in FIG. 6, and described further below. In addition, the device 10 includes a communication interface 72 for connecting the device 10 to the network and/or otherwise communicating with other devices (not shown in FIG. 2). The communication interface 72 may include a modem, a network interface, such as an Ethernet card, a communications port, a PCMCIA slot and card, an infrared interface, and the like. In a preferred embodiment, the indexing agent 60 is a specialized, resource- conservative, embedded server, e.g., an HTTP web server. The indexing agent 60 may be relatively small compared to conventional crawlers or servers, such that it may be installed on virtually any electronic or computing device, including personal devices such as personal digital assistants (e.g., Palm Pilots), WAP phones, or embedded micro- controllers, yet it may provide all of the services needed to index local resources on the device. The term "indexing agent" is used generally herein to refer to such an embedded web server, or any combination of hardware-based components and/or software-based modules that may perform the indexing features described herein. The term "thin server" may also be used to refer to the indexing agent 60, because of its relatively small size compared to conventional servers.
The indexing agent 60 may direct the server 62 to access the device's resources 68, e.g., to serve up the device's file system, configuration data, and/or other resources, to facilitate the indexing agent 60 systematically indexing all of the resources to be indexed therein. Alternatively, the indexing agent may be capable of accessing the device's resources 68 directly, without the intervention of the server 62, and the server 62 may be eliminated.
The indexing data 70 generated by the indexing agent 60 preferably includes content data associated with respective resources 68 available on the device 10. The content data generally includes information describing individual resources of the device, preferably based upon the content of the individual resources, and/or metadata associated with the individual resources. For resources that are already compatible with a crawler or indexer, such as HTML or ASCII files, the indexing agent 60 may store the compatible information, e.g., text, metatags, and the like, as content data. For resources that are not already compatible, the indexing agent 60 may use the translator 64 to translate information extracted from a particular resource into a format that is capable of being interpreted by a crawler or indexer. For example, the translator 64 may include one or more device-specific modules for translating particular types of resources, such as application files, word processor files, spreadsheets, media files (such as image, audio and video files), databases, Portable Document Format ("PDF") files, and the like. The translator 64 may also translate device configuration data, such as hardware or software settings, into formats that may be interpreted by a crawler or indexer. In one embodiment, the translator 64 may translate device-dependent content into Web-standard formats, such as HTML or XML. Consequently, the indexing data may include information and data for which no extractors previously existed, allowing the indexing data to be crawled and extracted by a crawler of any common web search engine. In effect, the indexing agent 60 and translator 64 may act as an extractor for the benefit of a crawler.
The indexing agent 60 also generally assigns location identifiers to each piece of content data to identify the location of the individual resources associated with the respective content data. For example, the location identifiers may identify the location of the respective resources within the device's file space, and/or may identify the specific device itself. In a preferred embodiment, the location identifiers are device-specific Uniform Resource Locators ("URLs"), identifying a location where the resource may be found, or other Uniform Resource Identifiers ("URI's"), identifying a process for identifying the location, e.g., to identify a portable device that may be located at one more locations in the network.
For "mobile" devices, i.e., devices that are only intermittently connected to the network and/or may be connected to the network at multiple nodes, the network may assign virtual location identifiers specifically for the benefit of an index server (not shown iii FIG. 2). For example, the network, e.g., the index server, may provide dynamic DNS service that allows an index server to obtain an IP address of the device even if it is dynamically assigned and changes from one network connection to another. This IP address may be reflected in the location identifiers included in the indexing data 70 to ensure that an index server may be able to subsequently identify the correct device having a particular resource therein.
In addition, the presence module 67 may provide notice for such mobile devices. For example, the presence module 67 may announce to a network when the device 10 is connected to the network. In addition, the authentication module 66 may include a security protocol to ensure that the device 10 is connected to a network that is authorized to access the indexing data 70, as described further below.
The indexing agent 60 is preferably configured to operate substantially silently in the background undetected by users of the device 10, e.g., as processor cycles and disk bandwidth become available. The indexing agent 60 preferably automatically and periodically indexes the device's resources 68. The indexing agent 60 may generate a complete index of the resources 68 on the respective device, or it may generate an updated index, i.e., only reflecting resources 68 that have changed since a previous indexing.
Turning to FIG. 3, an exemplary indexing method that may be executed by an indexing agent resident on a device connectable to a network is shown, in accordance with the present invention. First, at step 110, the indexing agent may access resources on the device in order to extract content-related information from the resources. This may involve the indexing agent directing a server resident on the device to serve up the device's resources, e.g., its file system, memory or other storage devices, peripheral devices, and the like, or may involve the indexing agent accessing the resources directly. The indexing agent may also access other resources, such as software and hardware configuration settings of the device.
At step 112, the indexing agent determines whether the respective resources are already in a web-standard format, such as HTML. If not, at step 114, the indexing agent extracts content-related information from the resources, and then, at step 116, translates the information into a web-standard format, as content data. If the respective resources are already in a web-standard format, content-related information, or content data, is extracted from the resources at step 118 as content data.
At step 120, the indexing agent assigns location identifiers to the content data, associating the content data with respective resources, e.g., using URLs that identify the location of the respective resources within the device's file space or other URIs. For mobile devices, the indexing agent may assign a dynamic URI to the content data that identifies the device independent of its specific connection to the network, e.g., provided by the network, as explained above. Finally, at step 122, the indexing agent stores the content data and associated location identifiers as indexing data in memory of the device. In a preferred embodiment, the indexing agent stores the content-related information as web pages, for example, using HTML or XML markups, or as text. Preferably, the indexing data is stored in a format that may be easily crawled by a conventional web crawler or interpreted by a conventional indexer, as described further below.
The indexing agent 60 may then make the indexing data 70 available to external devices, e.g., an index server, crawler, search engine, and the like (not shown). Turning to FIG. 4, a method is shown wherein the indexing agent may retain the indexing data on the device and respond to queries from a search engine connected to a network. For example, at step 130, the indexing agent, having generated indexing data for the device, may receive a query from a search engine, such as a request from a requestor whether any files on the device include particular keywords. The query may be sent by the search engine to all of the devices connected to the network, or only to a specific subset of devices, such as those used by participants in a particular project group. The indexing agent may then search the indexing data for content data related to the query at steps 132, 134. If no match is found, the indexing agent responds to the search engine at step 136 with a negative response, or alternatively no response at all. If one or more matches are found, the indexing agent may provide the search engine with information regarding the resource(s) whose content data matched the query criteria. The extent of information provided may depend upon the authority of the search engine and/or the requestor to access the indexing data and/or resources on the device. For example, the response may merely indicate that matches were found, e.g., identifying the device, without providing any further details. Alternatively, the response may include the URL or URI for the resource(s) that resulted in matches, possibly also including the content data that resulted in the match. In a further alternative, the response may include transferring the resource itself, e.g., to provide a copy of a file on the device that matches the query to the requestor.
This method of serving up indexing data "on the fly" may be suitable for smaller enterprises that include only a limited number of devices. For larger enterprises, however, it may be more advantageous to have the indexing agent "push" its indexing data to a repository or centralized index. For example, returning to FIG. 1, each of the devices 10, 20, 30, n preferably includes an indexing agent (not shown), which may push the indexing data from the respective device to the index server 52. This model brings scaleable, comprehensive, and speedy enterprise-wide indexing and/or searching to any electronic or computing device within an enterprise.
Turning to FIGS. 1 and 5, a method of implementing a centralized, searchable database of indexing data related to the content of resources on a plurality of devices is shown. First, at step 150, an index server 50 may receive indexing data from a plurality of devices 10, 20, 30, n connected to the network 40. Preferably, the indexing agents (not shown) on the devices 10, 20, 30, n have previously generated the indexing data, including content data describing content of resources on the respective devices, as described above. The indexing agents may transfer the respective indexing data to the index server using one of several models described further below. Once the index server 50 has received indexing data from the devices 10, 20, 30, n, the indexer 54 may compile the .indexing data into a database 58 at step 152, using any known method for creating an inverted index or other searchable database. The indexing data may be received from all of the devices at one time and then compiled, or indexing data from devices may be compiled intermittently, for example, as indexing data becomes available from mobile devices. In addition, the index server may store web pages including the indexing data or otherwise retain a copy of the indexing data as stored by the indexing agents on the respective devices. This may be useful for archiving mobile devices, which may not be connected to the network when a query is submitted. The database 58 may then be used to search for resources in response to queries by requestors having access to the database 58, such as co-workers, human resources personnel, security personnel, and the like.
For example, in step 154, the query engine 56 may receive a query, e.g., including keywords or other search criteria, submitted by a requestor. The query engine 56 may access the database 58 at step 156 to search for indexing data related to the query, e.g., to identify any content data that matches the keywords or other criteria submitted by the requestor. The query engine 56 may search the entire database 58 or a subset of the database 58, as will be appreciated by those skilled in the art.
At step 158, the query engine may send a response to the requestor, indicating whether or not any matches were found. If any matches are found, the query engine may also provide additional information to the requestor, depending upon their access authority. For example, the response may include a device URL or URI or otherwise identify the device(s) that includes resources corresponding to content data satisfying the query, possibly identifying the user of the device. This level of response may be sufficient to identify the devices or users that satisfy the query without divulging the actual content of the resources, which may be sensitive, personal, or otherwise inappropriate for the requestor to access or review. Alternatively, the response may include the location identifiers of any resources satisfying the query, either with or without explicitly identifying the device itself, thereby providing access to the resource. This level of response may be appropriate for shared files, such as those that should be available to members of a common project.
Thus, placing an indexing agent on each of the devices connected to a network may permit a search engine to discover, index, and query resources resident on the devices that . were previously unknown, uninventoried, and/or largely inaccessible. Using an existing search engine, an enterprise may, with no additional effort, discover and access new sources of content enterprise-wide, as explained further below.
The indexing agents and index server may transfer the indexing data between them in a variety of different ways. For example, the indexing agents may generate the indexing data only when instructed to do so by the index server or by the device's user. Preferably, the users of the devices, however, are not involved in the indexing activity of the indexing agents, i.e., the indexing agent acts autonomously in the background such the users are unaware of and/or not substantially affected by the indexing agents' activities.
In a first preferred method, the devices automatically generate and transfer ("serve up") the indexing data with a predetermined granularity to keep the database substantially current. Thus, the indexing agents periodically generate indexing data, e.g., with substantially fixed or predetermined time periods between the generation of each set of indexing data. In one embodiment, the indexing data may be a complete set of indexing data reflecting all of the indexed resources on the device. In an alternative embodiment, the indexing data may be an updated set including indexing data only for resources whose status has changed since a previous set, e.g. new, edited, or deleted files.
Alternatively, for mobile devices, the indexing data may be generated "offline," i.e., when the devices are not connected to the network, e.g., at periodic intervals. When the mobile devices are connected to the network, their indexing agents may automatically initiate the transfer of their respective indexing data to the index server. If the device is disconnected from the network, the indexing agent may discontinue transfer, and store the location within the indexing data where the transfer was discontinued. When the device is again connected to the network, the indexing agent may resume at the location where it left off. Thus, mobile devices need not be connected to the network to allow transfer of indexing data in a single session, but transfer may be accomplished incrementally over several successive connections.
In another preferred embodiment, the index server includes a crawler, such as a conventional web crawler. The crawler is preferably an autonomous robot that systematically and/or periodically contacts each of the devices including resources to be indexed. The crawler may initiate contact successively with the indexing agents on the devices, and exchange "handshakes" to confirm connectability, identify itself, and/or to complete a security protocol confirming that the crawler has sufficient authority to access to the indexing data on the respective devices. Once connection and authority are confirmed, the indexing agents may serve up their indexing data to the crawler. For mobile devices, the indexing agent may offload the task of extracting URLs and content from the crawler by assigning virtual URIs that exist specifically for the benefit of the crawler. Thus, when a crawler, indexer, or other client visits such a virtual URI (preferably using the standard HTTP protocol such that no changes to the crawler are required), the indexing agent may generate a dynamic page that summarizes the content of the original page. This may substantially reduce network transfers and load, and may improve the incisiveness of the indexing.
The technique of assigning URIs for the benefit of a crawler may be further extended to create and push device-specific content to a repository or indexer. The indexing agent may have the ability to authenticate and control access. For example, "u" may represent an URL served by an indexing agent "M," where u denotes content in a format for which no crawler extractor exists, for example, a device-specific content configuration. M may create a virtual URL "v" that, when accessed, may trigger the translation of the content of u into standard HTML. In this manner, M may perform extraction on behalf of the crawler, giving it access to formats for which no crawler extractors have previously existed.
In addition, the indexing agent may be configured as a crawler-aware and/or indexer-aware server that offers device-dependent and/or content-dependent indexes directly to the crawler. In other words, the authentication protocols and access controls of the indexing agent may allow the indexing agent to generate crawler-specific content that is optimized for the indexer of the search engine. For example, the indexing agent executing on a personal digital assistant may generate a summary of the Pilot's memo pad that may be suitable for cross-indexing with a departmental project web site.
The notions of a crawler and indexer may be generalized considerably if the device's indexing agent knows of, and cooperates with, the crawler. The architecture outlined here permits the deployment of enterprise-specific and/or domain-specific crawlers and indexers. Crawlers may be deployed within an enterprise to search for a specific form of content, for example, all content relating to a specific project. The crawler may move from device to device throughout the enterprise's network, and, with the cooperation of the indexing agents onboard each of the network's devices, may be served with just the content sought by the crawler, thereby generating relevant and incisive indices.
Several features of an indexing agent in accordance with the present invention may facilitate access by crawlers when a target device is not connected to the network as the crawler is making its rounds. First, the indexing agent may announce its presence to the network when the device connects. This event notification may be propagated to all interested subscribers including, for example, a crawler, thereby allowing the crawler to immediately visit the device and push needed content back to the indexer. Second, the indexing agent may pre-index relevant content for the search engine and, when connected, push the indexing data back to the search engine for inclusion in, and integration with, the enterprise index database, as described above. In addition, the indexing agent may "bleed" the indexing data incrementally to a search engine over the span of multiple connections to the network. This strategy may be particularly appropriate for a device that is connected for only brief periods or supports only a low bandwidth connection. Because the indexing agent on a device may act in a substantially autonomous fashion, it may index the resources on the device, and notify the index server when the device is connected, and/or has an update for the index server. In this manner, mobile and intermittently connected devices may be intelligently included in enterprise data searches. Thus, the index server may not only be able to find resources on all devices in the network, but it may also instantly know the connection status of the device(s) that contained the resource(s) pointed to by the indexing data in the database. This opens the possibility of contacting the user of a disconnected mobile device in real-time to request that the device with critical data be connected to the network as soon as possible.
The indexing agent may also be able to filter responses for sensitive data. For example, financial, human resources, medical, personal, and/or other sensitive data may be contained appropriately, e.g., using the authentication module of the indexing agent.
Returning to FIG. 1, the index server 50 may include a single server, or it may include a plurality of servers, each sharing a database or generating independent databases, e.g., including different types of compiled indexing data. In addition, a single search engine, or a plurality of search engines may be provided.
Generally, a search engine (C, I, Q) includes three separate, but related, components: a crawler C, an indexer I, and a query engine Q. Each component may be characterized by action (what is done), locale (where it is done), and time (when it is done). In this manner, a taxonomy of search engines may be constructed that characterizes the range of variation available to search engines component with respect to action, locale, and time. Additional information on search engines may be found in S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, 1998, pp. 107-118, the disclosure of which is expressly incorporated herein by reference. Crawlers are robot applications that substantially autonomously fetch data, preferably in the form of "web pages," for submission to an indexer of a search engine. Additional information on crawlers may be found in A. Rappapofξ Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes, Search Tools Consulting, available at www.searchtools.com, the disclosure of which is expressly incorporated herein by reference.
Given a location identifier identifying a particular web page, e.g., an URL "u," the crawler C first may decide whether or not to visit the web page designated by u. If affirmative, the crawler may reach u if C can connect to the host of u, and C has the authority to access the page designated by u. Connectivity and access authority are two separate, but related, considerations. The first is the ability to establish a connection, e.g. a TCP connection, between the crawler and the web server and may vary with the position of the crawler within the network (for example, relative to a firewall) or the network quality of service (such as congestion or routing anomalies).
Once a connection is established, the crawler may be required to obtain permission to read the page, for example, if it is password protected. Access may also be restricted to a finite set of users or hosts whose identity may be determined, for example, by inspecting a source IP address of the packet stream associated with the host or using cryptographic methods.
Once the crawler has the page content in hand, it may extract whatever links it can for the next round of crawling. Extraction depends on the form and semantics of the content and the extractors available to the crawler. For example, all crawlers may extract links from HTML pages however, few crawlers have the extractors required to lift links embedded within PDF or Microsoft Word documents.
Thus, a crawler is characterized by: a) the IP address of the crawler host which limits, with respect to network topology, routing, and firewalls, the remote hosts to which the crawler may connect; b) access authority; c) a loading policy that gleans URLs of value from the set at hand; d) an extraction policy that determines if the, contents of a web page will yield URLs; and e) a set of extractors for extracting URLs from various forms of content.
A formal model of a crawler that reflects the components and process outlined may be given. The following definitions and notation shall be used, which have been adopted, with minor amendments from P. Bailey, N. Craswell, D. Hawking, Chart of Darkness: Mapping a Large Intranet, CSIRO CMIS Technical Report, Canberra, Australia, 2000, the disclosure of which is expressly incorporated herein by reference.
Given an URL u, "pu" denotes the HTTP response for u (nominally the page contents of u). If u is a nonempty set of URLs then pu = ueUpu For the sake of conciseness, pu may denote both the URL u and the page contents of u. An "extractor" is a function g(pu) such that either g(pu)=_L, meaning that g is ill- defined on pu or g(pu) = S where S is the set (possibly empty) of all links (URLs) v extracted from pu.
An "extraction policy" E(pu) is a decision procedure that returns true if links (URLs) can be extracted from pu and false otherwise. E may inspect the URL u, the MIME type of u (contained within the HTTP response) and the page contents pu since all offer valuable hints as to the format and structure of pu. For example, if the MIME type is "html," then pu is a page whose structure is well defined (by the HTML specification) and amenable to the extraction of links. If the MIME type is unspecified (the HTTP response omitted the Content-Type header field), then the crawler may examine the syntax of the URL or the content itself to infer the media type. For example, the URL suffix .wav or .au may indicate (by common convention) an audio file that may contain links (rendered as speech) but whose extraction by machine agents is problematic at best. Some audio formats, however, may provide for the inclusion of digital metadata. The crawler, if equipped with a suitable extractor, may be able to extract that metadata for the benefit of the indexer.
A "loading policy" is a decision procedure L(u) that returns true if URL u is deemed suitable for loading and false otherwise. A "page loading policy" determines whether a crawler ignores robot excluded pages and generated pages (such as those produced by CGI scripts) and honors page loading, resource, and time limits with respect to a site or domain. Other considerations may also play a role in the formulation of L.
Where an IP address "α", a URL u, and a set of access permissions P (including passwords, encryption keys, and access procedures) are given, an "access function" A("α", u, P) returns pu if and only if it is possible to access u from "α" and P grants sufficient authority.
A "crawler" C is a tuple ("α", A, P, E, G, L), where "α" is the location (IP address) of C, A is an access function, P is a set of access permissions, E is an extraction policy, G is a nonempty set of extractors, and L is a loading policy. Given a link extraction policy E, an extractor g, and a nonempty seed set S = {u0, . . . , um-1 } of URLs, an URL v may be extracted from pS with respect to E and g if and only if there exists u € S such that E(pu) is true, and v € g(pu).
An URL u is reachable in one step by crawler C = ("α", A, P, E, G, L) with respect
C_ to a set of URLs S, written S =>u, if u € S or:
• u can be extracted from pS with respect to E and some g e G
• L(u) is true; and
• A ("α", u, P) = pu. c_ An URL u is reachable by crawler C with respect to a set of URLs S if S => *u,
C_ that is, (S, u) is contained in the transitive closure of => . The definition of reachable is easily extended to page sets.
Given page sets pS, pT then pT is reachable from pS by crawler C if and only if pS c pT and Vpt ε pT3ps ε pS such that t is reachable by C from s. The set of pages c_ reachable by a crawler C is just the maximal page set in the relation => *. Turning to Fig. 6, a block diagram illustrates an exemplary computer system 350 in which elements and functionality of the present invention may be implemented according to one embodiment of the present invention. The present invention may be implemented using hardware, software, or a combination thereof and may be implemented in a computer system or other processing system. Various software embodiments are described in terms of exemplary computer system 350. After reading this description, it will become apparent to a person having ordinary skill in the relevant art how to implement the invention using other computer systems, processing systems, or computer architectures.
The device 350 includes one or more processors, such as processor 352. Additional processors may be provided, such as an auxiliary processor to manage input/output, an auxiliary processor to perform floating point mathematical operations, a special-purpose microprocessor having an architecture suitable for fast execution of signal processing algorithms ("digital signal processor"), a slave processor subordinate to the main processing system ("back-end processor"), an additional microprocessor or controller for dual or multiple processor systems, or a coprocessor. It is recognized that such auxiliary processors may be discrete processors or may be integrated with the processor 352. The processor 352 is connected to a communication bus 354. The communication bus 354 may include a data channel for facilitating information transfer between storage and other peripheral components of the computer system 350. The communication bus 354 further provides the set of signals required for communication with the processor 352,. ■including a data bus, address bus, and control bus (not shown). The communication bus 354 may include any known bus architecture according to promulgated standards, for example, industry standard architecture (ISA), extended industry standard architecture (EISA), Micro Channel Architecture (MCA), peripheral component interconnect (PCI) local bus, standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE) including IEEE 488 general-purpose interface bus (GPIB), IEEE 696/S-lOO, and the like.
Device 350 also includes a main memory 356 and may also include a secondary memory 358. The main memory 356 provides storage of instructions and data for programs executing on the processor 352. "The main memory 356 is typically semiconductor-based memory such as dynamic random access memory (DRAM) aήd/or static random access memory (SRAM). Other semiconductor-based memory types include, for example, synchronous dynamic random access memory (SDRAM), Rambus dynamic random access memory (RDRAM), ferroelectric random access memory (FRAM), and the like, as well as read only memory (ROM). '
The secondary memory 358 may include a hard disk drive 360 and/or a removable storage drive 362, for example a floppy disk. drive, a magnetic tape drive, an optical disk drive, and the like. The removable storage drive 362 may read from and write to a removable storage unit 364 in a well-known manner. For example, removable storage unit 364 may include a floppy disk, magnetic tape, optical disk, and the like that'may be read from and written to by removable storage drive 362. Additionally, the removable storage unit 364 may include a computer usable storage medium with computer software and computer data stored thereon.
In alternative embodiments, secondary memory 358 may include other similar components for allowing computer programs or other instructions to be loaded into the computer system 350. For example, such components may include interface 370 and removable storage unit 372. Examples of secondary memory 358 may include semiconductor-based memory such as programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), or flash memory (block oriented memory similar to EEPROM). Also included are any other interfaces 370 and removable storage units 372 that allow software and data to be transferred from the removable storage unit 372 to the computer system 350 through interface 370.
The device 350 also includes a communication interface 374. Communication interface 374 allows software and data to be transferred between device 350 and external devices, networks, or information sources. Examples of communication interface 374 include but are not limited to a modem, a network interface (for example an Ethernet card), a communications port, a PCMCIA slot and card, an infrared interface, and the like.
Communication interface 374 preferably implements industry promulgated architecture standards, such as Ethernet IEEE 802 standards, Fibre Channel, digital subscriber line (DSL), asymmetric digital subscriber line (ASDL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on. Software and data transferred via communication interface 374 may be in the form of signals 378 which may be electronic, electromagnetic, optical or other signals capable of being received by communication interface 374. These signals 378 are provided to communication interface 374 via channel 376. Channel 376 carries signals 378 and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, or other communications channels. Computer programming instructions (also known as computer programs, software, or firmware) may be stored in the main memory 356 and the secondary memory 358. Computer programs may also be received via communication interface 374. Such computer programs, when executed, enable the device 350 to perform the features of the present invention. In particular, execution of the computer programming instructions may enable the processor 352 to perform the features and functions of the present invention. Accordingly, such computer programs represent controllers of the computer system 350.
In this document, the term "computer program product" is used to refer to any medium used to provide programming instructions to the computer system 350. Examples of certain media include removable storage units 364 and 372, a hard disk installed in hard disk drive 360, and signals 378. Thus, a computer program product may be a means for providing programming instructions to the computer system 350.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 350 using hard disk drive 360, removable storage drive 362, interface 370, or communication interface 374. The computer programming instructions, when executed by the processor 352, may cause the processor 352 to perform the features and functions of the invention as described herein.
In another embodiment, the invention may be implemented primarily in hardware using hardware components, such as application specific integrated circuits ("ASICs"). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons having ordinary skill in the relevant art. In yet another embodiment, the invention may be implemented using a combination of both hardware and software. It is understood that modification or reconfiguration of the device 350 by one having ordinary skill in the relevant art does not depart from the scope or the spirit of the present invention.
While the invention is susceptible to various modifications, and alternative forms, specific examples thereof have been shown in the drawings and are herein described in detail. It should be understood, however, that the invention is not to be limited to the particular forms or methods disclosed, but to the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the appended claims.

Claims

WHAT IS CLAIMED IS:
1. An electronic device, comprising: memory storing a plurality of resources on the device; an indexing agent resident on the device configured for generating indexing data comprising content data and location identifiers for the plurality of resources, the indexing agent configured for extracting information regarding content of respective resources to generate the content data, the indexing agent further configured for assigning location identifiers linking the content data to respective resources; and a communication interface for connecting to a network including a remote index server, the indexing agent further configured for pushing the indexing data to the index server.
2. The device of claim 1, wherein the indexing agent is further configured for storing the indexing data in a format that may be interpreted by the index server.
3. The device of claim 2, wherein the indexing agent comprises a translator for translating the information regarding content of the resources from a device-specific format to a format that may be interpreted by the index server.
- 4. The device of claim 2, wherein the indexing data is stored as web pages comprising HTML or XML code.
5. The device of claim 1, wherein the location identifiers comprise URLs identifying the location of the respective resources within the device's resource space, and identifying the device itself with respect to the network.
6. The device of claim 5, wherein the device is intermittently connectable to one or more nodes of the network, and wherein the location identifiers include dynamic URLs identifying the device independent of a particular node to which the device is connected.
7. The device of claim 1, wherein the indexing agent comprises a crawler and a thin server, the thin server configured for providing an interface between the crawler and the device's resources, the crawler configured for extracting the information regarding content from files within the device's resources to generate the content data.
8. The device of claim 1, wherein the communication interface is configured for intermittently connecting the device to the network.
9. The device of claim 8, wherein the indexing agent comprises an authentication module that is configured for exchanging a security protocol with the index server to confirm that the index server is authorized to access the indexing data.
10. The device of claim 8, wherein the indexing agent is configured for automatically pushing its content data to the index server each time that the device is connected to the network.
11. A method for indexing resources on an electronic device using an indexing agent resident on the device, the device being at least intermittently connectable to a network, the network including an index server remote from the device, the method comprising: indexing the resources on the device using the indexing agent to generate indexing data; storing the indexing data on the device; and making the indexing data available to the index server.
12. The method of claim 11, wherein the indexing step comprises extracting information regarding content of the plurality of resources to generate content data, the indexing data comprising the content data.
13. The method of claim 12, wherein the indexing step further comprises assigning location identifiers to the content data linking the content data to respective resources, the indexing data further comprising the location identifiers.
14. The method of claim 12, wherein the extracting step comprises translating device-specific content into content data comprising a format that may be interpreted by the index server.
15. The method of claim 12, wherein the extracting step comprises extracting metadata related to the resources.
16. The method of claim 11 , wherein the indexing agent stores the indexing data as web pages comprising HTML or XML markups.
17. The method of claim 11 , wherein the device comprises a mobile device that may be intermittently connected to the network, and wherein the method further comprises connecting the device to the network before making the indexing data available to the index server. .
18. The method of claim 17, wherein the indexing agent confirms that the index server has authorization to receive the content data before making the indexing data available to the index server.
19. The method of claim 17, wherein the step of making the indexing data available comprises automatically transferring at least a portion of the indexing data to the index server when the device is connected to the network.
20. The method of claim 11 , wherein the step of making the indexing data available comprises transferring the indexing data to the index server when the indexing agent is polled by the index server.
21. The method of claim 11 , wherein the step of making the indexing data available comprises: initiating transfer of the indexing data to the index server; disconnecting the device from the network, whereupon transfer is discontinued; and resuming transfer of the indexing data to the index server after the device is reconnected to the network.
22. The method of claim 11 , wherein the step of making the indexing data available comprises: receiving a query from the index server; searching the indexing data using the indexing agent for resources related to the query; and returning a response to the index server indicating whether resources on the device are related to the query.
23. A system for indexing resources available on a network, comprising: a plurality of electronic devices at least intermittently connected to the network, each electronic device including a plurality of resources; an index server connected to the network, the index server comprising a query engine for searching the plurality of resources on the plurality of devices; and indexing agents resident on respective devices comprising the plurality of devices, the indexing agents configured for indexing resources on the respective devices to generate indexing data, the indexing agents further configured for making the indexing data • available to the index server.
24. The system of claim 23, wherein the indexing agents are further configured for pushing the indexing data to the index server, the index server comprising an indexer for compiling the indexing data into a searchable database accessible by the query engine.
25. The system of claim 24, wherein the index server comprises a crawler, the crawler configured for periodically contacting the indexing agents on the respective devices to initiate transfer of the indexing data to the index server.
26. The system of claim 24, wherein the indexing agents are configured for incrementally transferring the indexing data of the respective devices to the index server over a plurality of connections of the respective devices to the network.
27. The system of claim 23, wherein the indexing agents comprise a crawler resident on each of the devices, the crawler being configured for extracting content-related information from the resources on the respective device to generate the indexing data.
28. The system of claim 23, wherein the query engine is configured for communicating a desired query to the indexing agents, the indexing agents configured for searching the indexing data on the respective devices to locate resources on the respective devices related to the desired query.
29. The system of claim 23, wherein at least some of the devices comprise mobile devices, the indexing agents on the mobile devices comprising authentication modules for confirming that the index server has authority to access the indexing data on respective mobile devices.
30. The system of claim 29, wherein the authentication modules are further configured for announcing to the network when the respective mobile devices are connected to the network.
31. The system of claim 23, wherein the indexing agents comprise one or more translators on the respective devices for translating device-dependent content into content data that is compatible with web-standard formats.
32. A method for indexing resources available on a plurality of electronic devices connectable to a network, the method comprising: indexing resources on respective devices using indexing agents resident on the respective devices to generate indexing data; storing the indexing data on the respective devices; and making the indexing data on the respective devices available to an index server comprising a query engine.
33. The method of claim 32, wherein the step of making the indexing data available comprises: transferring the indexing data from the respective devices to the index server; and compiling the indexing data from the respective devices into a searchable database stored on the index server. '
34. The method of claim 33, wherein the transferring step comprises using a crawler to contact the indexing agents on the respective devices.
35. The method of claim 34, wherein the crawler resides on the index server, and wherein the crawler periodically contacts the indexing agents on the respective devices to initiate transfer of the indexing data.
36. The method of claim 33, wherein the indexing agents on the respective devices automatically and periodically transfer the indexing data to the index server.
37. The method of claim 33, wherein the devices comprise mobile devices that may be disconnected from the network, and wherein the transferring step comprises incrementally transferring the indexing data from the mobile devices over a series of connections to the network.
38. The method of claim 32, further comprising: receiving a query at each of the devices from the index server; searching the indexing data on each of the devices using the respective indexing agents for resources related to the query; and sending responses from the indexing agents indicating whether the respective device include resources related to the query.
39. A method for indexing resources on an electronic device having an indexing agent resident thereon, the electronic device connected to an index server via a network, the method comprising: indexing resources on the electronic device using the indexing agent to generate indexing data; and transferring the indexing data to the index server, the indexing agent controlling a rate at which the indexing data is transferred.
40. The method of claim 39, wherein the indexing agent controls the rate at which the indexing data is transferred such that the transfer of indexing data does not substantially interfere with operation of the electronic device to perform other functions.
41. The method of claim 39, wherein the indexing agent monitors one or more properties of the electronic device, the indexing agent controlling the rate at which the indexing data is transferred based upon the one or more properties.
42. The method of claim 41 , wherein the one or more properties comprise an energy level of a power source supplying power to the electronic device.
43. The method of claim 42, wherein the power source comprises a battery.
44. The method of claim 41 , wherein the one or more properties comprise at least one of available memory capacity and processing rate of the electronic device.
45. The method of claim 41 , wherein the one or more properties comprise available bandwidth of a communication interface used to transfer the indexing data to the index server.
46. The method of claim 41, wherein the indexing agent periodically monitors the one or more properties, and adjusts the rate at which the indexing data is transferred based upon the one or more properties.
47. The method of claim 39, wherein the electronic device comprises a mobile device that is only intermittently connected to the network.
48. The method of claim 47, wherein the indexing agent automatically transfers at least a portion of the indexing data to the index server each time the electronic device is connected to the network.
49. The method of claim 47, wherein the indexing agent incrementally transfers the indexing data to the index server each time the electronic device is connected to the network.
50. The method of claim 39, further comprising storing the indexing data on the electromc device, the indexing data being compressed by the indexing agent to facilitate its transfer to the index server.
PCT/US2002/036276 2001-11-14 2002-11-13 Systems and methods for indexing data in a network environment WO2003042874A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2003544636A JP2006502461A (en) 2001-11-14 2002-11-13 Electronic device, method of indexing resources on an electronic device using an indexing agent on the device, system for indexing available resources on a network, available on a number of electronic devices connectable to the network Method for indexing resources, method for indexing resources on electronic devices with resident indexing agents
EP02792247A EP1444613A2 (en) 2001-11-14 2002-11-13 Systems and methods for indexing data in a network environment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
USPCT/US01/43667 2001-11-14
PCT/US2001/043667 WO2003042871A2 (en) 2001-11-14 2001-11-14 Systems and methods for indexing data in a network environment

Publications (3)

Publication Number Publication Date
WO2003042874A2 true WO2003042874A2 (en) 2003-05-22
WO2003042874A3 WO2003042874A3 (en) 2004-03-04
WO2003042874A9 WO2003042874A9 (en) 2004-04-01

Family

ID=21743003

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/US2001/043667 WO2003042871A2 (en) 2001-11-14 2001-11-14 Systems and methods for indexing data in a network environment
PCT/US2002/036276 WO2003042874A2 (en) 2001-11-14 2002-11-13 Systems and methods for indexing data in a network environment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/US2001/043667 WO2003042871A2 (en) 2001-11-14 2001-11-14 Systems and methods for indexing data in a network environment

Country Status (3)

Country Link
EP (1) EP1444613A2 (en)
JP (1) JP2006502461A (en)
WO (2) WO2003042871A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147617A1 (en) * 2006-12-19 2008-06-19 Yahoo! Inc. Providing system configuration information to a search engine
WO2015061085A1 (en) * 2013-10-23 2015-04-30 Microsoft Corporation Pervasive search architecture
US10713324B2 (en) 2014-06-24 2020-07-14 Google Llc Search results for native applications

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5641175B2 (en) * 2009-05-27 2014-12-17 株式会社リコー Search object management system and search object management method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027805A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Index cards on network hosts for searching, rating, and ranking
WO2001046856A1 (en) * 1999-12-20 2001-06-28 Youramigo Pty Ltd An indexing system and method
EP1143349A1 (en) * 2000-04-07 2001-10-10 IconParc GmbH Method and apparatus for generating index data for search engines

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2490079A (en) * 1944-04-18 1949-12-06 Francis L Melvill Contacting apparatus
US3083952A (en) * 1955-10-07 1963-04-02 Metal Textile Corp Capillary strand material
BE767730R (en) * 1970-11-06 1971-10-18 Fabelta Sa METHOD AND APPARATUS FOR THE CONTACT OF FLUIDS AND THE TRANSFER OF MATTER AND HEAT BETWEEN
JP2001327859A (en) * 2000-05-19 2001-11-27 Tadayoshi Nagaoka Stereoscopic netlike structure such as packing body in device performing mass transfer or the like and method for manufacturing the same
JP2003512144A (en) * 1999-10-18 2003-04-02 マントイフェル、ロルフ・ピー・シー Method and apparatus for material and / or energy exchange in a washing tower
AU2001265993A1 (en) * 2000-05-18 2001-11-26 Rolf P. C. Manteufel Device for guiding the flow of a liquid used for material and/or energy exchangein a wash column

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001027805A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Index cards on network hosts for searching, rating, and ranking
WO2001046856A1 (en) * 1999-12-20 2001-06-28 Youramigo Pty Ltd An indexing system and method
EP1143349A1 (en) * 2000-04-07 2001-10-10 IconParc GmbH Method and apparatus for generating index data for search engines

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1444613A2 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147617A1 (en) * 2006-12-19 2008-06-19 Yahoo! Inc. Providing system configuration information to a search engine
US10120936B2 (en) * 2006-12-19 2018-11-06 Excalibur Ip, Llc Providing system configuration information to a search engine
WO2015061085A1 (en) * 2013-10-23 2015-04-30 Microsoft Corporation Pervasive search architecture
US10949408B2 (en) 2013-10-23 2021-03-16 Microsoft Technology Licensing, Llc Pervasive search architecture
US11507552B2 (en) 2013-10-23 2022-11-22 Microsoft Technology Licensing, Llc Pervasive search architecture
US10713324B2 (en) 2014-06-24 2020-07-14 Google Llc Search results for native applications

Also Published As

Publication number Publication date
WO2003042874A9 (en) 2004-04-01
WO2003042874A3 (en) 2004-03-04
JP2006502461A (en) 2006-01-19
WO2003042871A3 (en) 2003-08-14
WO2003042871A2 (en) 2003-05-22
EP1444613A2 (en) 2004-08-11

Similar Documents

Publication Publication Date Title
US8024306B2 (en) Hash-based access to resources in a data processing network
US6078929A (en) Internet file system
JP3967806B2 (en) Computerized method and resource nomination mechanism for nominating a resource location
US6038603A (en) Processing customized uniform resource locators
US6012083A (en) Method and apparatus for document processing using agents to process transactions created based on document content
US20060206460A1 (en) Biasing search results
US20080086540A1 (en) Method and system for executing a normally online application in an offline mode
JP2016533594A (en) WEB PAGE ACCESS METHOD, WEB PAGE ACCESS DEVICE, ROUTER, PROGRAM, AND RECORDING MEDIUM
RU2453916C1 (en) Information resource search method using readdressing
JPH08263417A (en) Method for accessing to independent resource of computer network and network subsystem
US20100306833A1 (en) Autonomous intelligent user identity manager with context recognition capabilities
KR100714504B1 (en) System and method for searching contents in personal terminals using wired and wireless internet
EP1444613A2 (en) Systems and methods for indexing data in a network environment
WO2003073324A1 (en) Systems and methods for indexing data in a network environment
KR20020003674A (en) Data synchronization system and method thereof
JP2002342144A (en) File sharing system, program and file transferring method
WO2005114400A1 (en) Method and apparatus for supporting multiple versions of a web services protocol
JP2002202955A (en) System and method for automatically formulating response to authentication request from secured server
JPH10334002A (en) System and method for controlling remote operation by electronic mail, and storage medium storing remote operation control program
US20080104239A1 (en) Method and system of managing accounts by a network server
KR20010073873A (en) System and method for storing web surfing data on the internet
KR100879880B1 (en) Method and system for providing electronic cabinet service
EP2041660A2 (en) Conditional url for computer devices
KR20050065862A (en) A method for publishing contents stored in personal devices via web and a system thereof
Valavanis et al. MobiShare: Sharing Context-Dependent Data and Services among Mobile Devices♣

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
COP Corrected version of pamphlet

Free format text: PAGES 1/5-5/5, DRAWINGS, REPLACED BY NEW PAGES 1/4-4/4; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

WWE Wipo information: entry into national phase

Ref document number: 2003544636

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2002792247

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2002792247

Country of ref document: EP