ANALYZING INTERNET-BASED INFORMATION
Cross-Reference to Related Applications
This application claims the benefit of United States Provisional Application Serial No. 60/097029 entitled "Collecting, Combining, Analyzing, and Using Internet and Business Information" filed on August 17, 1998, which is incorporated
herein.
Background of the Invention
This application relates to analyzing Internet-based information. World-Wide Web ("Web") protocols call for text-based addressing information, which is highly suitable for human users, to be converted to number- based addressing information, which is highly suitable for computers. Much of the information available on the Web is organized into Web pages that can be retrieved and displayed by Web browser software ("browser") under the direction of a user. Each of the Web pages is identifiable by a respective Uniform Resource
Locator text string ("URL"), such as "http://www.isp321.com/frontpage.html", that the browser can use to select the page. Each URL includes a domain name,
such as "isp321.com", that identifies the Web site where the corresponding Web
page is stored for retrieval by browser software. Each domain name is registered
by an entity that controls the corresponding Web site and Web pages. A domain name registry organization maintains the domain name registration information,
which may include name, address, and other information that allows the organization to bill the entity for payment for the maintenance. (It is to be understood that the term "registry", as used herein, also refers to a domain name registrar or any other entity that may provide assistance in registering a domain name.) When the domain name is registered, the entity identifies a domain name server computer system that stores a numeric address (known as an IP address) that corresponds to the domain name, and the domain name registry stores the identity of the domain name server computer system together with the domain name in a file known as a zone file. The domain name registry also reports the domain name together with the identity of the domain name server computer system to a root zone server computer system, which is a high level computer system that is responsible for helping other computer systems properly derive IP addresses from domain names (e.g., as described below). The root zone server computer system receives such reports from effectively all domain name registries as domain names are registered, and therefore the root zone server computer system has a comprehensive list of domain names that are registered on the Web. When the Web browser is directed to retrieve information from a Web site identified by a URL, the browser must determine the IP address of the Web site to which the URL refers. If the browser submits the domain name part of the URL to the root zone server computer system, the root zone server computer system
determines, based on information previously supplied by the domain name registry, the identity of the domain name server computer system for the domain name and reports the identity to the browser. The browser then refers to the domain name server computer system to find out the IP address, and uses the IP address to attempt to contact the Web site. If the Web site is functioning, the browser receives information including a Web page formatting header ("HTML header") from the Web site. If the Web site is not functioning, the browser receives no response from the Web site, and times out indicating an error.
An Internet service provider ("ISP") is an example of an entity that may have a registered domain name for a Web site. Typically, an ISP has customers such as individuals or businesses for whom the ISP stores Web pages on the Web site for retrieval by Web browser software. For example, the ISP may have a customer Maple Street Plumbing for which the ISP stores a home Web page having a URL that includes a prefix "http://www.isp321.com/~maplestplumb". A home Web page is typically the only or the primary entry point into a Web site or a set of Web pages that are under the control of an entity.
Another example of an entity that may have a registered domain name is a Web portal site such as "Yahoo.com" that maintains, in pages organized by categories, links to Web sites and home pages that are under the control of other entities. Typically, a Web portal site allows another entity to create a link from
the Web portal site to the other entity's Web site or home page by submitting information to the Web portal site.
A Web search engine site ("search engine") maintains and updates a search engine database, i.e., a Web page record database, that includes a Web page record for every Web page that has been turned up by Web sweeping software that sweeps the World Wide Web for any and all Web pages. A typical Web page record includes a URL for the respective Web page, an excerpt or other subset of the information provided by the Web page, and a date indicating the most recent update of the Web page record. When a user directs a Web search engine to execute a search, the Web page record database is searched and then search engine results are displayed to the user in the form of a list of Web page records.
The Web sweeping software discovers information on the Web, including domain names previously unknown to the search engine, by following links among Web pages. Some information about an entity may not be available on a Web site that is under the control of the entity. For example, public financial information about a company may be stored in a database that is not linked to the company's Web site or is not directly accessible by Web browser software, such as a database under the control of a financial services firm.
In general, statistical information regarding Web activity for companies or other entities in a particular sector of human activity, such as an industrial sector, are expressed in broad terms such as the total number of uniquely qualified domain name Web sites ("unique Web sites") that are sponsored by all of the companies in a particular industrial sector. Such statistics may prove misleading for at least some purposes. For example, if ten companies belong to a sector that is known to have ten unique Web sites, the resulting average, i.e., one unique Web site per company, can make the sector appear to be well represented by unique Web sites, even if in actuality all ten unique Web sites belong to only one of the companies and none of the other nine companies has a unique Web site.
Summary of the Invention Methods and systems are provided for analyzing Internet-based information. An Internet analysis system is provided that gathers domain names and determines whether the domain names are associated with functioning Web sites. A variation of the Internet analysis system that includes an entity information database and a mapping database is able to generate reports regarding Web activity in sectors of human activity such as industrial sectors. Different aspects of the invention allow one or more of the following. A comprehensive list of tested domain names can be produced. Domain names for
Web sites that are difficult or impossible for a search engine to discover can be made available to the search engine to allow the search engine to produce search
results that account for the contents of the previously undiscovered Web sites. Before being provided to the search engine, the domain names may be prioritized or sorted according to one or more attributes (such as industry sector or company size) of the respective entities that are registered as having control over the domain names. Highly useful statistics can be produced concerning the number of entities in an industrial sector that are registered as having control over Web sites. Such statistics can be used for highly effective marketing or sales approaches in which Web oriented products are targeted at potential customers in industrial sectors that are shown by the statistics to have substantial Web activity.
Other features and advantages will become apparent from the following description, including the drawings, and from the claims.
Brief Description of the Drawings
Figs. 1, 6, and 7 are block diagrams of computer-based systems.
Figs. 2, 3, 4, 5, and 10 are flow diagrams of computer-based procedures.
Figs. 8 and 11 are illustrations of output produced by software.
Figs. 9A-9B are illustrations of computer data.
Detailed Description Fig. 1 illustrates an Internet analysis system 110 in which a domain names analysis application 112 executes a procedure 1000 (Fig. 2) to collect domain names 114 from a domain name source 116 (step 1010), test the domain names to determine which of the domain names are domain names that correspond to functioning Web sites ("live domain names" 115) (step 1020), and deliver live domain names to a search engine 116 for use in searching the Web (step 1030).
The domain names analysis application collects domain names as follows. The domain name source may include a domain name registry or a root zone server or both. To collect domain names from the domain name registry, the domain names analysis application executes a procedure 2000 (Fig. 3) to submit a request to the domain name registry for a zone file (step 2010), download the requested zone file (step 2020), and extract domain names from the requested zone file (step 2030). In a specific embodiment, the zone file is downloaded by use of a binary transfer procedure known as an FTP transfer. Fig. 11 illustrates an example of a portion of a zone file and extracted domain names. The example is divided into sections. In a typical section, as shown in the example, the first line includes the domain name (in the first column) and its corresponding domain name server (in the last column), and the next line lists the domain name server (in the first column) and its actual IP address (in the last column). If the domain
name has more than one domain name server, the section may include additional lines, each including name and IP address information for another domain name
server. After the zone file is downloaded, the domain names are extracted and duplicate domain names are removed. To collect domain names from the root zone server, the domain names analysis application executes a procedure 3000 (Fig. 4) to request domain name information record by record from the root zone server (step 3010) and extract the domain names from the domain name information (step 3020). In a specific embodiment, domain names are collected from a root zone server as follows. First, the root zone server (e.g., F. Root-Servers. net) is selected and data from the root zone server is directed into a file; the following is a sequence in which the F root server is selected and the F root server is directed to unload all data that
ends in "com" into a file called "com.txt".
> nslookup > server f.root-servers.net
> Is com > com.txt
Next, a Perl program is executed to extract the domain names from the file in accordance with the principles described above in connection with the zone file
example.
Finally, the process, with appropriate changes, is repeated as necessary to collect other domain names. Since different root zone servers are responsible for different domain name extensions (e.g., "com", "net", "edu", "ca", "uk"), collecting a comprehensive list of domain names requires gathering domain name information from multiple root zone servers. In a specific embodiment, other root zone servers are identified by use of a "whois" command. For example, to identify a root zone server that is responsible for "ca" which is the domain name extension for Canada, the following command line is used. > whois ca-dom The response generated in this case is "relay. cdnnet.ca". Domain names are gathered from the "relay. cdnnet.ca" server by using a variation of the process described above, which variation uses "relay. cdnnet.ca" in place of "f. root-server. net".
To test a domain name, the domain names analysis application executes a procedure 4000 (Fig. 5) to attempt to acquire the IP address associated with the domain name (step 4010) and, if the IP address is acquired, to attempt, by a request known as an HTTP protocol query, to retrieve an HTML header from a server having the IP address (step 4020). In a specific implementation, a prefix "www." is added to the domain name to form a URL, and the URL is handled much as a typical URL is handled by Web browser software. For example, a
protocol known as telnet is used to attempt to connect to a Web server and retrieve an HTML header. The following command lines illustrate an example for
the case of the "uspto.gov" domain name: > telnet www.uspto.gov 80
> dump
If both attempts 4010, 4020 succeed, the domain name is determined to be a live domain name (step 4030). If either attempt fails, the domain name is determined not to be a live domain name (step 4040). In the case of the attempt to retrieve the HTML header, failure takes the form of a timing out, because the domain names analysis application fails to receive any response. If the Web site returns an error page, it does not qualify as a failure, because the error page includes the HTML header. In such a case, the Web site is functioning, but its
contents may be corrupted or may be blocked by security arrangements.
The live domain names are delivered to the search engine as a list that is added to the search engine's list of domain names to be searched for content to be recorded in the search engine's index. In a variation, all or some of the domain
names are delivered after being sorted in accordance with a prioritization scheme that takes into account information, gathered from mapping and entity
information databases (described below), pertaining to respective entities that are
registered as having control over the domain names. For example, the only
domain names delivered may be domain names registered as being under the control of telecommunications companies or companies in a particular city.
Fig. 6 illustrates a variation 200 of the Internet analysis system that allows a user to produce research reports regarding Web activity in sectors of human activity. System 200 includes a mapping database 12 (Fig. 7) that maps URLs or domain names 14 to entities 16 such as people, businesses, or government agencies, as described in more detail below. For example, the mapping database may indicate that any URL that begins with "http:/ /www.uspto.gov" is for a Web page controlled by the U.S. Patent and Trademark Office, or that domain names "elmstdogs.com" and "elmstcats.com" are under the control of a company named Elm Street Pets, Inc. System 200 also includes a Web activity analysis application 202 (described below) and an entity information database 28 (described below) that includes information such as geographic information about entities to which URLs or domain names are mapped in the mapping database. The mapping database may use a unique identification number ("unique
ID"), such as a 9-digit American Business Information ("ABI") number or a DUNNS number, to identify an entity so that other information about the entity can be retrieved from the entity information database or elsewhere by searching under the unique ID. (ABI numbers are sponsored by infoUSA.) For example, unique IDs from the mapping database may be used to search the entity
information database to produce a subset of the mapping database that has records only for entities having a particular characteristic, such as a particular geographic location or between 1000 and 5000 employees.
With respect to the mapping database, where an entity constitutes a portion of another entity, each of the entities may be assigned different unique IDs, and the different unique IDs may be linked in the mapping database to note the relationship among the entities. For example, a company that has offices in different locations may be assigned a unique ID for the company itself and a respective different unique ID for each location. In another example, when two previously unrelated companies merge or one is acquired by the other, each may
retain its unique ID and a new, different unique ID may be assigned to the combination of the two companies, or both companies may be assigned the same
unique ID.
Information in the mapping database may be derived from information submitted by or on behalf of the entity when a domain name is registered. For example, when the company Elm Street Pets, Inc. registers the domain names
"elmstdogs.com" and "elmstcats.com" with a domain name registry, the company associates the domain names with at least enough information, such as name,
address, and telephone number information, to allow the domain name registry to
bill the company for maintenance of the registration.
The entity may submit information to the mapping database in other ways such as in an on-line questionnaire that feeds the mapping database.
Information in the mapping database may be derived from information provided by an intermediary such as an ISP or an Internet portal. For example, an
ISP having a domain name "isp321.com" may have a customer Maple Street Plumbing for which the ISP hosts and administers a home page having a home page address "www.isp321.com/~maplestplumb". In such a case, the ISP may have name, address, and telephone number information for the purpose of billing Maple Street Plumbing for such hosting and administration, and may allow such information along with the home page address to be used to link the home page address to Maple Street Plumbing in the mapping database.
In another example, an Internet portal may allow an entity such as Maple
Street Plumbing to create an entry or listing named "Maple Street Plumbing" in a "plumbing" section of a on-line directory maintained by the portal, to allow a user
to view home page "www.isp321.com/~maplestplumb" by selecting the entry. In such a case, the Internet portal may allow information in the entry, and perhaps
any address and telephone number information submitted by the entity during creation of the entry, to be used to link the home page to Maple Street Plumbing
in the mapping database.
The mapping database, the entity information database, and the Web analysis application allow a report such as report 204 (Fig. 8) to be generated that shows, in absolute numbers and as a percentage, how many entities in an industrial sector are registered as controlling one or more Web sites. The entities included in the report may also or instead be limited by geographical area or by any other attribute stored for entities in the entity information database (Figs. 9A-
9B illustrate a list of such attributes). To generate the report, the Web analysis application executes a procedure 5000 (Fig. 10) to search the entity information database for entities that match an industrial sector code such as an SIC code or a North American Industrial Classification System ("NAICS") code ("sector entities") (step 5010), determine from the mapping database which of the sector entities are registered as controlling one or more Web sites ("Web entities") (step 5020), and account for each of the sector entities and Web entities in the report (step 5030), such as by presenting quantities for sector entities and Web entities and indicating
the number of Web entities as a percentage of the sector entities.
Other reports, such as time based reports, can also be generated by the Web analysis application. For example, the percentage of sector entities that are
Web entities can be tracked over time to demonstrate the growth in the number or
percentage of entities that are registered as controlling one or more Web sites
("online penetration"). By limiting the entities in the report by entity size (e.g.,
number of employees) or other attribute (e.g., obtained from the entity information database), the report can demonstrate other aspects of Web activity,
such as the difference in online penetration among large, medium, and small companies, or which industrial sectors have the most or least online penetration. The mapping database and applications based on the mapping database may take advantage of a hierarchical organization of Web pages, by treating similarly a mapped page and all pages below the mapped page, such as pages sharing a particular prefix with the mapped page. For example, all pages sharing
the prefix "http://www.isp321.com" may be treated as being under the control of an ISP named Global ISP Co.
The mapping database may map an entity to Web pages maintained at different Web sites. For example, Maple Street Plumbing may have a first set of
Web pages at the Global ISP Co. site and a second set of Web pages at another ISP's site. The entity information database may include a database such as EDGAR that includes information about companies.
Information in the mapping database or the entity information database
may allow searches to be limited by relative size of entities, such as size in an
industry.
One or more of the databases referenced above may be or include a relational database and may have records to which fields may be added readily.
Any of many different types of computer equipment may be used. For example, one or more Intel-based personal computers may be used that run an SQL database on Linux and that programs written in Perl or the C programming language with interfaces to the SQL database.
The technique (i.e., the procedures described above) may be implemented in hardware or software, or a combination of both. In at least some cases, it is advantageous if the technique is implemented in computer programs executing on one or more programmable computers, such as a personal computer running or
able to run an operating system such as Unix, Linux, Microsoft Windows 95, 98, or NT, or Macintosh OS, that each include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device such as a keyboard, and at least one output
device. Program code is applied to data entered using the input device to perform the technique described above and to generate output information. The
output information is applied to one or more output devices such as a display screen of the computer.
In at least some cases, it is advantageous if each program is implemented in
a high level procedural or object-oriented programming language such as Perl, C,
C++, or Java to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language.
In at least some cases, it is advantageous if each such computer program is stored on a storage medium or device, such as ROM or optical or magnetic disc, that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer to perform the procedures described in this document. The system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so
configured causes a computer to operate in a specific and predefined manner.
Other embodiments are within the scope of the following claims. For example, the user may be a human being or a non-human entity such as a computer program or an automated device that may interact with one or more of
the databases or one or more of the applications via an application programming interface ("API") or a network message. An on-line information store or multiple
databases may serve as the entity information database, which may take the form of any mechanism that provides automated access to information, such as a
spreadsheet file or a store of email messages. System 110 may also refer to the
mapping and entity information databases before reporting the live domain names
to the search engine. For example, by referring to the mapping and entity information databases, system 110 can retrieve entity information relating to the live domain names, and can sort the live domain names according to the entity information, such as by listing first the live domain names that pertain to an industry that is indicated as being particularly relevant to the search engine or users of the search engine.