Monitoring of Backup Activity on a Computer System
The present invention is concerned with monitoring of backup activity on a computer system.
Storage manager software (abbreviated below to "SM") will be used to refer to backup products that can be installed on a computer or server (referred to hereafter as a node or nodes) which may be running any one of several operating systems (for demonstration purposes, nodes running Wintel and UNIX operating systems will be further referenced). Once the SM software has been installed on a node, it is referred to as a client. The client is configured to perform a backup of data located on the node using an incremental backup method. The client is registered with a specific SM server (referred to hereafter as the storage manager server). Communication between the client and the storage manager server is performed over a Local Area Network (LAN) or a Wide Area Network (WAN) using a system protocol such as Internet Protocol (IP). The client can be used to manually perform a backup, however in most cases a schedule is used to schedule a backup to start on the node at a particular time. An existing SM of this type is sold by International Business Machines under the name Tivoli Storage Manager.
There are various methods and tools that enable a job to be scheduled; however the most common method is for the storage manager server to send the schedule to the client. Once a scheduled backup has been started, any
data to be backed up is backed up to the storage manager server over the network. The storage manager server subsequently saves the data to storage media that can be accessed in the event that the data has to be restored.
Detailed results, detailing the activity of a backup job performed by the client, are entered into log files located on the node (and will be referred to hereafter as client information). Only a subset of this information is reported to the storage manager server, typically in the form of return codes.
Under certain conditions found on a node, or due to the way in which the storage manager software functions, it can be that a backup job reported as "successful" in the client information, and subsequently communicated as such to the storage manager server, is actually not successful. This means it is possible that not all data that should potentially have been backed up, was backed up. In some cases it is only possible to detect, and thus prevent the recurrence of, such a critical event by reviewing the client information.
Due to the nature of computer systems and their security requirements, only a restricted number of administrators are able to access and review the client information for potentially critical errors. Due to a heavy workload, and time constraints of these administrators, it can be that this task is not performed. In most cases the information reported to the storage manager server is relied upon to categorize the backup status of a backup. The reason for this is that
storage manager server information is far easier to access than client information ac tho status of all clients connected to the server can be viewed from one central location.
It is not sufficient to rely on storage manager server information in order to be certain that a backup was successful. It is also not always sufficient to review just the latest information detailed in the client log files that relate to the latest backup, as this is only point in time information. The data and values from the previous backup also need to be considered in order to be certain that the backup was successful.
Furthermore, if a backup job is reported as failed, only the client contains enough detail to ascertain the cause. In order to access this information and discover the root cause of the problem, one must in any event connect to the respective node and review the client information. This is a time consuming activity especially in an environment where hundreds of backups are performed every day.
In accordance with a first aspect of the present invention, there is a computer system comprising a storage manager server and multiple clients, the storage manager server being adapted to receive data objects to be backed up from the clients and to store them in at least one mass storage device, and the clients being adapted to create client information comprising a log of status of backup jobs carried out by the client, the system being characterized in that it
further comprises a home server adapted to receive client information from the client, to parse the client information, and so to provide a user with back relating to the success, failure or other status of backup operations on the computer system.
The invention can thus provide a means for monitoring the backup activity which is distinct from the monitoring activity of the SM itself and which utilizes the client information, thus providing potentially more detailed and reliable information than for example return code-based facilities of the SFvI. True failure information, normally found only on the client, is made available thus reducing time and effort normally required to connect to the client to ascertain failure information.
In accordance with a second aspect of the present invention, there is a method of monitoring backup activity on a computer system comprising a storage manager server which receives data objects to be backed up from multiple clients in the system and which stores the data objects in at least one main storage device, the clients creating client information comprising respective logs of backup jobs which they carry out, the method of monitoring comprising providing a home server, transferring client information from the clients to the home server, parsing the client information received by the home server, and providing a user with backup information relating to the success, failure or other current backup status of a node.
In accordance with a third aspect of the present invention, there is a computer program product for running on a home server of a computer system in which the home server is in communication with multiple clients, the computer program product comprising instructions which cause the home server to receive from the clients respective client information files comprising logs of backup jobs carried out by the clients, to parse the client information files, and to output for a user backup information relating to the success, failure or other current backup status of a node.
Through the utilization of client information obtained directly and automatically from the node, the present invention makes it possible to present the backup data via a website in such a way that one is able at a glance to gain a complete overview of the current as well as historical backup situation for an individual node or complete environment, be it large or small.
As well as an overview, detailed information can be made available for every backup job, thus reducing the need to manually connect to a node in order to review the information.
Potentially critical scenarios can be automatically flagged with coloured warnings on the website as well as e-mail and SMS (text message) alerts. The time and effort saved in identifying failed backups, as well as the advantage of having all the client information to hand means one can concentrate on solving backup problems. In addition, being able to
proactively react to possibly critical situations that may normally go undetected helps protect against possible data loss.
The backups can be monitored by anyone who has a valid account for the website, thus removing the pressure from node or server administrators. Reports complete with graphs and tables can be generated for individual nodes or the complete environment for several time frames. This is particularly helpful in an audit situation, and for ensuring Service Level Agreements (SLAs) are reached.
Specific embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:-
Figure 1 is a diagrammatic representation of a backup environment;
Figure 2 is a diagrammatic representation of an environment for implementing the present invention;
Figure 3 is a screen display from a website used in an embodiment of the present invention;
Figure 4 is a backup environment summary prepared by a system embodying the present invention;
Figure 5 presents summary data in relation to a chosen item in the backup environment summary;
Figure 6 presents backup information specific to a chosen node;
Figure 7 presents a historical backup data for a given node;
Figure 8 is a graph of the frequency of backup operation outcomes - failure, success, etc. - over time; and
Figure 9 shows backup information for a particular node.
The embodiment of the invention to be described below will be referred to as "EBC" and consists of two distinct parts. One part is software that is installed, according to the present exemplary embodiment, on a Wintef server (to be referred to hereafter as a "home server") which is within the environment where the nodes that are to be backed up exist, and can be accessed over a system. In accordance with the invention, log files containing the client information are either collected by the home server or can be delivered. Once the log files are available the software, which comprises a parsing mechanism, applies specific initial parsing logic to the client information. As a result of the parsing, two distinct log files are created. Key amongst them is a
file containing summary information pertaining to the latest backup from every node where log files were available. This log file will be referred to hereafter as the "summary log."
The second part of EBC is a second Wintel server (to be referred to hereafter as the "host") running a data warehouse and hosting a website. On the home server, once all the respective log files have been parsed and the summary log has been completed, the summary log is sent, e.g. via FTP or e-mail (with attachment) to the host where it is further parsed, having a certain criterion inspected as well as specific values compared and contrasted. The results of the data warehouse parsing are subsequently accessible via the website with a valid and registered e-mail account.
Via the website one is able to ascertain at a glance the overall picture of a complete backup environment, for example how many backups were successful, failed, are still running etc. As well as providing an overview, the system makes it possible to drill down to the latest client information from an individual node. This means one is able to immediately identify (in most cases) the cause of a backup failure without having to spend time connecting to a node and manually checking the respective client information.
The website offers an abundance of information to the end user, highlighting possible critical situations with the use of colours. However this is a passive approach to backup monitoring and relies on someone actively logging on and
paying attention to the backups. In order to ensure that no critical warnings are overlooked, messages containing warnings and backup failure information can be sent to those who choose to have this service. These may for example be SMS (text messages) and/or e-mails. Each subscriber is able to individually tailor the service to meet his or her requirements. It is possible to receive just e-mails or SMS messages or both for the complete environment, and to specify individual nodes. This service ensures that the backups are proactively monitored, enabling administrators to swiftly and efficiently react to failed or possibly critical situations that may otherwise go undetected, even if a manual backup control was performed.
As well as monitoring the backups, EBC is also able to report on all backup activity at node level, service pack level or complete environment level over several (user defined) time scales.
In summary, EBC is a complete backup monitoring and reporting system that enables one to proactively manage, react and report on the backup environment from anywhere in the world. Additionally, the information available can be used for audits or to ensure Service Level Agreements (SLAs) have been reached or adhered to.
Before looking at EBC itself in greater detail, the operation of a known suite of storage manager software (such as IBM's Tivoli Storage Manager, for example) will be described with reference to Figure 1.
The storage manager software (referred to hereafter as SM) serves to backup data located on nodes running several platforms such as Wintel and UNIX. In order for an SM backup to be performed, firstly a suitable environment must exist. A simple environment consists of a single storage manager server 10 and several clients 12a, 12b connected over a network such as a Local Area Network (LAN) or Wide Area Network (WAN) 14 as seen in Figure 1. The storage manager server has storage resources 16 in the form of multiple mass media devices (tapes, disks) connected to it where backup data from the clients 12a, 12b is stored and managed.
The role of the storage manager server 10 is to schedule backup jobs on each client and store and manage the data backed up from the clients, making it available as and when a restore is required.
Each node to be backed up must have SM client software, called an SM client, installed on it. In order for the client to know what it should backup, a configuration file or files are used. Detailed information pertaining to a backup is to be found in log files produced by the client 12a, 12b. There are slight differences between the number and type of tiles used tor Wintel and UNIX clients. Both will be detailed below.
A Wintel client has three main files:
• a configuration file;
• a main client backup information file; and
• a specific error information file.
The configuration file can be edited to configure the various specific SM settings for the respective backup. This includes specifying files/directories, otherwise known as "objects", that should be included in or excluded from the backup. When using an include statement, it is possible to assign a "management class" which defines how long the objects that are backed up by the client should be retained once they have been transferred to the server. Additionally it defines how many versions are to be retained and for how long. If a management class is specifically defined on the client, this will be used. If however no management class is defined on the client, a default management class defined on the server that the client is connected to will be used.
The main client backup information file contains the backup statistics resulting from the backup activity of the client. This file can contain more or less information pertaining to the backups depending on how the configuration file is configured. For example, it can contain a listing of every single file or directory backed up, or just a summary of the backup statistics depending on the configuration.
Should a backup encounter any problems, detailed error messages are placed in the specific error information file.
A UNIX client may have up to five files:
• a configuration file which can be edited to configure the various specific SM settings for the respective backup. Unlike the Wintel configuration file it may or may not be used to configure objects that should be included or excluded;
• if it is not used, a separate include/exclude file is used to define these items;
• a server identification file which is primarily used to define which server the client should be connected to;
• a main client backup information file, which has same function as described above with reference to the Wintel main client backup information file; and
• a specific error information file, which has the same function as described above with reference to the Wintel specific error information file.
The method used by the exemplary SM to perform a backup is known as "incremental". An incremental backup works by initially inspecting "objects". Objects are directories and files that reside on the file system of a node. What should be inspected is based on the include and exclude statements defined in the respective configuration files. Following the inspection phase, the inspected objects are compared with those that were previously backed
up to see if they already exist or have been modified. If an object is new or has been modified since the last backup, it will be backed up and the data sent to the server. All other objects will not be backed up.
Under certain conditions, some caused by the node and some due to the way in which SMs function, it can be that a backup job reported as "successful" by the client and subsequently reported to the server, is not actually successful. This means it is possible that not all data that should potentially have been backed, was backed up. In some cases it is only possible to detect and thus prevent the recurrence of such a critical event by review the data (client information) located on the node.
To overcome this problem, the system to be described below, through the utilisation of information obtained directly and automatically from the node only, makes it possible via a secure Internet website to gain, at a glance, an overview of a backup environment, large or small, or simply view backup information pertaining to a single backup. The option of receiving an e-mail or SMS or both makes certain that valuable information that may only otherwise be gleaned from browsing the website is not overlooked. The system enables a user to be actively or passively informed, and thus to proactively react against potential critical situations based on key client information (the only realistic data source that should be considered when monitoring SM backups) that, left undetected, could lead to data loss.
EBC utlises only client information found on the respective nodes. Client log files are obtained from every monitored node.
With reference to Figure 2 an explanation of the operation of EBC will now follow.
A UNIX node 12c has an SM client installed on it. The arrow 20 pointing from a home server 22 to the node 12c indicates that the log files are being collected from the node 12c by the home server 22.
A Wintel node 12d has an SM client installed on it. The arrow pointing from the node 12d to the home server indicates that the log files are being sent from the node 12d to the home server 22.
For both Wintel and UNIX nodes, either method of log file collection or delivery can be used. The same methods can be applied to any node running an alternative operating system other than Wintel or UNIX.
Once all the relevant log files have been centrally consolidated on the home server 22 the parser applies initial parsing logic to them. The resulting information pertaining to the latest backup of each respective node, where the required client log files were available is subsequently entered into a file called the summary log.
The summary log file forms the basis of information used to monitor and report on the backups. Within this file it is already possible to glean the basic status of most jobs, for example (a) failed, (b) still running or (c) successful.
It is important however that the true backup status of a job is never based solely on this file, as only the information that is ultimately presented on the website portrays the real status.
As well as the summary log, a second file is generated called the missing nodes log. If the log files from a node were once available but for whatever reason became unavailable, the name(s) of the missing nodes(s) appear(s) in this file. This file is produced as part of the initial parsing on the home server.
Once the summary log file and the missing node log are complete, they are subsequently transferred via e-mail, File Transfer Protocol (FTP) or Secure File Transfer Protocol (SFTP) over the world wide web 24 to a data warehouse and web server 26.
The files once automatically loaded into the data warehouse 26 are further parsed with the resulting data uploaded to the website which is then available to an EBC service provider 28. Additionally, depending on how the user has
configured his or her account, they will receive e-mails and SMS (text) messages detailing the specific backup failures and warnings.
Some details of the website will be provided by way of example rather than limitation and with reference to Figures 3 to 8. Website access requires a valid account (e-mail address) and password. The initial website view seen in Figure 3 offers several viewing options from a backup environment overview to detailed node backup information. It is also possible to generate reports as well as export data to a commercial spreadsheet package.
The backup environment summary (Figure 4) enables one to view the backup results of the complete environment at a glance, thus gleaning the most important statistics such as the number of backups that are failed, unknown or have possible warnings associated with them.
The user may create a "Personai Backup View", selecting the particular backup information required, including alerts, the personal backup view being subsequently automatically provided to that particular user.
If one clicks on the value associated with a specific item, for example value 3 in Figure 4 associated with "failed backups", a summary of only the corresponding data - in this example backup jobs that failed - is produced as can be seen in Figure 5. This option makes it easy for one to view only what
is of interest, preventing having to scroll up and down the page in order to view pertinent information.
In Figure 9, information is given relating to backup jobs on a single node. The status of the backup job can be seen to be "failed". By clicking on the node name, the user obtains the more detailed textual information shown toward the bottom of the figure including the reason why the backup failed - in this case, "Backup using Microsoft volume shadow copy failed" (Microsoft is a registered trade mark). Only the client contains this information, and being able to view this information via the website saves wasted time and effort that would normally have to be invested by connecting to the node.
Either through the selection of a summary view via the backup environment summary, or by scrolling down the page, one is immediately able to view basic backup information and the corresponding status for the backup job. A typical example of the type of information available can be seen under Figure 6 that has the status "failed."
If one clicks where indicated on Figure 6, i.e. on "backup status", be it successful, failed, unknown or running, up to a thirty-day (or backup) historical view is available (Figure 7). This view enables the trend of the backups for a specific node to be monitored. This can help identify and rectify certain backup problems that may appear once a week for example, that would normally go undetected.
Taking the example that someone would like a restore of a specific object from a date more than thirty-days ago, prior to performing a restore it is important to check that the backup for this particular date was successful. EBC offers a "point in time" option that makes it possible to specify a date and view the respective backup information, and thus to establish whether or not the restore is possible (as long as there was data available at this time). This feature is also very useful in an audit situation where one must demonstrate that a particular backup (chosen at random by an auditor) was successful and most importantly that a record exists.
Graphs and charts can be generated to help to identify trends and identify problem areas. As well as this they can be used for audits and for demonstrating Service Level Agreement (SLA) compliance. EBC offers several graph options specific for an individual node, service pack Wintel, UNIX or covering the complete environment. For example Figure 8 shows a graph of the outcomes of backup operations - successful, failed, etc. - over a consecutive sequence of backups.
In a conventional SM system, if a backup job fails for any reason, and due to the fact that the communication between the client and the server is in the form of return codes, detailed information pertaining to the reason for the backup failure can only be found on the node where the client is installed. In order to find out exactly what happened with the backup, one must first
connect to the node in order to view the information. One can only access this information if one has the relevant access rights. This activity is time consuming. However it is the only realistic way of identifying exactly what the cause of the backup failure was.
EBC enables one to view detailed failure information pertaining to a backup via the website, thus saving time and effort connecting to the node. It also has the advantage that it allows people to view the information who would normally not have the correct rights to view the information on the node. Armed with this detailed information it is then possible to take steps to prevent a further failed backup.
Occasionally the file system of a node may become corrupt. In this event, when the SM backup encounters the point of the corrupt file system during the inspection phase, the inspection can simply stop, and only a backup of the objects inspected thus far is performed. Barring any other backup issues that may arise, the SM can subsequently - and erroneously - proclaim the backup as "successful". Left undetected, this can have devastating consequences as depending at what level the corrupt file system was located it could mean that only a small percentage of the objects to be backed up were actually backed up, thus making the restore of any affected objects impossible.
In order to immediately identify this problem, if there is a decrease of ten percent or more in the value for "objects inspected" between the last backup
and the current, a warning appears on the website highlighting the value. Additionally, an e-mail and/or an 5MS can be sent to proactϊvely inform the relevant parties.
In the SM, the server 10 and the client 12 communicate the status of a backup job via "return codes". Particular return code values may for example indicate
(a) all backup operations on the client completed successfully;
(b) the backup operation completed successfully but some files were not processed (a common outcome in practice, e.g., or were in use by another application and so were inaccessible, or were deleted between the inspect operation and the subsequent backup);
(c) the backup operation completed with at least one warning message.
The value of the return code for any respective backup job is based primarily on the successful completion or failure of the scheduled job that starts the backup on the client. Subsequently, additional information pertaining to the respective backup such as the number of objects failed is taken into account.
As an example, the schedule defined on the client starts and successfully completes and no objects fail, the server classifies the respective backup job as return code "0" or completely successful and attributes the corresponding return code.
If the schedule for a backup job on a client starts and ends without an error, but one or more objects are reported as failed, the job is given a return code indicating that the scheduled job was successful, but one or more objects failed. In a worst-case scenario, every single object could fail for whatever reason during the backup, SM can declare the schedule that started the backup as successfully run although one or more objects were not backed up. This is because it is normal for some objects to fail, e.g. for one of the reasons just mentioned. Because outcome (b) above is a common occurrence, such a job (as long as the schedule started and completes successfully) is normally classed as successful by administrators and no further action is taken. With regard to the "worst case scenario", if no further action were taken and it were left undetected, it would not be possible to restore any of the objects that were affected by the respective backup.
To ensure that this situation, if it does arise, is guarded against by EBC, if more than a threshold number (configurable depending on the environment) objects - e.g. one hundred - are listed as failed, an objects failed threshold warning highlighting the value appears on the website. Once again an e-mail or SMS can be sent to proactively inform the relevant parties.
A management class is an entry that may be included in the server identification file or the include/exclude file located locally on the respective node. This entry defines how "objects" are stored and managed within the SM storage infrastructure. The main purpose of the management class is to
define how many versions of an object should be stored and for how long once it has been backed up.
If a management class is incorrectly defined, objects backed up will have the "default" management class rules applied to them. The administrator defines the default management class on the server. If the default management class is not defined correctly or accurately to reflect the specific requirements of the objects backed up, it could have serious consequences. This could lead to a situation where objects that should normally be available for restore are not, as they have not been backed up or stored for the correct amount of time.
Information pertaining to the fact that the management class defined in the server identification or include/exclude files located on the node is invalid can be found in the specific error information file located on the node. This information may or may not be communicated to the respective server. In the event that it is communicated to the server, it may well be overlooked.
If an "invalid management" class is defined, a warning will appear on the website highlighting the issue. As before an e-mail and/or an SMS can be sent to proactively inform the relevant parties.
Occasionally the SM may report a backup as "successful" although nothing at al! has been backed up. There are two reasons for this. The first is that no new objects were created or have been modified since the last backup. The
second can be clue to a configuration error in the configuration file. To ensure that this potentially critical situation does not go unnoticed, EBC generates a warning on the website indicating that although the backup was successful, 0 bytes of information were backed up. E-mail and/or SMS alerts are available for this warning.
It is normal that not all objects (files) can be backed up and there are several reasons why thfs might be the case, one of which is that the object was in use during the backup. It is however important to review the failed objects and make a special backup if it is deemed necessary. Via the website one is able to view the objects (files) that were not backed up.
If an object does not have a backup three or more times in succession, a warning appears on the website highlighting the value for "failed objects". If the user clicks on the warning they are able to view the affected objects and thus take the necessary action. This warning is complete with an e-mail and/or SMS alerting option.
As the SM performs an incremental backup, it is very rare that a backup job runs for longer than twenty-four hours. Apart from the initial backup, not all data is backed up every time a backup is performed. Only new objects or objects that have been modified since the last backup are indeed backed up, which is why a backup is normally complete in under twenty-four hours. In the event that a backup does run longer than twenty-four hours, it could indicate a
problem, e.g. the backup session between the client and the server may be hanging. With this in mind, a warning for this event is available via the website and is available with an e-mail and/or SMS alert.