US20080301490A1 - Quorum-based power-down of unresponsive servers in a computer cluster - Google Patents

Quorum-based power-down of unresponsive servers in a computer cluster Download PDF

Info

Publication number
US20080301490A1
US20080301490A1 US12/192,273 US19227308A US2008301490A1 US 20080301490 A1 US20080301490 A1 US 20080301490A1 US 19227308 A US19227308 A US 19227308A US 2008301490 A1 US2008301490 A1 US 2008301490A1
Authority
US
United States
Prior art keywords
cluster
servers
server
power
quorum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/192,273
Inventor
Christopher Henry Jones
William T. Newport
Graham Derek Wallis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/192,273 priority Critical patent/US20080301490A1/en
Publication of US20080301490A1 publication Critical patent/US20080301490A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/12Arrangements for remote connection or disconnection of substations or of equipment thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/953Organization of data
    • Y10S707/955Object-oriented

Definitions

  • This invention generally relates to data processing, and more specifically relates to networked computer systems.
  • Networked computers are capable of performing tasks that no single computer could perform.
  • networks allow low cost personal computer systems to connect to larger systems to perform tasks that such low cost systems could not perform alone.
  • Most companies in the United States today have one or more computer networks.
  • the topology and size of the networks may vary according to the computer systems being networked and the design of the system administrator. It is very common, in fact, for companies to have multiple computer networks. Many large companies have a sophisticated blend of local area networks (LANs) and wide area networks (WANs) that effectively connect most computers in the company to each other.
  • LANs local area networks
  • WANs wide area networks
  • Clusters of computer systems have also been used to provide high-reliability services.
  • the high reliability is provided by allowing services on a server that fails to be moved to a server that is still alive. This type of fault-tolerance is very desirable for many companies, such as those that do a significant amount of e-commerce.
  • In order to provide high-reliability services there must be some mechanism in place to detect when one of the servers in the cluster becomes inoperative.
  • One known way to determine whether all the servers in a cluster are operative is to have each server periodically issue a message to the other servers indicating that the server that sent the message is still alive and well.
  • These types of messages are commonly referred to in the art as “heartbeats” because as long as the messages continue (i.e., as long as the heart is still beating), we know the server is still alive.
  • a server in the cluster that is designated as a manager assumes the server that no longer has a heartbeat has failed. As a result, the manager must provide the resources that were on the failed server on another server in the cluster. Note, however, that the absence of a heartbeat does not always mean a server is dead. For example, a server may not provide a heartbeat because it may be temporarily unresponsive due to trashing, swapping, network floods, etc. If the server is not giving heartbeats but is still alive, there exists the possibility that the server may once again become responsive and start providing heartbeats.
  • An apparatus and method provide a quorum-based server power-down mechanism that allows a manager in a computer cluster to power-down unresponsive servers in a manner that assures that an unresponsive server does not become responsive again.
  • the cluster In order for a manager in a cluster to power down servers in the cluster, the cluster must have quorum, meaning that a majority of the computers in the cluster must be responsive. If the cluster has quorum, and if the manager server did not fail, the manager causes the failed server(s) to be powered down. If the manager server did fail, the new manager causes all unresponsive servers in the cluster to be powered down. If the power-down is successful, the resources on the failed server(s) may be failed over to other servers in the cluster that were not powered down. If the power-down is not successful, the cluster is disabled.
  • FIG. 1 is a block diagram of a computer apparatus in accordance with the preferred embodiments
  • FIG. 2 is a block diagram of a cluster of computer systems shown in FIG. 1 in accordance with the preferred embodiments;
  • FIG. 3 is a flow diagram of a method in accordance with the preferred embodiments for powering up servers in a cluster
  • FIG. 4 is a prior art method for a server to shut itself down based on the loss of lock on a shared disk drive
  • FIG. 5 is a flow diagram of a method in accordance with the preferred embodiments for powering down unresponsive servers in a computer cluster before failing over the resources of the failed servers.
  • a quorum-based server power-down mechanism in a computer cluster assures that an unresponsive server in the cluster is powered-down before the resources are failed over to one or more other responsive servers.
  • the power-down mechanism is quorum-based, meaning that only a cluster that includes a majority of the servers in the cluster may perform power-down operations. By powering down failed servers, the preferred embodiments assure that a failed system does not become responsive again.
  • Method 400 in FIG. 4 shows the steps in one known method in the art that uses a shared disk drive.
  • a shared disk drive When different computer systems in a cluster share a disk drive, there is typically a locking mechanism on the disk drive to assure only one server can access the disk drive at any given time.
  • a set of servers that are visible to each other using some membership algorithm will elect a leader and this leader will obtain the lock on the disk drive. If the set of servers split into partitions because of a communication fault, then the majority partition will obtain a lock on the shared disk drive (step 410 ).
  • a majority partition is determined with a voting system.
  • step 420 YES
  • step 430 The panic may result in powering down the server or panicking the operating system kernel.
  • a computer system 100 is one suitable implementation of an computer system that may be a member of a cluster in accordance with the preferred embodiments of the invention.
  • Computer system 100 is an IBM eServer iSeries computer system.
  • IBM eServer iSeries computer system As shown in FIG. 1 , computer system 100 comprises one or more processors 110 , a main memory 120 , a mass storage interface 130 , a display interface 140 , a network interface 150 , and a service processor interface 180 . These system components are interconnected through the use of a system bus 160 .
  • Mass storage interface 130 is used to connect mass storage devices (such as a direct access storage device 155 ) to computer system 100 .
  • mass storage devices such as a direct access storage device 155
  • One specific type of direct access storage device 155 is a readable and writable CD RW drive, which may store data to and read data from a CD RW 195 .
  • Service processor interface 180 preferably connects the computer system 100 to a separate service processor 182 .
  • Service processor 182 preferably includes a server power-down mechanism 184 that allows servers coupled to the service processor to be individually powered-down.
  • Service processor 182 typically provides an interface that allows a computer system (such as 100 ) to command the service processor to power down another computer system in the cluster.
  • service processor 180 can terminate a single process on another machine when servers in the cluster are processes rather than physical boxes or logical partitions.
  • Main memory 120 in accordance with the preferred embodiments contains data 121 , an operating system 122 , and a cluster engine 123 .
  • Data 121 represents any data that serves as input to or output from any program in computer system 100 .
  • Operating system 122 is a multitasking operating system known in the industry as OS/400; however, those skilled in the art will appreciate that the spirit and scope of the present invention is not limited to any one operating system.
  • Cluster engine 123 provides for communication between computer systems in a cluster.
  • Cluster engine 123 includes many features and mechanisms that are known in the art that support cluster communications but are not shown in FIG. 1 .
  • Cluster engine 123 includes a heartbeat mechanism 124 possibly over multiple channels, a membership change mechanism 125 , and a quorum-based server power-down mechanism 126 .
  • the heartbeat mechanism 124 and membership change mechanism 125 are preferably known mechanisms in the art.
  • Heartbeat mechanism 124 sends a periodic heartbeat message to other servers in the cluster, and receives periodic heartbeat messages from other servers in the cluster. These heart beats can be transmitted using a variety of channels such as network, serial cables or shared disk based heart beating.
  • Membership change mechanism 125 monitors the membership in the cluster, and generates a membership change message to all servers in the cluster when one of the servers in the cluster becomes unresponsive (i. e., stops sending heartbeat messages).
  • Quorum-based server power-down mechanism 126 allows a manager server to power down unresponsive servers, thereby assuring that the unresponsive servers do not become responsive in the future.
  • the quorum-based server power-down mechanism 126 can only power down a server if the cluster has quorum, as discussed in more detail below with reference to FIG. 5 .
  • the quorum-based server power-down mechanism 126 is shown to be part of the cluster engine 123 . This, however, is shown only as one possible implementation within the scope of the preferred embodiments. The quorum-based server power-down mechanism 126 could also be implemented separate from the cluster engine 123 . The preferred embodiments expressly extend to any suitable location and implementation for the quorum-based server power-down mechanism 126 .
  • Computer system 100 utilizes well known virtual addressing mechanisms that allow the programs of computer system 100 to behave as if they only have access to a large, single storage entity instead of access to multiple, smaller storage entities such as main memory 120 and DASD device 155 . Therefore, while data 121 , operating system 122 , and cluster engine 123 are shown to reside in main memory 120 , those skilled in the art will recognize that these items are not necessarily all completely contained in main memory 120 at the same time. It should also be noted that the term “memory” is used herein to generically refer to the entire virtual memory of computer system 100 , and may include the virtual memory of other computer systems coupled to computer system 100 .
  • Processor 110 may be constructed from one or more microprocessors and/or integrated circuits. Processor 110 executes program instructions stored in main memory 120 . Main memory 120 stores programs and data that processor 110 may access. When computer system 100 starts up, processor 110 initially executes the program instructions that make up operating system 122 . Operating system 122 is a sophisticated program that manages the resources of computer system 100 . Some of these resources are processor 110 , main memory 120 , mass storage interface 130 , display interface 140 , network interface 150 , system bus 160 , and service processor interface 180 .
  • computer system 100 is shown to contain only a single system bus, those skilled in the art will appreciate that the present invention may be practiced using a computer system that has multiple buses.
  • the interfaces that are used in the preferred embodiment each include separate, fully programmed microprocessors that are used to off-load compute-intensive processing from processor 110 .
  • processor 110 processor 110
  • the present invention applies equally to computer systems that simply use I/O adapters to perform similar functions.
  • Display interface 140 is used to directly connect one or more displays 165 to computer system 100 .
  • These displays 165 which may be non-intelligent (i.e., dumb) terminals or fully programmable workstations, are used to allow system administrators and users to communicate with computer system 100 . Note, however, that while display interface 140 is provided to support communication with one or more displays 165 , computer system 100 does not necessarily require a display 165 , because all needed interaction with users and other processes may occur via network interface 150 .
  • Network interface 150 is used to connect other computer systems and/or workstations (e.g., 175 in FIG. 1 ) to computer system 100 across a network 170 .
  • the present invention applies equally no matter how computer system 100 may be connected to other computer systems and/or workstations, regardless of whether the network connection 170 is made using present-day analog and/or digital techniques or via some networking mechanism of the future.
  • many different network protocols can be used to implement a network. These protocols are specialized computer programs that allow computers to communicate across network 170 .
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • signal bearing media include: recordable type media such as floppy disks and CD RW (e.g., 195 of FIG. 1 ), and transmission type media such as digital and analog communications links.
  • each node 100 in the cluster 200 is preferably a computer system 100 as shown in FIG. 1 .
  • the connections between nodes in FIG. 2 represent logical connections, and the physical connections can vary within the scope of the preferred embodiments as long as the nodes in the cluster can logically communicate with each other.
  • Each node 100 is connected to a service processor 182 .
  • the service processor 182 preferably includes logic that allows for individually powering down each server on each node.
  • the quorum-based server power-down mechanism 126 in a manager server gives one or more commands to the service processor 182 to power down one or more of the servers in the cluster 200 .
  • the service processor 182 in response to the command(s) from the manager server, powers down the one or more servers in the cluster.
  • the term “power down” and “powering down” denotes removing power to the server, but can also denote simply putting the server in a non-functional state using any suitable mechanism or means.
  • the service processor 182 could simply assert and hold a hard reset signal to a node that needs to be powered down. As long as the reset signal is asserted, the node cannot power up.
  • a server is located in a logical partition on an apparatus that includes other servers in the cluster in one or more other logical partitions that are still responsive, the apparatus cannot be physically powered down because this would reset the responsive servers as well.
  • the service processor can assert a signal or provide a command that causes the server that needs to be powered off to instead shut down.
  • the term “power down” and “powering down” as used in this specification and claims means any way, whether currently known or developed in the future, for putting a server in an unresponsive state until a supervisor determines that the server may be powered back up. In addition, these terms could also refer to simply restarting the server.
  • a service processor may also be more fine grained, and if the members of the cluster were processes rather than physical boxes or logical partitions, then the powering down of the server may be the simple step of guaranteeing the server process was terminated.
  • a method 300 is a method in accordance with the preferred embodiments for initially powering up servers in a cluster.
  • the manager server is powered up first (step 310 ). This is done because the algorithms for powering down boxes when the manager server moves can reset boxes that are in the process of starting. This makes the initial bring up of the cluster much smoother.
  • the rest of the servers in the cluster may then be powered up (step 320 ).
  • step 310 we assume there is a single manager server for a cluster. However, one skilled in the art will realize that multiple managers could be defined for a cluster, with an arbitration scheme to determine which manager is responsible for performing management duties at any particular point in time. In the case of multiple manager servers, all manager servers are started in step 310 , followed by the servers that are not managers in step 320 .
  • FIG. 5 shows one specific method 500 that is preferably performed by the quorum-based server power-down mechanism 126 in FIG. 1 in accordance with the preferred embodiments.
  • step 530 NO
  • method 500 powers down the servers that failed in step 510 that currently are potential owners of any quorum protected resource. This check is critical as it allows a server process to be shutdown cleanly and it won't be powered down as a result.
  • the difference between steps 540 and 550 is simply this: if a manager fails, we don't necessarily know which failed node used to be the manager, so we must power down all unresponsive servers in the cluster (step 540 ) to avoid the manager coming back alive in the future. If the manager does not fail, only the failed servers that can potentially own a quorum protected resource need to be powered down (step 550 ).
  • step 560 YES
  • the resources on the failed server(s) may be failed over to servers in the cluster that are still responsive (step 570 ).
  • the concept of failing over resources from a dead server to a live server in the cluster is well-known in the art, and therefore need not be discussed in further detail here.
  • the preferred embodiments depend on the service processor doing its job of powering down a selected server when the quorum-based server power-down mechanism sends the command to power down the selected server. If the service processor is unable to perform its power-down function, this means there is a problem with the service processor itself or something else that requires intervention by a system administrator. Thus, once a cluster is disabled in step 580 , a system administrator is preferably notified of the problem so the system administrator can take appropriate action to correct the problem.

Abstract

A quorum-based server power-down mechanism allows a manager in a computer cluster to power-down unresponsive servers in a manner that assures that an unresponsive server does not become responsive again. In order for a manager in a cluster to power down servers in the cluster, the cluster must have quorum, meaning that a majority of the computers in the cluster must be responsive. If the cluster has quorum, and if the manager server did not fail, the manager causes the failed server(s) to be powered down. If the manager server did fail, the new manager causes all unresponsive servers in the cluster to be powered down. If the power-down is successful, the resources on the failed server(s) may be failed over to other servers in the cluster that were not powered down. If the power-down is not successful, the cluster is disabled.

Description

    CROSS-REFERENCE TO PARENT APPLICATION
  • This patent application is a continuation of U.S. Ser. No. 10/981,020 filed on Nov. 4, 2004, which is incorporated herein by reference.
  • BACKGROUND OF THE INVENTION
  • 1. Technical Field
  • This invention generally relates to data processing, and more specifically relates to networked computer systems.
  • 2. Background Art
  • Since the dawn of the computer age, computer systems have become indispensable in many fields of human endeavor including engineering design, machine and process control, and information storage and access. In the early days of computers, companies such as banks, industry, and the government would purchase a single computer which satisfied their needs, but by the early 1950's many companies had multiple computers and the need to move data from one computer to another became apparent. At this time computer networks began being developed to allow computers to work together.
  • Networked computers are capable of performing tasks that no single computer could perform. In addition, networks allow low cost personal computer systems to connect to larger systems to perform tasks that such low cost systems could not perform alone. Most companies in the United States today have one or more computer networks. The topology and size of the networks may vary according to the computer systems being networked and the design of the system administrator. It is very common, in fact, for companies to have multiple computer networks. Many large companies have a sophisticated blend of local area networks (LANs) and wide area networks (WANs) that effectively connect most computers in the company to each other.
  • With multiple computers hooked together on a network, it soon became apparent that networked computers could be used to complete tasks by delegating different portions of the task to different computers on the network, which can then process their respective portions in parallel. In one specific configuration for shared computing on a network, the concept of a computer “cluster” has been used to define groups of computer systems on the network that can work in parallel on different portions of a task.
  • Clusters of computer systems have also been used to provide high-reliability services. The high reliability is provided by allowing services on a server that fails to be moved to a server that is still alive. This type of fault-tolerance is very desirable for many companies, such as those that do a significant amount of e-commerce. In order to provide high-reliability services, there must be some mechanism in place to detect when one of the servers in the cluster becomes inoperative. One known way to determine whether all the servers in a cluster are operative is to have each server periodically issue a message to the other servers indicating that the server that sent the message is still alive and well. These types of messages are commonly referred to in the art as “heartbeats” because as long as the messages continue (i.e., as long as the heart is still beating), we know the server is still alive.
  • In the prior art, when a server becomes invisible due to lack of a heartbeat, a server in the cluster that is designated as a manager assumes the server that no longer has a heartbeat has failed. As a result, the manager must provide the resources that were on the failed server on another server in the cluster. Note, however, that the absence of a heartbeat does not always mean a server is dead. For example, a server may not provide a heartbeat because it may be temporarily unresponsive due to trashing, swapping, network floods, etc. If the server is not giving heartbeats but is still alive, there exists the possibility that the server may once again become responsive and start providing heartbeats. If the manager has already assumed the server has failed, and has provided the server's services on another server, we now have two servers that try to provide the same services. This creates a problem in administrating the cluster. One way to deal with this problem is to monitor data for a service to make sure that two servers don't try to access the same data for the same service. However, this is complex and inefficient. Without a mechanism for assuring that services in a computer cluster are not duplicated when a server failure is detected, the computer industry will continue to suffer from inadequate and inefficient ways of handling a failed server in a computer cluster.
  • DISCLOSURE OF INVENTION
  • An apparatus and method provide a quorum-based server power-down mechanism that allows a manager in a computer cluster to power-down unresponsive servers in a manner that assures that an unresponsive server does not become responsive again. In order for a manager in a cluster to power down servers in the cluster, the cluster must have quorum, meaning that a majority of the computers in the cluster must be responsive. If the cluster has quorum, and if the manager server did not fail, the manager causes the failed server(s) to be powered down. If the manager server did fail, the new manager causes all unresponsive servers in the cluster to be powered down. If the power-down is successful, the resources on the failed server(s) may be failed over to other servers in the cluster that were not powered down. If the power-down is not successful, the cluster is disabled.
  • The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The preferred embodiments of the present invention will hereinafter be described in conjunction with the appended drawings, where like designations denote like elements, and:
  • FIG. 1 is a block diagram of a computer apparatus in accordance with the preferred embodiments;
  • FIG. 2 is a block diagram of a cluster of computer systems shown in FIG. 1 in accordance with the preferred embodiments;
  • FIG. 3 is a flow diagram of a method in accordance with the preferred embodiments for powering up servers in a cluster;
  • FIG. 4 is a prior art method for a server to shut itself down based on the loss of lock on a shared disk drive; and
  • FIG. 5 is a flow diagram of a method in accordance with the preferred embodiments for powering down unresponsive servers in a computer cluster before failing over the resources of the failed servers.
  • BEST MODE FOR CARRYING OUT THE INVENTION
  • According to preferred embodiments of the present invention, a quorum-based server power-down mechanism in a computer cluster assures that an unresponsive server in the cluster is powered-down before the resources are failed over to one or more other responsive servers. The power-down mechanism is quorum-based, meaning that only a cluster that includes a majority of the servers in the cluster may perform power-down operations. By powering down failed servers, the preferred embodiments assure that a failed system does not become responsive again.
  • The prior art provides a way for a server in a cluster to determine when it has become unresponsive, and to know it needs to shut down. Method 400 in FIG. 4 shows the steps in one known method in the art that uses a shared disk drive. When different computer systems in a cluster share a disk drive, there is typically a locking mechanism on the disk drive to assure only one server can access the disk drive at any given time. A set of servers that are visible to each other using some membership algorithm will elect a leader and this leader will obtain the lock on the disk drive. If the set of servers split into partitions because of a communication fault, then the majority partition will obtain a lock on the shared disk drive (step 410). A majority partition is determined with a voting system. This will cause the original leader to detect that the lock on the shared disk drive has been stolen (step 420=YES), and the servers in the original partition will panic as a result (step 430). The panic may result in powering down the server or panicking the operating system kernel.
  • The check for a majority partition is necessary because different partitions will realize that we have partitioned in an asynchronous manner. If no partition had a majority then each partition will panic any servers with resources active. While method 400 in FIG. 4 is somewhat effective for servers that share a disk drive, the trend in the industry is to get away from sharing resources between servers in a cluster. In addition, some servers in a cluster may not need shared storage, making method 400 inapplicable to such servers. As a result, a method is needed to know when a server fails, and to take appropriate action to assure the server is dead when it is unresponsive.
  • Referring now to FIG. 1, a computer system 100 is one suitable implementation of an computer system that may be a member of a cluster in accordance with the preferred embodiments of the invention. Computer system 100 is an IBM eServer iSeries computer system. However, those skilled in the art will appreciate that the mechanisms and apparatus of the present invention apply equally to any computer system, regardless of whether the computer system is a complicated multi-user computing apparatus, a single user workstation, or an embedded control system. As shown in FIG. 1, computer system 100 comprises one or more processors 110, a main memory 120, a mass storage interface 130, a display interface 140, a network interface 150, and a service processor interface 180. These system components are interconnected through the use of a system bus 160. Mass storage interface 130 is used to connect mass storage devices (such as a direct access storage device 155) to computer system 100. One specific type of direct access storage device 155 is a readable and writable CD RW drive, which may store data to and read data from a CD RW 195.
  • Service processor interface 180 preferably connects the computer system 100 to a separate service processor 182. Service processor 182 preferably includes a server power-down mechanism 184 that allows servers coupled to the service processor to be individually powered-down. Service processor 182 typically provides an interface that allows a computer system (such as 100) to command the service processor to power down another computer system in the cluster. In addition, service processor 180 can terminate a single process on another machine when servers in the cluster are processes rather than physical boxes or logical partitions.
  • Main memory 120 in accordance with the preferred embodiments contains data 121, an operating system 122, and a cluster engine 123. Data 121 represents any data that serves as input to or output from any program in computer system 100. Operating system 122 is a multitasking operating system known in the industry as OS/400; however, those skilled in the art will appreciate that the spirit and scope of the present invention is not limited to any one operating system. Cluster engine 123 provides for communication between computer systems in a cluster. Cluster engine 123 includes many features and mechanisms that are known in the art that support cluster communications but are not shown in FIG. 1. Cluster engine 123 includes a heartbeat mechanism 124 possibly over multiple channels, a membership change mechanism 125, and a quorum-based server power-down mechanism 126. The heartbeat mechanism 124 and membership change mechanism 125 are preferably known mechanisms in the art. Heartbeat mechanism 124 sends a periodic heartbeat message to other servers in the cluster, and receives periodic heartbeat messages from other servers in the cluster. These heart beats can be transmitted using a variety of channels such as network, serial cables or shared disk based heart beating. Membership change mechanism 125 monitors the membership in the cluster, and generates a membership change message to all servers in the cluster when one of the servers in the cluster becomes unresponsive (i. e., stops sending heartbeat messages). Quorum-based server power-down mechanism 126 allows a manager server to power down unresponsive servers, thereby assuring that the unresponsive servers do not become responsive in the future. The quorum-based server power-down mechanism 126 can only power down a server if the cluster has quorum, as discussed in more detail below with reference to FIG. 5.
  • In computer system 100 of FIG. 1, the quorum-based server power-down mechanism 126 is shown to be part of the cluster engine 123. This, however, is shown only as one possible implementation within the scope of the preferred embodiments. The quorum-based server power-down mechanism 126 could also be implemented separate from the cluster engine 123. The preferred embodiments expressly extend to any suitable location and implementation for the quorum-based server power-down mechanism 126.
  • Computer system 100 utilizes well known virtual addressing mechanisms that allow the programs of computer system 100 to behave as if they only have access to a large, single storage entity instead of access to multiple, smaller storage entities such as main memory 120 and DASD device 155. Therefore, while data 121, operating system 122, and cluster engine 123 are shown to reside in main memory 120, those skilled in the art will recognize that these items are not necessarily all completely contained in main memory 120 at the same time. It should also be noted that the term “memory” is used herein to generically refer to the entire virtual memory of computer system 100, and may include the virtual memory of other computer systems coupled to computer system 100.
  • Processor 110 may be constructed from one or more microprocessors and/or integrated circuits. Processor 110 executes program instructions stored in main memory 120. Main memory 120 stores programs and data that processor 110 may access. When computer system 100 starts up, processor 110 initially executes the program instructions that make up operating system 122. Operating system 122 is a sophisticated program that manages the resources of computer system 100. Some of these resources are processor 110, main memory 120, mass storage interface 130, display interface 140, network interface 150, system bus 160, and service processor interface 180.
  • Although computer system 100 is shown to contain only a single system bus, those skilled in the art will appreciate that the present invention may be practiced using a computer system that has multiple buses. In addition, the interfaces that are used in the preferred embodiment each include separate, fully programmed microprocessors that are used to off-load compute-intensive processing from processor 110. However, those skilled in the art will appreciate that the present invention applies equally to computer systems that simply use I/O adapters to perform similar functions.
  • Display interface 140 is used to directly connect one or more displays 165 to computer system 100. These displays 165, which may be non-intelligent (i.e., dumb) terminals or fully programmable workstations, are used to allow system administrators and users to communicate with computer system 100. Note, however, that while display interface 140 is provided to support communication with one or more displays 165, computer system 100 does not necessarily require a display 165, because all needed interaction with users and other processes may occur via network interface 150.
  • Network interface 150 is used to connect other computer systems and/or workstations (e.g., 175 in FIG. 1) to computer system 100 across a network 170. The present invention applies equally no matter how computer system 100 may be connected to other computer systems and/or workstations, regardless of whether the network connection 170 is made using present-day analog and/or digital techniques or via some networking mechanism of the future. In addition, many different network protocols can be used to implement a network. These protocols are specialized computer programs that allow computers to communicate across network 170. TCP/IP (Transmission Control Protocol/Internet Protocol) is an example of a suitable network protocol.
  • At this point, it is important to note that while the present invention has been and will continue to be described in the context of a fully functional computer system, those skilled in the art will appreciate that the present invention is capable of being distributed as a program product in a variety of forms, and that the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of suitable signal bearing media include: recordable type media such as floppy disks and CD RW (e.g., 195 of FIG. 1), and transmission type media such as digital and analog communications links.
  • Referring to FIG. 2, a simple cluster 200 of five computer systems (or “nodes”) is shown. Note that each node 100 in the cluster 200 is preferably a computer system 100 as shown in FIG. 1. However, one skilled in the art will recognize that different types of computer systems could be interconnected in a cluster. The connections between nodes in FIG. 2 represent logical connections, and the physical connections can vary within the scope of the preferred embodiments as long as the nodes in the cluster can logically communicate with each other. Each node 100 is connected to a service processor 182. The service processor 182 preferably includes logic that allows for individually powering down each server on each node. When a node in cluster 200 becomes unresponsive, the quorum-based server power-down mechanism 126 in a manager server gives one or more commands to the service processor 182 to power down one or more of the servers in the cluster 200. The service processor 182, in response to the command(s) from the manager server, powers down the one or more servers in the cluster. Note that the term “power down” and “powering down” denotes removing power to the server, but can also denote simply putting the server in a non-functional state using any suitable mechanism or means. For example, the service processor 182 could simply assert and hold a hard reset signal to a node that needs to be powered down. As long as the reset signal is asserted, the node cannot power up. If a server is located in a logical partition on an apparatus that includes other servers in the cluster in one or more other logical partitions that are still responsive, the apparatus cannot be physically powered down because this would reset the responsive servers as well. However, the service processor can assert a signal or provide a command that causes the server that needs to be powered off to instead shut down. Thus, the term “power down” and “powering down” as used in this specification and claims means any way, whether currently known or developed in the future, for putting a server in an unresponsive state until a supervisor determines that the server may be powered back up. In addition, these terms could also refer to simply restarting the server. A service processor may also be more fine grained, and if the members of the cluster were processes rather than physical boxes or logical partitions, then the powering down of the server may be the simple step of guaranteeing the server process was terminated.
  • Referring to FIG. 3, a method 300 is a method in accordance with the preferred embodiments for initially powering up servers in a cluster. The manager server is powered up first (step 310). This is done because the algorithms for powering down boxes when the manager server moves can reset boxes that are in the process of starting. This makes the initial bring up of the cluster much smoother. The rest of the servers in the cluster may then be powered up (step 320). For the sake of simplicity, in method 300 we assume there is a single manager server for a cluster. However, one skilled in the art will realize that multiple managers could be defined for a cluster, with an arbitration scheme to determine which manager is responsible for performing management duties at any particular point in time. In the case of multiple manager servers, all manager servers are started in step 310, followed by the servers that are not managers in step 320.
  • FIG. 5 shows one specific method 500 that is preferably performed by the quorum-based server power-down mechanism 126 in FIG. 1 in accordance with the preferred embodiments. Method 500 begins when one or more servers in the cluster fail (step 510). If the cluster does not have quorum (step 520=NO), method 500 is done. The cluster has quorum if the cluster contains a majority of the servers in the cluster. Thus, a cluster with seven servers that has three of the servers fail still has quorum, but if four servers fail, the remaining cluster no longer has quorum. If the number of possible servers is even then one server is given two votes and acts as a tiebreaker. One skilled in the art can determine other techniques for creating tiebreakers. If the cluster has quorum (step 520=YES), method 500 determines whether a manager server failed (step 530). Step 530 does not simply test to see if a manager has ever failed, but more specifically tests to see if a manager server is one of the servers that failed to start method 500 in step 510. If the manager server failed in step 510 (step 530=YES), all non-visible servers in the cluster that have a critical resource are powered down (step 540). A server is non-visible in the cluster (i.e., unresponsive) if it has stopped sending heartbeat messages, or if it has been partitioned from the cluster. If no manager server failed (step 530=NO), method 500 powers down the servers that failed in step 510 that currently are potential owners of any quorum protected resource. This check is critical as it allows a server process to be shutdown cleanly and it won't be powered down as a result. The difference between steps 540 and 550 is simply this: if a manager fails, we don't necessarily know which failed node used to be the manager, so we must power down all unresponsive servers in the cluster (step 540) to avoid the manager coming back alive in the future. If the manager does not fail, only the failed servers that can potentially own a quorum protected resource need to be powered down (step 550).
  • If the power-down operation succeeded (step 560=YES)), the resources on the failed server(s) may be failed over to servers in the cluster that are still responsive (step 570). The concept of failing over resources from a dead server to a live server in the cluster is well-known in the art, and therefore need not be discussed in further detail here. The failing over of resources is the process of making these same resources available on a different server in the cluster. This is the very nature of one specific way to provide highly-reliable services, using multiple servers that can take over for each other when one of the servers fails. If the power-down operation did not succeed (step 560=NO), the cluster is disabled (step 580). The preferred embodiments depend on the service processor doing its job of powering down a selected server when the quorum-based server power-down mechanism sends the command to power down the selected server. If the service processor is unable to perform its power-down function, this means there is a problem with the service processor itself or something else that requires intervention by a system administrator. Thus, once a cluster is disabled in step 580, a system administrator is preferably notified of the problem so the system administrator can take appropriate action to correct the problem.
  • With an understanding of method 500 in FIG. 5, we now understand why it is necessary to power up the manager server first in method 300 of FIG. 3 before powering up the other servers. Let's assume a manager server B was powered up after another server A is powered up. In this scenario, when A powers up, it will assume it is the manager. When server C powers up, it will detect a change in manager server, which it will interpret as a failure of the previous manager, and will power down all non-visible servers. These non-visible servers may be in the process of powering up, and each time the manager changes, they are effectively killed off before they can complete the power-up sequence. By requiring the manager server to be powered up first (step 310), followed by the other servers (step 320), this type of undesirable behavior is avoided.
  • One skilled in the art will appreciate that many variations are possible within the scope of the present invention. Thus, while the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that these and other changes in form and details may be made therein without departing from the spirit and scope of the invention. For example, while a known service processor is shown as one possible mechanism for powering down servers, other mechanisms could also be used within the scope of the preferred embodiments. For example, addressable power strips could be used that are capable of receiving commands, and shutting off power to a particular plug in the power strip or to the entire power strip. Any mechanism for putting a server in an unresponsive state until some step of intervention is taken falls within the scope of the term “service processor” as used herein. In addition, the servers recited herein may reside within logical partitions, which means that the power down of a server in a logical partition implies simply shutting down the logical partition.

Claims (4)

1. An apparatus comprising:
(A) at least one processor;
(B) a memory coupled to the at least one processor;
(C) a server process residing in the memory and executed by the at least one processor, wherein the server process resides in a logical partition defined on the apparatus;
(D) a cluster engine residing in the memory and executed by the at least one processor, the cluster engine handling communications between the server process and other servers in a cluster, the cluster engine comprising:
(D1) a heartbeat mechanism that sends a periodic message to the other servers in the cluster to indicate the server process is functioning properly and that receives periodic messages from the other servers in the cluster that indicate the other servers in the cluster are functioning properly;
(D2) a membership change mechanism that generates a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive;
(E) a quorum-based server power-down mechanism residing in the memory and executed by the at least one processor, the quorum-based server power-down mechanism determining whether the server process is part of a group of servers that includes a majority of servers in the cluster, and if so, the quorum-based server power-down mechanism determining whether a manager of the cluster failed when an indication of a server failure is received, and if a manager of the cluster failed, the quorum-based server power-down mechanism issues at least one command to power down all unresponsive servers in the cluster, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly, and if a manager of the cluster did not fail, the quorum-based server power-down mechanism issues at least one command to power down a server corresponding to the received indication of server failure, wherein the quorum-based server power-down mechanism determines whether the power down of the at least one of the other servers was successful, and if the power down of the at least one of the other servers was successful, the quorum-based server power-down mechanism enables failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive, and if the power down of the at least one of the other servers was not successful, the quorum-based server power-down mechanism disables the cluster; and
(F) a service processor that receives the command and in response powers down at least one of the other servers.
2-4. (canceled)
5. A computer readable recordable medium bearing a computer program, the computer program comprising:
(A) a cluster engine that handles communications between a plurality of servers in a cluster, wherein at least one server in the cluster resides in a logical partition, the cluster engine comprising:
(A1) a heartbeat mechanism that sends a periodic message to other servers in the cluster to indicate the server process is functioning properly and that receives periodic messages from the other servers in the cluster that indicate the other servers in the cluster are functioning properly;
(A2) a membership change mechanism that generates a membership change message to all servers in the cluster when any of the servers in the cluster become unresponsive; and
(A3) a quorum-based server power-down mechanism that determines whether the server process is part of a group of servers that includes a majority of servers in the cluster, and if so, the quorum-based server power-down mechanism determines whether a manager of the cluster failed when an indication of a server failure is received, and if a manager of the cluster failed, the quorum-based server power-down mechanism issues at least one command to power down all unresponsive servers in the cluster, wherein an unresponsive server is a server that fails to send a periodic message that indicates the server is functioning properly, and if a manager of the cluster did not fail, the quorum-based server power-down mechanism issues at least one command to power down a server corresponding to the received indication of server failure, wherein the quorum-based server power-down mechanism determines whether the power down of the at least one of the other servers was successful, and if the power down of the at least one of the other servers was successful, the quorum-based server power-down mechanism enables failing over any resources on the at least one of the other servers that was powered down to at least one server that is responsive, and if the power down of the at least one of the other servers was not successful, the quorum-based server power-down mechanism disables the cluster.
6-7. (canceled)
US12/192,273 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster Abandoned US20080301490A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/192,273 US20080301490A1 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/981,020 US20060100981A1 (en) 2004-11-04 2004-11-04 Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster
US12/192,273 US20080301490A1 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/981,020 Continuation US20060100981A1 (en) 2004-11-04 2004-11-04 Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster

Publications (1)

Publication Number Publication Date
US20080301490A1 true US20080301490A1 (en) 2008-12-04

Family

ID=36317533

Family Applications (4)

Application Number Title Priority Date Filing Date
US10/981,020 Abandoned US20060100981A1 (en) 2004-11-04 2004-11-04 Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster
US12/192,291 Expired - Fee Related US7908251B2 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster
US12/192,273 Abandoned US20080301490A1 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster
US12/192,282 Expired - Fee Related US7716222B2 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US10/981,020 Abandoned US20060100981A1 (en) 2004-11-04 2004-11-04 Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster
US12/192,291 Expired - Fee Related US7908251B2 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/192,282 Expired - Fee Related US7716222B2 (en) 2004-11-04 2008-08-15 Quorum-based power-down of unresponsive servers in a computer cluster

Country Status (2)

Country Link
US (4) US20060100981A1 (en)
CN (1) CN1770707B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318236A1 (en) * 2009-06-11 2010-12-16 Kilborn John C Management of the provisioning of energy for a workstation

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10143142A1 (en) * 2001-09-04 2003-01-30 Bosch Gmbh Robert Microprocessor-controlled operation of vehicular EEPROM memory, employs two memory areas with data pointers and cyclic validation strategy
WO2006121990A2 (en) * 2005-05-06 2006-11-16 Marathon Technologies Corporation Fault tolerant computer system
JP4662548B2 (en) * 2005-09-27 2011-03-30 株式会社日立製作所 Snapshot management apparatus and method, and storage system
US20080034053A1 (en) * 2006-08-04 2008-02-07 Apple Computer, Inc. Mail Server Clustering
US7673169B1 (en) * 2007-05-09 2010-03-02 Symantec Corporation Techniques for implementing an adaptive data access error handling policy
US8201016B2 (en) * 2007-06-28 2012-06-12 Alcatel Lucent Heartbeat distribution that facilitates recovery in the event of a server failure during a user dialog
JP5377898B2 (en) * 2008-07-10 2013-12-25 株式会社日立製作所 System switching method and system for computer system constituting clustering
US8671218B2 (en) * 2009-06-16 2014-03-11 Oracle America, Inc. Method and system for a weak membership tie-break
US8108733B2 (en) * 2010-05-12 2012-01-31 International Business Machines Corporation Monitoring distributed software health and membership in a compute cluster
US9069571B2 (en) 2010-12-01 2015-06-30 International Business Machines Corporation Propagation of unique device names in a cluster system
US8943082B2 (en) 2010-12-01 2015-01-27 International Business Machines Corporation Self-assignment of node identifier in a cluster system
US8788465B2 (en) 2010-12-01 2014-07-22 International Business Machines Corporation Notification of configuration updates in a cluster system
US9203900B2 (en) * 2011-09-23 2015-12-01 Netapp, Inc. Storage area network attached clustered storage system
US8683170B1 (en) 2011-09-23 2014-03-25 Netapp, Inc. Consistent distributed storage communication protocol semantics in a clustered storage system
CN102546233A (en) * 2011-11-28 2012-07-04 中标软件有限公司 Method for realizing serial heartbeat in high-availability cluster
US9183148B2 (en) 2013-12-12 2015-11-10 International Business Machines Corporation Efficient distributed cache consistency
CN104506392B (en) * 2015-01-04 2018-10-30 华为技术有限公司 A kind of delay machine detection method and equipment
US10049011B2 (en) 2016-05-03 2018-08-14 International Business Machines Corporation Continuing operation of a quorum based system after failures

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081812A (en) * 1998-02-06 2000-06-27 Ncr Corporation Identifying at-risk components in systems with redundant components
US6363495B1 (en) * 1999-01-19 2002-03-26 International Business Machines Corporation Method and apparatus for partition resolution in clustered computer systems
US20020095470A1 (en) * 2001-01-12 2002-07-18 Cochran Robert A. Distributed and geographically dispersed quorum resource disks
US20020145983A1 (en) * 2001-04-06 2002-10-10 International Business Machines Corporation Node shutdown in clustered computer system
US6587950B1 (en) * 1999-12-16 2003-07-01 Intel Corporation Cluster power management technique
US6938084B2 (en) * 1999-03-26 2005-08-30 Microsoft Corporation Method and system for consistent cluster operational data in a server cluster using a quorum of replicas
US6944662B2 (en) * 2000-08-04 2005-09-13 Vinestone Corporation System and methods providing automatic distributed data retrieval, analysis and reporting services
US6950833B2 (en) * 2001-06-05 2005-09-27 Silicon Graphics, Inc. Clustered filesystem
US20050216442A1 (en) * 2002-01-31 2005-09-29 Barbara Liskov Methods and apparatus for configuring a content distribution network
US7016946B2 (en) * 2001-07-05 2006-03-21 Sun Microsystems, Inc. Method and system for establishing a quorum for a geographically distributed cluster of computers
US7024580B2 (en) * 2002-11-15 2006-04-04 Microsoft Corporation Markov model of availability for clustered systems
US20060080569A1 (en) * 2004-09-21 2006-04-13 Vincenzo Sciacca Fail-over cluster with load-balancing capability
US20060090095A1 (en) * 1999-03-26 2006-04-27 Microsoft Corporation Consistent cluster operational data in a server cluster using a quorum of replicas
US20060235952A1 (en) * 2001-09-06 2006-10-19 Bea Systems, Inc. Exactly once JMS communication
US20070016822A1 (en) * 2005-07-15 2007-01-18 Rao Sudhir G Policy-based, cluster-application-defined quorum with generic support interface for cluster managers in a shared storage environment

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081812A (en) * 1998-02-06 2000-06-27 Ncr Corporation Identifying at-risk components in systems with redundant components
US6363495B1 (en) * 1999-01-19 2002-03-26 International Business Machines Corporation Method and apparatus for partition resolution in clustered computer systems
US20060090095A1 (en) * 1999-03-26 2006-04-27 Microsoft Corporation Consistent cluster operational data in a server cluster using a quorum of replicas
US6938084B2 (en) * 1999-03-26 2005-08-30 Microsoft Corporation Method and system for consistent cluster operational data in a server cluster using a quorum of replicas
US6587950B1 (en) * 1999-12-16 2003-07-01 Intel Corporation Cluster power management technique
US6944662B2 (en) * 2000-08-04 2005-09-13 Vinestone Corporation System and methods providing automatic distributed data retrieval, analysis and reporting services
US20020095470A1 (en) * 2001-01-12 2002-07-18 Cochran Robert A. Distributed and geographically dispersed quorum resource disks
US20020145983A1 (en) * 2001-04-06 2002-10-10 International Business Machines Corporation Node shutdown in clustered computer system
US6950833B2 (en) * 2001-06-05 2005-09-27 Silicon Graphics, Inc. Clustered filesystem
US7016946B2 (en) * 2001-07-05 2006-03-21 Sun Microsystems, Inc. Method and system for establishing a quorum for a geographically distributed cluster of computers
US20060235952A1 (en) * 2001-09-06 2006-10-19 Bea Systems, Inc. Exactly once JMS communication
US20050216442A1 (en) * 2002-01-31 2005-09-29 Barbara Liskov Methods and apparatus for configuring a content distribution network
US7024580B2 (en) * 2002-11-15 2006-04-04 Microsoft Corporation Markov model of availability for clustered systems
US20060080569A1 (en) * 2004-09-21 2006-04-13 Vincenzo Sciacca Fail-over cluster with load-balancing capability
US20070016822A1 (en) * 2005-07-15 2007-01-18 Rao Sudhir G Policy-based, cluster-application-defined quorum with generic support interface for cluster managers in a shared storage environment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318236A1 (en) * 2009-06-11 2010-12-16 Kilborn John C Management of the provisioning of energy for a workstation

Also Published As

Publication number Publication date
US20080301272A1 (en) 2008-12-04
CN1770707B (en) 2010-08-11
US7908251B2 (en) 2011-03-15
US20060100981A1 (en) 2006-05-11
US7716222B2 (en) 2010-05-11
CN1770707A (en) 2006-05-10
US20080301491A1 (en) 2008-12-04

Similar Documents

Publication Publication Date Title
US7716222B2 (en) Quorum-based power-down of unresponsive servers in a computer cluster
US7251736B2 (en) Remote power control in a multi-node, partitioned data processing system via network interface cards
US6918051B2 (en) Node shutdown in clustered computer system
US7099996B2 (en) Disk array system
US6757836B1 (en) Method and apparatus for resolving partial connectivity in a clustered computing system
US6691225B1 (en) Method and apparatus for deterministically booting a computer system having redundant components
US7219260B1 (en) Fault tolerant system shared system resource with state machine logging
US5220668A (en) Digital data processor with maintenance and diagnostic system
US20030158933A1 (en) Failover clustering based on input/output processors
US8863278B2 (en) Grid security intrusion detection configuration mechanism
US20060242453A1 (en) System and method for managing hung cluster nodes
JPH09237226A (en) Method for highly reliable disk fencing in multi-computer system and device therefor
JP2003515813A5 (en)
US9116861B2 (en) Cascading failover of blade servers in a data center
US6823397B2 (en) Simple liveness protocol using programmable network interface cards
US20040109406A1 (en) Facilitating communications with clustered servers
KR20050058241A (en) Method and apparatus for enumeration of a multi-node computer system
US7120821B1 (en) Method to revive and reconstitute majority node set clusters
EP4250119A1 (en) Data placement and recovery in the event of partition failures
KR100305491B1 (en) Scheme to perform event rollup
US20040047299A1 (en) Diskless operating system management
Corsava et al. Intelligent architecture for automatic resource allocation in computer clusters
US20210406064A1 (en) Systems and methods for asynchronous job scheduling among a plurality of managed information handling systems
US20230216607A1 (en) Systems and methods to initiate device recovery
Lee et al. NCU-HA: A lightweight HA system for kernel-based virtual machine

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION