3 or More Nodes - Failover Process
An item of confusion with clustering often pops up in class. The question is, how does the cluster service decide what node to failover to in the event of a cluster group failure? The answer is not so straight forward. First, let’s look at the different attributes that contribute to the decision.
- Order of Installation - Simply, in what order were the nodes installed? Was NodeA installed before NodeB and NodeB installed before NodeC and NodeC before NodeD?  Â
- Possible Owners -  This attribute is a property of an individual resource. In this situation, the possible owners strictly is used to define which nodes can run the resource, and this will also then control the cluster group it belongs to in that the resource can only run in a particular cluster group. For example, if the resource is part of a cluster group running on NodeA, and it fails, it will not failover to NodeB if it is not listed as a possible owner of the resource.
- Preferred Owner - This attribute is a property of the cluster group. It is used to define a priority list for where a cluster group should run. In the event of failover, it will failover to the Node highest on the preferred owners list.Â
- AntiAffinityClassNames -  This is a new property for a cluster group in Windows Server 2003. Basically, what is done is that each node can be configured with a list of “names” or terms. For example, a cluster group on NodeA can be configured with “SQL” and a cluster group on NodeB can be configured with “SQL” as well. In the event that NodeA fails, the cluster group configured with the AntiAffinityClassName of ”SQL” can not be failed over to NodeB because it also holds that name for one of its groups. The two cluster groups can not run on the same node at any time.Â
Now that we understand the attributes and properties, let’s look at the failover process:
- A cluster group on NodeA achieves the threshold for failover and fails. The cluster group failures depends on resource failures. For example, if a single resource fails more than 3 times in 900 seconds (default settings), then it will cause the entire cluster group to failover if the Affect cluster check box is enabled. If multiple failures among multiple resources in a single group exceed 10 in 6 hours (default settings), that will also cause the cluster grop to failover (again, assuming the Affect cluster check box is enabled).
- Nodes are checked to see if they are available. Available means that they are online and running and that they don’t have any restrictions against running, like not being included on the possible owners list. If nodes are not able to run the application, then they are not considered available.
- The cluster group looks to its preferred owners list and selects the available node highest on the list. If no nodes are listed, then this step is skipped.
- The cluster group will failover to an available node based on installation order. So, after NodeA is NodeB. The cluster group will attempt to failover to NodeB. If NodeB is not available, then it will failover to NodeC (3rd on the list in the installation order).
In the even there are no available nodes, then the cluster group will just fail and remain offline in a failed state until manual intervention takes place.
Comments Off