When you want to design an highly available infrastructure, Microsoft Failover Clustering comes in the party. A failover cluster is a group of servers that work together to increase the availability and scalability of clustered roles. As a consultant in the EUC space, the main clustered roles I use are Microsoft SQL Always On (for the Citrix’s databases for example) and File Servers (users’ datas).
When designing such a solution, you need to be aware about split-brain scenarios. Split-brain happens when cluster nodes cannot communicate with each other. This can cause both nodes to try to own the clustered roles which can lead to a lot of problem like data corruption or data lost.
To avoid this kind of situation, the concept of quorum has been implemented within the Failover Clustering solution.
Quorum determines the number of failures that the cluster can sustain while still remaining online. By having this concept of quorum, the cluster will force the cluster service to stop in one of the subsets of nodes to ensure that there is only one true owner of a particular resource group. Once nodes which have been stopped can once again communicate with the main group of nodes, they will automatically rejoin the cluster and start their cluster service.
The following table gives an overview of the Cluster Quorum outcomes per scenario:
Number of nodes | Can survive one server node failure | Can survive one server node failure, then another | Can survive two simultaneous server node failures |
2 | 50/50 | No | No |
2 + Witness | Yes | No | No |
3 | Yes | 50/50 | No |
3 + Witness | Yes | Yes | No |
4 | Yes | Yes | 50/50 |
4 + Witness | Yes | Yes | Yes |
5 and above | Yes | Yes | Yes |
As you can see a 2 nodes cluster requires a witness to be 100% sure to survive one server node failure.
One important thing to add is the location of the witness. Most of my customers run their infrastructure on two different datacenters to have a Disaster Recovery plan. If you install the witness in the same datacenter than one the node, you are covering only the node failure. If the entire datacenter goes down, the remaining node will be alone and will stop its Clustering service.
To be sure that the cluster survives to a datacenter outage, the witness should be installed on a third datacenter.
Hopefully this time, Microsoft is pushing its own datacenter solution: Microsoft Azure and give you the availability to use Azure as your third datacenter. Cloud Witness is a new type of Failover Cluster quorum witness that leverages Microsoft Azure as the arbitration point. It uses Azure Blob Storage to read/write a blob file which is then used as an arbitration point in case of split-brain resolution.
Microsoft gives the following significant benefits which this approach:
- Leverages Microsoft Azure (no need for third separate datacenter).
- Uses standard available Azure Blob Storage (no extra maintenance overhead of virtual machines hosted in public cloud).
- Same Azure Storage Account can be used for multiple clusters (one blob file per cluster; cluster unique id used as blob file name).
- Very low on-going $cost to the Storage Account (very small data written per blob file, blob file updated only once when cluster nodes’ state changes).
- Built-in Cloud Witness resource type.
It costs only few euros or dollars a month to host your witness on Azure. If you compare it to the cost of a third datacenterm it’s ridiculously cheap!
In my next post, you will see how to configure a SQL Always On witness on Azure, stay tuned!