AWS Migration part 3: The technique behind Windows Failover Cluster on AWS 18 Windows Failover Cluster network

AWS Migration part 3: The technique behind Windows Failover Cluster on AWS


In the previous two blogs [1], I showed that it is is possible to implement a Windows Failover Cluster in AWS. In this blog, I will explain the differences between a Failover Cluster on-premise and a Failover Cluster in AWS.

1. How does a Windows Failover Cluster work on-premise?

Though you can install a cluster with one or two nodes, you need at least three nodes to be able to failover to another node when one of the nodes fails. The nodes use in general three networks: one public network (for access from the website and also where the Domain Controller can be accessed), one private network that is for network traffic between the cluster nodes only and one network to connect for storage, for example to connect to a SAN. In this example, I don’t use storage, so I use two networks.

AWS Migration part 3: The technique behind Windows Failover Cluster on AWS 18 Windows Failover Cluster network

There are different ways to make an application high available, I used the cluster script role to do that. This script has to be written in Visual Basic Script (VBS). The only functionality I wrote in VBS is to call Powershell to do all the work. There are seven hooks that you have to implement: Online, Offline, IsAlive, LooksAlive, Open, Close and Terminate. When you look at the code, I implemented the first four of them.

Online is used when the cluster comes online. It is used to start the W3SVC service, the application pool and the website. Offline is used when the cluster role is stopped: it will stop the website, application pool and the service.

LooksAlive will be called quite often to check if the role is working properly. IsAlive will do the same, but it is called less often and it can have more checks. In the example application, LooksAlive will check if the website, application pool and service are still running. The IsAlive script will also look at the website: I added a task that will run forever and change the website every second. The IsAlive script checks if the file that should be changed is indeed changed not too long ago.

When the VBS script is added to the cluster, the cluster will also change the DNS with the name of the clusterrole. This is why you can use http://myclusteriis as an address to check if everything works well.

Let’s assume that the cluster is running on Cluster Node 1. Node 1 then has several IP addresses attached to it: it has it’s own IP addresses for the different networks, and it also has an IP address for the cluster. When the cluster moves to another node, the IP address of the cluster will move with the cluster. The IP address will be disconnected from the network adapter on Node 1 and will be reconnected to (f.e.) Node 2. In an on-premise situation with a “flat network” you don’t have to do anything yourself to let this work: Windows Server will take care of this.

2. Implementing Windows Failover Cluster in AWS

In AWS, things work differently. In my example I use two DNS addresses for the public network: one for the Domain Controller, another one for the AWS DNS. This is necessary, because in the cluster configuration we need to call AWS services. The Failover Cluster assumes that the DNS on the Domain Controller is connected to other DNS’es – and in our situation this isn’t the case. This leads to errors in the Windows Failover Cluster: the cluster wants to add both the name of the cluster and the name of the cluster role to all the DNS’es on the network adapter. Even though it is allowed to do so on the Domain Controller, it will not do so because adding the name to the AWS DNS fails. To solve this, I added the name of the clusterrole to the DNS via the main CloudFormation template. You can safely ignore the error messages that you see about the DNS in the Failover Cluster Manager.

In the on-premise situation, the IP address will be moved by the Windows Failover Cluster itself. In AWS, this is not enough: the IP address of the cluster has to be known by AWS as well. When the IP address is not known to AWS, the web servers (or, in our demo application, the Demo node) will not be able to connect to the cluster. There are two CLI commands that solve this: in the Online script, it is the command

aws ec2 assign-private-ip-addresses --allow-reassignment […] --private-ip-addresses

In the Offline script, it will revert this change:

aws ec2 unassign-private-ip-addresses […]  --private-ip-addresses

The –allow-reassignment means that you can use the same command to allocate these IP addresses to (the same or) another node without using the unassign-private-ip-addresses command first. This is needed when a node drops out: in general, that node doesn’t have time to unassign addresses. Though it is possible not to unassign the private ip-addresses in the Offline script (this might save some time), I choose not to do so: I choose to let the AWS environment work the same way as it works within a “flat” network on-premise.

The AWS commands work asynchronous. Though it is possible to use the aws ec2 describe-instances command to see if the IP addresses are attached correctly, I choose not to do so: this command will take quite some time: much more time than doing a curl to the meta-data on the virtual machine [2]. One of the curious things I found out when implementing this, is that the standard way of doing a curl in Powershell, Invoke-Webrequest, doesn’t give any output when it is called from the VBS cluster script. The curl command that is used in Windows cmd.exe solved this issue.

One last remark: you might ask yourself why I am using VMs with 4 vCPUs and 16 GB of memory (m5a.xlarge), where my on-premise VMs had 2 CPUs and 2 GB of memory. Why not use the cheaper m5a.large instance type, that has 2 vCPU’s and 8 GB of RAM? The simple answer is: because the m5a.large didn’t work. The task that changes the website wasn’t running every second. Not even every two seconds. There were times when it ran once in 5 or 6 seconds. That made it not suitable for my goal. Changing m5a.large to m5.large solved part of this problem, but it still happened too often that the time of the website wasn’t changed. The failover part of the functionality, however, worked fine. You might consider using the m5a.large or m5.large instance type for your own applications.

Next blog…

Next blog, I will show how the CloudFormation scripts work together and how the ClusterNodes are build up.


[1] Previous blogs:

– First blog, overview:

– Second blog, installation:

[2] This link is mentioned in the boto3 documentation for assign-private-ip-addresses: