In this series, I will look at the migration from on-premise Windows Failover Clusters to AWS. What is the difference in recovery times between the application on-premise, the 1:1 migration of a Failover Cluster to AWS and the commonly used pattern of an Auto Scaling Group with one node? What is the difference in costs?
In this series, I will discuss Windows Failover Clusters. Failover Clusters are used for the second tier in a three tier application: web services can do HTTP requests to programs running on these servers and then give the answer back via their web pages. They commonly use the database to store the results.
There are multiple ways to use a Failover Cluster, but in general these systems are designed to run with one running program. Think for example of a program that assists in selling concert tickets. When you would have multiple instances of the application, you could end up in selling the same concert ticket twice. In more modern AWS applications this problem could also be solved in serverless Lambda functions or in serverless containers.
But in cloud migrations, companies in general want to migrate as fast as possible to the cloud – so in general there is not much time to rewrite code. We might redesign the application later, or replace it by totally new software. The project to migrate the on-premise systems to the cloud has to deliver fast. So, with this in the back of our minds, what are our options? How fast are failovers in the different solutions ? And what does it cost ?
In the next paragraphs I will explain the different solutions and what the (infrastructural) costs of the different solutions are. The costs of the web servers and the database are not included in this overview. The costs for the data that is going to or from AWS is also not calculated, both type of costs are the same for every solution for the data going to and from the virtual machines. In the Auto Scaling Group with one node solution, you have to add costs for the data that is flowing through the load balancer. You can use the AWS Pricing Calculator to determine these costs .
Current situation: on-premise (Hyper-V demo) environment
Before a migration starts, we should know what the current situation is. How fast is a failover in the current situation? When we look at the migrated application in AWS, the recovery time will always take longer than the current situation. I think that we should test the current on-premise situation in every migration and discuss the results and expected results after the migration with the application owner. When the recovery time in the migrated situation is longer than the current situation then this might be an acceptable change, but it must be clear to the application owner that he takes a bigger risk on a long recovery time than he did when the system ran on premise.
Windows Failover Clusters are designed to run in an environment with Active Directory. The cluster name and the cluster services get their own IP address and that IP address is stored in the DNS, which is part of the Domain Controller (DC). In the Failover Cluster, there are in general three (or more) nodes: when one node falls out, one of the other nodes will take over. Windows checks regularly to see if all the nodes are well. When one of the nodes doesn’t respond, the cluster is switched to another node. The advantage is that this node is already running, so this goes pretty fast.
I made a demo “on premise” application in my Hyper-V environment. It consists of a very simple script that creates a webpage that shows the node name and the current time. A cluster script will start the IIS service, application pool and website on the node where the cluster is running. When the cluster is moved to another node, the service, application pool and web site are stopped and started by this script. There is no storage in this cluster.
You can follow and play along, by using the scripts in my github repository. More about this in the next blog in this series. I did two types of tests: the first tests will move the cluster to another node via the GUI of the Windows Failover Cluster Manager. I looked at the duration of these moves. I did that 6 times within the same cluster, without changing the configuration. Moving a cluster goes pretty fast: in 10 – 14 seconds the cluster is running on another node. You will see the last timestamp when the node responded after giving the command to move the cluster, and then the time of the first response after the node is running on that other node. After that, the difference is calculated between these times:
In the second test, I played less nice: what would happen when one of the nodes is stopped? I tested stopping the nodes in Hyper-V, and then looked how fast other nodes take over the tasks of the failed node. These results are fast as well: between 28 and 42 seconds.
Please mind, that this performance is the performance that both ICT staff and end users will expect from the migrated situation in AWS if you don’t tell them otherwise. When the servers are created a long time ago, people even might not know the failover time of a stopped on-premise cluster node and only know the cluster move times, because they simply never dared to stop an active node. They were (and are) too afraid that that node, or the cluster as a whole, will be damaged if you do this. When they expect that the failover time in AWS will be (about) the same as the move times of the on-premise cluster, they will be very disappointed about the failover times when a node fails in AWS.
The costs of the current implementation will depend on the on-premise situation. You might have one or more data centers and have more or less servers in that data center. You might have much redundancy implemented, or not too much. The costs will vary with these choices. It is therefore impossible to give an estimation of the costs of the current situation in this blog.
Solution 1: Windows Failover Cluster in AWS
When the deployment of a Windows Failover cluster is scripted, then you would expect that this can be migrated to AWS pretty fast. This is, however, not the case. In a future blog I will write more about the implementation of a Windows Failover Cluster in AWS. You will see many things that will work different (or: different than expected) in AWS than it works on-premise.
One of these things is that all nodes of the Windows Failover Cluster have to be within the same AWS Availability Zone. Every cluster has an IP address. When the cluster is switched, each of the nodes to which the cluster is switched to, has to be able to assign itself the IP address of the cluster. This has the disadvantage that the solution of the Failover Cluster in AWS is not as redundant as other AWS solutions. When you choose for this solution, you have to think about the risks. The main question is: what risk is bigger, the risk of a specific virtual machine that isn’t working anymore or the risk of inaccessible nodes in an Availability Zone?
I did the same tests with the Windows Failover Cluster in AWS as I did with the Windows Failover Cluster on my Hyper-V environment. The results of moving a cluster with the Failover Cluster Manager GUI in AWS has very stable results: this takes between 24 and 26 seconds:
When you stop one of the nodes and wait until another node has taken over the tasks of the stopped node, the results are dependent on the speed with which AWS will move the IP address to the other node. The total failover time is between 25 and 55 seconds, with an average of 39 seconds. The results are:
I started the whole environment from scratch between different tests when I stopped a node, both in the cloud and “on-premise”, to be sure that previous stops do not effect the response time of stopping a node.
The costs for this solution are the costs of the virtual machines, these costs are $880 per month. I calculated the costs of one domain controller and three Windows Failover Cluster nodes. The costs are based on the assumption that reserved instances are used, for one year, payed in advance, in AWS region Ireland (eu-west-1). It consists of the costs for both the instances and for the disks.
Solution 2: Auto Scaling Group with one node, default settings
One solution to migrate this application as fast as possible, is to use a loadbalancer in combination with an auto scaling group. The size of the auto scaling group is always one. Both the loadbalancer and the virtual machine (EC2) infrastructure will check the health of the node. When the node is not healthy, the node will be replaced by a new node. Let’s look in more detail how this works.
When you configure the health checks of a load balancer, you have four variables that you can use:
- Healthy threshold (defaults to: 5)
- Unhealthy threshold (defaults to: 2)
- Timeout (defaults to: 5 seconds)
- Interval (defaults to: 30 seconds)
You can also configure a draining period (which defaults to 300 seconds), but this doesn’t affect the failover time:
You can see the defaults in the image: you can see that the load balancer does a health check every 30 seconds, and at the fifth checkpoint that doesn’t fail it will consider the node healthy. When the node is not considered healthy (yet), no traffic will be directed to this node. You might think that this means that it will take more than 2 minutes between the moment that the node is started and configured and the moment the traffic is redirected to a new node, but in reality this is much faster: it is about 8 seconds.
When something goes wrong within the 30 seconds interval, then the next time the load balancer checks the health it will not get an answer from the node. The timeout interval will be reached and the load balancer will check again. When this check also times out, then the node will be considered unhealthy because the unhealty threshold is two. In this configuration, we expect the health check to fail between 5 + 30 + 5 = 40 seconds and 29 + 5 + 30 + 5 = 69 seconds.
When the node is unhealthy, the autoscaling group will be informed and then the auto scaling group will start a new node. When a draining period is configured, the connections will be drained during this period and when the draining period is over then the node will be terminated. Starting a new node will take time: time to start the new virtual machine, time to configure the node and time to install our software.
I tested this situation with default values in two ways: first by stopping the website and then by stopping the server itself. When the website is stopped, it takes on average 9 minutes to get a new node up and running. In the table, you can also see when the health checks of the load balancer reported unhealthy and when the instance is going out of service in the Auto Scaling Group. I also looked when a new instance is started and when the configuration of the node is started. All the relative times are relative to the moment when the website is stopped.
An example: in the first line, you can see that it took 58 seconds for the health check in the Target Group (which is part of the loadbalancer) to report that the node is unhealthy. It then took more than a minute (!) to pass this information to the Auto Scaling Group, which took the instance out of service. It then took another 23 seconds to start a new node. 4 minutes and 51 seconds after stopping the website, the configuration of the node is started (= Start part1.ps1). In total, it took 8 minutes and 56 seconds to have our application ready to be used by our customers again.
When the node is stopped, it takes also about 9 minutes to get a new node up and running. Where the loadbalancer sees directly that something is wrong, the EC2 health check takes between 32 seconds and more than 2 minutes to get the instance out of service.
Configuring of this solution is much easier than the Windows Failover solution and the operational costs are also much lower: about $238 per month.
Solution 3: Auto Scaling Group with one node, fastest settings
This solution is technically the same as the previous solution. I used the fastest settings:
- Healthy threshold (2 instead of 5 times)
- Unhealthy threshold (2 is both the minimum and the default)
- Timeout (2 seconds instead of 5)
- Interval (5 seconds instead of 30)
- Draining period (0 seconds is the minimum)
The image is (about) the same: there is no draining period, the timeouts and intervals are smaller and the number of checks before the node is in service is lower. No draining period doesn’t mean that the node is terminated faster, but this doesn’t matter because the termination doesn’t have any effect on the failover time:
Based on this image, we would expect the health checks of the load balancer to fail between 2 + 5 + 2 = 9 seconds and 4 + 2 + 5 + 2 = 13 seconds. In reality, it takes slightly longer than these theoretical minimums.
When you look at the logging of the website traffic in the IIS log (which can be found in the directory C:\inetpub\logs\LogFiles\W3SVC1), you can see the difference between ELB checks and normal traffic:
It is no surprise that these changes make the failover time lower: when the website is stopped, this is between 7:21 and 8:34 (on average 7:37), when the node is stopped this is between 7:21 and 8:34 (on average 7:56).
Solution 4: Auto Scaling Group with custom image
When you look at a failover time of (on average) between 7.5 and 8 minutes, that’s not impressive. There is, however, a way to speed this up: we can first create a custom image, and then use that custom image in the Auto Scaling Group.
In the creation of the image, we follow about the same steps as in Solution 3. The process ends with SysPrepping the image. This prepares the image to be enrolled the next time with different mac adresses for the network card, different IP adresses etc. This makes it possible to use the same image in different networks in different Availability Zones, within the same Auto Scaling Group. I used the Amazon EC2Launch software to do this, this is extra installed during the installation phase of the image:
When the image is ready, the Auto Scaling Group is started. During the startup, no software has to be installed or configured. This saves two very time consuming steps:
The result is, that stopping the website leads to failover times of between 3.5 minutes and about 6 minutes (with an average of 4 minutes and 16 seconds). When the VM is stopped, the failover times are between 3.5 minutes and 4.5 minutes (with an average of 3:52 minutes).
Even though the failover time dropped from about 9 minutes to on average 4 minutes, this is still more than five times as slow as the migrated Windows Failover Cluster solution. And this Auto Scaling Group with Custom Image solution is more than 6.5 times as slow as when an on-premise node stops working.
Many employees will not know how much time it costs when an on-premise node fails over. They will assume that this is about the same as the time it takes to move a cluster to another node. When you then compare an average of 10-14 seconds with any solution in AWS, this will always be a disappointment.
So what to choose? It’s not as easy as it seems. When you choose the 1:1 migration of a Windows Failover Cluster from on-premise to AWS, you will pay much more money on operational expenses than you would when you choose for one of the Auto Scaling Group solutions. When you choose for this solution, you also accept some risk of a failure of an entire Availability Zone, the implicit assumption is that the risk of a failure on one of the nodes is bigger than the failure of an Availability Zone.
When you choose for one of the Auto Scaling Group solutions, you accept longer failover times. In general, businesses use a Windows Failover Cluster when the software should be reliable: the costs of not having this software running might be considerable. In some cases, the costs of not having the software running for a few minutes might be hundreds of euros or more. The advantage of Auto Scaling Group solutions is that they are much easier to build and maintain as the Windows Failover Cluster in AWS.
We, as specialists, cannot make this decision. There is not one solution that is the best for all cases. We should talk to the stakeholders and explain them what the different solutions are and what the pros and cons of each solution are.
In the next blog, I will show how to start the different solutions in your own environment so you can play with this yourself. In blog 3 the differences between Failover Cluster on-premise and within AWS are explained.
The construction of the CloudFormation templates may look difficult. In the fourth blog I will explain how the different templates are build up. I also show how the interaction between CloudFormation and the ClusterNode takes place.
Sometimes, the Task Scheduler in Windows doesn’t start a task after reboot. I used a little trick to enforce that the next script is always started. When the Task Scheduler starts the next script, that’s fine. When the Task Scheduler doesn’t do this, AWS will start the task.
 A spreadsheet with details of the outcome of the tests can be found here: https://github.com/FrederiqueRetsema/AWSMigrationWindowsFailoverCluster/raw/master/Results.ods
 A spreadsheet with the costs of the different solutions can also be found in Github: https://github.com/FrederiqueRetsema/AWSMigrationWindowsFailoverCluster/raw/master/Costs.ods
 AWS Pricing Calculator: https://calculator.aws/#/