It took me a while, but then this blog series was ready: five blog articles about Windows Failover Clustering and Auto Scaling Groups with one node. I was happy, but then a thought was nagging me. Let me explain the problem by showing you the results for the Auto Scaling Group with Custom Image:
The relative times are relative to stopping the website or stopping the server. The different stages are:
- Stop website: the website is stopped via IIS or via the PowerShell command Get-IISSite | Stop-IISSite -Confirm:$False. The application pool and the W3SVC service will keep running. The last moment that the website responds to a curl via the curl1sec.ps1 command is the first time under this heading. The second time is the first time that the website responds via curl1sec.ps1 on the new node. The difference is the failover time.
- Target Group unhealthy: Go to the target group, look at the status of the node. Directly after I stopped the website, I keep refreshing the status. The moment the health check gives “Unhealthy”, I write down the time (that I got via https://time.is).
- Instance out of service ASG: when you open the Auto Scaling Group and look in the tab Activity, you will see the message “An instance was taken out of service”. The timestamp of this event is used in the spreadsheet.
- Type: the “An instance was taken out of service” message also gives the cause for pulling the instance out of service. Up to now we saw EC2 for the EC2 health check and ELB for the health check in the target group.
- New instance started: this is the moment a new instance is started by the Auto Scaling Group. The timestamp can be found in the Auto Scaling Group activity tab as well.
- Start CreateWebpage.ps1: this is the moment that you can see in the CloudWatch install_log.txt log group that the PowerShell script is started. This is done after the image is started and (via EC2Launch sysprep) configured.
- Stop server: I used the GUI of AWS EC2 to stop (not: terminate) the instance. The timestamps of curl1sec.ps1 are used in the same way as when I stopped the website.
Let’s take the first line of stopping the website in the Auto Scaling Group with Custom Image:
- It took 15 seconds for the Target Group to become unhealthy. That is pretty quick.
- 2 minutes and 24 seconds after stopping the website, the instance is put out-of-service in the Auto Scaling Group.
What?! 2 minutes and 24 seconds? This is two minutes and 9 seconds after the health check of the load balancer calls the instance unhealthy. Okay, next tries are much faster, but it still takes a considerable amount of time to put the instance out of service. When I looked into this, I saw that the health of the node in the Auto Scaling Group is a long time “healthy”, even when the health in the Target Group is unhealthy.
How to solve this? First, I looked at events. AWS generates lots of events and it is quite easy to attach a Lambda function to such events. It’s easy and it’s cheap, because you only pay when the events occur.
Unfortunately, the target group of the LoadBalancer doesn’t deliver events on the event bus when the status of an instance changes. This means that the only way to speed up the process is to regularly check the status of the instances in the target group, and then act on unhealthy nodes.
When you look at the main part of the Python code elb_health_status_to_asg.py, you will see the following lines of code:
while (True): target_group_arns = get_target_group_arns() for target_group_arn in target_group_arns: instance_ids = get_instance_ids_of_unhealthy_nodes(target_group_arn) for instance_id in instance_ids: asg_name = get_asg_name_from_instance_tags(instance_id) asg_health_check_type = get_asg_health_check_type(asg_name) if (asg_health_check_type == "ELB"): change_instance_health(instance_id) time.sleep(SLEEP_IN_SECONDS)
When the loop starts, the Amazon Resource Name (ARN) of all target groups in the current region will be collected. Per target group, a list of instances with unhealthy status will be received. Per instance, the name of the Auto Scaling Group will be determined from the tags of the instance. When an Auto Scaling Group starts an instance, it will add its name to the list of tags of the instance. The name of the tag we are looking for is aws:autoscaling:groupName:
Now we know the name of the Auto Scaling Group, we have to check if the health status is checked based on EC2 checks only (the HealthCheckType is ‘EC2’), or based on both EC2 and ELB checks (the HealthCheckType is ‘ELB’). When the HealthCheckType is ELB, we send the status Unhealthy for this instance to the ASG via a Custom Health Check. This is called “user health-check” in the GUI:
For this solution I used a (serverless) Fargate container cluster, with the container running as a service. You can find the logging in CloudWatch.
Both the sleep time and the logging level are configurable. The sleep time is by default 5 seconds. The logging level has INFO as default, but you can also use DEBUG or WARNING. The container works independently of the software and the target groups that are deployed, it will work for any target group in the region where the container is enrolled. You can use my Dockerhub repo: frederiquer/elb_health_status_to_asg:latest is the default image in the CloudFormation stack. You can change this to your own repository/image if you want.
I added a column for the moment that the container sends the message to the Auto Scaling Group. This is one to four seconds after the moment that the target group health check changes the status from healthy to unhealthy. That is pretty fast, when the sleep interval is five seconds.
The results for stopping the website is on average 40 seconds faster than the results without using the container. There are about 90 seconds between the fastest and the slowest run. That is pretty much on an average of 3 minutes and 36 seconds. I think we may draw the conclusion that AWS is not designed for using an Auto Scaling Group with one node when recovery time is an issue. When the node is stopped, then using the container is on average 31 seconds faster .
The costs for running this container are $ 8.89 per month. This isn’t too much when you compare it to the other costs: the costs for the total solution are $ 250 per month based on reserved instances. When I add this alternative to the list of alternatives for a Windows on-premise Failover Cluster, then the list is :
When you don’t have reserved instances, the price for playing with this environment is about $ 2.21 per hour, based on m5a.xlarge instances. 
Even with the container running, the average failover time of an Auto Scaling Group with one node is more than 4 times as much as a 1:1 migration from a Windows Failover Cluster on-premise to a Windows Failover Cluster in AWS. The main question remains: how bad is this? There are business cases where it is no problem when the program is not available for multiple minutes. In that case, it also doesn’t make sense to start the container with the optimization for faster delivering the ELB health state to the Auto Scaling Group. When time is an issue and you decide that a 1:1 conversion is not for you (for example because you are afraid of using just one availability zone) then running the container might give you 30-40 seconds less failover time.
 Previous blogs:
– Second blog, installation: https://technology.amis.nl/2020/11/08/windows-failover-cluster-migration-to-aws-part-2-installation/
– Third blog, technique behind Windows Failover Clusters in AWS: https://technology.amis.nl/2020/11/11/aws-migration-part-3-the-technique-behind-windows-failover-cluster-on-aws/
– Fourth blog, about the construction of the CloudFormation templates: https://technology.amis.nl/2020/11/14/windows-failover-cluster-on-aws-part-4-construction-of-the-cloudformation-scripts/
– Fifth blog, about starting a PowerShell script after a reboot when Windows fails to do so: https://technology.amis.nl/2020/11/15/aws-blog-series-part-5-start-powershell-script-after-a-reboot-when-windows-fails-to-do-so/
 A spreadsheet with details of the outcome of the tests can be found here: https://github.com/FrederiqueRetsema/AWSMigrationWindowsFailoverCluster/raw/master/Results.ods
 A spreadsheet with the costs of the different solutions can also be found in Github: https://github.com/FrederiqueRetsema/AWSMigrationWindowsFailoverCluster/raw/master/Costs.ods