Windows Failover Cluster on AWS part 4: Construction of the CloudFormation scripts

Frederique Retsema

Introduction

If you follow along in this blog series [1], I can imagine that if you deployed the CloudFormation scripts, that you think “wow, how does it work”. In this blog, I try to answer that question.

1. Windows Failover Clustering

Nested stacks

When you start the CloudFormationFailover.yml script, you will see that directly two other cloudformation templates are started. These are the network stack and the common stack. The network stack is used to deploy a (very basic) network. The common stack is used to create the CloudWatch log groups and the Lambda functions.

After the common stack is deployed, one of the Lambda functions is used to put a secure parameter in the parameter store. In our case, it is the parameter with the password that we use in all the other stacks. The reason for using a Lambda function to do this, is that though it is possible to add String and StringList parameters to the parameter store via normal CloudFormation resources, it is not possible to do the same with a SecureString.

When the password is added to the parameter store and the network is deployed, the CloudFormation template for the domain controller is enrolled. The domain controller is needed for the the last two CloudFormation templates, which are used to create the three cluster nodes and the demo node.

When the three cluster nodes are deployed, then the cluster configuration can be started on ClusterNode1. When this is successful, a DNS entry will be added to the DNS on the Domain Controller.

ClusterNode stack

You might get confused about all the steps that are taken on the cluster node before the cluster is configured. To help you, I created a simplified flow diagram for the Cluster Node:

When the template is started, it will create System Manager parameters in the parameter store for the IP addresses that it got as parameter from the main CloudFormation template. It will also create log streams for this node, under the CloudWatch log groups that have been created in the common template. Also, a few IAM objects are created. This is pretty standard stuff.

Then the creation of the ClusterNode itself is started. After the start of the image, the configuration begins with the user data. In the first line of the user data, cfn-init is called. cfn-init is a program to configure a node with help from CloudFormation. You will find the details in the meta data of the node deployment. The cfn-hup service is started, files are created (for example: the configuration file for the CloudWatch agent) and a series of commands are started.

These commands are run in the alphabetical order of the name you give them. The names therefore begin with a number. The first command will create the directories C:\Install and C:\ClusterScripts, the second command will download and install the Command Line Interface (CLI) for AWS and CloudWatch Agent. The third command will echo the computer name to one of the powershell scripts. In step four, all configuration files are downloaded. In the fifth command, the part1.ps1 script will be started. One of the steps in this part1.ps1 script is to schedule part2.ps1 to start after a reboot. Part1.ps1 will not reboot the VM itself. It will return nicely to cfn-init, and when all the five commands are executed correctly, cfn-init will be happy and will stop executing. After echoing the region to the network settings file (this has to be done by Powershell, because of the encoding of the file), the cfn-signal command will be added to part5.ps1. Part5.ps1 is the last powershell configuration script that is (indirectly) executed from the user data. The last step of the userdata is to reboot the virtual machine.

Part2.ps1 will be started after the reboot. And part2.ps1 will (apart from doing some of the configuration) schedule part3.ps1, which will start after the reboot. And so on, until part5.ps1 is finished. The last commands of part5.ps1 are the commands that are added by the UserData. The signal to CloudFormation is the sign that the configuration of the virtual machine is done.

When I tried to pass two network interfaces in CloudFormation at the same time, Windows couldn’t deal with this. I had to connect the extra network interface after the deployment of the node. The configuration of the network is done via an SSM Run Command command, which is called from a Custom Resource (Lambda function) in CloudFormation.

This SSM Run Command “trick” is also used for configuring the cluster itself, after all three cluster nodes are deployed. This command is “fired” by the main CloudFormation template. It uses the Amazon SSM Agent service on the virtual machine to run the command.

The Amazon SSM Agent by default runs with an account that has no network access and it wasn’t possible to start a new session with other credentials from within Run Command. I therefore had to change the user of the SSM Agent to a user that has permissions to create a cluster [2]. I used the Domain Admin account for that, which is fine for this test but you might want to change this for your production environment.

In the previous blog, I already talked about adding the address of the cluster to the DNS by the (main) CloudFormation script instead of being added automatically by the cluster configuration.

2. ASG Template

Nested stacks

When you compare the build up of the ASG template to the previous template, you will see both the similarity to the Windows Failover Cluster template, and you will also see its simplicity: when the ASG Node is deployed in the Auto Scaling Group, the work is done.

ASG node stack

The flow of the ASG Node stack is rather simple. In the cfn-init part of the template, the commands are all in one “command block” instead of using 5 “commands”. This is done because cfn-init will wait for 60 seconds between each command. When you deploy a Windows Failover Cluster, this doesn’t matter that much: readability of the code is then more important than the few minutes you have to wait longer. In the Auto Scaling Group solution, time does matter – so I combined all the commands to one block. This isn’t too bad for the readability, because the number of commands in the command block is very small when you compare it to the number of commands that are in the Failover Cluster template.

The reason that I configure the Auto Scaling Group in two times (first with EC2 health checks, then change the configuration to use both ELB and EC2 health checks), is that when the Auto Scaling Group is created, it will directly start a node. The Life Cycle Hook is not present yet, the Load Balancer will therefore start to do the checks directly when the node is started. On that moment, IIS is not installed yet. ELB will decide that the node is unhealthy (and the Auto Scaling Group will terminate it) before the node had the chance to install and configure IIS. To prevent this from happening, the ASG is configured to use the ELB health checks after the Life Cycle Hook is added first.

3. ASG with Custom Image Template

Nested stacks

In the previous solution the configuration of the node is done in the Launch Template. It then makes sense to put the Auto Scaling Group in the same template.

In the ASG with Custom Image solution, the creation of the image is done in a different template than Auto Scaling Group and the Launch Template:

Signalling CloudFormation and the Auto Scaling Group

When you look at the CloudFormationCreateImage.yml template, you will see that in the UserData signal commands are added to both part2.ps1 and a new file SignalASG.ps1. The signal commands in part2.ps1 are used to signal the CloudFormationCreateImage.yml template. After the signal is given the EC2Launch sysprep command is started, which will shutdown the instance. The following Lambda function to create the image will wait until the instance is stopped before the image is created.

The signal commands in SignalASG.ps1 are used to signal the CloudFormationASGAndLaunchTemplate.yml template and also to signal the hook of the Auto Scaling Group. This PowerShell script is started from the CreateWebpage.ps1 PowerShell script every time the custom image is started.

4. Test environment

Please mind that these scripts are meant to demonstrate the differences between the solutions for the migration of a Windows Failover Cluster on-premise. The scripts are not meant to directly run in production: I left some security issues in these scripts, mainly to make it easier for the users of these scripts to access the different nodes and to look how the different solutions work. Please change the scripts before you use these scripts in a production environment.

5. Next blog…

In the next blog, I will tell more about the “trick” to be sure that the automatic start of a PowerShell script after a reboot is always started: by Windows if possible, but by AWS if Windows fails to do so.

Links

[1] Previous blogs:

– First blog, overview: https://technology.amis.nl/2020/11/07/aws-migration-part-1-how-to-migrate-windows-failover-clustering-servers-to-aws/

– Second blog, installation: https://technology.amis.nl/2020/11/08/windows-failover-cluster-migration-to-aws-part-2-installation/

– Third blog, technique behind Windows Failover Clusters in AWS: https://technology.amis.nl/2020/11/11/aws-migration-part-3-the-technique-behind-windows-failover-cluster-on-aws/

[2] With a very big “thank you” to both Santhosh Sivarajan and Luke, for publishing the PowerShell code to resp. change the owner of a Windows service and to add a user-id to grant logon as a service to an account (resp. https://gallery.technet.microsoft.com/scriptcenter/79644be9-b5e1-4d9e-9cb5-eab1ad866eaf and https://stackoverflow.com/questions/313831/using-powershell-how-do-i-grant-log-on-as-service-to-an-account )

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

AWS blog series part 5: Start PowerShell script after a reboot when Windows fails to do so

Facebook 0 Twitter Linkedin You might have wondered in the second blog of this series [1], why I mentioned the possibility that the Task Scheduler didn’t start after a reboot in the “on-premise” (Hyper-V) environment, but I didn’t do so in the AWS environment. Well, that’s because I used a […]