You might have wondered in the second blog of this series , why I mentioned the possibility that the Task Scheduler didn’t start after a reboot in the “on-premise” (Hyper-V) environment, but I didn’t do so in the AWS environment. Well, that’s because I used a “trick”, where AWS will start the next job when the Windows Task Scheduler fails to do so. In this blog, I explain how this works. You can find the code in my GitHub repository .
All virtual machines send the content of their C:\Install\install_log.txt files to CloudWatch. This is done via the AWS CloudWatch Agent: the agent will send the content to the log group with the name install_log.txt and the log stream with the name of the node.
Every line that is sent by the Virtual Machine is put in the CloudWatch log. When you look at the events in the install_log.txt log group, you will see four types of lines: lines that start with
- START (for the beginning of a script),
- END (which is used as last log line of the script),
- TRACE (with information what the script is doing), and
- CHECK (with information which script is about to start after the reboot, and when the retry should start if the script is not started before)
Log events may look like:
In this example, you see that part1.ps1 is started on node ClusterNode1, then some TRACE lines and, just before the END of part1.ps1, the CHECK line.
In the CloudWatch subscription filter, the START lines and the CHECK lines are filtered by the filter “?CHECK ?START” (which means: either CHECK or START ). When a CHECK line comes in, the relevant parts (in the example: script part2.ps1, in 3 minutes) are used to create a CloudWatch event. The name of the node (which is retrieved from the log stream name) and the name of the script are combined to form the name of the event. The event is started in, in this example, 3 minutes from the moment that the Lambda function receives this line.
When a START line is received, the Lamda function will be started again to delete the event. When there is no START line within the time that is specified, the event will start. The event will use SSM Run Command to start the PowerShell script that is in the second part of the event name. The assumption is here, that the script is always in the C:\Install directory.
To show how this works, I “forgot” to uncomment the line that starts part3.ps1 of the Demo virtual machine after the reboot (in part2.ps1). In the log of the Lambda function StartPowershellEventFunction, you will see that (at least) the event Demo-part3.ps1 is started.
When you look at the configuration file of the CloudWatch agent (in the yml files for f.e. the CloudFormation ClusterNode template), you will see that the encoding of this file is “utf-16”. The default encoding for the CloudWatch agent is utf-8, the default encoding for Windows is Windows-1252 (you can see this by typing [System.Text.Encoding]::Default in a PowerShell window). You can find out what the encoding of a specific file is, by opening the file in Notepad and look in the status line, or use “save as” and look at the encoding:
When you don’t add an encoding line in the configuration file of the CloudWatch agent, the log events in CloudWatch will look like every character has an extra space before or after it. The subscription filter “?CHECK ?START” will also fail, because the encoding in the filter is different than the encoding of the text that comes in.
The default user for the SSM Agent (that starts the SSM Run Commands on the virtual machine) is the Local System user. This user doesn’t have network access. I changed the user of this service for the ClusterNodes, to make it possible to use SSM Run Command to configure the cluster. For a test environment that is only on-line for a short time this is fine. For a production environment, you might add commands to (both) change the default user for the SSM Run Command back to Local System and also to delete the CloudWatch subscription filter to make it harder to abuse this mechanism.
In your own test environment, feel free to create a simple PowerShell script in the C:\Install directory and add a “CHECK” line in the install_log.txt file that uses this mechanism to start your own PowerShell script.
This “trick” is used in the other templates as well: every time a new node is created, the scripts will be used to configure the node. In both the Windows Failover Cluster and in the CloudFormation template for the ASG with Custom Image solution, you might want to delete the CloudWatch subscription at the end of the CloudFormation deployment.
In the ASGWithCustomImage template, you might have seen the option to start a container. When you do, this container will improve the speed of the failover with another 30-40 seconds. The software in the container will check the status of the instance in the target group every 5 seconds and send an unhealthy status directly to the Auto Scaling Group.
 Previous blogs:
– First blog, overview: https://technology.amis.nl/2020/11/07/aws-migration-part-1-how-to-migrate-windows-failover-clustering-servers-to-aws/
– Second blog, installation: https://technology.amis.nl/2020/11/08/windows-failover-cluster-migration-to-aws-part-2-installation/
– Third blog, technique behind Windows Failover Clusters in AWS: https://technology.amis.nl/2020/11/11/aws-migration-part-3-the-technique-behind-windows-failover-cluster-on-aws/
– Fourth blog, about the construction of the CloudFormation templates: https://technology.amis.nl/2020/11/14/windows-failover-cluster-on-aws-part-4-construction-of-the-cloudformation-scripts/
 Github Repository: https://github.com/FrederiqueRetsema/AWSMigrationWindowsFailoverCluster
 More information about the format of the filters can be found here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html