Differences between CloudFormation, Terraform and Ansible in deployment of objects in AWS YTRMSHZV

Differences between CloudFormation, Terraform and Ansible in deployment of objects in AWS

In this article, I will deploy a simple solution in AWS in three ways: via the AWS templates of CloudFormation, via a Terraform script and via an Ansible script. By doing so, I will show the differences between the mentioned scripts. In the article, I will highlight a few examples to give an idea about the possibilities of a tool. The full scripts can be downloaded, see the links at the end of the article. All durations are averages of 5 different runs. All the scripts are run in region Ireland (eu-west-1).

Which components are deployed?

Differences between CloudFormation, Terraform and Ansible in deployment of objects in AWS cloudcraft

The solution that is deployed is not that complicated: there is a VPC with an internet gateway, three public subnets and three private subnets. There is one public and one private subnet per availability zone. The private networks are connected to the public networks with one NAT gateway per public network, to ensure high availability for each of the AZ’s. Routing tables and security groups are created to allow traffic to go from and to the EC2’s.

The EC2’s are t2.micro Amazon Unix-machines. Each of them gets a webserver (httpd) with a very simple message in the index.html. They are deployed by an auto scaling group with three desired nodes. There are two load balancers: one for the nodes in the private networks and one for the nodes in the public networks. The CloudFormation, Terraform and Ansible scripts will output the DNS-name of the public load balancer when they are ready.

To make the solution a little bit easier, the EC2’s in the private network are deployed first. The VM’s in the public network will first give back their own message (“Foreground website”) and then do a curl to the internal load balancer to get the text of the EC2’s in the private network (“Background info”) after that. Because the index.html is a static file which is created at the deployment of the EC2’s, the scripts must deploy the EC2’s in the private network first.

When you want to deploy these scripts yourself, you first need to create a key within EC2  to be able to ssh to the EC2’s. The security group that is used in public subnets allow all connections from a test PC (the IP-address of this PC is passed as a parameter), and web traffic from all IP-addresses.

CloudFormation

CloudFormation, Terraform and Ansible will all determine which objects are already there and which objects are different to what they expect. CloudFormation can both deal with YAML files and with JSON files. In this example, I will use JSON.

The order in which resources are deployed is determined by CloudFormation. This can save time when multiple objects can be created at the same time, this might lead to errors when the dependencies are not configured in the right way in the CloudFormation template. The order in which objects are deployed can be different when the template is deployed multiple times. Sometimes the order is not logical: when one deploys both network components like route tables and NAT gateways, it is not wise to deploy autoscaling groups at the same time (where Virtual Machines depend on the network for their initialization scripts).

The dependency can be set in the template, by using the DependsOn keyword. This keyword uses the names as they are provided by the developer of the template:

       “AutoScalingGroupPrivate”: {

         “Type”: “AWS::AutoScaling::AutoScalingGroup”,

         “DependsOn”: [“SecurityGroupPublic”, “PublicSubnetRouteTable”,

                        “SecurityGroupPrivate”, “PrivateGatewayRouteAZ1”,

                        “PrivateGatewayRouteAZ2”, “PrivateGatewayRouteAZ3”],

It is in CloudFormation not possible to create objects in a loop: though the configuration for all private subnets are very much alike, there are three different objects in the CloudFormation file:

       “PrivateSubnetRouteTableAZ2”: {

         “Type”: “AWS::EC2::RouteTable”,

         “Properties”: {

              “Tags” : [ {

                  “Key”: “Name”,

                  “Value”: {“Fn::Join”: [“”,

                                       [{“Ref”: “Name”}, “-private-subnet-routetable-AZ2”]]}

              }],

              “VpcId”: {“Ref”: “VPC”}            

         }

       },

In this snippet, the Join-function of CloudFormation is used to connect the contents of the variable Name with the static text “-private-subnet-routetable-AZ2”.

There is no standard possibility to determine AMI-id’s dynamically. There is a workaround to get around this, see the links-section of this article.

CloudFormation is an integral part of AWS, where Terraform and Ansible use the API of AWS instead. The disadvantage is, that it is impossible to enhance CloudFormation or to improve the way CloudFormation works. The advantage is, that it is always possible to delete the objects that are created by CloudFormation: AWS stores this information somewhere in an (inaccessible) place in the cloud.

The strangest disadvantage of CloudFormation is that it is much slower than Terraform: running this script in the web console of CloudFormation takes on average 5 minutes and 25 seconds.

Terraform

Terraform uses an own language for its templates, which looks a lot like JSON. It is possible to use different files for the different parts of the configuration, terraform will join all the files in the current subdirectory together to one deployment. It is recommended to put sensitive data (like keys) in different configuration files in another directory, in this demo I put them in the terraform.tfvars file in my home directory.

The deployment of a solution consists of three stages: the init-stage, the plan-stage and the apply-stage. In the init-stage, the modules that this solution are dependent on will be downloaded to the hidden directory .terraform . In the plan-stage, Terraform will determine what differences there are between the configuration file(s) and the environment where the solution is deployed to. This will lead to an output-file, in this solution I use terraform.tfplans for that.  Plan will also return what objects will be added, changed or deleted. The deployment itself is done in a seperate step, which is called terraform apply. The deployment can be removed using terraform destroy. The full commands which are used in this article are shown here:

terraform init –var-file=/home/user/terraform.tfvars

terraform plan –var-file=/home/user/terraform.tfvars -out terraform.tfplans

terraform apply terraform.tfplans

terraform destroy –var-file=/home/user/terraform.tfvars

It is possible to change variables on the command line, these take precedence above the variables in the configuration files.

Terraform will, like CloudFormation, determine the order in which objects are deployed. In terraform also a “depends on” exists. In terraform, the object names are combined with the name of the class to refer to an object that has been created earlier:

resource “aws_autoscaling_group” “autoscaling_group_private” {

  name = “${var.name}-asg-group-private”

  […]

  depends_on = [“aws_security_group.security_group_public”,

             “aws_route_table.public_subnet_route_table”,

             “aws_security_group.security_group_private”,

           “aws_route_table_association.subnet_private_route_table_association”,

             “aws_internet_gateway.igw”]

}

In terraform it is possible to loop through parts of the configuration. Let’s look at the route table associations: in terraform each of the three public networks are connected to one route table:

resource “aws_route_table_association” “subnet_public_route_table_association” {

  count          = “${length(data.aws_availability_zones.available.names)}”

  subnet_id      = “${element(aws_subnet.publicsubnet.*.id,count.index)}”

  route_table_id = “${aws_route_table.public_subnet_route_table.id}”

}

The count indicates the number of times this part of the configuration will be used. In this case, it will be executed three times: once per Availability Zone. The count.index will increment from 0 – 2, for three availability zones. The element function will give back the individual route table associations, each one will take a public subnet and connect it to the (one) public route table.

One of the nice things on Terraform is the ability to use external data to create resources. An example is to get the most recent AMI number for AWS Linux. I didn’t put that in the original script because I like to determine the differences in duration between CloudFormation and Terraform in a fair way.

data “aws_ami” “aws_linux” {

  most_recent = true

  owners = [“amazon”]

  filter {

    name   = “name”

    values = [“amzn-ami-hvm-20*”]

  }

  filter {

    name   = “architecture”

    values = [“x86_64”]

  }

  filter {

    name   = “virtualization-type”

    values = [“hvm”]

  }

  filter {

    name   = “root-device-type”

    values = [“ebs”]

  }

}

Another nice feature is the state lock: every time you use terraform plan, apply or destroy, a hidden file .terraform.tfstate.lock.info will be made, which contains the name of the user that started the command and the date and time of the current run. When another operator tries to use one of the change commands, terraform will give an error. This can be done on file level when people start deploys always from the same directory, but it is also possible to store this info in the (AWS-)cloud via backend configuration. This is a feature that can be very useful for operators who work in different physical locations but who do configure the same resources. This is an example of a feature that you might never use or a feature that you heavily depend upon.

Terraform can destroy the objects that have been made by earlier terraform plans and terraform applies, by terraform destroy. This command uses the terraform.tfstate (JSON-)file with data (f.e. id’s, names of availability zones, etc). When you delete the whole directory without doing a terraform destroy first, you will have to delete everything that has been created in earlier stages by hand.

One of the advantages of terraform is that it is possible to use modules that are written by the community. One of these examples is the vpc-module, which will create routes, gateways, subnets etc. where just a few, relevant, parameters are given. There is one drawback, though: it is impossible to use dynamic names for availability zones.

variable “azs” {

  default = [“eu-west-1a”, “eu-west-1b”, “eu-west-1c”]

}

[…]

# NETWORKING #

module “vpc” {

  source = “terraform-aws-modules/vpc/aws”

  name   = “${var.name}-VPC”

  cidr            = “${var.cidr_block}”

  azs             = “${var.azs}”

  public_subnets = [“${cidrsubnet(var.cidr_block, 8, 1)}”,

                    “${cidrsubnet(var.cidr_block, 8, 2)}”,

                    “${cidrsubnet(var.cidr_block, 8, 3)}”]

  private_subnets = [“${cidrsubnet(var.cidr_block, 8, 4)}”,

                     “${cidrsubnet(var.cidr_block, 8, 5)}”,

                     “${cidrsubnet(var.cidr_block, 8, 6)}”]

  enable_nat_gateway = true

  single_nat_gateway = false

  create_database_subnet_group = false

  enable_dns_hostnames = true

  enable_dns_support = true

}

Ansible

To let the Ansible script run, you need to have some extra Python library installed by pip: for vpc, we need boto, boto3 and botocore. For the ipaddr function, we need the netaddr Python package. The following command will do the trick:

sudo pip install boto boto3 botocore netaddr

Ansible doesn’t have a local file with ID’s that are used in previous deployments. When Ansible is compared to Terraform, this is a major disadvantage: Ansible cannot always see if previously created objects are created by Ansible or that they are created outside Ansible. This is for example the case with Elastic IP Addresses: when I restarted the playbook, Ansible created extra Elastic IP Addresses, it doesn’t reuse the previous assigned IP-addresses. In my solution, the EIP’s are used for NAT gateways, and in Ansible it is possible to create the EIP’s at the creation of the NAT Gateway. In this way, Ansible will only create EIP’s when new NAT Gateways are created. It does recognize existing NAT Gateways when you specify the “if_exist_do_not_create”-flag. This is a little bit strange feature with an even stranger default of “no”: objects of other classes are created only if they don’t exist and with other classes it isn’t even possible to create extra objects with the same name…

    – name: Create NAT Gateways

      ec2_vpc_nat_gateway:

        subnet_id: “{{item.subnet_id}}”

        if_exist_do_not_create: yes

        wait: yes

        wait_timeout: 600

      register: natgateway

      with_items:

      – { id: 0, subnet_id: “{{pub_subnet.results.0.subnet.id}}” }

      – { id: 1, subnet_id: “{{pub_subnet.results.1.subnet.id}}” }

      – { id: 2, subnet_id: “{{pub_subnet.results.2.subnet.id}}” }

When configuring the NAT Gateway, Ansible doesn’t wait by default for the creation to end. This saves time, but this might also cause problems when other objects are created that rely on the NAT Gateway. This is the case with the EC2 instances in the private network: when they start up the NAT Gateway should be present, because in the user_data (“start up script”) of these VM’s, Apache is downloaded from the internet. When I ran my deployment script, there were timeout errors when I used the default of 300 seconds (5 minutes!). When I change this setting into 600 seconds (10 minutes), the timeout issues are gone.

Ansible has other improvements over the default API: in Ansible there is just one object for route tables, these objects include the information that in Terraform and CloudFormation are passed in route_table_associations-objects .

The tag option is missing in some objects in Terraform and Ansible. This can be a big disadvantage when you have multiple deployments in the same VPC.

The way information is passed back from an object is different from object to object. I think that the AWS-object can be improved if Ansible would use a more standardised approach: Terraform has a more consistent approach and better documentation. In Ansible, you will need the debug-feature to display the results from previous commands quite often.

There are some minor bugs when you use Ansible: when the user_data of a launch configuration is changed, one would expect the launch configuration to be updated or to be deleted/recreated. This isn’t done in Ansible: you have to delete the launch configuration by hand (or by another playbook) to let Ansible recreate it.

In Ansible, there is no default way to “destroy” or “rollback”: you have to create a new playbook to implement the deletion of objects. My first thought was that this would be quite simple: just revert the order in which objects are created and use the “state: absent”. This didn’t work out of the box, because some information is missing. For example: information about the VPC is sometimes needed to be able to destroy an object. To make this possible, there are special _facts objects within the AWS module to get information about objects that are present.

    – name: Get VPC facts

      ec2_vpc_net_facts:

        region: “{{region}}”

        filters:

          “tag:Name”: “{{nameprefix}}-VPC”

      register: vpc

The Elastic IP Addresses (EIP’s) are also a problem in the deletion: Ansible doesn’t store information about automatic creation (or not) of EIP’s. At deletion, the EIP’s will not be deleted. This will cause problems because each time you rollout new NAT Gateways, new EIP’s will be used. The old EIP’s will never be reused. Though this can be solved by deleting ALL existing EIP’s in the VPC, this can cause problems when you have multiple playbooks that use the same VPC.

The performance of Ansible is very slow: the duration of the create script is about twice as long as the creation with CloudFormation: more than 10 minutes.

In Ansible there are also possibilities to use modules: one can search within Ansible Galaxy to find the right one. When the repository is stored in github, one can clone the module into ones own environment. In Terraform, I found a usable module quite quick. In Ansible, this took me much more time. The module I found did include (about) the same parameters as the one in Terraform. When I tried to create an example using this one, I found out that:

– the module wasn’t thoroughly tested (there was a typo in the max port number in the second subnet, change the number 66535 in the file vars/main.yml into 65535 to fix this)

– the VPC was created using a cidr_block of 10.0.0.0/21 instead of 10.0.0.0/16 (which I asked for)

– there are just two subnets, not three

– there was no option to destroy the objects that were created

– there were no output variables (you will need to call a _facts object after using this module)

– the last commit was in December 2017, so there might or might not be any development in this module

The advantage of modules is that the code can be changed, so if we really would like to use this module, we could create a fork and change it (it is released under GPL-3.0).

    – name: Create VPC

      include_role:

        name: level27tech.aws_vpc

      vars:

        aws_vpc_aws_access_key: “{{aws_access_key}}”

        aws_vpc_aws_secret_key: “{{aws_secret_key}}”

        aws_vpc_region: “{{region}}”

        aws_vpc_cidr_for_access: “{{ip_address_test_pc}}/32”

        aws_vpc_multi_az: true

        aws_vpc_include_private: true

resource “aws_autoscaling_group” “autoscaling_group_private” {

  name = “${var.name}-asg-group-private”

  […]

  depends_on = [“aws_security_group.security_group_public”,

             “aws_route_table.public_subnet_route_table”,

             “aws_security_group.security_group_private”,

             “aws_route_table_association.subnet_private_route_table_association”,

             “aws_internet_gateway.igw”]

}

        aws_vpc_vpc_name: “{{nameprefix}}-vpc”

        aws_vpc_public_subnet_1_name: “{{nameprefix}}-public-1”

        aws_vpc_public_subnet_1_cidr: “{{cidr_block | ipsubnet(24,1)}}”

        aws_vpc_public_subnet_2_name: “{{nameprefix}}-public-2”

        aws_vpc_public_subnet_2_cidr: “{{cidr_block | ipsubnet(24,2)}}”

        aws_vpc_private_subnet_1_name: “{{nameprefix}}-private-1”

        aws_vpc_private_subnet_1_cidr: “{{cidr_block | ipsubnet(24,4)}}”

        aws_vpc_private_subnet_2_name: “{{nameprefix}}-private-2”

        aws_vpc_private_subnet_2_cidr: “{{cidr_block | ipsubnet(24,5)}}”

      register: vpc

Changing this module to adjust it to our needs or put more time into searching for another module that might fit our needs better was beyond the scope of this blog.

Conclusion

CloudFormation has less options and is significantly slower than Terraform. On average for 5 deployments:

CF via web console CF via the AWS CLI TF, not using extra modules TF, using extra modules Ansible, not using extra modules Ansible, using extra modules
5:25 5:04 4:11 4:05 10:19 n/a

CF = CloudFormation

TF = Terraform

CLI = Command Line Interface

The difference in duration is strange: one would expect that AWS would be able to transfer its own template-format into API calls faster than external software that uses the default AWS API. The advantage of using CloudFormation above Terraform is that it is impossible to delete the configuration file with object id’s. It is always possible to delete the whole CloudFormation stack, no matter what happens on the configuration node.

Terraform is much faster than Ansible: this is because Terraform will deploy multiple different objects at once. Ansible will do it in a predefined order. This can be an advantage when there are many dependencies, but comes with a (heavy) performance penalty. Ansible doesn’t store ID’s locally. That is a disadvantage in deploying infrastructure that doesn’t have labels (like Elastic IP Addresses). Ansible doesn’t see (all) differences between the configuration in the playbook and the current deployment in AWS. This makes Terraform a better option to use for deployment of infrastructure than Ansible.

The advantage of Terraform and Ansible over CloudFormation is that there are modules that are doing more complex tasks with just a few parameters. The code can be downloaded and changed if one wants to do so. The disadvantage is that searching the right one can be time consuming and some modules don’t do what you might expect them to do.

Links:

Free tool to draw AWS pictures: https://cloudcraft.co/

Determine your own IP-address: https://whatismyipaddress.com/

Dynamically determine AMI-id’s in CloudFormation scripts: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/walkthrough-custom-resources-lambda-lookup-amiids.html

Ansible Galaxy: https://galaxy.ansible.com

Full scripts that were used in this article:

CloudFormation: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/CF_template_AMIS.json

Terraform:

– tfvars-file: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/terraform.tfvars

– terraform-file: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/tf_configfile_AMIS.tf

– terraform-file using a module: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/tf_configfile_AMIS_module.tf

Ansible:

– creation: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/ansible_configfile_AMIS.yml

– destroy (*): https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/ansible_configfile_AMIS_destroy.yml

– creation using a module (first clone the module from https://github.com/level27tech/ansible-role-aws_vpc ): https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/ansible_configfile_AMIS_module.yml

destroy all objects (*) that were created using a module: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/ansible_configfile_AMIS_module_destroy.yml

(*) Important: the deletion scripts of ansible will not destroy Elastic IP Addresses, you will need to do that by hand. Go to Services>EC2 >Elastic IP’s, select the IP addresses and choose Actions > Release addresses.

All durations of individual runs:

Download this Excel file, which also contains the commands that were used to create or destroy a solution: https://s3-eu-west-1.amazonaws.com/frpublic/AMIS/Performance.xlsx.