First steps with Docker Checkpoint - to create and restore snapshots of running containers image 10

First steps with Docker Checkpoint – to create and restore snapshots of running containers

Docker Containers can be stopped and started again. Changes made to the file system in a running container will survive this deliberate stop and start cycle. Data in memory and running processes obviously do not. A container that crashes cannot just be restarted and will have a file system in an undetermined state if it can be restarted. When you start a container after it was stopped, it will go through its full startup routine. If heavy duty processes needs to be started – such as a database server process – this startup time can be substantial, as in many seconds or dozens of seconds.

Linux has a mechanism called CRIU or Checkpoint/Restore In Userspace. Using this tool, you can freeze a running application (or part of it) and checkpoint it as a collection of files on disk. You can then use the files to restore the application and run it exactly as it was during the time of the freeze. See https://criu.org/Main_Page for details. Docker CE has (experimental) support for CRIU. This means that using straightforward docker commands we can take a snapshot of a running container (docker checkpoint create <container name> <checkpointname>). At a later moment, we can start this snapshot as the same container (docker start –checkpoint <checkpointname> <container name> ) or as a different container.

The container that is started from a checkpoint is in the same state – memory and processes – as the container was when the checkpoint was created. Additionally, the startup time of the container from the snapshot is very short (subsecond); for containers with fairly long startup times – this rapid startup can be a huge boon.

In this article, I will tell about my initial steps with CRIU and Docker. I got it to work. I did run into an issue with recent versions of Docker CE (17.12 and 18.x) so I resorted back to 17.04 of Docker CE. I also ran into an issue with an older version of CRIU, so I built the currently latest version of CRIU (3.8.1) instead of the one shipped in the Ubuntu Xenial 64 distribution (2.6).

I will demonstrate how I start a container that clones a GitHub repository and starts a simple REST API as a Node application; this takes 10 or more seconds. This application counts the number of GET requests it handles (by keeping some memory state). After handling a number of requests, I create a checkpoint for this container. Next, I make a few more requests, all the while watching the counter increase. Then I stop the container and start a fresh container from the checkpoint. The container is running lightningly fast – within 700ms – so it clearly leverages the container state at the time of creating the snapshot. It continues counting requests at the point were the snapshot was created, apparently inheriting its memory state. Just as expected and desired.

Note: a checkpoint does not capture changes in the file system made in a container. Only the memory state is part of the snapshot.

Note 2: Kubernetes does not yet provide support for checkpoints. That means that a pod cannot start a container from a checkpoint.

In a future article I will describe a use case for these snapshots – in automated test scenarios and complex data sets.

The steps I went through (on my Windows 10 laptop using Vagrant 2.0.3 and VirtualBox 5.2.8):

  • use Vagrant to a create an Ubuntu 16.04 LTS (Xenial) Virtual Box VM with Docker CE 18.x
  • downgrade Docker from 18.x to 17.04
  • configure Docker for experimental options
  • install CRIU package
  • try out simple scenario with Docker checkpoint
  • build CRIU latest version
  • try out somewhat more complex scenario with Docker checkpoint (that failed with the older CRIU version)

 

Create Ubuntu 16.04 LTS (Xenial) Virtual Box VM with Docker CE 18.x

My Windows 10 laptop already has Vagrant 2.0.3 and Virtual Box 5.2.8. Using the following vagrantfile, I create the VM that is my Docker host for this experiment:

 

After creating (and starting) the VM with

vagrant up

I connect into the VM with

vagrant ssh

ending up at the command prompt, ready for action.

And in just to make sure we are pretty much up to date, I run

sudo apt-get upgrade

image

Downgrade Docker CE to Release 17.04

At the time of writing there is an issue with recent Docker version (at least 17.09 and higher – see https://github.com/moby/moby/issues/35691) and for that reason I downgrade to version 17.04 (as described here: https://forums.docker.com/t/how-to-downgrade-docker-to-a-specific-version/29523/4 ).

First remove the version of Docker installed by the vagrant provider:

sudo apt-get autoremove -y docker-ce \
&& sudo apt-get purge docker-ce -y \
&& sudo rm -rf /etc/docker/ \
&& sudo rm -f /etc/systemd/system/multi-user.target.wants/docker.service \
&& sudo rm -rf /var/lib/docker \
&&  sudo systemctl daemon-reload

then install the desired version:

sudo apt-cache policy docker-ce

sudo apt-get install -y docker-ce=17.04.0~ce-0~ubuntu-xenial

 

    Configure Docker for experimental options

    Support for checkpoints leveraging CRIU is an experimental feature in Docker. In order to make use of it, the experimental options have to be enabled. This is done (as described in https://stackoverflow.com/questions/44346322/how-to-run-docker-with-experimental-functions-on-ubuntu-16-04)

     

    sudo nano /etc/docker/daemon.json
    

    add

    {
    "experimental": true
    }
    

    Press CTRL+X, select Y and press Enter to save the new file.

    restart the docker service:

    sudo service docker restart
    

    Check with

    docker version
    

    if experimental is indeed enabled.

     

    Install CRIU package

    The simple approach with CRIU – how it should work – is by simply installing the CRIU package:

    sudo apt-get install criu
    

    (see for example in https://yipee.io/2017/06/saving-and-restoring-container-state-with-criu/)

    This installation results for me in version 2.6 of the CRIU package. For some actions that proves sufficient, and for others it turns out to be not enough.

    image

     

    Try out simple scenario with Docker checkpoint on CRIU

    At this point we have Docker 17.04, Ubuntu 16.04 with CRIU 2.6. And that combination can give us a first feel for what the Docker Checkpoint mechanism entails.

    Run a simple container that writes a counter value to the console once every second (and then increases the counter)

    docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
    

    check on the values:

    docker logs cr
    

    create a checkpoint for the container:

    docker checkpoint create  --leave-running=true cr checkpoint0
    

    image

    leave the container running for a while and check the logs again

    docker logs cr
    

    SNAGHTML19a5da6

    now stop the container:

    docker stop cr
    

    and restart/recreate the container from the checkpoint:

    docker start --checkpoint checkpoint0 cr
    

    Check the logs:

    docker logs cr
    

    You will find that the log is resumed at the value (19) where the checkpoint was created:

    SNAGHTML197d66e

     

    Build CRIU latest version

    When I tried a more complex scenario (see next section) I ran into this issue. I could work around that issue by building the latest version of CRIU on my Ubuntu Docker Host. Here are the steps I went through to accomplish that – following these instuctions: https://criu.org/Installation.

    First, remove the currently installed CRIU package:

    sudo apt-get autoremove -y criu \
    && sudo apt-get purge criu -y \
    

    Then, prepare the build environment:

    sudo apt-get install build-essential \
    && sudo apt-get install gcc   \
    && sudo apt-get install libprotobuf-dev libprotobuf-c0-dev protobuf-c-compiler protobuf-compiler python-protobuf \
    && sudo apt-get install pkg-config python-ipaddr iproute2 libcap-dev  libnl-3-dev libnet-dev --no-install-recommends
    

    Next, clone the GitHub repository for CRIU:

    git clone https://github.com/checkpoint-restore/criu
    

    Navigate into to the criu directory that contains the code base

    cd criu
    

    and build the criu package:

    make
    

    When make is done, I can run CRIU :

    sudo ./criu/criu check
    

    to see if the installation is successful. The final message printed should be: Looks Good (despite perhaps one or more warnings).

    Use

    sudo ./criu/criu –V
    

    to learn about the version of CRIU that is currently installed.

    Note: the CRIU instructions describe the following steps to install criu system wide. This does not seem to be needed in order for Docker to leverage CRIU from the docker checkpoint commands.

    sudo apt-get install asciidoc  xmlto
    sudo make install
    criu check
    

    Now we are ready to take on the more complex scenario that failed before with an issue in the older CRIU version.

    A More complex scenario with Docker Checkpoint

    This scenario failed with the older CRIU version – probably because of this issue. I could work around that issue by building the latest version of CRIU on my Ubuntu Docker Host.

      In this case, I run a container based on a Docker Container image for running any Node application that is downloaded from a GitHub Repository. The Node application that the container will download and run handles simple HTTP GET requests: it counts requests and returns the value of the counter as the response to the request. This container image and this application were introduced in an earlier article: https://technology.amis.nl/2017/05/21/running-node-js-applications-from-github-in-generic-docker-container/

      Here you see the command to run the container – to be be called reqctr2:

      docker run --name reqctr2 -e "GIT_URL=https://github.com/lucasjellema/microservices-choreography-kubernetes-workshop-june2017" -e "APP_PORT=8080" -p 8005:8080 -e "APP_HOME=part1"  -e "APP_STARTUP=requestCounter.js"   lucasjellema/node-app-runner
      

      image

      It takes about 15 seconds for the application to start up and handle requests.

      Once the container is running, requests can be sent from outside the VM – from a browser running on my laptop for example – to be handled  by the container, at http://192.168.188.106:8005/.

      After a number or requests, the counter is at 21:

      image

      At this point, I create a checkpoint for the container:

      docker checkpoint create  --leave-running=true reqctr2 checkpoint1
      

      image

      I now make a few additional requests in the browser, bringing the counter to a higher value:

      imageAt this point, I stop the container – and subsequently start it again from the checkpoint:

      docker stop reqctr2
      docker start --checkpoint checkpoint1 reqctr2
      

      image

      It takes less than a second for the container to continue running.

      When I make a new request, I do not get 1 as a value (as would be the result from a fresh container) nor is it 43 (the result I would get if the previous container would still be running). Instead, I get

      imageThis is the next value starting at the state of the container that was captured in the snapshot. Note: because I make the GET request from the browser and the browser also tries to retrieve the favicon, the counter is increased by two for every single time I press refresh in the browser.

      Note: I can get a list of all checkpoints that have been created for a container. Clearly, I should put some more effort in a naming convention for those checkpoints:

      docker checkpoint ls reqctr2
      

      image

      The flow I went through in this scenario can be visualized like this:

      image

      The starting point: Windows laptop with Vagrant and Virtual Box. A VM has been created by Vagrant with Docker inside. The correct version of Docker and of the CRIU package have been set up.

      Then these steps are run through:

      1. Start Docker container based on an image with Node JS runtime
      2. Clone GitHub Repository containing a Node JS application
      3. Run the Node JS application – ready for HTTP Requests
      4. Handle HTTP Requests from a browser on the Windows Host machine
      5. Create a Docker Checkpoint for the container – a snapshot of the container state
      6. The checkpoint is saved on the Docker Host – ready for later use
      7. Start a container from the checkpoint. This container starts instantaneously, no GitHub clone and application startup are required; it resumes from the state at the time of creating the checkpoint
      8. The container handles HTTP requests – just like its checkpointed predecessor

       

      Resources

      Sources are in this GitHub repo: https://github.com/lucasjellema/docker-checkpoint-first-steps

      Article on CRIU: http://www.admin-magazine.com/Archive/2014/22/Save-and-Restore-Linux-Processes-with-CRIU

      Also: on CRIU and Docker: https://yipee.io/2017/06/saving-and-restoring-container-state-with-criu/.

      Docs on Checkpoint and Restore in Docker: https://github.com/docker/cli/blob/master/experimental/checkpoint-restore.md

       

      Home of CRIU:   and page on Docker support: https://criu.org/Docker; install CRIU package on Ubuntu: https://criu.org/Packages#Ubuntu

      Install and Build CRIU Sources: https://criu.org/Installation

       

      Docs on Vagrant’s Docker providingprovisioning: https://www.vagrantup.com/docs/provisioning/docker.html

      Article on downgrading Docker : https://forums.docker.com/t/how-to-downgrade-docker-to-a-specific-version/29523/4

      Configure Docker for experimental options: https://stackoverflow.com/questions/44346322/how-to-run-docker-with-experimental-functions-on-ubuntu-16-04

      Issue with Docker and Checkpoints (at least in 17.09-18.03): https://github.com/moby/moby/issues/35691

      3 Comments

      1. shtu August 7, 2018
        • Lucas Jellema August 19, 2018
          • Lord Aries August 20, 2018