DevOps Diaries

Resolving 502 and 504 errors in ECS apps

Andrzej Bakonski — Wed, 23 Dec 2020 06:31:27 GMT

A while ago, I noticed a significant number of 502 and 504 errors in our production APIs running on AWS. This is a story of resolving those.

Our setup:

RESTful API running behind an Apache server
Elastic Container Services (Fargate)
Application Load Balancer directing traffic to the ECS containers

What I learned after hours (days?) of investigation, is that everything came down to various timeout settings throughout the environment. It's important to align timeout values in a number of areas within an ECS cluster to avoid issues such as gateway timeouts and inaccessible backend server errors.

TLDR

Possible causes of 502 errors:

backend server keep-alive timeout is shorter than the load balancer's idle timeout (load balancer tries to reuse an already closed connection to backend server)

Possible causes of 504 errors:

load balancer's idle timeout is too short (long-running task is timed out)
target group deregistration delay is too short (long-running task is timed out)

Application Load Balancer Idle Timeout

The Idle Timeout on the ALB will affect how long a connection to a client will remain open while no data is being sent. In our case, this would happen when a long-running API call is being processed and the client is waiting for a response. On the API, we needed to set this to 120 secs to allow plenty of time for typical long-running tasks to complete.

While the application is processing a request, there is no data being sent between the load balancer and the client and so the connection is considered idle. If no data is sent in either direction before the idle timeout is hit, while the request is still being processed and client waiting for a response, the ALB will throw a 504 error.

Apache/NginX Keep-Alive Timeout

Keep-alives on backend servers like Apache and NginX allow a client to reuse an open connection to the server to make multiple requests. By default, Apache has keep-alives turned on and the keep-alive timeout is set to 5 seconds. These are typically the recommended settings because:

Having keep-alives enabled improves performance by not requiring new connections to be open for each request that a client makes (as long as these requests come frequently enough)
5 seconds ensures that connections aren't left open too long and unnecessarily block new connections from being made once the limit of concurrent connections has been reached. By default there is a limit of 100 concurrent keep-alive connections (additional non keep-alive connections can be made beyond this)
Each connection takes up memory on the server, so having the keep-alive timeout set low (5 secs) ensures that these are cycled in and out frequently and memory usage remains under control

With the load balanced scenario, things are a little different. Instead of each client connecting directly to Apache, the connection from the client is made to the load balancer and the load balancer makes a connection to the backend server. In the case of AWS, connections from the ALB to the backend server are reused for multiple clients/requests. This means that one single keep-alive connection made from the ALB to Apache can serve multiple client connections to the ALB. Also the same client's multiple connections can be served in multiple backend connections, even to different backend servers. This has the following implications:

Connections to the backend server (Apache) will be limited to the number of connections that ALB will open to the server, not by the number of clients hitting the ALB. The load balancer is good at spreading traffic across multiple backend servers, so each one will get a more limited number of keep-alive connections
Because connections between the ALB and the backend server can be reused for multiple client connections, keep-alive connections won't be wastefully staying open if a large keep-alive timeout is set on the backend server, i.e. Client B's request can be served by the same keep-alive connection that was created to serve Client A's request, etc
If a keep-alive timeout is set to a lower value than the ALB's idle timeout, there will inevitably be times when no data is sent on an open keep-alive connection in the backend server until the timeout is hit and the backend server will close that connection. The problem with this is that until the idle timeout has been hit on the ALB, it will try to reuse the same keep-alive connection that it opened with the backend server. When it learns that the connection has already been closed by Apache/NginX, the ALB will throw a 502 error

With the above in mind, it's best to set the keep-alive timeout on the backend servers to be larger than the idle timeout on the ALB. This will prevent the backend server from closing connections that the ALB is expecting to still be open. Since it is the ALB and not each client that opens these connections to the backend server, it is safe to increase the keep-alive timeout. So if you set the idle timeout on the load balancer to 120 secs, you could set it on Apache/NginX to 150 secs (longer to provide a buffer).

Apache/NginX Connection Timeout

There is another directive in backend servers that's important here - connection timeout. This should also be set long enough to handle all appropriate calls to the server. Unlike the other 2 types of timeouts above, this one doesn't care whether data is being sent to the server or not. Let's say there's an endpoint that serves data for 45 secs. If the connection timeout is set to 30 secs, the stream will be dropped at 30 secs. In the API, Apache is set to the default 5 min connection timeout, which is more than sufficient for the types of requests that are served in our example scenario. Namely, it easily covers the 120 secs idle timeout and any additional non-idle time on typical requests.

Target Group Deregistration Delay

One last type of timeout that needs to be catered for is the deregistration delay on the Target Group in AWS. This is important when dealing with ECS, which will deregister old targets and register new ones when a code deployment is made (in a "rolling update" deployment type). When a deployment is initiated, ECS will start up and perform health checks on new containers. If all is well, it will add the new containers as targets to the target group and mark the old targets as "draining". While in the draining state, no NEW connections will be made to those targets, but existing connections will be allowed to remain open until their requests complete OR the deregistration delay time has been hit. If the deregistration delay was set to 0 secs, these targets would be deregistered and connections to them would be closed immediately once they entered the draining state. If connections are closed in this fashion, the ALB will throw a 504 error.

As such, it's important to set the deregistration delay to a high enough value that all existing, appropriate requests should have enough time to complete. It's also a good idea to keep this value low enough that these old targets won't hang around unnecessarily. In our example scenario the deregistration delay could be set to 150 secs, which is more than the 120 secs idle timeout on the ALB. During this deregistration period, all processes will either complete or be cancelled by the ALB by the 150 secs mark.

Creating a Continuous Deployment Pipeline using BitBucket, Jenkins and Amazon ECS (part 2 of 2)

Andrzej Bakonski — Thu, 12 Mar 2020 10:13:55 GMT

In the first part of this article series I walked through setting up a cluster in Amazon ECS. This also includes leveraging Amazon ECR as the image repository that will be used for deploying new versions of our sample app to the cluster. In this post, I'm going to walk through the Jenkins configuration required so that a merge in BitBucket automatically triggers a Jenkins job that in turn deploys a new build in Amazon ECS.

I've already covered the process of setting up BitBucket and Jenkins in a previous article, so this article will only fill in the gaps that are specific to this ECS setup. If you haven't already set up a Jenkins server and configured the BitBucket hook, first do so by following the instructions in that article, and then come back to complete the Jenkins setup outlined below.

Configuring the required credentials in Jenkins

By now you should have a Jenkins server running with all the appropriate plugins installed. You should also have Docker installed on the same machine.

In the first article in this series, we discussed the IAM credentials that will be used to push new Docker images to Amazon ECR. I used my admin IAM credentials in the initial setup, but for this part of the process I very strongly recommend creating an IAM user with ECR write permissions for just the repository that Jenkins needs to write to. As a rule, I create a new IAM user for each isolated process and I give it only the permissions that it needs to do its job. It's a really bad idea to create open IAM users and reuse them across multiple processes, multiple code bases, etc. This goes for team members as well - each gets their own login with only the permissions that they need to do their job.

1. Log into the Jenkins admin

2. On the home screen, select Credentials, then System, then click the Global credentials domain

3. Click Add credentials to create the ACR credentials:

Kind: "Username with password"
Scope: "Global"
Username: Access key ID of the IAM user that has ECR write permissions on the "flask-app" repository
Password: Secret access key of the IAM user that has ECR write permissions on the "flask-app" repository
ID: "ecr-flask-app"

Configuring the build jobs in Jenkins

As I mentioned in this article, there is currently a limitation in Jenkins in that there's currently no way to create a Pipeline job in Jenkins that is triggered by a BitBucket push on any branch other than master. Because we tend to do multiple stages of deployments (any combination of: alpha, beta, QA, staging, production, etc), this limitation isn't acceptable. The workaround is to use a two-job process. The first job is a Freestyle Job, which can be configured to run when a merge happens on any branch. This job will simply act as a trigger for the Pipeline Job, which will do the actual building and deploying to ECS.

I've outlined the process of creating the Freestyle and Pipeline jobs here. Follow the same steps to build your Freestyle job, using your appropriate BitBucket repo address and branch.

The next step is to configure the Pipeline job. Again, you can follow the instructions in the same article. The obvious difference here will be the Pipeline script. You can use my example Groovy script. Just make sure to modify it to suit your environment first. The main things you will definitely need to change are the imageName, clusterName, serviceName, taskFamily and desiredCount variables at the top of the script.

If you read through the script, you'll notice that there are two deployment files that you will need to include in your repo for your Pipeline job to succeed:

ecs-task.json - this is the Task Definition file. You will need to update at a minimum the containerDefinitions.image value to represent the image in your own ECR, but keep the :%BUILD_NUMBER% suffix. We use this to choose the correct image tag/version in the Jenkins build process.
ecs-wait.sh - this is a script I wrote to monitor when ECS has fully cut over to your new deployment. It will wait a maximum of 10 minutes for ECS to start your new task version(s) and enter a stable state. If that 10 mins elapses without reaching this stable state on your new version, an error will be thrown to your chosen Slack channel. If it succeeds, you'll get a success message instead.

This step is optional, but it personally helps me a lot in terms of knowing when deployments complete and whether they've succeeded or failed.

One thing you should be aware of with the 10 minute cap on this process is that there are a number of factors that could actually make your cutovers take a long enough time that you'll never get a stable release in that time period. The place to start troubleshooting this is the "Deregistration delay" setting on your Target Group (go to EC2 > Load Balancing > Target Groups in the AWS console).

Testing the pipeline

The final step is naturally going to be to test your new pipeline. I've outlined testing in my previous article as well and it covers a common issue with a build error that you may encounter at this point.

Once you've successfully tested the pipeline by building your Freestyle job directly from Jenkins, your BitBucket merges into the branch you configured above will trigger this job automatically.

Final words

I really hope you've found this article series helpful. Since I've written it, Amazon have released the Fargate launch type in ECS. I've migrated all my ECS clusters to Fargate a while ago now and so far I'm loving it. I found the EC2 launch type to be too difficult to control efficiently in terms of resource management and task placement strategies. I was also noticing a lot of intermittent ECS Agent failures on the EC2 instances and would end up needing to regularly cycle EC2 instances out.

I'm finding Fargate to be much more efficient with resources, much easier to scale and I'm experiencing virtually no failures as far as ECS goes. Of course there are code and server setup issues that creep in here and there, but I can hardly blame Amazon for that!

If you're interested in trying Fargate out instead of EC2 launch type, these 2 posts are surprisingly close as far as a guide to setting that pipeline up. Because they're so similar, I'm not planning to write up a guide on a similar pipeline with Fargate. For now I'll let you discover the few nuances that differ between the two launch types.

Creating a Continuous Deployment Pipeline using BitBucket, Jenkins and Amazon ECS (part 1 of 2)

Andrzej Bakonski — Sat, 10 Nov 2018 20:24:00 GMT

This is the first part of a two part series of articles. This first article outlines how to configure an ECS cluster that runs a very simple python-based web application. The second article in this series will outline the process of configuring a full Continuous Deployment pipeline that starts with a BitBucket merge and uses a Jenkins job to deploy changes to the ECS cluster.

Note that several steps in this article are Mac OS X specific and will need to be adapted for Windows and Linux operating systems.

Prerequisites

Install AWS CLI Tools

Follow the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/cli-install-macos.html#install-bundle-macos

The AWS CLI tools provide a way to control virtually everything in AWS in lieu of the AWS Console. They're really quite powerful and useful in automating tasks. The AWS CLI tools run on Python 2 and 3. Make sure you have Python installed on your machine, as well as pip. Once you've got that going, all you need to do is run the following command:

$ pip install awscli --upgrade --user

Once the installer has completed, run the following command to make sure the process worked. It should print out the version information for the AWS CLI tools on your machine.

$ aws --version

Configure AWS CLI Tools

In this step we're going to configure the AWS CLI tools with the appropriate IAM permissions for pushing Docker images to ECR.

For the following steps in the process, I'm using these tools on my personal laptop so I'm going to use my admin credentials for my test AWS account. This is usually not a good idea when working with production systems, especially once we start working with a Jenkins server in the second post in this series.

For the sake of security, I recommend creating a new IAM user specifically for the purpose of pushing Docker images to ECR. At minimum, the IAM user will need certain ECS permissions and ECR write permissions, ideally restricted to the appropriate ECR repository. Here's a policy that will have all the required permissions for the initial push to ECR as well as the automated Jenkins processes contained in part 2 of this series:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ECR_Perms",
            "Effect": "Allow",
            "Action": [
                "ecr:CompleteLayerUpload",
                "ecr:DescribeImages",
                "ecr:UploadLayerPart",
                "ecr:ListImages",
                "ecr:InitiateLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:us-west-1:xxxxxxxxxxxx:repository/flask-app"
        },
        {
            "Sid": "ECS_Perms",
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateService",
                "ecs:RegisterTaskDefinition",
                "ecr:GetAuthorizationToken",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": "*"
        }
    ]
}

Once you've got an IAM user set up with the appropriate permissions, run the following command and follow the prompts to configure the AWS CLI tools:

$ aws configure

Set up Elastic Container Registry (ECR) Repository

1. Open the Elastic Container Service dashboard in the AWS console and select Repositories from the left hand menu

2. Enter "flask-app" for the repository name and click the Next step button

3. You will be presented with instructions on how to push an image to your new repository:

Next, we'll push an initial Docker image to our new ECR to allow creation of the task definition and service through the AWS console.

In this example I'm going to use a sample project that I created for one of my earlier articles. The project can be found here: https://github.com/abakonski/docker-flask

The process of creating and pushing the Docker image to ECR may look something like this:

cd /path/to/project/

# Log into AWS, using credentials created using "aws configure"
$ $(aws ecr get-login --no-include-email --region us-west-1)

# Build the Docker image
$ docker build -t flask-app:0001 .

# Tag the image, prefixing it with the ECR registry address
$ docker tag flask-app:0001 xxxxxxxxxxxx.dkr.ecr.us-west-1.amazonaws.com/flask-app:0001

# Push the image to the ECR registry
$ docker push xxxxxxxxxxxx.dkr.ecr.us-west-1.amazonaws.com/flask-app:0001

You should now be able to see the image in ECR:

NOTE: we'll be using the Repository URI from the above screen later in the process.

Set up an ECS Cluster

Now that you've got ECR configured with an initial image sitting inside it, let's set up our first cluster.

A very simplistic overview of ECS Clusters is that a Cluster is a logical grouping of one or more Services, each of which comprises one or more instances of a Task. A Task is made up of one or more Docker images.

As an example, you could define Task A to be a web app image. Task B could be a Varnish server. Service A would be configured to run 3 instances of Task A, while Service B will run a single instance of Task B. These two services will then be grouped into a single cluster, and the result would be a Varnish server that serves a web app that runs on 3 load balanced instances.

Create a Task Definition

ECS Task Definitions do what their name suggests: they define a task. A task is typically a microservice, though it could comprise multiple microservices.

In this section we'll create a Task Definition for our flask-app example.

1. In the Elastic Container Service section, open the Task Definitions screen and click the Create new Task Definition button.

2. Name the task definition "flask-app-task" and move on to adding a container. There are some advanced settings here, but we'll leave these alone in this example.

3. Click Add Container

4. Use "flask-app" for the container name

5. There are 2 things we need to enter into the Image field. First, paste the Repository URI from earlier in the process. Next, add a colon (":") and the image tag (in this example, we used "0001" as the tag for our initial image).

6. We're going to use a somewhat arbitrary value of 500MB for the hard memory limit. The app we're using should theoretically never reach anywhere near this limit.

7. The rest of the form can be left with mostly the defaults

8. Once you've added the container, use the default values for the rest of the form and click Create.

Create the Cluster

1. In the Elastic Container Service section, open the Clusters screen and click the Create Cluster button.

2. Choose EC2 Linux + Networking and click Next step

3. Use "flask-app-cluster" for the cluster name

4. Choose the "t2.micro" instance type and "2" for number of instances. We're keeping things very cheap and basic for this example.

5. Under the Key pair field, follow the link to create a new key pair and follow the prompts. Once created, click the refresh button next to the Key pair field and select your new key.

NOTE: Save your new key somewhere safe for future use. This is important as this key is the only way you will be able to connect (via SSH) to the EC2 instances in this cluster.

6. Choose your default (or another appropriate) VPC and at least 2 subnets within it

7. Select the option to create a new security group and use the port range "0-65535".

NOTE: this isn't the ideal way of setting up the security rules. We're opening every port for all inbound traffic because in the next section of this process we're going to use dynamic port mapping between ECS and the Docker containers. In a production system, we would create a custom VPC with much stricter traffic rules, NAT gateways and bastion instances (for SSH connections), etc, but we'll leave all of these more advanced topics for a future post.

8. Choose the "ecsInstanceRole" for Container Instance IAM Role

9. Click Create

Create an Application Load Balancer

Since we're starting multiple (2) instances of our app, we'll need to set up an Application Load Balancer to allow HTTP access to both of these containers under a single endpoint.

1. Open the EC2 Dashboard, select Load Balancers from the left hand menu and click on Create Load Balancer

2. Click the Create button under the Application Load Balancer option

3. Enter "flask-app-lb" as the name

4. Leave the scheme as "internet-facing"

5. Select the "HTTP" protocol.

NOTE: In a production scenario, you'd most likely use HTTPS, but for the sake of this guide, we'll keep it simple.

6. Select your default VPC and choose at least 2 subnets/availability zones

7. Click the Next: Configure Security Settings button to continue through the wizard

8. Because we only selected the HTTP protocol in step 1, step 2 will contain a suggestion to improve security through the HTTPS protocol. We'll ignore this suggestion for this example and move on to step 3

9. In step 3, choose the security group that was automatically created earlier while setting up the cluster, then continue to the next step in the wizard

10. In step 4, create a new target group with the name "flask-app-tg" and use the "HTTP" protocol. Use the default values for the rest of the form and continue to the next step

11. In step 5, you should see 2 EC2 instances that were automatically started as part of the cluster creation process. Select them both and continue to the final step in the wizard

12. Review the configuration summary in the final step and click Create

Create a Service

1. Once the cluster setup completes, click the View Cluster button. Alternatively, go to the Clusters screen and click on the "flask-app-cluster" cluster that you just created

2. On the Services tab, click Create

3. Choose the "flask-app-task" task definition and "flask-app-cluster" cluster. Use the service name "flask-app-service"

4. Use the default "AZ Balanced Spread" placement template.

5. Click Next step

6. Select the Application Load Balancer type, leave the IAM Role set to "Create new role" and choose the "flask-app-lb" load balancer that we created earlier

7. Click the Add to load balancer button

8. Select the "flask-app-tg" target group and leave the other fields in this section with their default values

9. In the Service discovery section, enter "flask-app" for the namespace name, select your default VPC for the cluster and leave the other fields with their default values. Click Next step

10. For this example, we'll leave auto scaling turned off, so select "Do not adjust the service's desired count"

11. Continue through the rest of the wizard, accepting the review section in the next step

12. After a few moments, you should see 2 instances of the "flask-app-task" in the "RUNNING" state

Validate the Deployment

Now that the deployment is set up, let's make sure that it works. We haven't set up a custom domain (via Route 53 or otherwise) for the web app, so we'll need to use the DNS for the "flask-app-lb" load balancer:

Enter this URL into your browser and if everything has gone well, you should see something like this:

Final Words

Hopefully everything has gone well and your initial ECS deployment is working as expected. This is only the first step of the process. The next step is to create the actual continuous deployment pipeline using Bitbucket and Jenkins. This is exactly what we'll cover in the next part of this post series.

On a side note, we've been very relaxed with security when putting together this ECS deployment. A few things that should be implemented to boost security of this app (and which will be covered in future posts) include:

Using a custom VPC whereby the EC2 instances can't be directly accessed via a public IP. This is achieved through the use of a combination of public and private subnets and appropriate routing tables
The use of bastion instances and custom instance ports for SSH connections. These bastion instances should use very strict traffic rules so that only trusted machines can gain access to the servers in the VPC
Running the web app solely on the HTTPS protocol using SSL certificates generated by AWS, Letsencrypt or similar. Use of HSTS headers and opting into the HSTS preload list is also recommended
Using tight IAM permission rules with MFA where available. AWS provides very powerful security measures for all of this

Resolving package install failures in Docker

Andrzej Bakonski — Mon, 20 Aug 2018 17:30:00 GMT

If you're finding your apt-get install (or equivalent) commands fail when building Docker images, but you've successfully tested the flow you have in your Dockerfile elsewhere, it could be a Docker caching issue.

When executing apt-get update (or equivalent) in one RUN command (Docker layer) and then apt-get install is executed in another RUN command you can run into these package install failures. Docker attempts to use cached layers as much as possible. It's possible for Docker to use an older, cached layer where apt-get update was previously run and for your apt-get install command to be being executed without the latest package list available. The result is that the package manager can't find the package you're trying to install and this step will fail (or potentially cause the wrong version to be installed).

To get around this issue, it's best to bundle your package manager commands into a single RUN command.

Instead of doing something like this:

RUN apt-get -y update
RUN apt-get -y install python-pil

Do this:

RUN apt-get -y update && \
    apt-get -y install python-pil

This will put both steps into the one layer, so your 'install' command is always executed with a fresh package list.

Optimizing Docker builds for speed

Andrzej Bakonski — Sun, 19 Aug 2018 19:33:59 GMT

If you've ever found yourself spending way too much time staring at the screen as Docker builds your images, this one's for you. Here's a quick tip to keep in mind when writing Dockerfiles.

Command sequence is important

Each RUN, COPY and ADD command in a Dockerfile creates a separate layer (a.k.a. intermediate image). Each layer is built and cached separately. Docker will reuse cached layers as much as possible, but because each subsequent layer depends on those that came before it, it's important to get the sequence of commands right, to prevent package installs, for example, from bogging down simple code changes.

The sequence of commands should generally start with the commands that are the least likely to generate a change in a layer, to those that are most likely to do so. Typically, this would put package installs at the top of the Dockerfile and then anything like code or frequently made environment changes at the bottom.

Let's take a look at this example Dockerfile:

FROM python:3.6

# Copy app code to image
COPY /app /app

# Copy the modified Nginx conf
COPY /conf/nginx.conf /etc/nginx/conf.d/

# Custom Supervisord config
COPY /conf/supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# Install python dependencies
COPY /app/requirements.txt /home/docker/code/app/
RUN pip3 install -r /home/docker/code/app/requirements.txt

# Install system dependencies
RUN echo "deb http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && echo "deb-src http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && apt-get update -y \
    && apt-get install -y -t stretch openssl nginx-extras=1.10.3-1+deb9u1 \
    && apt-get install -y nano supervisor \
    && rm -rf /var/lib/apt/lists/*

EXPOSE 80

CMD ["/usr/bin/supervisord"]

In the above example:

Code changes are applied first. Anything that changes in the app code will force everything below this command to be re-run when the image is built. This includes installing the various dependencies, which is typically a heavy action
Configuration files are copied next. These are likely to be much less frequent than code changes, but will be re-cached every time code changes are made
Python dependencies follow. These are likely to happen more often than config changes but not as often as code changes
The heaviest call is performed next - installing various system dependencies. This takes up a significant percentage of the overall build. It is also pretty likely the action that is least likely to result in any changes to the image subsequent builds and therefore should be cached the most aggressively
Next we expose port 80 on the resulting container (also not likely to change after the initial setup)
Finally we launch supervisor

We can see a few problems here. Every time we make a change to code (the most frequently made change), a build of this Docker image will ignore the cache of all layers that follow it, including installing all those dependencies. It's an unnecessary hit and will burn minutes each time this image is built.

Ideally, the above Dockerfile would be restructured to something like this:

FROM python:3.6

EXPOSE 80

# Install system dependencies
RUN echo "deb http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && echo "deb-src http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && apt-get update -y \
    && apt-get install -y -t stretch openssl nginx-extras=1.10.3-1+deb9u1 \
    && apt-get install -y nano supervisor \
    && rm -rf /var/lib/apt/lists/*

# Install python dependencies
COPY /app/requirements.txt /home/docker/code/app/
RUN pip3 install -r /home/docker/code/app/requirements.txt

# Copy the modified Nginx conf
COPY /conf/nginx.conf /etc/nginx/conf.d/

# Custom Supervisord config
COPY /conf/supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# Copy app code to image
COPY /app /app

CMD ["/usr/bin/supervisord"]

Here's what's happening with this new version of the Dockerfile:

We expose port 80 on the container first. After inital config, this is likely to never need to change and can always be cached
Next we install all the system dependencies. This is a heavy call, the packages installed typicall won't change very often, so we're making sure that for the majority of image builds, this layer will come from Docker's cache and this build step is essentially skipped
Python dependencies come next. This layer will come from the cache as long as the previous (system dependencies) layer comes from cache and also 'requirements.txt' is unchanged. These dependencies are probably going to be updated more frequently than system ones. They can also take a long time to install, so caching this is a good idea as well
The next 2 commands copy config files. These are probably likely to not change as often as the python dependencies, but these are light calls, so having these be rebuilt from scratch when a dependency changes isn't a big deal at all. I like to put all COPY and ADD commands towards the bottom of the Dockerfile as these are usually light calls. The exception would be where a change to one of these underlying files affects subsequent RUN steps, in which case these need to be sorted in a logical way
Finally the code files are copied. This layer will need to be rebuilt from scratch on most image builds. Putting it at the end ensures that the rest of the build is pulled from cache as much as possible

Creating a Continuous Deployment Pipeline with BitBucket, Jenkins and Azure (part 3 of 3)

Andrzej Bakonski — Mon, 16 Apr 2018 02:56:00 GMT

Last year we decided to move Veryfi’s Python-based web app onto Microsoft Azure. The process was complicated and involved several stages. First I had to Dockerize the app, then move it into a Docker Swarm setup, and finally set up a CI/CD pipeline using Jenkins and BitBucket. Most of this was new to me, so the learning curve was steep. I had limited experience with Python and knew of Docker and Jenkins, but had yet to dive into the deep end. After completing the task, I thought I could share my research and process.

I’ve compiled a three-part series that will cover these topics:

Dockerizing a web app, using Docker Compose for orchestrating multi-container infrastructure
Deploying to Docker Swarm on Microsoft Azure
Creating a Continuous Deployment pipeline with BitBucket, Jenkins and Azure

This is the third and last post in the series, where I'll discuss the process of setting up a Jenkins build server, configuring BitBucket and then creating a fully functional pipeline that builds the appropriate Docker images after a BitBucket push and deploys the build to a Docker Swarm running on Microsoft Azure.

Make sure to read my first two posts (links above) so we’re on the same page because I’ll be building off those. The first two posts focus on the steps required to migrate an app to a Docker environment, setting up a Docker Swarm cluster on Azure then deploying to that cluster manually. This post will cover the automation side of deployments.

Note: I've included resource files related to this post here.

The example app that I'm deploying here is the same, minimal "Hello World" app that I used in the first two posts. See this repo for reference.

Why we implemented Continuous Deployment at Veryfi

Our manual deployment process was tedious and time-consuming. As a result, code that was ready to be shipped would stay on the back burner for hours or even days, because it simply took up too many resources to deploy. We actually found ourselves in a similar situation as many of Veryfi’s own customers, only instead of an expense management solution we needed a better deployment process. We now deploy multiple times a day with minimal impact on other essential daily activities. What used to take up to a half hour now happens with a couple of clicks and typically less than a minute of "mental distraction."

Setting up Azure Container Registry

Before we get to Jenkins, we need a few prerequisites in place. In the second post of this series, we set up a Docker Swarm cluster on Azure. Now we'll need to set up a Docker image registry. You could use the official Docker Hub, but in our case, we chose to use ACR (Azure Container Registry).

In Azure portal, click "Create a resource" and search for "container registry"
Select Azure Container Registry by Microsoft:

Follow the prompts to finish creating the ACR

Set up a Service Principal Name for Azure

Next we'll need to set up a Service Principal Name (SPN) that will be used by Jenkins to connect to ACR. Because I use a Mac for all my work, all instructions are Mac-specific.

Install Azure CLI 2.0:

# Ref: https://azure.github.io/projects/clis/
$ curl -L https://aka.ms/InstallAzureCli | bash

Install dependencies required for step 5:

# Ref: http://macappstore.org/jq/
 
# Install Homebrew if you don't have it already
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
 
# Install JQ (command-line JSON processor)
$ brew install jq

Download https://github.com/abakonski/bitbucket-jenkins-acs-pipeline/blob/master/scripts/spn.sh
Create SPN using script from step 3:

# Navigate to the path of the above downloaded script
$ cd 
 
# Run it
$ ./spn.sh
 
# Follow the on-screen instructions and you should see the following 2 blocks within the response
# Block 1:
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXXXX to authenticate.
[
  {
    "cloudName": "AzureCloud",
    "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "isDefault": true,
    "name": "Microsoft Azure Sponsorship",
    "state": "Enabled",
    "tenantId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "user": {
      "name": "xxxxx@xxxxxxx.com",
      "type": "user"
    }
  }
]
Checking Azure subscription count...
You only have one subscription. Your SPN will be created in Microsoft Azure Sponsorship
 
# Block 2:
{
  "appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "displayName": "myjenkins-acr-spn",
  "name": "http://myjenkins-acr-spn",
  "password": "xxxxxxxxxxxx",
  "tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
Successfully created Service Principal.

Confirm that a Contributor SPN has been created:

Assign SPN to your ACR:

# SUBSCRIPTION_ID = "id" from Block 1 in step 4 above
# ACR_RESOURCE_GROUP = the Resource Group you created your ACR inside of
# ACR_NAME = the name of your ACR
# APP_ID = "appId" from Block 2 in step 4 above
$ az role assignment create --scope /subscriptions//resourcegroups//providers/Microsoft.ContainerRegistry/registries/ --role Owner --assignee

Note your ACR login details for later reference:
- Username: "appId" from Block 2 in step 4 above
- Password: "password" from Block 2 in step 4 above
- ACR login server URL: refer to the following screenshot

Launching and configuring a Jenkins build server in Azure

We now have ACS and ACR set up and are ready to create a Jenkins build server. For our internal use case, we chose to do this on Azure, so that's what I'll cover here:

Create a new RSA SSH key for logging into the Jenkins server. This will be used below. Run the following from the terminal:

# Generate new key:
$ ssh-keygen -t rsa

# When prompted, enter a path to store the key, e.g. /Users/username/.ssh/myjenkinsserver_rsa
# For the sake of demonstration in this article, I left the passphrase empty

# Print the contents of the public key (for the above example, the path would be /Users/username/.ssh/myjenkinsserver_rsa.pub - note the .pub extension):
$ cat

In the left-side menu in Azure portal, click Create a resource and search for "Jenkins"
Select the Jenkins resource with Microsoft as the publisher:

For the sake of this article, I used the following settings on the first step of the wizard. Note that I selected the SSH public key option for authentication type and pasted the contents of the public key that I got in step 1 above:

Complete the rest of the wizard, with default values and/or whatever is appropriate.
Once the deployment has completed, copy down the new Jenkins server's DNS by clicking on the corresponding, freshly created Virtual Machine:

Enter the DNS into your browser's address bar and you'll be greeted with something like the following security screen:

Note: I won't go into how to set up HTTPS on this server in this post, but I'll run through how to use SSH tunneling to gain access to Jenkins on this VM.

In the terminal, run the following command to open an SSH session to the Jenkins VM and also forward port 8080 on the VM to the local port 8080. This will make Jenkins accessible via http://127.0.0.1:8080/


# Open SSH session with Jenkins VM and forward VM's locahost:8080 to local machine's port 8080
# JENKINS_USERNAME = refer to step 4 above
# JENKINS_DNS = refer to step 6 above
# PATH_TO_PRIVATE_KEY = refer to step 1 above
$ ssh -L 8080:localhost:8080 @ -i

Open http://127.0.0.1:8080/ to access and configure Jenkins. Follow the instructions on the screen to log in. You'll be asked to get the initial admin password from a file on the VM. You can do this with the following command on the Jenkins VM:

# Get initial admin password
$ sudo cat /var/lib/jenkins/secrets/initialAdminPassword

On the next screen, click Install suggested plugins to get started quickly. We'll add some extra plugins later because it seems that not all are available when following the Select plugins to install option on this screen.
Follow the prompts to create your first Jenkins admin user.
Once the setup wizard completes, go to Manage Jenkins > Manage Plugins and then open the Available tab. The Azure Jenkins VM appears to be pre-configured to install a lot of the Azure (and other) plugins by default, which may or may not be the case for a completely fresh Jenkins install on different architecture. Please feel free to confirm or deny in the comments. If your setup doesn't work due to missing plugins, let me know and I'll provide the full plugin list that is running on our Azure installation. Below is the list of extra plugins that I selected to install beyond the defaults. You may not need them all depending on your final requirements.
- BitBucket
- Build Pipeline
- External Monitor Job Type
- Global Slack Notifier
- Icon Shim
- JQuery
- Maven Integration
- Mercurial
- Parameterized Trigger
- Run Condition
- Slack Notification
- SSH Agent
- SSH
Now it's time to install Docker on the Jenkins server so that images can be built and remote Docker commands can be executed from this build server. Run the following commands in the SSH terminal session on the Jenkins server:

# Install Docker and dependencies
# Ref: https://docs.docker.com/v17.09/engine/installation/linux/docker-ce/ubuntu/#install-using-the-repository
$ sudo apt-get -y update
$ sudo apt-get -y upgrade
$ sudo apt-get -y install \
     apt-transport-https \
     ca-certificates \
     curl \
     software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
$ sudo add-apt-repository \
    "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
    $(lsb_release -cs) \
    stable"
$ sudo apt-get -y update
$ sudo apt-get -y install docker-ce

# Give Jenkins permission to run Docker commands
# Full credit to Jessica Dean for helping out with this step
# Ref: http://jessicadeen.com/tech/pro-tip-jenkins-and-docker-build-server/
$ sudo usermod -aG docker $USER
$ sudo usermod -aG docker jenkins

# Configure NginX to accept webhook calls from BitBucket on /bitbucket-hook/
$ sudo nano /etc/nginx/sites-enabled/default
# Replace contents with this file: https://github.com/abakonski/bitbucket-jenkins-acs-pipeline/blob/master/etc/nginx/sites-enabled/default
# This will disable access to Jenkins from the public internet (without SSH tunneling), leaving only the webhook enabled.
# Customize the URL for the webhook by changing the line: "location /bitbucket-hook/ {"

# Reboot the server
$ sudo reboot

NOTE: You could enable access to the Jenkins server without SSH tunneling, but it wouldn't be a good idea to leave this running without SSL. I won't be running through how to configure SSL on this Jenkins server in this post.

Test your ACR login by opening an SSH connection to the Jenkins server again (once it has had enough time to reboot after the above steps) and run the following command:

# ACR login details are in step 8 of the "Set up a Service Principal Name for Azure" section of this article
# You will be prompted for username and password
$ docker login

Configuring BitBucket

Now it's time to configure BitBucket to notify Jenkins whenever code is pushed to the repository. This is done via the Jenkins webhook:

Open the project settings for your BitBucket repository and click Add webhook:

Configure a webhook with the URL "/bitbucket-hook/". For triggers, select Repository: Push and Pull Request: Merged:

Configuring Slack Integration

It's always a good idea to communicate build processes (attempts, successes and failures) to the team. We use Slack internally for a lot of our communications, so I decided to have Jenkins report on all deployments. This is an optional step, but I highly recommend it. First, we'll need to configure Slack to accept messages from Jenkins, and then we'll configure Jenkins:

Create a channel in Slack. In this walkthrough, I'm going to use "#jenkins-builds"
Go to https://slack.com/apps/
Sign in
Search for "Jenkins"
Click Add Configuration
Choose the "#jenkins-builds" channel and click Add Jenkins CI integration
The next page will provide instructions on how to configure the Slack plugin within Jenkins. You should only need to follow the first 3 steps at this point

Creating the Jenkins jobs

Finally, it's time to create the jobs in Jenkins. These will use BitBucket pushes as a trigger, perform the Docker image build, and deploy the code changes to ACS. To do this, Jenkins will need to be able to connect to BitBucket, Azure Container Registry, and Azure Container Service. We'll configure the required credentials first.

Configuring required credentials in Jenkins

Log into the Jenkins admin
On the home screen, select Credentials, then System, then click the Global credentials domain
Click Add credentials to create the ACR credentials:
- Kind: "Username and password"
- Scope: "Global"
- Username: Application ID of the SPN created in the Set up a Service Principal Name for Azure section of this post
- Password: Password of the SPN created in the Set up a Service Principal Name for Azure section of this post
- ID: "acr-login"

Click Add credentials to create the ACS credentials:
- Kind: "SSH Username with private key"
- Scope: "Global"
- Username: SSH username for your ACS (refer to the second post in this series)
- Private key: Choose "Enter directly" and paste in the contents of your ACS SSH private key (refer to the second post in this series)
- Passphrase: SSH passphrase (if applicable) for your ACS (refer to the second post in this series)
- ID: "acs-login"

Click Add credentials to create the BitBucket credentials. There are various ways of logging into BitBucket. For pipelines in Jenkins, I recommend using SSH access keys:
- Kind: "SSH Username with private key"
- Scope: "Global"
- Username: "git"
- Private key: Choose "Enter directly" and paste in the contents of your BitBucket SSH private key
- Passphrase: SSH passphrase (if applicable) for your BitBucket SSH key
- ID: "bitbucket-login"

Setting up jobs in Jenkins

Now comes the juicy part: setting up the actual jobs that will automatically deploy your BitBucket pushes.

At the time of writing, there are some limitations with BitBucket integration in Jenkins. To the best of my knowledge, there's currently no way to create a Pipeline job in Jenkins that is triggered by a BitBucket push on any branch other than master. This is a pretty big limitation if you use different branches for different deployments. For example, we use a develop branch for all code that is ready to go out to a staging environment. In our case, the master branch is only used once we're happy with a production release. The workaround for this limitation is to create 2 different jobs — a Freestyle project, which is capable of being triggered by a push to any branch in BitBucket, and a Pipeline job to do the actual building and deployment. The Pipeline job can be triggered by the completion of the Freestyle project.

Creating a Freestyle project

We'll start by creating the Freestyle project that will be triggered by a push in BitBucket. In this example the job will only be triggered by a push on the develop branch:

Navigate to Jenkins > New Item
Enter something like "MyApp-BBHook-Staging" in the Name field and choose Freestyle project. Click Next.
Scroll down to Source Code Management
- Enter your BitBucket repository URL
- Select your BitBucket SSH credential
- Enter "*/develop" in the Branches to build section

Scroll down to Build Triggers and select Build when a change is pushed to BitBucket

Scroll down to Post-build Actions, select Slack Notifications and finally select Notify Build Start and anything else you'd like to be notified about

Click Save

Creating a Pipeline job in Jenkins

This job will do all the heavy lifting, which includes:

Pulling code from BitBucket
Building the Docker image
Pushing the image to ACR
Connecting to ACS
Pulling the Docker image to ACS
Deploying the stack — as defined by docker-compose.yml - to the ACS cluster

At each step, Slack notifications will be sent out.

To set up this Pipeline job:

Navigate to Jenkins > New Item
Enter something like "MyApp-Pipeline-Staging" in the Name field and choose Pipeline. Click Next.
Scroll down to Build Triggers
- Select Build after other projects are built
- Enter "MyApp-BBHook-Staging"

Scroll down to Pipeline
- Leave Pipeline script selected
- Modify the contents in this Groovy script to suit your needs and paste it into the Script text box
Click Save

Testing the pipeline

There's just one step left at this point: testing.

Go to Jenkins > MyApp-BBHook-Staging
Click Build Now
The first time you attempt to build this flow, it's very likely that the Pipeline job will throw out an error. This is because Jenkins is missing some permissions to execute certain methods in Pipeline scripts

To fix this error, go to Jenkins > Manage Jenkins > In-process Script Approval
You should see a signature awaiting approval. Click Approve

Build the MyApp-BBHook-Staging job again
If everything builds correctly this time, test the entire process end-to-end by pushing a change to your project's develop branch on BitBucket to kick off an automated deployment

Final words

That's it! If you've made it this far, congratulations — you should be able to create a Dockerized version of your app and create a full Continuous Deployment pipeline into Azure Container Services, triggered by a BitBucket push.

If you've enjoyed this series or if you have any comments, feedback, questions or thoughts on how these processes could be improved, please let me know in the comments.

Deploying to Docker Swarm on Microsoft Azure (part 2 of 3)

Andrzej Bakonski — Thu, 01 Mar 2018 04:27:00 GMT

A couple months ago we decided to move Veryfi’s Python-based web app onto Microsoft Azure. The process was complicated and involved several stages. First I had to Dockerize the app, then move it into a Docker Swarm setup, and finally set up a CI/CD pipeline using Jenkins and BitBucket. Most of this was new to me, so the learning curve was steep. I had limited experience with Python and knew of Docker and Jenkins, but had yet to dive into the deep end. After completing the task, I thought I could share my research and process with the Veryfi community.

I’ve compiled a three-part series that will cover these topics:

Dockerizing a web app, using Docker Compose for orchestrating multi-container infrastructure
Deploying to Docker Swarm on Microsoft Azure
CI/CD using BitBucket, Jenkins, Azure Container Registry

This is the second post in the series, where I'll discuss the process of setting up a Docker Swarm cluster on Microsoft Azure and deploying a Dockerized app to it.

This post assumes some basic knowledge of Docker Swarm.

In simple terms, Docker Swarm is an extension of Docker Compose. They both orchestrate a group of containers, with the main difference being that Swarm does this across multiple nodes and allows for replication of containers across multiple nodes for redundancy. Swarm uses two types of nodes: Master (aka manager) and Agent (aka worker). Master nodes perform all the orchestration and scheduling, while Agent nodes run the containers. Swarm handles all the load balancing between replicas. This is a very oversimplified and limited description of Swarm, so if you're new to the platform, you can dive a little deeper with Docker's official documentation: https://docs.docker.com/engine/swarm/

So why did we decide to implement Docker at Veryfi?

The main benefits to us have been portability and the ease of scale and orchestration. We use Macs internally and deploy to Linux servers on Azure and AWS and wanted to ensure we don’t run into the “well it worked on my machine…” scenario. It had to be easy to port to whatever infrastructure we wanted to use. We wanted to be able to scale easily — both Azure and AWS have container services that make this easy with Docker. Simply put, we wanted to make it all easy. And as I’ll explain in the third post in this series, we made things super easy by implementing CI/CD.

Note: the code for the example included in this article can be found in this GitHub repo: https://github.com/abakonski/docker-swarm-flask
The example here is the same minimal, "Hello World" app that I used in the first post, extended to work on Docker Swarm.

Creating a Docker Swarm cluster on Azure

The first thing I tried when setting up a Swarm cluster on Azure was to launch an Azure Container Service using the Azure Portal. After running through the wizard, the resulting resources were confusing and I was having a hard time deploying our web app to it, so I contacted the Azure support team who kindly pointed me to this very useful Azure Container Service Engine (ACS-Engine) walkthrough. This walkthrough outlines the entire process of setting the right tools and deploying a Docker Swarm cluster on Azure, so I won't rehash it in this post.

In a nutshell, you'll need to:

Install ACS-Engine
Generate an SSH key to connect to the Docker Swarm via SSH
Create a cluster definition template file
Generate Azure Resource Manager templates using ACS-Engine
Install Azure CLI 2.0
Create Azure Resource Group for the Docker Swarm
Deploy Docker Swarm using Azure CLI 2.0

Connecting to Docker Swarm cluster

Now that Docker Swarm is running on Azure, it's time to explore.

Here are some things we'll need before we get started:

SSH Key
Linux admin username
Master FQDN

The SSH Key required to connect to the Swarm cluster is the same one used to create the cluster with ACS-Engine. The admin username and Master FQDN can both be found in Azure Portal by navigating to the cluster's Deployment details screen as per these screenshots:

Now SSH into the default Swarm Master node:

# Add SSH key to keychain to allow SSH from Swarm Master to Swarm Agent nodes
$ ssh-add -K 

# SSH to Swarm cluster
# Use agent forwarding (-A flag) to allow SSH connections to Agent nodes
$ ssh -p 22 -A -i  @

Here are a few commands to start exploring your new Docker Swarm cluster:

# Docker version information
$ docker version

# Overview details about the current Docker machine and Docker Swarm 
$ docker info

# List of the Master and Agent nodes running in the Swarm
$ docker node ls

# List of deployments running on the Swarm
$ docker stack ls

# List of services running on the Swarm
$ docker service ls

The command list above introduces the "docker stack" family of commands. A Docker Stack is essentially a group of services that run in an environment, in this case the Docker Swarm. The "docker stack" commands relate to operations that are performed across all the nodes within the Swarm, much like "docker-compose" does on a single Docker Machine.

Getting Swarm-ready with Docker Compose

In the first post in this series I introduced Docker Compose as a way of orchestrating containers and any additional required resources. Thankfully, the move from Docker Compose to Docker Swarm is trivial. Docker Swarm uses virtually identical docker-compose.yml files to Docker Compose, albeit with additional optional settings.

Here's the same docker-compose.yml file that I used in the first post, adapted for Docker Swarm:

version: '3'

services:
  redis:
      image: redis:alpine
      deploy:
          mode: replicated
          replicas: 1
          restart_policy:
              condition: on-failure
      ports:
          - "6379:6379"
      networks:
          - mynet

  web:
      build: .
      image: 127.0.0.1:5000/myapp:latest
      deploy:
          mode: global
          restart_policy:
              condition: on-failure
      depends_on:
          - redis
      ports:
          - "80:80"
      networks:
          - mynet

networks:
  mynet:

There are 3 differences in this new file:

"deploy" setting in redis block - this tells Swarm to only create one instance of the redis container and restart the container if it enters an error state
"deploy" setting in web block - this tells Swarm to put an instance of the web image on each agent node and restart any container if it enters an error state
The image tag on the web has been prefixed with "127.0.0.1:5000"

We will be using a private Docker registry in our Swarm cluster to allow Docker to access our custom images when deploying to each agent node. This address and port is where we'll be running the private Docker registry.

Deploying a web app to Docker Swarm (the wrong way)

There are a two ways of manually deploying containers to Docker Swarm. One way is to SSH into a master node's host VM, pull the appropriate repository, build the Docker containers and deploy the stack. This is NOT the recommended way of doing things, but I'll introduce it here for illustration purposes, and also because I feel like this intermediate step builds some familiarity with Docker Swarm. This approach also doesn't require an external Docker Registry (like Docker Hub, Azure Container Registry or Amazon Container Registry). Instead, a private registry will be created in the Swarm.

This example uses my sample project, which can be found here. These commands are executed on a manager node's host VM, in the SSH session you opened earlier:

# Pull web app code from GitHub
$ git clone https://github.com/abakonski/docker-swarm-flask.git
 
# Launch private Docker registry service on the Swarm
$ docker service create --name registry --publish 5000:5000 registry:latest
 
# Build custom Docker images
$ cd docker-swarm-flask
$ docker-compose build
 
# Push web image to private registry
$ docker push 127.0.0.1:5000/myapp:latest
 
# Deploy images to the Swarm
$ docker stack deploy --compose-file docker-compose.yml myapp
 
# List stack deployments
$ docker stack ls
 
# Show stack deployment details
$ docker stack ps myapp

The final command above (i.e. "docker stack ps ") prints information about the containers that are running and the nodes they're running on, along with a few other pieces of information that are useful for monitoring and diagnosing problems with the stack. Since we're using agent forwarding to pass the SSH key into our connection to the Swarm Master node, connecting to any other node in the Swarm is as simple as running:

# Connect to another node inside the Swarm cluster
$ ssh -p 22 

# This will typically look something like this
$ ssh -p 22 swarmm-agentpublic-12345678000001

Once connected to a specific (Agent) node, we can inspect containers directly:

# List containers running on the node
$ docker ps

# View all sorts of resource and configuration info about a container
$ docker inspect 

# View the logs of a container
$ docker logs 

# Connect to a container's interactive terminal (if bash is running in the container)
$ docker exec -it  bash

# If bash isn't available in a specific container, sh usually will be
$ docker exec -it  sh

# Get a list of other useful commands
$ docker --help

Manually deploying a web app to Docker Swarm (the right way)

The second — and strongly recommended — manual approach is to tunnel Docker commands on your local machine to the Docker Swarm manager node. In this example, we'll be doing just that and you'll notice that the majority of the process will be identical to the above technique. The only difference is that you're now running all your commands locally. All your file system commands are run on your local machine and only Docker commands are forwarded to the remote cluster.

As you'll find out in more depth in the third article in this series, there is another step that we can take to add some more robustness to the whole process. In the Continuous Deployment flow, we'll be building all the required images outside of the Docker Swarm (i.e. on the Jenkins build server), pushing them to a Docker Registry and finally we'll be pulling the images from the Registry and deploying them to the Swarm. That last step will be done by tunneling from the build server to the remote Docker Swarm manager, much in the same way as we'll do here.

These commands are all executed on your local machine:

# Pull web app code from GitHub
$ git clone https://github.com/abakonski/docker-swarm-flask.git
 
# Open SSH tunnel to Docker Swarm manager node
# Traffic on local port 2380 will be directed to manager node VM's port 2375 (Docker service)
$ ssh -fNL 2380:localhost:2375 -i  @
 
# Tell Docker service on local machine to send all commands to port 2380 (i.e. run on Swarm manager node)
$ export DOCKER_HOST=':2380'
 
# Confirm that environment variable DOCKER_HOST is correctly set to ":2380"
$ echo $DOCKER_HOST
 
# Confirm that Docker commands are running on remote Swarm manager node - review the response here
$ docker info
 
# Launch private Docker registry service on the Swarm
$ docker service create --name registry --publish 5000:5000 registry:latest
 
# Build custom Docker images
$ cd docker-swarm-flask
$ docker-compose build
 
# Push web image to private registry
$ docker push 127.0.0.1:5000/myapp:latest
 
# Deploy images to the Swarm
$ docker stack deploy --compose-file docker-compose.yml myapp
 
# List stack deployments
$ docker stack ls
 
# Show stack deployment details
$ docker stack ps myapp
 
# Unset DOCKER_HOST environment variable
$ unset DOCKER_HOST
 
# Find the open SSH tunnel in your process list - the process ID is in the 1st column
$ ps ax | grep ssh
 
# End the SSH tunnel session using the above process ID
$ kill

This method allows for running all the same Docker commands as the first method, so feel free to do some testing and exploration. The difference here is that you can no longer SSH directly into the agent nodes, as you're working in the context of your local machine instead of being connected to the host VM of the Swarm manager node.

TIP: If you find the above process of ending the SSH tunnel session dirty, another way to do this is to open the tunnel without the “f” argument. This will keep the session in the foreground. When you’re finished with the session, simply hit CONTROL+C on your keyboard to end the session. Running the tunnel in the foreground will obviously require all other commands to be run in a separate terminal window/tab.

Testing the deployment

The example in this post runs a very simple Python app, with NginX as a web server and Redis as a session store. Now that the cluster is up and running, we can test this by browsing to the Swarm's Agent Public FQDN. The Agent Public FQDN can be found in the Azure Portal by following the same steps that we followed to get the Master FQDN above. The Agent Public FQDN should appear right underneath the Master FQDN. If everything is working correctly, entering the Agent Public FQDN into your browser's address bar should present you with a very simple page with the message "Hello There! This is your visit #".

Cleaning up the deployment

To clean up the deployment entirely, follow these steps on the Swarm Master node:

# Bring down the stack
$ docker stack rm myapp

# Completely wipe all the images from the node
$ docker image rm $(docker image ls -a -q) -f

# Stop the private registry service
$ docker service rm registry

To remove the ACS-Engine deployment entirely and free up all the resources taken up by it, open the Azure Portal, navigate to Resource Groups, open the appropriate Resource Group that you created for this exercise and finally click "Delete resource group" at the top of the Overview tab.

Final words

I mentioned earlier in the post that there are two ways of deploying to Docker Swarm. I covered the "wrong" (not recommended) approach, because I think it's helpful to get "under the hood" of Swarm to make troubleshooting easier if and when the need arises. The next post will walk through setting up a complete CI/CD pipeline with the help of Jenkins and BitBucket and this will include the "correct" way of connecting and deploying to Swarm.

Stay tuned for the next post and feel free to reach out in the comments with any feedback, questions or insights.

Dockerizing a web app + using Docker Compose (part 1 of 3)

Andrzej Bakonski — Mon, 26 Feb 2018 00:50:22 GMT

I’ve compiled a three-part series that will cover these topics:

Dockerizing a web app, using Docker Compose for orchestrating multi-container infrastructure
Deploying to Docker Swarm on Microsoft Azure
CI/CD using BitBucket, Jenkins, Azure Container Registry

This is the first post in the series.

I won’t go into a full blown explanation of Docker — there are plenty of articles online that answer that question, and a good place to start is here. One brief (and incomplete) description is that Docker creates something similar to Virtual Machines, only that Docker containers run on the host machine’s OS, rather than on a VM. Each Docker container should ideally contain one service and an application can comprise of multiple containers. With this approach, individual containers (services) can be easily swapped out or scaled out, independently of others. For example, our main web app currently runs on 3 instances of the main Python app container, and they all speak to one single Redis container.

So why did we decide to implement Docker at Veryfi?

Dockerizing an app

Note: the example included in this section can be found in this GitHub repo: https://github.com/abakonski/docker-flask
The example here is a minimal, “Hello World” app.

Docker containers are defined by Docker images, which are essentially templates for the environment that a container will run in, as well as the service(s) that will be running within them. A Docker image is defined by a Dockerfile, which outlines what gets installed, how it’s configured etc. This file always first defines the base image that will be used.

Docker images comprise multiple layers. For example, our web app image is based on the “python:3.6” image (https://github.com/docker-library/python/blob/d3c5f47b788adb96e69477dadfb0baca1d97f764/3.6/jessie/Dockerfile). This Python image is based on several layers of images containing various Debian Jessie build dependencies, which are ultimately based on a standard Debian Jessie image. It’s also possible to base a Docker image on “scratch” — an empty image that is the very top-level base image of all other Docker images, which allows for a completely customizable image, from OS to the services and any other software.

In addition to defining the base image, the Dockerfile also defines things like:

Environment variables
Package/dependency install steps
Port configuration
Environment set up, including copying application code to the image and any required file system changes
A command to start the service that will run for the duration of the Docker container’s life

This is an example Dockerfile:

FROM python:3.6

# Set up environment variables
ENV NGINX_VERSION '1.10.3-1+deb9u1'

# Install dependencies
RUN apt-key adv --keyserver hkp://pgp.mit.edu:80 --recv-keys 573BFD6B3D8FBC641079A6ABABF5BD827BD9BF62 \
    && echo "deb http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && echo "deb-src http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && apt-get update -y \
    && apt-get install -y -t stretch openssl nginx-extras=${NGINX_VERSION} \
    && apt-get install -y nano supervisor \
    && rm -rf /var/lib/apt/lists/*


# Expose ports
EXPOSE 80

# Forward request and error logs to Docker log collector
RUN ln -sf /dev/stdout /var/log/nginx/access.log \
    && ln -sf /dev/stderr /var/log/nginx/error.log

# Make NGINX run on the foreground
RUN if ! grep --quiet "daemon off;" /etc/nginx/nginx.conf ; then echo "daemon off;" >> /etc/nginx/nginx.conf; fi;

# Remove default configuration from Nginx
RUN rm -f /etc/nginx/conf.d/default.conf \
    && rm -rf /etc/nginx/sites-available/* \
    && rm -rf /etc/nginx/sites-enabled/*

# Copy the modified Nginx conf
COPY /conf/nginx.conf /etc/nginx/conf.d/

# Custom Supervisord config
COPY /conf/supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# COPY requirements.txt and RUN pip install BEFORE adding the rest of your code, this will cause Docker's caching mechanism
# to prevent re-installinig all of your dependencies when you change a line or two in your app
COPY /app/requirements.txt /home/docker/code/app/
RUN pip3 install -r /home/docker/code/app/requirements.txt

# Copy app code to image
COPY /app /app
WORKDIR /app

# Copy the base uWSGI ini file to enable default dynamic uwsgi process number
COPY /app/uwsgi.ini /etc/uwsgi/
RUN mkdir -p /var/log/uwsgi


CMD ["/usr/bin/supervisord"]

Here’s a cheat sheet of the commands used in the above example:

FROM — this appears at the top of all Dockerfiles and defines the image that this new Docker image will be based on. This could be a public image (see https://hub.docker.com/) or a local, custom image
ENV — this command sets environment variables that are available within the context of the Docker container
EXPOSE — this opens ports into the Docker container so traffic can be sent into them. These will still need to be listened to from within the container, (i.e. NginX could be configured to listen to port 80). Without this EXPOSE command, no traffic from outside the container will be able to get through on those ports
RUN — this command will run shell commands inside the container (when the image is being built)
COPY — this copies files from the host machine to the container
CMD — this is the command that will execute on container launch and will dictate the life of the container. If it’s a service, such as NginX, the container will continue to run for as long as NginX is up. If it’s a quick command (i.e. “echo ‘Hello world’”), then the container will stop running as soon as the command has executed and exited

The Docker image resulting from the above Dockerfile will be based on the Python 3.6 image and contain NginX and a copy of the app code. The Python dependencies are all listed in requirements.txt and are installed as part of the process. NginX, uWSGI and supervisord are all configured as part of this process as well.

This setup breaks the rule of thumb for the “ideal” way of using Docker, in that one container runs more than one service (i.e. NginX and uWSGI). It was a case-specific decision to keep things simple. Of course, there could be a separate container running just NginX and one running uWSGI, but for the time being, I’ve left the two in one container.

These services are both run and managed with the help of supervisord. Here’s the supervisord config file that ensures NginX and uWSGI are both running:

[supervisord]
nodaemon=true

[program:uwsgi]
# Run uWSGI with custom ini file
command=/usr/local/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

[program:nginx]
# NginX will use a custom conf file (ref: Dockerfile)
command=/usr/sbin/nginx
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

Launching a Docker container

I’m not including the instructions on installing Docker in this post (a good place to get started is here)

With the above project set up and Docker installed, the next step is to actually launch a Docker container based on the above image definition.

Frist, the Docker image must be built. In this example, I’ll tag (name) the image as “myapp”. In whatever terminal/shell is available on the machine you’re using (I’m running the Mac terminal), run the following command:

$ docker build -t myapp .

Next, run a container based on the above image using one of the following commands:

# run Docker container in interactive terminal mode - this will print logs to the terminal stdout, hitting command+C (or Ctrl+C etc) will kill the container
$ docker run -ti -p 80:80 myapp

# run Docker container quietly in detached/background mode - the container will need to be killed with the "docker kill" command (see next code block below)
$ docker run -d -p 80:80 myapp

The above commands will direct traffic to port 80 on the host machine to the Docker container’s port 80. The Python app should now be accessible on port 80 on localhost (i.e. open http://localhost/ in a browser on the host machine).

Here are some helpful commands to see what’s going on with the Docker container and perform any required troubleshooting:

# list running Docker containers
$ docker ps

# show logs for a specific container
$ docker logs [container ID]

# connect to a Docker container's bash terminal
$ docker exec -it [container ID] bash

# stop a running container
$ docker kill [container ID]

# remove a container
$ docker rm [container ID]

# get a list of available Docker commands
$ docker --help

Docker Compose

Note: the example included in this section is contained in this GitHub repo: https://github.com/abakonski/docker-compose-flask
As above, the example here is minimal.

The above project is a good start, but it’s a very limited example of what Docker can do. The next step in setting up a microservice infrastructure is through the use of Docker Compose. Typically, most apps will comprise multiple services that interact with each other. Docker Compose is a pretty simple way of orchestrating exactly that. The concept is that you describe the environment in a YAML file (usually named docker-compose.yml) and launch the entire environment with just one or two commands.

This YAML file describes things like:

The containers that need to run (i.e. the various services)
The various storage mounts and the containers that have access to them — this makes it possible for various services to have shared access to files and folders
The various network connections over which containers can communicate with each other
Other configuration parameters that will allow containers to work together

version: '3'

services:
  redis:
    image: "redis:alpine"
    ports:
      - "6379:6379"
    networks:
      - mynet

  web:
    build: .
    image: myapp:latest
    ports:
      - "80:80"
    networks:
      - mynet

networks:
  mynet:

The above YAML file defines two Docker images that our containers will be based on, and one network that both containers will be connected to so that they can “talk” to each other.

In this example, the first container will be created based on the public “redis:alpine” image. This is a generic image that runs a Redis server. The “ports” setting is used to open a port on the container and map it to a host port. The syntax for ports is “HOST:CONTAINER”. In this example we forward the host port 6379 to the same port in the container. Lastly, we tell Docker compose to put the Redis container on the “mynet” network, which is defined at the bottom of the file.

The second container defined will be based on a custom local image, namely the one that’s outlined in the first section of this article. The “build” setting here simply tells Docker Compose to build the Dockerfile that is sitting in the same directory as the YAML file (./Dockerfile) and tag that image with the value of “image” — in this case “myapp:latest”. The “web” container is also going to run on the “mynet” network, so it will be able to communicate with the Redis container and the Redis service running within it.

Finally, there is a definition for the “mynet” network at the bottom of the YAML file. This is set up with the default configuration.

This is a very basic setup, just to get a basic example up and running. There is a ton of info on Docker Compose YAML files here.

Once the docker-compose.yml file is ready, build it (in this case only the “web” project will actually be built, as the “redis” image will just be pulled from the public Docker hub repo). Then bring up the containers and network:

# build all respective images
$ docker-compose build

# create containers, network, etc
$ docker-compose up

# as above, but in detached mode
$ docker-compose up -d

Refer to the Docker commands earlier in this article for managing the containers created by Docker Compose. When in doubt, use the “–help” argument, as in:

# general Docker command listing and help
$ docker --help

# Docker network help
$ docker network --help

# Help with specific Docker commands
$ docker  --help

# Docker Compose help
$ docker-compose --help

So there you have it — a “Hello World” example of Docker and Docker Compose.

Just remember that this is a starting point. Anyone diving into Docker for the first time will find themselves sifting through the official Docker docs and StackOverflow forums etc, but hopefully this post is a useful intro. Stay tuned for my follow-up posts that will cover deploying containers into Docker Swarm on Azure and then setting up a full pipeline into Docker Swarm using Jenkins and BitBucket.

If you have any feedback, questions or insights, feel free to reach out in the comments.