Optimizing Docker builds for speed

If you've ever found yourself spending way too much time staring at the screen as Docker builds your images, this one's for you. Here's a quick tip to keep in mind when writing Dockerfiles.

Command sequence is important

Each RUN, COPY and ADD command in a Dockerfile creates a separate layer (a.k.a. intermediate image). Each layer is built and cached separately. Docker will reuse cached layers as much as possible, but because each subsequent layer depends on those that came before it, it's important to get the sequence of commands right, to prevent package installs, for example, from bogging down simple code changes.

The sequence of commands should generally start with the commands that are the least likely to generate a change in a layer, to those that are most likely to do so. Typically, this would put package installs at the top of the Dockerfile and then anything like code or frequently made environment changes at the bottom.

Let's take a look at this example Dockerfile:

FROM python:3.6

# Copy app code to image
COPY /app /app

# Copy the modified Nginx conf
COPY /conf/nginx.conf /etc/nginx/conf.d/

# Custom Supervisord config
COPY /conf/supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# Install python dependencies
COPY /app/requirements.txt /home/docker/code/app/
RUN pip3 install -r /home/docker/code/app/requirements.txt

# Install system dependencies
RUN echo "deb http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && echo "deb-src http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && apt-get update -y \
    && apt-get install -y -t stretch openssl nginx-extras=1.10.3-1+deb9u1 \
    && apt-get install -y nano supervisor \
    && rm -rf /var/lib/apt/lists/*

EXPOSE 80

CMD ["/usr/bin/supervisord"]

In the above example:

Code changes are applied first. Anything that changes in the app code will force everything below this command to be re-run when the image is built. This includes installing the various dependencies, which is typically a heavy action
Configuration files are copied next. These are likely to be much less frequent than code changes, but will be re-cached every time code changes are made
Python dependencies follow. These are likely to happen more often than config changes but not as often as code changes
The heaviest call is performed next - installing various system dependencies. This takes up a significant percentage of the overall build. It is also pretty likely the action that is least likely to result in any changes to the image subsequent builds and therefore should be cached the most aggressively
Next we expose port 80 on the resulting container (also not likely to change after the initial setup)
Finally we launch supervisor

We can see a few problems here. Every time we make a change to code (the most frequently made change), a build of this Docker image will ignore the cache of all layers that follow it, including installing all those dependencies. It's an unnecessary hit and will burn minutes each time this image is built.

Ideally, the above Dockerfile would be restructured to something like this:

FROM python:3.6

EXPOSE 80

# Install system dependencies
RUN echo "deb http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && echo "deb-src http://httpredir.debian.org/debian/ stretch main contrib non-free" >> /etc/apt/sources.list \
    && apt-get update -y \
    && apt-get install -y -t stretch openssl nginx-extras=1.10.3-1+deb9u1 \
    && apt-get install -y nano supervisor \
    && rm -rf /var/lib/apt/lists/*

# Install python dependencies
COPY /app/requirements.txt /home/docker/code/app/
RUN pip3 install -r /home/docker/code/app/requirements.txt

# Copy the modified Nginx conf
COPY /conf/nginx.conf /etc/nginx/conf.d/

# Custom Supervisord config
COPY /conf/supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# Copy app code to image
COPY /app /app

CMD ["/usr/bin/supervisord"]

Here's what's happening with this new version of the Dockerfile:

We expose port 80 on the container first. After inital config, this is likely to never need to change and can always be cached
Next we install all the system dependencies. This is a heavy call, the packages installed typicall won't change very often, so we're making sure that for the majority of image builds, this layer will come from Docker's cache and this build step is essentially skipped
Python dependencies come next. This layer will come from the cache as long as the previous (system dependencies) layer comes from cache and also 'requirements.txt' is unchanged. These dependencies are probably going to be updated more frequently than system ones. They can also take a long time to install, so caching this is a good idea as well
The next 2 commands copy config files. These are probably likely to not change as often as the python dependencies, but these are light calls, so having these be rebuilt from scratch when a dependency changes isn't a big deal at all. I like to put all COPY and ADD commands towards the bottom of the Dockerfile as these are usually light calls. The exception would be where a change to one of these underlying files affects subsequent RUN steps, in which case these need to be sorted in a logical way
Finally the code files are copied. This layer will need to be rebuilt from scratch on most image builds. Putting it at the end ensures that the rest of the build is pulled from cache as much as possible

Optimizing Docker builds for speed

Command sequence is important

Resolving package install failures in Docker

Creating a Continuous Deployment Pipeline with BitBucket, Jenkins and Azure (part 3 of 3)