βοΈ Week 1 lesson 1 of DataTalksClub 2022 data engineering zoomcamp, π³ Docker install & run, π Python scripts in containers, π passing params
Today, we will follow DataTalksClub's video: DE Zoomcamp 1.2.1 - Introduction to Docker.
Which is part of the DataTalksClub 2022 Data engineering Zoomcamp week 1 repo.
π¬ In this lesson, we will:
- Install Docker.
- Run Docker containers.
- Create a docker container with our dependencies using a Dockerfile.
- Execute a Python script automatically when running a Docker container.
- Pass parameters to a Python script from the Docker run command.
This post is part of a series. Find all the other related posts here
ποΈ Introduction
Your computer has an Operating System (OS) and a collection of software installed on top of it that enables the creation of more software. When you build a new piece of code that runs on your machine, the OS along with this software collection become implicit dependencies of your code, and can give rise to the "but it works in my computer" kind of bugs. The only way to make sure your application runs on a different computer is to replicate this system's configuration, run your code in it, and see if it still works.
There are very good reasons for not doing this testing with hardware, for example, it is not practical, it is expensive, and we can also do it with virtualization. Docker enables OS-level virtualization. With Docker, we create containers that run explicit configurations of OS and software, which are isolated from the host system (your computer).
If you have ever typed import numpy as np
, and then use np
somewhere in your code, you have created a dependency between your code and Numpy. In Python, we typically manage these explicit dependencies with virtual environments, normally handle by venv, conda, or poetry. In a way, Docker allows the next level of generalization, where we jointly define our code's implicit and explicit dependencies.
Explicit is better than implicit. - The Zen of Python
Below is a list of a few reasons we should use Docker. During the first week, we will focus on running local experiments.
π‘ Reasons to use Docker
1. To run local experiments.
2. Integration tests (CI/CD).
3. Reproducibility.
4. For running pipeline in the cloud.
5. Combined with Spark
6. For running serverless processes (AWS Lambda, Google functions)
π» Install Docker
Docker comes packaged for Windows and Mac under Docker Desktop, but since I'm running Linux Ubuntu, I followed the Docker Engine instructions for installing using the repository. If all went well, at the end of the instructions you will get the message below in your terminal.
π¦ Running Docker containers
Let's learn about Docker's commands structure by following the suggestions in the message above
docker run -it ubuntu bash
This command says, docker
, run the ubuntu
image, on interactive mode (-it
), and execute the bash
command so we get a bash prompt. Everything that comes before the image name (ubuntu
) is a parameter to run
, and everything that comes after is a parameter to that container.
docker run -it ubuntu bash
root@3295092584eb:/# pwd
/
root@3295092584eb:/# ls
bin dev home lib32 libx32 mnt proc run srv tmp var
boot etc lib lib64 media opt root sbin sys usr
root@3295092584eb:/# echo "Look Mom I'm running bash in a Docker container!"
Look Mom I'm running bash in a Docker container!
root@3295092584eb:/#
To exit the container, type exit
and hit enter.
Let's try running Python, specifying the version with a "tag"
docker run -it python 3.9
Python 3.9.10 (main, Jan 18 2022, 21:15:42)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from platform import python_version
>>> print(f"Look Mom now I'm running Python {python_version()} in a Docker container!")
Look Mom now I'm running Python 3.9.10 in a Docker container!
>>>
Now we are accessing a different container (python:3.9
) than the one before (ubuntu
). The former brings a Python interpreter session running in the container. Suppose we want to install Pandas in this container. For this, we need to specify the entry point, i.e., what is exactly executed when we run the container, so we get access to the prompt.
docker run -it --entrypoint=bash python:3.9
root@2d64a4403ce8:/# pip install pandas
...
root@2d64a4403ce8:/# python
Python 3.9.10 (main, Jan 18 2022, 21:15:42)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> print(pd.__version__)
1.4.1
>>>
π Create a Docker container from a Dockerfile
With prompt access we can install Pandas (pip install pandas
). The challenge with this approach is that when we exit the container, it will go back to the original state (sans Pandas). We need a way to ensure that the Pandas library is in the container when we run our data ingestion script. For this, we use a Docker file. Let's create a working directory (e.g., 2_docker_sql) and add a file named Dockerfile (no extension) to it with the content below.
Note that in the Dockerfile we are starting from the python:3.9
container, we install Pandas on it, and then we specify the entry point to be bash. We now want to create a new container using this recipe. For this, we use Docker's build
command in the directory containing the Docker file, like
docker build -t test:pandas .
This means, docker
, build an image using the Docker file in the current directory (.
), and tag (name) it test:pandas
. After the image is created we can use it with
docker run -it test:pandas
π Data pipeline file
Let's create a Python file in the same location as the Dockerfile, and call it ingest_data.py
. This file will contain the data ingestion steps that we want to automate.
At this point, we are only interested in checking that Pandas can be imported into the container we created above. Also, we need to add this file to the Docker image by adding the COPY
instruction to the Dockerfile.
Which reads, copy the ingest_data.py
file in the current directory (source) to the container destination specified in WORKDIR
, and rename it with the destination name (ingest_data.py
).
We now have to rebuild the image with
docker build -t test:pandas .
So when we access the updated image,
docker run -it test:pandas
root@2d64a4403ce8:/app# pwd
/app
root@2d64a4403ce8:/app# ls
ingest_data.py
root@2d64a4403ce8:/app#
we note that the /app
directory was created and that it contains the ingest_data.py
file. Also, the current working directory is /app
.
In the container, we can execute the pipeline script with
root@2d64a4403ce8:/app# python ingest_data.py
which should print the "job finished successfully" message.
π Passing parameters to a script in a container
In the section above, we loaded a script in our host system to a Docker container, accessed the container, and ran the script by hand. However, we might want to run this script automatically (self-sufficient container), perhaps with a set of parameters for the run, e.g., pull data for a specific date, apply some transformation, and save the results. Let's modify ingest_data.py file
to accept parameters.
Next, we must change the entry point in the Dockerfile to specify that we want to run the ingest_data.py
file.
After this change, we rebuild the image with
docker build -t test:pandas .
and run the container passing a date after the tag
docker run -it test:pandas 2022-01-26
Which should print the success message
docker run -it test:pandas 2022-01-26
['ingest_data.py', '2022-01-26']
job finished successfully for day = 2022-01-26
π Summary
In this lesson, we saw how to
- Install Docker
- Run containers from the terminal
- Build a container using a Docker file
- Build a container that executes a Python script when launched
- Pass parameters to a Python script in the container
In our next lesson, we will set Postgres on Docker, load one month's worth of data from the NYC taxi trips, and access it with pgcli.