🐳 Docker Intro

✍️ Week 1 lesson 1 of DataTalksClub 2022 data engineering zoomcamp, 🐳 Docker install & run, 🐍 Python scripts in containers, 📋 passing params

Today, we will follow DataTalksClub's video: DE Zoomcamp 1.2.1 - Introduction to Docker.

Which is part of the DataTalksClub 2022 Data engineering Zoomcamp week 1 repo.

💬 In this lesson, we will:

Install Docker.
Run Docker containers.
Create a docker container with our dependencies using a Dockerfile.
Execute a Python script automatically when running a Docker container.
Pass parameters to a Python script from the Docker run command.

This post is part of a series. Find all the other related posts here

🖋️ Introduction

Your computer has an Operating System (OS) and a collection of software installed on top of it that enables the creation of more software. When you build a new piece of code that runs on your machine, the OS along with this software collection become implicit dependencies of your code, and can give rise to the "but it works in my computer" kind of bugs. The only way to make sure your application runs on a different computer is to replicate this system's configuration, run your code in it, and see if it still works.

There are very good reasons for not doing this testing with hardware, for example, it is not practical, it is expensive, and we can also do it with virtualization. Docker enables OS-level virtualization. With Docker, we create containers that run explicit configurations of OS and software, which are isolated from the host system (your computer).

If you have ever typed import numpy as np, and then use np somewhere in your code, you have created a dependency between your code and Numpy. In Python, we typically manage these explicit dependencies with virtual environments, normally handle by venv, conda, or poetry. In a way, Docker allows the next level of generalization, where we jointly define our code's implicit and explicit dependencies.

Explicit is better than implicit. - The Zen of Python

Below is a list of a few reasons we should use Docker. During the first week, we will focus on running local experiments.

💡 Reasons to use Docker

1. To run local experiments.

2. Integration tests (CI/CD).

3. Reproducibility.

4. For running pipeline in the cloud.

5. Combined with Spark

6. For running serverless processes (AWS Lambda, Google functions)

💻 Install Docker

Docker comes packaged for Windows and Mac under Docker Desktop, but since I'm running Linux Ubuntu, I followed the Docker Engine instructions for installing using the repository. If all went well, at the end of the instructions you will get the message below in your terminal.

docker run hello-world

Hello from Docker!
This message shows that your installation appears to be
working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image
    from the Docker Hub. (amd64)
 3. The Docker daemon created a new container from that
    image which runs the executable that produces the output
    you are currently reading.
 4. The Docker daemon streamed that output to the Docker
    client, which sent it to your terminal.

To try something more ambitious, you can run an Ubuntu
container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free
Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Docker run hello-world terminal output

📦 Running Docker containers

Let's learn about Docker's commands structure by following the suggestions in the message above

docker run -it ubuntu bash

This command says, docker, run the ubuntu image, on interactive mode (-it), and execute the bash command so we get a bash prompt. Everything that comes before the image name (ubuntu) is a parameter to run, and everything that comes after is a parameter to that container.

docker run -it ubuntu bash
root@3295092584eb:/# pwd
/
root@3295092584eb:/# ls
bin   dev  home  lib32  libx32  mnt  proc  run   srv  tmp  var
boot  etc  lib   lib64  media   opt  root  sbin  sys  usr
root@3295092584eb:/# echo "Look Mom I'm running bash in a Docker container!"
Look Mom I'm running bash in a Docker container!
root@3295092584eb:/#

To exit the container, type exit and hit enter.

Let's try running Python, specifying the version with a "tag"

docker run -it python 3.9

Python 3.9.10 (main, Jan 18 2022, 21:15:42) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from platform import python_version
>>> print(f"Look Mom now I'm running Python {python_version()} in a Docker container!")
Look Mom now I'm running Python 3.9.10 in a Docker container!
>>>

Now we are accessing a different container (python:3.9) than the one before (ubuntu). The former brings a Python interpreter session running in the container. Suppose we want to install Pandas in this container. For this, we need to specify the entry point, i.e., what is exactly executed when we run the container, so we get access to the prompt.

docker run -it --entrypoint=bash python:3.9
root@2d64a4403ce8:/# pip install pandas
...
root@2d64a4403ce8:/# python

Python 3.9.10 (main, Jan 18 2022, 21:15:42) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> print(pd.__version__)
1.4.1
>>>

📜 Create a Docker container from a Dockerfile

With prompt access we can install Pandas (pip install pandas). The challenge with this approach is that when we exit the container, it will go back to the original state (sans Pandas). We need a way to ensure that the Pandas library is in the container when we run our data ingestion script. For this, we use a Docker file. Let's create a working directory (e.g., 2_docker_sql) and add a file named Dockerfile (no extension) to it with the content below.

FROM python:3.9 

RUN pip install pandas 
  
ENTRYPOINT [ "bash" ]

Dockerfile

Note that in the Dockerfile we are starting from the python:3.9 container, we install Pandas on it, and then we specify the entry point to be bash. We now want to create a new container using this recipe. For this, we use Docker's build command in the directory containing the Docker file, like

docker build -t test:pandas .

This means, docker, build an image using the Docker file in the current directory (.), and tag (name) it test:pandas. After the image is created we can use it with

docker run -it test:pandas

🐍 Data pipeline file

Let's create a Python file in the same location as the Dockerfile, and call it ingest_data.py. This file will contain the data ingestion steps that we want to automate.

import pandas as pd

# some fancy stuff with pandas
  
print('job finished successfully')

ingest_data.py

At this point, we are only interested in checking that Pandas can be imported into the container we created above. Also, we need to add this file to the Docker image by adding the COPY instruction to the Dockerfile.

FROM python:3.9 

RUN pip install pandas

WORKDIR /app

COPY ingest_data.py ingest_data.py

ENTRYPOINT [ "bash" ]

Dockerfile

Which reads, copy the ingest_data.py file in the current directory (source) to the container destination specified in WORKDIR, and rename it with the destination name (ingest_data.py).

We now have to rebuild the image with

docker build -t test:pandas .

So when we access the updated image,

docker run -it test:pandas
root@2d64a4403ce8:/app# pwd
/app
root@2d64a4403ce8:/app# ls
ingest_data.py
root@2d64a4403ce8:/app#

we note that the /app directory was created and that it contains the ingest_data.py file. Also, the current working directory is /app.

In the container, we can execute the pipeline script with

root@2d64a4403ce8:/app# python ingest_data.py

which should print the "job finished successfully" message.

📋 Passing parameters to a script in a container

In the section above, we loaded a script in our host system to a Docker container, accessed the container, and ran the script by hand. However, we might want to run this script automatically (self-sufficient container), perhaps with a set of parameters for the run, e.g., pull data for a specific date, apply some transformation, and save the results. Let's modify ingest_data.py file to accept parameters.

import sys
import pandas as pd

print(sys.argv)
day = sys.argv[1]

# some fancy stuff with pandas
  
print(f'job finished succesfully for day = {day}')

ingest_data.py

Next, we must change the entry point in the Dockerfile to specify that we want to run the ingest_data.py file.

FROM python:3.9 

RUN pip install pandas

WORKDIR /app

COPY ingest_data.py ingest_data.py

ENTRYPOINT [ "python", "ingest_data.py" ]

Dockerfile

After this change, we rebuild the image with

docker build -t test:pandas .

and run the container passing a date after the tag

docker run -it test:pandas 2022-01-26

Which should print the success message

docker run -it test:pandas 2022-01-26
['ingest_data.py', '2022-01-26']
job finished successfully for day = 2022-01-26

📝 Summary

In this lesson, we saw how to

Install Docker
Run containers from the terminal
Build a container using a Docker file
Build a container that executes a Python script when launched
Pass parameters to a Python script in the container

Week 1 lesson 1 visual summary.

In our next lesson, we will set Postgres on Docker, load one month's worth of data from the NYC taxi trips, and access it with pgcli.

Ingesting NYC taxi trip data to Postgres

✍️ Week 1 lesson 2 of DataTalksClub 2022 data engineering zoomcamp, set 🐘 Postgres on 🐳 Docker, Load 🚕 NYC taxi trip data, Read with 🐼 Pandas and pgcli

pintonistaRafael

#Course Notes: 2022 DataTalksClub Data Engineering Zoomcamp #Learning in public #Data Engineering #Docker #Python