Seamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and Docker

By Alexander Mueller

When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU. Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive. Luckily there are some options in the cloud. One that I tried out was using the wonderful Google Compute Engine. GPUs are available in the GCE as external accelerators of an instance. Currently, there are these GPUs available (prices for us-central1).

  • NVIDIA® Tesla® P4: $1267.28 USD per GPU per Month
  • NVIDIA® Tesla® V100: $306.60 USD per GPU per Month
  • NVIDIA® Tesla® P100: $746.06 USD per GPU per Month
  • NVIDIA® Tesla® K80: $229.95 USD per GPU per Month

Manual configuration usually isn’t something you can scale up easily, so I did some investigations on if there are methods with which I can ramp-up my environment as seamless as possible and destroy it the same way. Consequently, I found a solution which uses terraform to set-up the infrastructure in the Google Compute Platform. The Source Code is deployed from Git and a Docker container is automatically started having all necessary dependencies like tensorflow, keras and jupyter installed. In this blog post I will guide you through the individual steps on how you can set up your environment easily. Some of the work is based on this git repository: https://github.com/Cheukting/GCP-GPU-Jupyter

  • Setting up a GCE Instance with GPUs in an automated way
  • How to use terraform with GCP
  • How to deploy the code of a Gitlab repository to a GCE Instance
What we will build in this blog post.

As shown in the chart above I will show you how to write a terraform script that automatically spins up a Google compute engine virtual machine that installs CUDA, Docker, etc on it and finally starts a Docker container with the code from another external Git repository (in our case from gitlab). This Docker container is running the jupyter notebook server which can be accessed from the outside via your browser. In addition, I will show you how you can run longer tasks outside of a notebook inside your VM using Docker.

Just want to try it, then check out the TL;DR section at the end with all necessary commands listed.

If you already have a repository that you want to us, feel free to do so. Otherwise it now time to create a new repository. https://gitlab.com/. All your ml and data exploration code will go into this repository. We will keep it separate from the infrastructure code.

Again, in this Gitlab repository you could now add all your code and for instance, create a simple train.py python file that trains a neural network and saves the trained weights at the end. A very lean example of what is possible can be found here: https://gitlab.com/dice89/deep-learning-experiments

This repo simply contains a Jupyter Notebook for some data exploration and a train.py to train an RNN LSTM text generation model.

Either you can take an existing Docker image or you can create your own. For the sake of simplicity, we will use a pre-built Docker image with a python 3.5 environment installed and all necessary libraries for a reasonable deep learning use case included. It should contain everything you need for a first try: python 3.X, tensorflow-gpu, numpy, pandas, sklearn, keras

If your interested, you can check out this image here: https://hub.docker.com/r/dice89/ubuntu-gpu-python-dl/

Terraform is an infrastructure-as-code toolkit that allows you to define, create and destroy infrastructure by writing code. This comes with a lot of advantages: For instance, you don’t need to configure anything by the means of a UI like in the GCP console. Furthermore, your entire infrastructure configuration is documented by default since it is readable code that is ideally versioned in a git repository. I just discovered Terraform a year ago and already cannot imagine setting up infrastructure without it. Hashicorp found a wonderful combination of codable and versionable infrastructure that can be understood with a coders mindset. You only need some terraform files and the terraform CLI to create your infrastructure.

Add a new Google compute instance is easy to create like this:

Example of a gcloud compute instance creation with terraform.

Before you can create an instance you need to perform some steps upfront:

In the following snippet, you will create a gcloud project, set the execution context to your current account, create a service account that is able to create new instances and finally download the private key to use this service account in terraform.

Execute these commands in your CLI and replace <your> with the project name you desire. It can take a bit of time mainly due to the activation of compute API.

Before we start the instance let us have a look in the details on how to configure a Google compute instance with a GPU enabled. Unfortunately, you have to request a quota for the GPU (it was working for me until Dec. 2018 without it). Please follow the advice from this Stackoverflow article. The handling of this request can take up to 2 business days.

Google Compute Instance with a GPU

As you can see in line 19, we added a Tesla K80 GPU to this instance and on start up we perform some action in a script (start-up-script.sh). This is shown below:

Setting up Ubuntu VM to run with CUDA

In this script we install all libraries needed, add an ssh key to the user and run our Docker container that exposes port 80 to the outside world. Therefore, we can reach the jupyter notebook server. Please notice that anyone who knows the IP could access your notebooks after creating this instance. This should be changed even for a short-lived environment like this one. Now create a new ssh key to be able to deploy our code from Gitlab to the instance.

ssh-keygen -t rsa -b 4096 -C “your_email@example.com”

In order to make it work, you have to replace the placeholder “ADD YOUR SSH KEY HERE “ with your generated private (!!!) ssh key in the start-up-script.sh. Please note: Do not share your key with anyone! Clone the full configuration from this repository and then change your ssh key: https://gitlab.com/dice89/google-cloud-gpu-vm. (do not commit your private key to any git repository) Also, make sure that your credentials.json is in the root of this folder (don’t commit the credentials.json). You also have to add this ssh key to your gitlab account in order to make the code from you Gitlab repository deployable.

Now we’re ready to create the machine. And this is possible with only 3 bash commands! 🚀

Fill in your GCP project id type yes and your instances will be created (also consider the costs that this will cause, GPUs are not covered by the free tier). After the instance is created, you will see an IP address posted to your command line. This is your public IP under which your jupyter instance will be available. It might take a couple of minutes until the start-up-script.sh is finished and everything is installed.

Let us take the time until the script is done to explore the instance a bit. In order to do so, you have to ssh into it. Luckily Google provides a command for us.

gcloud compute --project “<your>_dl” ssh --zone “europe-west1-d” “gpu-vm”

The start-up-script.sh is running as a root user, therefore, you have to switch to your root console to see what is happening.

sudo su
cd /var/log
tail -f syslog | grep startup-script

Now we’re on the instance and can check, e.g., if a GPU is installed and already used.

nvidia-smi -l 1

We could also install htop since it comes handy at monitoring memory consumption of your running processes:

sudo apt-get install htop

After a while you can check if there already is a Docker container running:

docker ps

If you see your Docker container on this overview you are ready to log into your jupyter notebook under the IP shown.

Also if you go to the path ~/datascience/deep-learning-experiments you will see that this automatically mounted to your Docker container under /root/project and contains the contents of your gitlab repository like the train.py .

Jupyter is nice for some data exploration or experimental code. However, training deep learning models takes a lot of time and you cannot afford your jupyter session to crash and to lose all the training progress. Luckily, there is a remedy for this. You can very easily train a new model with running a Docker container as a daemon that runs a python script to train your model. Everything you need to do is, e.g., in our example type.

docker run --runtime=nvidia -d -v ~/datascience:/root/project dice89/ubuntu-gpu-python-dl python3 /root/project/deep-learning-experiments/train.py

If you now check docker ps you will see something like this:

To see the logs of your training task, simply type:

docker logs <your_container_id_from docker ps>

Finally, you’re training your model with code from a Git repository in a reproducible fashion. When you’re done and have saved and stored your weights, you can destroy the environment by simply typing:

terraform destroy \
-var 'project_id=<your>-dl' \
-var 'region=europe-west1-d'

If you need the same env again simply type:

terraform apply \
-var 'project_id=<your>-dl' \
-var 'region=europe-west1-d'

So, this is it for this little walkthrough on how to create, an environment for deep learning using cloud resources. Have fun trying it out and send me some feedback if you have any suggestions to improve the environment.

Here are the instructions to create the environment in a nutshell with a predefined Docker container. Exchange <your> with some prefix your like.

1. Create a gcloud account

2. Install gcloud CLI: https://cloud.google.com/sdk/docs/downloads-interactive

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

3. Create Gcloud account and project (replace <your>)

gcloud projects create <your>-dl --enable-cloud-apis
gcloud config set project <your>-dl
gcloud services enable compute.googleapis.com

4. Install Terraform: https://www.terraform.io/intro/getting-started/install.html

brew install terraform

6. (optional if you want to deploy some code to it) Fork and git clone deep learning experiments

https://gitlab.com/dice89/deep-learning-experiments/forks/new

git clone git@gitlab.com:<your_user>/deep-learning-experiments.git

7. Git clone the code to define the Google compute engine VM with a GPU

git clone git@gitlab.com:dice89/google-cloud-gpu-vm.git
cd google-cloud-gpu-vm

6. create ssh key

ssh-keygen -t rsa -b 4096 -C “your_email@example.com”

7. Add private ssh key to Google cloud infrastructure the `start_up_script.sh`

8. Add the public ssh key to your Gitlab account

9. Create a GCP service account and get credentials.json

gcloud iam service-accounts create gcp-terraform-dl --display-name gcp-terraform-dl
gcloud projects add-iam-policy-binding  <your>-dl \
--member='serviceAccount:gcp-terraform-dl@<your>-dl.iam.gserviceaccount.com' --role='roles/owner'
gcloud iam service-accounts keys create 'credentials.json' --iam-account='gcp-terraform-dl@<your>-dl.iam.gserviceaccount.com'

10. Init your Terraform environment

terraform init

11. Start the environment

terraform apply \
-var 'project_id=<your>-dl' \
-var 'region=europe-west1-d'

Wait a bit (roughly 5–10 minutes) to see the IP address of your jupyter notebook server:

terraform show | grep assigned_nat_ip

To ssh into your compute instance:

gcloud compute — project “<your>-dl” ssh — zone “europe-west1-d” “gpu-vm”

12. Destroy the environment

terraform destroy \
-var 'project_id=<your>-dl' \
-var 'region=europe-west1-d'

If you find any problems with the tutorial, please report them to me! I’m very keen to keep it up-to-date.