Snark Hyper: Serverless Deep Learning and Hyper-Parameter Search

By Snark AI

With the exponential increase of training data and the computational complexity of machine learning models, Deep Learning on the cloud has become very engineering heavy. Single training experiment of a production ready model may take up to 2 weeks. If one wants to explore more variations or fine-tune hyper-parameters, production lifecycle becomes really slow. Picking the right instance, managing cloud instances for optimal utility rate, handling spot/pre-emptive instances, running multiple experiments at the same time, all require a lot of DevOps work from deep learning engineers, who may better spend their time developing models.

Snark Hyper helps to abstract away ML infrastructure and focus on the essential — Building and improving models at scale.

How it works?
You can easily install Snark CLI and register an account at Snark Lab.

sudo pip3 install snark
snark login

Define the training process in a mnist.yaml file

version: 1
experiments:
mnist:
image: pytorch/pytorch:latest
hardware:
gpu: k80
command:
- git clone https://github.com/pytorch/examples
- cd examples/mnist
- python main.py

Boom, you are done…

snark up -f mnist.yaml

You have just started an instance, loaded the container equipped with PyTorch, downloaded the source code and started training. Well this will take some time since we are bounded by the speed of light, however for long trainings, a few minutes should not matter.

After scheduling the task, we are able to check the status of the experiment by running snark ps. Once we are happy with the training process, we simply take it down by snark down {experiment_id} to avoid additional charges of machine time.

Getting Results: You can additionally specify to upload the model to an S3 bucket or another repository by adding command after training script — python main.pyor run snark logs {experiment_id} to get training logs.

Things get interesting, when we want to find the best learning rate or the optimal batch size by parallelizing experiments over many instances on the cloud.

The following example perform hyper-parameter search across different combination of batch size (batch_size) and learning rate (lr). You can pass in the parameters as a list or specify the search range of the parameter. Snark Hyper automatically starts a cloud instance for sampled parameter combination, and run hyper-parameter search in parallel.

version: 1
experiments:
mnist_search:
image: pytorch/pytorch:latest
parameters:
github: https://github.com/pytorch/examples
batch_size: [32,64,128,256]
lr: "0.01-0.09"
hardware:
gpu: k80
sampling: 'random'
samples: 8
workers: 4
command:
- git clone {{github}} && cd examples/mnist
- python main.py --batch-size {{batch_size}} --lr {{lr}}

The {{parameters}} on the command line are replaced with sampled combination of parameters during execution of the program. Each parameter can be a single number, string, categorial list of parameters (e.g. [“cnn”, “rnn”]) or uniform continuous range (“0.001–0.1”). Then, we are able to define the sampling method, number of samples and number of concurrent workers.

What if we want to change the GPU type or use more than a single GPU. Currently we have a support for distributed training using multiple K80s and V100s. You can specify 1/8/16 K80s and 1/4/8 V100s for distributed training. The mnist example below does not do distributed training but you can swap it with your own docker image that can support single instance multi-GPU distributed training.

version: 1
experiments:
mnist:
image: pytorch/pytorch:latest
hardware:
gpu: V100
gpu_count: 4
command:
- git clone https://github.com/pytorch/examples
- cd examples/mnist
- python main.py

Please be aware of costs before starting the experiment above. This will be another topic how to ensure most cost-efficient use of hardware resources on the cloud.

We presented a building base that is capable of training deep learning models on the cloud using distributed training and hyper-parameter search without worrying much about infrastructure issues. In every abstraction there is always a flexibility tradeoff. Our aim is to keep users in the loop by automating non-essential. We want to provide the experience of having thousands of GPUs under your laptop.

In upcoming posts, we will further dive into hyper-parameter search methods and distributed training using various setups and frameworks.

Bells and whistles are coming soon…