How to effortlessly train sklearn 📊, pytorch🔥, and transformers 🤗 models in the cloud

SageMaker is a Machine Learning Operations (MLOps) platform, offered by AWS, that provides a number of tools for developing machine learning models from no code solutions to completely custom. With SageMaker, you can label data, train your own models in the cloud using hyperparameter optimization, and then deploy those models easily behind a cloud hosted API. In this series of posts we will explore SageMaker’s services and provide guides on how to use them, along with code examples. In this first post we will touch on training models using popular frameworks such as sklearn, pytorch and transformers for which Sagemaker provides pre-configured containers.

We will be working with a minimal project structure:

Most of the code for training models, preprocessing data or making predictions should live in src and be well tested. The scripts folder is reserved for automating tasks which might otherwise be done manually with few commands or lines of code. We will also use the scripts folder to store the scripts necessary to run your code on SageMaker.

The main library we need is sagemaker (which can be installed from pip as usual) but I am also going to be using typer for adding command line tool (CLI) functionality to my scripts. Other than those I will only use tqdm for adding progress bars, black for formatting and pytest for tests. All this is on top of the framework you want to experiment with, which in my case also includes pytorchtransformers, and sklearn.

The data I will be using comes from a sentiment classification task on Kaggle. The data comes in CSV format and contains two columns, text and label. The label is either 0 or 1.

Below is a simple training script in sklearn that trains a tf-idf svm pipeline using that data.

The script is vanilla sklearn. The only alteration for sagemaker is to set the default data_path and model_path. We can use this script outside sagemaker without any problem by passing a path to our data and models folder. Let’s run it

Before we explain the default values for model and data path let’s also write the script that will trigger the sagemaker job.

That’s it. We can now run this script by passing a path to our local data and models folder and it will download the container locally and run it.

If we now change the instance type to one of the instances SageMaker works with (complete list here) and pass an s3 path to our data and models folder, the training will happen in the cloud.

One thing to note is that the data and model path needs to be prepended either with file:// or s3:// depending on whether you want to read from a local directory or s3.

In order for SageMaker to work, you need to have a role with appropriate permissions. You can read more on how to create that [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html,). Here is also a terraform manifest that my co-founder Matt has written which might be helpful to set up. It is important to ensure the role has permissions to read and write in s3 in order to read the data and write the model. In this script we pass the role via a command line argument which by default reads it from an environment variable.