# SageMaker training jobs using Snowpark Python API (SageMaker Studio)

In this notebook, we'll show how to package a simple Python example which showcases a training of a simple scikit-learn machine predictive maintenance classification model.

Getting Started with Snowpark for Machine Learning on SageMaker:
 - https://quickstarts.snowflake.com/guide/getting_started_with_snowpark_for_machine_learning_on_sagemaker/index.html
 - https://github.com/Snowflake-Labs/sfguide-getting-started-snowpark-python-sagemaker
 
To be able to securely store the database access credentials, we strongly recommend using AWS Secrets Manager with Snowflake connections:
 - https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html 
 - https://aws.amazon.com/blogs/big-data/simplify-snowflake-data-loading-and-processing-with-aws-glue-databrew/

## Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:

* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.
* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. 

#### Running your container during training

When Amazon SageMaker runs training, your training script runs just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:

 /opt/ml
 |-- input
 | |-- config
 | | |-- hyperparameters.json
 | | `-- resourceConfig.json
 | `-- data
 | `-- 
 | `-- 
 |-- model
 | `-- 
 `-- output
 `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data//` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/_` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.


### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Along the way, we clean up extra space. This makes the container smaller and faster to start.

Let's look at the Dockerfile for the example:

In [None]:
!cat container/Dockerfile

In [None]:
!cat container/requirements.txt

### Building and registering the container

Let's install the [SageMaker Docker Build](https://github.com/aws-samples/sagemaker-studio-image-build-cli) - a CLI for building Docker images in SageMaker Studio using AWS CodeBuild.

In [None]:
!pip install sagemaker-studio-image-build -q

The following shell code shows how to build the container image and push the container image to ECR using `sm-docker build`.

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

This will build a Docker Image of 147MB size.
When the following commands completed executing, you should see something like this:

`Image URI: .dkr.ecr..amazonaws.com/sagemaker-scikit-learn-snowpark:latest`

In [None]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-scikit-learn-snowpark

cd container

region=${region:-us-east-1}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region "${region}" --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
 aws ecr create-repository --region "${region}" --repository-name "${algorithm_name}" > /dev/null
fi

sm-docker build . --repository "${algorithm_name}:latest"

Using your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train models and use the model for hosting or batch transforms. Let's do that with the algorithm we made above.

## Define IAM role



In [None]:
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [None]:
import sagemaker

session = sagemaker.Session()

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that will be fetched from Snowflake using Snowpark Python API.

In [None]:
account = session.boto_session.client("sts").get_caller_identity()["Account"]
region = session.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-scikit-learn-snowpark:latest".format(account, region)

print(image)

### Set `secret-name` and `region-name`

We need to pass the secert name and region in order to fetch the Snowflake credential from AWS Secrets Manager.

In [None]:
hyperparameters={
 "secret-name": "dev/ml/snowflake",
 "region-name": "us-east-1"
}

In [None]:
estimator = Estimator(
 image_uri=image,
 entry_point='predictive_maintenance_classification.py',
 source_dir='code',
 role=role,
 instance_count=1,
 instance_type='ml.m5.large',
 hyperparameters=hyperparameters
)

In [None]:
estimator.fit()

## Downloading the model

SageMaker models are packaged as compressed tar files (*.tar.gz). 

We will now download the `model.tar.gz` file from S3 and uncompress it.

In [None]:
model_uri = estimator.model_data
model_uri

In [None]:
!aws s3 cp {model_uri} .

In [None]:
import tarfile
tarf = tarfile.open('model.tar.gz', 'r:gz')
print(tarf.getnames())