# Distributed Scikit Learn in SageMaker with Magic

In [1]:
%%sklearn?

[0;31mDocstring:[0m
::

 %sklearn [--estimator_name ESTIMATOR_NAME] [--entry_point ENTRY_POINT]
 [--source_dir SOURCE_DIR] [--role ROLE]
 [--framework_version FRAMEWORK_VERSION]
 [--py_version PY_VERSION] [--instance_type INSTANCE_TYPE]
 [--instance_count INSTANCE_COUNT] [--output_path OUTPUT_PATH]
 [--hyperparameters FOO:1,BAR:0.555,BAZ:ABC | 'FOO : 1, BAR : 0.555, BAZ : ABC']
 [--channel_training CHANNEL_TRAINING]
 [--channel_testing CHANNEL_TESTING]
 [--use_spot_instances [USE_SPOT_INSTANCES]]
 [--max_wait MAX_WAIT]
 [--enable_sagemaker_metrics [ENABLE_SAGEMAKER_METRICS]]
 [--metric_definitions ['Name: loss, Regex: Loss = .*?);' ['Name: loss, Regex: Loss = (.*?;' ...]]]
 [--name_contains NAME_CONTAINS] [--max_result MAX_RESULT]
 {submit,list,status,logs,delete,show_defaults}

SKLearn magic command.

methods:
 {submit,list,status,logs,delete,show_defaults}

submit:
 --estimator_name ESTIMATOR_NAME
 estimator shell variable name
 --entry_point ENTRY_POINT
 notebook local code file
 

In [2]:
%sklearn show_defaults

defaults:
 {
 "framework_version": "0.23-1",
 "instance_count": 1,
 "instance_type": "ml.c4.xlarge",
 "max_wait": 86400,
 "py_version": "py3",
 "use_spot_instances": true
}
null


## Upload the data for training 

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using a sample of the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is included with Scikit-learn. We will load the dataset, write locally, then write the dataset to s3 to use.


In [2]:
import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs('./data', exist_ok=True)
np.savetxt('./data/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

Lets create our Sagemaker session and create a S3 prefix to use for the notebook example.
Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [3]:
# S3 prefix
prefix = 'Scikit-iris'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

WORK_DIRECTORY = 'data'

train_input = sagemaker_session.upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY) )
print(train_input)

s3://sagemaker-eu-west-1-245582572290/Scikit-iris/data


### Write the Scikit Learn script

The source for a traning script is in the cell below. The cell uses the `%%sklearn submit` directive to submit python application from cell to Scikit Learn Estimator. 

In [8]:
%%sklearn submit --channel_training s3://sagemaker-eu-west-1-245582572290/Scikit-iris/data --channel_testing s3://sagemaker-eu-west-1-245582572290/Scikit-iris/data --hyperparameters 'max_leaf_nodes: 30' 

# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# 
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
# 
# http://www.apache.org/licenses/LICENSE-2.0
# 
# or in the "license" file accompanying this file. This file is distributed 
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 
# express or implied. See the License for the specific language governing 
# permissions and limitations under the License.

from __future__ import print_function

import argparse
import joblib
import os
import pandas as pd

from sklearn import tree


if __name__ == '__main__':
 parser = argparse.ArgumentParser()

 # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
 parser.add_argument('--max_leaf_nodes', type=int, default=-1)

 # Sagemaker specific arguments. Defaults are set in the environment variables.
 parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
 parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
 parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
 parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TESTING'])

 args = parser.parse_args()

 # Take the set of files and read them all into a single pandas dataframe
 input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
 if len(input_files) == 0:
 raise ValueError(('There are no files in {}.\n' +
 'This usually indicates that the channel ({}) was incorrectly specified,\n' +
 'the data specification in S3 was incorrectly specified or the role specified\n' +
 'does not have permission to access the data.').format(args.train, "train"))
 raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
 train_data = pd.concat(raw_data)

 # labels are in the first column
 train_y = train_data.iloc[:, 0]
 train_X = train_data.iloc[:, 1:]

 # Here we support a single hyperparameter, 'max_leaf_nodes'. Note that you can add as many
 # as your training my require in the ArgumentParser above.
 max_leaf_nodes = args.max_leaf_nodes

 # Now use scikit-learn's decision tree classifier to train the model.
 clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
 clf = clf.fit(train_X, train_y)

 # Print the coefficients of the trained classifier, and save the coefficients
 joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
 """Deserialized and return fitted model
 
 Note that this should have the same name as the serialized model in the main method
 """
 clf = joblib.load(os.path.join(model_dir, "model.joblib"))
 return clf


Couldn't call 'get_role' to get Role ARN from role name workshop-sagemaker to get Role path.


submit:
 {
 "channel_testing": "s3://sagemaker-eu-west-1-245582572290/Scikit-iris/data",
 "channel_training": "s3://sagemaker-eu-west-1-245582572290/Scikit-iris/data",
 "entry_point": "/tmp/tmp-ab309e54-b484-48ab-89eb-2850dcc5b4fc.py",
 "estimator_name": "___SKLearn_estimator",
 "framework_version": "0.23-1",
 "hyperparameters": {
 "max_leaf_nodes": "30"
 },
 "instance_count": 1,
 "instance_type": "ml.c4.xlarge",
 "max_result": 10,
 "max_wait": 86400,
 "name_contains": "sklearn",
 "py_version": "py3",
 "role": "arn:aws:iam::245582572290:role/workshop-sagemaker",
 "use_spot_instances": true
}
{
 "___SKLearn_latest_job_name": "sagemaker-scikit-learn-2021-03-02-15-04-35-077",
 "estimator_variable": "___SKLearn_estimator"
}


## Stop latest traning Job

In [5]:
%sklearn delete

{
 "AlgorithmSpecification": {
 "EnableSageMakerMetricsTimeSeries": false,
 "TrainingImage": "141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
 "TrainingInputMode": "File"
 },
 "CreationTime": "2021-03-02 14:58:22.381000+00:00",
 "DebugHookConfig": {
 "CollectionConfigurations": [],
 "S3OutputPath": "s3://sagemaker-eu-west-1-245582572290/"
 },
 "EnableInterContainerTrafficEncryption": false,
 "EnableManagedSpotTraining": true,
 "EnableNetworkIsolation": false,
 "HyperParameters": {
 "max_leaf_nodes": "\"30\"",
 "sagemaker_container_log_level": "20",
 "sagemaker_job_name": "\"sagemaker-scikit-learn-2021-03-02-14-58-21-806\"",
 "sagemaker_program": "\"tmp-d1536a9b-a65a-4c82-a0a5-f8b43bdf7775.py\"",
 "sagemaker_region": "\"eu-west-1\"",
 "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-245582572290/sagemaker-scikit-learn-2021-03-02-14-58-21-806/source/sourcedir.tar.gz\""
 },
 "InputDataConfig": [
 {
 "ChannelName": "training",
 "CompressionType

## Describe latest traning Job

In [9]:
%sklearn status

{
 "AlgorithmSpecification": {
 "EnableSageMakerMetricsTimeSeries": false,
 "TrainingImage": "141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
 "TrainingInputMode": "File"
 },
 "CreationTime": "2021-03-02 14:43:54.283000+00:00",
 "DebugHookConfig": {
 "CollectionConfigurations": [],
 "S3OutputPath": "s3://sagemaker-eu-west-1-245582572290/"
 },
 "EnableInterContainerTrafficEncryption": false,
 "EnableManagedSpotTraining": true,
 "EnableNetworkIsolation": false,
 "HyperParameters": {
 "max_leaf_nodes": "\"30\"",
 "sagemaker_container_log_level": "20",
 "sagemaker_job_name": "\"sagemaker-scikit-learn-2021-03-02-14-43-53-685\"",
 "sagemaker_program": "\"tmp-9d940cb7-23ce-418d-b401-4783563fd641.py\"",
 "sagemaker_region": "\"eu-west-1\"",
 "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-245582572290/sagemaker-scikit-learn-2021-03-02-14-43-53-685/source/sourcedir.tar.gz\""
 },
 "InputDataConfig": [
 {
 "ChannelName": "training",
 "CompressionType

## Show logs for latest traning Job

In [16]:
%sklearn logs

2021-03-02 14:47:11 Starting - Preparing the instances for training
2021-03-02 14:47:11 Downloading - Downloading input data
2021-03-02 14:47:11 Training - Training image download completed. Training in progress.
2021-03-02 14:47:11 Uploading - Uploading generated training model
2021-03-02 14:47:11 Completed - Training job completed[34m2021-03-02 14:46:58,954 sagemaker-containers INFO Imported framework sagemaker_sklearn_container.training[0m
[34m2021-03-02 14:46:58,957 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-02 14:46:58,966 sagemaker_sklearn_container.training INFO Invoking user training script.[0m
[34m2021-03-02 14:46:59,380 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-02 14:46:59,392 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-02 14:46:59,403 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)[

## List traning jobs

In [13]:
%sklearn list --name_contains scikit-learn

{
 "NextToken": "cIws2QhTXUIa8bi8X9aU7gCAR0Xdc3x9L/Ofg4vsVMTtcNqRqLcpBqE42+cDc29SYTx/csxWdtykePlH94xR4wYEkuOg2mnj9cwdVGP+NpZB72ngF4Jrl7LpzDZ0PJsqFrzBt+u31JpjCNRKIu0shJdmwQFN4q1dZVv7nlEQknn0Lq3OrzlpFDQiE18Oj1AJHnCfuBJZ7zOQ3sfYAfAdqYagLXLCDWuTdoa/hB7wYjqq3FrkadY0KvMW7eH0wtpZJayqKkHWdIyHy67DU4AmQJv8Gwc6g8JNdVWRblepG2ilsMjH/0avxXpG9AcUNA+GmZhYy57avBvkY5SgrCn6XoAyNX93/BeuXP7xKqEquvzZTiKq7vUo0p5kmdw1hhty1YLrJq3yMRVfoRqUsEh5wgMthHc9Nu5KmKdfgkL5XQFJUV3AwTw9ZTTZJbHJHSlmN2GESIAgBwwSmddHazfmhGUhL2ETbeAZ2FWN8vgWg94TN8cX++UfNBqWE7pZu+4LcXqHkd9SID/Jk95ZEzHBhapfc8O3+JZSZKOkE21Dm+Gs2YVOLrg6689YxbUCLnZtF4I8PwAdjp5pjUbZJQ==",
 "ResponseMetadata": {
 "HTTPHeaders": {
 "content-length": "1606",
 "content-type": "application/x-amz-json-1.1",
 "date": "Tue, 02 Mar 2021 15:18:50 GMT",
 "x-amzn-requestid": "2dd07e22-9075-4295-8e0a-a55f6ee249b0"
 },
 "HTTPStatusCode": 200,
 "RequestId": "2dd07e22-9075-4295-8e0a-a55f6ee249b0",
 "RetryAttempts": 0
 },
 "TrainingJobSummaries": [
 {
 "CreationTime": "2021-03-02 15

## Use estimator variable
### Deploy the model 

In [11]:
# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = ___SKLearn_estimator.deploy(instance_type='ml.m4.xlarge',
 initial_instance_count=1)


---------------!

### Choose some data and use it for a prediction 

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [14]:
import itertools
import pandas as pd

shape = pd.read_csv("data/iris.csv", header=None)

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data = shape.iloc[indices[:-1]]
test_X = test_data.iloc[:,1:]
test_y = test_data.iloc[:,0]

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The output from the endpoint return an numerical representation of the classification prediction; in the original dataset, these are flower names, but in this example the labels are numerical. We can compare against the original label that we parsed.

In [15]:
print(predictor.predict(test_X.values))
print(test_y.values)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]


### Endpoint cleanup 

When you're done with the endpoint, you'll want to clean it up.

In [16]:
predictor.delete_endpoint()