## Amazon SageMaker Processing jobs

With Amazon SageMaker Processing jobs, you can leverage a simplified, managed experience to run data pre- or post-processing and model evaluation workloads on the Amazon SageMaker platform.

A processing job downloads input from Amazon Simple Storage Service (Amazon S3), then uploads outputs to Amazon S3 during or after the processing job.

This notebook shows how you can:

1. Run a processing job to run a scikit-learn script that cleans, pre-processes, performs feature engineering, and splits the input data into train and test sets.
2. Run a training job on the pre-processed training data to train a model
3. Run a processing job on the pre-processed test data to evaluate the trained model's performance
4. Use your own custom container to run processing jobs with your own Python libraries and dependencies.

The dataset used here is ...................

# Rapids Toy example on SM Processing

In [81]:
!mkdir docker

mkdir: cannot create directory ‘docker’: File exists


This is the Dockerfile to create the processing container. Install `pandas` and `scikit-learn` into it. You can install your own dependencies.

In [156]:
%%writefile docker/Dockerfile

FROM rapidsai/rapidsai

RUN conda install -c anaconda cudatoolkit
RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults rapids=0.13 python=3.6
RUN conda install -c conda-forge xgboost
RUN conda install -c anaconda pytest
RUN conda install scikit-learn xgboost
RUN conda install dask
RUN conda install -c conda-forge dask-xgboost
RUN conda install -c rapidsai dask-cudf

ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Overwriting docker/Dockerfile


This block of code builds the container using the `docker` command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR.

In [157]:
import boto3
import sagemaker
from sagemaker import get_execution_role

region = boto3.session.Session().region_name

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'miniconda-rapids-container-v2'
tag = ':latest'

uri_suffix = 'amazonaws.com'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Sending build context to Docker daemon  2.048kB
Step 1/11 : FROM rapidsai/rapidsai
 ---> 8429180a83ec
Step 2/11 : RUN conda install -c anaconda cudatoolkit
 ---> Using cache
 ---> 062f50c892c3
Step 3/11 : RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults rapids=0.13 python=3.6
 ---> Using cache
 ---> 7b5c698cd10c
Step 4/11 : RUN conda install -c conda-forge xgboost
 ---> Using cache
 ---> 55afcd0a0739
Step 5/11 : RUN conda install -c anaconda pytest
 ---> Using cache
 ---> 9eb953647c09
Step 6/11 : RUN conda install scikit-learn xgboost
 ---> Using cache
 ---> f3a7444c5467
Step 7/11 : RUN conda install dask
 ---> Running in 0b11b860ef61
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Removing intermediate container 0b11b860ef61
 ---> ada5888f5e5c
Step 8/11 : RUN conda install -c conda-forge dask-xgboost
 ---> Running in 48e835a377b2
Collecting package metadata (

# Testing if the container works with SM processing

The `ScriptProcessor` class lets you run a command inside this container, which you can use to run your own script.

In [94]:
import pandas as pd

input_data = 'https://github.com/plotly/datasets/raw/master/tips.csv'
df = pd.read_csv(input_data)
df.to_csv('tips.csv')

In [95]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### Write toy example file

In [86]:
%%writefile preprocessing.py

import argparse
import os
import pandas as pd
import numpy as np
import cudf, io, requests
from io import StringIO



if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    args, _ = parser.parse_known_args()
    
    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'tips.csv')
    
    print('Reading input data from {}'.format(input_data_path))
    
    tips_df = cudf.read_csv(input_data_path)
    tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100

    # display average tip by dining party size
    print(tips_df.groupby('size').tip_percentage.mean())
    
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_features.csv')
    
    print('Saving training features to {}'.format(train_features_output_path))
    tips_df.to_csv(train_features_output_path, header=False, index=False)

Overwriting preprocessing.py


In [226]:
from sagemaker.processing import ScriptProcessor
from sagemaker import get_execution_role
role = get_execution_role()
script_processor = ScriptProcessor(command=['python'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.p3.16xlarge')

Run the same `preprocessing.py` script you ran above, but now, this code is running inside of the Docker container you built in this notebook, not the scikit-learn image maintained by Amazon SageMaker. You can add the dependencies to the Docker image, and run your own pre-processing, feature-engineering, and model evaluation scripts inside of this container.

In [88]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

script_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source='tips.csv',
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)


Job Name:  miniconda-rapids-container-v2-2020-05-19-01-00-37-214
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-19-01-00-37-214/input/input-1/tips.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-19-01-00-37-214/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-19-01-00-37-214/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}]
...........

# HIGGS dataset

In [78]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz ./

--2020-05-19 00:10:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2816407858 (2.6G) [application/x-httpd-php]
Saving to: ‘HIGGS.csv.gz’


2020-05-19 00:13:35 (13.0 MB/s) - ‘HIGGS.csv.gz’ saved [2816407858/2816407858]

--2020-05-19 00:13:35--  http://./
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2020-05-19 00:13:35--
Total wall clock time: 3m 28s
Downloaded: 1 files, 2.6G in 3m 27s (13.0 MB/s)


In [100]:
!find . -name '*.csv.gz' -exec gzip -d {} \;

In [102]:
input = sagemaker.session.Session().upload_data('HIGGS.csv',key_prefix='rapids/higgsdata')

In [103]:
input

's3://sagemaker-us-east-1-497456752804/rapids/higgsdata/HIGGS.csv'

In [191]:
higgs_df = pd.read_csv('HIGGS.csv',names=['col'+str(s) for s in range(29)])

In [197]:
higgs_df.head()

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28
0,1,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


In [196]:
higgs_df['col0'] = higgs_df['col0'].astype('int32')

In [209]:
higgs_df.shape

(11000000, 29)

### Write cudf xgboost script

In [227]:
%%writefile cudf_xgboost_example.py

import numpy as np
import os

from cuml.test.utils import array_equal
from cuml.utils.import_utils import has_xgboost

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

from sklearn import metrics

#from sklearn.model_selection import train_test_split
    
from cuml import ForestInference

from cuml.preprocessing.model_selection import train_test_split

import cudf
from cuml.preprocessing.model_selection import train_test_split

import argparse
import os
import pandas as pd
import numpy as np
import io, requests
from io import StringIO
import datetime

import time
globalstart = time.process_time()
print(datetime.datetime.now())
# your code here    



if has_xgboost():
    import xgboost as xgb
else:
    raise ImportError("Please install xgboost using the conda package,"
                      " Use conda install -c conda-forge xgboost "
                      "command to install xgboost")
    
def train_xgboost_model(X_train, y_train,
                        num_rounds, model_path):
    # set the xgboost model parameters
#     params = {'silent': 1, 'eval_metric':'error',
#               'objective':'binary:logistic',
#               'max_depth': 25}
    params = {'silent': 0, 'tree_method': 'gpu_hist',
              'eval_metric': 'auc',
              'objective': 'binary:logistic'}
    
    print("in xgboost training")
    dtrain = xgb.DMatrix(X_train, label=y_train)
    print("loaded dtrain")
    # train the xgboost model
    bst = xgb.train(params, dtrain, num_rounds)

    # save the trained xgboost model
    bst.save_model(model_path)

    return bst

def predict_xgboost_model(X_validation, y_validation, xgb_model):

    # predict using the xgboost model
    dvalidation = xgb.DMatrix(X_validation, label=y_validation)
    xgb_preds = xgb_model.predict(dvalidation)

    # convert the predicted values from xgboost into class labels
    xgb_preds = np.around(xgb_preds)
    
    return xgb_preds

if __name__=='__main__':

    print("Loading data",time.process_time() - globalstart,datetime.datetime.now())
    start = time.process_time()
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    parser.add_argument('--num-round', type=int, default=15)
    
    args, _ = parser.parse_known_args()
    
    input_data_path = os.path.join('/opt/ml/processing/input', 'HIGGS.csv')
    higgs_df = cudf.read_csv(input_data_path,names=['col'+str(s) for s in range(29)])
    
    print("Loading time=",time.process_time() - start,datetime.datetime.now())
    
    
    Y = higgs_df['col0']
    X = higgs_df.pop('col0')
    
    start = time.process_time()
    
    # Generate some sample data

    X_train, X_validation, y_train, y_validation = train_test_split(X,Y, test_size=args.train_test_split_ratio)
    print("Split time=",time.process_time() - start,datetime.datetime.now())

    
    start = time.process_time()
    # enter path to the directory where the trained model will be saved
    model_path = os.path.join('/opt/ml/processing/train', 'xgb.model')
    # This trains and saves the model
    num_rounds = args.num_round
    xgboost_model = train_xgboost_model(X_train, y_train,
                                    num_rounds, model_path)
    
    print("Training time=",time.process_time() - start,datetime.datetime.now())
    
    start = time.process_time()
    trained_model_preds = predict_xgboost_model(X_validation,
                                            y_validation,
                                            xgboost_model)
    print("Validation time=",time.process_time() - start,datetime.datetime.now())

    
    start = time.process_time()
    fm = ForestInference.load(filename=model_path,
                          algo='BATCH_TREE_REORG',
                          output_class=True,
                          threshold=0.50,
                          model_type='xgboost')
    print("fm init time=",time.process_time() - start,datetime.datetime.now())
    
    start = time.process_time()
    fil_preds = fm.predict(X_validation)
    print("fm Validation time=",time.process_time() - start,datetime.datetime.now())

#     fpr, tpr, thresholds = metrics.roc_curve(y_validation.tolist(), fil_preds, pos_label=2)
#     print('AUC = ',metrics.auc(fpr, tpr))
    print("-------------------------------")
    
    
    
    print("End",time.process_time() - globalstart,datetime.datetime.now())
    

Overwriting cudf_xgboost_example.py


In [248]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

script_processor.run(code='cudf_xgboost_example.py',
                      inputs=[ProcessingInput(
                        source='s3://sagemaker-us-east-1-497456752804/rapids/higgsdata/HIGGS.csv',
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train')],
                      arguments=['--train-test-split-ratio', '0.2','--num-round','100'], wait=False
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)


Job Name:  miniconda-rapids-container-v2-2020-05-20-21-48-07-383
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/rapids/higgsdata/HIGGS.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-20-21-48-07-383/input/code/cudf_xgboost_example.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-20-21-48-07-383/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}]
{'ProcessingInputs': [{'InputName': 'input-1', 'S3Inpu

# Dask xgboost in rapids

In [251]:
%%writefile cudf_dask_xgboost_example.py

import numpy as np
import os

from cuml.test.utils import array_equal
from cuml.utils.import_utils import has_xgboost

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
#from sklearn.model_selection import train_test_split
    
from cuml import ForestInference

from cuml.preprocessing.model_selection import train_test_split

import cudf

import dask
import dask_cudf
import dask_xgboost
import pandas as pd
import numpy as np

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

import subprocess

cmd = "hostname --all-ip-addresses"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
IPADDR = str(output.decode()).split()[0]

cluster = LocalCUDACluster(ip=IPADDR)
client = Client(cluster)

import argparse
import os
import pandas as pd
import numpy as np
import io, requests
from io import StringIO
import datetime

import time
globalstart = time.process_time()
print(datetime.datetime.now())
# your code here    



if has_xgboost():
    import xgboost as xgb
else:
    raise ImportError("Please install xgboost using the conda package,"
                      " Use conda install -c conda-forge xgboost "
                      "command to install xgboost")
    
def train_xgboost_model(X_train, y_train,
                        num_rounds, model_path):

    params = {'silent': 0, 'tree_method': 'gpu_hist',
              'eval_metric': 'auc',
              'objective': 'binary:logistic'}
    
    print("in xgboost training")
    dtrain = xgb.DMatrix(X_train, label=y_train)
    print("loaded dtrain")
    # train the xgboost model
    bst = xgb.train(params, dtrain, num_rounds)
    

    ## Optional: persist training data into memory
    x_train = X_train.persist()
    y_train = y_train.persist()

    bst = dask_xgboost.train(client, params, x_train, y_train, num_boost_round=params['num_rounds'])

    # save the trained xgboost model
    bst.save_model(model_path)

    return bst

def predict_xgboost_model(X_validation, y_validation, xgb_model):

    # predict using the xgboost model
    dvalidation = xgb.DMatrix(X_validation, label=y_validation)
    xgb_preds = xgb_model.predict(dvalidation)

    # convert the predicted values from xgboost into class labels
    xgb_preds = np.around(xgb_preds)
    return xgb_preds

if __name__=='__main__':

    print("Loading data",time.process_time() - globalstart,datetime.datetime.now())
    start = time.process_time()
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    parser.add_argument('--n-partitions', type=int, default=20)
    parser.add_argument('--num-round', type=int, default=15)
    args, _ = parser.parse_known_args()
    
    input_data_path = os.path.join('/opt/ml/processing/input', 'HIGGS.csv')
    higgs_dask_df = cudf.read_csv(input_data_path,names=['col'+str(s) for s in range(29)])
    
    dask.dataframe.from_pandas(higgs_df, npartitions=args.n_partitions)
    
    print("Loading time=",time.process_time() - start,datetime.datetime.now())
    
    
    Y = higgs_dask_df['col0']
    X = higgs_dask_df.pop('col0')
    
    start = time.process_time()
    
    # Generate some sample data

    X_train, X_validation, y_train, y_validation = train_test_split(X,Y, test_size=args.train_test_split_ratio)
    print("Split time=",time.process_time() - start,datetime.datetime.now())

    
    start = time.process_time()
    # enter path to the directory where the trained model will be saved
    model_path = os.path.join('/opt/ml/processing/train', 'xgb.model')
    # This trains and saves the model
    num_rounds = args.num_round
    xgboost_model = train_xgboost_model(X_train, y_train,
                                    num_rounds, model_path)
    
    print("Training time=",time.process_time() - start,datetime.datetime.now())
    
    start = time.process_time()
    trained_model_preds = predict_xgboost_model(X_validation,
                                            y_validation,
                                            xgboost_model)
    print("Validation time=",time.process_time() - start,datetime.datetime.now())

    
    start = time.process_time()
    fm = ForestInference.load(filename=model_path,
                          algo='BATCH_TREE_REORG',
                          output_class=True,
                          threshold=0.50,
                          model_type='xgboost')
    print("fm init time=",time.process_time() - start,datetime.datetime.now())
    
    start = time.process_time()
    fil_preds = fm.predict(X_validation)
    print("fm Validation time=",time.process_time() - start,datetime.datetime.now())
    print(type(y_validation))
    print(type(fil_preds))
    print(type(np.asarray(fil_preds)))
    
    fpr, tpr, thresholds = metrics.roc_curve(y_validation, np.asarray(fil_preds), pos_label=2)
    print('AUC = ',metrics.auc(fpr, tpr))
    print("-------------------------------")
    
    
    
    
    print("End",time.process_time() - globalstart,datetime.datetime.now())

Overwriting cudf_dask_xgboost_example.py


In [252]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

script_processor.run(code='cudf_xgboost_example.py',
                      inputs=[ProcessingInput(
                        source='s3://sagemaker-us-east-1-497456752804/rapids/higgsdata/HIGGS.csv',
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train')],
                      arguments=['--train-test-split-ratio', '0.2','--n-partitions','320','--num-round','100']
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)


Job Name:  miniconda-rapids-container-v2-2020-05-20-22-27-49-922
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/rapids/higgsdata/HIGGS.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-20-22-27-49-922/input/code/cudf_xgboost_example.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-497456752804/miniconda-rapids-container-v2-2020-05-20-22-27-49-922/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}]
......................................................

# SageMaker Xgboost

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

### Write data

In [198]:
import os
import boto3

### Run only once if doing repeated testing....

In [199]:
bucket = 'sagemaker-us-east-1-497456752804'
prefix = 'sagemaker/DEMO-xgboost-churn'

train_data, validation_data, test_data = np.split(higgs_df.sample(frac=1), [int(0.8 * len(higgs_df)), int(0.99 * len(higgs_df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

In [200]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

In [234]:
validation_data.drop('col0', axis=1).to_csv('test.csv', header=False, index=False) # Testing for performance benchmark here. For Actual use cases, please use a hold out set (see test_data)

In [235]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')

### Train using SageMaker's Xgboost

In [201]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p3.8xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

# params = {'silent': 0, 'tree_method': 'gpu_hist',
#           'eval_metric': 'auc',
#           'objective': 'binary:logistic',
#               'num_round':100}

xgb.set_hyperparameters(silent=0,
                        eval_metric='auc',
                        num_round=100,
                        objective='binary:logistic')


xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

2020-05-20 14:37:48 Starting - Starting the training job...
2020-05-20 14:37:49 Starting - Launching requested ML instances.........
2020-05-20 14:39:33 Starting - Preparing the instances for training............
2020-05-20 14:41:30 Downloading - Downloading input data.........
2020-05-20 14:43:11 Training - Downloading the training image..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m

2020-05-20 14:4

### SM Batch transform

In [247]:
sm_transformer = xgb.transformer(instance_count =1,
                                 instance_type ='ml.p3.16xlarge',
                                )

# start a transform job
input_location = 's3://{}/{}/test/'.format(bucket, prefix)
sm_transformer.transform(input_location, split_type='Line',content_type='text/csv')



In [None]:
instance_count, instance_type, strategy=None, assemble_with=None, output_path=None, output_kms_key=None, accept=None, env=None, max_concurrent_transforms=None, max_payload=None, tags=None, role=None, volume_kms_key=None, vpc_config_override='VPC_CONFIG_DEFAULT', enable_network_isolation=None, model_name=Non

In [253]:
x = np.arange(10)


In [254]:
!ls

cudf_dask_xgboost_example.py
cudf_xgboost_example.py
docker
forest_inference_demo.ipynb
HIGGS.csv
preprocessing.py
RapidsAI based Data Science using SageMaker processing.ipynb
test.csv
tips.csv
train.csv
validation.csv
x.npy


In [255]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.1.0-py3-none-manylinux2010_x86_64.whl (127.6 MB)
[K     |████████████████████████████████| 127.6 MB 19 kB/s s eta 0:00:01   |▌                               | 1.9 MB 15.3 MB/s eta 0:00:09
Installing collected packages: xgboost
Successfully installed xgboost-1.1.0


In [256]:
import xgboost as xgb

In [257]:
bst = xgb.Booster({'nthread': 4})  # init model
bst.load_model('xgb.model')

In [262]:
bst

<xgboost.core.Booster at 0x7ff9ba2e0860>

In [266]:
bst.trees_to_dataframe()

Unnamed: 0,Tree,Node,ID,Feature,Split,Yes,No,Missing,Gain,Cover
0,0,0,0-0,f0,0.5,0-1,0-2,0-1,8.768354e+06,2.200000e+06
1,0,1,0-1,Leaf,,,,,-5.999994e-01,0.000000e+00
2,0,2,0-2,Leaf,,,,,5.999995e-01,0.000000e+00
3,1,0,1-0,f0,0.5,1-1,1-2,1-1,4.812176e+06,2.013302e+06
4,1,1,1-1,Leaf,,,,,-4.646429e-01,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...
199,95,0,95-0,Leaf,,,,,-4.984138e-02,1.532724e+00
200,96,0,96-0,Leaf,,,,,1.802034e-02,2.041205e+00
201,97,0,97-0,Leaf,,,,,1.626312e-02,2.058103e+00
202,98,0,98-0,Leaf,,,,,1.466685e-02,2.073617e+00


In [271]:
test_data

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28
8300971,1,0.484791,0.287267,0.220142,1.104423,-1.204572,0.742843,0.650556,1.483669,2.173076,...,0.692448,0.109711,0.000000,0.948819,0.943823,0.987724,0.905025,0.914164,0.750882,0.701774
10924625,0,0.389077,-0.048753,-1.349666,1.790137,1.715823,1.294318,-0.801121,-0.210050,0.000000,...,-0.839114,-0.514668,0.000000,1.075798,1.022699,0.975776,0.896829,0.486526,0.769174,0.742101
8842886,1,1.816182,-0.128618,-0.182235,2.456141,0.104298,1.666152,-2.294388,-1.512280,0.000000,...,1.577739,1.487558,0.000000,0.840247,1.267690,0.983190,1.229591,0.881115,1.243128,1.293838
5191959,0,0.851358,0.785939,-0.030758,0.909259,-1.434206,1.181642,-1.972563,-0.522163,0.000000,...,0.614163,0.510358,3.101961,0.879423,0.810431,1.161848,0.692461,0.736917,0.592077,0.647699
8839804,0,0.935726,0.956383,1.527950,0.219544,-0.133253,0.625036,-0.605056,-0.739481,1.086538,...,-0.456848,-0.266622,0.000000,0.819641,0.765392,0.979729,0.717036,1.558316,1.055638,0.910037
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8401292,1,1.052485,0.711917,-1.042828,2.480450,-0.657692,1.839473,0.156431,0.968655,2.173076,...,-0.009622,1.398772,0.000000,0.341178,0.708616,0.999462,0.469504,1.399658,1.469215,1.052727
5014379,0,1.062002,1.161891,-1.648737,0.752124,1.300429,0.688703,-1.218008,0.246304,2.173076,...,-1.872647,-1.365351,0.000000,0.861246,0.992491,0.980291,1.097236,0.512228,1.440075,1.865075
8247014,0,1.416490,0.988524,-1.103863,0.703794,1.465887,1.082981,0.574308,0.962556,2.173076,...,0.320176,0.427676,0.000000,0.803851,1.061422,1.016638,1.107646,1.935393,1.084115,0.890553
10882761,1,0.611982,-0.580540,-0.013003,1.412994,-1.164767,0.751637,-0.890242,-0.776624,2.173076,...,-2.124993,0.644091,0.000000,0.976866,0.565791,1.108988,0.659875,0.723932,0.951845,0.902022


In [303]:
dvalidation = xgb.DMatrix(test_data)

In [304]:
xgb_preds = bst.predict(dvalidation)

XGBoostError: [23:27:00] /workspace/src/learner.cc:1062: Check failed: learner_model_param_.num_feature == p_fmat->Info().num_col_ (1 vs. 29) : Number of columns does not match number of features in booster.
Stack trace:
  [bt] (0) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0xa0c64) [0x7ff40e9e5c64]
  [bt] (1) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x195440) [0x7ff40eada440]
  [bt] (2) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x198a9d) [0x7ff40eadda9d]
  [bt] (3) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredict+0xd9) [0x7ff40e9d6749]
  [bt] (4) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7ff9e41d5ec0]
  [bt] (5) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7ff9e41d587d]
  [bt] (6) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7ff9e43eae2e]
  [bt] (7) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12865) [0x7ff9e43eb865]
  [bt] (8) /home/ec2-user/anaconda3/envs/python3/bin/python(_PyObject_FastCallDict+0x8b) [0x5654e9e7fd7b]



In [300]:
bst.feature_names

['col0',
 'col1',
 'col2',
 'col3',
 'col4',
 'col5',
 'col6',
 'col7',
 'col8',
 'col9',
 'col10',
 'col11',
 'col12',
 'col13',
 'col14',
 'col15',
 'col16',
 'col17',
 'col18',
 'col19',
 'col20',
 'col21',
 'col22',
 'col23',
 'col24',
 'col25',
 'col26',
 'col27',
 'col28']

In [307]:
!conda install -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids=0.13

Solving environment: \ ^C
failed

CondaError: KeyboardInterrupt



In [311]:
!pip install xgboost==0.90

Collecting xgboost==0.90
  Downloading xgboost-0.90-py2.py3-none-manylinux1_x86_64.whl (142.8 MB)
[K     |████████████████████████████████| 142.8 MB 63.9 MB/s eta 0:00:01
Installing collected packages: xgboost
Successfully installed xgboost-0.90


In [313]:
import pickle as pkl 
model = pkl.load(open('xgboost-model', 'rb'))

XGBoostError: [00:03:12] /workspace/src/learner.cc:846: Check failed: header == serialisation_header_: 

  If you are loading a serialized model (like pickle in Python) generated by older
  XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version.  There's a simple script for helping
  the process. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for reference to the script, and more details about differences between saving model and
  serializing.


Stack trace:
  [bt] (0) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0xa0c64) [0x7ff40e9e5c64]
  [bt] (1) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x1b2d11) [0x7ff40eaf7d11]
  [bt] (2) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUnserializeFromBuffer+0x4a) [0x7ff40e9d460a]
  [bt] (3) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7ff9e41d5ec0]
  [bt] (4) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7ff9e41d587d]
  [bt] (5) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7ff9e43eae2e]
  [bt] (6) /home/ec2-user/anaconda3/envs/python3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12865) [0x7ff9e43eb865]
  [bt] (7) /home/ec2-user/anaconda3/envs/python3/bin/python(_PyObject_FastCallDict+0x8b) [0x5654e9e7fd7b]
  [bt] (8) /home/ec2-user/anaconda3/envs/python3/bin/python(+0x19e7ce) [0x5654e9f0f7ce]

