# Multiclass classification on Spark with Amazon SageMaker XGBoost algorithm
_**Single machine and distributed training on Spark for multiclass classification with Amazon SageMaker XGBoost algorithm**_

---

## Introduction


This notebook demonstrates the use of Amazon SageMaker’s implementation of the XGBoost algorithm to train and host a multiclass classification model using the sagemaker-spark SDK.

---

## Download Dataset

For the purposes of this example we are downloading a dataset that has already been converted to libsvm format.

In [2]:
region = "us-east-1"

training_data = spark.read.format("libsvm").option("numFeatures", "784").load("s3a://sagemaker-sample-data-{}/spark/mnist/train/".format(region))

test_data = spark.read.format("libsvm").option("numFeatures", "784").load("s3a://sagemaker-sample-data-{}/spark/mnist/test/".format(region))

## Create And Invoke Model

The IAM role specified in `iam_role` is passed to the containers SageMaker uses for model hosting allowing them to do things like publish CloudWatch metrics and download data from S3. If you are unsure of which policies to add to this role try adding the managed `AmazonSageMakerFullAccess` and scoping down permissions from there if needed.

Takes ~10-20 minutes

In [None]:
from sagemaker_pyspark import IAMRole
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator

iam_role = "name_of_your_iam_role"

xgboost_estimator = XGBoostSageMakerEstimator(
 trainingInstanceType="ml.m4.xlarge",
 trainingInstanceCount=1,
 endpointInstanceType="ml.m4.xlarge",
 endpointInitialInstanceCount=1,
 sagemakerRole=IAMRole(iam_role))

xgboost_estimator.setNumRound(25) # Set number of trees to use
xgboost_estimator.setNumClasses(10) # MNIST contains digits 0-9
xgboost_estimator.setObjective('multi:softmax') # Set XGBoost objective to multi-class classification w/ SoftMax

xgboost_model = xgboost_estimator.fit(training_data)

transformed_data = xgboost_model.transform(test_data.limit(5)) # Score first 5 rows of test data
transformed_data.show()

## Create And Invoke Model From An Existing Endpoint

In the last step we saw how you can create and train a model then invoke it from the model object. Here we create the model object from an existing SageMaker endpoint and use invoke it for scoring on the same test data. 

In [8]:
from sagemaker_pyspark import SageMakerModel, EndpointCreationPolicy
from sagemaker_pyspark.transformation.serializers import LibSVMRequestRowSerializer
from sagemaker_pyspark.transformation.deserializers import XGBoostCSVRowDeserializer

my_endpoint = xgboost_model.endpointName # Get endpoint name of model created in previous step

xgboost_model = SageMakerModel(
 endpointInstanceType=None,
 endpointInitialInstanceCount=None,
 requestRowSerializer=LibSVMRequestRowSerializer(),
 responseRowDeserializer=XGBoostCSVRowDeserializer(),
 existingEndpointName=my_endpoint,
 endpointCreationPolicy=EndpointCreationPolicy.DO_NOT_CREATE
)

transformed_data = xgboost_model.transform(test_data.limit(5))
transformed_data.show()

+-----+--------------------+----------+
|label| features|prediction|
+-----+--------------------+----------+
| 7.0|(784,[202,203,204...| 7.0|
| 2.0|(784,[94,95,96,97...| 2.0|
| 1.0|(784,[128,129,130...| 1.0|
| 0.0|(784,[124,125,126...| 0.0|
| 4.0|(784,[150,151,159...| 4.0|
+-----+--------------------+----------+