Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: CC-BY-SA-4.0
This section provides information for developers who want to use Apache Spark for preprocessing data and Amazon SageMaker for model training and hosting. For information about supported versions of Apache Spark, see https://github.com/aws/sagemaker-spark#getting-sagemaker-spark.
Amazon SageMaker provides an Apache Spark library, in both Python and Scala, that you can use to easily train models in Amazon SageMaker using org.apache.spark.sql.DataFrame
data frames in your Spark clusters. After model training, you can also host the model using Amazon SageMaker hosting services.
The Amazon SageMaker Spark library, com.amazonaws.services.sagemaker.sparksdk
, provides the following classes, among others: + SageMakerEstimator
—Extends the org.apache.spark.ml.Estimator
interface. You can use this estimator for model training in Amazon SageMaker. + KMeansSageMakerEstimator
, PCASageMakerEstimator
, and XGBoostSageMakerEstimator
—Extend the SageMakerEstimator
class. + SageMakerModel
—Extends the org.apache.spark.ml.Model
class. You can use this SageMakerModel
for model hosting and obtaining inferences in Amazon SageMaker.
You have the following options for downloading the Spark library provided by Amazon SageMaker: + You can download the source code for both PySpark and Scala libraries from GitHub at https://github.com/aws/sagemaker-spark.
+ For the Python Spark library, you have the following additional options: + Use pip install:
```
$ pip install sagemaker_pyspark
```
In a notebook instance, create a new notebook that uses either the Sparkmagic (PySpark)
or the Sparkmagic (PySpark3)
kernel and connect to a remote Amazon EMR cluster. For more information, see Build Amazon SageMaker Notebooks Backed by Spark in Amazon EMR. Note
The EMR cluster must be configured with an IAM role that has the AmazonSageMakerFullAccess
policy attached. For information about configuring roles for an EMR cluster, see Configure IAM Roles for Amazon EMR Permissions to AWS Services in the Amazon EMR Management Guide.
You can get the Scala library from Maven. Add the Spark library to your project by adding the following dependency to your pom.xml
file:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>sagemaker-spark_2.11</artifactId>
<version>spark_2.2.0-1.0</version>
</dependency>
The following is high-level summary of the steps for integrating your Apache Spark application with Amazon SageMaker.
Continue data preprocessing using the Apache Spark library that you are familiar with. Your dataset remains a DataFrame
in your Spark cluster. Note
Load your data into a DataFrame
and preprocess it so that you have a features
column with org.apache.spark.ml.linalg.Vector
of Doubles
, and an optional label
column with values of Double
type.
Use the estimator in the Amazon SageMaker Spark library to train your model. For example, if you choose the k-means algorithm provided by Amazon SageMaker for model training, you call the KMeansSageMakerEstimator.fit
method.
Provide your DataFrame
as input. The estimator returns a SageMakerModel
object. Note
SageMakerModel
extends the org.apache.spark.ml.Model
.
The fit
method does the following:
Converts the input DataFrame
to the protobuf format by selecting the features
and label
columns from the input DataFrame
and uploading the protobuf data to an Amazon S3 bucket. The protobuf format is efficient for model training in Amazon SageMaker.
Starts model training in Amazon SageMaker by sending an Amazon SageMaker CreateTrainingJob request. After model training has completed, Amazon SageMaker saves the model artifacts to an S3 bucket.
Amazon SageMaker assumes the IAM role that you specified for model training to perform tasks on your behalf. For example, it uses the role to read training data from an S3 bucket and to write model artifacts to a bucket.
Creates and returns a SageMakerModel
object. The constructor does the following tasks, which are related to deploying your model to Amazon SageMaker.
Sends a CreateModel request to Amazon SageMaker.
Sends a CreateEndpointConfig request to Amazon SageMaker.
Sends a CreateEndpoint request to Amazon SageMaker, which then launches the specified resources, and hosts the model on them.
You can get inferences from your model hosted in Amazon SageMaker with the SageMakerModel.transform
.
Provide an input DataFrame
with features as input. The transform
method transforms it to a DataFrame
containing inferences. Internally, the transform
method sends a request to the InvokeEndpoint Amazon SageMaker API to get inferences. The transform
method appends the inferences to the input DataFrame
.