Amazon SageMaker PySpark Documentation ====================================== The SageMaker PySpark SDK provides a pyspark interface to Amazon SageMaker, allowing customers to train using the Spark Estimator API, host their model on Amazon SageMaker, and make predictions with their model using the Spark Transformer API. This page is a quick guide on the basics of SageMaker PySpark. You can also check the :ref:`api` docs. .. toctree:: :maxdepth: 2 :caption: Contents: Quick Start ------------ First, install the library: .. code-block:: sh $ pip install sagemaker_pyspark Next, set up credentials (in e.g. ``~/.aws/credentials``): .. code-block:: ini [default] aws_access_key_id = YOUR_KEY aws_secret_access_key = YOUR_KEY Then, set up a default region (in e.g. ``~/.aws/config``): .. code-block:: ini [default] region=us-west-2 Then, to load the sagemaker jars programatically: .. code-block:: python from pyspark import SparkContext, SparkConf import sagemaker_pyspark conf = (SparkConf() .set("spark.driver.extraClassPath", ":".join(sagemaker_pyspark.classpath_jars()))) SparkContext(conf=conf) Alternatively pass the jars to your pyspark job via the --jars flag: .. code-block:: sh $ spark-submit --jars `sagemakerpyspark-jars` If you want to play around in interactive mode, the pyspark shell can be used too: .. code-block:: sh $ pyspark --jars `sagemakerpyspark-jars` You can also use the --packages flag and pass in the Maven coordinates for SageMaker Spark: .. code-block:: sh $ pyspark --packages com.amazonaws:sagemaker-spark_2.11:spark_2.1.1-1.0 Training and Hosting a K-Means Clustering model using SageMaker PySpark ----------------------------------------------------------------------- A KMeansSageMakerEstimator runs a training job using the Amazon SageMaker KMeans algorithm upon invocation of fit(), returning a SageMakerModel. .. code-block:: python from sagemaker_pyspark import IAMRole from sagemaker_pyspark.algorithms import KMeansSageMakerEstimator iam_role = "arn:aws:iam:0123456789012:role/MySageMakerRole" region = "us-east-1" training_data = spark.read.format("libsvm").option("numFeatures", "784") .load("s3a://sagemaker-sample-data-{}/spark/mnist/train/".format(region)) test_data = spark.read.format("libsvm").option("numFeatures", "784") .load("s3a://sagemaker-sample-data-{}/spark/mnist/train/".format(region)) kmeans_estimator = KMeansSageMakerEstimator( trainingInstanceType="ml.m4.xlarge", trainingInstanceCount=1, endpointInstanceType="ml.m4.xlarge", endpointInitialInstanceCount=1, sagemakerRole=IAMRole(iam_role)) kmeans_estimator.setK(10) kmeans_estimator.setFeatureDim(784) kmeans_model = kmeans_estimator.fit(training_data) transformed_data = kmeans_model.transform(test_data) transformed_data.show() The SageMakerEstimator expects an input DataFrame with a column named "features" that holds a Spark ML Vector. The estimator also serializes a "label" column of Doubles if present. Other columns are ignored. The dimension of this input vector should be equal to the feature dimension given as a hyperparameter. The Amazon SageMaker KMeans algorithm accepts many parameters, but K (the number of clusters) and FeatureDim (the number of features per Row) are required. You can set other hyperparameters, See the docs (link), or run .. code-block:: python kmeans_estimator.explainParams() After training is complete, an Amazon SageMaker Endpoint is created to host the model and serve predictions. Upon invocation of transform(), the SageMakerModel predicts against their hosted model. Like the SageMakerEstimator, the SageMakerModel expects an input DataFrame with a column named "features" that holds a Spark ML Vector equal in dimension to the value of the FeatureDim parameter. API Reference ------------- If you are looking for information on a specific class or method, this is where its found. .. toctree:: :maxdepth: 2 api Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`