{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train an ML Model using Apache Spark in EMR and deploy in SageMaker\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will see how you can train your Machine Learning (ML) model using Apache Spark and then take the trained model artifacts to create an endpoint in SageMaker for online inference. Apache Spark is one of the most popular big-data analytics platforms & it also comes with an ML library with a wide variety of feature transformers and algorithms that one can use to build an ML model. \n", "\n", "Apache Spark is designed for offline batch processing workload and is not best suited for low latency online prediction. In order to mitigate that, we will use [MLeap](https://github.com/combust/mleap) library. MLeap provides an easy-to-use Spark ML Pipeline serialization format & execution engine for low latency prediction use-cases. Once the ML model is trained using Apache Spark in EMR, we will serialize it with `MLeap` and upload to S3 as part of the Spark job so that it can be used in SageMaker in inference.\n", "\n", "After the model training is completed, we will use SageMaker **Inference** to perform predictions against this model. The underlying Docker image that we will use in inference is provided by [sagemaker-sparkml-serving](https://github.com/aws/sagemaker-sparkml-serving-container). It is a Spring based HTTP web server written following SageMaker container specifications and its operations are powered by `MLeap` execution engine. \n", "\n", "We'll use the SageMaker Studio `Sparkmagic (PySpark)` kernel for this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up an EMR cluster and connect a SageMaker notebook to the cluster\n", "In order to perform the steps mentioned in this notebook, you will need to have an EMR cluster running and make sure that the notebook can connect to the master node of the cluster. \n", "\n", "**This solution has been tested with Mleap 0.20, EMR 6.9.0 and Spark 3.3.0**\n", "\n", "Please follow the guide here on how to set up an EMR cluster and connect it to a notebook.\n", "[https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/](https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/)\n", "\n", "\n", "It can also be run as part of our workshop:\n", "[https://catalog.workshops.aws/sagemaker-studio-emr/en-US](https://catalog.workshops.aws/sagemaker-studio-emr/en-US)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to your EMR cluster\n", "\n", "To begin, you'll want to connect to your EMR Cluster from your SparkMagic kernel. If you want more information on doing this, please check the documentation:\n", "[https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster-connect.html](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster-connect.html)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install the MLeap JARs & Python Libraries on the cluster\n", "You need to have the MLeap JARs in the classpath to be successfully able to use it during model serialization. This can seamlessly be done by adding the package names to `spark.jars.packages` and creating a virtual env on our driver and executor nodes using notebook scoped dependencies.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%configure -f\n", "{\n", " \"conf\": {\n", " \"spark.jars.packages\": \"ml.combust.mleap:mleap-spark_2.12:0.20.0,ml.combust.mleap:mleap-spark-base_2.12:0.20.0\",\n", " \"spark.pyspark.python\": \"python3\",\n", " \"spark.pyspark.virtualenv.enabled\": \"true\",\n", " \"spark.pyspark.virtualenv.type\": \"native\",\n", " \"spark.pyspark.virtualenv.bin.path\": \"/usr/bin/virtualenv\",\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "sc.install_pypi_package(\"mleap==0.20.0\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing PySpark dependencies\n", "Next we will import all the necessary dependencies that will be needed to execute the following cells on our Spark cluster. Please note that we are also importing the `boto3` and `mleap` modules here. \n", "\n", "You need to ensure that the import cell runs without any error to verify that you have installed the dependencies from PyPI properly. Also, this cell will provide you with a valid `SparkSession` named as `spark`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from mleap.pyspark.spark_support import SimpleSparkSerializer\n", "from pyspark.ml.regression import RandomForestRegressor\n", "from pyspark.sql.types import StructField, StructType, StringType, DoubleType\n", "\n", "schema = StructType(\n", " [\n", " StructField(\"sex\", StringType(), True),\n", " StructField(\"length\", DoubleType(), True),\n", " StructField(\"diameter\", DoubleType(), True),\n", " StructField(\"height\", DoubleType(), True),\n", " StructField(\"whole_weight\", DoubleType(), True),\n", " StructField(\"shucked_weight\", DoubleType(), True),\n", " StructField(\"viscera_weight\", DoubleType(), True),\n", " StructField(\"shell_weight\", DoubleType(), True),\n", " StructField(\"rings\", DoubleType(), True),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning task: Predict the age of an Abalone from its physical measurement" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is available from [UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/abalone). The aim for this task is to determine age of an Abalone (a kind of shellfish) from its physical measurements. At the core, it's a regression problem. The dataset contains several features - `sex` (categorical), `length` (continuous), `diameter` (continuous), `height` (continuous), `whole_weight` (continuous), `shucked_weight` (continuous), `viscera_weight` (continuous), `shell_weight` (continuous) and `rings` (integer). Our goal is to predict the variable `rings` which is a good approximation for age (age is `rings` + 1.5).\n", "\n", "We'll use SparkML to pre-process the dataset (apply one or more feature transformers) and train it with the [Random Forest](https://en.wikipedia.org/wiki/Random_forest) algorithm from SparkML." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pass bucket information to your EMR Cluster\n", "We'll use our local notebook kernel to pass variables to our EMR Cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%local\n", "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "bucket = sess.default_bucket()\n", "region = sess.boto_region_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%send_to_spark -i bucket -t str -n bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%send_to_spark -i region -t str -n region" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the schema of the dataset\n", "In the next cell, we will define the schema of the `Abalone` dataset and provide it to Spark so that it can parse the CSV file properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "schema = StructType(\n", " [\n", " StructField(\"sex\", StringType(), True),\n", " StructField(\"length\", DoubleType(), True),\n", " StructField(\"diameter\", DoubleType(), True),\n", " StructField(\"height\", DoubleType(), True),\n", " StructField(\"whole_weight\", DoubleType(), True),\n", " StructField(\"shucked_weight\", DoubleType(), True),\n", " StructField(\"viscera_weight\", DoubleType(), True),\n", " StructField(\"shell_weight\", DoubleType(), True),\n", " StructField(\"rings\", DoubleType(), True),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read data directly from S3\n", "Next we will use in-built CSV reader from Spark to read data directly from S3 into a `Dataframe` and inspect its first five rows.\n", "\n", "After that, we will split the `Dataframe` into **80-20** train and validation so that we can train the model on the train part and measure its performance on the validation part." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "total_df = spark.read.csv(\n", " f\"s3://sagemaker-example-files-prod-{region}/datasets/tabular/uci_abalone/abalone.csv\",\n", " header=False,\n", " schema=schema,\n", ")\n", "total_df.show(5)\n", "(train_df, validation_df) = total_df.randomSplit([0.8, 0.2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the feature transformers\n", "Abalone dataset has one categorical column - `sex` which needs to be converted to integer format before it can be passed to the Random Forest algorithm. \n", "\n", "For that, we are using `StringIndexer` and `OneHotEncoderEstimator` from Spark to transform the categorical column and then use a `VectorAssembler` to produce a flat one dimensional vector for each data-point so that it can be used with the Random Forest algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from pyspark.ml.feature import (\n", " StringIndexer,\n", " VectorIndexer,\n", " OneHotEncoder,\n", " VectorAssembler,\n", " IndexToString,\n", ")\n", "\n", "\n", "sex_indexer = StringIndexer(inputCol=\"sex\", outputCol=\"indexed_sex\")\n", "\n", "sex_encoder = OneHotEncoder(inputCols=[\"indexed_sex\"], outputCols=[\"sex_vec\"])\n", "\n", "assembler = VectorAssembler(\n", " inputCols=[\n", " \"sex_vec\",\n", " \"length\",\n", " \"diameter\",\n", " \"height\",\n", " \"whole_weight\",\n", " \"shucked_weight\",\n", " \"viscera_weight\",\n", " \"shell_weight\",\n", " ],\n", " outputCol=\"features\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the Random Forest model and perform training\n", "After the data is preprocessed, we define a `RandomForestClassifier`, define our `Pipeline` comprising of both feature transformation and training stages and train the Pipeline calling `.fit()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from pyspark.ml import Pipeline\n", "from pyspark.ml.regression import RandomForestRegressor\n", "\n", "rf = RandomForestRegressor(labelCol=\"rings\", featuresCol=\"features\", maxDepth=6, numTrees=18)\n", "pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler, rf])\n", "model = pipeline.fit(train_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use the trained `Model` to transform train and validation dataset\n", "Next we will use this trained `Model` to convert our training and validation dataset to see some sample output and also measure the performance scores.The `Model` will apply the feature transformers on the data before passing it to the Random Forest." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "transformed_train_df = model.transform(train_df)\n", "transformed_validation_df = model.transform(validation_df)\n", "transformed_validation_df.select(\"prediction\").show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating the model on train and validation dataset\n", "Using Spark's `RegressionEvaluator`, we can calculate the `rmse` (Root-Mean-Squared-Error) on our train and validation dataset to evaluate its performance. If the performance numbers are not satisfactory, we can train the model again and again by changing parameters of Random Forest or add/remove feature transformers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from pyspark.ml.evaluation import RegressionEvaluator\n", "\n", "evaluator = RegressionEvaluator(labelCol=\"rings\", predictionCol=\"prediction\", metricName=\"rmse\")\n", "train_rmse = evaluator.evaluate(transformed_train_df)\n", "validation_rmse = evaluator.evaluate(transformed_validation_df)\n", "print(\"Train RMSE = %g\" % train_rmse)\n", "print(\"Validation RMSE = %g\" % validation_rmse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `MLeap` to serialize the model\n", "By calling the `serializeToBundle` method from the `MLeap` library, we can store the `Model` in a specific serialization format that can be later used for inference by `sagemaker-sparkml-serving`. \n", "\n", "**If this step fails with an error - `JavaPackage is not callable`, it means you have not setup the MLeap JAR in the classpath properly.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model.serializeToBundle(\"jar:file:/tmp/model.zip\", transformed_validation_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert the model to `tar.gz` format\n", "SageMaker expects any model format to be present in `tar.gz` format, but MLeap produces the model `zip` format. In the next cell, we unzip the model artifacts and store it in `tar.gz` format. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import zipfile\n", "\n", "with zipfile.ZipFile(\"/tmp/model.zip\") as zf:\n", " zf.extractall(\"/tmp/model\")\n", "\n", "import tarfile\n", "\n", "with tarfile.open(\"/tmp/model.tar.gz\", \"w:gz\") as tar:\n", " tar.add(\"/tmp/model/bundle.json\", arcname=\"bundle.json\")\n", " tar.add(\"/tmp/model/root\", arcname=\"root\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload the trained model artifacts to S3\n", "At the end, we need to upload the trained and serialized model artifacts to S3 so that it can be used for inference in SageMaker. \n", "\n", "Please note down the S3 bucket location which we passed to the cluster previously." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "import boto3\n", "\n", "s3 = boto3.resource(\"s3\")\n", "file_name = os.path.join(\"emr/abalone/mleap\", \"model.tar.gz\")\n", "s3.Bucket(bucket).upload_file(\"/tmp/model.tar.gz\", file_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete model artifacts from local disk (optional)\n", "If you are training multiple ML models on the same host and using the same location to save the `MLeap` serialized model, then you need to delete the model on the local disk to prevent `MLeap` library failing with an error - `file already exists`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import shutil\n", "\n", "os.remove(\"/tmp/model.zip\")\n", "os.remove(\"/tmp/model.tar.gz\")\n", "shutil.rmtree(\"/tmp/model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hosting the model in SageMaker\n", "Now the second phase of this Notebook begins, where we will host this model in SageMaker and perform predictions against it. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hosting a model in SageMaker requires two components\n", "\n", "* A Docker image residing in ECR.\n", "* a trained Model residing in S3.\n", "\n", "For SparkML, Docker image for MLeap based SparkML serving has already been prepared and uploaded to ECR by SageMaker team which anyone can use for hosting. For more information on this, please see [SageMaker SparkML Serving](https://github.com/aws/sagemaker-sparkml-serving-container/). \n", "\n", "MLeap serialized model was uploaded to S3 as part of the Spark job we executed in EMR in the previous steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the endpoint for prediction\n", "Next we'll create the SageMaker endpoint which will be used for performing online prediction. \n", "\n", "For this, we have to create an instance of `SparkMLModel` from `sagemaker-python-sdk` which will take the location of the model artifacts that we uploaded to S3 as part of the EMR job." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Passing the schema of the payload via environment variable\n", "SparkML server also needs to know the payload of the request that'll be passed to it while calling the `predict` method. In order to alleviate the pain of not having to pass the schema with every request, `sagemaker-sparkml-serving` lets you to pass it via an environment variable while creating the model definitions. \n", "\n", "We'd see later that you can overwrite this schema on a per request basis by passing it as part of the individual request payload as well.\n", "\n", "This schema definition should also be passed while creating the instance of `SparkMLModel`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "import json\n", "\n", "schema = {\n", " \"input\": [\n", " {\"name\": \"sex\", \"type\": \"string\"},\n", " {\"name\": \"length\", \"type\": \"double\"},\n", " {\"name\": \"diameter\", \"type\": \"double\"},\n", " {\"name\": \"height\", \"type\": \"double\"},\n", " {\"name\": \"whole_weight\", \"type\": \"double\"},\n", " {\"name\": \"shucked_weight\", \"type\": \"double\"},\n", " {\"name\": \"viscera_weight\", \"type\": \"double\"},\n", " {\"name\": \"shell_weight\", \"type\": \"double\"},\n", " ],\n", " \"output\": {\"name\": \"prediction\", \"type\": \"double\"},\n", "}\n", "schema_json = json.dumps(schema, indent=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "from time import gmtime, strftime\n", "import time\n", "\n", "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.sparkml.model import SparkMLModel\n", "\n", "boto3_session = boto3.session.Session()\n", "sagemaker_client = boto3.client(\"sagemaker\")\n", "sagemaker_runtime_client = boto3.client(\"sagemaker-runtime\")\n", "\n", "# Initialize sagemaker session\n", "session = sagemaker.Session(\n", " boto_session=boto3_session,\n", " sagemaker_client=sagemaker_client,\n", " sagemaker_runtime_client=sagemaker_runtime_client,\n", ")\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%local\n", "# S3 location of where you uploaded your trained and serialized SparkML model\n", "sparkml_data = \"s3://{}/{}/{}\".format(bucket, \"emr/abalone/mleap\", \"model.tar.gz\")\n", "model_name = \"sparkml-abalone-\" + timestamp_prefix\n", "sparkml_model = SparkMLModel(\n", " model_data=sparkml_data,\n", " role=role,\n", " spark_version=\"3.3\",\n", " sagemaker_session=session,\n", " name=model_name,\n", " # passing the schema defined above by using an environment\n", " # variable that sagemaker-sparkml-serving understands\n", " env={\"SAGEMAKER_SPARKML_SCHEMA\": schema_json},\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%local\n", "endpoint_name = \"sparkml-abalone-ep-\" + timestamp_prefix\n", "sparkml_model.deploy(\n", " initial_instance_count=1, instance_type=\"ml.c4.xlarge\", endpoint_name=endpoint_name\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invoking the newly created inference endpoint with a payload to transform the data\n", "Now we will invoke the endpoint with a valid payload that `sagemaker-sparkml-serving` can recognize. There are three ways in which input payload can be passed to the request:\n", "\n", "* Pass it as a valid CSV string. In this case, the schema passed via the environment variable will be used to determine the schema. For CSV format, every column in the input has to be a basic datatype (e.g. int, double, string) and it can not be a Spark `Array` or `Vector`.\n", "\n", "* Pass it as a valid JSON string. In this case as well, the schema passed via the environment variable will be used to infer the schema. With JSON format, every column in the input can be a basic datatype or a Spark `Vector` or `Array` provided that the corresponding entry in the schema mentions the correct value.\n", "\n", "* Pass the request in JSON format along with the schema and the data. In this case, the schema passed in the payload will take precedence over the one passed via the environment variable (if any)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Passing the payload in CSV format\n", "We will first see how the payload can be passed to the endpoint in CSV format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "from sagemaker.predictor import Predictor\n", "from sagemaker.serializers import CSVSerializer, JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "\n", "payload = \"F,0.515,0.425,0.14,0.766,0.304,0.1725,0.255\"\n", "\n", "predictor = Predictor(\n", " endpoint_name=endpoint_name, sagemaker_session=session, serializer=CSVSerializer()\n", ")\n", "print(predictor.predict(payload))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Passing the payload in JSON format\n", "We will now pass a different payload in JSON format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "payload = {\"data\": [\"F\", 0.515, 0.425, 0.14, 0.766, 0.304, 0.1725, 0.255]}\n", "\n", "predictor = Predictor(\n", " endpoint_name=endpoint_name, sagemaker_session=session, serializer=JSONSerializer()\n", ")\n", "print(predictor.predict(payload))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Passing the payload with both schema and the data\n", "Next we will pass the input payload comprising of both the schema and the data. If you notice carefully, this schema will be slightly different than what we have passed via the environment variable. The locations of `length` and `sex` column have been swapped and so the data. The server now parses the payload with this schema and works properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "payload = {\n", " \"schema\": {\n", " \"input\": [\n", " {\"name\": \"length\", \"type\": \"double\"},\n", " {\"name\": \"sex\", \"type\": \"string\"},\n", " {\"name\": \"diameter\", \"type\": \"double\"},\n", " {\"name\": \"height\", \"type\": \"double\"},\n", " {\"name\": \"whole_weight\", \"type\": \"double\"},\n", " {\"name\": \"shucked_weight\", \"type\": \"double\"},\n", " {\"name\": \"viscera_weight\", \"type\": \"double\"},\n", " {\"name\": \"shell_weight\", \"type\": \"double\"},\n", " ],\n", " \"output\": {\"name\": \"prediction\", \"type\": \"double\"},\n", " },\n", " \"data\": [0.515, \"F\", 0.425, 0.14, 0.766, 0.304, 0.1725, 0.255],\n", "}\n", "\n", "predictor = Predictor(\n", " endpoint_name=endpoint_name, sagemaker_session=session, serializer=JSONSerializer()\n", ")\n", "print(predictor.predict(payload))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting the Endpoint (Optional)\n", "Next we will delete the endpoint so that you do not incur the cost of keeping it running." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%cleanup -f" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "%%local\n", "predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-python-sdk|sparkml_serving_emr_mleap_abalone|sparkml_serving_emr_mleap_abalone.ipynb)\n" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.m5.4xlarge", "kernelspec": { "display_name": "PySpark (SparkMagic)", "language": "python", "name": "pysparkkernel__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-sparkmagic" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }