{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Distributed Data Processing using Apache Spark and SageMaker Processing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Apache Spark is a unified analytics engine for large-scale data processing. The Spark framework is often used within the context of machine learning workflows to run data transformation or feature engineering workloads at scale. Amazon SageMaker provides a set of prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs on Amazon SageMaker. This example notebook demonstrates how to use the prebuilt Spark images on SageMaker Processing using the SageMaker Python SDK.\n",
    "\n",
    "This notebook walks through the following scenarios to illustrate the functionality of the SageMaker Spark Container:\n",
    "\n",
    "* Running a basic PySpark application using the SageMaker Python SDK's `PySparkProcessor` class\n",
    "* Viewing the Spark UI via the `start_history_server()` function of a `PySparkProcessor` object\n",
    "* Adding additional Python and jar file dependencies to jobs\n",
    "* Running a basic Java/Scala-based Spark job using the SageMaker Python SDK's `SparkJarProcessor` class\n",
    "* Specifying additional Spark configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Runtime\n",
    "\n",
    "This notebook takes approximately 22 minutes to run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Setup](#Setup)\n",
    "1. [Example 1: Running a basic PySpark application](#Example-1:-Running-a-basic-PySpark-application)\n",
    "1. [Example 2: Specify additional Python and jar file dependencies](#Example-2:-Specify-additional-Python-and-jar-file-dependencies)\n",
    "1. [Example 3: Run a Java/Scala Spark application](#Example-3:-Run-a-Java/Scala-Spark-application)\n",
    "1. [Example 4: Specifying additional Spark configuration](#Example-4:-Specifying-additional-Spark-configuration)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install the latest SageMaker Python SDK\n",
    "\n",
    "This notebook requires the latest v2.x version of the SageMaker Python SDK. First, ensure that the latest version is installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U \"sagemaker>2.0\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Restart your notebook kernel after upgrading the SDK*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: Running a basic PySpark application\n",
    "\n",
    "The first example is a basic Spark MLlib data processing script. This script will take a raw data set and do some transformations on it such as string indexing and one hot encoding.\n",
    "\n",
    "### Setup S3 bucket locations and roles\n",
    "\n",
    "First, setup some locations in the default SageMaker bucket to store the raw input datasets and the Spark job output. Here, you'll also define the role that will be used to run all SageMaker Processing jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sagemaker\n",
    "from time import gmtime, strftime\n",
    "\n",
    "sagemaker_logger = logging.getLogger(\"sagemaker\")\n",
    "sagemaker_logger.setLevel(logging.INFO)\n",
    "sagemaker_logger.addHandler(logging.StreamHandler())\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, you'll download the example dataset from a SageMaker staging bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch the dataset from the SageMaker bucket\n",
    "import boto3\n",
    "\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(\n",
    "    f\"sagemaker-example-files-prod-{sagemaker_session.boto_region_name}\",\n",
    "    \"datasets/tabular/uci_abalone/abalone.csv\",\n",
    "    \"./data/abalone.csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write the PySpark script\n",
    "\n",
    "The source for a preprocessing script is in the cell below. The cell uses the `%%writefile` directive to save this file locally. This script does some basic feature engineering on a raw input dataset. In this example, the dataset is the [Abalone Data Set](https://archive.ics.uci.edu/ml/datasets/abalone) and the code below performs string indexing, one hot encoding, vector assembly, and combines them into a pipeline to perform these transformations in order. The script then does an 80-20 split to produce training and validation datasets as output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./code/preprocess.py\n",
    "from __future__ import print_function\n",
    "from __future__ import unicode_literals\n",
    "\n",
    "import argparse\n",
    "import csv\n",
    "import os\n",
    "import shutil\n",
    "import sys\n",
    "import time\n",
    "\n",
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.feature import (\n",
    "    OneHotEncoder,\n",
    "    StringIndexer,\n",
    "    VectorAssembler,\n",
    "    VectorIndexer,\n",
    ")\n",
    "from pyspark.sql.functions import *\n",
    "from pyspark.sql.types import (\n",
    "    DoubleType,\n",
    "    StringType,\n",
    "    StructField,\n",
    "    StructType,\n",
    ")\n",
    "\n",
    "\n",
    "def csv_line(data):\n",
    "    r = \",\".join(str(d) for d in data[1])\n",
    "    return str(data[0]) + \",\" + r\n",
    "\n",
    "\n",
    "def main():\n",
    "    parser = argparse.ArgumentParser(description=\"app inputs and outputs\")\n",
    "    parser.add_argument(\"--s3_input_bucket\", type=str, help=\"s3 input bucket\")\n",
    "    parser.add_argument(\"--s3_input_key_prefix\", type=str, help=\"s3 input key prefix\")\n",
    "    parser.add_argument(\"--s3_output_bucket\", type=str, help=\"s3 output bucket\")\n",
    "    parser.add_argument(\"--s3_output_key_prefix\", type=str, help=\"s3 output key prefix\")\n",
    "    args = parser.parse_args()\n",
    "\n",
    "    spark = SparkSession.builder.appName(\"PySparkApp\").getOrCreate()\n",
    "\n",
    "    # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format\n",
    "    spark.sparkContext._jsc.hadoopConfiguration().set(\n",
    "        \"mapred.output.committer.class\", \"org.apache.hadoop.mapred.FileOutputCommitter\"\n",
    "    )\n",
    "\n",
    "    # Defining the schema corresponding to the input data. The input data does not contain the headers\n",
    "    schema = StructType(\n",
    "        [\n",
    "            StructField(\"sex\", StringType(), True),\n",
    "            StructField(\"length\", DoubleType(), True),\n",
    "            StructField(\"diameter\", DoubleType(), True),\n",
    "            StructField(\"height\", DoubleType(), True),\n",
    "            StructField(\"whole_weight\", DoubleType(), True),\n",
    "            StructField(\"shucked_weight\", DoubleType(), True),\n",
    "            StructField(\"viscera_weight\", DoubleType(), True),\n",
    "            StructField(\"shell_weight\", DoubleType(), True),\n",
    "            StructField(\"rings\", DoubleType(), True),\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # Downloading the data from S3 into a Dataframe\n",
    "    total_df = spark.read.csv(\n",
    "        (\"s3://\" + os.path.join(args.s3_input_bucket, args.s3_input_key_prefix, \"abalone.csv\")),\n",
    "        header=False,\n",
    "        schema=schema,\n",
    "    )\n",
    "\n",
    "    # StringIndexer on the sex column which has categorical value\n",
    "    sex_indexer = StringIndexer(inputCol=\"sex\", outputCol=\"indexed_sex\")\n",
    "\n",
    "    # one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)\n",
    "    sex_encoder = OneHotEncoder(inputCol=\"indexed_sex\", outputCol=\"sex_vec\")\n",
    "\n",
    "    # vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format\n",
    "    assembler = VectorAssembler(\n",
    "        inputCols=[\n",
    "            \"sex_vec\",\n",
    "            \"length\",\n",
    "            \"diameter\",\n",
    "            \"height\",\n",
    "            \"whole_weight\",\n",
    "            \"shucked_weight\",\n",
    "            \"viscera_weight\",\n",
    "            \"shell_weight\",\n",
    "        ],\n",
    "        outputCol=\"features\",\n",
    "    )\n",
    "\n",
    "    # The pipeline is comprised of the steps added above\n",
    "    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])\n",
    "\n",
    "    # This step trains the feature transformers\n",
    "    model = pipeline.fit(total_df)\n",
    "\n",
    "    # This step transforms the dataset with information obtained from the previous fit\n",
    "    transformed_total_df = model.transform(total_df)\n",
    "\n",
    "    # Split the overall dataset into 80-20 training and validation\n",
    "    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])\n",
    "\n",
    "    # Convert the train dataframe to RDD to save in CSV format and upload to S3\n",
    "    train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))\n",
    "    train_lines = train_rdd.map(csv_line)\n",
    "    train_lines.saveAsTextFile(\n",
    "        \"s3://\" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, \"train\")\n",
    "    )\n",
    "\n",
    "    # Convert the validation dataframe to RDD to save in CSV format and upload to S3\n",
    "    validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))\n",
    "    validation_lines = validation_rdd.map(csv_line)\n",
    "    validation_lines.saveAsTextFile(\n",
    "        \"s3://\" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, \"validation\")\n",
    "    )\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the SageMaker Processing Job\n",
    "\n",
    "Next, you'll use the `PySparkProcessor` class to define a Spark job and run it using SageMaker Processing. A few things to note in the definition of the `PySparkProcessor`:\n",
    "\n",
    "* This is a multi-node job with two m5.xlarge instances (which is specified via the `instance_count` and `instance_type` parameters)\n",
    "* Spark framework version 3.1 is specified via the `framework_version` parameter\n",
    "* The PySpark script defined above is passed via via the `submit_app` parameter\n",
    "* Command-line arguments to the PySpark script (such as the S3 input and output locations) are passed via the `arguments` parameter\n",
    "* Spark event logs will be offloaded to the S3 location specified in `spark_event_logs_s3_uri` and can be used to view the Spark UI while the job is in progress or after it completes\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.spark.processing import PySparkProcessor\n",
    "\n",
    "# Upload the raw input dataset to a unique S3 location\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n",
    "input_prefix_abalone = \"{}/input/raw/abalone\".format(prefix)\n",
    "input_preprocessed_prefix_abalone = \"{}/input/preprocessed/abalone\".format(prefix)\n",
    "\n",
    "sagemaker_session.upload_data(\n",
    "    path=\"./data/abalone.csv\", bucket=bucket, key_prefix=input_prefix_abalone\n",
    ")\n",
    "\n",
    "# Run the processing job\n",
    "spark_processor = PySparkProcessor(\n",
    "    base_job_name=\"sm-spark\",\n",
    "    framework_version=\"3.1\",\n",
    "    role=role,\n",
    "    instance_count=2,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    max_runtime_in_seconds=1200,\n",
    ")\n",
    "\n",
    "spark_processor.run(\n",
    "    submit_app=\"./code/preprocess.py\",\n",
    "    arguments=[\n",
    "        \"--s3_input_bucket\",\n",
    "        bucket,\n",
    "        \"--s3_input_key_prefix\",\n",
    "        input_prefix_abalone,\n",
    "        \"--s3_output_bucket\",\n",
    "        bucket,\n",
    "        \"--s3_output_key_prefix\",\n",
    "        input_preprocessed_prefix_abalone,\n",
    "    ],\n",
    "    spark_event_logs_s3_uri=\"s3://{}/{}/spark_event_logs\".format(bucket, prefix),\n",
    "    logs=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Validate Data Processing Results\n",
    "\n",
    "Next, validate the output of our data preprocessing job by looking at the first 5 rows of the output dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Top 5 rows from s3://{}/{}/train/\".format(bucket, input_preprocessed_prefix_abalone))\n",
    "!aws s3 cp --quiet s3://$bucket/$input_preprocessed_prefix_abalone/train/part-00000 - | head -n5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the Spark UI\n",
    "\n",
    "Next, you can view the Spark UI by running the history server locally in this notebook. (**Note:** this feature will only work in a local development environment with docker installed or on a Sagemaker Notebook Instance. This feature does not currently work in SageMaker Studio.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uses docker\n",
    "spark_processor.start_history_server()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After viewing the Spark UI, you can terminate the history server before proceeding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark_processor.terminate_history_server()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2: Specify additional Python and jar file dependencies\n",
    "\n",
    "The next example demonstrates a scenario where additional Python file dependencies are required by the PySpark script. You'll use a sample PySpark script that requires additional user-defined functions (UDFs) defined in a local module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./code/hello_py_spark_app.py\n",
    "import argparse\n",
    "import time\n",
    "\n",
    "# Import local module to test spark-submit--py-files dependencies\n",
    "import hello_py_spark_udfs as udfs\n",
    "from pyspark.sql import SparkSession, SQLContext\n",
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import IntegerType\n",
    "import time\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    print(\"Hello World, this is PySpark!\")\n",
    "\n",
    "    parser = argparse.ArgumentParser(description=\"inputs and outputs\")\n",
    "    parser.add_argument(\"--input\", type=str, help=\"path to input data\")\n",
    "    parser.add_argument(\"--output\", required=False, type=str, help=\"path to output data\")\n",
    "    args = parser.parse_args()\n",
    "    spark = SparkSession.builder.appName(\"SparkTestApp\").getOrCreate()\n",
    "    sqlContext = SQLContext(spark.sparkContext)\n",
    "\n",
    "    # Load test data set\n",
    "    inputPath = args.input\n",
    "    outputPath = args.output\n",
    "    salesDF = spark.read.json(inputPath)\n",
    "    salesDF.printSchema()\n",
    "\n",
    "    salesDF.createOrReplaceTempView(\"sales\")\n",
    "\n",
    "    # Define a UDF that doubles an integer column\n",
    "    # The UDF function is imported from local module to test spark-submit--py-files dependencies\n",
    "    double_udf_int = udf(udfs.double_x, IntegerType())\n",
    "\n",
    "    # Save transformed data set to disk\n",
    "    salesDF.select(\"date\", \"sale\", double_udf_int(\"sale\").alias(\"sale_double\")).write.json(\n",
    "        outputPath\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./code/hello_py_spark_udfs.py\n",
    "def double_x(x):\n",
    "    return x + x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a processing job with Python file dependencies\n",
    "\n",
    "Then, you'll create a processing job where the additional Python file dependencies are specified via the `submit_py_files` argument in the `run()` function. If your Spark application requires additional jar file dependencies, these can be specified via the `submit_jars` argument of the `run()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define job input/output URIs\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n",
    "input_prefix_sales = \"{}/input/sales\".format(prefix)\n",
    "output_prefix_sales = \"{}/output/sales\".format(prefix)\n",
    "input_s3_uri = \"s3://{}/{}\".format(bucket, input_prefix_sales)\n",
    "output_s3_uri = \"s3://{}/{}\".format(bucket, output_prefix_sales)\n",
    "\n",
    "sagemaker_session.upload_data(\n",
    "    path=\"./data/data.jsonl\", bucket=bucket, key_prefix=input_prefix_sales\n",
    ")\n",
    "\n",
    "spark_processor = PySparkProcessor(\n",
    "    base_job_name=\"sm-spark-udfs\",\n",
    "    framework_version=\"3.1\",\n",
    "    role=role,\n",
    "    instance_count=2,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    max_runtime_in_seconds=1200,\n",
    ")\n",
    "\n",
    "spark_processor.run(\n",
    "    submit_app=\"./code/hello_py_spark_app.py\",\n",
    "    submit_py_files=[\"./code/hello_py_spark_udfs.py\"],\n",
    "    arguments=[\"--input\", input_s3_uri, \"--output\", output_s3_uri],\n",
    "    logs=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Validate Data Processing Results\n",
    "\n",
    "Next, validate the output of the Spark job by ensuring that the output URI contains the Spark `_SUCCESS` file along with the output json lines file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Output files in {}\".format(output_s3_uri))\n",
    "!aws s3 ls $output_s3_uri/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 3: Run a Java/Scala Spark application\n",
    "\n",
    "In the next example, you'll take a Spark application jar (located in `./code/spark-test-app.jar`) that is already built and run it using SageMaker Processing. Here, you'll use the `SparkJarProcessor` class to define the job parameters. \n",
    "\n",
    "In the `run()` function you'll specify: \n",
    "\n",
    "* The location of the Spark application jar file in the `submit_app` argument\n",
    "* The main class for the Spark application in the `submit_class` argument\n",
    "* Input/output arguments for the Spark application"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.spark.processing import SparkJarProcessor\n",
    "\n",
    "# Upload the raw input dataset to S3\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n",
    "input_prefix_sales = \"{}/input/sales\".format(prefix)\n",
    "output_prefix_sales = \"{}/output/sales\".format(prefix)\n",
    "input_s3_uri = \"s3://{}/{}\".format(bucket, input_prefix_sales)\n",
    "output_s3_uri = \"s3://{}/{}\".format(bucket, output_prefix_sales)\n",
    "\n",
    "sagemaker_session.upload_data(\n",
    "    path=\"./data/data.jsonl\", bucket=bucket, key_prefix=input_prefix_sales\n",
    ")\n",
    "\n",
    "spark_processor = SparkJarProcessor(\n",
    "    base_job_name=\"sm-spark-java\",\n",
    "    framework_version=\"3.1\",\n",
    "    role=role,\n",
    "    instance_count=2,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    max_runtime_in_seconds=1200,\n",
    ")\n",
    "\n",
    "spark_processor.run(\n",
    "    submit_app=\"./code/spark-test-app.jar\",\n",
    "    submit_class=\"com.amazonaws.sagemaker.spark.test.HelloJavaSparkApp\",\n",
    "    arguments=[\"--input\", input_s3_uri, \"--output\", output_s3_uri],\n",
    "    logs=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 4: Specifying additional Spark configuration\n",
    "\n",
    "Overriding Spark configuration is crucial for a number of tasks such as tuning your Spark application or configuring the Hive metastore. Using the SageMaker Python SDK, you can easily override Spark/Hive/Hadoop configuration.\n",
    "\n",
    "The next example demonstrates this by overriding Spark executor memory/cores.\n",
    "\n",
    "For more information on configuring your Spark application, see the EMR documentation on [Configuring Applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upload the raw input dataset to a unique S3 location\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n",
    "input_prefix_abalone = \"{}/input/raw/abalone\".format(prefix)\n",
    "input_preprocessed_prefix_abalone = \"{}/input/preprocessed/abalone\".format(prefix)\n",
    "\n",
    "sagemaker_session.upload_data(\n",
    "    path=\"./data/abalone.csv\", bucket=bucket, key_prefix=input_prefix_abalone\n",
    ")\n",
    "\n",
    "spark_processor = PySparkProcessor(\n",
    "    base_job_name=\"sm-spark\",\n",
    "    framework_version=\"3.1\",\n",
    "    role=role,\n",
    "    instance_count=2,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    max_runtime_in_seconds=1200,\n",
    ")\n",
    "\n",
    "configuration = [\n",
    "    {\n",
    "        \"Classification\": \"spark-defaults\",\n",
    "        \"Properties\": {\"spark.executor.memory\": \"2g\", \"spark.executor.cores\": \"1\"},\n",
    "    }\n",
    "]\n",
    "\n",
    "spark_processor.run(\n",
    "    submit_app=\"./code/preprocess.py\",\n",
    "    arguments=[\n",
    "        \"--s3_input_bucket\",\n",
    "        bucket,\n",
    "        \"--s3_input_key_prefix\",\n",
    "        input_prefix_abalone,\n",
    "        \"--s3_output_bucket\",\n",
    "        bucket,\n",
    "        \"--s3_output_key_prefix\",\n",
    "        input_preprocessed_prefix_abalone,\n",
    "    ],\n",
    "    configuration=configuration,\n",
    "    logs=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}