{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training and Deploying a Text Classification model using Amazon SageMaker Pipelines"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id ='background'></a>\n",
    "\n",
    "## Background\n",
    "\n",
    "[Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) makes it easy for data scientists and engineers to build, automate, and scale end-to-end machine learning workflows. Machine learning workflows are complex, requiring iteration and experimentation across each step of the machine learning process, such as exploring and preparing data, experimenting with different algorithms, training and turning models, and deploying models to production. Developing and managing these workflows can take weeks or months of coding and manually managing workflow dependencies can become complex. With Amazon SageMaker Pipelines, data science teams have an easy-to-use continuous integration and continuous delivery (CI/CD) service that simplifies the development and management of machine learning workflows at scale.\n",
    "\n",
    "In this notebook, we use SageMaker Pipelines to train and deploy a text classification model to predict e-commerce product ratings based on customers’ product reviews. We’ll use BlazingText, one of the SageMaker [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#built-in-algorithms-benefits), to minimize the amount of effort required to train and deploy the model. [BlazingText](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) provides highly optimized implementations of Word2vec and text classification algorithms.\n",
    "\n",
    "To train the model, we’ll use a sample of data containing e-commerce reviews and associated product ratings. This data has been generated by customers that have rated products on a scale between 1 (worst) and 5 (best), and have left a short review. Our pipeline will start with processing the data for model training and will proceed with model training, evaluation, registry and deployment. The [Women’s E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/) dataset has been made available under a [Creative Commons license](https://creativecommons.org/publicdomain/zero/1.0/). A copy of the dataset has been saved in a sample data Amazon S3 bucket. In the first section of the notebook, we’ll walk through how to download the data and get started with building the ML workflow as a SageMaker pipeline.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "- [Project setup](#setup)\n",
    "    - [Set up SageMaker environment](#environment)\n",
    "    - [Obtain training data](#obtain-data)\n",
    "- [Step 0: Prepare pipeline parameters](#pipeline-parameters)\n",
    "- [Step 1: Create dataset and train/test split](#dataset-train-test)\n",
    "- [Step 2: Train text classification model using BlazingText](#train-model)\n",
    "- [Step 3: Model creation](#create-model)\n",
    "- [Step 4: Deploy model](#deploy-model)\n",
    "- [Step 5: Register model](#register-model)\n",
    "- [Step 6: Create pipeline](#create-pipeline)\n",
    "- [Use deployed model to return predictions](#return-predictions)\n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id ='setup'></a>\n",
    "\n",
    "## Project setup\n",
    "\n",
    "In this section, we'll install some necessary packages, set parameters that we can use in the pipeline and download the training data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id ='environment'></a>\n",
    "\n",
    "### Set up SageMaker environment\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install necessary packages\n",
    "import boto3\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sagemaker\n",
    "\n",
    "print(f\"SageMaker version: {sagemaker.__version__}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import necessary execution role so that you can read from S3 buckets\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "# source default session parameters (region, default S3 bucket etc)\n",
    "region = boto3.Session().region_name\n",
    "sagemaker_session = sagemaker.Session()\n",
    "s3_client = boto3.client(\"s3\", region_name=region)\n",
    "sagemaker_client = boto3.client(\"sagemaker-runtime\")\n",
    "default_bucket = sagemaker_session.default_bucket()\n",
    "prefix = \"sagemaker-pipelines-nlp-demo\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id ='obtain-data'></a>\n",
    "\n",
    "### Obtain training data\n",
    "\n",
    "To get started, we'll download the training data from a public S3 bucket and then upload the data to our own S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p data\n",
    "s3_client.download_file(\n",
    "    f\"sagemaker-example-files-prod-{region}\",\n",
    "    \"datasets/tabular/womens_clothing_ecommerce/Womens_Clothing_E-Commerce_Reviews.csv\",\n",
    "    \"data/Womens Clothing E-Commerce Reviews.csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick look at the distribution of product ratings shows us that the dataset is imbalanced; there are more observations with good product ratings than poor product ratings. This inbalance in the training data set may lead to a tendency to favor the majority classes (high product ratings) and overall poor model accuracy. During the data processing step, we'll mitigate the impact of the imbalanced dataset by:\n",
    "\n",
    "* Grouping ratings of 1 & 2 in to a larger 'negative review' category\n",
    "* Oversampling the 'negative review+' minority class\n",
    "* Ignoring neutral reviews (ratings of 3 & 4) to remove reviews that may be difficult to distinguish from positive and negative reviews. While this means that we will be unable to classify reviews as neutral, this may enable the model to better flag positive and negative reviews and therefor identify customers that require additional customer service attention or who are particularly likely to purchase additional products.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot = df.groupby(\"Rating\")[\"Rating\"].count().plot(kind=\"bar\", title=\"Count of Product Rating\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# upload the data to your default S3 bucket or another S3 bucket of your choosing\n",
    "local_path = \"data/Womens Clothing E-Commerce Reviews.csv\"\n",
    "\n",
    "base_uri = f\"s3://{default_bucket}/{prefix}/data\"\n",
    "input_data_uri = sagemaker.s3.S3Uploader.upload(\n",
    "    local_path=local_path,\n",
    "    desired_s3_uri=base_uri,\n",
    ")\n",
    "print(input_data_uri)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id ='pipeline-parameters'></a>\n",
    "\n",
    "## Step 0: Define parameters to parametrize pipeline execution\n",
    "\n",
    "Using SageMaker Pipelines, we can define the steps to be included in a pipeline but then use parameters to modify that pipeline when we go to execute the pipeline, without having to modify the pipeline definition. We'll provide some default parameter values that can be overridden on pipeline execution.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.workflow.parameters import ParameterInteger, ParameterString\n",
    "\n",
    "# specify location of input data\n",
    "input_data = ParameterString(\n",
    "    name=\"InputData\",\n",
    "    default_value=input_data_uri,\n",
    ")\n",
    "\n",
    "# specify default number of instances for processing step\n",
    "processing_instance_count = ParameterInteger(name=\"ProcessingInstanceCount\", default_value=1)\n",
    "\n",
    "# specify default instance type for processing step\n",
    "processing_instance_type = ParameterString(\n",
    "    name=\"ProcessingInstanceType\", default_value=\"ml.m4.xlarge\"\n",
    ")\n",
    "\n",
    "# specify default instance type for training step\n",
    "train_instance_type = ParameterString(\n",
    "    name=\"TrainingInstanceType\",\n",
    "    default_value=\"ml.m4.xlarge\",\n",
    ")\n",
    "\n",
    "# specify default model approval mode\n",
    "model_approval_status = ParameterString(name=\"ModelApprovalStatus\", default_value=\"Approved\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='dataset-train-test'></a>\n",
    "\n",
    "### Step 1: Create Dataset and Train/Test Split\n",
    "\n",
    "In this step, an SKLearn processor is used to prepare the data for model training. This data processing includes converting the review text to lowercase, removing carriage returns and line breaks and removing punctuation. We also group categories and conduct oversampling to reduce training dataset imbalance. Additionally, we process the reviews so that the training input data is in the format expected by the BlazingText algorithm (see more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html#bt-inputoutput) and we split the data in to training, testing and validation datasets.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile code/preprocessing.py\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import string\n",
    "from sklearn.utils import resample\n",
    "\n",
    "base_dir = \"/opt/ml/processing\"\n",
    "\n",
    "df = pd.read_csv(f\"{base_dir}/input/Womens Clothing E-Commerce Reviews.csv\")\n",
    "df = df[df[\"Review Text\"].notna()]  # drop rows where Review text is missing\n",
    "\n",
    "\n",
    "def process_review(text):\n",
    "    punctuation = string.punctuation\n",
    "    review = text.lower()\n",
    "    review = review.replace(\"\\r\\n\", \" \").replace(\"\\n\\n\", \" \")\n",
    "    translator = str.maketrans(\"\", \"\", punctuation)\n",
    "    review = review.translate(translator)\n",
    "    return review\n",
    "\n",
    "\n",
    "# create columns for concat reviews and new labels\n",
    "df[\"Complete_Review\"] = df[\"Title\"] + \" \" + df[\"Review Text\"]\n",
    "df = df[df[\"Complete_Review\"].notna()]  # drop rows where review text is missing\n",
    "df[\"Label\"] = df[\"Rating\"].map({1: \"negative\", 2: \"negative\", 3: \"none\", 4: \"none\", 5: \"positive\"})\n",
    "df = df.loc[df[\"Label\"].isin([\"negative\", \"positive\"])]  # only use positive and negative reviews\n",
    "df[\"Review\"] = df[\"Complete_Review\"].astype(str).apply(process_review)\n",
    "df[\"Processed\"] = \"__label__\" + df[\"Label\"].astype(str) + \" \" + df[\"Review\"]\n",
    "\n",
    "# create train:test split\n",
    "train, validation, test = np.split(df, [int(0.7 * len(df)), int(0.85 * len(df))])\n",
    "\n",
    "# deal with unbalanced classes\n",
    "# only include resampling for training set so no data leakeage for validation sets\n",
    "positive = train.loc[train[\"Label\"] == \"positive\"]\n",
    "negative = train.loc[train[\"Label\"] == \"negative\"]\n",
    "\n",
    "# oversample the minority classes\n",
    "negative_oversample = resample(negative, replace=True, n_samples=len(positive))\n",
    "\n",
    "# remake training set using balanced class camples\n",
    "train = pd.concat([positive, negative_oversample])\n",
    "\n",
    "# create Series datasets for BlazingText format\n",
    "train = train[\"Processed\"]\n",
    "validation = validation[\"Processed\"]\n",
    "test = test[\"Processed\"]\n",
    "\n",
    "# save datasets\n",
    "pd.DataFrame(train).to_csv(f\"{base_dir}/train/train.csv\", header=False, index=False)\n",
    "pd.DataFrame(validation).to_csv(f\"{base_dir}/validation/validation.csv\", header=False, index=False)\n",
    "pd.DataFrame(test).to_csv(f\"{base_dir}/test/test.csv\", header=False, index=False)\n",
    "\n",
    "print(f\"Number of reviews in the training dataset: {train.shape[0]}\")\n",
    "print(f\"Number of reviews in the validation set: {validation.shape[0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "\n",
    "framework_version = \"0.23-1\"\n",
    "\n",
    "sklearn_processor = SKLearnProcessor(\n",
    "    framework_version=framework_version,\n",
    "    instance_type=processing_instance_type,\n",
    "    instance_count=processing_instance_count,\n",
    "    base_job_name=\"sklearn-nlp-process\",\n",
    "    role=role,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "from sagemaker.workflow.steps import ProcessingStep\n",
    "\n",
    "s3_client.upload_file(\n",
    "    Filename=\"./code/preprocessing.py\", Bucket=default_bucket, Key=f\"{prefix}/code/preprocessing.py\"\n",
    ")\n",
    "preprocess_script_uri = f\"s3://{default_bucket}/{prefix}/code/preprocessing.py\"\n",
    "\n",
    "process_step = ProcessingStep(\n",
    "    name=\"BTDemoProcessStep\",\n",
    "    processor=sklearn_processor,\n",
    "    inputs=[\n",
    "        ProcessingInput(source=input_data, destination=\"/opt/ml/processing/input\"),\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/train\"),\n",
    "        ProcessingOutput(output_name=\"validation\", source=\"/opt/ml/processing/validation\"),\n",
    "        ProcessingOutput(output_name=\"test\", source=\"/opt/ml/processing/test\"),\n",
    "    ],\n",
    "    code=preprocess_script_uri,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='train-model'></a>\n",
    "\n",
    "### Step 2: Train text classification model using BlazingText"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up estimator:\n",
    "\n",
    "from sagemaker.estimator import Estimator\n",
    "\n",
    "bt_estimator = Estimator(\n",
    "    role=role,\n",
    "    instance_type=train_instance_type,\n",
    "    instance_count=1,\n",
    "    image_uri=sagemaker.image_uris.retrieve(\"blazingtext\", region),\n",
    "    output_path=f\"s3://{default_bucket}/{prefix}/training_jobs\",\n",
    "    base_job_name=\"bt-model-estimator\",\n",
    "    input_mode=\"File\",\n",
    ")\n",
    "\n",
    "# for more info on hyperparameters, see: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html\n",
    "bt_estimator.set_hyperparameters(\n",
    "    mode=\"supervised\",\n",
    "    epochs=25,\n",
    "    learning_rate=0.02,\n",
    "    min_count=2,\n",
    "    early_stopping=True,\n",
    "    patience=4,\n",
    "    min_epochs=10,\n",
    "    word_ngrams=3,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up model training step\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.workflow.steps import TrainingStep\n",
    "\n",
    "train_step = TrainingStep(\n",
    "    name=\"BTDemoTrainStep\",\n",
    "    estimator=bt_estimator,\n",
    "    inputs={\n",
    "        \"train\": sagemaker.inputs.TrainingInput(\n",
    "            s3_data=process_step.properties.ProcessingOutputConfig.Outputs[\"train\"].S3Output.S3Uri,\n",
    "            content_type=\"text/csv\",\n",
    "        ),\n",
    "        \"validation\": sagemaker.inputs.TrainingInput(\n",
    "            s3_data=process_step.properties.ProcessingOutputConfig.Outputs[\"test\"].S3Output.S3Uri,\n",
    "            content_type=\"text/csv\",\n",
    "        ),\n",
    "    },\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='create-model'></a>\n",
    "\n",
    "### Step 3: Model creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.workflow.steps import CreateModelStep\n",
    "\n",
    "model = sagemaker.model.Model(\n",
    "    name=\"nlp-blaztext-model\",\n",
    "    image_uri=sagemaker.image_uris.retrieve(\"blazingtext\", region),\n",
    "    model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    role=role,\n",
    ")\n",
    "\n",
    "inputs = sagemaker.inputs.CreateModelInput(instance_type=\"ml.m4.xlarge\")\n",
    "\n",
    "create_model_step = CreateModelStep(name=\"BTDemoCreatemodelStep\", model=model, inputs=inputs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='deploy-model'></a>\n",
    "\n",
    "### Step 4: Deploy model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile code/deploy_model.py\n",
    "\n",
    "import time\n",
    "from datetime import datetime\n",
    "import boto3\n",
    "import argparse\n",
    "\n",
    "\n",
    "# Parse argument variables passed via the DeployModel processing step\n",
    "parser = argparse.ArgumentParser()\n",
    "parser.add_argument(\"--model-name\", type=str)\n",
    "parser.add_argument(\"--region\", type=str)\n",
    "parser.add_argument(\"--endpoint-instance-type\", type=str)\n",
    "parser.add_argument(\"--endpoint-name\", type=str)\n",
    "args = parser.parse_args()\n",
    "\n",
    "region = args.region\n",
    "boto3.setup_default_session(region_name=region)\n",
    "sagemaker_boto_client = boto3.client(\"sagemaker\")\n",
    "\n",
    "# truncate name per sagameker length requirememnts (63 char max) if necessary\n",
    "endpoint_config_name = f'{args.endpoint_name}-config-{datetime.now().strftime(\"%Y%m%d-%H%M%S\")}'\n",
    "\n",
    "# create new endpoint config file\n",
    "create_ep_config_response = sagemaker_boto_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": args.endpoint_instance_type,\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": args.model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"ModelName: {}\".format(args.model_name))\n",
    "\n",
    "# create endpoint if model endpoint does not already exist, otherwise update the endpoint\n",
    "try:\n",
    "    create_endpoint_response = sagemaker_boto_client.create_endpoint(\n",
    "        EndpointName=args.endpoint_name, EndpointConfigName=endpoint_config_name\n",
    "    )\n",
    "except:\n",
    "    create_endpoint_response = sagemaker_boto_client.update_endpoint(\n",
    "        EndpointName=args.endpoint_name, EndpointConfigName=endpoint_config_name\n",
    "    )\n",
    "\n",
    "endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=args.endpoint_name)\n",
    "endpoint_status = endpoint_info[\"EndpointStatus\"]\n",
    "\n",
    "while endpoint_status != \"InService\":\n",
    "    endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=args.endpoint_name)\n",
    "    endpoint_status = endpoint_info[\"EndpointStatus\"]\n",
    "    print(\"Endpoint status:\", endpoint_status)\n",
    "    if endpoint_status != \"InService\":\n",
    "        time.sleep(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file(\n",
    "    Filename=\"./code/deploy_model.py\", Bucket=default_bucket, Key=f\"{prefix}/code/deploy_model.py\"\n",
    ")\n",
    "deploy_model_script_uri = f\"s3://{default_bucket}/{prefix}/code/deploy_model.py\"\n",
    "pipeline_endpoint_name = \"nlp-blaztext-model-endpoint\"\n",
    "\n",
    "deployment_instance_type = \"ml.m4.xlarge\"\n",
    "\n",
    "deploy_model_processor = SKLearnProcessor(\n",
    "    framework_version=\"0.23-1\",\n",
    "    role=role,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    instance_count=1,\n",
    "    volume_size_in_gb=60,\n",
    "    base_job_name=\"nlp-blaztext-deploy-model\",\n",
    "    sagemaker_session=sagemaker_session,\n",
    ")\n",
    "\n",
    "deploy_step = ProcessingStep(\n",
    "    name=\"BTDemoDeployStep\",\n",
    "    processor=deploy_model_processor,\n",
    "    job_arguments=[\n",
    "        \"--model-name\",\n",
    "        create_model_step.properties.ModelName,\n",
    "        \"--region\",\n",
    "        region,\n",
    "        \"--endpoint-instance-type\",\n",
    "        deployment_instance_type,\n",
    "        \"--endpoint-name\",\n",
    "        pipeline_endpoint_name,\n",
    "    ],\n",
    "    code=deploy_model_script_uri,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='register-model'></a>\n",
    "\n",
    "### Step 5: Register model\n",
    "\n",
    "In this step you will use the ParameterString `model_approval_status` defined at the outset of the pipeline code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.workflow.step_collections import RegisterModel\n",
    "\n",
    "register_step = RegisterModel(\n",
    "    name=\"BTDemoRegistermodelStep\",\n",
    "    estimator=bt_estimator,\n",
    "    model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n",
    "    content_types=[\"text/csv\"],\n",
    "    response_types=[\"text/csv\"],\n",
    "    inference_instances=[\"ml.t2.medium\", \"ml.m5.xlarge\"],\n",
    "    transform_instances=[\"ml.m5.xlarge\"],\n",
    "    model_package_group_name=prefix,\n",
    "    approval_status=model_approval_status,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='create-pipeline'></a>\n",
    "\n",
    "### Step 6: Create pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.workflow.pipeline import Pipeline\n",
    "\n",
    "# run full pipeline\n",
    "steps_full = [process_step, train_step, create_model_step, deploy_step, register_step]\n",
    "\n",
    "# run data processing step\n",
    "steps_preprocessing = [process_step]\n",
    "\n",
    "pipeline_name = \"BlazingTextPipeline\"\n",
    "\n",
    "pipeline = Pipeline(\n",
    "    name=pipeline_name,\n",
    "    parameters=[\n",
    "        processing_instance_type,\n",
    "        processing_instance_count,\n",
    "        train_instance_type,\n",
    "        model_approval_status,\n",
    "        input_data,\n",
    "    ],\n",
    "    steps=steps_full,  # switch to steps_preprocessing if you would like to run only the data processing step\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the JSON pipeline definition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "definition = json.loads(pipeline.definition())\n",
    "definition"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit the pipeline definition to the SageMaker Pipelines service to create a pipeline if it doesn't exist, or update the pipeline if it does. The role passed in is used by SageMaker Pipelines to create all the jobs defined in the steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline.upsert(role_arn=role)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start a pipeline execution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "execution = pipeline.start()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Describe the pipeline execution status to ensure that it has been created and started successfully:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "execution.describe()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait for the execution to finish:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "execution.wait()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List the execution steps and their status:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "execution.list_steps()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='return-predictions'></a>\n",
    "\n",
    "### Use deployed model to return predictions\n",
    "\n",
    "Now that our model is deployed as an endpoint, we can submit sample product reviews and return predicted product ratings. For more information on BlazingText inference, see the [BlazingText algorithm documentation](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.html#Hosting-/-Inference)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import string\n",
    "\n",
    "\n",
    "def process_review(text):\n",
    "    punctuation = string.punctuation\n",
    "    review = text.lower()\n",
    "    review = review.replace(\"\\r\\n\", \" \").replace(\"\\n\\n\", \" \")\n",
    "    translator = str.maketrans(\"\", \"\", punctuation)\n",
    "    review = review.translate(translator)\n",
    "    return review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = [\n",
    "    \"i loved this blouse when i saw it on-line, and the fabric is so soft!\",\n",
    "    \"love the top, but very small to the size. ordered a medium and had to send back in exchange for x-large\",\n",
    "    \"horrible! this top was scratchy and too small.\",\n",
    "]\n",
    "# process the reviews to predict the same as training data\n",
    "processed_sentences = [process_review(sent) for sent in sentences]\n",
    "\n",
    "payload = {\"instances\": processed_sentences}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_predictions(payload, endpoint_name, client):\n",
    "    response = client.invoke_endpoint(\n",
    "        EndpointName=endpoint_name, Body=json.dumps(payload), ContentType=\"application/json\"\n",
    "    )\n",
    "    predictions = json.loads(response[\"Body\"].read().decode(\"utf-8\"))\n",
    "    return list(zip(payload[\"instances\"], predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# return predictions\n",
    "get_predictions(payload, pipeline_endpoint_name, sagemaker_client)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean up\n",
    "\n",
    "To avoid incurring unnecessary charges, delete the endpoints and resources that you created while running the notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session.sagemaker_client.delete_endpoint(EndpointName=pipeline_endpoint_name)\n",
    "print(f\"Deleting endpoint: {pipeline_endpoint_name}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/use-cases|product_ratings_with_pipelines|pipelines_product_ratings.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.m5.large",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}