{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recommendation Engine for E-Commerce Sales - Pipeline Mode\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook gives an overview of techniques and services offer by SageMaker to build and deploy a personalized recommendation engine.\n", "\n", "## Dataset\n", "\n", "The dataset for this demo comes from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+Retail). It contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. The following attributes are included in our dataset:\n", "+ InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.\n", "+ StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.\n", "+ Description: Product (item) name. Nominal.\n", "+ Quantity: The quantities of each product (item) per transaction. Numeric.\n", "+ InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.\n", "+ UnitPrice: Unit price. Numeric, Product price per unit in sterling.\n", "+ CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.\n", "+ Country: Country name. Nominal, the name of the country where each customer resides. \n", "\n", "Citation: Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution Architecture\n", "----\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!pip install -U sagemaker==2.139.0 --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.workflow.pipeline import Pipeline\n", "from sagemaker.workflow.steps import CreateModelStep, ProcessingStep, TrainingStep\n", "from sagemaker.workflow.step_collections import RegisterModel\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.workflow.parameters import ParameterInteger, ParameterFloat, ParameterString\n", "import datetime\n", "import boto3\n", "import time\n", "import pandas as pd\n", "from preprocessing import loadDataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "assert sagemaker.__version__ >= \"2.139.0\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "region = boto3.Session().region_name\n", "boto3.setup_default_session(region_name=region)\n", "boto_session = boto3.Session(region_name=region)\n", "\n", "s3_client = boto3.client(\"s3\", region_name=region)\n", "\n", "sagemaker_boto_client = boto_session.client(\"sagemaker\")\n", "sagemaker_session = sagemaker.session.Session(\n", " boto_session=boto_session, sagemaker_client=sagemaker_boto_client\n", ")\n", "sagemaker_role = sagemaker.get_execution_role()\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "\n", "prefix = \"personalization\"\n", "\n", "output_prefix = f\"s3://{bucket}/{prefix}/output\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, the number of feature dimensions must be calculated as it is a hyperparameter of the estimator. The feature dimensions are calculated by looking at the dataset, cleaning and preprocessing it as defined in the first part of [Recommendation Engine for E-Commerce Sales](retail_recommend.ipynb), and then counting the number of feature dimensions are in the processed dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/Online Retail.csv\")\n", "df.dropna(subset=[\"CustomerID\"], inplace=True)\n", "df[\"Description\"] = df[\"Description\"].apply(lambda x: x.strip())\n", "df = df.groupby([\"StockCode\", \"Description\", \"CustomerID\", \"Country\", \"UnitPrice\"])[\n", " \"Quantity\"\n", "].sum()\n", "df = df.loc[df > 0].reset_index()\n", "X, y = loadDataset(df)\n", "input_dims = X.shape[1]\n", "input_dims" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After calculating all the hyperparameters that are needed, the estimator is created." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container = sagemaker.image_uris.retrieve(\"factorization-machines\", region=boto_session.region_name)\n", "\n", "fm = sagemaker.estimator.Estimator(\n", " container,\n", " sagemaker_role,\n", " instance_count=1,\n", " instance_type=\"ml.c5.xlarge\",\n", " output_path=output_prefix,\n", " sagemaker_session=sagemaker_session,\n", ")\n", "\n", "fm.set_hyperparameters(\n", " feature_dim=input_dims,\n", " predictor_type=\"regressor\",\n", " mini_batch_size=1000,\n", " num_factors=64,\n", " epochs=20,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build Pipeline\n", "\n", "Now that we are comfotable with the model that we built." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_uri = f\"s3://{bucket}/data\"\n", "input_data_uri = sagemaker.s3.S3Uploader.upload(\n", " local_path=\"data/Online Retail.csv\", desired_s3_uri=base_uri\n", ")\n", "\n", "input_data = ParameterString(name=\"InputData\", default_value=input_data_uri)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_approval_status = ParameterString(\n", " name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "create_dataset_script_uri = f\"s3://{bucket}/{prefix}/code/preprocessing.py\"\n", "s3_client.upload_file(\n", " Filename=\"preprocessing.py\", Bucket=bucket, Key=f\"{prefix}/code/preprocessing.py\"\n", ")\n", "\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version=\"1.2-1\",\n", " instance_type=\"ml.m5.xlarge\",\n", " instance_count=1,\n", " base_job_name=\"sklearn-retail-sales-process\",\n", " role=sagemaker_role,\n", ")\n", "\n", "create_dataset_step = ProcessingStep(\n", " name=\"PreprocessData\",\n", " processor=sklearn_processor,\n", " inputs=[\n", " sagemaker.processing.ProcessingInput(\n", " source=input_data, destination=\"/opt/ml/processing/input\"\n", " ),\n", " ],\n", " outputs=[\n", " sagemaker.processing.ProcessingOutput(\n", " output_name=\"train_data\", source=\"/opt/ml/processing/output/train\"\n", " ),\n", " sagemaker.processing.ProcessingOutput(\n", " output_name=\"test_data\", source=\"/opt/ml/processing/output/test\"\n", " ),\n", " ],\n", " code=create_dataset_script_uri,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_step = TrainingStep(\n", " name=\"TrainingStep\",\n", " estimator=fm,\n", " inputs={\n", " \"train\": sagemaker.inputs.TrainingInput(\n", " s3_data=create_dataset_step.properties.ProcessingOutputConfig.Outputs[\n", " \"train_data\"\n", " ].S3Output.S3Uri\n", " ),\n", " \"test\": sagemaker.inputs.TrainingInput(\n", " s3_data=create_dataset_step.properties.ProcessingOutputConfig.Outputs[\n", " \"test_data\"\n", " ].S3Output.S3Uri\n", " ),\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = sagemaker.model.Model(\n", " name=\"retail-personalization-factorization-machine\",\n", " image_uri=container,\n", " model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n", " sagemaker_session=sagemaker_session,\n", " role=sagemaker_role,\n", ")\n", "\n", "inputs = sagemaker.inputs.CreateModelInput(instance_type=\"ml.m4.xlarge\")\n", "\n", "create_model_step = CreateModelStep(name=\"CreateModel\", model=model, inputs=inputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timestamp = datetime.datetime.now().strftime(\"%Y-%m-%d-%H-%M\")\n", "mpg_name = f\"retail-recommendation-{timestamp}\"\n", "\n", "register_step = RegisterModel(\n", " name=\"RegisterModel\",\n", " estimator=fm,\n", " model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n", " content_types=[\"application/x-recordio-protobuf\", \"application/json\"],\n", " response_types=[\"text/csv\"],\n", " inference_instances=[\"ml.t2.medium\", \"ml.m5.xlarge\"],\n", " transform_instances=[\"ml.m5.xlarge\"],\n", " model_package_group_name=mpg_name,\n", " approval_status=model_approval_status,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "s3_client.upload_file(Filename=\"deploy.py\", Bucket=bucket, Key=f\"{prefix}/code/deploy.py\")\n", "deploy_script_uri = f\"s3://{bucket}/{prefix}/code/deploy.py\"\n", "\n", "deployment_processor = SKLearnProcessor(\n", " framework_version=\"1.2-1\",\n", " role=sagemaker_role,\n", " instance_type=\"ml.t3.medium\",\n", " instance_count=1,\n", " base_job_name=f\"{prefix}-deploy\",\n", " sagemaker_session=sagemaker_session,\n", ")\n", "\n", "deploy_step = ProcessingStep(\n", " name=\"DeployModel\",\n", " processor=deployment_processor,\n", " job_arguments=[\n", " \"--model-name\",\n", " create_model_step.properties.ModelName,\n", " \"--region\",\n", " region,\n", " \"--endpoint-instance-type\",\n", " \"ml.m4.xlarge\",\n", " \"--endpoint-name\",\n", " \"retail-recommendation-endpoint\",\n", " ],\n", " code=deploy_script_uri,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_name = f\"PersonalizationDemo\"\n", "\n", "pipeline = Pipeline(\n", " name=pipeline_name,\n", " parameters=[input_data, model_approval_status],\n", " steps=[create_dataset_step, train_step, create_model_step, register_step, deploy_step],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline.upsert(role_arn=sagemaker_role)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_response = pipeline.start()\n", "start_response.wait()\n", "start_response.describe()" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.lineage.visualizer import LineageTableVisualizer\n", "from pprint import pprint\n", "\n", "\n", "viz = LineageTableVisualizer(sagemaker_session)\n", "for execution_step in reversed(start_response.list_steps()):\n", " pprint(execution_step)\n", " display(viz.show(pipeline_execution_step=execution_step))\n", " time.sleep(5)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }