{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This is a sample binary classification algorithm used to figure out if a patient has heart disease. In this example, we will upload sample data from Cleveland Heart Disease dataset taken from the UCI repository. The dataset consists of 303 individuals data. Please see data repository for column description and sample data.\n", "https://archive.ics.uci.edu/ml/datasets/heart+Disease. You can download the sample data, place the data in an S3 bucket, and execute the cells in this notebook to build and deploy you own model. \n", "\n", "The rest of this tutorial walks you through using binary classification algorithm to predict heart disease.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prequisites and Preprocessing\n", "\n", "### Permissions and environment variables\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. Once you have created your S3 bucket, specify the bucket name and prefix.\n", "- The IAM role arn used to give training and hosting access to your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true, "tags": [ "parameters" ] }, "outputs": [], "source": [ "#Enter bucket name\n", "bucket = '{ENTER_BUCKET_NAME}'\n", "prefix = 'sagemaker/heart'\n", "\n", "#Enter data file name (e.g. heart.csv)\n", "data_key = 'heart.csv'\n", "data_location = 's3://{}/{}'.format(bucket, data_key)\n", " \n", "# Define IAM role\n", "import boto3\n", "import re\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data ingestion\n", "\n", "Prior to data ingestion, please make sure your data set (e.g. heart_data.csv) from the UCI repository is uploaded to an S3 bucket. By default, SageMaker role has access to buckets that start with 'sageMaker*'. The code below reads the data from the specified S3 bucket and prints out a sample of it.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "\n", "# read the data from S3\n", "heart_data = pd.read_csv(data_location)\n", "\n", "#print out a sample of data.\n", "heart_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data conversion\n", "\n", "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf.\n", "\n", "The code below performs the following:\n", "\n", "Take the data and convert to a numpy array. It has to be of type float32, as that is what the SageMaker Linear Learner algorithm expects.\n", "\n", "The linear learner algorithm requires a data matrix, with rows representing the observations, and columns representing the dimensions of the features. It also requires an additional column that contains the labels that match the data points.\n", "\n", "For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y is a numeric label. For binary classification problems, the label must be either 0 or 1\n", "\n", "\n", "The Linear Learner algorithms expects a features matrix and labels vector.\n", "\n", "The labels column is our 'target' column. In this case, we want to predict the last column to determine if the user has heart disease. A value of 1 indicates the presence of heart disease and zero does not. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "vectors = np.array(heart_data).astype('float32')\n", "\n", "#target column - value must be either 0 or 1\n", "labels = vectors[:,13]\n", "print (\"label data is\")\n", "print (labels)\n", "\n", "\n", "#drop the target column. Use the features as part of the training data\n", "training_data = vectors[:, :13]\n", "print (\"Training data is\")\n", "print (training_data)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's go ahead and upload the training data to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import sagemaker.amazon.common as smac\n", "\n", "buf = io.BytesIO()\n", "smac.write_numpy_to_dense_tensor(buf, training_data, labels)\n", "buf.seek(0)\n", "\n", "key = 'recordio-pb-data'\n", "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)\n", "s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)\n", "print('uploaded training data location: {}'.format(s3_train_data))\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training Artifacts\n", "Once data is trained, it will be uploaded to the following location." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", "print('training artifacts will be uploaded to: {}'.format(output_location))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the linear model\n", "\n", "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the Linear Learner training algorithm.\n", "\n", "Again, we'll use the Amazon SageMaker Python SDK to kick off training, and monitor status until it is completed. In this example that takes between 7 and 11 minutes. Despite the dataset being small, provisioning hardware and loading the algorithm container take time upfront." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will do a binary classification (patient either has heart disease or not), train the model on the specified compute (e.g. c4.xlarge), and we will sepcify the features or dimiensions in our training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "import sagemaker\n", "\n", "container = get_image_uri(boto3.Session().region_name, 'linear-learner', \"latest\")\n", "\n", "sess = sagemaker.Session()\n", "linear = sagemaker.estimator.Estimator(container,\n", " role, \n", " train_instance_count=1, \n", " train_instance_type='ml.c4.xlarge',\n", " output_path=output_location,\n", " sagemaker_session=sess)\n", "linear.set_hyperparameters(feature_dim=13,\n", " predictor_type='binary_classifier',\n", " mini_batch_size=100)\n", "\n", "linear.fit({'train': s3_train_data})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up hosting for the model\n", "Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint. This will allow out to make predictions (or inference) from the model dyanamically.\n", "\n", "_Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target._" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "heartdisease_predictor = linear.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validate the model for use\n", "Finally, we can now validate the model for use. We can pass HTTP POST requests to the endpoint to get back predictions. To make this easier, we'll again use the Amazon SageMaker Python SDK and specify how to serialize requests and deserialize responses that are specific to the algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.predictor import csv_serializer, json_deserializer\n", "\n", "heartdisease_predictor.content_type = 'text/csv'\n", "heartdisease_predictor.serializer = csv_serializer\n", "heartdisease_predictor.deserializer = json_deserializer" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Let's print out the endpoint that we will be invoking" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Endpoint name: {}'.format(heartdisease_predictor.endpoint))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try passing the following sample data for testing. This is a single record from the file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectors[5][0:13]" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "It's preicdtion time ! Let's see what our model predicts with the given data.\n", "You should see a prediction score approximately 82 percent accuracy and label value of 1 (indicates heart disease)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = heartdisease_predictor.predict(vectors[5][0:13])\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Delete the Endpoint\n", "\n", "If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "sagemaker.Session().delete_endpoint(heartdisease_predictor.endpoint)" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 2 }