{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mammography classification\n", "1. [Introduction](#Introduction)\n", "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n", " 1. [Permissions and environment variables](#Permissions-and-environment-variables)\n", " 2. [Prepare the data](#Prepare-the-data)\n", "3. [Training the model](#Training-the-model)\n", " 1. [Training parameters](#Training-parameters)\n", " 2. [Start the training](#Start-the-training)\n", "4. [Inference](#Inference)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Welcome to our Mammography Classification based on AWS Sage Maker Image Classification built-in algorithm. In this demo, we will use the Amazon SageMaker image classification algorithm to train on real patients' mammography.\n", "\n", "They will be classified in 5 categories:\n", "- Mediolateral-Oblique Left (MLO Left)\n", "- Mediolateral-Oblique Right (MLO Right)\n", "- Cranial-Caudal Left (CC Left)\n", "- Cranial-Caudal Right (CC Right)\n", "- Not a mammography\n", "\n", "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prequisites and Preprocessing\n", "\n", "### Permissions and environment variables\n", "\n", "Here we set up the linkage and authentication to AWS services. There are three parts to this:\n", "\n", "* The roles used to give learning and hosting access to your data. This will automatically be obtained from the role used to start the notebook\n", "* The S3 bucket that you want to use for training and model data\n", "* The Amazon sagemaker image classification docker image which need not be changed" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%time\n", "import sagemaker\n", "import boto3\n", "from datetime import datetime\n", "from sagemaker import get_execution_role\n", "from botocore.client import ClientError\n", "\n", "role = get_execution_role()\n", "print(role)\n", "\n", "sess = sagemaker.Session()\n", "\n", "# Replace the name of the bucket created in the first steps of the workshop\n", "# bucket=<>\n", "\n", "bucket='mammography-workshop-files-...'\n", "\n", "#Validate if variable 'bucket' was configured correctly\n", "s3 = boto3.resource('s3')\n", "try:\n", " s3.meta.client.head_bucket(Bucket=bucket)\n", "except ClientError:\n", " print (\"\\x1b[31m ERROR: Please, configure the variable 'bucket' with a valid S3 bucket name \\x1b[0m\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker import image_uris\n", "training_image = image_uris.retrieve(\n", " region=boto3.Session().region_name, framework=\"image-classification\"\n", ")\n", "print (training_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the model\n", "\n", "Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.\n", "### Training parameters\n", "There are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include:\n", "\n", "* **Training instance count**: This is the number of instances on which to run the training. When the number of instances is greater than one, then the image classification algorithm will run in distributed settings. \n", "* **Training instance type**: This indicates the type of machine on which to run the training. For SageMaker Image Classification built-in algorithm, it is mandatory to use GPU instances for the training. For more information on Recommended instances for Image Classification algorithm: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-instances\n", "* **Output bucket**: This the s3 bucket in which the training output will be stored. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "prefix_output='output'\n", "\n", "s3_output_location = 's3://{}/{}'.format(bucket, prefix_output)\n", "\n", "ic = sagemaker.estimator.Estimator(training_image,\n", " role, \n", " instance_count=1, \n", " instance_type='ml.p2.xlarge',\n", " volume_size = 50,\n", " max_run = 360000,\n", " input_mode= 'File',\n", " output_path=s3_output_location,\n", " sagemaker_session=sess)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apart from the above set of parameters, there are hyperparameters that are specific to the algorithm. These are:\n", "\n", "* **num_layers**: The number of layers (depth) for the network. We use 18 in this samples but other values such as 50, 152 can be used.\n", "* **image_shape**: The input image dimensions,'num_channels, height, width', for the network. It should be no larger than the actual image size. The number of channels should be same as the actual image.\n", "* **num_classes**: This is the number of output classes for the new dataset. As mentioned in the beggining, we have 5 different output classes: MLO Right, MLO Left, CC Right, CC Left, Not a mammography.\n", "* **num_training_samples**: This is the total number of training samples. It is set to 875, since it is the amount of available images we have for training. The higher the training samples you have, better are the chances of you getting a good classification model.\n", "* **mini_batch_size**: The number of training samples used for each mini batch. In distributed training, the number of training samples used per batch will be N * mini_batch_size where N is the number of hosts on which training is run. This is the amount of images that will be loaded into memory for training. If this number is too high, your instance might run out of memory. If this number is too low, your accuracy might suffer.\n", "* **epochs**: Number of training epochs, i.e, the amount of training cycles in a full dataset. The correct number of epochs need to be tested, because you don't want to underfit or overfit your model. Just remember that more epochs means taking longer and costing more. \n", "* **learning_rate**: Learning rate for training.\n", "* **top_k**: Report the top-k accuracy during training.\n", "* **precision_dtype**: Training datatype precision (default: float32). If set to 'float16', the training will be done in mixed_precision mode and will be faster than float32 mode\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "ic.set_hyperparameters(num_layers=18,\n", " image_shape=\"3,300,150\",\n", " num_classes=5,\n", " num_training_samples=1752,\n", " mini_batch_size=120,\n", " epochs=20,\n", " learning_rate=0.01,\n", " optimizer='sgd',\n", " top_k=2,\n", " precision_dtype='float32'\n", " )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input data specification\n", "Set the data type and channels used for training the model.\n", "\n", "The next block of code defines where the images are and how they are classified.\n", "\n", "There are two separate datasets: one for **training** and one for **validation**.\n", "\n", "This is important because the model needs to test itself on data that it already knows and with data that it has never seen, in order for us to find out if it is a good model to make generalizations on new data.\n", "\n", "In order for us to provide to the model the data already classified, we are using an \".lst\" file, that consists of all the location of the images and their classification.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.session import TrainingInput\n", "\n", "# raw-jpg stores all the original images\n", "prefix = 'raw-jpg'\n", "\n", "\n", "# Four channels: train, validation, train_lst, and validation_lst\n", "s3train = 's3://{}/{}/train/'.format(bucket, prefix)\n", "s3validation = 's3://{}/{}/test/'.format(bucket, prefix)\n", "s3train_lst = 's3://{}/{}/train-data.lst'.format(bucket, prefix)\n", "s3validation_lst = 's3://{}/{}/test-data.lst'.format(bucket, prefix)\n", "\n", "train_data = TrainingInput(s3train, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "validation_data = TrainingInput(s3validation, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "train_lst = TrainingInput(s3train_lst, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "validation_lst = TrainingInput(s3validation_lst, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "\n", "data_channels = {'train': train_data, 'validation': validation_data, 'train_lst': train_lst, 'validation_lst': validation_lst }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note:Ground Truth\n", "\n", "If we were to use the Ground Truth file generated in the previous module, this is where we would declare the output file generated by Groung Truth.\n", "\n", "It would look something like this:\n", "```\n", "train_data = sagemaker.session.s3_input(s3_train_data, \n", " distribution='FullyReplicated', \n", " content_type='image/jpeg', \n", " s3_data_type='AugmentedManifestFile', \n", " attribute_names=['source-ref', 'sthakur-groundtruth-demo'])\n", "```\n", "\n", "There is a great blog post that explains how to use Ground Truth output file in Sage Maker with more details. You can check it out later.\n", "https://aws.amazon.com/pt/blogs/machine-learning/easily-train-models-using-datasets-labeled-by-amazon-sagemaker-ground-truth/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start the training\n", "Start training by calling the fit method in the estimator.\n", "\n", "It will take about **10 minutes** to finish the execution. Wait until it ends before executing the next block of code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "job_name = 'mammography-classification-' + datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", "\n", "ic.fit(inputs=data_channels, logs=True, job_name=job_name)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation\n", "\n", "You may have noticed that **the best Validation Accuracy found was around 75%, which is not ideal yet**.\n", "\n", "**Training accuracy** so different from your **validation accuracy** is also a bad indicator.\n", "\n", "When navigating through the mammography images, you may notice that every image has a different size, and this is the main reason why our Valication Accuracy is so low.\n", "\n", "To prove that, we will train the model again, but now we will use **images that were all resized to the same dimension**. In our case: 150 width, 300 height. \n", "\n", "We used OpenCV lib (https://opencv.org/) to perform the resize using Python. For information only, the main code executed can be found below:\n", "\n", " #filepath is the full path for the mammography stored locally\n", " \n", " img = cv2.imread(filepath) \n", "\n", " res = cv2.resize(img,(150,300),interpolation=cv2.INTER_AREA)\n", "\n", "\n", "\n", "**For the sake of time, we already provided you with all the resized images needed for this lab.**\n", "\n", "So, let's run it all over again, but now using the resized mammography images for the training stored in the same bucket, but inside **resize** folder\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "#'resize' folder stores all the resized images\n", "prefix = 'resize'\n", "\n", "# Four channels: train, validation, train_lst, and validation_lst\n", "s3train = 's3://{}/{}/train/'.format(bucket, prefix)\n", "s3validation = 's3://{}/{}/test/'.format(bucket, prefix)\n", "s3train_lst = 's3://{}/{}/train-data.lst'.format(bucket, prefix)\n", "s3validation_lst = 's3://{}/{}/test-data.lst'.format(bucket, prefix)\n", "\n", "train_data = TrainingInput(s3train, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "validation_data = TrainingInput(s3validation, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "train_lst = TrainingInput(s3train_lst, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "validation_lst = TrainingInput(s3validation_lst, distribution='FullyReplicated', \n", " content_type='application/x-image', s3_data_type='S3Prefix')\n", "\n", "data_channels = {'train': train_data, 'validation': validation_data, 'train_lst': train_lst, 'validation_lst': validation_lst }\n", "\n", "# Note that:\n", "# image_shape=\"num_channels, height, width\"\n", "\n", "ic.set_hyperparameters(num_layers=18,\n", " image_shape=\"3,300,150\",\n", " num_classes=5,\n", " num_training_samples=1752,\n", " mini_batch_size=250,\n", " epochs=20,\n", " optimizer='sgd',\n", " learning_rate=0.01,\n", " top_k=2,\n", " precision_dtype='float32'\n", " )\n", "\n", "job_name = 'mammography-classification-' + datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", "\n", "ic.fit(inputs=data_channels, logs=True, job_name=job_name)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Training job analytics\n", "\n", "We will now plot a pivot table with the analytics collected during training execution.\n", "These are metrics collected from Cloud Watch. So, this graph might not show all checkpoints collected, but the average of those collected in the 1-minute range. Those factors might change the amount of lines ploted in the graph depending on the duration of the execution.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "job_name=ic.latest_training_job.name\n", "metric_name = 'validation:accuracy'\n", "parent = sagemaker.analytics.TrainingJobAnalytics(training_job_name=job_name,metric_names=[metric_name],period=1)\n", "parent.dataframe().sort_values(['value'], ascending=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inference\n", "\n", "***\n", "\n", "In order for us to use the model we just trained and tested, we need to deploy it.\n", "To do that, you need to invoke the method *deploy* of the Estimator.\n", "When doing that, AWS will provision a SageMaker EC2 instance (you **won't** be able to see nor manage it through the EC2 console) whose only purpose is to respond to invokations.\n", "\n", "We are using an m5.large, but in production, plan accordingly.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "ic_classifier = ic.deploy(initial_instance_count = 1,\n", " instance_type = 'ml.m5.large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrix explained" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "Confusion Matrix is one of multiple ways to determine the performance of a classification algorithm.\n", "It basically plots the expected values and the predicted values returned by the model in a matrix layout, so we can easily see how much the model is acing and how much it is failing.\n", "\n", "**If my model is ~98% accurate, why would I need a Confusion Matrix?**\n", "\n", "Because it still is 2% wrong. And it is important to understand where it is wrong: is it 2% wrong spread all over your classes or do you have a class that is bringing your accuracy down?\n", "\n", "Or even the 2% might be misleading. If you have more data on a class that is \"easy\" to predict and, therefore, it is bringing your accuracy up, but in reality, some other class might have less data but it could be failing to predict correctly. Of course, this is a problem for algorithms classifing 3 or more classes of data, which is our case.\n", "\n", "Before we jump into our own model, let's take a look at the example below.\n", "Interpreting the matrix, we discover that there should be 5 cats and 5 dogs in an imaginary image classification training. But when invoking the model, it predicted 4 cats e 6 dogs. **The correct values are in the diagonal part of the matrix, in bold**.\n", "\n", "\n", "![image.png](attachment:image.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrix for Mammography Classification\n", "\n", "Now that we've understood what is a confusion matrix, let's execute the code below to generate a confusion matrix for our model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import sys\n", "\n", "import boto3\n", "\n", "import logging\n", "\n", "import json\n", "\n", "from sklearn.metrics import confusion_matrix\n", "\n", "import seaborn as sn\n", "\n", "import pandas as pd\n", "\n", "import numpy as np\n", "\n", "from botocore.exceptions import ClientError\n", "\n", "debug = False\n", "# debug = True\n", "\n", "prefix = \"resize/validate\"\n", "\n", "endpoint_name = ic_classifier.endpoint_name\n", "\n", "expected = []\n", "\n", "predicted = []\n", "\n", "def list_bucket_objects(bucket_name, prefix):\n", " \"\"\"List the objects in an Amazon S3 bucket\n", "\n", " :param bucket_name: string\n", " :param prefix: string\n", " :return: List of bucket objects. If error, return None.\n", " \"\"\"\n", "\n", " # Retrieve the list of bucket objects\n", " s_3 = boto3.client('s3')\n", "\n", " try:\n", " response = s_3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)\n", " except ClientError as e:\n", " logging.error(e)\n", " return None\n", "\n", " # Only return the contents if we found some keys\n", " if response['KeyCount'] > 0:\n", " return response['Contents']\n", "\n", " return None\n", "\n", "\n", "def get_object(bucket_name, object_name):\n", " \"\"\"Retrieve an object from an Amazon S3 bucket\n", "\n", " :param bucket_name: string\n", " :param object_name: string\n", " :return: botocore.response.StreamingBody object. If error, return None.\n", " \"\"\"\n", "\n", " # Retrieve the object\n", " s3 = boto3.client('s3')\n", " try:\n", " response = s3.get_object(Bucket=bucket_name, Key=object_name)\n", " except ClientError as e:\n", " logging.error(e)\n", " return None\n", " # Return an open StreamingBody object\n", " return response['Body']\n", "\n", "def get_expected_value(object_key):\n", " if \"CCD\" in object_key:\n", " return \"CC-Right\"\n", " elif \"CCE\" in object_key:\n", " return \"CC-Left\"\n", " elif \"MLOD\" in object_key:\n", " return \"MLO-Right\"\n", " elif \"MLOE\" in object_key:\n", " return \"MLO-Left\"\n", " elif \"NAO\" in object_key:\n", " return \"Not-a-mammography\"\n", " else:\n", " logging.warning(object_key)\n", " sys.exit(\"Unsupported mammography type on expected array\")\n", "\n", "\n", "\n", "def get_best_prediction_position(prediction):\n", "\n", " logging.info(prediction)\n", " \n", " index = np.argmax(prediction)\n", " object_categories = ['Not-a-mammography', 'CC-Right', 'CC-Left', 'MLO-Right', 'MLO-Left']\n", " return object_categories[index]\n", "\n", "\n", "# Set up logging\n", "import logging\n", "logger = logging.getLogger()\n", "\n", "if debug:\n", " logger.setLevel(logging.INFO)\n", "else:\n", " logger.setLevel(logging.WARN)\n", "\n", "\n", "sagemaker = boto3.client('runtime.sagemaker')\n", "\n", "# Retrieve the bucket's objects\n", "objects = list_bucket_objects(bucket, prefix)\n", "\n", "if objects is not None:\n", " # List the object names\n", " logging.info('Objects in ' + bucket)\n", " for obj in objects:\n", " object_key = obj[\"Key\"]\n", " if object_key.endswith(\".jpg\") or object_key.endswith(\".png\") or object_key.endswith(\".jpeg\"):\n", "\n", " logging.info(object_key)\n", "\n", " expected.append(get_expected_value(object_key))\n", "\n", " s3_object = get_object(bucket, object_key)\n", "\n", " s3_object_byte_array = s3_object.read()\n", "\n", " #invoke sagemaker and append on predicted array\n", " sagemaker_invoke = sagemaker.invoke_endpoint(EndpointName=endpoint_name,\n", " ContentType='application/x-image',\n", " Body=s3_object_byte_array)\n", "\n", " prediction = json.loads(sagemaker_invoke['Body'].read().decode())\n", "\n", " best_prediction_position = get_best_prediction_position (prediction)\n", "\n", " predicted.append(best_prediction_position)\n", "\n", "\n", "else:\n", " # Didn't get any keys\n", " logging.warning('No objects in ' + bucket)\n", "\n", "\n", "#Confusion matrix using scikit learn\n", "results = confusion_matrix(expected, predicted)\n", "print(results)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#Confusion matrix using pandas and seaborn\n", "data = {'y_Actual': expected,\n", " 'y_Predicted': predicted\n", " }\n", "\n", "df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])\n", "confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Expected'], colnames=['Predicted'])\n", "\n", "sn.heatmap(confusion_matrix, annot=True)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download test image" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!wget -O /tmp/test.jpg https://mammography-workshop.s3.amazonaws.com/sample/resize_RIGHT_CC.jpg\n", "file_name = '/tmp/test.jpg'\n", "# test image\n", "from IPython.display import Image\n", "Image(file_name) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation\n", "\n", "Evaluate the image through the network for inteference. The network outputs class probabilities and typically, one selects the class with the maximum probability as the final class output.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "import numpy as np\n", "\n", "with open(file_name, 'rb') as f:\n", " payload = f.read()\n", " payload = bytearray(payload)\n", " \n", "result = json.loads(ic_classifier.predict(payload, initial_args={'ContentType': 'application/x-image'}))\n", "# the result will output the probabilities for all classes\n", "# find the class with maximum probability and print the class index\n", "index = np.argmax(result)\n", "object_categories = ['Not a Mammography', 'CC-Right', 'CC-Left', 'MLO-Right', 'MLO-Left']\n", "print(\"Result: \" + object_categories[index] + \", probability - \" + str(result[index]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean up\n", "\n", "When we're done with the endpoint, we can just delete it and the backing instances will be released. \n", "\n", "After the workshop, you may uncomment the code below and run it to delete the endpoint.\n", "\n", "**DON'T DO IT NOW!** \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#ic_classifier.delete_endpoint()" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }