# Classify Skin Lesion Images

Note: This notebook was developed using the `Data Science 3.0` image.

## 1. Background

According to [The Skin Cancer Foundation](https://www.skincancer.org/skin-cancer-information/skin-cancer-facts/), 1 in 5 Americans will develop skin cancer by the age of 70 and more than 2 people die of skin cancer in the U.S. every hour. Worldwide, [melanoma of skin is the 17th most common form of cancer](https://www.wcrf.org/cancer-trends/skin-cancer-statistics/). However, because non-melanoma skin cancer is often excluded from official statistics, it is significantly under-reported. In Brazil, [according to the Brazilian Cancer Institute (INCA)](https://www.sciencedirect.com/science/article/pii/S0010482519304019?via%3Dihub#b6), skin cancer accounts for 33% of all cancer diagnoses in the country. Developing an algorithm to distinguish skin cancer from other lesions would allow communities to make the best use of limited healthcare resources.

In this lab, you will use Amazon SageMaker to build, train, and deploy a classification model based on the [PAD-UFES-20 skin lesion dataset](https://data.mendeley.com/datasets/zr7vgbcyr2/1).

```
Pacheco, Andre G. C.; Lima, Gustavo R.; Salomão, Amanda S.; Krohling, Breno; Biral, Igor P.; de Angelo, Gabriel G. ; Alves Jr, Fábio C. R. ; Esgario, José G. M.; Simora, Alana C. ; Castro, Pedro B. C. ; Rodrigues, Felipe B.; Frasson, Patricia H. L. ; Krohling, Renato A.; Knidel, Helder ; Santos, Maria C. S. ; Espírito Santo, Rachel B.; Macedo, Telma L. S. G.; Canuto, Tania R. P. ; de Barros, Luíz F. S. (2020), “PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones”, Mendeley Data, V1, doi: 10.17632/zr7vgbcyr2.1
```

The model will distinguish 6 skin lesion classes from cell phone images.

- Skin Cancers 
 - BCC: Basal Cell Carcinoma 
 - MEL: Melanoma 
 - SCC: Squamous Cell Carcinoma and Bowen’s disease 
- Skin Diseases 
 - ACK: Actinic Keratosis 
 - NEV: Nevus 
 - SEK: Seborrheic Keratosis 
 
 ![Skin Lesions](img/lesions.png)

## 2. Build

### 2.1. Import Python packages and create clients

In [None]:
%pip install -U pip -q -q
%pip install -U sagemaker sagemaker-experiments -q -q

In [None]:
import boto3
import json
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import os
import pandas as pd
import sagemaker
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.sklearn.estimator import SKLearn
from sklearn.metrics import precision_recall_fscore_support
import smexperiments.experiment
from time import strftime, sleep

boto_session = boto3.session.Session()
sagemaker_session = sagemaker.session.Session(boto_session)
sagemaker_boto_client = boto_session.client("sagemaker")

REGION = boto_session.region_name
print(f"AWS Region is {REGION}")

ACCOUNT_ID = boto_session.client("sts").get_caller_identity().get("Account")
print(f"AWS Account ID is {ACCOUNT_ID}")

SAGEMAKER_EXECUTION_ROLE = sagemaker.session.get_execution_role(sagemaker_session)
print(f"Assumed SageMaker role is {SAGEMAKER_EXECUTION_ROLE}")

S3_BUCKET = sagemaker_session.default_bucket()
S3_PREFIX = 'skin-lesion-classification-lab'
print(f"Default S3 location is s3://{S3_BUCKET}/{S3_PREFIX}")


In [None]:
# Create a new experiment
skin_lesion_experiment = smexperiments.experiment.Experiment.create(
 description="Classify skin lesions",
 experiment_name=f"Classify-skin-lesions-{strftime('%Y-%m-%d-%H-%M-%S')}",
 sagemaker_boto_client=sagemaker_boto_client,
 tags=[{"Key": "Creator", "Value": "arosalez"}],
)

### 2.2. Create Processing Script

In [None]:
%%writefile scripts/processing/preprocess.py
import boto3
import logging
import os
import pandas as pd
import shutil
from sklearn.model_selection import train_test_split

logging.getLogger().setLevel(logging.INFO)

# Define data source and other parameters.
SRC_BUCKET = 'prod-dcd-datasets-cache-zipfiles'
SRC_KEY = 'zr7vgbcyr2-1.zip'
DATA_DIR = '/opt/ml/processing'

# Download raw data zip from https://data.mendeley.com/datasets/zr7vgbcyr2/1
logging.info(f'Downloading {SRC_KEY}')
s3_boto_client = boto3.client("s3")
os.makedirs(f'{DATA_DIR}/input', exist_ok=True)
s3_boto_client.download_file(SRC_BUCKET, SRC_KEY, f'{DATA_DIR}/input/raw.zip')

# Unzip data
logging.info(f'Unpacking {SRC_KEY}')
shutil.unpack_archive(f'{DATA_DIR}/input/raw.zip', f'{DATA_DIR}/input')
for i in range(1,4): 
 logging.info(f'Unpacking imgs_part_{i}.zip')
 shutil.unpack_archive(f'{DATA_DIR}/input/images/imgs_part_{i}.zip', f'{DATA_DIR}/input/images')
 logging.info(f'Copying {DATA_DIR}/input/images/imgs_part_{i} to {DATA_DIR}/input/images/all_imgs')
 shutil.copytree(f'{DATA_DIR}/input/images/imgs_part_{i}', f'{DATA_DIR}/input/images/all_imgs', dirs_exist_ok=True)

# Split data into training, validation, and test sets
logging.info(f'Creating training-validation data split')
metadata = pd.read_csv(f'{DATA_DIR}/input/metadata.csv')
train_df, test_df = train_test_split(metadata, test_size=0.2, stratify=metadata['diagnostic'])
train_df, val_df = train_test_split(train_df, test_size=0.05, stratify=train_df['diagnostic'])

# Copy training data into folders for training
logging.info(f'Copying training data to {DATA_DIR}/output/train')
os.makedirs(f"{DATA_DIR}/output/train", exist_ok=True)
train_df.to_csv(f'{DATA_DIR}/output/train/metadata.csv', index=False)
for _,row in train_df.iterrows():
 src = f"{DATA_DIR}/input/images/all_imgs/{row['img_id']}"
 os.makedirs(f"{DATA_DIR}/output/train/{row['diagnostic']}", exist_ok=True)
 dest = f"{DATA_DIR}/output/train/{row['diagnostic']}/{row['img_id']}"
 shutil.copy2(src, dest) 
 
# Copy validation data into folders for training
logging.info(f'Copying validation data to {DATA_DIR}/output/val')
os.makedirs(f"{DATA_DIR}/output/val", exist_ok=True)
train_df.to_csv(f'{DATA_DIR}/output/val/metadata.csv', index=False)
for _,row in val_df.iterrows():
 src = f"{DATA_DIR}/input/images/all_imgs/{row['img_id']}"
 os.makedirs(f"{DATA_DIR}/output/val/{row['diagnostic']}", exist_ok=True)
 dest = f"{DATA_DIR}/output/val/{row['diagnostic']}/{row['img_id']}"
 shutil.copy2(src, dest)
 
# Copy test data into folders for evaluation
logging.info(f'Copying test data to {DATA_DIR}/output/test')
os.makedirs(f"{DATA_DIR}/output/test", exist_ok=True)
train_df.to_csv(f'{DATA_DIR}/output/test/metadata.csv', index=False)
for _,row in val_df.iterrows():
 src = f"{DATA_DIR}/input/images/all_imgs/{row['img_id']}"
 os.makedirs(f"{DATA_DIR}/output/test/{row['diagnostic']}", exist_ok=True)
 dest = f"{DATA_DIR}/output/test/{row['diagnostic']}/{row['img_id']}"
 shutil.copy2(src, dest)


### 2.4. Define a Processor

In [None]:
pre_processor = sagemaker.processing.FrameworkProcessor(
 estimator_cls=SKLearn,
 framework_version='1.0-1',
 command=['python'],
 role=SAGEMAKER_EXECUTION_ROLE,
 instance_count=1,
 instance_type='ml.m5.xlarge'
)

### 2.5. Submit a Processing Run

In [None]:
pre_processor.run(
 job_name=f"skin-lesion-image-processing-job-{strftime('%Y-%m-%d-%H-%M-%S')}",
 code="scripts/processing/preprocess.py",
 outputs=[
 sagemaker.processing.ProcessingOutput(
 output_name="train",
 source="/opt/ml/processing/output/train",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/train/",
 ),
 sagemaker.processing.ProcessingOutput(
 output_name="validation",
 source="/opt/ml/processing/output/val",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/val/",
 ),
 sagemaker.processing.ProcessingOutput(
 output_name="test",
 source="/opt/ml/processing/output/test",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/test/",
 ),
 ],
 experiment_config={
 "ExperimentName": skin_lesion_experiment.experiment_name,
 }
)

### 2.6. Download and Examine Metadata

In [None]:
if pre_processor.latest_job.describe().get('ProcessingJobStatus') == 'Completed':
 sagemaker_session.download_data(
 'data',
 bucket=S3_BUCKET,
 key_prefix=f"{S3_PREFIX}/data/train/metadata.csv"
 )
 metadata = pd.read_csv('data/metadata.csv')
 
 display(metadata.head())
 display(metadata.describe(include=['object']))
 display(metadata.describe(include=[np.number]))

 plt.style.use('default')
 metadata['diagnostic'].value_counts().plot(kind='bar', color="#003181")
 plt.title("Diagnosis Value Counts")
 plt.show()

## 3. Train

### 3.1. Create a New Experiment Trial

In [None]:
mobilenet_trial = smexperiments.trial.Trial.create(
 experiment_name=skin_lesion_experiment.experiment_name,
 sagemaker_boto_client=sagemaker_boto_client,
 trial_name=f"mobilenet-Trial-{strftime('%Y-%m-%d-%H-%M-%S')}",
)

### 3.2. Define an Estimator

In [None]:
model_id, model_version = "tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4", "*"
training_instance_type = "ml.p3.2xlarge"
mobilenet_job_name = f"mobilenet-Training-Job"

# Retrieve the Docker image uri
train_image_uri = image_uris.retrieve(
 model_id=model_id,
 model_version=model_version,
 image_scope="training",
 instance_type=training_instance_type,
 region=None,
 framework=None)

# Retrieve the training script uri
train_source_uri = script_uris.retrieve(
 model_id=model_id, 
 model_version=model_version, 
 script_scope="training")

# Retrieve the pretrained model artifact uri for transfer learning
train_model_uri = model_uris.retrieve(
 model_id=model_id, 
 model_version=model_version, 
 model_scope="training")

# Retrieve the default hyper-parameter values for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
 model_id=model_id, 
 model_version=model_version
)

# Override default hyperparameters with custom values
hyperparameters["epochs"] = "3"
hyperparameters["batch_size"] = "70"
hyperparameters["learning_rate"] = 0.00010804583232953079
hyperparameters["optimizer"] = 'rmsprop'

# Specify S3 urls for input data and output artifact
training_dataset_s3_path = f"s3://{S3_BUCKET}/{S3_PREFIX}/data/train"
validation_dataset_s3_path = f"s3://{S3_BUCKET}/{S3_PREFIX}/data/val"
s3_output_location = f"s3://{S3_BUCKET}/{S3_PREFIX}/output"

# Specify what metrics to look for in the logs
training_metric_definitions = [
 {"Name": "val_accuracy", "Regex": "- val_accuracy: ([0-9\\.]+)"},
 {"Name": "val_loss", "Regex": "- val_loss: ([0-9\\.]+)"},
 {"Name": "train_accuracy", "Regex": "- accuracy: ([0-9\\.]+)"},
 {"Name": "train_loss", "Regex": "- loss: ([0-9\\.]+)"},
]

# Create estimator
tf_ic_estimator = sagemaker.estimator.Estimator(
 base_job_name = mobilenet_job_name,
 role=SAGEMAKER_EXECUTION_ROLE,
 image_uri=train_image_uri,
 source_dir=train_source_uri,
 model_uri=train_model_uri,
 entry_point="transfer_learning.py",
 instance_count=1,
 instance_type=training_instance_type,
 hyperparameters=hyperparameters,
 output_path=s3_output_location,
 enable_sagemaker_metrics=True,
 metric_definitions=training_metric_definitions,
 rules=[sagemaker.debugger.ProfilerRule.sagemaker(sagemaker.debugger.rule_configs.ProfilerReport())]
)

### 3.3. Submit a Training Job

In [None]:
tf_ic_estimator.fit(
 inputs = {
 "training": training_dataset_s3_path,
 "validation": validation_dataset_s3_path
 }, 
 experiment_config={
 "TrialName": mobilenet_trial.trial_name,
 }
)

### 3.4. View Trial Details

In [None]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
 sagemaker_session=sagemaker_session,
 experiment_name=skin_lesion_experiment.experiment_name,
 parameter_names=["SageMaker.InstanceType"],
)

trial_component_analytics.dataframe()

## 4. Deploy

![SageMaker Model Deployment Options](img/deployment_options.png "SageMaker Model Deployment Options")

### 4.1. Define a Model Using the Training Artifact

In [None]:
endpoint_name = sagemaker.utils.name_from_base(f"lesion-classifier-{model_id}")
inference_instance_type = "ml.g4dn.xlarge"

# Get the inference docker container uri.
deploy_image_uri = image_uris.retrieve(
 region=None,
 framework=None,
 image_scope="inference",
 model_id=model_id,
 model_version=model_version,
 instance_type=inference_instance_type,
)

# Get the inference script uri
deploy_source_uri = script_uris.retrieve(
 model_id=model_id, model_version=model_version, script_scope="inference"
)

# Get the model artifact created by the training job
model_data_uri = tf_ic_estimator.model_data

# Define a SageMaker model using the training artifact
model = sagemaker.model.Model(
 image_uri=deploy_image_uri,
 source_dir=deploy_source_uri,
 model_data=model_data_uri,
 entry_point="inference.py",
 role=SAGEMAKER_EXECUTION_ROLE,
 predictor_cls=sagemaker.predictor.Predictor,
 name=endpoint_name,
)


### 4.2. Submit a Deployment Job

In [None]:
model_predictor = model.deploy(
 initial_instance_count=1,
 instance_type=inference_instance_type,
 endpoint_name=endpoint_name,
)

### 4.3. Download Test Images

In [None]:
sagemaker_session.download_data(
 f"data/test",
 bucket=S3_BUCKET,
 key_prefix=f"{S3_PREFIX}/data/test"
)

### 4.4. Generate Test Predictions

In [None]:
truth = []
pred = []
images = []

diagnostic_codes = { 
 'BCC': 'Cancer: Basal Cell Carcinoma',
 'MEL': 'Cancer: Melanoma',
 'SCC': 'Cancer: Squamous Cell Carcinoma and Bowen’s disease',
 'ACK': 'Disease: Actinic Keratosis',
 'NEV': 'Disease: Nevus',
 'SEK': 'Disease: Seborrheic Keratosis'
}

for true_diagnostic in ['ACK','BCC', 'MEL', 'NEV', 'SCC', 'SEK']:
 print(diagnostic_codes[true_diagnostic])
 filenames = []
 filenames = [name for name in os.listdir(f'data/test/{true_diagnostic}') if name.endswith('.png')][:25]
 n = len(filenames)
 cols = 5
 rows = int(np.ceil(n / cols))
 for filename in filenames:
 filename = f'data/test/{true_diagnostic}/{filename}'
 print(f'Predicting {filename}')
 with open(filename, "rb") as file:
 img = file.read()
 query_response = model_predictor.predict(
 img, {"ContentType": "application/x-image", "Accept": "application/json;verbose"}
 )
 model_predictions = json.loads(query_response)
 predicted_label = model_predictions["predicted_label"]
 truth.append(true_diagnostic)
 pred.append(predicted_label) 
 images.append(mpimg.imread(filename))
 sleep(0.1)

precision, recall, f1, _ = precision_recall_fscore_support(truth, pred, average='weighted', zero_division=0)

print(f'Inference Precision: {precision}')
print(f'Inference Recall: {recall}')
print(f'Inference F1-Score: {f1}')

### 4.5. View Sample Results

In [None]:
import matplotlib.pyplot as plt
import random

k = 12
samples = random.sample(range(len(images)), 12)
plt.figure()
for i, sample in enumerate(samples):
 plt.subplot(3,4,i+1)
 plt.imshow(images[sample])
 plt.title(f'{truth[sample]}:{pred[sample]}')
 plt.axis('off')
 
_ = plt.suptitle("Actual:Prediction")
plt.show()

### 4.6. Delete Endpoint

In [None]:
model_predictor.delete_endpoint()

## 5. Workflow Automation with SageMaker Pipelines

In [None]:
from sagemaker.workflow import pipeline
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.inputs import TrainingInput

### 5.1. Define a Processing Step

In [None]:
# Update the pre_processor from above to use PipelineSession()
pre_processor.sagemaker_session = PipelineSession()

# Create a lazy initialization of the processor run that will wait to run during the pipeline execution
processor_args = pre_processor.run(
 job_name=f"skin-lesion-image-processing-job-{strftime('%Y-%m-%d-%H-%M-%S')}",
 code="scripts/processing/process.py",
 outputs=[
 sagemaker.processing.ProcessingOutput(
 output_name="train",
 source="/opt/ml/processing/output/train",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/train/",
 ),
 sagemaker.processing.ProcessingOutput(
 output_name="validation",
 source="/opt/ml/processing/output/val",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/val/",
 ),
 sagemaker.processing.ProcessingOutput(
 output_name="test",
 source="/opt/ml/processing/output/test",
 destination=f"s3://{S3_BUCKET}/{S3_PREFIX}/data/test/",
 ),
 ]
)

# Use the lazy procerring run to define a ProcessingStep
processing_step = ProcessingStep(
 name="LesionImageProcessingStep",
 step_args=processor_args,
)

### 5.2. Define a Training Step

In [None]:
# Update the tf_ic_estimator from above to use PipelineSession()
tf_ic_estimator.sagemaker_session = PipelineSession()

# Create a lazy initialization of the training run that will wait to run during the pipeline execution
training_args = tf_ic_estimator.fit(
 inputs = {
 "training": TrainingInput(
 s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri, content_type="text/csv"
 ),
 "validation": TrainingInput(
 s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,content_type="text/csv"
 )
 }
)

# Use the lazy training run to define a TrainingStep
training_step = TrainingStep(
 name="LegionClassifierTrainingStep",
 step_args=training_args
)

### 5.3. Create and Execute a Pipeline

In [None]:
# Create a pipeline with the processing and training steps
pipeline = Pipeline(
 name=f"lesion-classifier-pipeline-{strftime('%Y-%m-%d-%H-%M-%S')}",
 steps=[
 processing_step, 
 training_step
 ],
 sagemaker_session=sagemaker_session,
)

In [None]:
# Execute the pipeline
pipeline.upsert(role_arn=SAGEMAKER_EXECUTION_ROLE)