{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab1: Finetuning HuggingFace models with Amazon SageMaker\n",
    "### Multi-class Classification with `Trainer` and `amazon_us_reviews` dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "Welcome to the first Lab and our end-to-end multi-class Text-Classification example. In this Lab, we will use the Hugging Faces `transformers` and `datasets` library together with Amazon SageMaker to fine-tune a pre-trained transformer on text classification. In particular, the pre-trained model will be fine-tuned using the `amazon_us_reviews` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Development Environment and Permissions "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "_*Note:* we install the required libraries from Hugging Face and AWS. Additionally, we make sure we have a compatible PyTorch version installed_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker>=2.48.0\" --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install datasets=='1.8.0'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note: Restart the kernel after installing the above packages.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display_html\n",
    "def restartkernel() :\n",
    "    display_html(\"<script>Jupyter.notebook.kernel.restart()</script>\",raw=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "restartkernel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Permissions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparing the dataset\n",
    "\n",
    "As dataset are we going to use the [amazon_us_reviews](https://huggingface.co/datasets/amazon_us_reviews).\n",
    "\n",
    "the dataset contains the customer review text with accompanying metadata, consisting of three major components:\n",
    "\n",
    "1. A collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews)\n",
    "2. A collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries)\n",
    "3. A collection of reviews that have been identified as non-compliant with respect to Amazon policies. This is intended to provide a reference dataset for research on detecting promotional or biased reviews. (several thousand customer reviews). This part of the dataset is distributed separately and is available upon request – please contact the email address below if you are interested in obtaining this dataset.  \n",
    "\n",
    "_https://s3.amazonaws.com/amazon-reviews-pds/readme.html_\n",
    "\n",
    "We will use `create_dataset.py` script to downsample the dataset and split it into train  `data/amazon_us_reviews_apparel_v1_00_train.json` and test dataset `data/amazon_us_reviews_apparel_v1_00_test.json`. The train dataset contains 29750 rows and the test dataset 5250. \n",
    "For the dataset files the `Apparel_v1_00` split was used. \n",
    "\n",
    "You can find the script at `data/create_dataset.py`. You can use this to change for example the size or category split of the reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pygmentize ../data/create_dataset.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download data and process it to have only reviews and label, then split it into train and test. \n",
    "* This step will take approximately `7mins - 10mins`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "!python ../data/create_dataset.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Uploading data to `sagemaker_session_bucket`\n",
    "\n",
    "Upload the `dataset` files to the default bucket in Amazon S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from sagemaker.s3 import S3Uploader\n",
    "\n",
    "local_train_dataset = \"amazon_us_reviews_apparel_v1_00_train.json\"\n",
    "local_test_dataset = \"amazon_us_reviews_apparel_v1_00_test.json\"\n",
    "\n",
    "# s3 uris for datasets\n",
    "remote_train_dataset = f\"s3://{sess.default_bucket()}/lab1/data\"\n",
    "remote_test_dataset = f\"s3://{sess.default_bucket()}/lab1/data\"\n",
    "\n",
    "# upload datasets\n",
    "S3Uploader.upload(os.path.join('../data', local_train_dataset),remote_train_dataset)\n",
    "S3Uploader.upload(os.path.join('../data',local_test_dataset),remote_test_dataset)\n",
    "\n",
    "print(f\"train dataset uploaded to: {remote_train_dataset}/{local_train_dataset}\")\n",
    "print(f\"test dataset uploaded to: {remote_test_dataset}/{local_test_dataset}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tuning & starting Sagemaker Training Job\n",
    "\n",
    "In order to create our sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles all end-to-end Amazon SageMaker training and deployment tasks. In the Estimator we define, which fine-tuning script (`entry_point`) should be used, which `instance_type` should be used, which `hyperparameters` are passed in .....\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "huggingface_estimator = HuggingFace(entry_point='train.py',\n",
    "                            source_dir='./scripts',\n",
    "                            base_job_name='huggingface-sdk-extension',\n",
    "                            instance_type='ml.p3.2xlarge',\n",
    "                            instance_count=1,\n",
    "                            transformers_version='4.4',\n",
    "                            pytorch_version='1.6',\n",
    "                            py_version='py36',\n",
    "                            role=role,\n",
    "                            hyperparameters = {'epochs': 1,\n",
    "                                               'train_batch_size': 32,\n",
    "                                               'model_name':'distilbert-base-uncased'\n",
    "                                                })\n",
    "```\n",
    "\n",
    "When we create a SageMaker training job, SageMaker takes care of starting and managing the required ec2 instances for us, providing the fine-tuning script `train.py` and downloading the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. When starting the training SageMaer executes the following command:\n",
    "\n",
    "```python\n",
    "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32\n",
    "```\n",
    "\n",
    "The CLI arguments you see are passed in as `hyperparameters`, when creating the `HuggingFace` estimator.\n",
    "\n",
    "Sagemaker is also providing useful properties about the training environment through various environment variables, including the following:\n",
    "\n",
    "* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.\n",
    "\n",
    "* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.\n",
    "\n",
    "* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.\n",
    "\n",
    "\n",
    "To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating an Estimator and start a training job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training script that performs fine tuning is located here: `training/scripts/train.py`. Navigate to the source code location and open the `train.py` file. You can also go through it's contents by executing the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize ../scripts/train.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.huggingface import HuggingFace\n",
    "import time\n",
    "\n",
    "# hyperparameters, which are passed into the training job\n",
    "hyperparameters={'epochs': 1,                          # number of training epochs\n",
    "                 'train_batch_size': 32,               # batch size for training\n",
    "                 'eval_batch_size': 64,                # batch size for evaluation\n",
    "                 'learning_rate': 3e-5,                # learning rate used during training\n",
    "                 'model_id':'distilbert-base-uncased', # pre-trained model\n",
    "                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training\n",
    "                 'train_file': local_train_dataset,    # training dataset\n",
    "                 'test_file': local_test_dataset,      # test dataset\n",
    "                 }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List of supported models: https://huggingface.co/models?library=pytorch,transformers&sort=downloads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a `metric_definition` dictionary that contains regex-based definitions that will be used to parse the job logs and extract metrics. You can read more about parsing the cloudwatch logs [here](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_definitions=[\n",
    "    {'Name': 'eval_loss',               'Regex': \"'eval_loss': ([0-9]+(.|e\\-)[0-9]+),?\"},\n",
    "    {'Name': 'eval_accuracy',           'Regex': \"'eval_accuracy': ([0-9]+(.|e\\-)[0-9]+),?\"},\n",
    "    {'Name': 'eval_f1',                 'Regex': \"'eval_f1': ([0-9]+(.|e\\-)[0-9]+),?\"},\n",
    "    {'Name': 'eval_precision',          'Regex': \"'eval_precision': ([0-9]+(.|e\\-)[0-9]+),?\"}]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define Training Job Name \n",
    "job_name = f'huggingface-workshop-{time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.localtime())}'\n",
    "\n",
    "# create the Estimator\n",
    "huggingface_estimator = HuggingFace(\n",
    "    entry_point          = 'train.py',        # fine-tuning script used in training jon\n",
    "    source_dir           = '../scripts',      # directory where fine-tuning script is stored\n",
    "    instance_type        = 'ml.p3.2xlarge',   # instances type used for the training job\n",
    "    instance_count       = 1,                 # the number of instances used for training\n",
    "    base_job_name        = job_name,          # the name of the training job\n",
    "    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3\n",
    "    transformers_version = '4.6',             # the transformers version used in the training job\n",
    "    pytorch_version      = '1.7',             # the pytorch_version version used in the training job\n",
    "    py_version           = 'py36',            # the python version used in the training job\n",
    "    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job\n",
    "    metric_definitions   = metric_definitions # the metrics regex definitions to extract logs\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# define a data input dictonary with our uploaded s3 uris\n",
    "training_data = {\n",
    "    'train': remote_train_dataset,\n",
    "    'test': remote_test_dataset\n",
    "}\n",
    "\n",
    "# starting the train job with our uploaded datasets as input\n",
    "huggingface_estimator.fit(training_data, wait=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accessing Logs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# access the logs of the training job\n",
    "huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accessing Training Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import TrainingJobAnalytics\n",
    "\n",
    "# Captured metrics can be accessed as a Pandas dataframe\n",
    "training_job_name = huggingface_estimator.latest_training_job.name\n",
    "print(f\"Training jobname: {training_job_name}\")\n",
    "\n",
    "df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus: Deploying the endpoint\n",
    "\n",
    "To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = huggingface_estimator.deploy(1,\"ml.g4dn.xlarge\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we use the returned predictor object to call the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_input= {\"inputs\":\"I love using the new Inference DLC.\"}\n",
    "\n",
    "predictor.predict(sentiment_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we delete the endpoint again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "interpreter": {
   "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
  },
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}