{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Named Entity Recognition by fine-tuning Keras BERT on SageMaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup \n", "\n", "We'll begin with some necessary imports, and get an Amazon SageMaker session to help perform certain tasks, as well as an IAM role with the necessary permissions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "import time\n", "from datetime import datetime\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker.tensorflow.serving import TensorFlowModel\n", "import logging\n", "\n", "role = get_execution_role()\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker variables and S3 bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Creating a sagemaker session\n", "sagemaker_session = sagemaker.Session()\n", "\n", "#We'll be using the sagemaker default bucket\n", "BUCKET = sagemaker_session.default_bucket()\n", "PREFIX = 'graph-nerc-blog' #Feel free to change this\n", "DATA_FOLDER = 'tagged-data'\n", "\n", "#Using default region, same as where this notebook is, change if needed\n", "REGION = sagemaker_session.boto_region_name\n", "\n", "INPUTS = 's3://{}/{}/{}/'.format(BUCKET,PREFIX,DATA_FOLDER)\n", "\n", "print(\"Using region: {}\".format(REGION))\n", "print('Bucket: {}'.format(BUCKET))\n", "print(\"Using prefix : {}\".format(INPUTS))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Downloading dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using the Kaggle entity-annotated-corpus that can be found at https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus\n", "To be able to download it, you will be required to create a Kaggle account.\n", "Once the zip folder is downloaded, unzip it locally and upload the file ner_dataset.csv in the folder of this notebook. (notebooks/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Data exploration and preparation\n", "The dataset consists of 47959 news article sentences (1048575 words) with tagged entities representing:\n", "- geo = Geographical Entity\n", "- org = Organization\n", "- per = Person\n", "- gpe = Geopolitical Entity\n", "- tim = Time indicator\n", "- art = Artifact\n", "- eve = Event\n", "- nat = Natural Phenomenon" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ner_dataset = pd.read_csv('ner_dataset.csv', encoding = 'latin')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here is an example sentence. We will only be using the Sentence #, Word and Tag columns\n", "ner_dataset.head(24)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These are the following entities we have in the data\n", "ner_dataset.Tag.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ner_dataset.Tag = ner_dataset.Tag.fillna('O')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split data to train and test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We split the data into train, validation and test set, taking the first 45000 sentences for training, the next 2000 sentences for validation and the last 959 sentences for testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index = ner_dataset['Sentence #'].index[~ner_dataset['Sentence #'].isna()].values.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_index = index[45000]\n", "val_index = index[47000]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df = ner_dataset[:train_index]\n", "val_df = ner_dataset[train_index:val_index]\n", "test_df = ner_dataset[val_index:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save data to s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df.to_csv(INPUTS + 'train.csv')\n", "val_df.to_csv(INPUTS + 'val.csv')\n", "test_df.to_csv(INPUTS + 'test.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Training BERT model using Sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For fine-tuning the Keras BERT for Named Entity Recognition, the whole code is in the folder code/\n", "The folder contains the train.py script that will be executed within a SageMaker training job to launch the training. The train.py imports modules for found in code/source/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ../code/train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data on which we train are the outputs of part 1: Data Exploration and Preparation\n", "\n", "**NOTA: If you change where you save the train, validation and test csv files please reflect those changes in the INPUTS variable**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Single Training job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Job name and instance type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "JOB_NAME = 'ner-bert-keras'\n", "INSTANCE_TYPE = 'ml.p3.2xlarge'\n", "# INSTANCE_TYPE = \"local_gpu\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hyperparameters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EPOCHS = 20\n", "BATCH_SIZE = 16\n", "MAX_SEQUENCE_LENGTH = 64 # This correspond to the input size of BERT that we want (The training time is quadratically increasing with input size)\n", "DROP_OUT = 0.1\n", "LEARNING_RATE = 4.0e-05\n", "BERT_PATH = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'\n", "OUTPUT_PATH = 's3://{}/{}/training-output/'.format(BUCKET,PREFIX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining training job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By providing a *framework_version* \"2.3\" and *py_version* \"py37\" in the TensorFlow object, we'll be calling a managed ECR image provided by AWS.\n", "Training on a gpu instance in the region eu-west-1, this is the same as providing the explicit image_uri: *training_image_uri = \"763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04\".*\n", "\n", "Using *framework_version* and *py_version* the TensorFlow (Estimator) object manages this for you by ensuring that the right image_uri is used depending on the region of your sagemaker session, as well as the type of instance used for training.\n", "Refer to https://github.com/aws/deep-learning-containers/blob/master/available_images.md for more details" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {'epochs': EPOCHS,\n", " 'batch_size' : BATCH_SIZE,\n", " 'max_sequence_length': MAX_SEQUENCE_LENGTH,\n", " 'drop_out': DROP_OUT,\n", " 'learning_rate': LEARNING_RATE,\n", " 'bert_path':BERT_PATH\n", " }\n", "\n", "# Use either framework_version and py_version or an explicit image_uri\n", "estimator = TensorFlow(base_job_name=JOB_NAME,\n", " source_dir='../code',\n", " entry_point='train.py', \n", " role=role,\n", " framework_version='2.3',\n", " py_version='py37',\n", "# image_uri=training_image_uri,\n", " hyperparameters=hyperparameters,\n", " instance_count=1,\n", " script_mode=True,\n", " metric_definitions=[\n", " {'Name': 'train loss', 'Regex': 'loss: (.*?) -'},\n", " {'Name': 'train accuracy', 'Regex': ' accuracy: (.*?) -'},\n", " {'Name': 'val loss', 'Regex': 'val_loss: (.*?) -'},\n", " {'Name': 'val accuracy', 'Regex': 'val_accuracy: (.*?)$'}\n", " ],\n", " output_path=OUTPUT_PATH,\n", " instance_type=INSTANCE_TYPE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "REMOTE_INPUTS = {'train' : INPUTS,\n", " 'validation' : INPUTS,\n", " 'eval' : INPUTS}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dt = datetime.now()\n", "estimator.fit(REMOTE_INPUTS, wait = False) # Set to True if you want to see the logs here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training can take between 40min and 1h. The following cells can be run to check the status. Once the status is 'Completed' you can go ahead and Deploy an Inference Endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(estimator.model_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm_client = boto3.client('sagemaker')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = sm_client.describe_training_job(\n", " TrainingJobName=estimator._current_job_name\n", ")\n", "response.get('TrainingJobStatus')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the next cells only once TrainingJobStatus response is 'Completed'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Deploy an Inference Endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ../code/inference.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODEL_ARTEFACTS_S3_LOCATION = response.get('ModelArtifacts').get('S3ModelArtifacts')\n", "\n", "INSTANCE_TYPE = \"ml.g4dn.xlarge\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(MODEL_ARTEFACTS_S3_LOCATION)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use either framework_version or an explicit image_uri. Refer to https://github.com/aws/deep-learning-containers/blob/master/available_images.md for more details\n", "model = TensorFlowModel(entry_point='inference.py',\n", " source_dir='../code',\n", " framework_version='2.3',\n", "# image_uri = inference_image_uri,\n", " role=role,\n", " model_data=MODEL_ARTEFACTS_S3_LOCATION,\n", " sagemaker_session=sagemaker_session,\n", " env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300' }\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = model.deploy(initial_instance_count=1, instance_type=INSTANCE_TYPE, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing the endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_set = pd.read_csv(INPUTS + 'test.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = test_set.copy()\n", "df = df.fillna(method='ffill')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = (df.groupby('Sentence #')\n", " .apply(lambda x: list(x['Word']))\n", " .to_dict())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_list = []\n", "for (k, v) in d.items():\n", " article = {'id': k, 'sentence':' '.join(v)}\n", " test_list.append(article)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_list[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "test_endpoint = predictor.predict(test_list[:1000])\n", "print(\"--- %s seconds ---\" % (time.time() - start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_endpoint[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running predictions for the whole dataset (example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Endpoints, which are built for doing real-time inference, and are by design meant to run for a maximum of 60 seconds per request.\n", "To test the endpoint on a big dataset, we can send it in requests of 1000 sentences each to avoid long inferences that could make the endpoint time out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# train_set = pd.read_csv(INPUTS + 'train.csv')\n", "# val_set = pd.read_csv(INPUTS + 'val.csv')\n", "# df = pd.concat([train_set,val_set,test_set])\n", "# df = df.fillna(method='ffill')\n", "# d = (df.groupby('Sentence #')\n", "# .apply(lambda x: list(x['Word']))\n", "# .to_dict())\n", "# test_list = []\n", "# for (k, v) in d.items():\n", "# article = {'id': k, 'sentence':' '.join(v)}\n", "# test_list.append(article)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# start_time = time.time()\n", "# preds = []\n", "# for k in range (0,round(len(test_list)/1000)):\n", "# preds.append(predictor.predict(test_list[k*1000:(k+1)*1000]))\n", "# print(\"--- %s seconds ---\" % (time.time() - start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# preds_flat = [item for sublist in preds for item in sublist]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# preds_flat[-10:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# with open('data_with_entities.json', 'w', encoding='utf-8') as f:\n", "# json.dump(preds_flat, f, ensure_ascii=False, indent=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing output to s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import boto3 \n", "s3 = boto3.resource('s3')\n", "s3object = s3.Object(BUCKET, PREFIX + '/data_with_entities.json')\n", "\n", "s3object.put(\n", " Body=(bytes(json.dumps(test_endpoint).encode('UTF-8')))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete the endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# predictor.delete_endpoint()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow2_p36", "language": "python", "name": "conda_tensorflow2_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }