# Named Entity Recognition by fine-tuning Keras BERT on SageMaker

## Setup 

We'll begin with some necessary imports, and get an Amazon SageMaker session to help perform certain tasks, as well as an IAM role with the necessary permissions.

In [None]:
import os
import json
import time
from datetime import datetime

import numpy as np
import pandas as pd

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from sagemaker.tensorflow.serving import TensorFlowModel
import logging

role = get_execution_role()

%matplotlib inline

### SageMaker variables and S3 bucket

In [None]:
#Creating a sagemaker session
sagemaker_session = sagemaker.Session()

#We'll be using the sagemaker default bucket
BUCKET = sagemaker_session.default_bucket()
PREFIX = 'graph-nerc-blog' #Feel free to change this
DATA_FOLDER = 'tagged-data'

#Using default region, same as where this notebook is, change if needed
REGION = sagemaker_session.boto_region_name

INPUTS = 's3://{}/{}/{}/'.format(BUCKET,PREFIX,DATA_FOLDER)

print("Using region: {}".format(REGION))
print('Bucket: {}'.format(BUCKET))
print("Using prefix : {}".format(INPUTS))

# Downloading dataset

We will be using the Kaggle entity-annotated-corpus that can be found at https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
To be able to download it, you will be required to create a Kaggle account.
Once the zip folder is downloaded, unzip it locally and upload the file ner_dataset.csv in the folder of this notebook. (notebooks/)

# 1. Data exploration and preparation
The dataset consists of 47959 news article sentences (1048575 words) with tagged entities representing:
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon

In [None]:
ner_dataset = pd.read_csv('ner_dataset.csv', encoding = 'latin')

In [None]:
# Here is an example sentence. We will only be using the Sentence #, Word and Tag columns
ner_dataset.head(24)

In [None]:
# These are the following entities we have in the data
ner_dataset.Tag.unique()

In [None]:
ner_dataset.Tag = ner_dataset.Tag.fillna('O')

### Split data to train and test

We split the data into train, validation and test set, taking the first 45000 sentences for training, the next 2000 sentences for validation and the last 959 sentences for testing.

In [None]:
index = ner_dataset['Sentence #'].index[~ner_dataset['Sentence #'].isna()].values.tolist()

In [None]:
train_index = index[45000]
val_index = index[47000]

In [None]:
train_df = ner_dataset[:train_index]
val_df = ner_dataset[train_index:val_index]
test_df = ner_dataset[val_index:]

### Save data to s3

In [None]:
train_df.to_csv(INPUTS + 'train.csv')
val_df.to_csv(INPUTS + 'val.csv')
test_df.to_csv(INPUTS + 'test.csv')

# 2. Training BERT model using Sagemaker

For fine-tuning the Keras BERT for Named Entity Recognition, the whole code is in the folder code/
The folder contains the train.py script that will be executed within a SageMaker training job to launch the training. The train.py imports modules for found in code/source/

In [None]:
!pygmentize ../code/train.py

The data on which we train are the outputs of part 1: Data Exploration and Preparation

**NOTA: If you change where you save the train, validation and test csv files please reflect those changes in the INPUTS variable**

## Single Training job

### Job name and instance type

In [None]:
JOB_NAME = 'ner-bert-keras'
INSTANCE_TYPE = 'ml.p3.2xlarge'
# INSTANCE_TYPE = "local_gpu"

### Hyperparameters:

In [None]:
EPOCHS = 20
BATCH_SIZE = 16
MAX_SEQUENCE_LENGTH = 64 # This correspond to the input size of BERT that we want (The training time is quadratically increasing with input size)
DROP_OUT = 0.1
LEARNING_RATE = 4.0e-05
BERT_PATH = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
OUTPUT_PATH = 's3://{}/{}/training-output/'.format(BUCKET,PREFIX)

### Defining training job

By providing a *framework_version* "2.3" and *py_version* "py37" in the TensorFlow object, we'll be calling a managed ECR image provided by AWS.
Training on a gpu instance in the region eu-west-1, this is the same as providing the explicit image_uri: *training_image_uri = "763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04".*

Using *framework_version* and *py_version* the TensorFlow (Estimator) object manages this for you by ensuring that the right image_uri is used depending on the region of your sagemaker session, as well as the type of instance used for training.
Refer to https://github.com/aws/deep-learning-containers/blob/master/available_images.md for more details

In [None]:
hyperparameters = {'epochs': EPOCHS,
 'batch_size' : BATCH_SIZE,
 'max_sequence_length': MAX_SEQUENCE_LENGTH,
 'drop_out': DROP_OUT,
 'learning_rate': LEARNING_RATE,
 'bert_path':BERT_PATH
 }

# Use either framework_version and py_version or an explicit image_uri
estimator = TensorFlow(base_job_name=JOB_NAME,
 source_dir='../code',
 entry_point='train.py', 
 role=role,
 framework_version='2.3',
 py_version='py37',
# image_uri=training_image_uri,
 hyperparameters=hyperparameters,
 instance_count=1,
 script_mode=True,
 metric_definitions=[
 {'Name': 'train loss', 'Regex': 'loss: (.*?) -'},
 {'Name': 'train accuracy', 'Regex': ' accuracy: (.*?) -'},
 {'Name': 'val loss', 'Regex': 'val_loss: (.*?) -'},
 {'Name': 'val accuracy', 'Regex': 'val_accuracy: (.*?)$'}
 ],
 output_path=OUTPUT_PATH,
 instance_type=INSTANCE_TYPE)

In [None]:
REMOTE_INPUTS = {'train' : INPUTS,
 'validation' : INPUTS,
 'eval' : INPUTS}

In [None]:
dt = datetime.now()
estimator.fit(REMOTE_INPUTS, wait = False) # Set to True if you want to see the logs here

The training can take between 40min and 1h. The following cells can be run to check the status. Once the status is 'Completed' you can go ahead and Deploy an Inference Endpoint

In [None]:
print(estimator.model_data)

In [None]:
sm_client = boto3.client('sagemaker')

In [None]:
response = sm_client.describe_training_job(
 TrainingJobName=estimator._current_job_name
)
response.get('TrainingJobStatus')

### Run the next cells only once TrainingJobStatus response is 'Completed'

# 3. Deploy an Inference Endpoint

In [None]:
!pygmentize ../code/inference.py

In [None]:
MODEL_ARTEFACTS_S3_LOCATION = response.get('ModelArtifacts').get('S3ModelArtifacts')

INSTANCE_TYPE = "ml.g4dn.xlarge"

In [None]:
print(MODEL_ARTEFACTS_S3_LOCATION)

In [None]:
# Use either framework_version or an explicit image_uri. Refer to https://github.com/aws/deep-learning-containers/blob/master/available_images.md for more details
model = TensorFlowModel(entry_point='inference.py',
 source_dir='../code',
 framework_version='2.3',
# image_uri = inference_image_uri,
 role=role,
 model_data=MODEL_ARTEFACTS_S3_LOCATION,
 sagemaker_session=sagemaker_session,
 env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300' }
 )

In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type=INSTANCE_TYPE, wait=True)

### Testing the endpoint

In [None]:
test_set = pd.read_csv(INPUTS + 'test.csv')

In [None]:
df = test_set.copy()
df = df.fillna(method='ffill')

In [None]:
d = (df.groupby('Sentence #')
 .apply(lambda x: list(x['Word']))
 .to_dict())

In [None]:
test_list = []
for (k, v) in d.items():
 article = {'id': k, 'sentence':' '.join(v)}
 test_list.append(article)

In [None]:
test_list[:10]

In [None]:
start_time = time.time()
test_endpoint = predictor.predict(test_list[:1000])
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
test_endpoint[-10:]

### Running predictions for the whole dataset (example)

Endpoints, which are built for doing real-time inference, and are by design meant to run for a maximum of 60 seconds per request.
To test the endpoint on a big dataset, we can send it in requests of 1000 sentences each to avoid long inferences that could make the endpoint time out.

In [None]:
# train_set = pd.read_csv(INPUTS + 'train.csv')
# val_set = pd.read_csv(INPUTS + 'val.csv')
# df = pd.concat([train_set,val_set,test_set])
# df = df.fillna(method='ffill')
# d = (df.groupby('Sentence #')
# .apply(lambda x: list(x['Word']))
# .to_dict())
# test_list = []
# for (k, v) in d.items():
# article = {'id': k, 'sentence':' '.join(v)}
# test_list.append(article)

In [None]:
# start_time = time.time()
# preds = []
# for k in range (0,round(len(test_list)/1000)):
# preds.append(predictor.predict(test_list[k*1000:(k+1)*1000]))
# print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# preds_flat = [item for sublist in preds for item in sublist]

In [None]:
# preds_flat[-10:]

In [None]:
# with open('data_with_entities.json', 'w', encoding='utf-8') as f:
# json.dump(preds_flat, f, ensure_ascii=False, indent=4)

### Writing output to s3

In [None]:
import json
import boto3 
s3 = boto3.resource('s3')
s3object = s3.Object(BUCKET, PREFIX + '/data_with_entities.json')

s3object.put(
 Body=(bytes(json.dumps(test_endpoint).encode('UTF-8')))
)

### Delete the endpoint

In [None]:
# predictor.delete_endpoint()