# Find Similar Industrial Accidents using Sentence Transformers in PyTorch

In this notebook, we will fine tune pretrained `bert-base-uncased` model from `HuggingFace Library` in an unsupervised fashion, on `Industrial labor accident data`. The objective is to find the similar accident reports based on the description of the incident using `bert-base-uncased`. 

Note: Notebook works best with `ml.m5.xlarge (4 vCPU + 16 GiB)` instance and `Python3 (PyTorch 1.6 Python 3.6 CPU Optimized)` kernel. 

## Setup
Update sagemaker package and restart the kernel. 

In [None]:
!pip install -U sagemaker -q
# !pip install sentence_transformers -q
# !pip install ipywidgets

In [None]:
import sagemaker
sagemaker.__version__

In [None]:
import boto3, os, sagemaker
import json

sess = sagemaker.Session()
bucket = sess.default_bucket() 
prefix = 'sentencetransformer/input'
role = sagemaker.get_execution_role()

## Dataset

Download the dataset from: https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database and upload the downloaded csv file to the notebook. 

The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.

**Columns description**
- Data: timestamp or time/date information
- Countries: which country the accident occurred (anonymized)
- Local: the city where the manufacturing plant is located (anonymized)
- Industry sector: which sector the plant belongs to
- Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
- Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
- Genre: if the person is male of female
- Employee or Third Party: if the injured person is an employee or a third party
- Critical Risk: some description of the risk involved in the accident
- Description: Detailed description of how the accident happened.

In [None]:
import pandas as pd
df_data = pd.read_csv('IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv', index_col=0)

In [None]:
df_data.head()

### Upload data to s3

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train.csv')).upload_file('IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv')
training_input_path = "s3://{}/{}/train.csv".format(bucket,prefix)
training_input_path

## Fine Tuning UnSupervised Sentence Transformer on your Dataset

### Setting hyper-parameters

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 8,
                 'model_name':'bert-base-uncased'
                 }

The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following:

- **SM_MODEL_DIR**: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. SM_MODEL_DIR is always set to /opt/ml/model.

- **SM_NUM_GPUS**: An integer representing the number of GPUs available to the host.

- **SM_CHANNEL_XXXX**: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimatorâ€™s fit call, named train and test, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

You can find a full list of the exposed environment variables [here](#https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md).

Later we define hyperparameters in the HuggingFace Estimator, which are passed in as named arguments and and can be processed with the [ArgumentParser()](#https://huggingface.co/docs/sagemaker/train#create-an-huggingface-estimator).

In [None]:
!pygmentize ./code/unsupervised.py

In [None]:
huggingface_estimator = HuggingFace(entry_point='unsupervised.py',
                            source_dir='./code',
                            instance_type='ml.p3.2xlarge', # GPU supported by Hugging Face
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path})

In [None]:
huggingface_estimator.model_data

## Deploy Pretrained Sentence Transformer

We will deploy the `Sentence Transformer` model using `SageMaker HuggingFaceModel` object with `inference.py` script as an entrypoint. 

Let's take a look into the `inference` script which is in the `code` directory and add the bucket name where you have the training data. Also, don't forget to update the `s3key`. This data will act as the source data, against which we will compare our target sentence. In this case, based on the description of the incident, model will find the similar accident reports.  

In [None]:
!pygmentize ./code/inference.py

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
sentence_transformer = HuggingFaceModel(model_data = huggingface_estimator.model_data, 
                                    role = role, 
                                    source_dir = 'code',
                                    entry_point = 'inference.py', 
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',)

In [None]:
predictor = sentence_transformer.deploy(initial_instance_count = 1, instance_type = 'ml.g4dn.2xlarge')

In [None]:
prediction = predictor.predict("they saw the bee carton, the reaction was to move away from the box as quickly as possible to avoid the stings, they ran about 50 meters, looking for a safe area, to exit the radius of attack of the bees, but the S.S. and Breno), were attacked and consequently they suffered 02 stings, in the belly and Jehovah in the hand, verified that there was no type of allergic reaction, returned with the normal activities.")

In [None]:
result = json.loads(prediction)
result = result['result']
result

## References: 
- https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database
- https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html
- https://huggingface.co/bert-base-uncased
- https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html