{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bert text classification on SageMaker using PyTorch\n", "\n", "This uses the [dbpedia dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) and BERT for text classification. The dataset csv looks like\n", "\n", "```text\n", "1,\"E. D. Abbott Ltd\",\" Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.\"\n", "1,\"Schwan-Stabilo\",\" Schwan-STABILO is a German maker of pens for writing colouring and cosmetics as well as markers and highlighters for office use. It is the world's largest manufacturer of highlighter pens Stabilo Boss.\"\n", "1,\"Q-workshop\",\" Q-workshop is a Polish company located in Poznań that specializes in designand production of polyhedral dice and dice accessories for use in various games (role-playing gamesboard games and tabletop wargames). They also run an online retail store and maintainan active forum community.Q-workshop was established in 2001 by Patryk Strzelewicz – a student from Poznań. Initiallythe company sold its products via online auction services but in 2005 a website and online store wereestablished.\"\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -r requirements_notebook.txt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys, os\n", "import logging\n", "\n", "sys.path.append(\"src\")\n", "\n", "logging.basicConfig(level=\"INFO\", handlers=[logging.StreamHandler(sys.stdout)],\n", " format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bucket and role set up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import get_execution_role\n", "sm_session = sagemaker.session.Session()\n", "role = get_execution_role()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_bucket = sm_session.default_bucket()\n", "\n", "data_bucket_prefix = \"bert-demo\"\n", "\n", "s3_uri_data = \"s3://{}/{}/data\".format(data_bucket, data_bucket_prefix)\n", "s3_uri_train = \"{}/{}\".format(s3_uri_data, \"train.csv\")\n", "s3_uri_val = \"{}/{}\".format(s3_uri_data, \"val.csv\")\n", "\n", "# Use a small data set if required for local testing..\n", "s3_uri_mini_data = \"s3://{}/{}/minidata\".format(data_bucket, data_bucket_prefix)\n", "s3_uri_mini_train = \"{}/{}\".format(s3_uri_mini_data, \"train.csv\")\n", "s3_uri_mini_val = \"{}/{}\".format(s3_uri_mini_data, \"val.csv\")\n", "\n", "s3_uri_classes = \"{}/{}\".format(s3_uri_data, \"classes.txt\")\n", "\n", "s3_uri_test = \"{}/{}\".format(s3_uri_data, \"test.csv\")\n", "\n", "s3_output_path = \"s3://{}/{}/output\".format(data_bucket, data_bucket_prefix)\n", "s3_code_path = \"s3://{}/{}/code\".format(data_bucket, data_bucket_prefix)\n", "s3_checkpoint = \"s3://{}/{}/checkpoint\".format(data_bucket, data_bucket_prefix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prepare_dataset = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmp =\"tmp\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$prepare_dataset\" \"$s3_uri_test\" \"$s3_uri_classes\" \"$tmp\"\n", " \n", "prepare_dataset=$1\n", "s3_test=$2\n", "s3_classes=$3\n", "tmp=$4\n", "\n", "if [ \"$prepare_dataset\" == \"True\" ]\n", "then \n", " echo \"Downloading data..\"\n", " wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz -P ${tmp}\n", " tar -xzvf ${tmp}/dbpedia_csv.tar.gz\n", " mv dbpedia_csv ${tmp}\n", " \n", " ls -l ${tmp}/dbpedia_csv/\n", " cat ${tmp}/dbpedia_csv/classes.txt\n", " head -3 ${tmp}/dbpedia_csv/train.csv \n", " \n", " echo aws s3 cp ${tmp}/dbpedia_csv/test.csv ${s3_test}\n", " aws s3 cp ${tmp}/dbpedia_csv/test.csv ${s3_test}\n", " \n", " aws s3 cp ${tmp}/dbpedia_csv/classes.txt ${s3_classes}\n", " \n", "fi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train val split" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "def train_val_split(data_file, train_file_name = None, val_file_name = None, val_ratio =.30, train_ratio = .70):\n", " with open(data_file, \"r\") as f:\n", " lines = f.readlines()\n", " \n", " train, val = train_test_split( lines, test_size=val_ratio, train_size = train_ratio ,random_state=42)\n", " \n", " train_file_name = train_file_name or os.path.join(os.path.dirname(data_file), \"train.csv\")\n", " val_file_name = val_file_name or os.path.join(os.path.dirname(data_file), \"val.csv\")\n", "\n", " \n", " with open(train_file_name, \"w\") as f:\n", " f.writelines(train)\n", " print(\"Wrote {} records to train\".format(len(train)))\n", " \n", " with open(val_file_name, \"w\") as f:\n", " f.writelines(val)\n", " print(\"Wrote {} records to validation\".format(len(val)))\n", " \n", " return train_file_name, val_file_name\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if prepare_dataset:\n", " from s3_util import S3Util\n", " \n", " s3util = S3Util()\n", " l_data_file = os.path.join(tmp, \"dbpedia_csv\", \"train.csv\")\n", " l_train, l_val = train_val_split(l_data_file)\n", " s3util.upload_file(l_train, s3_uri_train)\n", " s3util.upload_file(l_val, s3_uri_val)\n", " \n", " l_mini_train = os.path.join(os.path.dirname(l_data_file), \"mini_train.csv\")\n", " l_mini_val = os.path.join(os.path.dirname(l_data_file), \"mini_val.csv\") \n", " l_train, l_val = train_val_split(l_data_file, l_mini_train, l_mini_val, val_ratio = 0.001, train_ratio=0.01)\n", "\n", " s3util.upload_file(l_mini_train, s3_uri_mini_train)\n", " s3util.upload_file(l_mini_val, s3_uri_mini_val)\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Clean up temp file..\n", "!rm -rf $tmp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "This shows you how to train BERT on SageMaker using SPOT instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs_full = {\n", " \"train\" : s3_uri_train,\n", " \"val\" : s3_uri_val,\n", " \"class\" : s3_uri_classes\n", "}\n", "\n", "inputs_sample = {\n", " \"train\" : s3_uri_mini_train,\n", " \"val\" : s3_uri_mini_val,\n", " \"class\" : s3_uri_classes\n", "}\n", "\n", "# Using the full dataset can take a while 4-5 hours. So if you just quickly test the sample, use inputs_sample\n", "inputs = inputs_sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm_localcheckpoint_dir=\"/opt/ml/checkpoints/\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instance_type = \"ml.p3.8xlarge\"\n", "instance_type_gpu_map = {\"ml.p3.8xlarge\":4, \"ml.p3.2xlarge\": 1, \"ml.p3.16xlarge\":8}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hp = {\n", "\"epochs\" : 10,\n", "\"earlystoppingpatience\" : 3,\n", "# Increasing batch size might end up with CUDA OOM error, increase grad accumulation instead\n", "\"batch\" : 8 * instance_type_gpu_map[instance_type],\n", "\"trainfile\" :s3_uri_train.split(\"/\")[-1],\n", "\"valfile\" : s3_uri_val.split(\"/\")[-1],\n", "\"classfile\":s3_uri_classes.split(\"/\")[-1],\n", "# The number of steps to accumulate gradients for\n", "\"gradaccumulation\" : 4,\n", "\"log-level\":\"INFO\",\n", "# This param depends on your model max pos embedding size or when large you might end up with CUDA OOM error \n", "\"maxseqlen\" : 512,\n", "# Make sure the lr is quite small, as this is a pretrained model..\n", "\"lr\":0.00001,\n", "# Use finetuning (set to 1), if you only want to change the weights in the final classification layer.. \n", "\"finetune\": 0,\n", "\"checkpointdir\" : sm_localcheckpoint_dir,\n", "# Checkpoints once every n epochs\n", "\"checkpointfreq\": 2\n", "}\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions = [{\"Name\": \"TrainLoss\",\n", " \"Regex\": \"###score: train_loss### (\\d*[.]?\\d*)\"}\n", " ,{\"Name\": \"ValidationLoss\",\n", " \"Regex\": \"###score: val_loss### (\\d*[.]?\\d*)\"}\n", " ,{\"Name\": \"TrainScore\",\n", " \"Regex\": \"###score: train_score### (\\d*[.]?\\d*)\"}\n", " ,{\"Name\": \"ValidationScore\",\n", " \"Regex\": \"###score: val_score### (\\d*[.]?\\d*)\"}\n", " ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set True if you need spot instance\n", "use_spot = True\n", "train_max_run_secs = 2*24 * 60 * 60\n", "spot_wait_sec = 5 * 60\n", "max_wait_time_secs = train_max_run_secs + spot_wait_sec\n", "\n", "if not use_spot:\n", " max_wait_time_secs = None\n", " \n", "# During local mode, no spot.., use smaller dataset\n", "if instance_type == 'local':\n", " use_spot = False\n", " max_wait_time_secs = 0\n", " wait = True\n", " # Use smaller dataset to run locally\n", " inputs = inputs_sample\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_type = \"bert-classification\"\n", "base_name = \"{}\".format(job_type)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "\n", "estimator = PyTorch(\n", " #entry_point='main_train_k_fold.py',\n", " entry_point='main.py',\n", " source_dir = 'src',\n", " role=role,\n", " framework_version =\"1.4.0\",\n", " py_version='py3',\n", " instance_count=1,\n", " instance_type=instance_type,\n", " hyperparameters = hp,\n", " output_path=s3_output_path,\n", " metric_definitions=metric_definitions,\n", " volume_size=30,\n", " code_location=s3_code_path,\n", " debugger_hook_config=False,\n", " base_job_name =base_name, \n", " use_spot_instances = use_spot,\n", " max_run = train_max_run_secs,\n", " max_wait = max_wait_time_secs, \n", " checkpoint_s3_uri=s3_checkpoint,\n", " checkpoint_local_path=sm_localcheckpoint_dir)\n", "\n", "estimator.fit(inputs, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy BERT model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Inference container\n", "Ideally the server containing should already have all the required dependencies installed to reduce start up time and ensure that the runtime enviornment is consistent. This can be implemented using a custom docker image.\n", "\n", "But for this demo, to simplify, we will let the Pytorch container script model install the dependencies during start up. As a result, you will see some of the initial ping requests fail, until all dependencies are installed.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorchModel\n", "from sagemaker import get_execution_role\n", "role = get_execution_role()\n", "\n", "model_uri = estimator.model_data\n", "\n", "model = PyTorchModel(model_data=model_uri,\n", " role=role,\n", " py_version = \"py3\",\n", " framework_version='1.4.0',\n", " entry_point='serve.py',\n", " source_dir='src')\n", "\n", "predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invoke API" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = [\"Q-workshop is a Polish company located in Poznań that specializes in designand production of polyhedral dice\",\n", " \"ET is a sci-fi directed by steven spielberg\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "\n", "class TextSerDes:\n", " \n", " def serialize(self, x):\n", " data_bytes=\"\\n\".join(x).encode(\"utf-8\")\n", " return data_bytes\n", " \n", " def deserialize(self, x, content_type):\n", " return json.loads(x.read().decode(\"utf-8\")) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "predictor.serializer = TextSerDes()\n", "predictor.deserializer = TextSerDes()\n", "\n", "\n", "response = predictor.predict(data, initial_args={ \"Accept\":\"text/json\", \"ContentType\" : \"text/csv\" }\n", " )\n", "\n", "response " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p36", "language": "python", "name": "conda_pytorch_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }