{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification using SageMaker's BlazingText\n", "\n", "Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker built-in algorithm, BlazingText to perform supervised multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.\n", "\n", "For this use case, we will be using utterances of intents for Amazon Lex chatbot that manages support cases as dataset. The model built using the utteraces dataset will predict which intent it belongs to when it was missed by the chatbot. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training data and model artifact. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. \n", "- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import get_execution_role\n", "import json\n", "import boto3\n", "\n", "sess = sagemaker.Session()\n", "\n", "role = get_execution_role()\n", "print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf\n", "\n", "bucket = sess.default_bucket() # Replace with your own bucket name if needed\n", "print(bucket)\n", "prefix = 'sagemaker/blazingtext/lex_text_classification' \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "Now we'll download a dataset from the github repo on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by \"\\__label\\__\".\n", "\n", "We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from DBPedia dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import pandas as pd\n", "import boto3\n", "import json\n", "import sagemaker.amazon.common as smac\n", "from sagemaker.predictor import json_deserializer\n", "from random import shuffle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the sample utterances dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://github.com/rumiio/amazon-lex-support-bot/tree/master/notebook/Training_Data.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the row data, we will load it on panda's DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('Training_Data.txt', delimiter=',', skiprows=0, header=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_list = []\n", "for row in data.iterrows(): \n", " #print('__label__' + str(row[1][0]) + ' ' + str(row[1][1]))\n", " data_list.append('__label__' + str(row[1][0]) + ' ' + str(row[1][1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will shuffle the data once it is in the right format for the algorithm. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shuffle(data_list)\n", "data_list[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the prepared dataset to a file for training. Use a half the same dataset for validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('utterances.train', 'w') as file:\n", " for line in data_list:\n", " file.writelines(line)\n", " file.write('\\n')\n", "\n", "keep = .5 #use a half of the training dataset for validation\n", "with open('utterances.validation', 'w') as file:\n", " for line in data_list[:int(keep*len(data_list))]:\n", " file.writelines(line)\n", " file.write('\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head utterances.train -n 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "train_channel = prefix + '/train'\n", "validation_channel = prefix + '/validation'\n", "\n", "sess.upload_data(path='utterances.train', bucket=bucket, key_prefix=train_channel)\n", "sess.upload_data(path='utterances.validation', bucket=bucket, key_prefix=validation_channel)\n", "\n", "s3_train_data = 's3://{}/{}'.format(bucket, train_channel)\n", "s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, setup an output location at S3, where the model artifact will be saved. These artifacts are also the output of the algorithm's traning job.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", "s3_output_location" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "\n", "Now that we are done with all the setup that is needed, we are ready to train our text classification model. To begin, let us specify the container of the built-in algorithm. You can use get_imange_uri() to get the access. The container holds the BlazingText algorithm for the chosen region (in this case, the region is where you are running the notebook). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "region_name = boto3.Session().region_name\n", "\n", "container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, \"blazingtext\", \"latest\")\n", "print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's define the SageMaker Estimator with resource configurations and hyperparameters to train Text Classification on the dataset, using \"supervised\" mode on a c4.4xlarge instance.\n", "\n", "Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bt_model = sagemaker.estimator.Estimator(container,\n", " role, \n", " train_instance_count=1, \n", " train_instance_type='ml.c4.4xlarge',\n", " train_volume_size = 1,\n", " train_max_run = 3600,\n", " input_mode= 'File',\n", " output_path=s3_output_location,\n", " sagemaker_session=sess)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bt_model.set_hyperparameters(mode=\"supervised\",\n", " epochs=15,\n", " min_count=2,\n", " learning_rate=0.001,\n", " vector_dim=8,\n", " early_stopping=True,\n", " evaluation=True,\n", " patience=4,\n", " min_epochs=2,\n", " word_ngrams=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', \n", " content_type='text/plain', s3_data_type='S3Prefix')\n", "validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', \n", " content_type='text/plain', s3_data_type='S3Prefix')\n", "data_channels = {'train': train_data, 'validation': validation_data}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only remaining thing to do is to train the algorithm. The following command will start the training. Training the algorithm involves a few steps. First, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. \n", "\n", "The data logs will print out accuracy on the validation data. This metric is a proxy for the quality of the algorithm. \n", "\n", "Once the job has finished a \"Job complete\" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bt_model.fit(inputs=data_channels, logs=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hosting and Inference\n", "\n", "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_classifier = bt_model.deploy(initial_instance_count = 1,\n", " instance_type = 'ml.t2.large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use JSON format for inference\n", "BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as \"**instances**\" while being passed to the endpoint.\n", "\n", "By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_utterance = \"can you check the status?\" #\"check something\"\n", "\n", "payload = {\"instances\" : [test_utterance],\n", " \"configuration\": {\"k\": 2}}\n", "\n", "response = text_classifier.predict(json.dumps(payload))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "predictions = json.loads(response)\n", "print(json.dumps(predictions, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stop / Close the Endpoint (Optional)\n", "Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sess.delete_endpoint(text_classifier.endpoint)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }