{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deploying GPT-2 and GPT-J" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will be using Hugging Face models and SageMaker Hugging Face-specific API's to deploy both GPT-2 and GPT-J. We will also showcase how to deploy what would could be GPT2 models fine-tuned on different datasets to the same SageMaker instance as a Multi Model Endpoint. This will allow you to get real-time predictions from several models, while only paying for one running endpoint instance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*****\n", "## Deploying GTP-2 to SageMaker Multi-Model Endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U transformers\n", "!pip install -U sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get sagemaker session, role and default bucket\n", "If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances), you need access to an IAM Role with the required permissions for Sagemaker. You can find more about this [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "\n", "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "sagemaker_session_bucket=None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except ValueError:\n", " iam = boto3.client('iam')\n", " role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n", "\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "region = sess.boto_region_name\n", "sm_client = boto3.client('sagemaker')\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {region}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load GPT-2 model and tokenizer, save them to the same folder with Transformers `save_pretrained` utility " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import transformers \n", "from transformers import AutoTokenizer, AutoModelForCausalLM\n", "\n", "tokenizer = AutoTokenizer.from_pretrained('gpt2')\n", "model = AutoModelForCausalLM.from_pretrained('gpt2')\n", "\n", "model.save_pretrained('gpt2-model/')\n", "tokenizer.save_pretrained('gpt2-model/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This cell is meant to test that the model can be loaded from the local artifact\n", "# model = AutoModelForCausalLM.from_pretrained('gpt2-model/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test model generation, by generating 5 different sequences for the same prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.eval()\n", "\n", "text = \"A rose by any other name would smell as sweet, by William Shakespeare.\"\n", "input_ids = tokenizer.encode(text, return_tensors = 'pt')\n", "\n", "sample_outputs = model.generate(input_ids,\n", " do_sample = True, \n", " max_length = 70,\n", " num_return_sequences = 5) #to test how long we can generate and it be coh\n", "\n", "print(\"Output:\\n\" + 100 * '-')\n", "for i, sample_output in enumerate(sample_outputs):\n", " print(\"{}: {}...\".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))\n", " print('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tar model and tokenizer artifacts, upload to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tarfile \n", "\n", "with tarfile.open('gpt2-model.tar.gz', 'w:gz') as f:\n", " f.add('gpt2-model/',arcname='.')\n", "f.close()\n", "\n", "prefix = 'gpt2-hf-workshop/gpt2-test'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the file contents and structure of the model.tar.gz artifact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! tar -ztvf gpt2-model.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will upload the same model package twice with different names, to simulate deploying 2 models to the same endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! aws s3 cp gpt2-model.tar.gz s3://\"$sagemaker_session_bucket\"/\"$prefix\"/gpt2-model1.tar.gz\n", "! aws s3 cp gpt2-model.tar.gz s3://\"$sagemaker_session_bucket\"/\"$prefix\"/gpt2-model2.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get image URI for Hugging Face inference Deep Learning Container" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import image_uris\n", "\n", "hf_inference_dlc = image_uris.retrieve(framework='huggingface', \n", " region=region, \n", " version='4.12.3', \n", " image_scope='inference', \n", " base_framework_version='pytorch1.9.1', \n", " py_version='py38', \n", " container_version='ubuntu20.04', \n", " instance_type='ml.c5.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use `MultiDataModel`to setup a multi-model endpoint definition\n", "By setting the `HF_TASK` environment variable, we avoid having to write and test our own inference code. Depending on the task and model you choose, the Hugging Face inference Container will run the appropriate code by default. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.multidatamodel import MultiDataModel\n", "from sagemaker.predictor import Predictor\n", "\n", "hub = {\n", " 'HF_TASK':'text-generation'\n", "}\n", "\n", "mme = MultiDataModel(\n", " name='gpt2-models',\n", " model_data_prefix=f's3://{sagemaker_session_bucket}/{prefix}/',\n", " image_uri=hf_inference_dlc,\n", " env=hub,\n", " predictor_cls=Predictor,\n", " role=role,\n", " sagemaker_session=sess,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that our model object has already \"registered\" the model artifacts we uploaded to S3 under the `model_data_prefix`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for model in mme.list_models():\n", " print(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy Multi-Model Endpoint and send inference requests to both models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "endpoint_name_gpt2 = 'mme-gpt2-'+datetime.datetime.now().strftime(\n", " \"%Y-%m-%d-%H-%M-%S\"\n", ")\n", "\n", "predictor_gpt2 = mme.deploy(\n", " initial_instance_count=1,\n", " instance_type='ml.c5.xlarge',\n", " serializer=JSONSerializer(),\n", " deserializer=JSONDeserializer(),\n", " endpoint_name='mme-gpt2'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now get predictions from both models; the first request made to each model will take longer than the subsequent, as the model will be loaded from S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor_gpt2.predict({'inputs':'A rose by any other name.'},\n", " target_model='gpt2-model1.tar.gz')[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor_gpt2.predict({'inputs':'A rose by any other name.'},\n", " target_model='gpt2-model2.tar.gz')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add new model to endpoint\n", "To add a new model to our multi-model endpoint, we only have to upload a new model artifact to the same prefix where we uploaded the other models to. You will be able to load and get inferences from this new model as soon as it is uploded to S3. We will again load the same artifact we previously packaged, for demonstration purposes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! aws s3 cp gpt2-model.tar.gz s3://\"$sagemaker_session_bucket\"/\"$prefix\"/gpt2-model3.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor_gpt2.predict({'inputs':'A rose by any other name.'},\n", " target_model='gpt2-model3.tar.gz')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "********************************************************************************************************************************************\n", "********************************************************************************************************************************************\n", "\n", "\n", "# Deploying GPT-J to SageMaker Endpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clone sample repo and run model preparation script\n", "Hugging Face has solidified best practices for deploying GPT-J on Sagemaker in this [repository](https://github.com/philschmid/amazon-sagemaker-gpt-j-sample). Namely, PyTorch utilities are directly used to save the model to disk, instead of `.save_pretrained()`. On deployment, this helps in reducing model loading time by 10x. Check out this [blog post](https://huggingface.co/blog/gptj-sagemaker) to learn more." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "git clone https://github.com/philschmid/amazon-sagemaker-gpt-j-sample.git\n", "\n", "mv amazon-sagemaker-gpt-j-sample/convert_gptj.py \\\n", " amazon-sagemaker-gpt-j-sample/requirements.txt \\\n", " amazon-sagemaker-gpt-j-sample/code/ .\n", "\n", "rm -r amazon-sagemaker-gpt-j-sample/\n", "pip install -r requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `convert_gpj.py` script will save the model, tokenizer and inference script to disk, create a tar file with those artifacts, and upload them to an S3 bucket of our choice, under the `gpt-j` prefix. We also directly get the S3 URI for our model artifact from the script's execution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "output = !python3 convert_gptj.py --bucket_name \"$sagemaker_session_bucket\"\n", "model_uri = output[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now use the `HuggingFaceModel` API to deploy GPT-J to a SageMaker endpoint, from which we can get real time predictions. Due to the way we saved our model, it is important that the Transformers and PyTorch version you use match the ones installed in the environment where you are running this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.huggingface import HuggingFaceModel\n", "\n", "huggingface_model = HuggingFaceModel(\n", " model_data=model_uri,\n", " transformers_version='4.12.3',\n", " pytorch_version='1.9.1',\n", " py_version='py38',\n", " role=role, \n", " )\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "endpoint_name_gptj = 'gptj-'+datetime.datetime.now().strftime(\n", " \"%Y-%m-%d-%H-%M-%S\"\n", ")\n", "\n", "predictor_gptj = huggingface_model.deploy(\n", " initial_instance_count=1,\n", " instance_type='ml.g4dn.xlarge',\n", " endpoint_name=endpoint_name_gptj\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can get real-time predictions from our model!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor_gptj.predict({\n", " \"inputs\": \"Can you please let us know more details about your \",\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleanup " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor_gpt2.delete_model()\n", "predictor_gpt2.delete_endpoint()\n", "predictor_gptj.delete_model()\n", "predictor_gptj.delete_endpoint()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Utility: To load endpoint from name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# from sagemaker.predictor import Predictor\n", "\n", "# predictor = Predictor(\n", "# endpoint_name='mme-test-gpt2',\n", "# serializer=JSONSerializer(),\n", "# deserializer=JSONDeserializer())" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (PyTorch 1.8 Python 3.6 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/1.8.1-cpu-py36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }