{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Find the right use case, model, and feature for Hugging Face on SageMaker \n", "If you are looking for helping finding the right use case and model for your data science project with Hugging Face on SageMaker and AWS, you've come to the right place! \n", "\n", "### A few questions to ask yourself:\n", "1. Where can I apply my NLP project to provide the highest value to my business, while doing this in a low risk, secure way?\n", "2. Do I already have a relevant dataset?\n", "3. If I do, is this primarily in English, or do I need multilingual?\n", "4. Does my domain include specialized terminology, words that wouldn't be common on Wikipedia?\n", "5. For whichever model seems appropriate, what dataset does it primarily use? What objective metric does this model report on this dataset? Am I comfortable with that level of performance in the use case I am consdiering targeting.\n", "\n", "Use your own anwers to those questions to determine which use case, and which model(s), will be most impactful to your teams. Data science projects commonly start with some amount of early exploration, anywhere from 1 day to a few weeks, to determine if a given approach is viable. \n", "\n", "\n", "**Pro tip** - timebox your experiments, be frugal with your teams' time, and move quickly to find a viable solution. Only after you have some quantitative evidence that a given model is working well on your dataset sample is it time to scale up to larger datasets, larger models, larger training, and larger hosting clusters.\n", "\n", "---\n", "\n", "*Please note: When you pick any pretrained model, make sure that model has been tested on a dataset. Take some time to research what objective metric that model is reporting, usually Hugging Face will report this directly in the model card. Ask yourself - how similar is this research dataset to my own dataset? Am I ok with this level of performance in my use case? How can I design my overall ML solution to accomodate this?*\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use Cases and Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Classification\n", "\n", "Top Use Cases\n", "- Spam filter\n", "- Support ticket routing\n", "- Chat bot analysis\n", "- Tagging customer feedback and product reviews\n", "- Social media analysis\n", "\n", "\n", " Top Models \n", "- [Search engine result optimization](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2)\n", "- [Sentiment detection](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)\n", "- [Multi-lingual](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Named Entity Recgonition\n", "Top Models\n", "\n", "- [Extract named entities (nouns) with XLM](https://huggingface.co/xlm-roberta-large-finetuned-conll03-english)\n", "- Some non-English languages:\n", " - [German](https://huggingface.co/flair/ner-german)\n", " - [French](https://huggingface.co/flair/ner-french)\n", " - [Dutch](https://huggingface.co/flair/ner-dutch )\n", "- [Finetune NER for disease related languages, with PubMed](https://huggingface.co/beatrice-portelli/DiLBERT )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question Answering\n", "Top Models\n", "- [RoBERTa base Squad 2](https://huggingface.co/deepset/roberta-base-squad2 )\n", "- [BERT large uncased whole word masking](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) \n", "- [BioBERT](https://huggingface.co/dmis-lab/biobert-large-cased-v1.1-squad )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Summarization\n", "\n", "Top Models\n", "- [DistiliBart](https://huggingface.co/sshleifer/distilbart-cnn-6-6)\n", "- [MT5 Multilingual](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)\n", "- [Financial Summarization](https://huggingface.co/human-centered-summarization/financial-summarization-pegasus )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Generation\n", "Top Models\n", "- [Distiligpt2](https://huggingface.co/distilgpt2)\n", "- [GPT2](https://huggingface.co/gpt2 )\n", "- [EleutherAI GPT J 6B](https://huggingface.co/EleutherAI/gpt-j-6B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Translation\n", "Top Models\n", "- [T5 Small](https://huggingface.co/t5-small)\n", "- [Helsinki NLP's Opus MT / ZH / EN](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en)\n", "- [T5 3b](https://huggingface.co/t5-3b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computer vision\n", "Top Models\n", "- [Image Classification](https://huggingface.co/google/vit-base-patch16-224)\n", "- [Image Segmentation](https://huggingface.co/facebook/detr-resnet-50-panoptic)\n", "- [Object detection](https://huggingface.co/facebook/detr-resnet-50)\n", "- [Text to image](https://huggingface.co/flax-community/dalle-mini)\n", "- [Image to text](https://huggingface.co/kha-white/manga-ocr-base)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Frequently Asked Questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Do I need a big model or a small model? Does size matter in NLP models?\n", "Generally, yes, size is very significant in NLP models. Recently researchers have developed techniques to deliver accuracy gains for even smaller model sizes, but the general rule stands. A model with more paramters (ie, larger) will on average be more accurate than a model with smaller parameters. So to answer this question, first ask yourself, where am I in my experimentation process? If you are very early, then please start with a small model. If you are later, then consider using a larger one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Can I use the pretrained model as it is, or do I need to fine tune it?\n", "The answer to this question strictly depends on how similar your dataset is to the dataset the model was trained against. For example, if the domain you are evaluating is primarily in English, with non domain-specific terminology, using casual English as found in Wikipedia, then you may be fine with hosting the model as it for your use case. However, if your dataset is very different from the one the model was trained on, then you may want to consider fine tuning. Generally speaking, fine-tuned models are expected to be more accurate than generic models. This means the only reason you'd deploy a pretrained model directly is if (a) your dataset is almost exactly identical to what the model was trained on, and/or (b) the value of deploying it quickly significantly outweights the time it would take to fine tune it. Pro tip - fine tuning really is not very difficult at all, unless you are dealling with larger scale datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. What if I need to train a model from scratch? Can I do that on SageMaker too?\n", "Typically you'd only train a model from scratch if the pretrained models are so completely different from your dataset that even large-scale fine tuning won't deliver the performance you need. This may be the case for new languages, new domains, or new syntaxes that vary dramatically from Wikipedia or other Common Crawl samples. If you do need to train a model from scratch, yes absolutely you can do this on SageMaker. Typically customers look at our managed distributed training libraries, like model and data parallel, for the absolute best performance, but you can also bring 3rd party distributed frameworks to scale. Anticipate jobs that run for hours to days to weeks, and cluster sizes ranging from 10s to 100s to 1000s of GPUS or custom accelerators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. Aren't NLP models biased? How do I navigate that?\n", "Yes! Please assume that any biases inherent in the dataset the model was trained on will be carried into the model itself. This is commonly visibile when comparing across social traits, such as gender and employment, or countless other examples. For this reason some teams layer on additional \"sensitivity\" models that look for red flags in text, and route it to manual review prior to exposing to customers. You can [detect and monitor for bias in text classifiers using SageMaker Clarify.]( https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/sagemaker_processing/fairness_and_explainability/text_explainability/text_explainability.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Should I use PyTorch or TensorFlow? Do I need to write my own model? What if I want to write my own model?\n", "Actually the Hugging Face SDK does an excellent job making it extremely easy to switch back and forth between PyTorch and TensorFlow. You'll need to have at least one of those backends installed wherever you are running the Transformers SDK, but that choice is entirely on you. Additionally, Hugging Face absolutely supports defining your own models. You are welcome to bring your own networks in both TensorFlow (Keras) and PyTorch, and use the rest of their functionality. On top of that, HF also lets you convert models from one framework to another. So if a particular model you are evaluating is available only in 1 framework, don't fear, you can just use their SDK to convert it into the one of your choice.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Hugging Face on SageMaker Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example notebooks x 46 built by Hugging Face and AWS\n", "\n", "1. Getting started\n", " - [Train and host with PyTorch](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb)\n", " - [Train and host with TensorFlow](https://github.com/huggingface/notebooks/blob/master/sagemaker/02_getting_started_tensorflow/sagemaker-notebook.ipynb)\n", "\n", "2. SageMaker Training\n", " - [Training Compiler](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-training-compiler/huggingface)\n", " - [Data parallel](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb)\n", " - [Model parallel](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-j/train_gptj_smp_notebook.ipynb)\n", " - [Hyperparameter tuning](https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/hyperparameter_tuning/huggingface_multiclass_text_classification_20_newsgroups/hpo_huggingface_text_classification_20_newsgroups.ipynb)\n", " - [Spot instances](https://github.com/huggingface/notebooks/blob/master/sagemaker/05_spot_instances/sagemaker-notebook.ipynb)\n", "\n", "3. SageMaker Hosting\n", " - [Deploy a pretrained model from the HF hub](https://github.com/huggingface/notebooks/tree/master/sagemaker/11_deploy_model_from_hf_hub)\n", " - [Deploy a finetuned model from s3](https://github.com/huggingface/notebooks/blob/master/sagemaker/10_deploy_model_from_s3/deploy_transformer_model_from_s3.ipynb)\n", " - [Deploy and autoscale](https://github.com/huggingface/notebooks/blob/master/sagemaker/13_deploy_and_autoscaling_transformers/sagemaker-notebook.ipynb) \n", " - [Serverless Inference](https://github.com/aws/amazon-sagemaker-examples/blob/35723de714e95d6152bfb1ff4c141fc069e37b8a/serverless-inference/huggingface-serverless-inference/huggingface-text-classification-serverless-inference.ipynb)\n", " - [Asynchronous Inference](https://github.com/huggingface/notebooks/blob/master/sagemaker/16_async_inference_hf_hub/sagemaker-notebook.ipynb)\n", "\n", "4. Operationalize and optimize\n", " - [Pipelines](https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/end_to_end/nlp_mlops_company_sentiment/02_nlp_company_earnings_analysis_pipeline.ipynb)\n", " - [Inference recommender](https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/sagemaker-inference-recommender/huggingface-inference-recommender/huggingface-inference-recommender.ipynb)\n", " - [SageMaker Neo and Inferentia](https://github.com/aws/amazon-sagemaker-examples/blob/a26b5c4e6a81c714707244d2407305984c584f51/sagemaker_neo_compilation_jobs/deploy_huggingface_model_on_Inf1_instance/inf1_bert_compile_and_deploy.ipynb)\n", " - [SageMaker Clarify](https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/sagemaker_processing/fairness_and_explainability/text_explainability/text_explainability.ipynb)\n", " - [NVIDIA Triton Server on SageMaker](https://github.com/aws/amazon-sagemaker-examples/blob/4b9615508a08b474d7a89827cb21d67c8f12ad1c/sagemaker-triton/nlp_bert/triton_nlp_bert.ipynb)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " " ] } ], "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }