# Introduction to SageMaker JumpStart - Text Generation with Falcon models

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-falcon.ipynb)

---

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy Falcon models for text generation. It is a permissively licensed ([Apache-2.0](https://jumpstart-cache-prod-us-east-2.s3.us-east-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt)) open source model trained on the [RefinedWeb dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). We show several example use cases including code generation, question answering, translation etc.

---

In [None]:
!pip install sagemaker --quiet --upgrade --force-reinstall
!pip install ipywidgets==7.0.0 --quiet

In [None]:
model_id, model_version, = (
 "huggingface-llm-falcon-7b-instruct-bf16",
 "*",
)

In [None]:
from ipywidgets import Dropdown

model_ids = [
 "huggingface-llm-falcon-40b-bf16",
 "huggingface-llm-falcon-40b-instruct-bf16",
 "huggingface-llm-falcon-7b-bf16",
 "huggingface-llm-falcon-7b-instruct-bf16",
]

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
 options=model_ids,
 value=model_id,
 description="Select a model",
 style={"description_width": "initial"},
 layout={"width": "max-content"},
)
display(model_dropdown)

In [None]:
model_id = model_dropdown.value

In [None]:
%%time
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

### Changing instance type
---


Models have been tested on the following instance types:

 - Falcon 7B and 7B instruct: `ml.g5.2xlarge`, `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.16xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Falcon 40B and 40B instruct: `ml.g5.12xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`

If an instance type is not available in you region, please try a different instance. You can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id="huggingface-llm-falcon-40b-instruct-bf16", instance_type="ml.g5.12xlarge")`

---

### Changing number of GPUs
---
Falcon models are served with HuggingFace (HF) LLM DLC which requires specifying number of GPUs during model deployment. 

**Falcon 7B and 7B instruct:** HF LLM DLC currently does not support sharding for 7B model. Thus, even if more than one GPU is available on the instance, please do not increase number of GPUs. 

**Falcon 40B and 40B instruct:** By default number of GPUs are set to 4. However, if you are using `ml.g5.48xlarge` or `ml.p4d.24xlarge`, you can increase number of GPUs to be 8 as follows: 

`my_model = JumpStartModel(model_id="huggingface-llm-falcon-40b-instruct-bf16", instance_type="ml.g5.48xlarge")`

`my_model.env['SM_NUM_GPUS'] = '8'`

`predictor = my_model.deploy()`


---

In [None]:
%%time


prompt = "Tell me about Amazon SageMaker."

payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "stop": ["<|endoftext|>", ""],
 },
}

response = predictor.predict(payload)
print(response[0]["generated_text"])

### About the model

---
Falcon is a causal decoder-only model built by [Technology Innovation Institute](https://www.tii.ae/) (TII) and trained on more than 1 trillion tokens of RefinedWeb enhanced with curated corpora. It was built using custom-built tooling for data pre-processing and model training built on Amazon SageMaker. As of June 6, 2023, it is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It features an architecture optimized for inference, with FlashAttention and multiquery. 


[Refined Web Dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): Falcon RefinedWeb is a massive English web dataset built by TII and released under an Apache 2.0 license. It is a highly filtered dataset with large scale de-duplication of CommonCrawl. It is observed that models trained on RefinedWeb achieve performance equal to or better than performance achieved by training model on curated datasets, while only relying on web data.

**Model Sizes:**
- **Falcon-7b**: It is a 7 billion parameter model trained on 1.5 trillion tokens. It outperforms comparable open-source models (e.g., MPT-7B, StableLM, RedPajama etc.). To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). To use this model, please select `model_id` in the cell above to be "huggingface-textgeneration-falcon-7b-bf16".
- **Falcon-40B**: It is a 40 billion parameter model trained on 1 trillion tokens. It has surpassed renowned models like LLaMA-65B, StableLM, RedPajama and MPT on the public leaderboard maintained by Hugging Face, demonstrating its exceptional performance without specialized fine-tuning. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). 

**Instruct models (Falcon-7b-instruct/Falcon-40B-instruct):** Instruct models are base falcon models fine-tuned on a mixture of chat and instruction datasets. They are ready-to-use chat/instruct models. To use these models, please select `model_id` in the cell above to be "huggingface-textgeneration-falcon-7b-instruct-bf16" or "huggingface-textgeneration-falcon-40b-instruct-bf16".

It is [recommended](https://huggingface.co/tiiuae/falcon-7b) that Instruct models should be used without fine-tuning and base models should be fine-tuned further on the specific task.

**Limitations:**

- Falcon models are mostly trained on English data and may not generalize to other languages. 
- Falcon carries the stereotypes and biases commonly encountered online and in the training data. Hence, it is recommended to develop guardrails and to take appropriate precautions for any production use. This is a raw, pretrained model, which should be further finetuned for most usecases.


---

In [None]:
def query_endpoint(payload):
 """Query endpoint and print the response"""
 response = predictor.predict(payload)
 print(f"\033[1m Input:\033[0m {payload['inputs']}")
 print(f"\033[1m Output:\033[0m {response[0]['generated_text']}")

In [None]:
# Code generation
payload = {
 "inputs": "Write a program to compute factorial in python:",
 "parameters": {"max_new_tokens": 200},
}
query_endpoint(payload)

In [None]:
payload = {
 "inputs": "Building a website can be done in 10 simple steps:",
 "parameters": {"max_new_tokens": 110, "no_repeat_ngram_size": 3},
}
query_endpoint(payload)

In [None]:
# Translation
payload = {
 "inputs": """Translate English to French:

 sea otter => loutre de mer

 peppermint => menthe poivrée

 plush girafe => girafe peluche

 cheese =>""",
 "parameters": {"max_new_tokens": 3},
}

query_endpoint(payload)

In [None]:
# Sentiment-analysis
payload = {
 "inputs": """"I hate it when my phone battery dies."
 Sentiment: Negative
 ###
 Tweet: "My day has been :+1:"
 Sentiment: Positive
 ###
 Tweet: "This is the link to the article"
 Sentiment: Neutral
 ###
 Tweet: "This new music video was incredibile"
 Sentiment:""",
 "parameters": {"max_new_tokens": 2},
}
query_endpoint(payload)

In [None]:
# Question answering
payload = {
 "inputs": "Could you remind me when was the C programming language invented?",
 "parameters": {"max_new_tokens": 50},
}
query_endpoint(payload)

In [None]:
# Recipe generation
payload = {
 "inputs": "What is the recipe for a delicious lemon cheesecake?",
 "parameters": {"max_new_tokens": 400},
}
query_endpoint(payload)

In [None]:
# Summarization

payload = {
 "inputs": """Starting today, the state-of-the-art Falcon 40B foundation model from Technology
 Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
 that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
 started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
 programmatically through the SageMaker Python SDK.
 Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
 ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
 benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
 exceptional performance without specialized fine-tuning. To make it easier for customers to access this
 state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
 Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
 needs for applications such as translation, question answering, and summarizing information.
 Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
 US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
 Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
 with availability in additional AWS Regions coming soon. To learn how to use this new feature,
 please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
 Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
 state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:""",
 "parameters": {"max_new_tokens": 200},
}
query_endpoint(payload)

### Supported parameters

***
Some of the supported parameters while performing inference are the following:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. 

For more parameters and information on HF LLM DLC, please see [this article](https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model).
***

### Limits on the number of input and output tokens

---

Large models such as Falcon have very high accelerator memory footprint. Thus, a very large input payload or generating a large output can cause out of memory errors. Furthermore, generating large outputs can take secs or even minutes. However, SageMaker has a response time limit of 60 seconds. Thus, large input or output payload can cause timeout issues. Based on these two constraints, we recommend the following limits on the input and new tokens


| Model | Small Input | Medium Input | Large Input |
|----------------------------------------| --- | --- | --- |
| | (#input_tokens, #max_new_tokens) | (#input_tokens, #max_new_tokens) | (#input_tokens, #max_new_tokens) |
| Falcon 7B/Instruct | (100, 1900) | (1500, 1500) | (20000, 1000) |
| Falcon 40B/Instruct on ml.g5.12xlarge | (100,1150) | (950,900) | (4000,100) |
| Falcon 40B/Instruct on ml.g5.48xlarge | (100, 1850) | (950, 1800) | (20000, 600)|

Note that, limits don't apply equally to input tokens and new tokens. Models typically support much larger input tokens if you decrease max_new_tokens slightly. Also, note that non-default values of inference parameters will impact the size of input and output payload supported. For instance, higher value of num_beams will reduce the number of `max_new_tokens` you can generate.


**Words-Token ratio:** Ratio of words to tokens is roughly 1.5. So, if number of input tokens 900, it corresponds to ~600 input words. Note that this is not always the case. There are several pieces of text where words-tokens ratio can be significantly different.

**Setting non-default environment variables:** If setting num_input_tokens >1024 or num_total_tokens >=2048, you would need to change the environment variable before deploying the model:
 my_model.env['MAX_INPUT_LENGTH'] = '2048' (default '1024')
 my_model.env['MAX_TOTAL_TOKENS'] = '4096' (default '2048')

Note that the endpoint supports maximum length on the number of total tokens (number of input + new tokens) whereas we report the limits above on number of input tokens and number of new tokens.

If using ml.g5.48xlarge for Falcon 40b, you would need to use 8 GPUs (my_model.env['SM_NUM_GPUS'] = '8')

**Concurrency/invoking at short intervals:** When sending multiple requests at once or at very short interval, the endpoint can not handle the same thresholds as above. It may result in OOM CUDA error. Thus, we recommend trying smaller number of input and output tokens than mentioned above. This is because the TGI container batches requests. While this results in significant throughput improvements, it may cause CUDA OOM or timeout errors. To reduce the number of batched requests, you can set the `max_concurrent_requests` parameter (my_model.env['MAX_CONCURRENT_REQUESTS'] = '1').

**Corrupted endpoint state:** It has been observed that once an endpoint suffers a CUDA OOM error, the endpoint may get in corrupted state where it may not function for very small inputs and small `max_new_token` parameter. It has been observed that letting endpoint sit ideally can help reset the state. If that does not work, please restart the endpoint (delete and launch a new endpoint).

**Model quality:** Note that above limits only refer to the memory limitation of the instance types available and the sagemaker endpoint response timeout limit. Model itself can theoretically support arbitrary large input and output payload if infinite CUDA memory is available and there is no limit on sagemaker endpoint response time. However, it has been observed that the quality of the model's output decrease substantially when provided a very large input payload (eg. generate summary of a document with 20,000 tokens) or generate large outputs (write a story with 2000 tokens). Thus, you may want to stay below the limits recommended above to generate high quality outputs.

---

### Generating few tokens at a time - Supporting large outputs on smaller instances

---
As observed above, model can support much larger input tokens than output tokens. For instance, endpoint has higher latency and more memory requirement when given an input sequence of length 100 tokens and generating 100 tokens compared to the case with input sequence of length 190 tokens and generating 10 new tokens. Based on this observation, we can set up our text generation process to avoid CUDA OOM issue and endpoint response timeout issue by invoke endpoint repeatedly to generate very large output sequence which would have been otherwise infeasible. For Falcon 40b instruct model, we observed that you can generate more than 5000 new tokens on `ml.g5.12xlarge` if you generate 100 tokens at a time.



We also experimented with the computation overhead of repeatedly querying endpoint and thus computing activations/states for the input text mutliple times. We observed that even when generating 10 tokens at a time, this contributed less than 5% of the overall time to generate the desired number of output tokens. Thus, even when generating the entire sequence is feasible for a single query, you can simply generate generate in batches.


---

In [None]:
max_new_tokens = 1000
max_new_tokens_single_iteration = 100

payload = {
 "inputs": "List down all the services by Amazon and a detailed description of each of the service. Tell me how to use Kendra. Tell me how to use AWS. Recite the guide to get started with SageMaker?",
 "parameters": {"max_new_tokens": max_new_tokens_single_iteration},
}

print(f"Input Text: {payload['inputs']}")

for i, _ in enumerate(range(0, max_new_tokens, max_new_tokens_single_iteration)):
 response = predictor.predict(payload)
 generated_text = response[0]["generated_text"]
 full_text = payload["inputs"] + generated_text
 print(f"\033[1mIteration {i+1}:\033[0m\n {generated_text}\n")
 payload["inputs"] = full_text

### 5. Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()