## Text Generation using Different Decoding Strategies with Amazon SageMaker JumpStart SDK and Falcon 40B Instruct Language Model

---
This Amazon SageMaker Studio Notebook demonstrates how to use the SageMaker Python SDK with very little efforts to firstly deploy Falcon-40B-Instruct Large Language Model and then generate text using different decoding methods. 

This notebook has the following prerequisites:
- Select an AWS region where [Amazon SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart) is available. 
- [Setup Amazon SageMaker Domain](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html).
- [Available service queta for "ml.g5.12xlarge for endpoint usage"](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html).
- Less than $10 per hour to spend on Amazon SageMaker JumpStart model deployment and Amazon SageMaker Studio notebook usage.  

This notebook is based on the following references:
- [Amazon SageMaker JumpStart SDK](https://sagemaker.readthedocs.io/en/v2.82.0/overview.html#use-prebuilt-models-with-sagemaker-jumpstart), providing pretrained models for a wide range of problem types to help you get started with machine learning.
- [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct), an open source model with 40B parameters decoder-only model built by TII.
- Public articles ([Link 1](https://huggingface.co/blog/how-to-generate), [Link 2](https://huggingface.co/docs/transformers/generation_strategies), [Link 3](https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model)) published on Hugging Face, an open source community and data science platform for machine learning models and datasets. 
---

In [None]:
!pip install --upgrade pip --quiet --disable-pip-version-check --root-user-action=ignore
!pip install --upgrade sagemaker --quiet --root-user-action=ignore

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

# Define SageMaker JumpStart Model using model id, instance type, and endpoint timeout
my_model = JumpStartModel(model_id="huggingface-llm-falcon-40b-instruct-bf16",
                          instance_type="ml.g5.12xlarge",
                          env={'ENDPOINT_SERVER_TIMEOUT':'300'})

# Take a look at the JumpStart Model parameters printed by this cell
print("Model id =", my_model.model_id)
print("Model name =", my_model.name)
print("Model version =", my_model.model_version)
print("Instance type =", my_model.instance_type)
print("Instance number of GPUs =", my_model.env["SM_NUM_GPUS"])
print("Model maximum input length =", my_model.env["MAX_INPUT_LENGTH"])
print("Model maximum total tokens =", my_model.env["MAX_TOTAL_TOKENS"])
print("Server endpoint timeout =", my_model.env["ENDPOINT_SERVER_TIMEOUT"], "seconds")

In [None]:
%%time

# Host the model on the instance and deploy an inference endpoint
# Because the model size is >80GB, expecy deploy() to take 15 min!
predictor = my_model.deploy()

---
**Decoding Strategies**

Large Language Models are designed to "guess" the next token having read all the previous ones based on a predefined sampling method. There are several methods (decoding strategies) we can configure for picking this output token, such as Greedy Search, Beam Search, and Contrastive Search.

We start by defining the promote which will enable us to test different decoding strategies for the same input text. We will also fix the values of some hyperparameters across all decoding strategies to make easy for us to compare the generated text:
- *Temperature* is used to control the randomness of predictions by scaling the logits before applying softmax. The softmax layer in the transformer architecture turns the logits into probabilities (between 0 and 1). A low temperature (below 1) sharpens the probabilities of the predicted words resulting in more conservative and predictable text. A high temperature (above 1) makes the model generates more creative and diverse text resulting in unusual or unexpected words.
- *stop* provides a list of input tokens to the model to stop the generation. The generation will stop when one of the tokens is generated.
- *max_new_tokens* defines the maximum number of tokens to be generated by the model.

In [None]:
prompt = "Did the chicken cross the road while being afraid of cars?"
stop_keywords = ["<|endoftext|>", "</s>"]
max_new_tokens = 120
temperature = 1.05

**Strategy 1: Greedy Search**

Greedy search is a deterministic method that simply selects the word with the highest probability as its next word. To configure greedy decoding, we set the *do_sample* hyperparameter to false and make sure the *num_beams* hyperparameter is set to 1. 

In [None]:
greedy_search_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": False,
        "num_beams": 1,
    }
}
greedy_search_response = predictor.predict(greedy_search_payload)

print(prompt)
print("Greedy Search Response:", ">"*40, "\033[95m")
print(greedy_search_response[0]["generated_text"])

**Strategy 2: Beam Search**

Beam search is another deterministic method that reduces the risk of missing hidden high probability word sequences by keeping a fixed number (beam) of active candidates at each time step and eventually choosing the hypothesis that has the overall highest probability. To configure beam decoding, we set the *do_sample* hyperparameter to false and set the *num_beams* hyperparameter to a value above 1. 

In [None]:
beam_search_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": False,
        "num_beams": 4,
    }
}
beam_search_response = predictor.predict(beam_search_payload)

print(prompt)
print("Beam Search Response:", ">"*40, "\033[95m")
print(beam_search_response[0]["generated_text"])

**Strategy 3: Multinomial Sampling**

Solely maximizing the output probability in deterministic methods can lead to dullness and repetitions. Conversely, stochastic methods try to solve the problem by introducing randomness to the sampling process. 

Multinomial sampling randomly selects the next token based on the probability distribution over the entire vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition. To configure multinomial sampling, we set the *do_sample* hyperparameter to true and set the *num_beams* hyperparameter to 1. 

In [None]:
multinomial_sampling_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "num_beams": 1
    }
}
multinomial_sampling_response = predictor.predict(multinomial_sampling_payload)

print(prompt)
print("Multinomial Sampling Response:", ">"*40, "\033[95m")
print(multinomial_sampling_response[0]["generated_text"])

**Strategy 4: Top-k Sampling**

Top- k sampling means sorting by probability and zeroing out the probabilities for anything below the k'th token. To configure Top-k sampling, we set the *do_sample* hyperparameter to true and set the *top_k* hyperparameter to a value above 1. 

In [None]:
top_k_sampling_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "top_k": 7
    }
}
top_k_sampling_response = predictor.predict(top_k_sampling_payload)

print(prompt)
print("Top-k Search Response:", ">"*40, "\033[95m")
print(top_k_sampling_response[0]["generated_text"])

**Strategy 5: Top-p Sampling**

Top k sampling (or nucleus sampling) chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. To configure Top p sampling, we set the *do_sample* hyperparameter to true and set the *top_p* hyperparameter to a value less than 1. 

In [None]:
nucleus_sampling_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "top_p": 0.8
    }
}
nucleus_sampling_response = predictor.predict(nucleus_sampling_payload)

print(prompt)
print("Nucleus search Response:", ">"*40, "\033[95m")
print(nucleus_sampling_response[0]["generated_text"])

**Strategy 6: Contrastive Sampling**

Contrastive search selects from the most probable candidates predicted by the model while taking into account the degeneration penalty computed from the previous context. This decoding strategy tries to maintain the semantic coherence in the generated text while reducing repetitions. To configure Contrastive search, we set the *top_k* hyperparameter to a value above 1 and set the *repetition_penalty* hyperparameter to a value between 0 and 1. When *repetition_penalty* is close to zero, contrastive search degenerates to the greedy search method.

In [None]:
contrastive_search_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "repetition_penalty": 0.9,
        "top_k": 7
    }
}
contrastive_search_response = predictor.predict(contrastive_search_payload)

print(prompt)
print("Contrastive Search Response:", ">"*40, "\033[95m")
print(contrastive_search_response[0]["generated_text"])

**Strategy 7: Combining Multiple Methods**

Combining multiple methods (such as top-k and top-p sampling) can sometimes improve the diversity and fluency of the generated text. Try to configure your own decoding strategy including changing the temperature hyperparameter and see the results. 

In [None]:
multiple_strategies_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "top_p": 0.8,
        "top_k": 7
    }
}
multiple_strategies_response = predictor.predict(multiple_strategies_payload)

print(prompt)
print("Multiple Strategies Response 1:", ">"*40, "\033[95m")
print(multiple_strategies_response[0]["generated_text"])

Beam search can be used in conjunction with sampling by choosing the next token at random from the set of possible tokens while building up the beams. Let us try it out.

In [None]:
multiple_strategies_payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "top_p": 0.8,
        "top_k": 7,
        "num_beams": 3
    }
}
multiple_strategies_response = predictor.predict(multiple_strategies_payload)

print(prompt)
print("Multiple Strategies Response 2:", ">"*40, "\033[95m")
print(multiple_strategies_response[0]["generated_text"])

### SageMaker Clean up 

In [None]:
# Delete the SageMaker endpoint
predictor.delete_endpoint()