## Text Generation using Different Prompting Techniques with Amazon SageMaker JumpStart SDK and Falcon 40B Instruct Language Model

---
This Amazon SageMaker Studio Notebook demonstrates how to use the SageMaker Python SDK to firstly deploy Falcon-40B-Instruct Large Language Model and then generate text using advanced prompting engineering techniques. 

This notebook has the following prerequisites:
- Select an AWS region where [Amazon SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart) is available. 
- [Setup Amazon SageMaker Domain](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html).
- [Available service queta](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for "ml.g5.12xlarge for endpoint usage".
- Less than $10 per hour to spend on Amazon SageMaker JumpStart model deployments and usage of other AWS services.

This notebook is based on the following references:
- [Amazon SageMaker JumpStart SDK](https://sagemaker.readthedocs.io/en/v2.82.0/overview.html#use-prebuilt-models-with-sagemaker-jumpstart), providing pretrained models for a wide range of problem types to help you get started with machine learning.
- [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b), an open source language model with 40B parameters decoder-only model built by TII.
- [Prompt Engineering](https://www.promptingguide.ai/), an online project aiming to educate Gen AI practitioners about prompt engineering. 
---

In [None]:
# install pythn libraries
!pip install --upgrade pip --quiet
!pip install --upgrade sagemaker --quiet

In [None]:
# important required libraries
from sagemaker.jumpstart.model import JumpStartModel

In [None]:
# Define SageMaker JumpStart Model using model id, instance type, and endpoint timeout
my_model = JumpStartModel(model_id="huggingface-llm-falcon-40b-bf16",
                          instance_type="ml.g5.12xlarge",
                          env={'ENDPOINT_SERVER_TIMEOUT':'300'})

In [None]:
%%time

# Host the model on the instance and deploy an inference endpoint
# Because the model size is >80GB, expecy deploy() to take 15 min!
predictor = my_model.deploy()

---
**Prompting Techniques**

Large language model (LLM) prompting is a technique for using a LLM to perform a task that it has not been explicitly trained on. This is done by providing the LLM with a prompt, which is a short piece of text that instructs the model on what to do. Prompting is a powerful technique that can be used to do a variety of tasks, including: text summarization, question answering, and code generation.

In [None]:
stop_keywords = ["<|endoftext|>", "</s>"]
max_new_tokens = 100
do_sample = False
temperature = 1

**Technique 1: Zero-Shot Prompting**

Zero-shot prompting allows a model to perform a task without any additional training data by providing the LLM with a prompt that describes the task, and the model then uses its knowledge of the world to generate the desired output.

In [None]:
# Zero-Shot Prompting Example 1
prompt = "Classify the following text into neutral, negative or positive.\n" \
         "Text: I think the vacation was not bad.\n" \
         "Sentiment:"

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response = predictor.predict(payload)

print("Zero-Shot Prompting Example 1:")
print(prompt)
print("\033[95m")
print(response[0]["generated_text"].strip())

**Technique 2: Few-Shot Prompting**

Few-shot prompting allows a model to process examples before attempting a task, steering the model to better performance.

In [None]:
## Zero-Shot Prompting Example 1
prompt = """A "whatpu" is a small, furry animal native to Tanzania. """ \
         """An example of a sentence that uses the word whatpu is: """ \
         """We were traveling in Africa and we saw these very cute whatpus.\n""" \
         """To do a "farduddle" means to jump up and down really fast. """ \
         """An example of a sentence that uses the word farduddle is:"""

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response = predictor.predict(payload)

print("Few-Shot Prompting Example 1:")
print(prompt)
print("\033[95m")
print(response[0]["generated_text"].strip())

In [None]:
## Few-Shot Prompting Example 2
prompt = "This is awesome! // Negative\n" \
         "This is bad! // Positive\n" \
         "Wow that movie was rad! // Positive\n" \
         "What a horrible show! //"

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response = predictor.predict(payload)

print("Few-Shot Prompting Example 2:")
print(prompt)
print("\033[95m")
print(response[0]["generated_text"].strip())

In [None]:
## Few-Shot Prompting Example Example 3
prompt = "Add 2+3: 5\n" \
         "Add 3+6: 9\n" \
         "Add 7+1:"

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response = predictor.predict(payload)

print("Few-Shot Prompting Example 3:")
print(prompt)
print("\033[95m")
print(response[0]["generated_text"].strip())

**Technique 3: Chain-of-Thought Prompting**

Chain-of-Thought prompting encourages the model to explain its reasoning. This method enables complex reasoning capabilities through intermediate reasoning steps.

In [None]:
## Chain-of-Thought Prompting Example 1
prompt = "I bought 10 apples from the market. " \
         "Then, I gave 2 apples to the neighbor and 2 to the repairman. " \
         "Then, I then went and bought 5 more apples but I ate 1. " \
         "How many apples did I remain with? Let us think step by step."

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response = predictor.predict(payload)

print("Chain-of-Thought Prompting Example 1:")
print(prompt)
print("\033[95m")
print(response[0]["generated_text"].strip())

**Technique 4: Self-Consistency**

This technique samples multiple generations through multiple few-shot chain-of-thought prompts, and then selects the most consistent answer. 

In [None]:
## Self-Consistency Prompting Example 1
prompt = "When I was 6, my sister was half my age. " \
         "Now I’m 58. How old is my sister? "

payload = {
    "inputs": prompt,
    "parameters": {
        "stop": stop_keywords,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": do_sample
    }
}
response1 = predictor.predict(payload)
response2 = predictor.predict(payload)
response3 = predictor.predict(payload)

print("Self-Consistency Prompting Example 1:")
print(prompt)
print("\033[95m")
print(response1[0]["generated_text"].strip())
print(response2[0]["generated_text"].strip())
print(response3[0]["generated_text"].strip())

### SageMaker Clean up 

In [None]:
# Delete the SageMaker endpoint
predictor.delete_endpoint()