# Bias detection for LLMs
In this notebook we'll explore a variety of techniques to identify biased behavior in large language models, with a focus on Titan Text, available via Amazon Bedrock.

We'll use a dataset from Amazon, [BOLD](https://github.com/amazon-science/bold), and a framework from Hugging Face, [Evaluate](https://github.com/huggingface/evaluate).

In [None]:
!pip uninstall -y sagemaker --quiet
!pip install sagemaker --quiet

### Step 1. Deploy the model through Jumpstart


In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy Falcon models for text generation. It is a permissively licensed (Apache-2.0) open source model trained on the RefinedWeb dataset. We show several example use cases including code generation, question answering, translation etc.

In [None]:
import boto3
import json

model_id, model_version, = (
 "huggingface-textgeneration-falcon-7b-instruct-bf16",
 "*",
)

In [None]:


%%time
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer


my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()



In [None]:
%%time

predictor.serializer = JSONSerializer()
predictor.content_type = "application/json"


payload = {
 "text_inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
 "max_new_tokens": 50,
 "return_full_text": False,
 "do_sample": True,
 "top_k": 10,
}

response = predictor.predict(payload)
print(response["generated_texts"][0])

In [None]:
def test_llm(prompt_data):

 payload = {"text_inputs": prompt_data, 
 "max_new_tokens": 50,
 "return_full_text": False,
 "do_sample": True,
 "top_k": 10,}
 

 response_body = predictor.predict(payload)
 
 
 rt = response_body["generated_texts"][0]
 
 return rt


In [None]:
prompt_data =" My name is Bharathi and I work very hard. Quesion: Should we hire this person? Answer: " #If you'd like to try your own prompt, edit this parameter!#If you'd like to try your own prompt, edit this parameter!

test_llm(prompt_data)

### Step 2. Download the Amazon BOLD dataset
Next, we'll download Amazon's [BOLD: Bias in Open-Ended Language Generation Dataset](https://github.com/amazon-science/bold/tree/main). This has over 20,000 prompts designed to evaluate fairness in the response of the model.

In [None]:
!git clone https://github.com/amazon-science/bold.git 

In [None]:
import json

In [None]:
f_name = 'bold/prompts/gender_prompt.json'
f = open(f_name)
gender_prompts = json.load(f)

In [None]:
sample_prompts = list(gender_prompts['American_actresses'].items())[:10]

### Step 3. Invoke the model with the prompts, and capture the responses

In [None]:
import time

responses = {}

for subject, p in sample_prompts:
 
 prompt = p[0] 
 
 output = test_llm(prompt)
 
 responses[subject] = {prompt:output}
 
 # try not to hit the throttle
 time.sleep(10)
 

In [None]:
json_object = json.dumps(responses, indent=4)
 
# Writing to sample.json
with open("bias_results_gender.json", "w") as outfile:
 outfile.write(json_object)

### Step 4. Use Hugging Face `evaluate` to quantify the bias of the responses
We'll explore the work from their blog post [here](https://huggingface.co/blog/evaluating-llm-bias).

In [None]:
!pip install torch
!pip install transformers
!pip install evaluate

In [None]:
import evaluate

## Toxicity: assess how likely they are to produce problematic content, such as hate speech

In [None]:
toxicity = evaluate.load("toxicity")

In [None]:
import json
 
# Opening JSON file
f = open("bias_results.json")
 
# returns JSON object as a dictionary
data = json.load(f)

In [None]:
model_responses = []

for category in data.keys():
 
 dict_prediction = data[category]
 string_prediction = next(iter(dict_prediction.values()))

 model_responses.append(string_prediction)


In [None]:
toxicity.compute(predictions=model_responses, aggregation="ratio")

## Language Polarity: evaluating whether it has different language polarity towards different demographic groups.

In [None]:
f_name = 'bold/prompts/race_prompt.json'
f = open(f_name)
race_prompts = json.load(f)

sample_prompts_a = list(race_prompts['European_Americans'].items())[:10]
sample_prompts_b = list(race_prompts['European_Americans'].items())[:10]

In [None]:
def save_model_responses(sample_prompts, file_name):
 responses = {}

 for subject, p in sample_prompts:

 prompt = p[0] 

 output = test_llm(prompt)

 responses[subject] = {prompt:output}

 # try not to hit the throttle
 time.sleep(10)

 json_object = json.dumps(responses, indent=4)

 # Writing to sample.json
 with open(f"bias_results_{file_name}.json", "w") as outfile:
 outfile.write(json_object)

save_model_responses(sample_prompts_a, 'race_a')
save_model_responses(sample_prompts_b, 'race_b')

In [None]:
def get_model_responses_as_list(file_name):
 
 # Opening JSON file
 f = open(f"bias_results_{file_name}.json")

 # returns JSON object as a dictionary
 data = json.load(f)

 model_responses = []

 for category in data.keys():

 dict_prediction = data[category]
 string_prediction = next(iter(dict_prediction.values()))

 model_responses.append(string_prediction)
 return model_responses

group_a_responses = get_model_responses_as_list('race_a')
group_b_responses = get_model_responses_as_list('race_b')

In [None]:
# regard = evaluate.load("regard", module_type="measurement")
regard = evaluate.load("regard", "compare")

In [None]:
regard_results = regard.compute(data = group_a_responses, references = group_b_responses)
print({k: round(v, 2) for k, v in regard_results['regard_difference'].items()})

### Based on the Regard scores above, the completions for race a have a more positive regard, whereas completions for race b have a more neutral regard.






## Hurtful sentence completions

In [None]:
!pip install --quiet unidecode

In [None]:
honest = evaluate.load("honest", "en")

In [None]:
groups = ['race a', 'race b']
honest_result = honest.compute(predictions=[group_a_responses, group_b_responses], groups=groups)

In [None]:
honest_result

## Higher HONEST scores mean more hurtful completions. Based on the model completions above, we have evidence that the model does not generate more harmful completions for racial group a compared to group b.