
# re:Invent 2022 workshop AIM342: Advancing Responsible AI


## 0. Setup

Make sure you select the `conda_rai_kernel` in your notebook instance.

This kernel includes pytorch, numpy, pillow, matplotlib, torch, torchvision, cv2, seaborn, and pandas, which you need
for this notebook.

You may run this notebook by selecting Run->Run All Cells from the menu above. 

In [None]:
# Imports
from utils.dataset_utils import *
from utils.eval_utils import *
from utils.plotting_utils import *
from utils.train_utils import *

%matplotlib inline


## 1. Introduction

### 1.1 Assess bias in the world of handwritten numbers

Building and operating machine learning applications responsibly requires an active, consistent approach to prevent, assess, and mitigate bias. This workshop takes attendees through a computer vision case study in assessing unwanted bias, covering demographic group selection, evaluation dataset selection, bias metric selection, bias reporting, and root cause analysis – and concludes by discussing prevention, mitigation and communication strategies. Attendees will follow along with a Jupyter notebook. The workshop will be run by an AWS team of experts on Responsible AI that develops science, tools, and best practices to advance Responsible AI within AWS.

You will conduct experiments on a synthetic task that has parallels to real-world bias in AI: the handwritten numbers dataset. This dataset contains images of blue and orange handwritten numbers in [0,9]. In AI systems that represent people, such as face, voice, or speech recognition, we can use demographic attributes like gender, age, or race to identify groups of people. In the world of handwritten numbers, similar attributes exist. In addition to its digit identity 0-9, an instance from this dataset can be linear or curvy, orange or blue. We call the instances *handwritten numbers*. 

Attributes of handwritten numbers:
* digit = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
* curviness = {curvy, neutral, linear}
* color = {orange, blue}

We define our groups with these attributes. For example, "all numbers where color==orange" is a group. Using these groups, we can practice the bias measurement techniques that apply to real-world AI systems. We will not draw a 1:1 correspondence between numbers and people. Although this is a simplification, the handwritten numbers dataset gives us a practical and objective sandbox. 

The handwritten numbers dataset is derived from MNIST [1] and inspired by other derivatives, such as biasedMNIST [2].

1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. (1998) "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998. Online (http://yann.lecun.com/exdb/publis/index.html#lecun-98)
2. Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. (2020) Learning de-biased representations with biased representations (http://proceedings.mlr.press/v119/bahng20a/bahng20a.pdf). In Proceedings of the 37th International Conference on Machine Learning (ICML'20). JMLR.org, Article 50, 528–539.

### 1.2 Learning Objectives
In this workshop you will learn to
* Measure and understand bias
* Apply bias metrics
* Explore bias between intersectional groups
* Compare results across evaluation datasets
* Compare results across training datasets


### 1.3 Runtime
This module takes about 60 minutes to run.

### 2. Define the problem

Your team is ready to launch an image recognition service for its customers in the world of handwritten numbers. The service will take number selfies as input, and produce the identity of the number, 0-9, as output. The model uses computer vision technology to perform this recognition. You will review the model for unwanted bias, using an evaluation dataset gathered internally. 

**Your task is to run the model on the evaluation set and measure bias in the output, in order to make a business decision: does the model treat all numbers fairly, at least in the groups you have identified? Is this service fair enough?**

### 2.1 Download the model and evaluation data 

In [None]:
# This takes about 15s
train_data, eval_data = create_starting_data()


In [None]:
# This takes about 10s
base_model = train_baseline_model(df=train_data)

## 3. Understand your evaluation dataset

#### 3.1 Learn what the model should do, by looking at the evaluation data.

First, look at the structure. See how your groups are captured as annotations.

In [None]:
print(eval_data[['digit','curviness','color']][:5])

#### 3.2 Look at sample images to understand the input.

In [None]:
# View an example from each digit group in the evaluation set. 
# Run this cell multiple times to view different examples.
plot_one_each(eval_data)

#### 3.3 Look at the groups to understand what analyses you can run

* Our data is annotated with digits, so we can measure whether the model performs better for 9's or for 3's. 
* Our data is annotated with color, so we can measure whether the model performs better for blues or for oranges.
* Our data is annotated with curviness, so we can measure whether the model performs better for curvy or for linear instances. 

#### 3.4 Find out whether groups you care about are well represented in the evaluation data.

Use a histogram to view representation. In this exercise, your customers are numbers and their attributes are digit, color, and curviness.

In [None]:
eval_data.digit.value_counts().sort_index().plot(kind='bar', xlabel='Digit', ylabel="Count", rot=0)

In [None]:
eval_data.color.value_counts().sort_index().plot(kind='bar', xlabel='Color', ylabel="Count", rot=0)

In [None]:
eval_data.curviness.value_counts().sort_index().plot(kind='bar', xlabel='curviness', ylabel="Count", rot=0)

## 4. Run the model on evaluation data


Now that you understand the evaluation data, run the model on this data to see what the outputs are.

For some outputs, the prediction matches the true digit. When the model makes an error, the prediction doesn't match.

In [None]:
results_df = run_baseline_model(base_model, eval_data)

#### 4.1 View some examples where the model was correct

In [None]:
df = results_df.loc[(results_df['label'] == results_df['pred1'])][:5]
plot_five(df)

#### 4.2 View examples where the model made an error

In [None]:
df = results_df.loc[(results_df['label'] != results_df['pred1'])][:5]
plot_five(df)

## 5. Measure overall accuracy

#### 5.1 Measure the accuracy of these outputs: how often did the model correctly recognize the number?

Accuracy is the percentage of instances for which the model prediction matches ground truth.

Once we know how to measure accuracy, we will measure bias as differences in accuracy across groups.

In [None]:
# Compute accuracy on the full validation set
acc = compute_accuracy(results_df, 'label', 'pred1')
print('Accuracy: ', acc)

#### 5.2 Calculate a confidence interval (CI) on the accuracy, to understand how strong the finding is.

A one-time calculation of accuracy on a sample of data is really an estimate for the accuracy of the model on any data drawn from the same distribution as your sample. 

A confidence interval is a standard statistical technique based on variability in the data, to compute upper and lower bounds on accuracy for any such in-distribution data. We calculate 95% intervals: the range where we can be 95% certain that accuracy falls.

If the confidence interval is small, then we can be confident that the accuracy is a good estimate of model performance on this evaluation set. While the point estimate of accuracy informs us about the model, the CI tells us more about the evaluation data. 

Later we will use CIs to understand whether differences in accuracy (i.e., unwanted bias) are meaningful.

In [None]:
lb, ub = compute_confidence_bounds(results_df, 'label', 'pred1')
print(f'95% confidence interval for accuracy is [{round(lb, 4)}, {round(ub, 4)}]')

## 6.Measure group-level accuracy

Your goal is to see whether the model treats all groups fairly. Next, see how accuracy breaks down into groups.

#### 6.1 Start with groups defined by their digit. Does the model have good accuracy for all 9’s? For all 3’s?

In [None]:
# Accuracy broken down by digit
print_accuracy_by_group(results_df, group_col='digit', prediction_col='pred1', ground_truth_col='label')

In [None]:
# Plot accuracy broken down by digit identity
plot_confidence_intervals_from_df(df_with_preds=results_df,
 label_col='label',
 pred_col='pred1',
 dataset_name='eval_data',
 group_name='digit',
 show_error=True,
 marker='o',
 use_legend=False)

#### 6.2 Next, look at color. Does the model have good accuracy for orange and blue?

In [None]:
# Print out accuracy broken down by color
print_accuracy_by_group(results_df, group_col='color', prediction_col='pred1', ground_truth_col='label')

In [None]:
# Plot accuracy broken down by color
plot_confidence_intervals_from_df(df_with_preds=results_df,
 label_col='label',
 pred_col='pred1',
 dataset_name='eval_data',
 group_name='color',
 show_error=True,
 use_legend=False)

#### 6.3. Finally, look at curviness. Does the model have good accuracy for all groups?

In [None]:
# Print accuracy broken down by curviness
print_accuracy_by_group(results_df, group_col='curviness', prediction_col='pred1', ground_truth_col='label')

In [None]:
# Plot accuracy broken down by curviness
plot_confidence_intervals_from_df(df_with_preds=results_df,
 label_col='label',
 pred_col='pred1',
 dataset_name='eval_data',
 group_name='curviness',
 use_legend=False,
 show_error=True)

#### 6.4. Review initial findings. You found disparities in accuracy. Is it bias?

✅ **Check-in:** *Your task is to measure bias in the model by comparing accuracy across number groups that you care about. So far, you have explored a dataset that is annotated for these groups, and run your model on the data. You calculated overall accuracy to get an idea of how well the model performs on average, and you measured accuracy at the group level to compare performance for customer groups. You found disparities in accuracy. Is it bias?*

## 7. Compare bias metrics using digit groups

Start with your fairness goal: all of your customers should experience comparable performance from your service. 


#### 7.1 Calculate disparity in accuracy between best- and worst-served numbers, to get a feel for the bias in your evaluation.

In [None]:
# Difference in accuracy between best-served and worst-served groups
digit_accuracy = get_group_accuracy_by_value(results_df, group_col='digit', prediction_col='pred1', ground_truth_col='label')
print_absolute_vs_relative_accuracy(digit_accuracy)

You found a disparity. The absolute disparity applies to accuracy and errors. 

#### 7.2 Calculate the ratio of errors.

The ratio of errors tells you how often a disadvantaged group experiences negative outcomes, compared to another group.

In [None]:
print_absolute_vs_ratio_error(digit_accuracy, 'digit')

If the handwritten numbers have to be classified before passing through airport security, the error ratio tells you how much more often numbers from the worst-performing group get stopped for a pat-down, compared to the best-performing group.

## 8. Apply metrics for bias to color and curviness groups

#### 8.1 Calculate absolute disparity in accuracy for groups by color and groups by curviness.

In [None]:
color_accuracy = get_group_accuracy_by_value(results_df, group_col='color', prediction_col='pred1', ground_truth_col='label')
print_absolute_vs_ratio_error(color_accuracy)
print_absolute_vs_relative_accuracy(color_accuracy)

In [None]:
curve_accuracy = get_group_accuracy_by_value(results_df, group_col='curviness', prediction_col='pred1', ground_truth_col='label')
print_absolute_vs_ratio_error(curve_accuracy)
print_absolute_vs_relative_accuracy(curve_accuracy)

#### 8.2 Interpret your results: is this model biased?

✅ **Check-in:** *Your task is to measure bias across several groups. You measured bias using absolute and relative differences in both accuracy and error. You found that there is disparity across digit and curviness groups. You may drive business decisions based on your tolerance for this level of perceived bias.*

## 9. Apply metrics to group intersections

You have measured bias based on one attribute at a time. Next, apply bias metrics to the intersections of groups, like "blue 8’s”, to understand how group disparities interact.

#### 9.1 View accuracy by digit-color intersections

In [None]:
# Print out the accuracy on images by digit-color intersections
print_accuracy_by_intersection(results_df, group1_col='digit', group2_col='color', prediction_col='pred1', ground_truth_col='label')

#### 9.2 Plot accuracy on digit-color intersections

In [None]:
# Plot accuracy on images by digit-color intersections
plot_intersectional_confidence_intervals_from_df(results_df, label_col='label', pred_col='pred1', 
 dataset_name='validation data', group_1='digit', group_2='color',
 use_legend=False)

#### 9.3 Calculate disparity on color-digit intersections

In [None]:
color_digit_accuracy = get_accuracy_by_intersection(results_df, group1_col='digit', 
 group2_col='color', prediction_col='pred1', 
 ground_truth_col='label')
print_absolute_vs_ratio_error(color_digit_accuracy)
print_absolute_vs_relative_accuracy(color_digit_accuracy)

#### 9.4 Interpret intersectional results

Oranges get similar accuracy to blues.

8's get above-average accuracy.

However orange 8's get lower accuracy than the average, and much lower accuracy than blue 8's.

Orange 3's get reasonable accuracy, but blue 3's drive the last-place performance of 3's overall.

## 10.Measure bias on a new evaluation set

Corporate customers of your service may want to use it in their own applications. First, they need to verify that the service is not biased on its own end user data. Use what you learned to explore bias on the new evaluation set. 

#### 10.1 Pull up the customer data and explore it: look at sample images

The file structure and groups are the same as in the original. Look at samples to understand whether the images are similar.

In [None]:
customer_data = get_val2(eval_data)
plot_one_each(customer_data)

You find that all the 4’s used terrible cameras, and their selfies are corrupted. Next, check whether this affects your determination of bias.

#### 10.2 Run the baseline model on your customer data

In [None]:
customer_results_df = run_baseline_model(base_model, customer_data)

#### 10.3 Calculate accuracy on customar data: Overall and broken down by digit

In [None]:
acc = compute_accuracy(customer_results_df, 'label', 'pred1')
print(f'overall accuracy on customer data = {acc}')
print_accuracy_by_group(customer_results_df, group_col='digit', prediction_col='pred1', ground_truth_col='label')
plot_confidence_intervals_from_df(customer_results_df, 'label', 'pred1', 'eval_data', 'digit', use_legend=False)

#### 10.4 Compute disparity metrics

In [None]:
digit_accuracy = get_group_accuracy_by_value(customer_results_df, group_col='digit', prediction_col='pred1', ground_truth_col='label')
print_absolute_vs_ratio_error(digit_accuracy)
print_absolute_vs_relative_accuracy(digit_accuracy)

The corrupted images you found in 10.1 seem to affect the accuracy and disparity of the model. 

The worst-performing group has changed, and the overall disparity of best-vs-worst has increased.

✅ **Check-in:** *Your task is to measure bias on multiple evaluation datasets. You evaluated the same model on an internal evaluation set, and on a dataset provided by your customer. While overall accuracy was stable, group-level differences and overall bias changed.*

## 11. Attempt to mitigate bias against 3's

We found disparities on 3's in the original dataset, and on 4's in the customer data. 

Mitigation strategies for these two situations may be different.

Next, we will try collecting more training data for 3's and re-test on the original dataset.

#### 11.1 Collect more training data and re-train

In [None]:
# This takes about 12s
augmented_train_data = create_augmented_data_2()
# This takes about 10s
augmented_model = train_baseline_model(df=augmented_train_data)

#### 11.2 Run the new model on your original evaluation set

In [None]:
# Run the augmented model on the original eval_data
augmented_results_df = run_baseline_model(augmented_model, eval_data)

#### 11.3 Calculate accuracy of the new model

Does accuracy improve?

In [None]:
# Original training data
acc = compute_accuracy(results_df, 'label', 'pred1')
print(f'Overall accuracy on original training + eval = {acc}')

acc2 = compute_accuracy(augmented_results_df, 'label', 'pred1')
print(f'Overall accuracy on augmented training + eval = {acc2}')

#### 11.4 Plot accuracy by digit

Does performance on 3's improve? 

In [None]:
plot_confidence_intervals_from_two_dfs(results_df, augmented_results_df,
 label_col='label', pred_col='pred1',
 dataset_name_1='original data', dataset_name_2='augmented data',
 group_name='digit',
 use_legend=False, figsize=(12, 6), fontsize=16, tick_spacing=1,
 show_error=True, marker_1='o', marker_2='^')

#### 11.5 Calculate disparity

Does disparity improve?

In [None]:
digit_accuracy = get_group_accuracy_by_value(augmented_results_df, group_col='digit', prediction_col='pred1', ground_truth_col='label')
print_absolute_vs_ratio_error(digit_accuracy)
print_absolute_vs_relative_accuracy(digit_accuracy)

✅ **Check-in:** *You have assessed bias in the digit recognition service, using multiple evaluation datasets and two models. You have discovered disparities that vary based on both model and dataset, and you have interpreted the results. Great work!*