# Gender Prediction from name, using Deep Learning

Deep Neural Networks can be used to extract features in the input and derive higher level abstractions. This technique is used regularly in vision, speech and text analysis. In this exercise, we build a deep learning model that would identify low level features in texts containing people's names, and would be able to classify them in one of two categories - Male or Female.

## Recurrent Neural Networks and Long Short Term Memory
Since we have to process sequence of characters, Recurrent Neural Netwrosk are a good fit for this problem. Whenever we have to persist learning from data previously seen, traditional Neural Networks fail. Recurrent Neural Networks contains loops in the graph, that allows them to persist data in memory. Effective the loops facilitate passing multiple copies of information to be passed on to next step.
<details>
<summary><strong>Recurrent Neural Network - Loops (expand to view diagram)</strong></summary><p>
    ![Recurrent Neural Network - Loops](images/RNN-unrolled.png "Recurrent Neural Network - Loops")
</p></details>


In practice however, when we need to selectively memorize or forget patterns seen in the past, based on the context, plain vanilla RNNs do not seem to perform so well. Instead we can use a special type of RNN, that can retain information in long term, and thus works better in understanding the contextual relation between patterns observed. They are known as Long Short Term memory.

The nodes in an LSTM networks consusts of remember/forget gates to retain or pass patterns learnt in sequence useful for predicting target variable. These gates are a way to optionally let information through and tends to the ability of LSTM networks to remove or add information to the cell state in regulated manner.
<details>
<summary><strong>LSTM - Chains (expand to view diagram)</strong></summary><p>
    ![LSTM - Chains](images/LSTM3-chain.png "LSTM - Chains")
</p></details>


## Network Architecture
The problem we are trying to solve is to predict whether a given name belongs to a male or female. We will use supervised learning, where the character sequence making up the names would be `X` variable, and the flag indicating **Male(M)** or **Female(F)**  wuold be `Y` variable.

We use a stacked 2-Layer LSTM model and a final dense layer with softmax activation as our network architecture. We use categorical cross-entropy as loss function, with an adam optimizer. We also add a 20% dropout layer is added for regularization to avoid over-fitting. 

### Dependencies
*  We will use Keras deep learning library to build the network. THerefore we import the symbolic interfaces needed.
* We also use pandas data frames to load and slice-and0dice data
* Finally we need numpy for matric manipulation    
* While running on SageMaker Notebook Instance, we choose conda_tensorflow kernel, so that Keras code is compiled to use tensorflow in the backend. 
* If you choose P2 and P3 class of instances for your Notebook, using Tensorflow ensures the low level code takes advantage of all available GPUs. So further dependencies needs to be installed.


In [None]:
import numpy as np
import pandas as pd
from numpy import genfromtxt
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM
from keras.models import load_model
from sklearn.utils import shuffle
import boto3

## Data download
* Training data that we will be using to train the LSTM model is derived from US Government's SSA records of baby names registered. 
* Original dataset is split into separate text files for names registered every year, starting from 1880.
Each record in each year's files contain the name, the gender identifier, and a count showing how many of those names have been registered.

In [None]:
! rm -rf data
! mkdir data
! wget https://www.ssa.gov/oact/babynames/names.zip -P data
! unzip -oq data/names.zip -d data
! rm data/names.zip
! rm data/NationalReadMe.pdf

As a first step we concatenate data in all year specific files into a single file, and remove the individual years' files.

In [None]:
! cat data/yob* > data/allnames.txt
! rm data/yob*

## Data analysis
As a first step, to facilitate convenient operation, we load the concatenated data as-is into a dataframe

In [None]:
filename = 'data/allnames.txt'
df=pd.read_csv(filename, sep=',', names = ["Name", "Gender", "Count"])

Naturally, when all files are concatenated, there will be multiple duplicate entries, because same name do get used year after year, in registration.<p>
We test our assumption on duplicate entries, by taking any name, e.g. Mary, as example, and filtering all records containing that name

In [None]:
df.loc[df['Name'] == 'Mary'].head(5)

Notice here, that same name, `Mary` has been used both as Male and Female name. This might actually throw the model off, and affect it's accuracy.

To remediate this scenario, notice that some name are more popular as Female names, and some are more opular as Male names.

Run the same experiment as above with another name, such as `John`, and notice that occurence of this name in male population is more.

In [None]:
df.loc[df['Name'] == 'John'].head(5)

We also observe that even though some names are used both as Male and Female names, they are more commonly used for one gender than the other. For example, `Mary` is more common as male name, whereas `John` is more common as male name, as we saw above.<p>
Since the model we'll be building needs to map each name to specifically one gender, without loss of generality, we can prepare our training data set to have a fixed marker - `M` or `F` on any particular name.

## Data cleanup
We'll remediate the solution using following approach:
* Order the names by Name and Gender
* Add the count for each group of unique Name-Gender combination
* Iterate through the unique groups, and where a name is used for both Male and Female, choose to retain the entry with higher count
* Create a new clean data frame containing only unique records mapping each name to a single gender

In [None]:
grouped_df = df.groupby( [ "Name", "Gender"] ).apply(lambda x: x.Count.sum()).to_frame()
grouped_df.columns = ['Count']

After the data is ordered by Name and Gender into a new frame, notice that the new frame contain the Name and Gender as index, and the total count of occurences as values.<p>
We therefore create a dictionary that will have the Name as keys and gender (with higher sum count) as values.<p>
We loop through the indexes of the grouped data frame and populate the entries into this dictionary following the logic as described above.

In [None]:
names={}
for i in range(len(grouped_df.index.values)):
    #print(grouped_df.index[i][0] + ", " + grouped_df.index[i][1] + ", " + str(grouped_df.values[i][0]))
    if i > 0 and grouped_df.index[i][0] == grouped_df.index[i-1][0]:
        if grouped_df.values[i][0] > grouped_df.values[i-1][0]:
            names[grouped_df.index[i][0]] = grouped_df.index[i][1]
        else:
            names[grouped_df.index[i][0]] = grouped_df.index[i-1][1]
    else:
        names[grouped_df.index[i][0]] = grouped_df.index[i][1]

After the dictionary is populated, we create a clean data frame using the keys and values as coulmns

In [None]:
clean_df = df = pd.DataFrame(list(names.items()), columns=['Name', 'Gender']).sample(frac=1).reset_index(drop=True)

Notice that the cleaned up data only has unique records, and that it has single entries for the names - `Mary` and `John`, uniquely mapped to one gender.

In [None]:
print(clean_df.shape)
print(clean_df.loc[clean_df['Name'] == 'Mary'])
print(clean_df.loc[clean_df['Name'] == 'John'])

Finally we shuffle the data and save the clean data into a file, which we'll also use in subsequent phases of model training

In [None]:
!mkdir -p data
clean_df.to_csv('data/name-gender.txt',index=False,header=False)

## Data preparation
As you'll see in the notebook where we orchestrate a pipeline to train, deploy and host the model, the container you create will need access to data on an S3 bucket.<p>
In order to prepare for the next step therefore, we'll do some pre-work here and upload the cleaned data to the S3 bucket that you created in module-1 of the workshop.


We can obtain the name of the S3 bucket from the execution role we attached to this Notebook instance. This should work if the policies granting read permission to IAM policies was granted, as per the documentation.

If for some reason, it fails to fetch the associated bucket name, it asks the user to enter the name of the bucket. If asked, use the bucket that you created in Module-3, such as 'smworkshop-firstname-lastname'.<p>
    
It is important to ensure that this is the same S3 bucket, to which you provided access in the Execution role used while creating this Notebook instance.

In [None]:
sts = boto3.client('sts')
iam = boto3.client('iam')


caller = sts.get_caller_identity()
account = caller['Account']
arn = caller['Arn']
role = arn[arn.find("/AmazonSageMaker")+1:arn.find("/SageMaker")]
timestamp = role[role.find("Role-")+5:]
policyarn = "arn:aws:iam::{}:policy/service-role/AmazonSageMaker-ExecutionPolicy-{}".format(account, timestamp)

s3bucketname = ""
policystatements = []

try:
    policy = iam.get_policy(
        PolicyArn=policyarn
    )['Policy']
    policyversion = policy['DefaultVersionId']
    policystatements = iam.get_policy_version(
        PolicyArn = policyarn, 
        VersionId = policyversion
    )['PolicyVersion']['Document']['Statement']
except Exception as e:
    s3bucketname=input("Which S3 bucket do you want to use to host training data and model? ")
    
for stmt in policystatements:
    action = ""
    actions = stmt['Action']
    for act in actions:
        if act == "s3:ListBucket":
            action = act
            break
    if action == "s3:ListBucket":
        resource = stmt['Resource'][0]
        s3bucketname = resource[resource.find(":::")+3:]

print(s3bucketname)

Once we have our bucket name, we upload the data file under `/data/` prefix. This is the location we'll use during the final step, when we containerize and run the training. 

In [None]:
s3 = boto3.resource('s3')
s3.meta.client.upload_file('data/name-gender.txt', s3bucketname, 'data/name-gender.txt')

At this point, we can clean up some space by deleting the raw data file.<p>

In [None]:
!rm -rf data/allnames.txt

## Feature representation
Before we start buiding the model, we need to represent the data in a format that we can feed into the LSTM model that we'll be creating.<p>
Although we already have the cleaned data loaded as a data frame, let's load the data fresh from the S3 location. That way we'll know for sure that our cleaned data is of good quality.

In [None]:
localfilename = "data/data.csv"
try:
    s3.Bucket(s3bucketname).download_file('data/name-gender.txt', localfilename)
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise
data=pd.read_csv(localfilename, sep=',', names = ["Name", "Gender"])
data = shuffle(data)

Let's do a quick check on the record, and vaildate that we have the same number of records as we saved into the file after cleaning.<p>

In [None]:
#number of names
num_names = data.shape[0]
print(num_names)

We need to convert the names into numeric arrays, usingone-hot encoding scheme. 
The length of the arrays representing the names need to be as long as the longest name record we have.
Therefore we check for the longest name length and have it in a variable.

In [None]:
# length of longest name
max_name_length = (data['Name'].map(len).max())
print(max_name_length)

As a first step of feature engineering we extract all names as an array, and derive the set of alphabets used in the names.<p>
The way we choose to do so, is to concatenate all characters into one string, and then serive a `set`. By definition, a `set` in Python would contain only unique charatcers.

In [None]:
names = data['Name'].values
txt = ""
for n in names:
    txt += n.lower()
print(len(txt))

When we apply a `set` operation, we derive as many characters as there are alphabets in English language, as expected.

In [None]:
chars = sorted(set(txt))
alphabet_size = len(chars)
print('Alphabet size:', len(chars))
print(chars)

In order for one-hot encoding to work, we nned to assign index values to each of these characters.<p>
Since we have all alphabets `a` to `z`, the most natural index would be to just assign sequential values.<p>
We create a Python `dictionary` with the character indices

In [None]:
char_indices = dict((str(chr(c)), i) for i, c in enumerate(range(97,123)))
alphabet_size = 123-97
for key in sorted(char_indices.keys()):
    print("%s: %s" % (key, char_indices[key]))

Since we also need to somehow store the maximum length of a name record to be used later when we containerize our training and inference, as a good practice, let's also store that value as another entry into the same `dictionary`

In [None]:
char_indices['max_name_length'] = max_name_length

One hot encoded array would be of dimension `n` `*` `m` `*` `a`, where :
* `n` = Number of name records, 
* `m` = Maximum length of a record, and 
* `a` = Size of alphabet

Each of the `n` name records would be represented by 2-dimensional matrix of fixed size.<p>
This matrix would have number of rows equal to the maximum length of a name record.<p>
Each row would be of size equal to the alphabet size.<p>
For each position of a character in a given name, a row of this 2-dimensinal matrix would be either all zeroes (if no alphabets present in the corresponding position), or a row vector with a `1` in the position of the alphabet indicated in the index (and zeroes in other positions). 

So, the name `Mary` would look like (note we ignore case by convertin names to lower case)<p>
m => [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]<br>
a => [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]<br>
r => [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]<br>
y => [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]

We begin the encoding by taking a tensor containing all zeroes. Observe the dimensions matches the above description.

In [None]:
X = np.zeros((num_names, max_name_length, alphabet_size))
print(X.shape)

Then we iterate through each character in each name records and selective turn the matching elements (as in the character index) to ones.

In [None]:
for i,name in enumerate(names):
    name = name.lower()
    for t, char in enumerate(name):
        X[i, t,char_indices[char]] = 1
X[0,:,:]

Machine learning algorithms do not work well when data has too much skewness.<p>
So, let us validate tjhat both genders are somewhat equally represented in the training data.

In [None]:
data['Gender'].value_counts()

With the `X` variables of training data one-hot encoded, it is time to encode the traget `Y` variable.<p>
To do so, we simply create a column vector with zeroes representing Female and ones represnting Male.

In [None]:
Y = np.ones((num_names,2))
Y[data['Gender'] == 'F',0] = 0
Y[data['Gender'] == 'M',1] = 0
Y

One last check to ensure that dimensions of `X` and `Y` are compatible.

In [None]:
print(X.shape)
print(Y.shape)

In [None]:
data_dim = alphabet_size
timesteps = max_name_length
num_classes = 2

## Model building
We build a stacked LSTM network with a final dense layer with softmax activation (many-to-one setup).<p>
Categorical cross-entropy loss is used with adam optimizer.<p>
A 20% dropout layer is added for regularization to avoid over-fitting. 

In [None]:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(timesteps, data_dim)))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

## Model training
We train this model for for just 1 epoch, as a trial, with a batch size of 128. Too large a batch size may result in out of memory error.<p>
During training we designate 20% of training data (randomly chosen) to be used as validation data. Validation is never presented to the model during training, instead used to ensure that the model works well with data that it has never seen.<p>
This confirms we are not over-fitting, that is the model is not simply memoriziing the dat it sees, and that it can generalize it's learning.

In [None]:
model.fit(X, Y, validation_split=0.20, epochs=1, batch_size=128)

After training for only 1 epoch, if everything goes well, you should see about 79% of accuracy, both over training and validation data, which is a pretty good result in itself.

During the orchestration phase, we'll attempt to increase the accuracy by training for more epochs. The advantage in doing so is that, potentially costly training operation will be offloaded to SageMaker managed infrastructure. We can choose higher instance type for hosted training, and not worry about cost overrun, because SageMaker automatically provisions the training infrastructure and tears just right after training finishes.

This allows us to choose cheaper instance type, as we did, for the Notebook instance itself.

## Model testing
To test the accuracy of the model, we now invoke the model locally, and pass it a comma separated list of names.<p>
Same data formatting, as we did previously on training data (one-hot encoding using the same character indices) would be needed here as well.<p>

In [None]:
names_test = ["Tom","Allie","Jim","Sophie","John","Kayla","Mike","Amanda","Andrew"]
num_test = len(names_test)

X_test = np.zeros((num_test, max_name_length, alphabet_size))

for i,name in enumerate(names_test):
    name = name.lower()
    for t, char in enumerate(name):
        X_test[i, t,char_indices[char]] = 1

We feed this one-hot encoded test data to the model, and the `predict` generates a vector, similar to the training labels vector we used before. Except in this case, it contains what model thinks the gender represnted by each of the test records.<p>
To present data intutitively, we simply map it back to `Male` / `Female`, from the `0` / `1` flag.

In [None]:
predictions = model.predict(X_test)

for i,name in enumerate(names_test):
    print("{} ({})".format(names_test[i],"M" if predictions[i][0]>predictions[i][1] else "F"))

A quick glance at the result indicates that our model did a pretty good job in (almost) correctly identifying the gender of the test subjects, based on the provided names.

## Model saving
Our job is done, we satisfied ourselves that the scheme works, and that we have a somewhat useful model that we can use to predict the gender of people from their names.<p>
In order to orchestrate the ML pipeline however, we need to confirm that the model can be saved and loaded from disk, and still be able to generate same predictions.

We have to save the model file (containing the weights), and the character indices (including the length of maximum name).<p>
This is why we saved the maximum name length as another entry into the dictionary of characters, so that we can load both at the same time.<p>
Note however that, using this scheme, our ability to generate prediction is limited to the name of length upto the maximum length of names among the training set.

In [None]:
model.save('GenderLSTM.h5')
char_indices['max_name_length'] = max_name_length
np.save('GenderLSTM.npy', char_indices) 

Subsequently we load the saved model from the files on the disk, and check to see the indices are loaded, as saved.

In [None]:
loaded_model = load_model('GenderLSTM.h5')
loaded_char_indices = np.load('GenderLSTM.npy').item()
max_name_length = loaded_char_indices['max_name_length']
loaded_char_indices.pop('max_name_length', None)
alphabet_size = len(loaded_char_indices)
print(loaded_char_indices)
print(max_name_length)
print(alphabet_size)

Finally we run a similar test as we did with the freshly created model.<p>
It should exhibit the same level of accuracy when presented with any previously unseen names.

In [None]:
names_test = ["Tom","Allie","Jim","Sophie","John","Kayla","Mike","Amanda","Andrew"]
num_test = len(names_test)

X_test = np.zeros((num_test, max_name_length, alphabet_size))

for i,name in enumerate(names_test):
    name = name.lower()
    for t, char in enumerate(name):
        X_test[i, t,loaded_char_indices[char]] = 1

predictions = loaded_model.predict(X_test)

for i,name in enumerate(names_test):
    print("{} ({})".format(names_test[i],"M" if predictions[i][0]>predictions[i][1] else "F"))

In the next step, we'll use a separate notebook to containerize the training and prediction code, execute the training on SageMaker using appropriate container, and host the model behind an API endpoint.<p>
This would allow us to use the model from web-application, and put it into real use from our VoC application.

Head back to Module-3 of the workshop now, to the section titled - `Containerization`, and follow the steps described.