# Notebook to build a deep learning model to predict Gender from name

We will follow the following steps in this notebook.
1. Download the data set
2. Explore and pre-process the dataset
3. Showcase the encoding (names, character-integer encoding, character-one-hot encoding)
4. Submit Sagemaker training job


#### Step 1: Download the data from https://www.ssa.gov/oact/babynames/names.zip
When you unzip the download, you will find several files with names 'yob1880.txt'. 
The naming convention of this file is 'yob' stands for 'Year of Birth' and the year. 
Which means, each file contains the popular names of babies born in that year.

We will first create a folder called data. Download and unzip the file. We will then proceed to 
extract the content of all those files into a single file named 'allnames.txt'

In [None]:
! rm -rf data
! mkdir data
! wget https://www.ssa.gov/oact/babynames/names.zip -P data
! unzip -oq data/names.zip -d data
! rm data/names.zip
! rm data/NationalReadMe.pdf
! mv data/yob2016.txt data/test_data.txt
! cat data/yob* > data/allnames.txt
! rm data/yob*

### Step 2: Explore and pre-process the data

In [None]:
import numpy as np
import pandas as pd
from numpy import genfromtxt

filename = 'data/allnames.txt'
df=pd.read_csv(filename, sep=',', names = ["Name", "Gender", "Count"])

Lets look at the data size. 

In [None]:
df.shape

There are 1.89M rows and 3 columns. Now lets see how the data.

In [None]:
df.head(10)

Data set has 3 columns, Name, Gender, and count. Here Count is the number of times this name was registered with the 
United States social security department. The names sound familiar for United states. Since we collected data
from all 50 states, there might be some names that occur multiple times. Lets us check how many time Mary occurs.

In [None]:
df.loc[df['Name'] == 'Mary'].head(10)

## Looking at sample data
The name 'Mary' occurs multple times, and at the same time Mary is also 
listed as a Male. In the early 20th century Mary used to be a
common name for boys, and it somewhat related to Mario.
But, looking at the counts, Mary is much more popular 
as a female name than a male name. So, it is not possible to 
guess the gender of a person by just looking at it. 

The second problem is that, the name Mary appears multple times 
in the dataset. We will remove redundant entries. 
But before we remove redundant entries, we will drop the counts as 
we will not be using it for training.

In [None]:
# Since we do not need the 'count' lets drop it from the dataframe
df = df.drop(['Count'], axis=1)

In [None]:
# let remove duplicates
df = df.drop_duplicates()

#checking the presence of Mary again
df.loc[df['Name'] == 'Mary']

#lets shuffle the data set
df = df.sample(frac=1).reset_index(drop=True)

In [None]:
# lets find the number of rows we have now. We want to 
# have a reasonable number to rows to train our deep learning model
num_names = df.shape[0]
print ('Number of names in the training dataset', num_names)

In [None]:
# Find the longest name
max_name_length = (df['Name'].map(len).max())
print("Longest name:", max_name_length)

In [None]:
!rm -rf namesdata
!mkdir namesdata

In [None]:
df.to_csv('namesdata/train_names.csv',index=False)

In [None]:
test_file = 'data/test_data.txt'
df_test=pd.read_csv(test_file, sep=',', names = ["Name", "Gender", "Count"])
df_test = df_test.drop(['Count'], axis=1)
df_test.to_csv('namesdata/test_names.csv',index=False,header=False)

In [None]:
df_test.shape

### Assumption
Beyond this point, this model will assume that the names only contain
english alphabets (26). The algorithm has to be modified slightly if you 
use the same model for other languages.

# One hot encoding of characters
We cannot use the character symbols as is to send as input to the neural network,
so we will convert this into a one-hot encoded sequence, based on the mapping.

First lets encode the character as integer and then encode the integers into one-hot 
In one-hot encodeing a is represented as an array with the first column selected and so on 

a => [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

e => [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
# Lets define a dictionar to help us with char to integer encoding
char_to_int = {'a':0,'b':1,'c':2,'d':3,'e':4,'f':5,'g':6,'h':7,'i':8,'j':9,'k':10,'l':11,'m':12,'n':13,'o':14,'p':15,'q':16,'r':17,'s':18,'t':19,'u':20,'v':21,'w':22,'x':23,'y':24,'z':25}

In [None]:
# X will be the input to the neural network, is a 3D numpyarray.
# X is initialized with zeros
alphabet_size = 26
names = df['Name'].values
genders = df['Gender']
X = np.zeros((num_names, max_name_length, alphabet_size))

# we will in each column we will encode 1 in in the column that represents the character
for i,name in enumerate(names):
 name = name.lower()
 for t, char in enumerate(name):
 X[i, t,char_to_int[char]] = 1


In [None]:
# lets look at the first name
# every name will be of the same size 26 x 15. IN case of the 
# first name 'Mary' only the first 4 letters will be encoded
# the rest of the rows will be all zeros

print ('first name is: ', names[0])
X[0,:,:]

In [None]:
# Now lets encode the gender in a numpy array Y
Y = np.ones((num_names,1))
Y[df['Gender'] == 'F',0] = 0

### Training job setup
The above exercise was only to show you how to create the input and target 
for the model. We will not be training in this notebook instance, but will 
submit a training job to sagemaker

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

In [None]:
inputs = sagemaker_session.upload_data(path='namesdata', key_prefix='namesdata')

# todo draw a picture of the neural network

In [None]:
from sagemaker.tensorflow import TensorFlow

gender_estimator = TensorFlow(entry_point='highlevel-tensorflow-helper.py',
 role=role,
 training_steps= 4000, 
 evaluation_steps= 10,
 hyperparameters={'learning_rate': 0.01},
 train_instance_count=1,
 train_instance_type='ml.p2.xlarge',
 base_job_name='tf-names')

gender_estimator.fit(inputs, run_tensorboard_locally=True)

In [None]:
gender_predictor = gender_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
sagemaker.Session().delete_endpoint(gender_predictor.endpoint)

In [None]:
data = {}
data['name'] = 'pratap'
json_obj = json.loads('{"names": {"name1":"pratap","name2":"swetha"}}')
json_data = json.dumps(data)
print (json_obj['names'])

In [None]:
!rm output.json
!aws sagemaker-runtime invoke-endpoint --endpoint-name tensorboard-names-2018-03-20-22-40-47-154 --body '{"name":"swetha"}' --content-type "application/json" output.json
! cat output.json

In [None]:
from sagemaker.tensorflow import TensorFlowPredictor
predictor = TensorFlowPredictor('tensorflowgendermodel571')
sagemaker.Session().delete_endpoint(predictor)
