{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook to build a deep learning model to predict Gender from name\n", "\n", "We will follow the following steps in this notebook.\n", "1. Download the data set\n", "2. Explore and pre-process the dataset\n", "3. Showcase the encoding (names, character-integer encoding, character-one-hot encoding)\n", "4. Submit Sagemaker training job\n", "\n", "\n", "#### Step 1: Download the data from https://www.ssa.gov/oact/babynames/names.zip\n", "When you unzip the download, you will find several files with names 'yob1880.txt'. \n", "The naming convention of this file is 'yob' stands for 'Year of Birth' and the year. \n", "Which means, each file contains the popular names of babies born in that year.\n", "\n", "We will first create a folder called data. Download and unzip the file. We will then proceed to \n", "extract the content of all those files into a single file named 'allnames.txt'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! rm -rf data\n", "! mkdir data\n", "! wget https://www.ssa.gov/oact/babynames/names.zip -P data\n", "! unzip -oq data/names.zip -d data\n", "! rm data/names.zip\n", "! rm data/NationalReadMe.pdf\n", "! mv data/yob2016.txt data/test_data.txt\n", "! cat data/yob* > data/allnames.txt\n", "! rm data/yob*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Explore and pre-process the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from numpy import genfromtxt\n", "\n", "filename = 'data/allnames.txt'\n", "df=pd.read_csv(filename, sep=',', names = [\"Name\", \"Gender\", \"Count\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets look at the data size. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 1.89M rows and 3 columns. Now lets see how the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data set has 3 columns, Name, Gender, and count. Here Count is the number of times this name was registered with the \n", "United States social security department. The names sound familiar for United states. Since we collected data\n", "from all 50 states, there might be some names that occur multiple times. Lets us check how many time Mary occurs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df['Name'] == 'Mary'].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at sample data\n", "The name 'Mary' occurs multple times, and at the same time Mary is also \n", "listed as a Male. In the early 20th century Mary used to be a\n", "common name for boys, and it somewhat related to Mario.\n", "But, looking at the counts, Mary is much more popular \n", "as a female name than a male name. So, it is not possible to \n", "guess the gender of a person by just looking at it. \n", "\n", "The second problem is that, the name Mary appears multple times \n", "in the dataset. We will remove redundant entries. \n", "But before we remove redundant entries, we will drop the counts as \n", "we will not be using it for training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Since we do not need the 'count' lets drop it from the dataframe\n", "df = df.drop(['Count'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let remove duplicates\n", "df = df.drop_duplicates()\n", "\n", "#checking the presence of Mary again\n", "df.loc[df['Name'] == 'Mary']\n", "\n", "#lets shuffle the data set\n", "df = df.sample(frac=1).reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets find the number of rows we have now. We want to \n", "# have a reasonable number to rows to train our deep learning model\n", "num_names = df.shape[0]\n", "print ('Number of names in the training dataset', num_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find the longest name\n", "max_name_length = (df['Name'].map(len).max())\n", "print(\"Longest name:\", max_name_length)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf namesdata\n", "!mkdir namesdata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_csv('namesdata/train_names.csv',index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_file = 'data/test_data.txt'\n", "df_test=pd.read_csv(test_file, sep=',', names = [\"Name\", \"Gender\", \"Count\"])\n", "df_test = df_test.drop(['Count'], axis=1)\n", "df_test.to_csv('namesdata/test_names.csv',index=False,header=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assumption\n", "Beyond this point, this model will assume that the names only contain\n", "english alphabets (26). The algorithm has to be modified slightly if you \n", "use the same model for other languages.\n", "\n", "# One hot encoding of characters\n", "We cannot use the character symbols as is to send as input to the neural network,\n", "so we will convert this into a one-hot encoded sequence, based on the mapping.\n", "\n", "First lets encode the character as integer and then encode the integers into one-hot \n", "In one-hot encodeing a is represented as an array with the first column selected and so on \n", "\n", "a => [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", "\n", "e => [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets define a dictionar to help us with char to integer encoding\n", "char_to_int = {'a':0,'b':1,'c':2,'d':3,'e':4,'f':5,'g':6,'h':7,'i':8,'j':9,'k':10,'l':11,'m':12,'n':13,'o':14,'p':15,'q':16,'r':17,'s':18,'t':19,'u':20,'v':21,'w':22,'x':23,'y':24,'z':25}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# X will be the input to the neural network, is a 3D numpyarray.\n", "# X is initialized with zeros\n", "alphabet_size = 26\n", "names = df['Name'].values\n", "genders = df['Gender']\n", "X = np.zeros((num_names, max_name_length, alphabet_size))\n", "\n", "# we will in each column we will encode 1 in in the column that represents the character\n", "for i,name in enumerate(names):\n", " name = name.lower()\n", " for t, char in enumerate(name):\n", " X[i, t,char_to_int[char]] = 1\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets look at the first name\n", "# every name will be of the same size 26 x 15. IN case of the \n", "# first name 'Mary' only the first 4 letters will be encoded\n", "# the rest of the rows will be all zeros\n", "\n", "print ('first name is: ', names[0])\n", "X[0,:,:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now lets encode the gender in a numpy array Y\n", "Y = np.ones((num_names,1))\n", "Y[df['Gender'] == 'F',0] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training job setup\n", "The above exercise was only to show you how to create the input and target \n", "for the model. We will not be training in this notebook instance, but will \n", "submit a training job to sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = sagemaker_session.upload_data(path='namesdata', key_prefix='namesdata')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# todo draw a picture of the neural network" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tensorflow import TensorFlow\n", "\n", "gender_estimator = TensorFlow(entry_point='highlevel-tensorflow-helper.py',\n", " role=role,\n", " training_steps= 4000, \n", " evaluation_steps= 10,\n", " hyperparameters={'learning_rate': 0.01},\n", " train_instance_count=1,\n", " train_instance_type='ml.p2.xlarge',\n", " base_job_name='tf-names')\n", "\n", "gender_estimator.fit(inputs, run_tensorboard_locally=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gender_predictor = gender_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker.Session().delete_endpoint(gender_predictor.endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {}\n", "data['name'] = 'pratap'\n", "json_obj = json.loads('{\"names\": {\"name1\":\"pratap\",\"name2\":\"swetha\"}}')\n", "json_data = json.dumps(data)\n", "print (json_obj['names'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm output.json\n", "!aws sagemaker-runtime invoke-endpoint --endpoint-name tensorboard-names-2018-03-20-22-40-47-154 --body '{\"name\":\"swetha\"}' --content-type \"application/json\" output.json\n", "! cat output.json" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tensorflow import TensorFlowPredictor\n", "predictor = TensorFlowPredictor('tensorflowgendermodel571')\n", "sagemaker.Session().delete_endpoint(predictor)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }