{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gender Prediction from name, using Deep Learning\n",
"\n",
"Deep Neural Networks can be used to extract features in the input and derive higher level abstractions. This technique is used regularly in vision, speech and text analysis. In this exercise, we build a deep learning model that would identify low level features in texts containing people's names, and would be able to classify them in one of two categories - Male or Female.\n",
"\n",
"## Recurrent Neural Networks and Long Short Term Memory\n",
"Since we have to process sequence of characters, Recurrent Neural Netwrosk are a good fit for this problem. Whenever we have to persist learning from data previously seen, traditional Neural Networks fail. Recurrent Neural Networks contains loops in the graph, that allows them to persist data in memory. Effective the loops facilitate passing multiple copies of information to be passed on to next step.\n",
" \n",
" \n",
" \n",
" \n",
"Recurrent Neural Network - Loops (expand to view diagram)
LSTM - Chains (expand to view diagram)
\n", "We test our assumption on duplicate entries, by taking any name, e.g. Mary, as example, and filtering all records containing that name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df['Name'] == 'Mary'].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice here, that same name, `Mary` has been used both as Male and Female name. This might actually throw the model off, and affect it's accuracy.\n", "\n", "To remediate this scenario, notice that some name are more popular as Female names, and some are more opular as Male names.\n", "\n", "Run the same experiment as above with another name, such as `John`, and notice that occurence of this name in male population is more." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df['Name'] == 'John'].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also observe that even though some names are used both as Male and Female names, they are more commonly used for one gender than the other. For example, `Mary` is more common as male name, whereas `John` is more common as male name, as we saw above.
\n", "Since the model we'll be building needs to map each name to specifically one gender, without loss of generality, we can prepare our training data set to have a fixed marker - `M` or `F` on any particular name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleanup\n", "We'll remediate the solution using following approach:\n", "* Order the names by Name and Gender\n", "* Add the count for each group of unique Name-Gender combination\n", "* Iterate through the unique groups, and where a name is used for both Male and Female, choose to retain the entry with higher count\n", "* Create a new clean data frame containing only unique records mapping each name to a single gender" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped_df = df.groupby( [ \"Name\", \"Gender\"] ).apply(lambda x: x.Count.sum()).to_frame()\n", "grouped_df.columns = ['Count']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the data is ordered by Name and Gender into a new frame, notice that the new frame contain the Name and Gender as index, and the total count of occurences as values.
\n", "We therefore create a dictionary that will have the Name as keys and gender (with higher sum count) as values.
\n", "We loop through the indexes of the grouped data frame and populate the entries into this dictionary following the logic as described above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names={}\n", "for i in range(len(grouped_df.index.values)):\n", " #print(grouped_df.index[i][0] + \", \" + grouped_df.index[i][1] + \", \" + str(grouped_df.values[i][0]))\n", " if i > 0 and grouped_df.index[i][0] == grouped_df.index[i-1][0]:\n", " if grouped_df.values[i][0] > grouped_df.values[i-1][0]:\n", " names[grouped_df.index[i][0]] = grouped_df.index[i][1]\n", " else:\n", " names[grouped_df.index[i][0]] = grouped_df.index[i-1][1]\n", " else:\n", " names[grouped_df.index[i][0]] = grouped_df.index[i][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the dictionary is populated, we create a clean data frame using the keys and values as coulmns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clean_df = df = pd.DataFrame(list(names.items()), columns=['Name', 'Gender']).sample(frac=1).reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the cleaned up data only has unique records, and that it has single entries for the names - `Mary` and `John`, uniquely mapped to one gender." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(clean_df.shape)\n", "print(clean_df.loc[clean_df['Name'] == 'Mary'])\n", "print(clean_df.loc[clean_df['Name'] == 'John'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we shuffle the data and save the clean data into a file, which we'll also use in subsequent phases of model training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p data\n", "clean_df.to_csv('data/name-gender.txt',index=False,header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data preparation\n", "As you'll see in the notebook where we orchestrate a pipeline to train, deploy and host the model, the container you create will need access to data on an S3 bucket.
\n", "In order to prepare for the next step therefore, we'll do some pre-work here and upload the cleaned data to the S3 bucket that you created in module-1 of the workshop.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can obtain the name of the S3 bucket from the execution role we attached to this Notebook instance. This should work if the policies granting read permission to IAM policies was granted, as per the documentation.\n", "\n", "If for some reason, it fails to fetch the associated bucket name, it asks the user to enter the name of the bucket. If asked, use the bucket that you created in Module-3, such as 'smworkshop-firstname-lastname'.
\n", " \n", "It is important to ensure that this is the same S3 bucket, to which you provided access in the Execution role used while creating this Notebook instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sts = boto3.client('sts')\n", "iam = boto3.client('iam')\n", "\n", "\n", "caller = sts.get_caller_identity()\n", "account = caller['Account']\n", "arn = caller['Arn']\n", "role = arn[arn.find(\"/AmazonSageMaker\")+1:arn.find(\"/SageMaker\")]\n", "timestamp = role[role.find(\"Role-\")+5:]\n", "policyarn = \"arn:aws:iam::{}:policy/service-role/AmazonSageMaker-ExecutionPolicy-{}\".format(account, timestamp)\n", "\n", "s3bucketname = \"\"\n", "policystatements = []\n", "\n", "try:\n", " policy = iam.get_policy(\n", " PolicyArn=policyarn\n", " )['Policy']\n", " policyversion = policy['DefaultVersionId']\n", " policystatements = iam.get_policy_version(\n", " PolicyArn = policyarn, \n", " VersionId = policyversion\n", " )['PolicyVersion']['Document']['Statement']\n", "except Exception as e:\n", " s3bucketname=input(\"Which S3 bucket do you want to use to host training data and model? \")\n", " \n", "for stmt in policystatements:\n", " action = \"\"\n", " actions = stmt['Action']\n", " for act in actions:\n", " if act == \"s3:ListBucket\":\n", " action = act\n", " break\n", " if action == \"s3:ListBucket\":\n", " resource = stmt['Resource'][0]\n", " s3bucketname = resource[resource.find(\":::\")+3:]\n", "\n", "print(s3bucketname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have our bucket name, we upload the data file under `/data/` prefix. This is the location we'll use during the final step, when we containerize and run the training. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.resource('s3')\n", "s3.meta.client.upload_file('data/name-gender.txt', s3bucketname, 'data/name-gender.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, we can clean up some space by deleting the raw data file.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf data/allnames.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature representation\n", "Before we start buiding the model, we need to represent the data in a format that we can feed into the LSTM model that we'll be creating.
\n", "Although we already have the cleaned data loaded as a data frame, let's load the data fresh from the S3 location. That way we'll know for sure that our cleaned data is of good quality." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "localfilename = \"data/data.csv\"\n", "try:\n", " s3.Bucket(s3bucketname).download_file('data/name-gender.txt', localfilename)\n", "except botocore.exceptions.ClientError as e:\n", " if e.response['Error']['Code'] == \"404\":\n", " print(\"The object does not exist.\")\n", " else:\n", " raise\n", "data=pd.read_csv(localfilename, sep=',', names = [\"Name\", \"Gender\"])\n", "data = shuffle(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a quick check on the record, and vaildate that we have the same number of records as we saved into the file after cleaning.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "7f8ad84a-6f22-4c8f-8d1d-f4a41192892e" } }, "outputs": [], "source": [ "#number of names\n", "num_names = data.shape[0]\n", "print(num_names)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We need to convert the names into numeric arrays, usingone-hot encoding scheme. \n", "The length of the arrays representing the names need to be as long as the longest name record we have.\n", "Therefore we check for the longest name length and have it in a variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# length of longest name\n", "max_name_length = (data['Name'].map(len).max())\n", "print(max_name_length)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a first step of feature engineering we extract all names as an array, and derive the set of alphabets used in the names.
\n", "The way we choose to do so, is to concatenate all characters into one string, and then serive a `set`. By definition, a `set` in Python would contain only unique charatcers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "14cb65ea-da38-4469-908a-a21c49bbba16" } }, "outputs": [], "source": [ "names = data['Name'].values\n", "txt = \"\"\n", "for n in names:\n", " txt += n.lower()\n", "print(len(txt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we apply a `set` operation, we derive as many characters as there are alphabets in English language, as expected." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "efa1c69b-43a7-489e-90d0-c9b4b98e5888" } }, "outputs": [], "source": [ "chars = sorted(set(txt))\n", "alphabet_size = len(chars)\n", "print('Alphabet size:', len(chars))\n", "print(chars)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order for one-hot encoding to work, we nned to assign index values to each of these characters.
\n", "Since we have all alphabets `a` to `z`, the most natural index would be to just assign sequential values.
\n", "We create a Python `dictionary` with the character indices" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "char_indices = dict((str(chr(c)), i) for i, c in enumerate(range(97,123)))\n", "alphabet_size = 123-97\n", "for key in sorted(char_indices.keys()):\n", " print(\"%s: %s\" % (key, char_indices[key]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we also need to somehow store the maximum length of a name record to be used later when we containerize our training and inference, as a good practice, let's also store that value as another entry into the same `dictionary`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "char_indices['max_name_length'] = max_name_length" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One hot encoded array would be of dimension `n` `*` `m` `*` `a`, where :\n", "* `n` = Number of name records, \n", "* `m` = Maximum length of a record, and \n", "* `a` = Size of alphabet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each of the `n` name records would be represented by 2-dimensional matrix of fixed size.
\n", "This matrix would have number of rows equal to the maximum length of a name record.
\n", "Each row would be of size equal to the alphabet size.
\n", "For each position of a character in a given name, a row of this 2-dimensinal matrix would be either all zeroes (if no alphabets present in the corresponding position), or a row vector with a `1` in the position of the alphabet indicated in the index (and zeroes in other positions). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, the name `Mary` would look like (note we ignore case by convertin names to lower case)
\n",
"m => [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
\n",
"a => [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
\n",
"r => [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
\n",
"y => [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin the encoding by taking a tensor containing all zeroes. Observe the dimensions matches the above description."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = np.zeros((num_names, max_name_length, alphabet_size))\n",
"print(X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we iterate through each character in each name records and selective turn the matching elements (as in the character index) to ones."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for i,name in enumerate(names):\n",
" name = name.lower()\n",
" for t, char in enumerate(name):\n",
" X[i, t,char_indices[char]] = 1\n",
"X[0,:,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Machine learning algorithms do not work well when data has too much skewness.
\n", "So, let us validate tjhat both genders are somewhat equally represented in the training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "2b85f255-be7a-4caa-b4f5-cf0eca3072cc" } }, "outputs": [], "source": [ "data['Gender'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `X` variables of training data one-hot encoded, it is time to encode the traget `Y` variable.
\n", "To do so, we simply create a column vector with zeroes representing Female and ones represnting Male." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "7a7851c4-dd03-42a2-90e7-edfaea3c87cd" } }, "outputs": [], "source": [ "Y = np.ones((num_names,2))\n", "Y[data['Gender'] == 'F',0] = 0\n", "Y[data['Gender'] == 'M',1] = 0\n", "Y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One last check to ensure that dimensions of `X` and `Y` are compatible." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "6ba83249-f57a-458d-bd22-04891354cec5" } }, "outputs": [], "source": [ "print(X.shape)\n", "print(Y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "1add41ec-20c9-4e24-8b60-87b1034e6f4c" } }, "outputs": [], "source": [ "data_dim = alphabet_size\n", "timesteps = max_name_length\n", "num_classes = 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model building\n", "We build a stacked LSTM network with a final dense layer with softmax activation (many-to-one setup).
\n", "Categorical cross-entropy loss is used with adam optimizer.
\n", "A 20% dropout layer is added for regularization to avoid over-fitting. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "34f06cf2-7beb-4fb8-9627-e9e029a17256" } }, "outputs": [], "source": [ "model = Sequential()\n", "model.add(LSTM(512, return_sequences=True, input_shape=(timesteps, data_dim)))\n", "model.add(Dropout(0.2))\n", "model.add(LSTM(512, return_sequences=False))\n", "model.add(Dropout(0.2))\n", "model.add(Dense(num_classes))\n", "model.add(Activation('sigmoid'))\n", "\n", "model.compile(loss='categorical_crossentropy', \n", " optimizer='adam',\n", " metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Model training\n", "We train this model for for just 1 epoch, as a trial, with a batch size of 128. Too large a batch size may result in out of memory error.
\n", "During training we designate 20% of training data (randomly chosen) to be used as validation data. Validation is never presented to the model during training, instead used to ensure that the model works well with data that it has never seen.
\n", "This confirms we are not over-fitting, that is the model is not simply memoriziing the dat it sees, and that it can generalize it's learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "28301990-bd58-40da-8600-99511131e2b1" } }, "outputs": [], "source": [ "model.fit(X, Y, validation_split=0.20, epochs=1, batch_size=128)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After training for only 1 epoch, if everything goes well, you should see about 79% of accuracy, both over training and validation data, which is a pretty good result in itself.\n", "\n", "During the orchestration phase, we'll attempt to increase the accuracy by training for more epochs. The advantage in doing so is that, potentially costly training operation will be offloaded to SageMaker managed infrastructure. We can choose higher instance type for hosted training, and not worry about cost overrun, because SageMaker automatically provisions the training infrastructure and tears just right after training finishes.\n", "\n", "This allows us to choose cheaper instance type, as we did, for the Notebook instance itself." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model testing\n", "To test the accuracy of the model, we now invoke the model locally, and pass it a comma separated list of names.
\n", "Same data formatting, as we did previously on training data (one-hot encoding using the same character indices) would be needed here as well.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "23fb2c07-a8ac-4675-95b2-8be5924ad8af" } }, "outputs": [], "source": [ "names_test = [\"Tom\",\"Allie\",\"Jim\",\"Sophie\",\"John\",\"Kayla\",\"Mike\",\"Amanda\",\"Andrew\"]\n", "num_test = len(names_test)\n", "\n", "X_test = np.zeros((num_test, max_name_length, alphabet_size))\n", "\n", "for i,name in enumerate(names_test):\n", " name = name.lower()\n", " for t, char in enumerate(name):\n", " X_test[i, t,char_indices[char]] = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We feed this one-hot encoded test data to the model, and the `predict` generates a vector, similar to the training labels vector we used before. Except in this case, it contains what model thinks the gender represnted by each of the test records.
\n", "To present data intutitively, we simply map it back to `Male` / `Female`, from the `0` / `1` flag." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = model.predict(X_test)\n", "\n", "for i,name in enumerate(names_test):\n", " print(\"{} ({})\".format(names_test[i],\"M\" if predictions[i][0]>predictions[i][1] else \"F\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick glance at the result indicates that our model did a pretty good job in (almost) correctly identifying the gender of the test subjects, based on the provided names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model saving\n", "Our job is done, we satisfied ourselves that the scheme works, and that we have a somewhat useful model that we can use to predict the gender of people from their names.
\n", "In order to orchestrate the ML pipeline however, we need to confirm that the model can be saved and loaded from disk, and still be able to generate same predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to save the model file (containing the weights), and the character indices (including the length of maximum name).
\n", "This is why we saved the maximum name length as another entry into the dictionary of characters, so that we can load both at the same time.
\n", "Note however that, using this scheme, our ability to generate prediction is limited to the name of length upto the maximum length of names among the training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.save('GenderLSTM.h5')\n", "char_indices['max_name_length'] = max_name_length\n", "np.save('GenderLSTM.npy', char_indices) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subsequently we load the saved model from the files on the disk, and check to see the indices are loaded, as saved." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loaded_model = load_model('GenderLSTM.h5')\n", "loaded_char_indices = np.load('GenderLSTM.npy').item()\n", "max_name_length = loaded_char_indices['max_name_length']\n", "loaded_char_indices.pop('max_name_length', None)\n", "alphabet_size = len(loaded_char_indices)\n", "print(loaded_char_indices)\n", "print(max_name_length)\n", "print(alphabet_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we run a similar test as we did with the freshly created model.
\n", "It should exhibit the same level of accuracy when presented with any previously unseen names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names_test = [\"Tom\",\"Allie\",\"Jim\",\"Sophie\",\"John\",\"Kayla\",\"Mike\",\"Amanda\",\"Andrew\"]\n", "num_test = len(names_test)\n", "\n", "X_test = np.zeros((num_test, max_name_length, alphabet_size))\n", "\n", "for i,name in enumerate(names_test):\n", " name = name.lower()\n", " for t, char in enumerate(name):\n", " X_test[i, t,loaded_char_indices[char]] = 1\n", "\n", "predictions = loaded_model.predict(X_test)\n", "\n", "for i,name in enumerate(names_test):\n", " print(\"{} ({})\".format(names_test[i],\"M\" if predictions[i][0]>predictions[i][1] else \"F\"))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In the next step, we'll use a separate notebook to containerize the training and prediction code, execute the training on SageMaker using appropriate container, and host the model behind an API endpoint.
\n", "This would allow us to use the model from web-application, and put it into real use from our VoC application." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Head back to Module-3 of the workshop now, to the section titled - `Containerization`, and follow the steps described." ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }