{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Binary Classification with Amazon Machine Learning (Learning from Disaster)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "Copyright [2017]-[2017] Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at\n", "\n", "http://aws.amazon.com/apache2.0/\n", "\n", "or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Description: \n", "This demo is based on the popular [\"Titanic: Machine Learning from Disaster\"](https://www.kaggle.com/c/titanic) Kaggle competition. I highly recommend creating an account and taking the full challenge as described on the competition page. \n", "\n", "From the Kaggle website: \n", "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n", "\n", "One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n", "\n", "We are using the sample dataset provided of passengers on board the ship to training a model with Amazon Machine Learning to predict survival chances with the real time predictions endpoint.\n", "\n", "Helpful resources:\n", "[Amazon Machine Learning – Make Data-Driven Decisions at Scale](https://aws.amazon.com/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/) \n", "[Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift](https://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R)\n", "\n", "Comments:\n", "This is meant to be an asynchronous demo where one quickly walks through the deployment sections, runs all cells above the real time prediction section and gets back to the results after about 20 minutes.\n", "\n", "Estimated cost: The maximum cost for a single run would be 0.42$. The Data Analysis and Model Building cost depends on the size of the input data, the number of attributes within it, and the number and types of transformations applied. As this is very small scale demo the cost should be even lower.\n", "The cost of real time predictions is be negligible.\n", "\n", "** Prerequisites: **\n", "\n", "The user or role that executes the commands must have permissions in AWS Identity and Access Management (IAM) to perform those actions. AWS provides a set of managed policies that help you get started quickly. For our example, you need to apply the following minimum managed policies to your user or role:\n", "\n", "* AmazonMachineLearningFullAccess \n", "* AmazonS3FullAccess \n", "\n", "Be aware that we recommend you follow AWS IAM best practices for production implementations, which is out of scope fof this workshop.\n", "\n", "\n", "Demo Author: Stas Vonholsky" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random, os, string, urllib, json, time, sys\n", "from boto3.session import Session\n", "\n", "# Generate session prefix\n", "demo_pre = \"ipydemo-\"\n", "N = 4\n", "random_pre = ''.join(random.SystemRandom().choice(string.ascii_lowercase) for _ in range(N)) + \"-\"\n", "resource_pre = demo_pre + random_pre\n", "print (\"\\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\")\n", "print (\"Demo session ID: \" + resource_pre[:-1])\n", "print (\"@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\")\n", "\n", "# Fetch relevant metadata\n", "work_dir = \"resources/\"\n", "meta_data_pre = \"http://169.254.169.254/latest/meta-data/\"\n", "\n", "region = 'eu-west-1'\n", " \n", "session = Session(region_name=region)\n", "s3 = session.client('s3')\n", "ml = session.client('machinelearning')\n", "\n", "#File paths\n", "train_data = \"train.csv\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create bucket and upload training data\n", "s3_bucket = resource_pre + 'aml-bucket'\n", "s3.create_bucket(Bucket=s3_bucket,CreateBucketConfiguration={'LocationConstraint': region},)\n", "print (\"Created S3 bucket: \" + s3_bucket)\n", "\n", "path_to_train_data = work_dir + train_data\n", "!aws s3 cp $path_to_train_data s3://$s3_bucket\n", "\n", "#Grant read permissions to AML on Bucket\n", "s3_bucket_policy = \"\"\"{\n", " \"Version\": \"2012-10-17\",\n", " \n", " \"Statement\": [\n", " {\n", " \"Sid\": \"AmazonML_s3:ListBucket\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"machinelearning.amazonaws.com\"\n", " },\n", " \"Action\": \"s3:ListBucket\",\n", " \"Resource\": \"arn:aws:s3:::\"\"\"+ s3_bucket +\"\"\"\"\n", " },\n", " {\n", " \"Sid\": \"AmazonML_s3:GetObject:PutObject\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"machinelearning.amazonaws.com\"\n", " },\n", " \"Action\": [\"s3:GetObject\",\"s3:PutObject\"],\n", " \"Resource\": \"arn:aws:s3:::\"\"\"+ s3_bucket +\"\"\"/*\"\n", " }\n", " ]\n", "}\"\"\"\n", "trust_policy = json.loads(s3_bucket_policy)\n", "trust_policy = json.dumps(s3_bucket_policy) \n", " \n", "response = s3.put_bucket_policy(\n", " Bucket = s3_bucket,\n", " Policy = s3_bucket_policy\n", ")\n", "\n", "print (\"Bucket policy set\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Sample of the data:\n", "sample = \"\"\"\n", "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n", "1,0,3,\"Braund, Mr. Owen Harris\",male,22,1,0,A/5 21171,7.25,,S\n", "2,1,1,\"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\",female,38,1,0,PC 17599,71.2833,C85,C\n", "3,1,3,\"Heikkinen, Miss. Laina\",female,26,0,0,STON/O2. 3101282,7.925,,S\n", "4,1,1,\"Futrelle, Mrs. Jacques Heath (Lily May Peel)\",female,35,1,0,113803,53.1,C123,S\"\"\"\n", " \n", "data_schema = \"\"\"{\n", " \"version\": \"1.0\",\n", " \"recordAnnotationFieldName\": \"PassengerId\",\n", " \"targetFieldName\": \"Survived\",\n", " \"dataFormat\": \"CSV\",\n", " \"dataFileContainsHeader\": true,\n", " \"attributes\": [ \n", " { \"fieldName\": \"PassengerId\", \"fieldType\": \"CATEGORICAL\" },\n", " { \"fieldName\": \"Survived\", \"fieldType\": \"BINARY\"},\n", " { \"fieldName\": \"Pclass\", \"fieldType\": \"CATEGORICAL\"},\n", " { \"fieldName\": \"Name\", \"fieldType\": \"TEXT\"},\n", " { \"fieldName\": \"Sex\", \"fieldType\": \"CATEGORICAL\" },\n", " { \"fieldName\": \"Age\", \"fieldType\": \"NUMERIC\"},\n", " { \"fieldName\": \"SibSp\", \"fieldType\": \"NUMERIC\" },\n", " { \"fieldName\": \"Parch\", \"fieldType\": \"NUMERIC\" }, \n", " { \"fieldName\": \"Ticket\", \"fieldType\": \"TEXT\"},\n", " { \"fieldName\": \"Fare\", \"fieldType\": \"NUMERIC\"},\n", " { \"fieldName\": \"Cabin\", \"fieldType\": \"CATEGORICAL\"},\n", " { \"fieldName\": \"Embarked\", \"fieldType\": \"CATEGORICAL\" }\n", " ]\n", "}\"\"\"\n", "\n", "#We are using the pre-spliting the data into 2 parts, one is used for training the model \n", "#and the other for the evaluation (test). From more information on splitting data sets, see:\n", "#http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-types.html\n", "\n", "#Training data\n", "train_data_split = \"\"\"{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70}}\"\"\"\n", "train_data_source_id = resource_pre + 'train-data-titanic-survival'\n", "response = ml.create_data_source_from_s3(\n", " DataSourceId = train_data_source_id,\n", " DataSourceName = train_data_source_id,\n", " DataSpec={\n", " 'DataLocationS3': 's3://' + s3_bucket + \"/\" + train_data,\n", " 'DataSchema': data_schema,\n", " 'DataRearrangement' : train_data_split\n", " },\n", " ComputeStatistics = True\n", ")\n", "#Evaluation data\n", "evaluation_data_split = \"\"\"{\"splitting\": {\"percentBegin\": 71, \"percentEnd\": 100}}\"\"\"\n", "evaluation_data_source_id = resource_pre + 'evaluate-data-titanic-survival'\n", "response = ml.create_data_source_from_s3(\n", " DataSourceId = evaluation_data_source_id,\n", " DataSourceName = evaluation_data_source_id,\n", " DataSpec={\n", " 'DataLocationS3': 's3://' + s3_bucket + \"/\" + train_data,\n", " 'DataSchema': data_schema,\n", " 'DataRearrangement' : evaluation_data_split\n", " },\n", " ComputeStatistics = True\n", ")\n", "\n", "print (\"Creating datasource and computing stats.. This will take a couple of minutes\")\n", "sys.stdout.flush()\n", "waiter = ml.get_waiter('data_source_available')\n", "waiter.wait(FilterVariable='Name', EQ=train_data_source_id, Limit=1)\n", "waiter = ml.get_waiter('data_source_available')\n", "waiter.wait(FilterVariable='Name', EQ=evaluation_data_source_id, Limit=1)\n", "print (\"Datasource created!\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_id = resource_pre + 'ml-model-titanic-survival'\n", "\n", "# Creating the model\n", "response = ml.create_ml_model(\n", " MLModelId=model_id,\n", " MLModelName=model_id,\n", " MLModelType='BINARY',\n", " TrainingDataSourceId=train_data_source_id\n", ")\n", "print (\"Generating model.. This will take a couple of minutes\")\n", "sys.stdout.flush()\n", "waiter = ml.get_waiter('ml_model_available')\n", "waiter.wait(FilterVariable='Name',EQ=model_id,Limit=1)\n", "print (\"Model created!\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Creating an evaluation of the ML model\n", "evaluation_id = resource_pre + 'evaluation-titanic-survival'\n", "response = ml.create_evaluation(\n", " EvaluationId = evaluation_id,\n", " EvaluationName = evaluation_id,\n", " MLModelId = model_id,\n", " EvaluationDataSourceId = evaluation_data_source_id\n", ")\n", "\n", "print (\"Evaluating model.. This will take a couple of minutes\")\n", "sys.stdout.flush()\n", "waiter = ml.get_waiter('evaluation_available')\n", "waiter.wait(FilterVariable='Name',EQ=evaluation_id,Limit=1)\n", "print (\"Evaluation complete!\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = ml.describe_evaluations(\n", " FilterVariable='Name',\n", " EQ=evaluation_id\n", ")\n", "print (\"Out-of-the-box performance (Binary AUC):\")\n", "print (response['Results'][0]['PerformanceMetrics']['Properties']['BinaryAUC'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are building a web service that predicts your chances of survival on the titanic, what would you keep a very close eye on? \n", "(a) true positives \n", "(b) true negatives \n", "(c) false positives \n", "(d) false negatives \n", "\n", "Explore the performance of the model and optionally edit score threshold:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Enabling real time predictions\n", "for i in range(6):\n", " response = ml.create_realtime_endpoint(MLModelId=model_id)\n", " print (response['RealtimeEndpointInfo']['EndpointStatus'])\n", " if response['RealtimeEndpointInfo']['EndpointStatus'] == 'READY':\n", " break\n", " time.sleep(20)\n", "predict_endpoint = response['RealtimeEndpointInfo']['EndpointUrl']\n", "time.sleep(20) #Waiting for endpoint to become active\n", "print (\"Realtime prediction endpoint: \" + predict_endpoint)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simple predict function\n", "def predict(passenger):\n", " response = ml.predict(\n", " MLModelId=model_id,\n", " Record=passenger,\n", " PredictEndpoint=predict_endpoint\n", " )\n", " print (\"Passenger ID: \" + passenger['PassengerId'])\n", " print (\"Predicted Label: \" + str(response['Prediction']['predictedLabel']))\n", " print (\"Predicted Score: \" + str(response['Prediction']['predictedScores']))\n", " print (\"Predictive Model Type: \" + str(response['Prediction']['details']['PredictiveModelType']))\n", " print (\"Algorithm: \" + str(response['Prediction']['details']['Algorithm']))\n", " print (\"\")\n", "\n", "passenger1 = {\"PassengerId\":\"1\", \"Pclass\":\"3\", \"Name\":\"Jack\", \"Sex\":\"male\",\"Age\":\"20\",\"Fare\":\"10\"}\n", "predict(passenger1)\n", "\n", "# Let's change the sex to male, but change the class, name, age and fare and add amount of siblings onboard.\n", "passenger2 = {\"PassengerId\":\"3\", \"Pclass\":\"1\", \"Name\":\"Dr. Jack\", \"Sex\":\"male\",\"Age\":\"60\",\"Fare\":\"60\",\"SibSp\":\"3\"}\n", "predict(passenger2)\n", "#The predicted score should be much higher.\n", "\n", "# Let's change the sex to female and class to first class\n", "passenger3 = {\"PassengerId\":\"2\", \"Pclass\":\"1\", \"Name\":\"Jacklin\", \"Sex\":\"female\",\"Age\":\"30\",\"Fare\":\"10\"}\n", "predict(passenger3)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of manipulating the values above try out the UI real time predictions:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "######################################################################################################\n", "### Clean up environment\n", "######################################################################################################\n", "\n", "response = s3.delete_object(Bucket=s3_bucket,Key=train_data)\n", "response = s3.delete_bucket(Bucket=s3_bucket)\n", "print (\"Deleted S3 bucket: \" + s3_bucket)\n", "\n", "response = ml.delete_realtime_endpoint(MLModelId=model_id)\n", "response = ml.delete_evaluation(EvaluationId=evaluation_id)\n", "response = ml.delete_ml_model(MLModelId=model_id)\n", "print (\"Deleted Realtime prediction endpoint, evaluation and ML model\")\n", "\n", "response = ml.delete_data_source(DataSourceId=train_data_source_id)\n", "response = ml.delete_data_source(DataSourceId=evaluation_data_source_id)\n", "print (\"Deleted Data sources\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }