{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Immersion Day\n", "\n", "This notebook will serve as a template for the overall process of taking a non ideal time series dataset and integrating it into [Amazon Forecast](https://aws.amazon.com/forecast/).\n", "\n", "## Overview\n", "\n", "1. Introduction to Amazon Forecast\n", "1. Obtaining Your Data\n", "1. Fitting the Data to Forecast\n", "1. Determining Your Forecast Horizon (1st pass)\n", "1. Building Your Predictors\n", "1. Visualizing Predictors\n", "1. Making Decisions\n", "1. Next Steps\n", "\n", "\n", "## Introduction to Amazon Forecast\n", "\n", "If you are not familiar with Amazon Forecast you can learn more about this tool on these pages:\n", "\n", "* [Product Page](https://aws.amazon.com/forecast/)\n", "* [GitHub Sample Notebooks](https://github.com/aws-samples/amazon-forecast-samples)\n", "* [Product Docs](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)\n", "\n", "\n", "## Obtaining Your Data\n", "\n", "A critical requirement to use Amazon Forecast is to have access to time-series data for your selected use case. To learn more about time series data:\n", "\n", "1. [Wikipedia](https://en.wikipedia.org/wiki/Time_series)\n", "1. [Toward's Data Science Primer](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775)\n", "1. [O'Reilly Book](https://www.amazon.com/gp/product/1492041653/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1)\n", "\n", "For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly.\n", "\n", "To begin, use Pandas to read the CSV and to show a sample of the data.\n", "\n", "To begin the cell below will complete the following:\n", "\n", "1. Create a directory for the data files.\n", "1. Download the sample data into the directory.\n", "1. Extract the archive file into the directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data downloaded, now we will import the Pandas library as well as a few other data science tools in order to inspect the information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from time import sleep\n", "import subprocess\n", "import pandas as pd\n", "import json\n", "import time\n", "import pprint\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib.dates import DateFormatter\n", "import matplotlib.dates as mdates\n", "import dateutil.parser" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../data/item-demand-time.csv\", dtype = object, names=['timestamp','value','item'])\n", "df.drop(df.loc[df['item']!='client_12'].index, inplace=True)\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice in the output above there are 3 columns of data:\n", "\n", "1. The Timestamp\n", "1. A Value\n", "1. An Item\n", "\n", "These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.\n", "\n", "The dataset happens to span January 01, 2014 to Deceber 31, 2014. For our testing we would like to keep the last month of information in a different CSV. We are also going to save January to November to a different CSV as well.\n", "\n", "You may notice a variable named `df` this is a popular convention when using Pandas if you are using the library's dataframe object, it is similar to a table in a database. You can learn more here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html\n", "\n", "In our dataset we have information about 3 clients, lets focus on client_12 on this excercise" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jan_to_oct = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] <= '2014-10-31')]\n", "remaining_df = df[(df['timestamp'] >= '2014-10-31') & (df['timestamp'] <= '2014-12-01')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now export them to CSV files and place them into your `data` folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jan_to_oct.to_csv(\"../data/item-demand-time-train.csv\", header=False, index=False)\n", "remaining_df.to_csv(\"../data/item-demand-time-validation.csv\", header=False, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uploading your training data to S3\n", "\n", "At this time the data is ready to be sent to S3 where Forecast will use it later. The following cells will upload the data to S3.\n", "\n", "Please paste the Bucket Name and the Forecast Role ARN from your Cloudformation outputs section\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Replace this bucket name and your Role ARN \n", "\n", "bucket_name = \"Replace with Your Bucket Name\"\n", "role_arn = \"Replace with your Role ARN\"\n", "role_name = role_arn.split(\"/\")[1]\n", "\n", "target_time_series_filename =\"elec_data/item-demand-time-train.csv\"\n", "\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(target_time_series_filename).upload_file(\"../data/item-demand-time-train.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started With Forecast\n", "\n", "Now that all of the required data to get started exists, our next step is to build the dataset groups and datasets required for our problem. Inside Amazon Forecast a DatasetGroup is an abstraction that contains all the datasets for a particular collection of Forecasts. There is no information sharing between DatasetGroups so if you'd like to try out various alternatives to the schemas we create below, you could create a new DatasetGroup and make your changes inside its corresponding Datasets.\n", "\n", "The order of the process below will be as follows:\n", "\n", "1. Create a DatasetGroup for our POC.\n", "1. Create a `Target-Time-Series` Dataset.\n", "1. Attach the Dataset to the DatasetGroup.\n", "1. Import the data into the Dataset.\n", "1. Generate Forecasts with ARIMA, Prophet, and DeepAR+.\n", "1. Query their Forecasts.\n", "1. Plot the Forecasts and metrics. \n", "\n", "\n", "At that point we can see which model is best and discuss how to add related data to our POC.\n", "\n", "The cell below defines a few global settings for our POC with the service." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATASET_FREQUENCY = \"H\" \n", "TIMESTAMP_FORMAT = \"yyyy-MM-dd hh:mm:ss\"\n", "\n", "project = 'forecast_immersion_day'\n", "datasetName= project+'_ds'\n", "datasetGroupName= project +'_dsg'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define `region` as the string that indicates the region you would like to use for Forecast and S3.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", " data = json.load(notebook_info)\n", " resource_arn = data['ResourceArn']\n", " region = resource_arn.split(':')[3]\n", "print(region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure your AWS APIs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = boto3.Session(region_name=region) \n", "forecast = session.client(service_name='forecast') \n", "forecast_query = session.client(service_name='forecastquery')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the DatasetGroup\n", "create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,\n", " Domain=\"CUSTOM\",\n", " )\n", "datasetGroupArn = create_dataset_group_response['DatasetGroupArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.\n", "schema ={\n", " \"Attributes\":[\n", " {\n", " \"AttributeName\":\"timestamp\",\n", " \"AttributeType\":\"timestamp\"\n", " },\n", " {\n", " \"AttributeName\":\"target_value\",\n", " \"AttributeType\":\"float\"\n", " },\n", " {\n", " \"AttributeName\":\"item_id\",\n", " \"AttributeType\":\"string\"\n", " }\n", " ]\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response=forecast.create_dataset(\n", " Domain=\"CUSTOM\",\n", " DatasetType='TARGET_TIME_SERIES',\n", " DatasetName=datasetName,\n", " DataFrequency=DATASET_FREQUENCY, \n", " Schema = schema\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_datasetArn = response['DatasetArn']\n", "forecast.describe_dataset(DatasetArn=target_datasetArn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Attach the Dataset to the Dataset Group:\n", "forecast.update_dataset_group(DatasetGroupArn=datasetGroupArn, DatasetArns=[target_datasetArn])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Finally we can call import the dataset\n", "target_s3DataPath = \"s3://\"+bucket_name+\"/\"+target_time_series_filename\n", "datasetImportJobName = 'DSIMPORT_JOB_TARGET'\n", "ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\n", " DatasetArn=target_datasetArn,\n", " DataSource= {\n", " \"S3Config\" : {\n", " \"Path\":target_s3DataPath,\n", " \"RoleArn\": role_arn\n", " } \n", " },\n", " TimestampFormat=TIMESTAMP_FORMAT\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\n", "print(ds_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below will run and poll every 30 seconds until the import process has completed. From there we will be able to create a few models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while True:\n", " dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']\n", " print(dataImportStatus)\n", " if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':\n", " sleep(30)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model building bits\n", "\n", "Given that that our data is hourly and we want to generate a forecast on the hour, Forecast limits us to a horizon of 500 of whatever the slice is. This means we will be able to predict about 20 days into the future. In our case we are going to predict 3 days or 72 hours.\n", "\n", "The cells below will define a few variables to be used with all of our models. Then there will be an API call to create each `Predictor` where they are based on ARIMA, Prophet, and DeepAR+ respectfully.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastHorizon = 72\n", "NumberOfBacktestWindows = 1\n", "BackTestWindowOffset = 72\n", "ForecastFrequency = \"H\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arima_algorithmArn = 'arn:aws:forecast:::algorithm/ARIMA'\n", "prophet_algorithmArn = 'arn:aws:forecast:::algorithm/Prophet'\n", "deepAR_Plus_algorithmArn = 'arn:aws:forecast:::algorithm/Deep_AR_Plus'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ARIMA Specifics\n", "arima_predictorName= project+'_arima_algo_1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build ARIMA:\n", "arima_create_predictor_response=forecast.create_predictor(PredictorName=arima_predictorName, \n", " AlgorithmArn=arima_algorithmArn,\n", " ForecastHorizon=forecastHorizon,\n", " PerformAutoML= False,\n", " PerformHPO=False,\n", " EvaluationParameters= {\"NumberOfBacktestWindows\": NumberOfBacktestWindows, \n", " \"BackTestWindowOffset\": BackTestWindowOffset}, \n", " InputDataConfig= {\"DatasetGroupArn\": datasetGroupArn},\n", " FeaturizationConfig= {\"ForecastFrequency\": ForecastFrequency, \n", " \"Featurizations\": \n", " [\n", " {\"AttributeName\": \"target_value\", \n", " \"FeaturizationPipeline\": \n", " [\n", " {\"FeaturizationMethodName\": \"filling\", \n", " \"FeaturizationMethodParameters\": \n", " {\"frontfill\": \"none\", \n", " \"middlefill\": \"zero\", \n", " \"backfill\": \"zero\"}\n", " }\n", " ]\n", " }\n", " ]\n", " }\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prophet Specifics\n", "prophet_predictorName= project+'_prophet_algo_1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build Prophet:\n", "prophet_create_predictor_response=forecast.create_predictor(PredictorName=prophet_predictorName, \n", " AlgorithmArn=prophet_algorithmArn,\n", " ForecastHorizon=forecastHorizon,\n", " PerformAutoML= False,\n", " PerformHPO=False,\n", " EvaluationParameters= {\"NumberOfBacktestWindows\": NumberOfBacktestWindows, \n", " \"BackTestWindowOffset\": BackTestWindowOffset}, \n", " InputDataConfig= {\"DatasetGroupArn\": datasetGroupArn},\n", " FeaturizationConfig= {\"ForecastFrequency\": ForecastFrequency, \n", " \"Featurizations\": \n", " [\n", " {\"AttributeName\": \"target_value\", \n", " \"FeaturizationPipeline\": \n", " [\n", " {\"FeaturizationMethodName\": \"filling\", \n", " \"FeaturizationMethodParameters\": \n", " {\"frontfill\": \"none\", \n", " \"middlefill\": \"zero\", \n", " \"backfill\": \"zero\"}\n", " }\n", " ]\n", " }\n", " ]\n", " }\n", " )\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR+ Specifics\n", "prophet_predictorName= project+'_deeparp_algo_1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build DeepAR+:\n", "deeparp_create_predictor_response=forecast.create_predictor(PredictorName=prophet_predictorName, \n", " AlgorithmArn=deepAR_Plus_algorithmArn,\n", " ForecastHorizon=forecastHorizon,\n", " PerformAutoML= False,\n", " PerformHPO=False,\n", " EvaluationParameters= {\"NumberOfBacktestWindows\": NumberOfBacktestWindows, \n", " \"BackTestWindowOffset\": BackTestWindowOffset}, \n", " InputDataConfig= {\"DatasetGroupArn\": datasetGroupArn},\n", " FeaturizationConfig= {\"ForecastFrequency\": ForecastFrequency, \n", " \"Featurizations\": \n", " [\n", " {\"AttributeName\": \"target_value\", \n", " \"FeaturizationPipeline\": \n", " [\n", " {\"FeaturizationMethodName\": \"filling\", \n", " \"FeaturizationMethodParameters\": \n", " {\"frontfill\": \"none\", \n", " \"middlefill\": \"zero\", \n", " \"backfill\": \"zero\"}\n", " }\n", " ]\n", " }\n", " ]\n", " }\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These calls will take an hour or so to complete in full. So feel free to take lunch here, go grab a pint, really anything that is going to kill a decent volume of time.\n", "\n", "The following while loop keeps track of the DeepAR+ predictor progress" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while True:\n", " status = forecast.describe_predictor(PredictorArn=deeparp_create_predictor_response['PredictorArn'])['Status']\n", " print(status)\n", " if status != 'ACTIVE' and status != 'CREATE_FAILED':\n", " sleep(30)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examine the Models\n", "\n", "First we are going to get the metrics for each model and see how they stack up:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ARIMA Metrics\n", "arima_arn = arima_create_predictor_response['PredictorArn']\n", "arima_metrics = forecast.get_accuracy_metrics(PredictorArn=arima_arn)\n", "pp = pprint.PrettyPrinter()\n", "pp.pprint(arima_metrics)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prophet Metrics\n", "prophet_arn = prophet_create_predictor_response['PredictorArn']\n", "prophet_metrics = forecast.get_accuracy_metrics(PredictorArn=prophet_arn)\n", "pp = pprint.PrettyPrinter()\n", "pp.pprint(prophet_metrics)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR+ Metrics\n", "deeparp_arn = deeparp_create_predictor_response['PredictorArn']\n", "deeparp_metrics = forecast.get_accuracy_metrics(PredictorArn=deeparp_arn)\n", "pp = pprint.PrettyPrinter()\n", "pp.pprint(deeparp_metrics)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate a Forecast For Each\n", "\n", "The next phase is to generate a Forecast from each Predictor so we can see the results and understand visually which model is performing better." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ARIMA\n", "arima_forecastName = project+'_arima_algo_forecast'\n", "arima_create_forecast_response=forecast.create_forecast(ForecastName=arima_forecastName,\n", " PredictorArn=arima_arn)\n", "arima_forecast_arn = arima_create_forecast_response['ForecastArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prophet\n", "prophet_forecastName = project+'_prophet_algo_forecast'\n", "prophet_create_forecast_response=forecast.create_forecast(ForecastName=prophet_forecastName,\n", " PredictorArn=prophet_arn)\n", "prophet_forecast_arn = prophet_create_forecast_response['ForecastArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR+\n", "deeparp_forecastName = project+'_deeparp_algo_forecast'\n", "deeparp_create_forecast_response=forecast.create_forecast(ForecastName=deeparp_forecastName,\n", " PredictorArn=deeparp_arn)\n", "deeparp_forecast_arn = deeparp_create_forecast_response['ForecastArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while True:\n", " status = forecast.describe_forecast(ForecastArn=deeparp_forecast_arn)['Status']\n", " print(status)\n", " if status != 'ACTIVE' and status != 'CREATE_FAILED':\n", " sleep(30)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting your Forecasts to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR + Forecast\n", "\n", "deep_ar_path = \"s3://\" + bucket_name + \"/DeepAR\"\n", "deep_ar_job_name = \"mlimday_deep_ar_algo_forecast\"\n", "deeparp_export_forecast_response = forecast.create_forecast_export_job(ForecastExportJobName=deep_ar_job_name,\n", " ForecastArn=deeparp_forecast_arn,\n", " Destination={\n", " \"S3Config\": {\n", " \"Path\": deep_ar_path,\n", " \"RoleArn\": role_arn\n", " }\n", " })\n", "deeparp_export_forecast_arn = deeparp_export_forecast_response['ForecastExportJobArn']\n", "print(deeparp_export_forecast_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Arima Forecast\n", "\n", "arima_path = \"s3://\" + bucket_name + \"/Arima\"\n", "arima_job_name = \"mlimday_arima_algo_forecast\"\n", "arima_export_forecast_response = forecast.create_forecast_export_job(ForecastExportJobName=arima_job_name,\n", " ForecastArn=arima_forecast_arn,\n", " Destination={\n", " \"S3Config\": {\n", " \"Path\": arima_path,\n", " \"RoleArn\": role_arn\n", " }\n", " })\n", "arima_export_forecast_arn = arima_export_forecast_response['ForecastExportJobArn']\n", "print(arima_export_forecast_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prophet Forecast\n", "\n", "prophet_path = \"s3://\" + bucket_name + \"/Prophet\"\n", "prophet_job_name = \"mlimday_prophet_algo_forecast\"\n", "prophet_export_forecast_response = forecast.create_forecast_export_job(ForecastExportJobName=prophet_job_name,\n", " ForecastArn=prophet_forecast_arn,\n", " Destination={\n", " \"S3Config\": {\n", " \"Path\": prophet_path,\n", " \"RoleArn\": role_arn\n", " }\n", " })\n", "prophet_export_forecast_arn = prophet_export_forecast_response['ForecastExportJobArn']\n", "print(prophet_export_forecast_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This exporting process is another one of those items that will take **5 minutes** to complete. From the earlier page where you saw the status turn `Active` for a Forecast, click it and you can see the progress of the export." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while True:\n", " status = forecast.describe_forecast_export_job(ForecastExportJobArn=arima_export_forecast_arn)['Status']\n", " print(status)\n", " if status != 'ACTIVE' and status != 'CREATE_FAILED':\n", " sleep(30)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Obtaining the Forecasts\n", "\n", "At this point they are all exported into S3 but you need to obtain the results locally so we can explore them, the cells below will do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR File\n", "s3 = boto3.resource('s3')\n", "s3_bucket = s3.Bucket(bucket_name)\n", "deep_ar_filename = \"\"\n", "deep_ar_files = list(s3_bucket.objects.filter(Prefix=\"DeepAR\"))\n", "for file in deep_ar_files:\n", " #There will be a collection of CSVs if the forecast is large, modify this to go get them all\n", " if \"csv\" in file.key:\n", " deep_ar_filename = file.key.split('/')[1]\n", " s3.Bucket(bucket_name).download_file(file.key, \"../data/\"+deep_ar_filename)\n", "print(deep_ar_filename)\n", "\n", "# ARIMA File\n", "arima_filename = \"\"\n", "arima_files = list(s3_bucket.objects.filter(Prefix=\"Arima\"))\n", "for file in arima_files:\n", " #There will be a collection of CSVs if the forecast is large, modify this to go get them all\n", " if \"csv\" in file.key:\n", " arima_filename = file.key.split('/')[1]\n", " s3.Bucket(bucket_name).download_file(file.key, \"../data/\"+arima_filename)\n", "print(arima_filename)\n", "\n", "# Phrophet File\n", "prophet_filename = \"\"\n", "prophet_files = list(s3_bucket.objects.filter(Prefix=\"Prophet\"))\n", "for file in prophet_files:\n", " #There will be a collection of CSVs if the forecast is large, modify this to go get them all\n", " if \"csv\" in file.key:\n", " prophet_filename = file.key.split('/')[1]\n", " s3.Bucket(bucket_name).download_file(file.key, \"../data/\"+prophet_filename)\n", "print(prophet_filename)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating the Forecast\n", "\n", "Event before exporting the forecasts themselves we can see a few things in the logs above...\n", "\n", "Mainly, the RMSE for each model:\n", "\n", "1. ARIMA - RMSE: 20.900040194709177,\n", "1. Prophet - RMSE: 20.298012475873875\n", "1. DeepAR+ - RMSE: 8.336288046896607\n", "\n", "Those numbers are:\n", "\n", "1. ARIMA wQL[0.5]: 0.15327057905199545\n", "1. Prophet wQL[0.5]: 0.16823164160633303\n", "1. DeepAR+ wQL[0.5]: 0.05955031556311987\n", "\n", "\n", "This tells us that our DeepAR+ model is doing the best when evaluating the p50 result.\n", "\n", "The next stage would be to plot these numbers over a particular window.\n", "\n", "To make this particular process easier we are going to export them all as CSV's from the console then read them in later. An improvement would be to use the JSON API and convert to a DF that way.\n", "\n", "Note the files were downloaded and placed into the `../data/` folder for exploration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ARIMA Eval\n", "arima_predicts = pd.read_csv(\"../data/\" + arima_filename)\n", "arima_predicts.sample()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arima_predicts.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove the timezone\n", "arima_predicts['date'] = pd.to_datetime(arima_predicts['date'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "arima_predicts.sample()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arima_predicts['date'] = arima_predicts['date'].dt.tz_convert(None)\n", "arima_predicts.set_index('date', inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arima_predicts.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (arima_predicts.index.min())\n", "print (arima_predicts.index.max())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see our prediction goes from October 31st to November 2nd as expectged given our 72 hour interval forecast horizon. Also we can see the cyclical nature of the predictions over the entire timeframe. \n", "\n", "Now we are going to create a dataframe of the prediction values from this Forecast and the actual values.\n", "\n", "First let us remove the column ID of item before continuing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arima_predicts = arima_predicts[['p10', 'p50', 'p90']]\n", "arima_predicts.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now strip the timezone information\n", "arima_predicts.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "actual_df = pd.read_csv(\"../data/item-demand-time-validation.csv\", names=['timestamp','value','item'])\n", "actual_df.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "actual_df = actual_df[(actual_df['timestamp'] >= '2014-10-31') & (actual_df['timestamp'] < '2014-11-03')]\n", "\n", "results_df = pd.DataFrame(columns=['timestamp', 'value', 'source'])\n", "for index, row in actual_df.iterrows():\n", " clean_timestamp = dateutil.parser.parse(row['timestamp'])\n", " results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['value'], 'source': 'actual'} , ignore_index=True)\n", " \n", "validation_df = results_df.pivot(columns='source', values='value', index=\"timestamp\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "validation_df.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Finally let us join the dataframes together\n", "arima_val_df = arima_predicts.join(validation_df, how='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot\n", "arima_val_df.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prophet Eval" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prophet Eval\n", "prophet_predicts = pd.read_csv(\"../data/\"+prophet_filename)\n", "prophet_predicts.sample()\n", "# Remove the timezone\n", "prophet_predicts['date'] = pd.to_datetime(prophet_predicts['date'])\n", "prophet_predicts['date'] = prophet_predicts['date'].dt.tz_convert(None)\n", "prophet_predicts.set_index('date', inplace=True)\n", "prophet_predicts = prophet_predicts[['p10', 'p50', 'p90']]\n", "# Finally let us join the dataframes together\n", "prophet_val_df = prophet_predicts.join(validation_df, how='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot\n", "prophet_val_df.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DeepAR+ Eval" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DeepAR+ Eval\n", "deeparp_predicts = pd.read_csv(\"../data/\"+deep_ar_filename)\n", "deeparp_predicts.sample()\n", "# Remove the timezone\n", "deeparp_predicts['date'] = pd.to_datetime(deeparp_predicts['date'])\n", "deeparp_predicts['date'] = deeparp_predicts['date'].dt.tz_convert(None)\n", "deeparp_predicts.set_index('date', inplace=True)\n", "deeparp_predicts = deeparp_predicts[['p10', 'p50', 'p90']]\n", "# Finally let us join the dataframes together\n", "deeparp_val_df = deeparp_predicts.join(validation_df, how='outer')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot\n", "deeparp_val_df.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is particularly interesting here is that we were below the actual numbers for a good portion of the day even with p90. We did see great performance from Prophet and the metrics indicate that DeepAR+ is objectively better here so now we will add related time series data to our project and see how the models behave then." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting it all together\n", "\n", "Lastly we will take a look at our p50 prediction from ARIMA, Prophet, and DeepAR+ where the latter leverage related time series data to see how close they are over our validation period.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating a final df\n", "validation_df_final = validation_df.copy()\n", "validation_df_final = validation_df_final[['actual']]\n", "# ARIMA P50\n", "arima_p50 = arima_predicts.copy()\n", "arima_p50 = arima_p50[['p50']]\n", "arima_p50.rename(columns = {'p50':'ARIMA'}, inplace = True)\n", "# Prophet P50:\n", "prophet_val_p50 = prophet_predicts.copy()\n", "prophet_val_p50 = prophet_val_p50[['p50']]\n", "prophet_val_p50.rename(columns = {'p50':'PROPHET'}, inplace = True)\n", "# DeepAR+ P50:\n", "deeparp_val_p50 = deeparp_predicts.copy()\n", "deeparp_val_p50 = deeparp_val_p50[['p50']]\n", "deeparp_val_p50.rename(columns = {'p50':'DEEPARP'}, inplace = True)\n", "# Join DFs\n", "validation_df_final = validation_df_final.join(arima_p50, how='outer')\n", "validation_df_final = validation_df_final.join(prophet_val_p50, how='outer')\n", "validation_df_final = validation_df_final.join(deeparp_val_p50, how='outer')\n", "validation_df_final.head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "validation_df_final.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(\"DeepAR/mlimday_deep_ar_algo_forecast_2020-04-16T18-12-03Z_part0.csv\").delete()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store datasetGroupArn\n", "%store target_datasetArn\n", "%store role_name\n", "%store bucket_name\n", "%store region\n", "%store ds_import_job_arn\n", "%store deeparp_export_forecast_arn\n", "%store arima_export_forecast_arn\n", "%store prophet_export_forecast_arn\n", "%store prophet_forecast_arn\n", "%store arima_forecast_arn\n", "%store deeparp_forecast_arn\n", "%store arima_arn\n", "%store prophet_arn\n", "%store deeparp_arn\n", "%store deep_ar_filename\n", "%store arima_filename\n", "%store prophet_filename" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }