{ "cells": [ { "cell_type": "markdown", "id": "26b630f5-51a0-4f6b-8c09-07fdb651b7f0", "metadata": { "tags": [] }, "source": [ "# Welcome to Amazon Forecast Quick Start Guide\n", "\n", "Using Amazon Forecast involves the following 3 steps.\n", "\n", "![Amazon Forecast Workflow](https://github.com/aws-samples/amazon-forecast-samples/raw/main/notebooks/basic/Getting_Started/images/workflow.png)\n", "\n", "Imagine we are trying to solve the forecasting problem for a ride-hailing service and we want to predict how many pick-ups are expected in specific areas of New York. For this exercise, we will use the yellow taxi trip records from [NYC Taxi and Limousine Commission (TLC)](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). \n", "\n", "We will start by **importing the historical data from December 2017 to January 2019**. Next, we will **train a Predictor** using this data. Finally, we will **generate a forecast for February 2019** and **compare it with the actual data from February 2019**.\n", "\n", "## Table of Contents\n", "* [Pre-requisites](#prerequisites)\n", "* Step 1: [Import your data](#import)\n", "* Step 2: [Train a predictor](#predictor)\n", "* Step 3: [Generate forecasts](#forecast)\n", "* BONUS! [Explaining the predictor](#explaining)\n", "* [Clean-up](#cleanup)" ] }, { "cell_type": "markdown", "id": "308d9bcf-7d72-4837-b01d-5b926908cba4", "metadata": { "tags": [] }, "source": [ "## Pre-requisites \n", "Before we get started, lets set up the notebook environment, the AWS SDK client for Amazon Forecast and IAM Role used by Amazon Forecast to access your data." ] }, { "cell_type": "markdown", "id": "1b061c30-3367-4518-a1ba-1d8801a859f1", "metadata": {}, "source": [ "#### Setup Notebook Environment" ] }, { "cell_type": "code", "execution_count": null, "id": "b8fb452c-a803-4fe3-814f-0e6da07128f0", "metadata": {}, "outputs": [], "source": [ "%%capture --no-stderr setup\n", "\n", "!pip install pandas s3fs matplotlib ipywidgets\n", "!pip install boto3 --upgrade\n", "\n", "%reload_ext autoreload" ] }, { "cell_type": "markdown", "id": "acb8fb2b-6392-4baf-a847-7c969cbe0dfa", "metadata": {}, "source": [ "#### Setup Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "479d5a31-7f83-448c-b7f8-b82477830d58", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "\n", "sys.path.insert( 0, os.path.abspath(\"../../common\") )\n", "\n", "import json\n", "import util\n", "import boto3\n", "import s3fs\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "a0a583cd-35f3-4f2a-b7a2-a51363dacf2c", "metadata": {}, "source": [ "#### Create an instance of AWS SDK client for Amazon Forecast" ] }, { "cell_type": "code", "execution_count": null, "id": "3d459bcc-526d-4f91-b165-6e747da5cbee", "metadata": {}, "outputs": [], "source": [ "region = 'us-east-1'\n", "session = boto3.Session(region_name=region) \n", "forecast = session.client(service_name='forecast')\n", "forecastquery = session.client(service_name='forecastquery')\n", "\n", "# Checking to make sure we can communicate with Amazon Forecast\n", "assert forecast.list_predictors()" ] }, { "cell_type": "markdown", "id": "bbf703b7-7e0e-46f8-baf3-c5419d033446", "metadata": {}, "source": [ "#### Setup IAM Role used by Amazon Forecast to access your data" ] }, { "cell_type": "code", "execution_count": null, "id": "cfff3e73-135b-4a0d-9576-6f5326ef0771", "metadata": {}, "outputs": [], "source": [ "role_name = \"ForecastNotebookRole-Basic\"\n", "print(f\"Creating Role {role_name}...\")\n", "role_arn = util.get_or_create_iam_role( role_name = role_name )\n", "\n", "# echo user inputs without account\n", "print(f\"Success! Created role = {role_arn.split('/')[1]}\")" ] }, { "cell_type": "markdown", "id": "561359c2-2c51-4e1d-a9af-25110b302638", "metadata": {}, "source": [ "## Step 1: Import your data. \n", "\n", "In this step, we will create a **Dataset** and **Import** the December 2017 to January 2019 dataset from S3 to Amazon Forecast. To train a Predictor we will need a **DatasetGroup** that groups the input **Datasets**. So, we will end this step by creating a **DatasetGroup** with the imported **Dataset**." ] }, { "cell_type": "markdown", "id": "dc9043b1-7ab0-4896-b6e5-012e351e0f00", "metadata": {}, "source": [ "#### Peek at the data and upload it to S3.\n", "\n", "The taxi dataset has the following 3 columns:\n", "1. **timestamp:** Timetamp at which pick-ups are requested.\n", "2. **item_id:** Pick-up location ID.\n", "3. **target_value:** Number of pick-ups requested around the timestamp at the pick-up location." ] }, { "cell_type": "code", "execution_count": null, "id": "89db3d04-f606-4cd9-aae3-8145bad0e967", "metadata": {}, "outputs": [], "source": [ "key=\"data/taxi-dec2017-jan2019.csv\"\n", "\n", "taxi_df = pd.read_csv(key, dtype = object, names=['timestamp','item_id','target_value'])\n", "\n", "display(taxi_df.head(3))\n", "\n", "bucket_name = input(\"\\nEnter S3 bucket name for uploading the data and hit ENTER key:\")\n", "print(f\"\\nAttempting to upload the data to the S3 bucket '{bucket_name}' at key '{key}' ...\")\n", "\n", "s3 = boto3.Session().resource('s3')\n", "bucket = s3.Bucket(bucket_name)\n", "if not bucket.creation_date:\n", " if region != \"us-east-1\":\n", " s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})\n", " else:\n", " s3.create_bucket(Bucket=bucket_name)\n", "\n", "s3.Bucket(bucket_name).Object(key).upload_file(key)\n", "ts_s3_path = f\"s3://{bucket_name}/{key}\"\n", "\n", "print(f\"\\nDone, the dataset is uploaded to S3 at {ts_s3_path}.\")" ] }, { "cell_type": "markdown", "id": "be7169ba-c8ee-473b-aa19-52197415df3d", "metadata": {}, "source": [ "#### Creating the Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "61d61e18-9ad6-4d58-98c9-5a1385a379f9", "metadata": {}, "outputs": [], "source": [ "DATASET_FREQUENCY = \"H\" # H for hourly.\n", "TS_DATASET_NAME = \"TAXI_TS\"\n", "TS_SCHEMA = {\n", " \"Attributes\":[\n", " {\n", " \"AttributeName\":\"timestamp\",\n", " \"AttributeType\":\"timestamp\"\n", " },\n", " {\n", " \"AttributeName\":\"item_id\",\n", " \"AttributeType\":\"string\"\n", " },\n", " {\n", " \"AttributeName\":\"target_value\",\n", " \"AttributeType\":\"integer\"\n", " }\n", " ]\n", "}\n", "\n", "create_dataset_response = forecast.create_dataset(Domain=\"CUSTOM\",\n", " DatasetType='TARGET_TIME_SERIES',\n", " DatasetName=TS_DATASET_NAME,\n", " DataFrequency=DATASET_FREQUENCY,\n", " Schema=TS_SCHEMA)\n", "\n", "ts_dataset_arn = create_dataset_response['DatasetArn']\n", "describe_dataset_response = forecast.describe_dataset(DatasetArn=ts_dataset_arn)\n", "\n", "print(f\"The Dataset with ARN {ts_dataset_arn} is now {describe_dataset_response['Status']}.\")" ] }, { "cell_type": "markdown", "id": "d4bf392b-beee-4d52-bf42-11e5fa33dd2e", "metadata": {}, "source": [ "#### Importing the Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "2abcb47d-9449-4a59-8c68-8404a9817177", "metadata": {}, "outputs": [], "source": [ "TIMESTAMP_FORMAT = \"yyyy-MM-dd hh:mm:ss\"\n", "TS_IMPORT_JOB_NAME = \"TAXI_TTS_IMPORT\"\n", "TIMEZONE = \"EST\"\n", "\n", "ts_dataset_import_job_response = \\\n", " forecast.create_dataset_import_job(DatasetImportJobName=TS_IMPORT_JOB_NAME,\n", " DatasetArn=ts_dataset_arn,\n", " DataSource= {\n", " \"S3Config\" : {\n", " \"Path\": ts_s3_path,\n", " \"RoleArn\": role_arn\n", " } \n", " },\n", " TimestampFormat=TIMESTAMP_FORMAT,\n", " TimeZone = TIMEZONE)\n", "\n", "ts_dataset_import_job_arn = ts_dataset_import_job_response['DatasetImportJobArn']\n", "describe_dataset_import_job_response = forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn)\n", "\n", "print(f\"Waiting for Dataset Import Job with ARN {ts_dataset_import_job_arn} to become ACTIVE. This process could take 5-10 minutes.\\n\\nCurrent Status:\")\n", "\n", "status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn))\n", "\n", "describe_dataset_import_job_response = forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn)\n", "print(f\"\\n\\nThe Dataset Import Job with ARN {ts_dataset_import_job_arn} is now {describe_dataset_import_job_response['Status']}.\")" ] }, { "cell_type": "markdown", "id": "519eb5ea-1929-4bd4-928f-ee730fee0974", "metadata": {}, "source": [ "#### Creating a DatasetGroup" ] }, { "cell_type": "code", "execution_count": null, "id": "4e610dd4-0782-41d4-99f2-cc7e4c77f18d", "metadata": {}, "outputs": [], "source": [ "DATASET_GROUP_NAME = \"TAXI_DEMO\"\n", "DATASET_ARNS = [ts_dataset_arn]\n", "\n", "create_dataset_group_response = \\\n", " forecast.create_dataset_group(Domain=\"CUSTOM\",\n", " DatasetGroupName=DATASET_GROUP_NAME,\n", " DatasetArns=DATASET_ARNS)\n", "\n", "dataset_group_arn = create_dataset_group_response['DatasetGroupArn']\n", "describe_dataset_group_response = forecast.describe_dataset_group(DatasetGroupArn=dataset_group_arn)\n", "\n", "print(f\"The DatasetGroup with ARN {dataset_group_arn} is now {describe_dataset_group_response['Status']}.\")" ] }, { "cell_type": "markdown", "id": "11f20bed-c206-428b-8c26-bee46b4dd702", "metadata": {}, "source": [ "## Step 2: Train a predictor \n", "\n", "In this step, we will create a **Predictor** using the **DatasetGroup** that was created above. After creating the predictor, we will review the accuracy obtained through the backtesting process to get a quantitative understanding of the performance of the predictor." ] }, { "cell_type": "markdown", "id": "ffe656fd-e0ff-4ab3-83d9-1bc5aed16059", "metadata": {}, "source": [ "#### Train a predictor" ] }, { "cell_type": "code", "execution_count": null, "id": "1b7da60b-1cbc-4ea3-81f8-3412ab1338eb", "metadata": {}, "outputs": [], "source": [ "PREDICTOR_NAME = \"TAXI_PREDICTOR\"\n", "FORECAST_HORIZON = 24\n", "FORECAST_FREQUENCY = \"H\"\n", "HOLIDAY_DATASET = [{\n", " 'Name': 'holiday',\n", " 'Configuration': {\n", " 'CountryCode': ['US']\n", " }\n", "}]\n", "\n", "create_auto_predictor_response = \\\n", " forecast.create_auto_predictor(PredictorName = PREDICTOR_NAME,\n", " ForecastHorizon = FORECAST_HORIZON,\n", " ForecastFrequency = FORECAST_FREQUENCY,\n", " DataConfig = {\n", " 'DatasetGroupArn': dataset_group_arn, \n", " 'AdditionalDatasets': HOLIDAY_DATASET\n", " },\n", " ExplainPredictor = True)\n", "\n", "predictor_arn = create_auto_predictor_response['PredictorArn']\n", "print(f\"Waiting for Predictor with ARN {predictor_arn} to become ACTIVE. Depending on data size and predictor setting,it can take several hours to be ACTIVE.\\n\\nCurrent Status:\")\n", "\n", "status = util.wait(lambda: forecast.describe_auto_predictor(PredictorArn=predictor_arn))\n", "\n", "describe_auto_predictor_response = forecast.describe_auto_predictor(PredictorArn=predictor_arn)\n", "print(f\"\\n\\nThe Predictor with ARN {predictor_arn} is now {describe_auto_predictor_response['Status']}.\")" ] }, { "cell_type": "markdown", "id": "5e519266-8533-4a22-aac5-0ea46593a146", "metadata": {}, "source": [ "#### Review accuracy metrics\n", "\n", "* **Weighted Quantile Loss (wQL)** metric measures the accuracy of a model at a specified quantile. It is particularly useful when there are different costs for underpredicting and overpredicting.\n", "\n", "* **Root Mean Square Error (RMSE)** uses the squared value of the residuals, which amplifies the impact of outliers. In use cases where only a few large mispredictions can be very costly, the RMSE is the more relevant metric.\n", "\n", "* **Weighted Absolute Percentage Error (WAPE)** is more robust to outliers than Root Mean Square Error (RMSE) because it uses the absolute error instead of the squared error.\n", "\n", "* **Mean Absolute Percentage Error (MAPE)** is useful for cases where values differ significantly between time points and outliers have a significant impact.\n", "\n", "* **Mean Absolute Scaled Error (MASE)** is ideal for datasets that are cyclical in nature or have seasonal properties." ] }, { "cell_type": "code", "execution_count": null, "id": "05190a1e-f7a8-449f-a43e-2bc1456c65c3", "metadata": {}, "outputs": [], "source": [ "get_accuracy_metrics_response = forecast.get_accuracy_metrics(PredictorArn=predictor_arn)\n", "wql = get_accuracy_metrics_response['PredictorEvaluationResults'][0]['TestWindows'][0]['Metrics']['WeightedQuantileLosses']\n", "accuracy_scores = get_accuracy_metrics_response['PredictorEvaluationResults'][0]['TestWindows'][0]['Metrics']['ErrorMetrics'][0]\n", "\n", "print(f\"Weighted Quantile Loss (wQL): {json.dumps(wql, indent=2)}\\n\\n\")\n", "\n", "print(f\"Root Mean Square Error (RMSE): {accuracy_scores['RMSE']}\\n\\n\")\n", "\n", "print(f\"Weighted Absolute Percentage Error (WAPE): {accuracy_scores['WAPE']}\\n\\n\")\n", "\n", "print(f\"Mean Absolute Percentage Error (MAPE): {accuracy_scores['MAPE']}\\n\\n\")\n", "\n", "print(f\"Mean Absolute Scaled Error (MASE): {accuracy_scores['MASE']}\\n\")" ] }, { "cell_type": "markdown", "id": "dc8a326d-ce32-425a-bc9f-d5bdb0c46c66", "metadata": {}, "source": [ "## Step 3: Generate forecasts \n", "Finally, we will generate the forecasts using the above predictor. Later in this step we will also compare the forecast with the ground truth for February 1, 2019 to demonstrate actual performance of Amazon Forecast on this dataset." ] }, { "cell_type": "markdown", "id": "e864147e-c0d7-4e83-9b58-c402743a7ba0", "metadata": {}, "source": [ "#### Generate forecasts" ] }, { "cell_type": "code", "execution_count": null, "id": "42e44edb-6468-4b40-8133-4f74e409c966", "metadata": {}, "outputs": [], "source": [ "FORECAST_NAME = \"TAXI_FORECAST\"\n", "\n", "create_forecast_response = \\\n", " forecast.create_forecast(ForecastName=FORECAST_NAME,\n", " PredictorArn=predictor_arn)\n", "\n", "forecast_arn = create_forecast_response['ForecastArn']\n", "print(f\"Waiting for Forecast with ARN {forecast_arn} to become ACTIVE. Depending on data size and predictor settings,it can take several hours to be ACTIVE.\\n\\nCurrent Status:\")\n", "\n", "status = util.wait(lambda: forecast.describe_forecast(ForecastArn=forecast_arn))\n", "\n", "describe_forecast_response = forecast.describe_forecast(ForecastArn=forecast_arn)\n", "print(f\"\\n\\nThe Forecast with ARN {forecast_arn} is now {describe_forecast_response['Status']}.\")" ] }, { "cell_type": "markdown", "id": "f875ae00-67cf-4095-b486-7b18a4ad8fb2", "metadata": {}, "source": [ "#### Load ground truth for pick-up location 48 on February 1, 2019." ] }, { "cell_type": "code", "execution_count": null, "id": "2b7e140e-cf41-4593-8c31-345e2c422d83", "metadata": {}, "outputs": [], "source": [ "ITEM_ID = \"48\"\n", "\n", "taxi_feb_df = pd.read_csv(\"data/taxi-feb2019.csv\", dtype = object, names=['timestamp','item_id','target_value'])\n", "taxi_feb_df.target_value = taxi_feb_df.target_value.astype(float)\n", "\n", "actuals = taxi_feb_df[(taxi_feb_df['item_id'] == ITEM_ID)]" ] }, { "cell_type": "markdown", "id": "9b0937f3-02d3-441d-b82d-7a91abf20bdd", "metadata": {}, "source": [ "#### Query forecasts for pick-up location 48 on February 1, 2019." ] }, { "cell_type": "code", "execution_count": null, "id": "b66d38e6-476a-4c68-815a-ab44c8c111ff", "metadata": {}, "outputs": [], "source": [ "forecast_response = forecastquery.query_forecast(\n", " ForecastArn=forecast_arn,\n", " Filters={\"item_id\": ITEM_ID}\n", ")\n", "\n", "forecasts_p10_df = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p10'])\n", "forecasts_p50_df = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p50'])\n", "forecasts_p90_df = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p90'])" ] }, { "cell_type": "markdown", "id": "3c828e58-a9f0-4506-9710-590cca595568", "metadata": {}, "source": [ "#### Compare the forecasts with ground truth" ] }, { "cell_type": "code", "execution_count": null, "id": "52ff4b5a-a7ed-45bd-baf6-67d4c4e64af2", "metadata": {}, "outputs": [], "source": [ "results_df = pd.DataFrame(columns=['timestamp', 'value', 'source'])\n", "\n", "for index, row in actuals.iterrows():\n", " clean_timestamp = dateutil.parser.parse(row['timestamp'])\n", " results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['target_value'], 'source': 'actual'} , ignore_index=True)\n", "for index, row in forecasts_p10_df.iterrows():\n", " clean_timestamp = dateutil.parser.parse(row['Timestamp'])\n", " results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'source': 'p10'} , ignore_index=True)\n", "for index, row in forecasts_p50_df.iterrows():\n", " clean_timestamp = dateutil.parser.parse(row['Timestamp'])\n", " results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'source': 'p50'} , ignore_index=True)\n", "for index, row in forecasts_p90_df.iterrows():\n", " clean_timestamp = dateutil.parser.parse(row['Timestamp'])\n", " results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'source': 'p90'} , ignore_index=True)\n", "\n", "pivot_df = results_df.pivot(columns='source', values='value', index=\"timestamp\")\n", "\n", "pivot_df.plot(figsize=(15, 7))" ] }, { "cell_type": "markdown", "id": "703ac462-5d10-4caa-9122-390e005494b9", "metadata": {}, "source": [ "## BONUS! Explaining the predictor \n", "In Step 2, we added an additional dataset - US Holidays - when creating the predictor. Let us now see how impactful the additional dataset feature was. You can do the same for additional datasets that you bring in." ] }, { "cell_type": "code", "execution_count": null, "id": "01f49f2d-1d65-42e5-bc55-892927654451", "metadata": {}, "outputs": [], "source": [ "explainability_arn = 'arn:aws:forecast:ap-southeast-1:730750055343:explainability/MY_TAXI_PREDICTOR_HOLIDAY'\n", "status = util.wait(lambda: forecast.describe_explainability(ExplainabilityArn=explainability_arn))\n", "\n", "EXPLANABILITY_EXPORT_NAME = \"TAXI_PREDICTOR_EXPLANATION_EXPORT\"\n", "EXPLANABILITY_EXPORT_DESTINATION = f\"s3://{bucket_name}/explanation/{EXPLANABILITY_EXPORT_NAME}\"\n", "\n", "explainability_export_response = forecast.create_explainability_export(ExplainabilityExportName=EXPLANABILITY_EXPORT_NAME, \n", " ExplainabilityArn=explainability_arn, \n", " Destination={\n", " \"S3Config\": {\n", " \"Path\": EXPLANABILITY_EXPORT_DESTINATION,\n", " \"RoleArn\": role_arn}\n", " }\n", " )\n", "\n", "explainability_export_arn = explainability_export_response['ExplainabilityExportArn']\n", "\n", "status = util.wait(lambda: forecast.describe_explainability_export(ExplainabilityExportArn=explainability_export_arn))\n", "\n", "export_data = util.read_explainability_export(bucket_name, \"explanation/\" + EXPLANABILITY_EXPORT_NAME)\n", "\n", "export_data.style.hide_index()" ] }, { "cell_type": "markdown", "id": "5f5a1227-dfa7-495b-9613-bba550d3192c", "metadata": {}, "source": [ "* **Impact scores** measure the relative impact attributes have on forecast values. For example, if the holiday attribute has an impact score that is twice as large as another possible attribute, say weather, you can conclude that the holiday has twice the impact on forecast values than the weather. \n", "* **Impact scores** also provide information on whether an attribute increases or decreases the forecasted value. A negative impact scores reflects that the attribute tends to decrease the value of the forecast." ] }, { "cell_type": "markdown", "id": "9c91ba88-7877-4a8d-bca0-f8800dcaefbe", "metadata": {}, "source": [ "## Clean-up \n", "Uncomment the code section to delete all resources that were created in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "172501bb-3cee-412f-81e0-f59e897230db", "metadata": {}, "outputs": [], "source": [ "# forecast.delete_resource_tree(ResourceArn = dataset_group_arn)\n", "# forecast.delete_resource_tree(ResourceArn = ts_dataset_arn)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }