{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How to use Amazon Forecast\n", "\n", "Helps advanced users start with Amazon Forecast quickly. The demo notebook runs through a typical end to end usecase for a simple timeseries forecasting scenario. \n", "\n", "Prerequisites: \n", "[AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/installing.html) . \n", "\n", "For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)\n", "\n", "## Table Of Contents\n", "* [Setting up](#setup)\n", "* [Test Setup - Running first API](#hello)\n", "* [Forecasting Example with Amazon Forecast](#forecastingExample)\n", "\n", "**Read Every Cell FULLY before executing it**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup <a class=\"anchor\" id=\"setup\"></a>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "import time\n", "\n", "import boto3\n", "\n", "# importing forecast notebook utility from notebooks/common directory\n", "sys.path.insert( 0, os.path.abspath(\"../../common\") )\n", "import util" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure the S3 bucket name and region name for this lesson.\n", "\n", "- If you don't have an S3 bucket, create it first on S3.\n", "- Although we have set the region to us-west-2 as a default value below, you can choose any of the regions that the service is available in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_widget_bucket = util.create_text_widget( \"bucketName\", \"input your S3 bucket name\" )\n", "text_widget_region = util.create_text_widget( \"region\", \"input region name.\", default_value=\"us-west-2\" )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucketName = text_widget_bucket.value\n", "assert bucketName, \"bucket_name not set.\"\n", "\n", "region = text_widget_region.value\n", "assert region, \"region not set.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = boto3.Session(region_name=region) \n", "\n", "forecast = session.client(service_name='forecast') \n", "forecastquery = session.client(service_name='forecastquery')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forecasting with Amazon Forecast<a class=\"anchor\" id=\"forecastingExample\"></a>\n", "### Preparing your Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon forecast can import data from Amazon S3. We first explore the data locally to see the fields" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"../../common/data/item-demand-time.csv\", dtype = object)\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now upload the data to S3. But before doing that, go into your AWS Console, select S3 for the service and create a new bucket inside the `Oregon` or `us-west-2` region. Use that bucket name convention of `amazon-forecast-unique-value-data`. The name must be unique, if you get an error, just adjust until your name works, then update the `bucketName` cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = session.client('s3')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "key=\"elec_data/item-demand-time.csv\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.upload_file(Filename=\"../../common/data/item-demand-time.csv\", Bucket=bucketName, Key=key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the role to provide to Amazon Forecast.\n", "role_name = \"ForecastNotebookRole-AutoML\"\n", "role_arn = util.get_or_create_iam_role( role_name = role_name )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CreateDataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`. Also for your project name, update it to reflect your name in a lowercase format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATASET_FREQUENCY = \"H\" \n", "TIMESTAMP_FORMAT = \"yyyy-MM-dd hh:mm:ss\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "project = 'workshop_forecastdemo_1' # Replace this with a unique name here, make sure the entire name is < 30 characters.\n", "datasetName= project+'_ds'\n", "datasetGroupName= project +'_gp'\n", "s3DataPath = \"s3://\"+bucketName+\"/\"+key" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datasetName" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Schema Definition \n", "### We are defining the attributes for the model " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.\n", "schema ={\n", " \"Attributes\":[\n", " {\n", " \"AttributeName\":\"timestamp\",\n", " \"AttributeType\":\"timestamp\"\n", " },\n", " {\n", " \"AttributeName\":\"target_value\",\n", " \"AttributeType\":\"float\"\n", " },\n", " {\n", " \"AttributeName\":\"item_id\",\n", " \"AttributeType\":\"string\"\n", " }\n", " ]\n", "}\n", "\n", "response=forecast.create_dataset(\n", " Domain=\"CUSTOM\",\n", " DatasetType='TARGET_TIME_SERIES',\n", " DatasetName=datasetName,\n", " DataFrequency=DATASET_FREQUENCY, \n", " Schema = schema\n", " )\n", "datasetArn = response['DatasetArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,\n", " Domain=\"CUSTOM\",\n", " DatasetArns= [datasetArn]\n", " )\n", "datasetGroupArn = create_dataset_group_response['DatasetGroupArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have an existing datasetgroup, you can update it using **update_dataset_group** to update dataset group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Data Import Job\n", "Brings the data into Amazon Forecast system ready to forecast from raw data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datasetImportJobName = 'EP_AML_DSIMPORT_JOB_TARGET'\n", "ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\n", " DatasetArn=datasetArn,\n", " DataSource= {\n", " \"S3Config\" : {\n", " \"Path\":s3DataPath,\n", " \"RoleArn\": role_arn\n", " } \n", " },\n", " TimestampFormat=TIMESTAMP_FORMAT\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\n", "print(ds_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status_indicator = util.StatusIndicator()\n", "\n", "while True:\n", " status = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']\n", " status_indicator.update(status)\n", " if status in ('ACTIVE', 'CREATE_FAILED'): break\n", " time.sleep(10)\n", "\n", "status_indicator.end()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Predictor with customer forecast horizon" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Forecast horizon is the number of number of time points to predicted in the future. For weekly data, a value of 12 means 12 weeks. Our example is hourly data, we try forecast the next day, so we can set to 24." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we are not sure which recipe will perform best, we can utilise the Auto ML option that the SDK offers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictorName = project+'_autoML'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastHorizon = 24" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "algorithmArn = 'arn:aws:forecast:::algorithm/ETS'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_predictor_response=forecast.create_predictor(PredictorName=predictorName, \n", " ForecastHorizon=forecastHorizon,\n", " PerformAutoML=True,\n", " PerformHPO=False,\n", " EvaluationParameters= {\"NumberOfBacktestWindows\": 1, \n", " \"BackTestWindowOffset\": 24}, \n", " InputDataConfig= {\"DatasetGroupArn\": datasetGroupArn},\n", " FeaturizationConfig= {\"ForecastFrequency\": \"H\", \n", " \"Featurizations\": \n", " [\n", " {\"AttributeName\": \"target_value\", \n", " \"FeaturizationPipeline\": \n", " [\n", " {\"FeaturizationMethodName\": \"filling\", \n", " \"FeaturizationMethodParameters\": \n", " {\"frontfill\": \"none\", \n", " \"middlefill\": \"zero\", \n", " \"backfill\": \"zero\"}\n", " }\n", " ]\n", " }\n", " ]\n", " }\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictorArn=create_predictor_response['PredictorArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of the predictor. When the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on data size, model selection and hyper parameters,it can take 10 mins to more than one hour to be **ACTIVE**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status_indicator = util.StatusIndicator()\n", "\n", "while True:\n", " status = forecast.describe_predictor(PredictorArn=predictorArn)['Status']\n", " status_indicator.update(status)\n", " if status in ('ACTIVE', 'CREATE_FAILED'): break\n", " time.sleep(10)\n", "\n", "status_indicator.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Error Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the accuracy metrics of the predicto we just created using Auto ML. The response will be a dictionary with all available recipes. Auto ML works out the best one for our predictor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.get_accuracy_metrics(PredictorArn=predictorArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Forecast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a forecast using the model that was trained." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastName= project+'_aml_forecast'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_forecast_response=forecast.create_forecast(ForecastName=forecastName,\n", " PredictorArn=predictorArn)\n", "forecastArn = create_forecast_response['ForecastArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of the forecast process, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on data size, model selection and hyper parameters,it can take 10 mins to more than one hour to be **ACTIVE**. There's no output here, but that is fine as long as the * is there." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status_indicator = util.StatusIndicator()\n", "\n", "while True:\n", " status = forecast.describe_forecast(ForecastArn=forecastArn)['Status']\n", " status_indicator.update(status)\n", " if status in ('ACTIVE', 'CREATE_FAILED'): break\n", " time.sleep(10)\n", "\n", "status_indicator.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Forecast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once created, the forecast results are ready and you view them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastResponse = forecastquery.query_forecast(\n", " ForecastArn=forecastArn,\n", " Filters={\"item_id\":\"client_12\"}\n", ")\n", "print(forecastResponse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Export Forecast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can export forecast to s3 bucket. To do so an role with s3 put access is needed, but this has already been created." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastExportName= project+'_aml_forecast_export'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "outputPath=\"s3://\"+bucketName+\"/output\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast_export_response = forecast.create_forecast_export_job(\n", " ForecastExportJobName = forecastExportName,\n", " ForecastArn=forecastArn, \n", " Destination = {\n", " \"S3Config\" : {\n", " \"Path\":outputPath,\n", " \"RoleArn\": role_arn\n", " } \n", " }\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecastExportJobArn = forecast_export_response['ForecastExportJobArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status_indicator = util.StatusIndicator()\n", "\n", "while True:\n", " status = forecast.describe_forecast_export_job(ForecastExportJobArn=forecastExportJobArn)['Status']\n", " status_indicator.update(status)\n", " if status in ('ACTIVE', 'CREATE_FAILED'): break\n", " time.sleep(10)\n", "\n", "status_indicator.end()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check s3 bucket for results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.list_objects(Bucket=bucketName,Prefix=\"output\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleanup\n", "\n", "Once we have completed the above steps, we can start to cleanup the resources we created. All the created resources can be deleted using `delete_resource_tree` and it is an asynchronous operation, so we have added the helpful `wait_till_delete` function. To learn more about deleting a parent resource and all its child resources, visit [DeleteResourceTree](https://docs.aws.amazon.com/forecast/latest/dg/API_DeleteResourceTree.html) API. \n", "Resource Limits documented <a href=\"https://docs.aws.amazon.com/forecast/latest/dg/limits.html\">here</a>." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the DatasetGroup, and all its child resources such as Predictor, PredictorBacktestExportJob, Forecast and ForecastExportJob\n", "util.wait_till_delete(lambda: forecast.delete_resource_tree(ResourceArn=datasetGroupArn))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the Dataset and its child DatasetImportJob resources:\n", "util.wait_till_delete(lambda: forecast.delete_resource_tree(ResourceArn=datasetArn))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete IAM role\n", "util.delete_iam_role( role_name )" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "toc": { "collapse_to_match_collapsible_headings": false, "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "253px", "width": "254px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": false, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }