**Measuring Demand Forecasting benefits series**

# Generating forecasts with Amazon Forecast

> *This notebook should work with the **`Data Science 3.0`** kernel in SageMaker Studio (older versions may see errors), and the default `ml.t3.medium` instance type (2 vCPU + 4 GiB RAM)*

In this notebook we'll walk through the process of importing data, training models, extracting metrics, and (optionally) producing forward-looking forecasts in [Amazon Forecast](https://aws.amazon.com/forecast/) - using the synthetic sample dataset and Python code.

You could instead work with Amazon Forecast manually through [the AWS Console UI](https://docs.aws.amazon.com/forecast/latest/dg/gs-console.html), other APIs and SDKs, or with more advanced pipeline automations like [this one using AWS CDK](https://github.com/aws-samples/amazon-forecast-mlops-pipeline-cdk) or the [Improving Forecast Accuracy with Machine Learning Solution](https://aws.amazon.com/solutions/implementations/improving-forecast-accuracy-with-machine-learning/) from AWS Solutions.

This notebook is provided to give users a relatively automated way to build Amazon Forecast models, in preparation for evaluating and comparing forecast business benefits. For more in-depth and step-by-step introductions to the mechanics of Forecast itself, check out the official [aws-samples/amazon-sagemaker-examples](https://github.com/aws-samples/amazon-forecast-samples) repository.

## Contents

1. [Dependencies and setup](#Dependencies-and-setup)
1. [Prepare data](#Prepare-data)
1. [Define and import datasets in Amazon Forecast](#Define-and-import-datasets-in-Amazon-Forecast)
1. [Train predictor model](#Train-predictor-model)
1. [Export predictor backtest results](#Export-predictor-backtest-results)
1. [(Optional) Create and export a forecast](#forecast)
1. [Next steps](#Next-steps)

## Dependencies and setup

Before getting started, we'll first import the libraries this notebook needs (all of which should be pre-installed on the supported SageMaker notebook kernel listed above), and configure where in Amazon S3 the input and output datasets should be stored:

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import logging
import os
from time import sleep # For polling waits

# External Dependencies:
import boto3 # General-purpose AWS SDK for Python
import numpy as np # Numerical/math processing tools
import pandas as pd # Tabular/dataframe processing tools
import sagemaker # SageMaker SDK used just to look up default S3 bucket

# Local Dependencies:
import util

# Configuration:
BUCKET_NAME = sagemaker.Session().default_bucket()
BUCKET_PREFIX = "measuring-forecast-benefits/"

os.makedirs("dataset", exist_ok=True)

### IAM access permissions

- To access data on your behalf, Amazon Forecast needs an [AWS IAM Execution Role](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-iam-roles.html) with appropriate S3 permissions.
- To create and manage Amazon Forecast resources, **this notebook** also needs an Execution Role with Amazon Forecast permissions. If you're running this notebook in Amazon SageMaker, it will have an [associated role already](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). If you're running this notebook locally instead, you'll need to [set up your CLI credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).

These are normally **two separate roles**, but could be combined into one if it's appropriate for your security strategy.

First, we'll check below that **this notebook** has basic Amazon Forecast access:

> ⚠️ **If this check fails:** Find your notebook's identity/role in the [IAM Console](https://console.aws.amazon.com/iamv2/home?#/roles) and consider attaching the `AmazonForecastFullAccess` permission.

In [None]:
forecast = boto3.client("forecast")

try:
 forecast.list_dataset_groups()
 print("SUCCESS: Notebook can call (at least basic) Amazon Forecast APIs")
except Exception as err:
 try: # Try to look up the NB role to help users find it for fixing permissions:
 nb_role_arn = sagemaker.get_execution_role()
 except:
 nb_role_arn = None
 print(
 "ERROR: Notebook does not have access to Amazon Forecast APIs. Try attaching the "
 "'AmazonForecastFullAccess' permission to your execution role.\n\nDetected Role: %s"
 % nb_role_arn
 )
 raise err

For your **Amazon Forecast role**, you'll need to either [set this up by hand](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-iam-roles.html) or grant your notebook additional permissions to create it for you.

> ℹ️ **Tip:** If you have [Amazon SageMaker Canvas](https://aws.amazon.com/sagemaker/canvas/) set up [with forecasting enabled](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-set-up-forecast.html), you may already be able to use your SageMaker Execution Role as a Forecast role. Try setting `forecast_role_arn = nb_role_arn` below.

Edit the cell below to insert your own Amazon Forecast Role ARN - or you can *try* to run it as-is to set up the role via the notebook:

> ⚠️ **Note:** If you *do* temporarily attach administrative permissions like `IAMFullAccess` to your notebook execution role to allow it to create the Amazon Forecast role on your behalf, remember to remove these permissions when no longer needed - following the [principle of least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html). 

In [None]:
# TODO: Replace below with your own ARN if you create one manually:
forecast_role_arn = util.iam.ensure_default_forecast_role()

There's **one final requirement** that we can't really test for here: Your notebook role needs permission to "pass" (use) your Amazon Forecast role.

> ⚠️ [**Check**](https://console.aws.amazon.com/iamv2/home?#/roles) your notebook's execution Role/identity has an attached policy granting the `iam:PassRole` permission on your Amazon Forecast role.
>
> If you need, you can **Create an inline policy on your notebook role** in the AWS Console to grant this access. The JSON for this policy could be similar to:
>
> ```json
> {
> "Version": "2012-10-17",
> "Statement": [
> {
> "Sid": "PassRoleForForecast",
> "Effect": "Allow",
> "Action": "iam:PassRole",
> "Resource": ""
> }
> ]
> }
> ```

## Prepare data

With service permissions set up, we're ready to prepare our datasets and start using them in Amazon Forecast.

We'll choose the [RETAIL domain](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) for this project, which will influence the [minimum required schema](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html) for the prepared datasets. You can find more information about preparing and importing data for Amazon Forecast in the [Importing Datasets](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-datasets-groups.html), [Dataset Guidelines](https://docs.aws.amazon.com/forecast/latest/dg/dataset-import-guidelines-troubleshooting.html), and [Guidelines and Quotas](https://docs.aws.amazon.com/forecast/latest/dg/limits.html) pages of the Amazon Forecast Developer Guide.

### Target Time-Series (TTS)

The mandatory TTS dataset records the historical values of the quantity you actually want to predict: In this case, sales of products.

For our sample, we'll first load the synthetic sales dataset before adjusting it for Amazon Forecast:

In [None]:
sales_raw_df = pd.read_parquet("s3://measuring-forecast-benefits-assets/dataset/v1/sales.parquet")
sales_raw_df

This dataset is already close to our target format: We'll use `sku` as the `item_id` field and treat `location` as a [dimension](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-predictor.html#creating-predictors).

The only preparation needed is to:

- Rename some columns to match the [required field names](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html#target-time-series-type-retail-domain)
- Explicitly store timestamps in a [supported timestamp format](https://docs.aws.amazon.com/forecast/latest/dg/dataset-import-guidelines-troubleshooting.html) - we'll use daily `yyyy-MM-dd` as this data has no sub-daily variations

In [None]:
tts_df = sales_raw_df.rename(
 columns={"date": "timestamp", "sku": "item_id", "sales": "demand"},
)
tts_df["timestamp"] = tts_df["timestamp"].dt.strftime("%Y-%m-%d")
tts_df

Once the data is prepared, we're ready to upload it to Amazon S3 to use with the Forecast service:

In [None]:
tts_s3_uri = f"s3://{BUCKET_NAME}/{BUCKET_PREFIX}training-data/tts/tts.parquet"
tts_df.to_parquet(tts_s3_uri, index=False)
print(f"Uploaded TTS to: {tts_s3_uri}")

We also need to compile the [Amazon Forecast Schema](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-datasets-groups.html#howitworks-dataset) for each dataset to be imported, so may as well detect that automatically from the dataframe columns here:

In [None]:
FORECAST_DIMENSIONS = [col for col in tts_df if col not in ("timestamp", "demand", "item_id")]
print("Forecast Dimensions:\n ", FORECAST_DIMENSIONS)

N_DIMENSION_COMBOS = len(tts_df[["item_id"] + FORECAST_DIMENSIONS].drop_duplicates())
print(f"{N_DIMENSION_COMBOS} unique item/dimension combinations", "\n")

tts_schema = util.amzforecast.autodiscover_dataframe_schema(
 tts_df,
 overrides={"demand": "float"},
)
print("TTS Dataset Schema:\n" + json.dumps(tts_schema, indent=2))

One other thing we need to configure **before we prepare RTS data** is the **forecast horizon**.

When preparing other time-varying inputs later we'll need a solid understanding of what future period the forecast itself covers, for aligning our RTS inputs to cover that period. For more information on what frequencies and time granularities Amazon Forecast supports, see [this page](https://docs.aws.amazon.com/forecast/latest/dg/data-aggregation.html) in the Developer Guide.

In [None]:
# Configure forecast horizon and frequency:
FORECAST_HORIZON = pd.offsets.Day() * 31 # or e.g. Hour(), Week(), MonthEnd()
print(f"Configured forecast horizon: {FORECAST_HORIZON}")
FORECAST_FREQ = FORECAST_HORIZON.base.freqstr # This should be Amazon Forecast compatible
print(f"({FORECAST_HORIZON.n} units of '{FORECAST_FREQ}')\n")

# Check TTS history end date (forecast start minus one) is as you expect:
TTS_END_DATE = pd.to_datetime(max(tts_df["timestamp"]))
print(f"Historical data end date: {TTS_END_DATE}")

Once the data is prepared and uploaded to Amazon S3, and we have the schema extracted, we can delete unnecessary variables to save notebook memory. The only extra information we'll need to keep for later is which combinations of location and item_id are present in the dataset:

In [None]:
item_location_combos = tts_df[["location", "item_id"]].drop_duplicates()

In [None]:
del sales_raw_df
del tts_df

### Static Item Metadata

The optional [Item Metadata dataset](https://docs.aws.amazon.com/forecast/latest/dg/item-metadata-datasets.html) records metadata about forecast items (i.e. products/SKUs) that **does not change over time**: I.e. a table of attributes keyed by unique `item_id`.

Note that any other **dimensions** in the TTS dataset are not included in this lookup: Item Metadata has one key attribute only, so in cases like this sample data you may want to choose between representing certain fields (like store/location) as either **dimensions**, or **incorporating them into the item ID** so that `skuXYZ-storeABC` becomes one "item ID".

In this example we'll keep `location` as a dimension, so the Item Metadata dataset cannot include it or any product attributes that are location-specific.

In [None]:
metadata_raw_df = pd.read_csv("s3://measuring-forecast-benefits-assets/dataset/v1/metadata.csv")
metadata_raw_df

Again there's very little preparation required for this dataset: We'll just rename the `sku` field to required name `item_id`:

In [None]:
metadata_df = metadata_raw_df.rename(columns={"sku": "item_id"})
metadata_df

We're then ready to upload the item metadata to Amazon S3:

In [None]:
metadata_s3_uri = f"s3://{BUCKET_NAME}/{BUCKET_PREFIX}training-data/metadata/metadata.csv"
metadata_df.to_csv(metadata_s3_uri, index=False)
print(f"Uploaded Item Metadata to: {metadata_s3_uri}")

...And extract the schema:

In [None]:
metadata_schema = util.amzforecast.autodiscover_dataframe_schema(metadata_df)
print(json.dumps(metadata_schema, indent=2))

### Related Time-Series (RTS)

The [optional Related Time-Series dataset](https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html) provides other input variables to your forecast that **vary over time**.

Popular time-varying features to help predict future demand include pricing and promotions, public holidays and events, and even weather information. We need to prepare one consolidated dataset of all the RTS features we wish to include, in this sample building from two base datasets: Public holidays by country, and product prices/promotions - as loaded below.

#### Weekends and holidays

The source weekend and holiday data has already been prepared in a flat file format. However, it extends for a full year beyond our TTS end date - so we need to trim it for only the forecasting period of interest:

In [None]:
holiday_raw_df = pd.read_csv(
 "s3://measuring-forecast-benefits-assets/dataset/v1/weekend_holiday_flag.csv",
)
holiday_raw_df["date"] = pd.to_datetime(holiday_raw_df["date"]) # (As CSV)

# Filter out any data beyond the end of the forecasting horizon:
holiday_raw_df = holiday_raw_df[
 holiday_raw_df["date"] <= pd.to_datetime(TTS_END_DATE) + FORECAST_HORIZON
]

holiday_raw_df

#### Prices and promotions

The price and promotion data is likewise available in a flat file format already:

In [None]:
prices_raw_df = pd.read_parquet(
 "s3://measuring-forecast-benefits-assets/dataset/v1/prices_promos.parquet",
)
# (No need to parse datetimes from parquet)
prices_raw_df

...But unlike the holidays reference data, it **doesn't extend beyond the TTS end date** at all.

This is a problem because, as discussed [in the Amazon Forecast Developer Guide](https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html#related-time-series-historical-futurelooking), "forward-looking" inputs (where we know or hypothesize the values during the forecast period) are much more valuable to model accuracy than "historical-only" data (where the model doesn't know what to expect during the forecast period).

Ideally, you would already have some plan for pricing actions in the near future. You could, if needed, build models with multiple different pricing scenarios and explore how forecasted demand changes (the *price elasticity of demand*).

In this example, we'll just project forward the current price throughout the forecast horizon.

First, create a dataframe of empty `NaN` placeholders for all the future dates and items:

In [None]:
prices_dimensions = [
 c for c in prices_raw_df.columns if c not in ("date", "promo", "unit_price")
]

prices_future = pd.merge(
 # Range of dates in the forecast period:
 pd.date_range(
 TTS_END_DATE,
 TTS_END_DATE + FORECAST_HORIZON,
 freq=FORECAST_FREQ,
 inclusive="right",
 name="date",
 ).to_series(),
 # Unique combinations of country+product:
 prices_raw_df[prices_dimensions].drop_duplicates(),
 # Cross join (all combinations):
 how="cross",
)
prices_future["promo"] = float("nan")
prices_future["unit_price"] = float("nan")
prices_future

Then, join the future placeholders with the historical prices and index and sort the data by the breakdown dimensions *before* date:

In [None]:
tmp = pd.concat([prices_raw_df, prices_future]).set_index(prices_dimensions + ["date"]).sort_index()
tmp

We're now ready to forward-fill, and reset the index back to regular columns:

In [None]:
prices_projected_df = tmp.groupby(level=prices_dimensions).ffill().reset_index()

# Delete temp variables to save space and make sure we don't accidentally use the wrong ones:
del prices_future
del prices_raw_df
del tmp

prices_projected_df

If you like, you can inspect this DataFrame to validate the continuity (i.e. `Brazil` `Gloves` will keep using the same `promo` and `unit_price` for records after `TTS_END_DATE` - and likewise for each other combination of country and product type).

#### Pulling the RTS together

With our end dates aligned and set up to fully cover the forecast period, we're ready to combine the two datasets and normalize the dimensions to match the Target Time-Series (i.e. map `country` to `location`s and `product` to `item_id`s).

First, we'll join them together:

In [None]:
rts_df = pd.merge(holiday_raw_df, prices_projected_df, on=["date", "country"], how="outer")
rts_df

This dataset is almost ready, but we need to expand from `country` to cover all separate `location` IDs and from `product` to cover all separate `item_id`s in the sales dataset. We can refer to the unique location/item_id list saved from earlier to do this:

In [None]:
# Construct the reference table:
item_location_combos["country"] = item_location_combos["location"].str.split("_").str[0]
item_location_combos["product"] = item_location_combos["item_id"].str.split("_").str[0]
item_location_combos

In [None]:
# Join to map country/product to locations/item_ids:
rts_df = (
 pd.merge(
 item_location_combos,
 rts_df,
 on=["country", "product"],
 how="outer",
 )
 .drop(columns=["country", "product"])
 .rename(columns={"date": "timestamp"})
)

# Standardize timestamp representation, as with TTS:
rts_df["timestamp"] = rts_df["timestamp"].dt.strftime("%Y-%m-%d")

rts_df

As previously, once dataset preparation is complete we'll upload the data to Amazon S3:

In [None]:
rts_s3_uri = f"s3://{BUCKET_NAME}/{BUCKET_PREFIX}training-data/rts/rts.parquet"
rts_df.to_parquet(rts_s3_uri, index=False)
print(f"Uploaded Related Time-Series to: {rts_s3_uri}")

...And extract the schema:

In [None]:
rts_schema = util.amzforecast.autodiscover_dataframe_schema(rts_df)
print(json.dumps(rts_schema, indent=2))

We can also clear the tables to save memory in the notebook:

In [None]:
del holiday_raw_df
del prices_projected_df
del rts_df

## Define and import datasets in Amazon Forecast

Once the datasets are available on Amazon S3, with known schemas conforming to Amazon Forecast's requirements, we're ready to define the Dataset Group in Forecast and import the datasets themselves.

### Define the Dataset Group

First, run the below cells to configure and set up the **schema and structure** of your datasets:

In [None]:
# Configurations:
DATASET_GROUP_NAME = "benefits_demo"
DOMAIN = "RETAIL"

> ℹ️ **Tip:** The `util.amzforecast.create_or_reuse_...` functions we use throughout this notebook are just thin wrappers over the corresponding [Forecast boto3 client create_... methods](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/forecast.html), to transparently re-use resources (instead of raising errors) if they already exist.
>
> This helps make it quick to re-run the notebook if the kernel restarts, but of course may not always be the behaviour you want (for example if changing settings). You can check out the implementation in [util/amzforecast.py](util/amzforecast.py), and swap out calls for e.g. `forecast.create_dataset_group(...)` if you like.

In [None]:
dsg_arn = util.amzforecast.create_or_reuse_dataset_group(
 Domain=DOMAIN,
 DatasetGroupName=DATASET_GROUP_NAME,
)

In [None]:
TTS_DATASET_NAME = f"{DATASET_GROUP_NAME}_tts"
tts_arn = util.amzforecast.create_or_reuse_dataset(
 DatasetName=TTS_DATASET_NAME,
 Domain=DOMAIN,
 DatasetType="TARGET_TIME_SERIES",
 DataFrequency=FORECAST_FREQ,
 Schema={"Attributes": tts_schema},
)

In [None]:
METADATA_DATASET_NAME = f"{DATASET_GROUP_NAME}_meta"
metadata_arn = util.amzforecast.create_or_reuse_dataset(
 DatasetName=METADATA_DATASET_NAME,
 Domain=DOMAIN,
 DatasetType="ITEM_METADATA",
 Schema={"Attributes": metadata_schema},
)

In [None]:
RTS_DATASET_NAME = f"{DATASET_GROUP_NAME}_rts"
rts_arn = util.amzforecast.create_or_reuse_dataset(
 DatasetName=RTS_DATASET_NAME,
 Domain=DOMAIN,
 DatasetType="RELATED_TIME_SERIES",
 DataFrequency=FORECAST_FREQ,
 Schema={"Attributes": rts_schema},
)

This final cell links your Datasets to the Dataset Group (note that you can also change which datasets are included in a DSG later, but only datasets linked to a DSG will appear in the AWS Console for Amazon Forecast):

In [None]:
forecast.update_dataset_group(
 DatasetGroupArn=dsg_arn,
 DatasetArns=[tts_arn, rts_arn, metadata_arn],
)

### Import data

With the schemas defined, we can **import the actual data** from Amazon S3 into the Amazon Forecast service, by creating a batch import job for each dataset.

The import process involves validating your data, so is asynchronous and can take some time to complete. In the cells below, we kick off all 3 jobs and then wait for them to complete in parallel:

In [None]:
tts_import_arn = util.amzforecast.create_dataset_import_job_by_hash(
 DatasetArn=tts_arn,
 DataSource={
 "S3Config": {
 "Path": tts_s3_uri,
 "RoleArn": forecast_role_arn,
 },
 },
 Format="PARQUET",
 TimestampFormat="yyyy-MM-dd",
)

In [None]:
metadata_import_arn = util.amzforecast.create_dataset_import_job_by_hash(
 DatasetArn=metadata_arn,
 DataSource={
 "S3Config": {
 "Path": metadata_s3_uri,
 "RoleArn": forecast_role_arn,
 },
 },
 Format="CSV",
)

In [None]:
rts_import_arn = util.amzforecast.create_dataset_import_job_by_hash(
 DatasetArn=rts_arn,
 DataSource={
 "S3Config": {
 "Path": rts_s3_uri,
 "RoleArn": forecast_role_arn,
 },
 },
 Format="PARQUET",
 TimestampFormat="yyyy-MM-dd",
)

In [None]:
pending_jobs = [tts_import_arn, metadata_import_arn, rts_import_arn]


def are_imports_finished(job_descs):
 global pending_jobs
 for desc in job_descs:
 status = desc["Status"]
 if status == "ACTIVE":
 pending_jobs = [
 job_arn for job_arn in pending_jobs if job_arn != desc["DatasetImportJobArn"]
 ]
 if len(pending_jobs) == 0:
 return True
 elif "FAILED" in status:
 raise ValueError(f"Data import failed!\n{desc}")


def max_all_etas(job_descs):
 eta_mins_by_job = list(
 filter(
 lambda t: t is not None,
 (d.get("EstimatedTimeRemainingInMinutes") for d in job_descs),
 )
 )
 return f"{max(eta_mins_by_job)} mins" if len(eta_mins_by_job) > 0 else None


util.progress.polling_spinner(
 # Call DescribeDatasetImportJob on all jobs:
 fn_poll_result=lambda: [
 forecast.describe_dataset_import_job(DatasetImportJobArn=job_arn)
 for job_arn in pending_jobs
 ],
 # Check if *all* jobs finished and cut finished jobs from list:
 fn_is_finished=are_imports_finished,
 # Stringify status as number of jobs remaining:
 fn_stringify_result=lambda descs: f"{len(descs)} jobs pending",
 # Get max of ETA from all outstanding jobs:
 fn_eta=max_all_etas,
 poll_secs=30,
 timeout_secs=60 * 60, # Max 1 hour
)
print("Data imported")

> ⏰ These dataset imports can take several minutes to complete: We saw a wait of around 10 minutes with the sample dataset.
>
> This period includes behind-the-scenes overhead for the service to spin up managed infrastructure to analyze your data, so don't worry: It scales much better than linearly as data size increases.

## Train predictor model

In Amazon Forecast, a trained model is called a "Predictor". After setting up the base datasets and importing data, you're ready to train (one or more) forecast models.

In this section we'll kick off training of a new AutoPredictor, and then wait for that process to complete (which may take multiple hours). You can find more information about the parameters and process in the [Training a predictor section](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-predictor.html) of the Amazon Forecast Developer Guide.

In [None]:
TRAIN_FORECAST_TYPES = ["mean", "0.10", "0.50", "0.90"]
METRIC = "RMSE"
PREDICTOR_NAME = f"{DATASET_GROUP_NAME}_auto_1"

In [None]:
predictor_arn = util.amzforecast.create_or_reuse_auto_predictor(
 PredictorName=PREDICTOR_NAME,
 DataConfig={
 "DatasetGroupArn": dsg_arn,
 "AttributeConfigs": [
 # Multi-record aggregation and missing value filling logic:
 {
 "AttributeName": "demand",
 # (Note only TTS accept aggregation parameter)
 "Transformations": {"aggregation": "sum", "middlefill": "zero", "backfill": "zero"},
 },
 {
 "AttributeName": "weekend_hol_flag",
 "Transformations": {
 "middlefill": "zero",
 "backfill": "zero",
 "futurefill": "zero",
 },
 },
 {
 "AttributeName": "promo",
 "Transformations": {
 "middlefill": "value",
 "middlefill_value": "1",
 "backfill": "value",
 "backfill_value": "1",
 "futurefill": "value",
 "futurefill_value": "1",
 },
 },
 {
 "AttributeName": "unit_price",
 "Transformations": {
 "middlefill": "mean",
 "backfill": "mean",
 "futurefill": "mean",
 },
 },
 ],
 },
 ExplainPredictor=False, # Enable for explainability report, but increased training time
 ForecastDimensions=FORECAST_DIMENSIONS,
 ForecastFrequency=FORECAST_FREQ,
 ForecastHorizon=FORECAST_HORIZON.n,
 ForecastTypes=TRAIN_FORECAST_TYPES,
 OptimizationMetric="RMSE", # Target metric for optimization between model candidates
)

In [None]:
util.progress.polling_spinner(
 fn_poll_result=lambda: forecast.describe_auto_predictor(PredictorArn=predictor_arn),
 fn_is_finished=util.amzforecast.is_forecast_resource_ready,
 fn_stringify_result=lambda desc: desc["Status"],
 fn_eta=lambda desc: (
 f"{desc['EstimatedTimeRemainingInMinutes']} mins"
 if "EstimatedTimeRemainingInMinutes" in desc else None
 ),
 poll_secs=60,
 timeout_secs=5 * 60 * 60, # Max 5 hours
)
print("Predictor model trained")

## Export predictor backtest results

Because [Amazon Forecast's pricing](https://aws.amazon.com/forecast/pricing/) charges by **generated forecasts**, here's an important cost optimization tip for PoCs and experimentation: **[Use backtest exports](https://docs.aws.amazon.com/forecast/latest/dg/metrics.html)** where appropriate, rather than generating forward-looking forecasts, to evaluate your candidate models.

When training your predictors, Amazon Forecast produces validation metrics by holding out the final portion of the data to calculate expected performance. By [creating a predictor backtest export job](https://docs.aws.amazon.com/forecast/latest/dg/API_CreatePredictorBacktestExportJob.html), you can export both detailed-level accuracy metrics per item over this final period, but also the raw forecasts themselves, mapped to actuals, to support any custom analyses or metrics you might wish to calculate.

So if you'd like to compare your Forecast models to historical actuals ("offline"), you can save costs by including the full period in your Target Time-Series and running a backtest export job - versus excluding this final period and performing the reconciliation and analysis yourself.

That's exactly what we'll do for the purpose of this sample:

In [None]:
backtest_s3_uri = "s3://{}/{}forecast-backtests/{}".format(
 BUCKET_NAME,
 BUCKET_PREFIX,
 PREDICTOR_NAME,
)
print(f"Exporting backtest results to: {backtest_s3_uri}")

backtest_export_arn = util.amzforecast.create_or_reuse_predictor_backtest_export_job(
 PredictorBacktestExportJobName=PREDICTOR_NAME,
 PredictorArn=predictor_arn,
 Destination={
 "S3Config": {
 "Path": backtest_s3_uri,
 "RoleArn": forecast_role_arn,
 },
 },
 Format="PARQUET",
)

In [None]:
util.progress.polling_spinner(
 fn_poll_result=lambda: forecast.describe_predictor_backtest_export_job(
 PredictorBacktestExportJobArn=backtest_export_arn,
 ),
 fn_is_finished=util.amzforecast.is_forecast_resource_ready,
 fn_stringify_result=lambda desc: desc["Status"],
 fn_eta=lambda desc: (
 f"{desc['EstimatedTimeRemainingInMinutes']} mins"
 if "EstimatedTimeRemainingInMinutes" in desc else None
 ),
 poll_secs=30,
 timeout_secs=30 * 60, # Max 30 mins
)
print("Backtest export done")

## (Optional) Create and export a forecast

For **online** evaluation (i.e. actually testing Forecast in production), you'll need to create an actual forward-looking forecast from your model.

This is a separate process from model training because it is technically possible to repeatedly update your datasets and create new forecasts from the same predictor model. However, in practice many customers find the accuracy benefits of re-training each time outweigh the resource costs because training costs are often much lower than forecasting/inference: So it's usually best to optimize for accuracy first (re-train a predictor every month/cycle) and explore the impacts of relaxing this later.

> ⚠️ **COST WARNING**
>
> This section is included for completeness but forecast generation is *not necessary* for the final (forecast comparison) notebook, which only uses the backtest result. The sample dataset includes a high number of SKUs/locations, and generating a full forecast (31 data points for each SKU/location, with a single quantile) **could cost approx $200-250 or more** at standard pricing, ignoring any free tier allowances.
>
> Refer to the [Amazon Forecast pricing page](https://aws.amazon.com/forecast/pricing/) and check you understand the profile of your dataset and horizon before generating forward-looking forecasts. We include logic below to estimate the number of forecast time-series and data points for your configuration.

If you're sure you want to generate forecasts, set `generate_forecast = True` below to enable this section.

If you want to run a limited-scope test, you could also [limit the forecast to particular time-series](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-forecast.html#forecast-time-series) to control costs.

In [None]:
INF_FORECAST_TYPES = ["mean"] # (Could also add e.g. "0.10", "0.50", "0.90")
generate_forecast = True

print(f"Configured for approx {N_DIMENSION_COMBOS * len(INF_FORECAST_TYPES):,d} time-series")
print(f"({N_DIMENSION_COMBOS * len(INF_FORECAST_TYPES) * FORECAST_HORIZON.n:,d} forecast data points)")

In [None]:
if generate_forecast:
 forecast_arn = util.amzforecast.create_or_reuse_forecast(
 ForecastName=PREDICTOR_NAME,
 PredictorArn=predictor_arn,
 ForecastTypes=INF_FORECAST_TYPES,
 )
else:
 print("Forecasting skipped")

In [None]:
if generate_forecast:
 util.progress.polling_spinner(
 fn_poll_result=lambda: forecast.describe_forecast(ForecastArn=forecast_arn),
 fn_is_finished=util.amzforecast.is_forecast_resource_ready,
 fn_stringify_result=lambda desc: desc["Status"],
 fn_eta=lambda desc: (
 f"{desc['EstimatedTimeRemainingInMinutes']} mins"
 if "EstimatedTimeRemainingInMinutes" in desc else None
 ),
 poll_secs=60,
 timeout_secs=2 * 60 * 60, # Max 2 hours
 )
 print("Forecast ready")

Created forecasts are [queryable](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-forecast.html#query-forecast) via a real-time [QueryForecast API](https://docs.aws.amazon.com/forecast/latest/dg/howitqworks-forecast.html#query-forecast) (and so there are also [quotas](https://docs.aws.amazon.com/forecast/latest/dg/limits.html) on the number of forecasts you can keep active concurrently).

For our analysis use case though (and for many real-world use-cases where businesses want to use some dashboarding tool to slice and explore the results), we're more interested in exporting the forecasts as a bulk dataset for all items.

The cells below will initiate this export to Amazon S3, and wait for it to complete:

In [None]:
if generate_forecast:
 export_s3_uri = "s3://{}/{}forecast-exports/{}".format(
 BUCKET_NAME,
 BUCKET_PREFIX,
 PREDICTOR_NAME,
 )
 print(f"Exporting forecast to: {export_s3_uri}")

 create_export_resp = forecast.create_forecast_export_job(
 ForecastExportJobName=PREDICTOR_NAME,
 ForecastArn=forecast_arn,
 Destination={
 "S3Config": {
 "Path": export_s3_uri,
 "RoleArn": forecast_role_arn,
 }
 },
 Format="PARQUET",
 )

 export_arn = create_export_resp["ForecastExportJobArn"]
else:
 print("Forecasting skipped")

In [None]:
if generate_forecast:
 util.progress.polling_spinner(
 fn_poll_result=lambda: forecast.describe_forecast_export_job(
 ForecastExportJobArn=export_arn,
 ),
 fn_is_finished=util.amzforecast.is_forecast_resource_ready,
 fn_stringify_result=lambda desc: desc["Status"],
 fn_eta=lambda desc: (
 f"{desc['EstimatedTimeRemainingInMinutes']} mins"
 if "EstimatedTimeRemainingInMinutes" in desc else None
 ),
 poll_secs=30,
 timeout_secs=30 * 60, # Max 30 minutes
 )
 print("\nForecast export ready:")
 print(export_s3_uri)

## Next steps

Congratulations! If you ran through this notebook successfully, you managed to import data to Amazon Forecast, build a model and export batch forecast results (optionally, backtest results too) to Amazon S3. You should also be able to view your Dataset Group, Predictor, and Forecast through the [AWS Console for Amazon Forecast](https://console.aws.amazon.com/forecast/home?#datasetGroups).

Now we're ready to dive in to analyzing the accuracy of the forecast as compared to the moving average baseline, and how that translates to actual business results. Head on over to [2. Measuring Forecast Benefits.ipynb](2.%20Measuring%20Forecast%20Benefits.ipynb) to follow along!