{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Retraining an AutoPredictor\n", "\n", "Helps advanced users start with Amazon Forecast Retraining feature for an AutoPredictor\n", "\n", "Predictors can be retained with updated datasets to keep your predictors up to date. When retraining a predictor, Amazon Forecast maintains the same predictor configuration settings. After retraining, the original predictor will remain active and the retrained predictor will have a separate Predictor ARN.\n", "\n", "Retraining a predictor can improve forecasting accuracy in two ways:\n", "\n", "1. *More current data*: Your retrained predictor will incorporate more up-to-date data when training a model.\n", "2. *Predictor improvements*: Your retrained predictor will incorporate any updates and improvements in the Amazon Forecast algorithms and additional datasets.\n", "\n", "Retraining a predictor can be up to 50% faster than creating a new predictor from scratch. Predictor training times are faster and AWS Forecast automatically uses your existing Predictor configuration during retraining to maintain model consistency.\n", "\n", "Retraining a Predictor is different from creating a predictor from scratch and is expected to carry over your existing Predictor settings and train the underlying models on top of the latest data.\n", "\n", "Note that you do not need to retrain everytime you are bringing in the new data and you can still generate forecast using the existing models. Retraining is required to periodically update the underlying models learning the latest data charecteristics and trends.\n", "\n", "For more informations about APIs, refer API [documentation](https://docs.aws.amazon.com/forecast/latest/dg/API_Operations_Amazon_Forecast_Service.html)\n", "\n", "For details about retraining a predictor, refer https://docs.aws.amazon.com/forecast/latest/dg/retrain-predictors.html\n", "\n", "As part of this notebook, we will first create an auto predictor, import new data and then retrain the predictor so that we can get a new predictor trained with the latest available data.\n", "\n", "## Table Of Contents\n", "* [Pre-requisites](#prerequisites)\n", "* [Test Setup - Creating a new AutoPredictor](#createAuto)\n", " * [Importing Data in Forecast System](#importingData)\n", " * [Creating the first Predictor](#firstPredictor)\n", "* [Retraining an AutoPredictor](#retrainAuto)\n", "* [Cleanup](#cleanup)\n", "\n", "**Read Every Cell FULLY before executing it**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-requisites \n", "Before we get started, lets set up the notebook environment, the AWS SDK client for Amazon Forecast and IAM Role used by Amazon Forecast to access your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "import time\n", "import boto3\n", "\n", "# importing forecast notebook utility from notebooks/common directory\n", "sys.path.insert( 0, os.path.abspath(\"../../common\") )\n", "import util" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure the S3 bucket name and region name for this lesson.\n", "\n", "- If you don't have an S3 bucket, create it first on S3.\n", "- Although we have set the region to us-west-2 as a default value below, you can choose any of the regions that the service is available in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_widget_bucket = util.create_text_widget( \"bucketName\", \"input your S3 bucket name\" )\n", "text_widget_region = util.create_text_widget( \"region\", \"input region name.\", default_value=\"us-west-2\" )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucketName = text_widget_bucket.value\n", "assert bucketName, \"bucket_name not set. Please set the bucket name and retry\"\n", "\n", "region = text_widget_region.value\n", "assert region, \"region not set.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = boto3.Session(region_name=region)\n", "forecast = session.client(service_name='forecast')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test Setup - Creating a new AutoPredictor\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly. \n", "\n", "As part of this notebook, we will be first importing subset of this data to create the first predictor and then use another subset(With additional data) to retrain the predictor " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing Data in Forecast System\n", "\n", "This notebooks has two csv files in the \"data\" folder where the first file \"elec_data_small_first.csv\" will be used to create the first AutoPredictor and subsequent file will be used to bring in new data for retraining." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon forecast can import data from Amazon S3. We first explore the data locally to see the fields" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/elec_data_small_first.csv\", dtype = object, header =None)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now upload the data to S3. But before doing that, go into your AWS Console, select S3 for the service and create a new bucket inside the `Oregon` or `us-west-2` region. Use that bucket name convention of `amazon-forecast-unique-value-data`. The name must be unique, if you get an error, just adjust until your name works, then update the `bucketName` cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = session.client('s3')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "key_first_predictor=\"elec_data/item-demand-time_first.csv\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.upload_file(Filename=\"data/elec_data_small_first.csv\", Bucket=bucketName, Key=key_first_predictor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the role to provide to Amazon Forecast.\n", "role_name = \"ForecastNotebookRole-Retrain\"\n", "role_arn = util.get_or_create_iam_role( role_name = role_name )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`. Also for your project name, update it to reflect your name in a lowercase format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATASET_FREQUENCY = \"H\" \n", "TIMESTAMP_FORMAT = \"yyyy-MM-dd hh:mm:ss\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "project = 'ws_retraining' # Replace this with a unique name here, make sure the entire name is < 30 characters.\n", "datasetName= project+'_ds'\n", "datasetGroupName= project +'_gp'\n", "s3DataPath = \"s3://\"+bucketName+\"/\"+key_first_predictor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"DatasetName: \", datasetName)\n", "print(\"DatasetGroupName: \", datasetGroupName)\n", "print(\"S3 bucket: \", s3DataPath)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.\n", "schema ={\n", " \"Attributes\":[\n", " {\n", " \"AttributeName\":\"timestamp\",\n", " \"AttributeType\":\"timestamp\"\n", " },\n", " {\n", " \"AttributeName\":\"target_value\",\n", " \"AttributeType\":\"float\"\n", " },\n", " {\n", " \"AttributeName\":\"item_id\",\n", " \"AttributeType\":\"string\"\n", " }\n", " ]\n", "}\n", "\n", "response=forecast.create_dataset(\n", " Domain=\"CUSTOM\",\n", " DatasetType='TARGET_TIME_SERIES',\n", " DatasetName=datasetName,\n", " DataFrequency=DATASET_FREQUENCY, \n", " Schema = schema\n", " )\n", "datasetArn = response['DatasetArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"DatasetArn:\", datasetArn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,\n", " Domain=\"CUSTOM\",\n", " DatasetArns= [datasetArn]\n", " )\n", "datasetGroupArn = create_dataset_group_response['DatasetGroupArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brings the data into Amazon Forecast system ready to forecast from raw data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datasetImportJobName = 'Electricity_First_Import_TTS'\n", "ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\n", " DatasetArn=datasetArn,\n", " DataSource= {\n", " \"S3Config\" : {\n", " \"Path\":s3DataPath,\n", " \"RoleArn\": role_arn\n", " } \n", " },\n", " TimestampFormat=TIMESTAMP_FORMAT\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\n", "print(\"DatasetImportJob\", ds_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Waiting for Dataset Import Job with ARN {ds_import_job_arn} to become **ACTIVE**. This process could take 15-20 minutes.\\n\\nCurrent Status:\")\n", "status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn))\n", "\n", "describe_dataset_import_job_response = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)\n", "print(f\"\\n\\nThe Dataset Import Job with ARN ds_import_job_arn is now {describe_dataset_import_job_response['Status']}.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating the first Predictor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Forecast horizon is the number of number of time points to predicted in the future. Our example is hourly data, we try forecast the next 24 hours, so we can set to 24." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "firstPredictorName = project+'_firstPred'\n", "forecastHorizon = 24" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_predictor_response=forecast.create_auto_predictor(PredictorName=firstPredictorName, \n", " ForecastHorizon=forecastHorizon,\n", " ForecastFrequency=\"H\", \n", " DataConfig= {\"DatasetGroupArn\": datasetGroupArn})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "firstPredictorArn=first_predictor_response['PredictorArn']\n", "print(\"Predictor created from CreateAutoPredictor API\", firstPredictorArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of the predictor. When the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on data size, model selection and hyper parameters,it can take 40 mins to more than one hour to be **ACTIVE**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "print(f\"Waiting for Predictor with ARN **{firstPredictorArn}** to become **ACTIVE**. This process could take minutes. Depending on data size and predictor setting,it can take several hours to be **ACTIVE**.\\n\\nCurrent Status:\")\n", "\n", "status = util.wait(lambda: forecast.describe_auto_predictor(PredictorArn=firstPredictorArn))\n", "\n", "describe_first_auto_predictor_response = forecast.describe_auto_predictor(PredictorArn=firstPredictorArn)\n", "print(f\"\\n\\nThe Predictor with ARN **{firstPredictorArn}** is now **{describe_first_auto_predictor_response['Status']}**.\")\n", "\n", "time_to_create_first_predictor = (time.time() - start_time) / 60" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Predictor details\\n\")\n", "forecast.describe_auto_predictor(PredictorArn=firstPredictorArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get Error Metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.get_accuracy_metrics(PredictorArn=firstPredictorArn)['PredictorEvaluationResults'][0]['TestWindows']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Retraining Predictor\n", "\n", "Let's assume that 1 day has passed and we have new data. Since we want to ensure that our predictor is trained on the latest data we can now choose the option to re-train our Predictor that we just created above.\n", "\n", "We do retraining as we have seen new data and we want to keep our Predictor upto date with the latest data and learn new trends." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As part of the below step, we reimport the data. Note that key_retrain_predictor has one more day of data so we will train the predictor with the latest data available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "key_retrain_predictor=\"elec_data/item-demand-time_new.csv\"\n", "s3.upload_file(Filename=\"data/elec_data_small_new.csv\", Bucket=bucketName, Key=key_retrain_predictor)\n", "s3DataPath = \"s3://\"+bucketName+\"/\"+key_retrain_predictor\n", "\n", "datasetImportJobName = 'Electricity_New_Import_TTS'\n", "ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\n", " DatasetArn=datasetArn,\n", " DataSource= {\n", " \"S3Config\" : {\n", " \"Path\":s3DataPath,\n", " \"RoleArn\": role_arn\n", " } \n", " },\n", " TimestampFormat=TIMESTAMP_FORMAT\n", " )\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\n", "print(\"DatasetImportJob\", ds_import_job_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Checking status for DatasetImport Job: \", ds_import_job_arn)\n", "\n", "status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn))\n", "\n", "describe_dataset_import_job_response = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)\n", "print(f\"\\n\\nThe Dataset Import Job with ARN {ds_import_job_arn} is now {describe_dataset_import_job_response['Status']}.\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "retrainedPredictorName=project+'_retrainedPred'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Upgrading the predictor, we pass the referencePredictor Arn an an input to CreateAutoPredictor API\n", "Here referencePredictorArn will be the predictorArn created above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "referencePredictorArn=firstPredictorArn\n", "\n", "print(\"ReferencePredictorArn: \", referencePredictorArn)\n", "\n", "create_auto_predictor_response=forecast.create_auto_predictor(PredictorName=retrainedPredictorName, \n", " ReferencePredictorArn=referencePredictorArn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "retrainingPredictorArn=create_auto_predictor_response['PredictorArn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"New Retrained Predictor Arn: \", retrainingPredictorArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we wait for the new retrained AutoPredictor to complete and it can take 1 hours." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "print(f\"Waiting for Predictor with ARN **{retrainingPredictorArn}** to become **ACTIVE**. This process could take 60 minutes. Depending on data size and predictor setting,it can take several hours to be **ACTIVE**.\\n\\nCurrent Status:\")\n", "\n", "status = util.wait(lambda: forecast.describe_auto_predictor(PredictorArn=retrainingPredictorArn))\n", "\n", "describe_auto_predictor_response = forecast.describe_auto_predictor(PredictorArn=retrainingPredictorArn)\n", "print(f\"\\n\\nThe Predictor with ARN **{retrainingPredictorArn}** is now **{describe_auto_predictor_response['Status']}**.\")\n", "time_to_retrain = (time.time() - start_time) / 60" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we describe the newly created retrained predictor for comparing against the existing predictor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.describe_auto_predictor(PredictorArn=retrainingPredictorArn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forecast.get_accuracy_metrics(PredictorArn=retrainingPredictorArn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we publish metrics to denote the time taken to create the first predictor and retraining the predictor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Time to create first predictor {time_to_create_first_predictor} mins\")\n", "print(f\"Time to retrain the predictor {time_to_retrain} mins\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleanup\n", "\n", "Once we have completed the above steps, we can start to cleanup the resources we created. All the created resources can be deleted using `delete_resource_tree` and it is an asynchronous operation, so we have added the helpful `wait_till_delete` function. To learn more about deleting a parent resource and all its child resources, visit [DeleteResourceTree](https://docs.aws.amazon.com/forecast/latest/dg/API_DeleteResourceTree.html) API. \n", "Resource Limits documented here.\n", "\n", "Note: Below lines are commented to ensure resource deletion does not happen by mistake. Please un-comment below to delete the resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the DatasetGroup, and all its child resources such as Predictor, PredictorBacktestExportJob, Forecast and ForecastExportJob\n", "#util.wait_till_delete(lambda: forecast.delete_resource_tree(ResourceArn=datasetGroupArn))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the Dataset and its child DatasetImportJob resources:\n", "#util.wait_till_delete(lambda: forecast.delete_resource_tree(ResourceArn=datasetArn))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete IAM role\n", "#util.delete_iam_role( role_name )" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "collapse_to_match_collapsible_headings": false, "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "253px", "width": "254px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": false, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }