# Getting Data Ready

Forecasting is used in a variety of applications and business use cases: For example, retailers need to forecast the sales of their products to decide how much stock they need by location, Manufacturers need to estimate the number of parts required at their factories to optimize their supply chain, Businesses need to estimate their flexible workforce needs, Utilities need to forecast electricity consumption needs in order to attain an efficient energy network, and enterprises need to estimate their cloud infrastructure needs.

<img src="https://amazon-forecast-samples.s3-us-west-2.amazonaws.com/common/images/forecast_overview_steps.png" width="98%">

In this notebook we will be walking through the first steps outlined in left-box above.


## Table Of Contents
* Step 1: [Setup Amazon Forecast](#setup)
* Step 2: [Prepare the Datasets](#DataPrep)
* Step 3: [Create the Dataset Group and Dataset](#DataSet)
* Step 4: [Create the Target Time Series Data Import Job](#DataImport)
* [Next Steps](#nextSteps)

For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)

## Step 1: Setup Amazon Forecast<a class="anchor" id="setup"></a>

This section sets up the permissions and relevant endpoints.

In [51]:
!pip install pandas s3fs matplotlib ipywidgets
!pip install boto3 --upgrade

Collecting botocore<1.20.107,>=1.20.106
 Using cached botocore-1.20.106-py2.py3-none-any.whl (7.7 MB)
Installing collected packages: botocore
 Attempting uninstall: botocore
 Found existing installation: botocore 1.22.1
 Uninstalling botocore-1.22.1:
 Successfully uninstalled botocore-1.22.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.19.1 requires botocore<1.23.0,>=1.22.1, but you have botocore 1.20.106 which is incompatible.[0m
Successfully installed botocore-1.20.106
You should consider upgrading via the '/Users/jeetub/.pyenv/versions/3.8.12/bin/python3.8 -m pip install --upgrade pip' command.[0m
Collecting botocore<1.23.0,>=1.22.1
 Using cached botocore-1.22.1-py3-none-any.whl (8.0 MB)
Installing collected packages: botocore
 Attempting uninstall: botocore
 Found existing installation: botocore 1.20.106
 Uninstalling botocore-1.20.106:
 S

In [1]:
import sys
import os
import pandas as pd

# importing forecast notebook utility from notebooks/common directory
sys.path.insert( 0, os.path.abspath("../../common") )
import util

%reload_ext autoreload
import boto3
import s3fs

Configure the S3 bucket name and region name for this lesson.

- If you don't have an S3 bucket, create it first on S3. 
- Although we have set the region to us-west-2 as a default value below, you can choose any of the regions that the service is available in.

In [2]:
region = 'us-west-2'
bucket_name = 'forecast-demo-uci-electricity-jeetub'

In [3]:
# Build Session and Clients for Amazon Forecast
session = boto3.Session(region_name=region) 
forecast = session.client(service_name='forecast') 

<b>Create IAM Role for Forecast</b> <br>
Like many AWS services, Forecast will need to assume an IAM role in order to interact with your S3 resources securely. In the sample notebooks, we use the get_or_create_iam_role() utility function to create an IAM role. Please refer to "notebooks/common/util/fcst_utils.py" for implementation.

In [82]:
# Create the role to provide to Amazon Forecast.
role_name = "ForecastNotebookRole-Basic"
print(f"Creating Role {role_name} ...")
role_arn = util.get_or_create_iam_role( role_name = role_name )

# echo user inputs without account
print(f"Success! Created role arn = {role_arn.split('/')[1]}")

Creating Role ForecastNotebookRole-Basic ...
The role ForecastNotebookRole-Basic exists, ignore to create it
Done.
Success! Created role arn = ForecastNotebookRole-Basic


The last part of the setup process is to validate that your account can communicate with Amazon Forecast, the cell below does just that.

In [83]:
# Check that you can communicate with Amazon Forecast
forecast.list_predictors()

{'Predictors': [],
 'ResponseMetadata': {'RequestId': 'f0c87c34-3559-4c84-94cf-cbfcb41a7c49',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
 'date': 'Thu, 21 Oct 2021 22:10:26 GMT',
 'x-amzn-requestid': 'f0c87c34-3559-4c84-94cf-cbfcb41a7c49',
 'content-length': '17',
 'connection': 'keep-alive'},
 'RetryAttempts': 0}}

## Step 2: Prepare the Datasets<a class="anchor" id="DataPrep"></a>

For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly. 

To begin, use Pandas to read the CSV and to show a sample of the data.

In [9]:
df = pd.read_csv("../../common/data/item-demand-time.csv", dtype = object, names=['timestamp','value','item'])
df.head(3)

Unnamed: 0,timestamp,value,item
0,2014-01-01 01:00:00,38.34991708126038,client_12
1,2014-01-01 02:00:00,33.5820895522388,client_12
2,2014-01-01 03:00:00,34.41127694859037,client_12


Notice in the output above there are 3 columns of data:

1. The Timestamp
1. A Value
1. An Item ID

These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.

The dataset happens to span January 01, 2014 to Deceber 31, 2014. We are only going to use January to October to train Amazon Forecast.

You may notice a variable named `df` this is a popular convention when using Pandas if you are using the library's dataframe object, it is similar to a table in a database. You can learn more here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


In [10]:
# Select January to October for one dataframe.
jan_to_oct = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] < '2014-11-01')]

print(f"min timestamp = {jan_to_oct.timestamp.min()}")
print(f"max timestamp = {jan_to_oct.timestamp.max()}")

min timestamp = 2014-01-01 01:00:00
max timestamp = 2014-10-31 23:00:00


In [11]:
# save an item_id for querying later
item_id = "client_12"

Now export them to CSV files and place them into your `data` folder.

In [14]:
jan_to_oct[["timestamp", "item", "value"]].to_csv("data/item-demand-time-train.csv", header=False, index=False)

We will now export a second dataset to CSV this time including November 1st. This extra day will be used to validate our forecast. 

In [15]:
validation = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] < '2014-11-02')]
validation[["timestamp", "item", "value"]].to_csv("data/item-demand-time-validation.csv", header=False, index=False)

At this time the data is ready to be sent to S3 where Forecast will use it later. The following cells will upload the data to S3.

In [89]:
key="elec_data/item-demand-time-train.csv"

boto3.Session().resource('s3').Bucket(bucket_name).Object(key).upload_file("data/item-demand-time-train.csv")

## Step 3: Create the Dataset Group and Dataset <a class="anchor" id="DataSet"></a>

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. Since data files are imported headerless, it is important to define a schema for your data.

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`.


Next, you need to make some choices. 
<ol>
 <li><b>How many time units do you want to forecast?</b>. For example, if your time unit is Hour, then if you want to forecast out 1 week, that would be 24*7 = 168 hours, so answer = 168. </li>
 <li><b>What is the time granularity for your data?</b>. For example, if your time unit is Hour, answer = "H". </li>
 <li><b>Think of a name you want to give this project (Dataset Group name)</b>, so all files will have the same names. You should also use this same name for your Forecast DatasetGroup name, to set yourself up for reproducibility. </li>
 </ol>

In [93]:
# what is your forecast horizon in number time units you've selected?
# e.g. if you're forecasting in months, how many months out do you want a forecast?
FORECAST_LENGTH = 24

# What is your forecast time unit granularity?
# Choices are: ^Y|M|W|D|H|30min|15min|10min|5min|1min$ 
DATASET_FREQUENCY = "H"
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"

# What name do you want to give this project? 
# We will use this same name for your Forecast Dataset Group name.
PROJECT = 'util_power_demo'
DATA_VERSION = 1

### Create the Dataset Group

In this task, we define a container name or Dataset Group name, which will be used to keep track of Dataset import files, schema, and all Forecast results which go together.


In [96]:
dataset_group = f"{PROJECT}_{DATA_VERSION}"
print(f"Dataset Group Name = {dataset_group}")

Dataset Group Name = util_power_demo_1


In [97]:
dataset_arns = []
create_dataset_group_response = \
 forecast.create_dataset_group(Domain="CUSTOM",
 DatasetGroupName=dataset_group,
 DatasetArns=dataset_arns)

In [98]:
dataset_group_arn = create_dataset_group_response['DatasetGroupArn']

In [99]:
forecast.describe_dataset_group(DatasetGroupArn=dataset_group_arn)

{'DatasetGroupName': 'util_power_demo_1',
 'DatasetGroupArn': 'arn:aws:forecast:us-west-2:730750055343:dataset-group/util_power_demo_1',
 'DatasetArns': [],
 'Domain': 'CUSTOM',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2021, 10, 21, 15, 11, 39, 8000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2021, 10, 21, 15, 11, 39, 8000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'c476e9e7-8761-4118-9d5e-5874ec63e000',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
 'date': 'Thu, 21 Oct 2021 22:11:43 GMT',
 'x-amzn-requestid': 'c476e9e7-8761-4118-9d5e-5874ec63e000',
 'content-length': '257',
 'connection': 'keep-alive'},
 'RetryAttempts': 0}}

### Create the Schema

In [100]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
ts_schema ={
 "Attributes":[
 {
 "AttributeName":"timestamp",
 "AttributeType":"timestamp"
 },
 {
 "AttributeName":"target_value",
 "AttributeType":"float"
 },
 {
 "AttributeName":"item_id",
 "AttributeType":"string"
 }
 ]
}

### Create the Dataset

In [101]:
ts_dataset_name = f"{PROJECT}_{DATA_VERSION}"
print(ts_dataset_name)

util_power_demo_1


In [102]:
response = \
forecast.create_dataset(Domain="CUSTOM",
 DatasetType='TARGET_TIME_SERIES',
 DatasetName=ts_dataset_name,
 DataFrequency=DATASET_FREQUENCY,
 Schema=ts_schema
 )

In [103]:
ts_dataset_arn = response['DatasetArn']

In [104]:
forecast.describe_dataset(DatasetArn=ts_dataset_arn)

{'DatasetArn': 'arn:aws:forecast:us-west-2:730750055343:dataset/util_power_demo_1',
 'DatasetName': 'util_power_demo_1',
 'Domain': 'CUSTOM',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'H',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
 'AttributeType': 'timestamp'},
 {'AttributeName': 'target_value', 'AttributeType': 'float'},
 {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2021, 10, 21, 15, 11, 46, 887000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2021, 10, 21, 15, 11, 46, 887000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '7baf7876-ec88-4c63-bce3-44c01ecce913',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
 'date': 'Thu, 21 Oct 2021 22:11:54 GMT',
 'x-amzn-requestid': '7baf7876-ec88-4c63-bce3-44c01ecce913',
 'content-length': '495',
 'connection': 'keep-alive'},
 'RetryAttempts': 0}}

### Update the dataset group with the datasets we created
You can have multiple datasets under the same dataset group. Update it with the datasets we created before.

In [105]:
dataset_arns = []
dataset_arns.append(ts_dataset_arn)
forecast.update_dataset_group(DatasetGroupArn=dataset_group_arn, DatasetArns=dataset_arns)

{'ResponseMetadata': {'RequestId': 'e7e74418-c60c-424f-95e2-61df025f9251',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
 'date': 'Thu, 21 Oct 2021 22:12:02 GMT',
 'x-amzn-requestid': 'e7e74418-c60c-424f-95e2-61df025f9251',
 'content-length': '2',
 'connection': 'keep-alive'},
 'RetryAttempts': 0}}

### Step 4: Create a Target Time Series Dataset Import Job <a class="anchor" id="DataImport"></a>


Now that Forecast knows how to understand the CSV we are providing, the next step is to import the data from S3 into Amazon Forecaast.

In [106]:
# Recall path to your data
ts_s3_data_path = "s3://"+bucket_name+"/"+key
print(f"S3 URI for your data file = {ts_s3_data_path}")

S3 URI for your data file = s3://forecast-demo-uci-electricity-jeetub/elec_data/item-demand-time-train.csv


In [107]:
ts_dataset_import_job_response = \
 forecast.create_dataset_import_job(DatasetImportJobName=dataset_group,
 DatasetArn=ts_dataset_arn,
 DataSource= {
 "S3Config" : {
 "Path": ts_s3_data_path,
 "RoleArn": role_arn
 } 
 },
 TimestampFormat=TIMESTAMP_FORMAT)

In [108]:
ts_dataset_import_job_arn=ts_dataset_import_job_response['DatasetImportJobArn']
ts_dataset_import_job_arn

'arn:aws:forecast:us-west-2:730750055343:dataset-import-job/util_power_demo_1/util_power_demo_1'

Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [109]:
status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn))
assert status

CREATE_PENDING ..
CREATE_IN_PROGRESS .............
ACTIVE 


In [110]:
forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn)

{'DatasetImportJobName': 'util_power_demo_1',
 'DatasetImportJobArn': 'arn:aws:forecast:us-west-2:730750055343:dataset-import-job/util_power_demo_1/util_power_demo_1',
 'DatasetArn': 'arn:aws:forecast:us-west-2:730750055343:dataset/util_power_demo_1',
 'TimestampFormat': 'yyyy-MM-dd hh:mm:ss',
 'UseGeolocationForTimeZone': False,
 'DataSource': {'S3Config': {'Path': 's3://forecast-demo-uci-electricity-jeetub/elec_data/item-demand-time-train.csv',
 'RoleArn': 'arn:aws:iam::730750055343:role/ForecastNotebookRole-Basic'}},
 'FieldStatistics': {'item_id': {'Count': 21885,
 'CountDistinct': 3,
 'CountNull': 0,
 'CountLong': 21885,
 'CountDistinctLong': 3,
 'CountNullLong': 0},
 'target_value': {'Count': 21885,
 'CountDistinct': 4635,
 'CountNull': 0,
 'CountNan': 0,
 'Min': '0.0',
 'Max': '209.99170812603649',
 'Avg': 50.0947432986864,
 'Stddev': 38.47197571594975,
 'CountLong': 21885,
 'CountDistinctLong': 4635,
 'CountNullLong': 0,
 'CountNanLong': 0},
 'timestamp': {'Count': 21885,
 'Cou

## Next Steps<a class="anchor" id="nextSteps"></a>

At this point you have successfully imported your data into Amazon Forecast and now it is time to get started in the next notebook to build your first model. To Continue, execute the cell below to store important variables where they can be used in the next notebook, then open `2.Building_Your_Predictor.ipynb`.

In [111]:
# Now save your choices for the next notebook 
%store item_id
%store PROJECT
%store DATA_VERSION
%store FORECAST_LENGTH
%store DATASET_FREQUENCY
%store TIMESTAMP_FORMAT
%store ts_dataset_import_job_arn
%store ts_dataset_arn
%store dataset_group_arn
%store role_arn
%store bucket_name
%store region
%store key

Stored 'item_id' (str)
Stored 'PROJECT' (str)
Stored 'DATA_VERSION' (int)
Stored 'FORECAST_LENGTH' (int)
Stored 'DATASET_FREQUENCY' (str)
Stored 'TIMESTAMP_FORMAT' (str)
Stored 'ts_dataset_import_job_arn' (str)
Stored 'ts_dataset_arn' (str)
Stored 'dataset_group_arn' (str)
Stored 'role_arn' (str)
Stored 'bucket_name' (str)
Stored 'region' (str)
Stored 'key' (str)


## Additional Topics<a class="anchor" id="additionalTopics"></a>

### Stop the data import

Possibly during fine-tuning development, you'll accidentally upload data before you're ready. If you don't want to wait for the data upload and processing, there is a handy "Stop API" call.


In [None]:
# StopResource
stop_ts_dataset_import_job_arn = forecast.stop_resource(ResourceArn=ts_dataset_import_job_arn)

In [78]:
# Delete the target time series dataset import job
# util.wait_till_delete(lambda: forecast.delete_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn))

forecast.delete_resource_tree(ResourceArn="arn:aws:forecast:us-west-2:730750055343:dataset/util_power_demo_1")

{'ResponseMetadata': {'RequestId': '0b91b92b-e435-49aa-9d11-267f3d91ae8f',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
 'date': 'Thu, 21 Oct 2021 22:05:36 GMT',
 'x-amzn-requestid': '0b91b92b-e435-49aa-9d11-267f3d91ae8f',
 'content-length': '0',
 'connection': 'keep-alive'},
 'RetryAttempts': 0}}