# preprocess dataset and put to S3
This notebook downloads and formats the retail data that is eligible for Forecasting. Upload the formatted data to S3 and launch Step Functions to make sure the Forecast is running.

## 1.Download dataset
We use data from the following sites to track sales on e-commerce sites. 
https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

In [None]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx -P ./input

## 2.Load dataset
Load the downloaded data and add a sales column.

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('./input/online_retail_II.xlsx', sheet_name='Year 2009-2010')

In [None]:
df['sales'] = df['Price'] * df['Quantity']

## 3.Build dataset
From the dataset, create two sets, one for initial training and one for automatic training using the pipeline.

train:2009/12/01 - 2010/12/02 
train_added:2009/12/01 - 2010/12/09

In [None]:
df2 = df[['Country', 'InvoiceDate', 'sales']]

In [None]:
df2 = df2.query('Country == "United Kingdom"')

In [None]:
df2.head()

In [None]:
!mkdir -p output

In [None]:
df2.to_csv('./output/tr_target_add_20091201_20101209.csv', header=False, index=False)

In [None]:
tr1 = df2.query('InvoiceDate <= "20101203"')

In [None]:
tr1.tail()

In [None]:
tr1.to_csv('./output/tr_target_20091201_20101202.csv', header=False, index=False)

## 4.Upload dataset to S3

In [None]:
import boto3

In [None]:
boto3.__version__

In [None]:
sts = boto3.client('sts')
id_info = sts.get_caller_identity()
print(id_info['Account'])

In [None]:
bucket_name = 'workshop-timeseries-retail-' + id_info['Account'] + '-source'

In [None]:
bucket_name

In [None]:
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

bucket.upload_file('./output/tr_target_add_20091201_20101209.csv', 'input/tr_target_add_20091201_20101209.csv')

## 5.NEXT
From the console screen of Step Functions, you should see the pipeline running. This will take a bit of time. Once everything is complete, make sure that S3 has the Forecast result stored in S3 and proceed to 3_visualization.ipynb for visualizing forecast.