{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step is to run this notebook and prepare a dataset for input into Amazon Forecast."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.Download dataset\n",
"We use data from the following sites to track sales on e-commerce sites. \n",
"https://archive.ics.uci.edu/ml/datasets/Online+Retail+II"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2020-08-01 05:51:39-- https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx\n",
"Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n",
"Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 45622278 (44M) [application/x-httpd-php]\n",
"Saving to: ‘./input/online_retail_II.xlsx’\n",
"\n",
"online_retail_II.xl 100%[===================>] 43.51M 19.0MB/s in 2.3s \n",
"\n",
"2020-08-01 05:51:41 (19.0 MB/s) - ‘./input/online_retail_II.xlsx’ saved [45622278/45622278]\n",
"\n"
]
}
],
"source": [
"! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx -P ./input"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.Load dataset\n",
"Load the downloaded data and add a sales column."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_excel('./input/online_retail_II.xlsx', sheet_name='Year 2009-2010')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"df['sales'] = df['Price'] * df['Quantity']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.Build dataset\n",
"From the dataset, create two sets, one for initial training and one for automatic training using the pipeline.\n",
"\n",
"train:2009/12/01 - 2010/12/02 \n",
"train_added:2009/12/01 - 2010/12/09"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df2 = df[['Country', 'InvoiceDate', 'sales']]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df2 = df2.query('Country == \"United Kingdom\"')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" InvoiceDate | \n",
" sales | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" United Kingdom | \n",
" 2009-12-01 07:45:00 | \n",
" 83.4 | \n",
"
\n",
" \n",
" 1 | \n",
" United Kingdom | \n",
" 2009-12-01 07:45:00 | \n",
" 81.0 | \n",
"
\n",
" \n",
" 2 | \n",
" United Kingdom | \n",
" 2009-12-01 07:45:00 | \n",
" 81.0 | \n",
"
\n",
" \n",
" 3 | \n",
" United Kingdom | \n",
" 2009-12-01 07:45:00 | \n",
" 100.8 | \n",
"
\n",
" \n",
" 4 | \n",
" United Kingdom | \n",
" 2009-12-01 07:45:00 | \n",
" 30.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country InvoiceDate sales\n",
"0 United Kingdom 2009-12-01 07:45:00 83.4\n",
"1 United Kingdom 2009-12-01 07:45:00 81.0\n",
"2 United Kingdom 2009-12-01 07:45:00 81.0\n",
"3 United Kingdom 2009-12-01 07:45:00 100.8\n",
"4 United Kingdom 2009-12-01 07:45:00 30.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p output"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df2.to_csv('./output/tr_target_add_20091201_20101209.csv', header=False, index=False)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"tr1 = df2.query('InvoiceDate <= \"20101203\"')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" InvoiceDate | \n",
" sales | \n",
"
\n",
" \n",
" \n",
" \n",
" 508150 | \n",
" United Kingdom | \n",
" 2010-12-02 19:59:00 | \n",
" 3.40 | \n",
"
\n",
" \n",
" 508151 | \n",
" United Kingdom | \n",
" 2010-12-02 19:59:00 | \n",
" 0.65 | \n",
"
\n",
" \n",
" 508152 | \n",
" United Kingdom | \n",
" 2010-12-02 19:59:00 | \n",
" 5.95 | \n",
"
\n",
" \n",
" 508153 | \n",
" United Kingdom | \n",
" 2010-12-02 19:59:00 | \n",
" 5.90 | \n",
"
\n",
" \n",
" 508154 | \n",
" United Kingdom | \n",
" 2010-12-02 19:59:00 | \n",
" 9.90 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country InvoiceDate sales\n",
"508150 United Kingdom 2010-12-02 19:59:00 3.40\n",
"508151 United Kingdom 2010-12-02 19:59:00 0.65\n",
"508152 United Kingdom 2010-12-02 19:59:00 5.95\n",
"508153 United Kingdom 2010-12-02 19:59:00 5.90\n",
"508154 United Kingdom 2010-12-02 19:59:00 9.90"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tr1.tail()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"tr1.to_csv('./output/tr_target_20091201_20101202.csv', header=False, index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4.Upload dataset to S3\n",
"Create a bucket in S3 and upload the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## make bucket"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import boto3"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.14.16'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"boto3.__version__"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"805433377179\n"
]
}
],
"source": [
"sts = boto3.client('sts')\n",
"id_info = sts.get_caller_identity()\n",
"print(id_info['Account'])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"s3 = boto3.client('s3')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"bucket_name = 'demo-forecast-' + id_info['Account']"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'demo-forecast-805433377179'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bucket_name"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ResponseMetadata': {'RequestId': '3E78EF02EB2C9A75',\n",
" 'HostId': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n",
" 'HTTPStatusCode': 200,\n",
" 'HTTPHeaders': {'x-amz-id-2': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n",
" 'x-amz-request-id': '3E78EF02EB2C9A75',\n",
" 'date': 'Sat, 01 Aug 2020 05:53:27 GMT',\n",
" 'location': '/demo-forecast-805433377179',\n",
" 'content-length': '0',\n",
" 'server': 'AmazonS3'},\n",
" 'RetryAttempts': 0},\n",
" 'Location': '/demo-forecast-805433377179'}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s3.create_bucket(Bucket=bucket_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## upload dataset"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"s3 = boto3.resource('s3')\n",
"bucket = s3.Bucket(bucket_name)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"bucket.upload_file('./output/tr_target_20091201_20101202.csv',\n",
" 'input/tr_target_20091201_20101202.csv')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"bucket.upload_file('./output/tr_target_add_20091201_20101209.csv',\n",
" 'input/tr_target_add_20091201_20101209.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## upload manifest file\n",
"Create a manifest file for use in Quick Sight and upload it to S3."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"import json"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"manifest_for_qs={\n",
" \"fileLocations\": [\n",
" {\n",
" \"URIs\": []\n",
" },\n",
" {\n",
" \"URIPrefixes\": [\n",
" \"s3://\" + bucket_name + \"/output/\"\n",
" ]\n",
" }\n",
" ],\n",
" \"globalUploadSettings\": {\n",
" \"format\": \"CSV\",\n",
" \"delimiter\": \",\",\n",
" \"textqualifier\": \"'\",\n",
" \"containsHeader\": \"true\"\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p manifest_for_quicksight"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"with open('./manifest_for_quicksight/manifest_uk_sales_pred.json', 'w') as f:\n",
" json.dump(manifest_for_qs, f, indent=2, ensure_ascii=False)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"bucket.upload_file('./manifest_for_quicksight/manifest_uk_sales_pred.json',\n",
" 'manifest_for_quicksight/manifest_uk_sales_pred.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5.NEXT\n",
"Manually run the forecast with Amazon Forecast. export the forecast results to S3 and visualize them in Amazon QuickSight. \n",
"When the visualization is complete, run 2_build_forecast_pipeline.ipynb to build an automatic forecast pipeline. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}