{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to run this notebook and prepare a dataset for input into Amazon Forecast." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.Download dataset\n", "We use data from the following sites to track sales on e-commerce sites. \n", "https://archive.ics.uci.edu/ml/datasets/Online+Retail+II" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-08-01 05:51:39-- https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx\n", "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n", "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 45622278 (44M) [application/x-httpd-php]\n", "Saving to: ‘./input/online_retail_II.xlsx’\n", "\n", "online_retail_II.xl 100%[===================>] 43.51M 19.0MB/s in 2.3s \n", "\n", "2020-08-01 05:51:41 (19.0 MB/s) - ‘./input/online_retail_II.xlsx’ saved [45622278/45622278]\n", "\n" ] } ], "source": [ "! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx -P ./input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.Load dataset\n", "Load the downloaded data and add a sales column." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_excel('./input/online_retail_II.xlsx', sheet_name='Year 2009-2010')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df['sales'] = df['Price'] * df['Quantity']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.Build dataset\n", "From the dataset, create two sets, one for initial training and one for automatic training using the pipeline.\n", "\n", "train:2009/12/01 - 2010/12/02 \n", "train_added:2009/12/01 - 2010/12/09" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df2 = df[['Country', 'InvoiceDate', 'sales']]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df2 = df2.query('Country == \"United Kingdom\"')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryInvoiceDatesales
0United Kingdom2009-12-01 07:45:0083.4
1United Kingdom2009-12-01 07:45:0081.0
2United Kingdom2009-12-01 07:45:0081.0
3United Kingdom2009-12-01 07:45:00100.8
4United Kingdom2009-12-01 07:45:0030.0
\n", "
" ], "text/plain": [ " Country InvoiceDate sales\n", "0 United Kingdom 2009-12-01 07:45:00 83.4\n", "1 United Kingdom 2009-12-01 07:45:00 81.0\n", "2 United Kingdom 2009-12-01 07:45:00 81.0\n", "3 United Kingdom 2009-12-01 07:45:00 100.8\n", "4 United Kingdom 2009-12-01 07:45:00 30.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "!mkdir -p output" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df2.to_csv('./output/tr_target_add_20091201_20101209.csv', header=False, index=False)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "tr1 = df2.query('InvoiceDate <= \"20101203\"')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryInvoiceDatesales
508150United Kingdom2010-12-02 19:59:003.40
508151United Kingdom2010-12-02 19:59:000.65
508152United Kingdom2010-12-02 19:59:005.95
508153United Kingdom2010-12-02 19:59:005.90
508154United Kingdom2010-12-02 19:59:009.90
\n", "
" ], "text/plain": [ " Country InvoiceDate sales\n", "508150 United Kingdom 2010-12-02 19:59:00 3.40\n", "508151 United Kingdom 2010-12-02 19:59:00 0.65\n", "508152 United Kingdom 2010-12-02 19:59:00 5.95\n", "508153 United Kingdom 2010-12-02 19:59:00 5.90\n", "508154 United Kingdom 2010-12-02 19:59:00 9.90" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tr1.tail()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "tr1.to_csv('./output/tr_target_20091201_20101202.csv', header=False, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.Upload dataset to S3\n", "Create a bucket in S3 and upload the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## make bucket" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import boto3" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.14.16'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boto3.__version__" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "805433377179\n" ] } ], "source": [ "sts = boto3.client('sts')\n", "id_info = sts.get_caller_identity()\n", "print(id_info['Account'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client('s3')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "bucket_name = 'demo-forecast-' + id_info['Account']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'demo-forecast-805433377179'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bucket_name" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ResponseMetadata': {'RequestId': '3E78EF02EB2C9A75',\n", " 'HostId': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amz-id-2': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n", " 'x-amz-request-id': '3E78EF02EB2C9A75',\n", " 'date': 'Sat, 01 Aug 2020 05:53:27 GMT',\n", " 'location': '/demo-forecast-805433377179',\n", " 'content-length': '0',\n", " 'server': 'AmazonS3'},\n", " 'RetryAttempts': 0},\n", " 'Location': '/demo-forecast-805433377179'}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3.create_bucket(Bucket=bucket_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## upload dataset" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.resource('s3')\n", "bucket = s3.Bucket(bucket_name)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "bucket.upload_file('./output/tr_target_20091201_20101202.csv',\n", " 'input/tr_target_20091201_20101202.csv')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "bucket.upload_file('./output/tr_target_add_20091201_20101209.csv',\n", " 'input/tr_target_add_20091201_20101209.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## upload manifest file\n", "Create a manifest file for use in Quick Sight and upload it to S3." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import json" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "manifest_for_qs={\n", " \"fileLocations\": [\n", " {\n", " \"URIs\": []\n", " },\n", " {\n", " \"URIPrefixes\": [\n", " \"s3://\" + bucket_name + \"/output/\"\n", " ]\n", " }\n", " ],\n", " \"globalUploadSettings\": {\n", " \"format\": \"CSV\",\n", " \"delimiter\": \",\",\n", " \"textqualifier\": \"'\",\n", " \"containsHeader\": \"true\"\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "!mkdir -p manifest_for_quicksight" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "with open('./manifest_for_quicksight/manifest_uk_sales_pred.json', 'w') as f:\n", " json.dump(manifest_for_qs, f, indent=2, ensure_ascii=False)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "bucket.upload_file('./manifest_for_quicksight/manifest_uk_sales_pred.json',\n", " 'manifest_for_quicksight/manifest_uk_sales_pred.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.NEXT\n", "Manually run the forecast with Amazon Forecast. export the forecast results to S3 and visualize them in Amazon QuickSight. \n", "When the visualization is complete, run 2_build_forecast_pipeline.ipynb to build an automatic forecast pipeline. " ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }