{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# preprocess dataset and put to S3\n", "This notebook downloads and formats the retail data that is eligible for Forecasting. Upload the formatted data to S3 and launch Step Functions to make sure the Forecast is running." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.Download dataset\n", "We use data from the following sites to track sales on e-commerce sites. \n", "https://archive.ics.uci.edu/ml/datasets/Online+Retail+II" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx -P ./input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.Load dataset\n", "Load the downloaded data and add a sales column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_excel('./input/online_retail_II.xlsx', sheet_name='Year 2009-2010')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['sales'] = df['Price'] * df['Quantity']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.Build dataset\n", "From the dataset, create two sets, one for initial training and one for automatic training using the pipeline.\n", "\n", "train:2009/12/01 - 2010/12/02 \n", "train_added:2009/12/01 - 2010/12/09" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = df[['Country', 'InvoiceDate', 'sales']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = df2.query('Country == \"United Kingdom\"')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.to_csv('./output/tr_target_add_20091201_20101209.csv', header=False, index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tr1 = df2.query('InvoiceDate <= \"20101203\"')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tr1.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tr1.to_csv('./output/tr_target_20091201_20101202.csv', header=False, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.Upload dataset to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.__version__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sts = boto3.client('sts')\n", "id_info = sts.get_caller_identity()\n", "print(id_info['Account'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket_name = 'workshop-timeseries-retail-' + id_info['Account'] + '-source'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.resource('s3')\n", "bucket = s3.Bucket(bucket_name)\n", "\n", "bucket.upload_file('./output/tr_target_add_20091201_20101209.csv', 'input/tr_target_add_20091201_20101209.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.NEXT\n", "From the console screen of Step Functions, you should see the pipeline running. This will take a bit of time. Once everything is complete, make sure that S3 has the Forecast result stored in S3 and proceed to 3_visualization.ipynb for visualizing forecast." ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }