{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to run this notebook and prepare a dataset for input into Amazon Forecast."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.Download dataset\n",
    "We use data from the following sites to track sales on e-commerce sites.   \n",
    "https://archive.ics.uci.edu/ml/datasets/Online+Retail+II"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-08-01 05:51:39--  https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx\n",
      "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n",
      "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 45622278 (44M) [application/x-httpd-php]\n",
      "Saving to: ‘./input/online_retail_II.xlsx’\n",
      "\n",
      "online_retail_II.xl 100%[===================>]  43.51M  19.0MB/s    in 2.3s    \n",
      "\n",
      "2020-08-01 05:51:41 (19.0 MB/s) - ‘./input/online_retail_II.xlsx’ saved [45622278/45622278]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx -P ./input"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.Load dataset\n",
    "Load the downloaded data and add a sales column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_excel('./input/online_retail_II.xlsx', sheet_name='Year 2009-2010')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['sales'] = df['Price'] * df['Quantity']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.Build dataset\n",
    "From the dataset, create two sets, one for initial training and one for automatic training using the pipeline.\n",
    "\n",
    "train:2009/12/01 - 2010/12/02   \n",
    "train_added:2009/12/01 - 2010/12/09"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df[['Country', 'InvoiceDate', 'sales']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df2.query('Country == \"United Kingdom\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>sales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>83.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>100.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Country         InvoiceDate  sales\n",
       "0  United Kingdom 2009-12-01 07:45:00   83.4\n",
       "1  United Kingdom 2009-12-01 07:45:00   81.0\n",
       "2  United Kingdom 2009-12-01 07:45:00   81.0\n",
       "3  United Kingdom 2009-12-01 07:45:00  100.8\n",
       "4  United Kingdom 2009-12-01 07:45:00   30.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.to_csv('./output/tr_target_add_20091201_20101209.csv', header=False, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr1 = df2.query('InvoiceDate <= \"20101203\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>sales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>508150</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2010-12-02 19:59:00</td>\n",
       "      <td>3.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>508151</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2010-12-02 19:59:00</td>\n",
       "      <td>0.65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>508152</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2010-12-02 19:59:00</td>\n",
       "      <td>5.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>508153</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2010-12-02 19:59:00</td>\n",
       "      <td>5.90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>508154</th>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>2010-12-02 19:59:00</td>\n",
       "      <td>9.90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Country         InvoiceDate  sales\n",
       "508150  United Kingdom 2010-12-02 19:59:00   3.40\n",
       "508151  United Kingdom 2010-12-02 19:59:00   0.65\n",
       "508152  United Kingdom 2010-12-02 19:59:00   5.95\n",
       "508153  United Kingdom 2010-12-02 19:59:00   5.90\n",
       "508154  United Kingdom 2010-12-02 19:59:00   9.90"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tr1.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "tr1.to_csv('./output/tr_target_20091201_20101202.csv', header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.Upload dataset to S3\n",
    "Create a bucket in S3 and upload the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## make bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.14.16'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boto3.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "805433377179\n"
     ]
    }
   ],
   "source": [
    "sts = boto3.client('sts')\n",
    "id_info = sts.get_caller_identity()\n",
    "print(id_info['Account'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client('s3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket_name = 'demo-forecast-' + id_info['Account']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'demo-forecast-805433377179'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bucket_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ResponseMetadata': {'RequestId': '3E78EF02EB2C9A75',\n",
       "  'HostId': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amz-id-2': 'WEvfjkpLyQGz0gcZavGvEqZ5qLjyKrOIa2dtKAlum3ztL9iaQ8Xj2JoVLjMCabAuYU3Nz4H6ivk=',\n",
       "   'x-amz-request-id': '3E78EF02EB2C9A75',\n",
       "   'date': 'Sat, 01 Aug 2020 05:53:27 GMT',\n",
       "   'location': '/demo-forecast-805433377179',\n",
       "   'content-length': '0',\n",
       "   'server': 'AmazonS3'},\n",
       "  'RetryAttempts': 0},\n",
       " 'Location': '/demo-forecast-805433377179'}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s3.create_bucket(Bucket=bucket_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## upload dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.resource('s3')\n",
    "bucket = s3.Bucket(bucket_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket.upload_file('./output/tr_target_20091201_20101202.csv',\n",
    "                   'input/tr_target_20091201_20101202.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket.upload_file('./output/tr_target_add_20091201_20101209.csv',\n",
    "                   'input/tr_target_add_20091201_20101209.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## upload manifest file\n",
    "Create a manifest file for use in Quick Sight and upload it to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "manifest_for_qs={\n",
    "  \"fileLocations\": [\n",
    "    {\n",
    "      \"URIs\": []\n",
    "    },\n",
    "    {\n",
    "      \"URIPrefixes\": [\n",
    "        \"s3://\" + bucket_name + \"/output/\"\n",
    "      ]\n",
    "    }\n",
    "  ],\n",
    "  \"globalUploadSettings\": {\n",
    "    \"format\": \"CSV\",\n",
    "    \"delimiter\": \",\",\n",
    "    \"textqualifier\": \"'\",\n",
    "    \"containsHeader\": \"true\"\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p manifest_for_quicksight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./manifest_for_quicksight/manifest_uk_sales_pred.json', 'w') as f:\n",
    "    json.dump(manifest_for_qs, f, indent=2, ensure_ascii=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket.upload_file('./manifest_for_quicksight/manifest_uk_sales_pred.json',\n",
    "                   'manifest_for_quicksight/manifest_uk_sales_pred.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.NEXT\n",
    "Manually run the forecast with Amazon Forecast. export the forecast results to S3 and visualize them in Amazon QuickSight.    \n",
    "When the visualization is complete, run 2_build_forecast_pipeline.ipynb to build an automatic forecast pipeline.  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}