{ "cells": [ { "cell_type": "markdown", "id": "632e4b69", "metadata": {}, "source": [ "## (Optional) Data Cleaning and Preparation\n", "\n", "This notebook cleans and prepares the [UCI Online Retail II Data Set](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II) for use with clustering algorithms. \n", "\n", "Please download the latest copy of the dataset from the following UCI link and place it in a directory called `data` before executing the notebook.\n", "\n", "File Location: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx\n", "\n", "*Reference*:\n", " * Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "code", "execution_count": 1, "id": "a7385433", "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "6fac6da3", "metadata": {}, "outputs": [], "source": [ "# create data file directory\n", "dir_path = './data'\n", "\n", "if not os.path.exists(dir_path):\n", " os.makedirs(dir_path)" ] }, { "cell_type": "markdown", "id": "70e93b36", "metadata": {}, "source": [ "Please download the latest copy of the source dataset from the UCI repo and place it in the `data` directory.\n", "\n", "File Location: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx" ] }, { "cell_type": "code", "execution_count": 3, "id": "4f2fb516", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com\n", "Requirement already satisfied: openpyxl in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (3.0.9)\n", "Requirement already satisfied: et-xmlfile in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (from openpyxl) (1.0.1)\n", "\u001b[33mWARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.\n", "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n", "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install openpyxl" ] }, { "cell_type": "code", "execution_count": 4, "id": "3042bb4e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(525461, 8) Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',\n", " 'Price', 'Customer ID', 'Country'],\n", " dtype='object')\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountry
04894348504815CM CHRISTMAS GLASS BALL 20 LIGHTS122009-12-01 07:45:006.9513085.0United Kingdom
148943479323PPINK CHERRY LIGHTS122009-12-01 07:45:006.7513085.0United Kingdom
248943479323WWHITE CHERRY LIGHTS122009-12-01 07:45:006.7513085.0United Kingdom
348943422041RECORD FRAME 7\" SINGLE SIZE482009-12-01 07:45:002.1013085.0United Kingdom
448943421232STRAWBERRY CERAMIC TRINKET BOX242009-12-01 07:45:001.2513085.0United Kingdom
\n", "
" ], "text/plain": [ " Invoice StockCode Description Quantity \\\n", "0 489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 \n", "1 489434 79323P PINK CHERRY LIGHTS 12 \n", "2 489434 79323W WHITE CHERRY LIGHTS 12 \n", "3 489434 22041 RECORD FRAME 7\" SINGLE SIZE 48 \n", "4 489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 \n", "\n", " InvoiceDate Price Customer ID Country \n", "0 2009-12-01 07:45:00 6.95 13085.0 United Kingdom \n", "1 2009-12-01 07:45:00 6.75 13085.0 United Kingdom \n", "2 2009-12-01 07:45:00 6.75 13085.0 United Kingdom \n", "3 2009-12-01 07:45:00 2.10 13085.0 United Kingdom \n", "4 2009-12-01 07:45:00 1.25 13085.0 United Kingdom " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the raw xlsx data\n", "df_raw = pd.read_excel('./data/online_retail_II.xlsx',\n", " engine='openpyxl')\n", "\n", "print(df_raw.shape, df_raw.columns)\n", "\n", "df_raw.head()" ] }, { "cell_type": "markdown", "id": "3e496fd6", "metadata": {}, "source": [ "### Data Preparation\n", "\n", "Clean, fill missing data (zero since it represents no sales), and restructure data to match clustering algorithm requirements." ] }, { "cell_type": "code", "execution_count": 5, "id": "8fb68578", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(272031, 3) Index(['StockCode', 'timestamp', 'Quantity'], dtype='object') StockCode object\n", "timestamp object\n", "Quantity int64\n", "dtype: object\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_7678/715102072.py:4: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_clean['timestamp'] = df_clean['InvoiceDate'].dt.date\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockCodetimestampQuantity
0100022009-12-0112
1100022009-12-037
2100022009-12-0473
3100022009-12-0649
4100022009-12-072
\n", "
" ], "text/plain": [ " StockCode timestamp Quantity\n", "0 10002 2009-12-01 12\n", "1 10002 2009-12-03 7\n", "2 10002 2009-12-04 73\n", "3 10002 2009-12-06 49\n", "4 10002 2009-12-07 2" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# subset data needed for analysis and roll up to daily frequency\n", "df_clean = df_raw[['StockCode', 'InvoiceDate', 'Quantity']]\n", "\n", "df_clean['timestamp'] = df_clean['InvoiceDate'].dt.date\n", "\n", "df_clean = (df_clean\n", " .groupby(['StockCode', 'timestamp'])['Quantity']\n", " .agg('sum')\n", " .reset_index())\n", "\n", "print(df_clean.shape, df_clean.columns, df_clean.dtypes)\n", "\n", "df_clean.head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "9abea5f2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(307, 4632) Index([ 10002, 10080, 10109, 10120,\n", " 10125, 10133, 10134, 10135,\n", " 10138, 11001,\n", " ...\n", " 'gift_0001_10', 'gift_0001_20', 'gift_0001_30', 'gift_0001_40',\n", " 'gift_0001_50', 'gift_0001_60', 'gift_0001_70', 'gift_0001_80',\n", " 'gift_0001_90', 'm'],\n", " dtype='object', name='StockCode', length=4632)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockCode10002100801010910120101251013310134101351013811001...gift_0001_10gift_0001_20gift_0001_30gift_0001_40gift_0001_50gift_0001_60gift_0001_70gift_0001_80gift_0001_90m
timestamp
2009-12-0112.0NaNNaN60.05.06.0NaN17.0NaN2.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-02NaN1.0NaN10.0NaNNaNNaN20.096.069.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-037.0NaN4.0NaNNaN40.013.014.035.02.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-0473.03.0NaN30.046.025.0NaN10.012.0NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-05NaNNaNNaNNaNNaN10.010.0NaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 4632 columns

\n", "
" ], "text/plain": [ "StockCode 10002 10080 10109 10120 10125 10133 10134 10135 10138 \\\n", "timestamp \n", "2009-12-01 12.0 NaN NaN 60.0 5.0 6.0 NaN 17.0 NaN \n", "2009-12-02 NaN 1.0 NaN 10.0 NaN NaN NaN 20.0 96.0 \n", "2009-12-03 7.0 NaN 4.0 NaN NaN 40.0 13.0 14.0 35.0 \n", "2009-12-04 73.0 3.0 NaN 30.0 46.0 25.0 NaN 10.0 12.0 \n", "2009-12-05 NaN NaN NaN NaN NaN 10.0 10.0 NaN NaN \n", "\n", "StockCode 11001 ... gift_0001_10 gift_0001_20 gift_0001_30 \\\n", "timestamp ... \n", "2009-12-01 2.0 ... NaN NaN NaN \n", "2009-12-02 69.0 ... NaN NaN NaN \n", "2009-12-03 2.0 ... NaN NaN NaN \n", "2009-12-04 NaN ... NaN NaN NaN \n", "2009-12-05 NaN ... NaN NaN NaN \n", "\n", "StockCode gift_0001_40 gift_0001_50 gift_0001_60 gift_0001_70 \\\n", "timestamp \n", "2009-12-01 NaN NaN NaN NaN \n", "2009-12-02 NaN NaN NaN NaN \n", "2009-12-03 NaN NaN NaN NaN \n", "2009-12-04 NaN NaN NaN NaN \n", "2009-12-05 NaN NaN NaN NaN \n", "\n", "StockCode gift_0001_80 gift_0001_90 m \n", "timestamp \n", "2009-12-01 NaN NaN NaN \n", "2009-12-02 NaN NaN NaN \n", "2009-12-03 NaN NaN NaN \n", "2009-12-04 NaN NaN NaN \n", "2009-12-05 NaN NaN NaN \n", "\n", "[5 rows x 4632 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pivot = df_clean.pivot(index='timestamp',\n", " columns='StockCode',\n", " values='Quantity')\n", "\n", "print(df_pivot.shape, df_pivot.columns)\n", "\n", "df_pivot.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "7a4ed896", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['90214D', '90214E', '90214F', '90214G', '90214H', '90214I', '90214J', '90214K', '90214L', '90214M', '90214N', '90214O', '90214P', '90214R', '90214S', '90214T', '90214U', '90214V', '90214W', '90214Y', '90214Z', 'ADJUST', 'ADJUST2', 'AMAZONFEE', 'B', 'BANK CHARGES', 'C2', 'C3', 'D', 'DCGS0003', 'DCGS0004', 'DCGS0006', 'DCGS0016', 'DCGS0027', 'DCGS0036', 'DCGS0037', 'DCGS0039', 'DCGS0041', 'DCGS0044', 'DCGS0053', 'DCGS0056', 'DCGS0058', 'DCGS0059', 'DCGS0060', 'DCGS0062', 'DCGS0066N', 'DCGS0068', 'DCGS0069', 'DCGS0070', 'DCGS0072', 'DCGS0075', 'DCGS0076', 'DCGSLBOY', 'DCGSLGIRL', 'DCGSSBOY', 'DCGSSGIRL', 'DOT', 'GIFT', 'M', 'PADS', 'POST', 'S', 'SP1002', 'TEST001', 'TEST002', 'gift_0001_10', 'gift_0001_20', 'gift_0001_30', 'gift_0001_40', 'gift_0001_50', 'gift_0001_60', 'gift_0001_70', 'gift_0001_80', 'gift_0001_90', 'm']\n" ] } ], "source": [ "# columns with unusual stock code data\n", "print(list(df_pivot.columns)[-75:])" ] }, { "cell_type": "code", "execution_count": 8, "id": "118e5f77", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(307, 4605) Index([ 10002, 10080, 10109, 10120, 10125,\n", " 10133, 10134, 10135, 10138, 11001,\n", " ...\n", " 'DCGS0068', 'DCGS0069', 'DCGS0070', 'DCGS0072', 'DCGS0075',\n", " 'DCGS0076', 'DCGSLBOY', 'DCGSLGIRL', 'DCGSSBOY', 'DCGSSGIRL'],\n", " dtype='object', name='StockCode', length=4605)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockCode10002100801010910120101251013310134101351013811001...DCGS0068DCGS0069DCGS0070DCGS0072DCGS0075DCGS0076DCGSLBOYDCGSLGIRLDCGSSBOYDCGSSGIRL
timestamp
2009-12-0112.0NaNNaN60.05.06.0NaN17.0NaN2.0...1.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-02NaN1.0NaN10.0NaNNaNNaN20.096.069.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-037.0NaN4.0NaNNaN40.013.014.035.02.0...NaNNaNNaNNaNNaN1.0NaNNaNNaNNaN
2009-12-0473.03.0NaN30.046.025.0NaN10.012.0NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2009-12-05NaNNaNNaNNaNNaN10.010.0NaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 4605 columns

\n", "
" ], "text/plain": [ "StockCode 10002 10080 10109 10120 10125 10133 10134 10135 10138 \\\n", "timestamp \n", "2009-12-01 12.0 NaN NaN 60.0 5.0 6.0 NaN 17.0 NaN \n", "2009-12-02 NaN 1.0 NaN 10.0 NaN NaN NaN 20.0 96.0 \n", "2009-12-03 7.0 NaN 4.0 NaN NaN 40.0 13.0 14.0 35.0 \n", "2009-12-04 73.0 3.0 NaN 30.0 46.0 25.0 NaN 10.0 12.0 \n", "2009-12-05 NaN NaN NaN NaN NaN 10.0 10.0 NaN NaN \n", "\n", "StockCode 11001 ... DCGS0068 DCGS0069 DCGS0070 DCGS0072 DCGS0075 \\\n", "timestamp ... \n", "2009-12-01 2.0 ... 1.0 NaN NaN NaN NaN \n", "2009-12-02 69.0 ... NaN NaN NaN NaN NaN \n", "2009-12-03 2.0 ... NaN NaN NaN NaN NaN \n", "2009-12-04 NaN ... NaN NaN NaN NaN NaN \n", "2009-12-05 NaN ... NaN NaN NaN NaN NaN \n", "\n", "StockCode DCGS0076 DCGSLBOY DCGSLGIRL DCGSSBOY DCGSSGIRL \n", "timestamp \n", "2009-12-01 NaN NaN NaN NaN NaN \n", "2009-12-02 NaN NaN NaN NaN NaN \n", "2009-12-03 1.0 NaN NaN NaN NaN \n", "2009-12-04 NaN NaN NaN NaN NaN \n", "2009-12-05 NaN NaN NaN NaN NaN \n", "\n", "[5 rows x 4605 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop columns with unusual stock code data\n", "drop_cols = ['ADJUST', 'ADJUST2', 'AMAZONFEE', 'B', 'BANK CHARGES', 'C2', 'C3', 'D', \n", " 'DOT', 'GIFT', 'M', 'PADS', 'POST', 'S', 'SP1002', 'TEST001', 'TEST002',\n", " 'gift_0001_10', 'gift_0001_20', 'gift_0001_30', 'gift_0001_40', 'gift_0001_50',\n", " 'gift_0001_60', 'gift_0001_70', 'gift_0001_80', 'gift_0001_90', 'm']\n", "\n", "df_pivot.drop(columns=drop_cols, inplace=True)\n", "\n", "print(df_pivot.shape, df_pivot.columns)\n", "\n", "df_pivot.head()" ] }, { "cell_type": "code", "execution_count": 9, "id": "7432d181", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(374, 4605) Index([ 10002, 10080, 10109, 10120, 10125,\n", " 10133, 10134, 10135, 10138, 11001,\n", " ...\n", " 'DCGS0068', 'DCGS0069', 'DCGS0070', 'DCGS0072', 'DCGS0075',\n", " 'DCGS0076', 'DCGSLBOY', 'DCGSLGIRL', 'DCGSSBOY', 'DCGSSGIRL'],\n", " dtype='object', name='StockCode', length=4605)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockCode10002100801010910120101251013310134101351013811001...DCGS0068DCGS0069DCGS0070DCGS0072DCGS0075DCGS0076DCGSLBOYDCGSLGIRLDCGSSBOYDCGSSGIRL
timestamp
2009-12-0112.00.00.060.05.06.00.017.00.02.0...1.00.00.00.00.00.00.00.00.00.0
2009-12-020.01.00.010.00.00.00.020.096.069.0...0.00.00.00.00.00.00.00.00.00.0
2009-12-037.00.04.00.00.040.013.014.035.02.0...0.00.00.00.00.01.00.00.00.00.0
2009-12-0473.03.00.030.046.025.00.010.012.00.0...0.00.00.00.00.00.00.00.00.00.0
2009-12-050.00.00.00.00.010.010.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 4605 columns

\n", "
" ], "text/plain": [ "StockCode 10002 10080 10109 10120 10125 10133 10134 10135 10138 \\\n", "timestamp \n", "2009-12-01 12.0 0.0 0.0 60.0 5.0 6.0 0.0 17.0 0.0 \n", "2009-12-02 0.0 1.0 0.0 10.0 0.0 0.0 0.0 20.0 96.0 \n", "2009-12-03 7.0 0.0 4.0 0.0 0.0 40.0 13.0 14.0 35.0 \n", "2009-12-04 73.0 3.0 0.0 30.0 46.0 25.0 0.0 10.0 12.0 \n", "2009-12-05 0.0 0.0 0.0 0.0 0.0 10.0 10.0 0.0 0.0 \n", "\n", "StockCode 11001 ... DCGS0068 DCGS0069 DCGS0070 DCGS0072 DCGS0075 \\\n", "timestamp ... \n", "2009-12-01 2.0 ... 1.0 0.0 0.0 0.0 0.0 \n", "2009-12-02 69.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-03 2.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-04 0.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-05 0.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "\n", "StockCode DCGS0076 DCGSLBOY DCGSLGIRL DCGSSBOY DCGSSGIRL \n", "timestamp \n", "2009-12-01 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-02 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-03 1.0 0.0 0.0 0.0 0.0 \n", "2009-12-04 0.0 0.0 0.0 0.0 0.0 \n", "2009-12-05 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 4605 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# resample time series data and fill missing values with 0s\n", "df_pivot.index = pd.DatetimeIndex(df_pivot.index)\n", "\n", "df_pivot = df_pivot.resample('D').sum().fillna(0)\n", "\n", "print(df_pivot.shape, df_pivot.columns)\n", "\n", "df_pivot.head()" ] }, { "cell_type": "code", "execution_count": 10, "id": "216fb961", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4605, 375) Index([ 'StockCode', 2009-12-01 00:00:00, 2009-12-02 00:00:00,\n", " 2009-12-03 00:00:00, 2009-12-04 00:00:00, 2009-12-05 00:00:00,\n", " 2009-12-06 00:00:00, 2009-12-07 00:00:00, 2009-12-08 00:00:00,\n", " 2009-12-09 00:00:00,\n", " ...\n", " 2010-11-30 00:00:00, 2010-12-01 00:00:00, 2010-12-02 00:00:00,\n", " 2010-12-03 00:00:00, 2010-12-04 00:00:00, 2010-12-05 00:00:00,\n", " 2010-12-06 00:00:00, 2010-12-07 00:00:00, 2010-12-08 00:00:00,\n", " 2010-12-09 00:00:00],\n", " dtype='object', name='timestamp', length=375)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampStockCode2009-12-01 00:00:002009-12-02 00:00:002009-12-03 00:00:002009-12-04 00:00:002009-12-05 00:00:002009-12-06 00:00:002009-12-07 00:00:002009-12-08 00:00:002009-12-09 00:00:00...2010-11-30 00:00:002010-12-01 00:00:002010-12-02 00:00:002010-12-03 00:00:002010-12-04 00:00:002010-12-05 00:00:002010-12-06 00:00:002010-12-07 00:00:002010-12-08 00:00:002010-12-09 00:00:00
01000212.00.07.073.00.049.02.012.00.0...12.060.01.08.00.01.025.08.013.044.0
1100800.01.00.03.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
2101090.00.04.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
31012060.010.00.030.00.00.00.01.01.0...10.00.00.03.00.00.00.00.012.00.0
4101255.00.00.046.00.08.020.01.022.0...0.02.00.00.00.00.03.00.040.00.0
\n", "

5 rows × 375 columns

\n", "
" ], "text/plain": [ "timestamp StockCode 2009-12-01 00:00:00 2009-12-02 00:00:00 \\\n", "0 10002 12.0 0.0 \n", "1 10080 0.0 1.0 \n", "2 10109 0.0 0.0 \n", "3 10120 60.0 10.0 \n", "4 10125 5.0 0.0 \n", "\n", "timestamp 2009-12-03 00:00:00 2009-12-04 00:00:00 2009-12-05 00:00:00 \\\n", "0 7.0 73.0 0.0 \n", "1 0.0 3.0 0.0 \n", "2 4.0 0.0 0.0 \n", "3 0.0 30.0 0.0 \n", "4 0.0 46.0 0.0 \n", "\n", "timestamp 2009-12-06 00:00:00 2009-12-07 00:00:00 2009-12-08 00:00:00 \\\n", "0 49.0 2.0 12.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 1.0 \n", "4 8.0 20.0 1.0 \n", "\n", "timestamp 2009-12-09 00:00:00 ... 2010-11-30 00:00:00 2010-12-01 00:00:00 \\\n", "0 0.0 ... 12.0 60.0 \n", "1 0.0 ... 0.0 0.0 \n", "2 0.0 ... 0.0 0.0 \n", "3 1.0 ... 10.0 0.0 \n", "4 22.0 ... 0.0 2.0 \n", "\n", "timestamp 2010-12-02 00:00:00 2010-12-03 00:00:00 2010-12-04 00:00:00 \\\n", "0 1.0 8.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 3.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", "timestamp 2010-12-05 00:00:00 2010-12-06 00:00:00 2010-12-07 00:00:00 \\\n", "0 1.0 25.0 8.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 3.0 0.0 \n", "\n", "timestamp 2010-12-08 00:00:00 2010-12-09 00:00:00 \n", "0 13.0 44.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 12.0 0.0 \n", "4 40.0 0.0 \n", "\n", "[5 rows x 375 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transpose data to match format neeed for further processing\n", "df_final = df_pivot.T\n", "df_final = df_final.reset_index()\n", "\n", "print(df_final.shape, df_final.columns)\n", "\n", "df_final.head()" ] }, { "cell_type": "code", "execution_count": 11, "id": "d66fc455", "metadata": {}, "outputs": [], "source": [ "# back up data -> used for clustering and Forecast training in later notebooks\n", "df_final.to_csv('./data/df_pivoted.csv.zip', index=None)" ] }, { "cell_type": "markdown", "id": "e2c0763c", "metadata": {}, "source": [ "### End of processing" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }