# Partition data 
**This notebook partitions `orders` data into batches which will be used by the other notebooks in this module for batch ingestion to the feature store.**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Setup](#Setup)
1. [Load data](#Load-data)
1. [Groupby and partition](#Groupby-and-partition)
1. [Copy partitions from local to S3](#Copy-partitions-from-local-to-S3)

# Setup

#### Imports 

In [None]:
import pandas as pd
import sagemaker
import shutil
import os

#### Essentials

In [None]:
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-feature-store'

# Load data

#### Read `orders` data

In [None]:
df = pd.read_csv('../data/raw/orders.csv')

In [None]:
df['year_month'] = df['purchased_on'].apply(lambda x: '-'.join([str(pd.to_datetime(x).year), 
 str(pd.to_datetime(x).month)]))

In [None]:
df.head(5)

# Groupby and partition 
Groupby and partition `orders` dataframe into multiple dataframes based on `year` and `month`

In [None]:
grouped = df.groupby('year_month')

In [None]:
for name, group in grouped:
 partition_dir = f'../data/partitions/{name}'
 if os.path.exists(partition_dir):
 shutil.rmtree(partition_dir)
 os.makedirs(partition_dir)
 partition_df = group.drop('year_month', axis=1)
 partition_df.to_csv(f'{partition_dir}/partition.csv', index=False)

# Copy partitions from local to S3 

In [None]:
!aws s3 cp ../data/partitions/ s3://{default_bucket}/{prefix}/partitions/ --recursive