# Module 1: Prepare datasets 
**This notebook generates and transforms datasets into a ML ready state to be ingested into SageMaker Feature Store.**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`


---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Generate Online Grocery shopping dataset](#Generate-Online-Grocery-shopping-dataset)
1. [Transform raw features into Machine Learning ready features](#Transform-raw-features-into-Machine-Learning-ready-features)

# Background

This notebook generates a relational set of files describing customers’ orders over time.

We have preset the notebook to generate a sample of 100,000 synthetic grocery orders from a total of 10,000 synthetically generated customers list. For each customer, the notebook generates between 1 to 10 of their orders, with products purchased in each order. The notebook also generates a timestamp on which the order was placed. The goal of the generated dataset and the example notebooks contained in this repository is to illustrate how to use **SageMaker Feature Store** to predict which products will be in a user’s next order and to enable a grocery retail store to revolutionize how consumers discover and purchase groceries online.

# Setup

##### Prerequisites

In [None]:
%%capture
!pip install faker
!pip install python-dateutil

### Create sub-directory structure for data files

Note: If you re-run this cell, you will get warnings from mkdir command that says 'File exists'; these warnings can be ignored

In [None]:
# create sub-directories
!mkdir ../data/test
!mkdir ../data/train
!mkdir ../data/transformed
!mkdir ../data/partitions
!mkdir ../data/validation

#### Imports 

In [None]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from datetime import datetime, timezone, date
from faker import Faker
import pandas as pd
import numpy as np
import hashlib
import logging
import random

Set locale and seed for reproducability 

In [None]:
faker = Faker()
faker.seed_locale('en_US', 0)

In [None]:
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
faker.seed_instance(SEED)

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [None]:
logger.info(f'Using Pandas version: {pd.__version__}')

#### Helper functions 

In [None]:
def generate_timestamp(start, end) -> str:
 start = datetime.strptime(start, '%Y-%m-%d %H:%M:%S')
 end = datetime.strptime(end, '%Y-%m-%d %H:%M:%S')
 timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime('%Y-%m-%d %H:%M:%S')
 return timestamp

In [None]:
def generate_date(start, end) -> str:
 start = datetime.strptime(start, '%Y-%m-%d')
 end = datetime.strptime(end, '%Y-%m-%d')
 date = faker.date_between_dates(date_start=start, date_end=end).strftime('%Y-%m-%d')
 return date

In [None]:
def get_md5_hash(string: str) -> str:
 hash_object = hashlib.md5(string.encode())
 return hash_object.hexdigest()

In [None]:
def generate_event_timestamp():
 # naive datetime representing local time
 naive_dt = datetime.now()
 # take timezone into account
 aware_dt = naive_dt.astimezone()
 # time in UTC
 utc_dt = aware_dt.astimezone(timezone.utc)
 # transform to ISO-8601 format
 event_time = utc_dt.isoformat(timespec='milliseconds')
 event_time = event_time.replace('+00:00', 'Z')
 return event_time

# Generate Online Grocery shopping dataset

### Generate random synthetic Customer profiles 

In [None]:
class Customer:
 def __init__(self):
 self.customer_id = None
 self.name = None
 self.sex = None
 self.state = None
 self.age = None
 self.is_married = None
 self.active_since = None
 self.event_time = None
 
 def as_dict(self):
 return {'customer_id': self.customer_id, 
 'name': self.name,
 'sex': self.sex, 
 'state': self.state, 
 'age': self.age, 
 'is_married': self.is_married, 
 'active_since': self.active_since,
 'event_time': self.event_time
 }

In [None]:
def generate_customer(i) -> Customer:
 customer = Customer()
 profile = faker.profile()
 customer.customer_id = f'C{i}'
 customer.name = profile['name'].lower()
 customer.sex = profile['sex']
 customer.state = faker.state().lower()
 customer.age = random.randint(18, 91)
 customer.is_married = faker.boolean()
 customer.active_since = generate_timestamp('2016-01-01 00:00:00', '2020-01-01 00:00:01')
 customer.event_time = generate_event_timestamp()
 return customer

In [None]:
customer = generate_customer(1)
customer.__dict__

In [None]:
customers = []
customer_ids = []
n = 10000 # number of synthetic customers to generate
for i in range(n):
 customer = generate_customer(i+1)
 customers.append(customer)
 customer_ids.append(customer.customer_id)

In [None]:
customers_df = pd.DataFrame([customer.as_dict() for customer in customers])

In [None]:
customers_df.head(5)

### Generate random synthetic purchase orders 

Load list of `products` raw data

In [None]:
products_df = pd.read_csv('../data/raw/product_category_mapping.csv')
products_df['product_name'] = products_df['product_name'].str.lower()
products_df['product_category'] = products_df['product_category'].str.lower()
products_df.head(5)

Add event timestamp to the feature records 

In [None]:
event_timestamps = [generate_event_timestamp() for _ in range(len(products_df))]
products_df['event_time'] = event_timestamps
products_df.head(5)

In [None]:
product_ids = products_df['product_id'].tolist()

##### Generate purchase orders specific to customers 

In [None]:
class Order:
 def __init__(self):
 self.order_id = None
 self.customer_id = None
 self.product_id = None
 self.purchase_amount = None
 self.is_reordered = None 
 self.purchased_on = None
 self.event_time = None
 
 def as_dict(self):
 return {'order_id': self.order_id, 
 'customer_id': self.customer_id, 
 'product_id': self.product_id,
 'purchase_amount': self.purchase_amount,
 'is_reordered': self.is_reordered,
 'purchased_on': self.purchased_on, 
 'event_time': self.event_time}

In [None]:
def generate_order(i) -> Order:
 order = Order()
 order.order_id = f'O{i}'
 order.customer_id = random.choice(customer_ids)
 order.product_id = random.choice(product_ids)
 order.purchase_amount = random.randint(1, 101) + round(random.random(), 2)
 order.is_reordered = random.choice([1, 1, 0]) # assume chance of reordering is twice as that of not reordering
 order.purchased_on = generate_timestamp('2020-01-01 00:01:01', '2021-06-01 00:00:01')
 order.event_time = generate_event_timestamp()
 return order

In [None]:
order = generate_order(1)
order.__dict__

In [None]:
orders = []
n = 100000 # number of synthetic orders to generate
for i in range(n):
 order = generate_order(i+1)
 orders.append(order)

In [None]:
orders_df = pd.DataFrame([order.as_dict() for order in orders])
orders_df.head(5)

In [None]:
orders_df.shape

#### Write generated customers, products and orders data to local directory

In [None]:
customers_df.to_csv('../data/raw/customers.csv', index=False)

In [None]:
products_df.to_csv('../data/raw/products.csv', index=False)

In [None]:
orders_df.to_csv('../data/raw/orders.csv', index=False)

# Transform raw features into Machine Learning ready features

#### I) Transform raw `customers` data

In [None]:
customers_df = pd.read_csv('../data/raw/customers.csv')
customers_df.head(5)

In [None]:
label_encoder = LabelEncoder()
min_max_scaler = MinMaxScaler()

In [None]:
customers_df.drop('name', axis=1, inplace=True)
customers_df.drop('state', axis=1, inplace=True)

In [None]:
bins = [18, 30, 40, 50, 60, 70, 90]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-plus']
customers_df['age_range'] = pd.cut(customers_df.age, bins, labels=labels, include_lowest=True)
customers_df = pd.concat([customers_df, pd.get_dummies(customers_df['age_range'], prefix='age')], axis=1)
customers_df.drop('age', axis=1, inplace=True)
customers_df.drop('age_range', axis=1, inplace=True)

In [None]:
customers_df['sex'] = label_encoder.fit_transform(customers_df['sex'])
customers_df['is_married'] = label_encoder.fit_transform(customers_df['is_married'])

In [None]:
customers_df.head()

In [None]:
customers_df['active_since'] = pd.to_datetime(customers_df['active_since'], format='%Y-%m-%d %H:%M:%S')

In [None]:
def get_delta_in_days(date_time) -> int:
 today = date.today()
 delta = today - date_time.date()
 return delta.days

In [None]:
customers_df['n_days_active'] = customers_df['active_since'].apply(lambda x: get_delta_in_days(x))
customers_df['n_days_active'] = min_max_scaler.fit_transform(customers_df[['n_days_active']])
customers_df.drop('active_since', axis=1, inplace=True)

In [None]:
customers_df.head()

In [None]:
customers_df.to_csv('../data/transformed/customers.csv', index=False)

#### II) Transform raw `products` data

In [None]:
products_df = pd.read_csv('../data/raw/products.csv')
products_df.head(5)

In [None]:
products_df.drop('product_name', axis=1, inplace=True)

In [None]:
products_df = pd.concat([products_df, pd.get_dummies(products_df['product_category'], prefix='category')], axis=1)

In [None]:
products_df.drop('product_category', axis=1, inplace=True)
products_df.columns = products_df.columns.str.replace(' ', '_')

In [None]:
products_df.head(5)

In [None]:
products_df.to_csv('../data/transformed/products.csv', index=False)

#### III) Transform raw `orders` data

In [None]:
orders_df = pd.read_csv('../data/raw/orders.csv')
orders_df.head(5)

In [None]:
orders_df['purchased_on'] = pd.to_datetime(orders_df['purchased_on'], format='%Y-%m-%d %H:%M:%S')
orders_df['n_days_since_last_purchase'] = orders_df['purchased_on'].apply(lambda x: get_delta_in_days(x))
orders_df['n_days_since_last_purchase'] = min_max_scaler.fit_transform(orders_df[['n_days_since_last_purchase']])


In [None]:
orders_df['purchase_amount'] = min_max_scaler.fit_transform(orders_df[['purchase_amount']])
orders_df['is_reordered'] = label_encoder.fit_transform(orders_df['is_reordered'])

In [None]:
orders_df.head(5)

### Generate partitioned orders data by month

In [None]:
import os
from datetime import datetime
from dateutil.relativedelta import relativedelta

print(f'Total Orders Count = {orders_df.shape[0]}') 
partitions_path = '../data/partitions'
start_date_str = '2020-01-01 00:00:00'
end_date_str = '2021-06-01 00:00:01'
date_format = '%Y-%m-%d %H:%M:%S'
start_date = datetime.strptime(start_date_str, date_format)
print(f'start_date = {start_date}')
end_date = datetime.strptime(end_date_str, date_format)
print(f'end_date = {end_date}')
a_month = relativedelta(months=1)
print(f'a_month = {a_month}')
current_start_date = start_date
print(f'current_start_date = {current_start_date}')
current_end_date = start_date + a_month
print(f'current_end_date = {current_end_date}')
print(f'----')
if not os.path.exists(partitions_path):
 os.makedirs(partitions_path)

while current_end_date <= end_date:
 print(f'Dates between {current_start_date} and {current_end_date}')
 partitions_df = orders_df[orders_df['purchased_on'].between(current_start_date, current_end_date)].copy()
 partitions_df.drop('purchased_on', axis=1, inplace=True)
 partition = f'{current_start_date.strftime("%Y")}-{int(current_start_date.strftime("%m"))}' 
 current_partitions_path = f'{partitions_path}/{partition}'
 print(current_partitions_path)
 if not os.path.exists(current_partitions_path):
 os.makedirs(current_partitions_path)
 print(f'Partitions Orders Count = {partitions_df.shape[0]}')
 partitions_df.to_csv(f'{current_partitions_path}/partition.csv', index=False)
 partitions_df.iloc[0:0]
 current_start_date = current_end_date
 current_end_date = current_start_date + a_month
 print(f'----')

In [None]:
orders_df.drop('purchased_on', axis=1, inplace=True)

In [None]:
orders_df.head(5)

In [None]:
orders_df.to_csv('../data/transformed/orders.csv', index=False)