# MLOps workshop with Amazon SageMaker

## Module 01: Transform the data and train a model inside a Jupyter notebook.

In this workshop we will demonstrate a journey to cloud-native machine learning starting from a more traditional approach to model development and training directly in Jupyter notebooks to remote managed data transformations and training with Amazon SageMaker to fully automated pipelines with SageMaker Pipelines.

In this first notebook we will predict house prices based on the well-known California housing dataset with a simple regression model in Tensorflow 2.

To begin, we'll import some necessary packages and set up directories for training and test data. In this notebook, the only usage of SageMaker is to manage the compute of the notebook. There is no usage of SageMaker APIs.

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler

In [None]:
import os

data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

raw_dir = os.path.join(os.getcwd(), 'data/raw')
os.makedirs(raw_dir, exist_ok=True)

batch_dir = os.path.join(os.getcwd(), 'data/batch')
os.makedirs(batch_dir, exist_ok=True)

## Exploratory Data Analysis (EDA)

According to The State of Data Science 2020 survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist’s time.

Exploratory Data Analysis is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA assists Data science professionals in various ways:-

Getting a better understanding of data.
Identifying various data patterns.
Getting a better understanding of the problem statement.
Numerical EDA gives you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Visual EDA on the other hand will give you insight into features and target relationship and distribution.

First we'll load the California Housing dataset and explore the data.

## Download California Housing dataset

We use the California housing dataset.

More info on the dataset:

This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

We will use AWS cli to download the dataset from S3. You don't need to specify AWS credentials. They are assumed from the notebook IAM role. If you get an error in this step, check that the notebook was created with a proper IAM role.

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .

In [None]:
!tar -zxf cal_housing.tgz 2>/dev/null

In [None]:
columns = [
 "longitude",
 "latitude",
 "housingMedianAge",
 "totalRooms",
 "totalBedrooms",
 "population",
 "households",
 "medianIncome",
 "medianHouseValue",
]
df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)

In [None]:
df.head()

## Numerical EDA
Check how big is dataset, how many and of what type features it has, and what is target.

In [None]:
df.info()

There are 9 attributes in each case of the dataset. They are:

longitude - block group longitude
latitude - block group latitude
housingMedianAge - median house age in block group
totalRooms - average number of rooms per household
totalBedrooms - average number of bedrooms per household
population - block group population
households - average number of household members
medianIncome - median income in block group
medianHouseValue - median value of owner-occupied homes.
It is important to notice that all data is numeric and there is no NULL values.
Now, let's summarize the data to see the distribution of data

In [None]:
df.describe()

In [None]:
df.value_counts("housingMedianAge", sort=True)

We can see that houses are rather old, around 28 years, looking at the mean.

## Visual EDA

Let's begin exploring the data by using visualization.We will plot the histogram of each feature.

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20, 15))
plt.show()

We see that the data is skewed and not normalized for most of the columns. We will not touch the latitude and longitude for now. Let's apply the logarithmic function to the rest of the columns and check the result.

In [None]:
columns_to_normalize = [
 'medianIncome', 'housingMedianAge', 'totalRooms', 
 'totalBedrooms', 'population', 'households', 'medianHouseValue'
]

for column in columns_to_normalize:
 df[column] = np.log1p(df[column])

In [None]:
df.hist(figsize=(12, 10), bins=50, edgecolor="black", grid=False)
plt.subplots_adjust(hspace=0.7, wspace=0.4)

The data looks much better. Now we will check the coordinates. First of all, we will plot the coordinates and use the "medianHouseValue" column for coloring.

In [None]:
from matplotlib.colors import LinearSegmentedColormap

plt.figure(figsize=(10,10))

cmap = LinearSegmentedColormap.from_list(name='Pacific Ocean shore', colors=['green','yellow','red'])

f, ax = plt.subplots()
points = ax.scatter(df['longitude'], df['latitude'], c=df['medianHouseValue'], s=10, cmap=cmap)
f.colorbar(points)

Our dataset is about California. What we see in the plot is the Pacific Ocean shore. From the diagram (using the color indicator), it is clear that houses located near the ocean are more expensive. Using the human knowledge domain, we also notice that the most expensive houses are located near San Francisco (37.7749° N, 122.4194° W) and Los Angeles (34.0522° N, 118.2437°). Another observation is the relationship between house prices and the distance to those locations. We will engineer the data to produce linear dependencies between the house price and the location, which is a good fit for linear regression problems.We remove the "longitude" and the "latitude" columns and replace them with Euclidian distances to San Francisco and Los Angeles.

In [None]:
sf_coord=[-122.4194, 37.7749]
la_coord=[-118.2437, 34.0522]

df['DistanceToSF']=np.sqrt((df['longitude']-sf_coord[0])**2+(df['latitude']-sf_coord[1])**2)
df['DistanceToLA']=np.sqrt((df['longitude']-la_coord[0])**2+(df['latitude']-la_coord[1])**2)
df.drop(columns=['longitude', 'latitude'],inplace=True)

Split the data to create training and validation datasets

In [None]:
X = df.drop("medianHouseValue", axis=1)
Y = df["medianHouseValue"].copy()

In [None]:
print("Features:", list(X.columns))
print("Dataset shape:", X.shape)
print("Dataset Type:", type(X))
print("Label set shape:", Y.shape)
print("Label set Type:", type(X))

# We partition the dataset into 2/3 training and 1/3 test set.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33)

np.save(os.path.join(raw_dir, 'x_train.npy'), x_train)
np.save(os.path.join(raw_dir, 'x_test.npy'), x_test)
np.save(os.path.join(raw_dir, 'y_train.npy'), y_train)
np.save(os.path.join(raw_dir, 'y_test.npy'), y_test)

In [None]:
scaler = StandardScaler()
x_train = np.load(os.path.join(raw_dir, 'x_train.npy'))
scaler.fit(x_train)

We save the training and test data on the file system.

In [None]:
input_files = glob.glob('{}/raw/*.npy'.format(data_dir))
print('\nINPUT FILE LIST: \n{}\n'.format(input_files))
for file in input_files:
 raw = np.load(file)
 # only transform feature columns
 if 'y_' not in file:
 transformed = scaler.transform(raw)
 if 'train' in file:
 if 'y_' in file:
 output_path = os.path.join(train_dir, 'y_train.npy')
 np.save(output_path, raw)
 print('SAVED LABEL TRAINING DATA FILE\n')
 else:
 output_path = os.path.join(train_dir, 'x_train.npy')
 np.save(output_path, transformed)
 print('SAVED TRANSFORMED TRAINING DATA FILE\n')
 else:
 if 'y_' in file:
 output_path = os.path.join(test_dir, 'y_test.npy')
 np.save(output_path, raw)
 print('SAVED LABEL TEST DATA FILE\n')
 else:
 output_path = os.path.join(test_dir, 'x_test.npy')
 np.save(output_path, transformed)
 print('SAVED TRANSFORMED TEST DATA FILE\n')

In [None]:
import numpy as np
import os
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

def get_train_data(train_dir):
 x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
 y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
 print('x train', x_train.shape,'y train', y_train.shape)

 return x_train, y_train


def get_test_data(test_dir):
 x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
 y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
 print('x test', x_test.shape,'y test', y_test.shape)

 return x_test, y_test

def get_model():
 inputs = tf.keras.Input(shape=(8,))
 hidden_1 = tf.keras.layers.Dense(8, activation='tanh')(inputs)
 hidden_2 = tf.keras.layers.Dense(4, activation='sigmoid')(hidden_1)
 outputs = tf.keras.layers.Dense(1)(hidden_2)
 return tf.keras.Model(inputs=inputs, outputs=outputs)

Now we will do the actual training. Feel free to change the hyperparameter values (epochs,batch_size, etc.) to see how they affect the training metric.

In [None]:
x_train, y_train = get_train_data(train_dir)
x_test, y_test = get_test_data(test_dir)

device = '/cpu:0'
print(device)
batch_size = 128
epochs = 25
learning_rate = 0.01
print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))

with tf.device(device):
 model = get_model()
 optimizer = tf.keras.optimizers.SGD(learning_rate)
 model.compile(optimizer=optimizer, loss='mse')
 model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
 validation_data=(x_test, y_test))

 # evaluate on test set
 scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
 print("\nTest MSE :", scores)

Mean Squared Error (MSE) is a commonly used metric in machine learning for evaluating the performance of regression models. It measures the average squared difference between the predicted and actual values. MSE penalizes larger errors more heavily due to the squaring operation. By calculating the mean of these squared differences, MSE provides a single numerical value to assess the model's accuracy. A lower MSE indicates better model performance, with zero being the ideal value.

In [None]:
model.save('model' + '/1')

In [None]:
!ls -R model

Our model is trained now, and the metric is good. We will check the "test" dataset to see how close our prediction is to actual values

In [None]:
import numpy as np
import tensorflow as tf

model = tf.keras.models.load_model('model/1')

x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
scores = model.evaluate(x_test, y_test, verbose=2)
print("\nTest MSE :", scores)

In [None]:
y_pred = model.predict(x_test)
flat_list_pred = [float('%.1f'%(item)) for sublist in y_pred for item in sublist]
flat_list_test = [float('%.1f'%(item)) for item in y_test]
test_result = pd.DataFrame({'Predicted':flat_list_pred,'Actual':flat_list_test})
test_result
fig= plt.figure(figsize=(16,8))
test_result = test_result.reset_index()
test_result = test_result.drop(['index'],axis=1)
plt.plot(test_result[:50])
plt.legend(['Actual','Predicted'])

The MSE metric suggested that our model would perform well, and indeed, we see in the visualization above a good correlation between actual and predicted values.