

# Architect and Build a Music Recommender System across the Entire ML-Lifecycle with Amazon SageMaker

## Overview

----

Welcome of the Music Recommender use-case with Amazon SageMaker. In this series of notebooks we will go through the ML Lifecycle and show how we can build a Music Recommender System using a combination of SageMaker Services and features. IN each phase, we will have relevant notebooks that show you how easy it is to implement that phase of the lifecycle.


----

### Contents

- [Overview](00_overview_arch_data.ipynb)
 - [Architecture](#arch-overview)
 - [Get the Data](#get-the-data)
 - [Update the data sources](#update-data-sources)
 - [Explore the Data](#explore-data)
- [Part 1: Data Prep using SageMaker Processing Job](01_music_rec_data_prep.ipynb)
- [Part 2: Model Training and Hyperparameter Tuning](02_music_rec_model_training.ipynb)
- [Part 3: SageMaker Pipelines](03_music_rec_pipelines.ipynb)




## Architecture

Let's look at the overall solution architecure for this use case. We will start by doing each of these tasks within the exploratory phase of the ML Lifecycle, then when we are done with Experimentation and Trials, we can develop an automated pipeline such as the one depicted here to prepare data, train and tune the model, deposit it in the registry, then deploy it to a SageMaker hosted endpoint, and run batch transform based on the trained model.

##### [back to top](#00-nb)

----



In [None]:
import sys
import pprint
sys.path.insert(1, './code')


In [None]:
# update pandas to avoid data type issues in older 1.0 version
!pip install pandas --upgrade --quiet
import pandas as pd
print(pd.__version__)

In [None]:
# create data folder
!mkdir data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import json
import sagemaker 
import boto3
import os
from awscli.customizations.s3.utils import split_s3_bucket_key

# Sagemaker session
sess = sagemaker.Session()
# get session bucket name
bucket = sess.default_bucket()
# bucket prefix or the subfolder for everything we produce
prefix='music-recommendation-workshop'
# s3 client
s3_client = boto3.client("s3")

print(f"this is your default SageMaker Studio bucket name: {bucket}") 

In [None]:
def get_data(public_s3_data, to_bucket, sample_data=1):
 new_paths = []
 for f in public_s3_data:
 bucket_name, key_name = split_s3_bucket_key(f)
 filename = f.split('/')[-1]
 new_path = "s3://{}/{}/input/{}".format(to_bucket, prefix, filename)
 new_paths.append(new_path)
 
 # only download if not already downloaded
 if not os.path.exists('./data/{}'.format(filename)):
 # download s3 data
 print("Downloading file from {}".format(f))
 s3_client.download_file(bucket_name, key_name, './data/{}'.format(filename))
 
 # subsample the data to create a smaller datatset for this demo
 new_df = pd.read_csv('./data/{}'.format(filename))
 new_df = new_df.sample(frac=sample_data)
 new_df.to_csv('./data/{}'.format(filename), index=False)
 
 # upload s3 data to our default s3 bucket for SageMaker Studio
 print("Uploading {} to {}\n".format(filename, new_path))
 s3_client.upload_file('./data/{}'.format(filename), to_bucket, os.path.join(prefix, 'input', filename))
 
 return new_paths


 

def update_data_sources(flow_path, tracks_data_source, ratings_data_source):
 with open(flow_path) as flowf:
 flow = json.load(flowf)
 
 for node in flow['nodes']:
 # if the key exists for our s3 endpoint
 try:
 if node['parameters']['dataset_definition']['name'] == 'tracks.csv':
 # reset the s3 data source for tracks data
 old_source = node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri']
 print("Changed {} to {}".format(old_source, tracks_data_source))
 node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] = tracks_data_source
 elif node['parameters']['dataset_definition']['name'] == 'ratings.csv':
 # reset the s3 data source for ratings data
 old_source = node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri']
 print("Changed {} to {}".format(old_source, ratings_data_source))
 node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] = ratings_data_source
 except:
 continue
 # write out the updated json flow file
 with open(flow_path, 'w') as outfile:
 json.dump(flow, outfile)
 
 return flow



## Prereqs: Get Data 

##### [back to top](#00-nb)

----

Here we will download the music data from a public S3 bucket that we'll be using for this demo and uploads it to your default S3 bucket that was created for you when you initially created a SageMaker Studio workspace. 

In [None]:
# public S3 bucket that contains our music data
s3_bucket_music_data = "s3://sagemaker-sample-files/datasets/tabular/synthetic-music"

In [None]:
new_data_paths = get_data([f"{s3_bucket_music_data}/tracks.csv", f"{s3_bucket_music_data}/ratings.csv"], bucket, sample_data=0.70)
print(new_data_paths)

In [None]:
# these are the new file paths located on your SageMaker Studio default s3 storage bucket
tracks_data_source = f's3://{bucket}/{prefix}/input/tracks.csv'
ratings_data_source = f's3://{bucket}/{prefix}/input/ratings.csv'




## Explore the Data


##### [back to top](#00-nb)


----

In [None]:
tracks = pd.read_csv('./data/tracks.csv')
ratings = pd.read_csv('./data/ratings.csv')

In [None]:
tracks.head()

In [None]:
ratings.head()

In [None]:
print("{:,} different songs/tracks".format(tracks['trackId'].nunique()))
print("{:,} users".format(ratings['userId'].nunique()))
print("{:,} user rating events".format(ratings['ratingEventId'].nunique()))

In [None]:
tracks.groupby('genre')['genre'].count().plot.bar(title="Tracks by Genre");

In [None]:
ratings[['ratingEventId','userId']].plot.hist(by='userId', bins=50, title="Distribution of # of Ratings by User");

----

# Music Recommender Lab 1: Data Prep using SageMaker Processing

After you completed running this notebook, you can open the next notebook.