# Amazon SageMaker Data Wrangler demo
## Data source
This demo of Amazon SageMaker Data Wrangler is using [UCI diabetic patient readmission dataset](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008). The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. 

Detail description of the dataset is available in 

    Detailed description of all the atrributes is provided in Table 1 Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.


In [None]:
import boto3
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'sagemaker/demo-diabetic-datawrangler'

s3_client = boto3.client("s3")

In [None]:
%%sh
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
unzip dataset_diabetes.zip

## Split the data for demo purposes


In [None]:
import pandas as pd
df = pd.read_csv('dataset_diabetes/diabetic_data.csv', index_col = 'encounter_id')

In [None]:
df.head()

In [None]:
demographic_feature_columns = 'patient_nbr,race,gender,age,weight'.split(',')
hospital_visits_feature_columns = 'patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,readmitted'.split(',')
labs_feature_columns = 'patient_nbr,A1Cresult,max_glu_serum'.split(',')
medication_feature_columns = 'patient_nbr,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed'.split(',')

Split the CSV into multiple CSVs and upload them to a S3 bucket

In [None]:
dfs = []
suffix = ['demographic', 'hospital_visits', 'labs', 'medication']
for i, columns in enumerate([demographic_feature_columns, hospital_visits_feature_columns, 
                             labs_feature_columns, medication_feature_columns]):
    df_tmp = df[columns]
    print(columns)
    df_tmp.head()
    dfs.append(df_tmp)
    fname = 'dataset_diabetes/diabetic_data_%s.csv' % suffix[i]
    df_tmp.to_csv(fname)
    s3_client.upload_file(fname, bucket,  '%s/%s' % (prefix, fname))

In [None]:
print('Your data is uploaded to s3://%s/%s/' % (bucket, prefix))