--- title: "SageMaker Kmeans preprocessing" date: 2020-03-02T17:46:34-05:00 draft: false algo: [kmeans] --- Per the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html#km-inputoutput), "For training, the k-means algorithm expects data to be provided in the train channel (recommended S3DataDistributionType=ShardedByS3Key), with an optional test channel (recommended S3DataDistributionType=FullyReplicated) to score the data on. Both recordIO-wrapped-protobuf and CSV formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as recordIO-wrapped-protobuf or as CSV." Using the python SDK for Sagemaker, there is a far simpler way to use KMeans (yes, even simpler than a CSV) Assume you start with a CSV that looks like the following (this is sample data from s3://aws-ml-blog-sagemaker-census-segmentation) ```html CensusId State County TotalPop Men Women Hispanic White Black Native ... 0 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4 ... 1 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 ... 2 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2 ... 3 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 ... 4 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3 ... ``` Read the CSV file ```python import pandas as pd data = pd.read_csv('data.csv', header=0, delimiter=",", low_memory=False) ``` Drop any column that has items that are NaN (Not a Number) ```python data.dropna(inplace=True) ``` Consider doing one-hot-encoding or any other data preprocessing, but use this as template for the minimal data preprocessing that you will need to do. (Optional) For example, you can use SKlearn to do some standard preprocessing like scaling: ```python from sklearn.preprocessing import MinMaxScaler scaler=MinMaxScaler() data_scaled=pd.DataFrame(scaler.fit_transform(data)) data_scaled.columns=data.columns data_scaled.index=data.index ``` Make sure your numerical values in the dataframe are Float values: ```python train_data = data_scaled.values.astype('float32') ``` **Note** - Consider using the dataframe as is, since the next [training](../../training/kmeans) step will become easier. If you really need to convert to CSV, do: ```python train_data.to_csv(index=False) ``` ### Optional It is possible that you have a large number of columns in your training dataset; it is common practice to then use Principle Component Analysis to reduce the number of columns while retaining most of the information. We can use the Sagemaker built in PCA algorithm to achieve this. Using the train_data dataframe, you can do: ```python from sagemaker import PCA from sagemaker import get_execution_role role = get_execution_role() num_components=20 pca_SM = PCA(role=role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path='s3://'+ bucket +'/counties/', num_components=num_components) ``` ... and then train a PCA model: ```python train_data = train_data.values.astype('float32') pca_SM.fit(pca_SM.record_set(train_data)) ``` This reduces the number of columns you have in your original dataset to ```num_components=20``` Once you are done training, store the new "components" in a new dataset, which will be your training dataset for Kmeans. To do this, first deploy your model to Sagemaker, and then do a prediction: ```python pca_predictor = pca_SM.deploy(initial_instance_count=1, instance_type='ml.t2.medium') result = pca_predictor.predict(train_data) ``` Then recover your train dataset to be used with kmeans ```python data_transformed=pd.DataFrame() last_n = 5 for a in result: b=a.label['projection'].float32_tensor.values data_transformed=data_transformed.append([list(b)]) data_transformed.index=data_scaled.index #Note that this uses indexes from a dataframe we used in previous steps data_transformed=data_transformed.iloc[:,-last_n:] data_transformed.columns=PCA_list ``` The variable ```last_n``` controls the number of columns you want to use (here, we are using the last 5 columns of data). You can also use all columns of ```data_transformed``` as it is already a dataset with only 20 columns, whereas your original dataset for Kmeans may have been 100's or 1000's of columns.