![MLU Logo](../../data/MLU_Logo.png)

## Amazon Access Samples Data Set
 
 Let's apply our boosting algorithm to a real dataset! We are going to use the __Amazon Access Samples dataset__. 
 
 We download this dataset from UCI ML repository from this [link](https://archive.ics.uci.edu/ml/datasets/Amazon+Access+Samples). Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

 
__Dataset description:__

Employees need to request certain resources to fulfill their daily duties. This data consists of anonymized historical data of employee IT access requests. Data fields look like this:
 #### Column Descriptions

* __ACTION__: 1 if the resource was approved, 0 if not.
* __RESOURCE__: An ID for each resource
* __PERSON_MGR_ID__: ID of the user's manager
* __PERSON_ROLLUP_1__: User grouping ID
* __PERSON_ROLLUP_2__: User grouping ID
* __PERSON_BUSINESS_TITLE__: Title ID 
* __PERSON_JOB_FAMILY__: Job family ID 
* __PERSON_JOB_CODE__: Job code ID 

Our task is to build a machine learning model that can automatically provision an employee's access to company resources given employee profile information and the resource requested.

In [1]:
%pip install -q -r ../../requirements.txt

### 1. Download and process the dataset

In this section, we will download our dataset and process it. It consists of two files, we will run the following code cells to get our dataset as a single file at the end. One of the files is large (4.8GB), so make sure you have enough storage.

In [2]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00216/amzn-anon-access-samples.tgz

--2021-11-03 18:04:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/00216/amzn-anon-access-samples.tgz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12268509 (12M) [application/x-httpd-php]
Saving to: ‘amzn-anon-access-samples.tgz’


2021-11-03 18:04:46 (16.2 MB/s) - ‘amzn-anon-access-samples.tgz’ saved [12268509/12268509]



In [3]:
! tar -zxvf amzn-anon-access-samples.tgz

amzn-anon-access-samples-2.0.csv
amzn-anon-access-samples-history-2.0.csv


We have the following files:
* __amzn-anon-access-samples-2.0.csv__: Employee profile data.
* __amzn-anon-access-samples-history-2.0.csv__: Resource provision history

Below, we first read the amzn-anon-access-samples-2.0.csv file (it is a large file) and use some employee fields.

In [4]:
import pandas as pd
import random 

person_fields = ["PERSON_ID", "PERSON_MGR_ID",
                 "PERSON_ROLLUP_1", "PERSON_ROLLUP_2",
                 "PERSON_DEPTNAME", "PERSON_BUSINESS_TITLE",
                 "PERSON_JOB_FAMILY", "PERSON_JOB_CODE"]

people = {}
for chunk in pd.read_csv('amzn-anon-access-samples-2.0.csv', usecols = person_fields, chunksize=5000): 
    for index, row in chunk.iterrows():
        people[row["PERSON_ID"]] = [row["PERSON_MGR_ID"], row["PERSON_ROLLUP_1"],
                                    row["PERSON_ROLLUP_2"], row["PERSON_DEPTNAME"],
                                    row["PERSON_BUSINESS_TITLE"], row["PERSON_JOB_FAMILY"],
                                    row["PERSON_JOB_CODE"]]

Now, let's read the resource provision history file. Here, we will create our dataset. We will read the add access and remove access actions and save them.

In [5]:
add_access_data = []
remove_access_data = []

df = pd.read_csv('amzn-anon-access-samples-history-2.0.csv')

# Loop through unique logins (employee ids)
for login in df["LOGIN"].unique():
    login_df = df[df["LOGIN"]==login].copy()
    # Save actions
    for target in login_df["TARGET_NAME"].unique():
        login_target_df = login_df[login_df["TARGET_NAME"]==target]
        unique_actions = login_target_df["ACTION"].unique()
        if((len(unique_actions)==1) and (unique_actions[0]=="remove_access")):
            remove_access_data.append([0, target] + people[login])
        elif((len(unique_actions)==1) and (unique_actions[0]=="add_access")):
            add_access_data.append([1, target] + people[login])

# Create random seed
random.seed(30)

# We will use only 8000 random add_access data
add_access_data = random.sample(add_access_data, 8000)

# Add them together
data = add_access_data + remove_access_data

# Let's shuffle it
random.shuffle(data)

Let's save this data so that we can use it later

In [6]:
df = pd.DataFrame(data, columns=["ACTION", "RESOURCE",
                                 "MGR_ID", "ROLLUP_1",
                                 "ROLLUP_2", "DEPTNAME",
                                 "BUSINESS_TITLE", "JOB_FAMILY",
                                 "JOB_CODE"])

df.to_csv("data.csv", index=False)

Here is how our data look like:

In [7]:
df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLLUP_1,ROLLUP_2,DEPTNAME,BUSINESS_TITLE,JOB_FAMILY,JOB_CODE
0,1,9802,43122,2,3,33467,45383,11,33326
1,1,10617,36504,33416,33689,36505,41299,33430,33326
2,1,9446,35624,33316,34256,35625,41014,33461,33326
3,1,11065,34326,33299,34397,38458,38459,33678,33289
4,1,11149,40640,33283,40641,40642,40643,33291,33431


In [8]:
# Delete the previously downloaded files
! rm amzn-anon-access-samples-2.0.csv amzn-anon-access-samples-history-2.0.csv amzn-anon-access-samples.tgz

### 2. CatBoost

Let's use CatBoost on this dataset. 

In [9]:
import numpy as np
from catboost import CatBoostClassifier

data = pd.read_csv("data.csv")

In [10]:
data.head(10)

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLLUP_1,ROLLUP_2,DEPTNAME,BUSINESS_TITLE,JOB_FAMILY,JOB_CODE
0,1,9802,43122,2,3,33467,45383,11,33326
1,1,10617,36504,33416,33689,36505,41299,33430,33326
2,1,9446,35624,33316,34256,35625,41014,33461,33326
3,1,11065,34326,33299,34397,38458,38459,33678,33289
4,1,11149,40640,33283,40641,40642,40643,33291,33431
5,0,9799,35395,2,33730,33731,63077,33657,9
6,1,9674,45034,33521,34979,34979,51381,33526,33326
7,1,10561,34023,33316,34024,34025,37456,33430,9
8,1,9667,43085,2,33365,37239,43086,33370,33326
9,1,3164,36605,2,33730,33902,36850,33461,33431


The dataset looks imbalanced

In [11]:
data["ACTION"].value_counts()

1    8000
0     152
Name: ACTION, dtype: int64

Let's get input and target.

In [12]:
y = data['ACTION']
X = data.drop(columns='ACTION')

We will use 15% of the data for validation

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X,
                                                      y,
                                                      test_size=0.15,
                                                      random_state=136,
                                                      stratify=y
                                                     )

As we have an imbalanced dataset, we will need to calculate the class wegihts and then fit the tree classifier.

In [14]:
class_weight_0 = (sum(y_train==0) + sum(y_train==1))/sum(y_train==0)
class_weight_1 = (sum(y_train==0) + sum(y_train==1))/sum(y_train==1)

params = {'loss_function':'Logloss', # Some others: CrossEntropy
          'eval_metric':'F1', # Some others: Accuracy, Precision, Recall, F1, AUC
          'verbose': 200, # output training process at every 200 iterations
          'random_seed': 13,
          'iterations': 200,
          'class_weights': [class_weight_0, class_weight_1]
         }

# All input features are categorical
cat_features = [0, 1, 2, 3, 4, 5, 6, 7]
cb_classifier = CatBoostClassifier(**params)
cb_classifier.fit(X_train, y_train,
          eval_set=(X_valid, y_valid), # data to validate on
          use_best_model=True, 
          plot=True, # It plots a nice visual for the training process
          cat_features=cat_features)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.102945
0:	learn: 0.5361259	test: 0.5393786	best: 0.5393786 (0)	total: 58.9ms	remaining: 11.7s
199:	learn: 0.9924433	test: 0.7973391	best: 0.8949858 (62)	total: 2.32s	remaining: 0us

bestTest = 0.8949858117
bestIteration = 62

Shrink model to first 63 iterations.


<catboost.core.CatBoostClassifier at 0x7f0b94752630>

Let's see the overall performance on validation set.

In [15]:
from sklearn.metrics import classification_report

y_pred = cb_classifier.predict(X_valid)

print(classification_report(y_valid, np.round(y_pred)))

              precision    recall  f1-score   support

           0       0.17      0.87      0.28        23
           1       1.00      0.92      0.96      1200

    accuracy                           0.92      1223
   macro avg       0.58      0.89      0.62      1223
weighted avg       0.98      0.92      0.94      1223

