{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Amazon Access Samples Data Set\n",
" \n",
" Let's apply our boosting algorithm to a real dataset! We are going to use the __Amazon Access Samples dataset__. \n",
" \n",
" We download this dataset from UCI ML repository from this [link](https://archive.ics.uci.edu/ml/datasets/Amazon+Access+Samples). Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.\n",
"\n",
" \n",
"__Dataset description:__\n",
"\n",
"Employees need to request certain resources to fulfill their daily duties. This data consists of anonymized historical data of employee IT access requests. Data fields look like this:\n",
" #### Column Descriptions\n",
"\n",
"* __ACTION__: 1 if the resource was approved, 0 if not.\n",
"* __RESOURCE__: An ID for each resource\n",
"* __PERSON_MGR_ID__: ID of the user's manager\n",
"* __PERSON_ROLLUP_1__: User grouping ID\n",
"* __PERSON_ROLLUP_2__: User grouping ID\n",
"* __PERSON_BUSINESS_TITLE__: Title ID \n",
"* __PERSON_JOB_FAMILY__: Job family ID \n",
"* __PERSON_JOB_CODE__: Job code ID \n",
"\n",
"Our task is to build a machine learning model that can automatically provision an employee's access to company resources given employee profile information and the resource requested."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%pip install -q -r ../../requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Download and process the dataset\n",
"\n",
"In this section, we will download our dataset and process it. It consists of two files, we will run the following code cells to get our dataset as a single file at the end. One of the files is large (4.8GB), so make sure you have enough storage."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2021-11-03 18:08:48-- https://archive.ics.uci.edu/ml/machine-learning-databases/00216/amzn-anon-access-samples.tgz\n",
"Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n",
"Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 12268509 (12M) [application/x-httpd-php]\n",
"Saving to: ‘amzn-anon-access-samples.tgz’\n",
"\n",
"amzn-anon-access-sa 100%[===================>] 11.70M 16.1MB/s in 0.7s \n",
"\n",
"2021-11-03 18:08:49 (16.1 MB/s) - ‘amzn-anon-access-samples.tgz’ saved [12268509/12268509]\n",
"\n"
]
}
],
"source": [
"! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00216/amzn-anon-access-samples.tgz"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"amzn-anon-access-samples-2.0.csv\n",
"amzn-anon-access-samples-history-2.0.csv\n"
]
}
],
"source": [
"! tar -zxvf amzn-anon-access-samples.tgz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have the following files:\n",
"* __amzn-anon-access-samples-2.0.csv__: Employee profile data.\n",
"* __amzn-anon-access-samples-history-2.0.csv__: Resource provision history\n",
"\n",
"Below, we first read the amzn-anon-access-samples-2.0.csv file (it is a large file) and use some employee fields."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import random \n",
"\n",
"person_fields = [\"PERSON_ID\", \"PERSON_MGR_ID\",\n",
" \"PERSON_ROLLUP_1\", \"PERSON_ROLLUP_2\",\n",
" \"PERSON_DEPTNAME\", \"PERSON_BUSINESS_TITLE\",\n",
" \"PERSON_JOB_FAMILY\", \"PERSON_JOB_CODE\"]\n",
"\n",
"people = {}\n",
"for chunk in pd.read_csv('amzn-anon-access-samples-2.0.csv', usecols = person_fields, chunksize=5000): \n",
" for index, row in chunk.iterrows():\n",
" people[row[\"PERSON_ID\"]] = [row[\"PERSON_MGR_ID\"], row[\"PERSON_ROLLUP_1\"],\n",
" row[\"PERSON_ROLLUP_2\"], row[\"PERSON_DEPTNAME\"],\n",
" row[\"PERSON_BUSINESS_TITLE\"], row[\"PERSON_JOB_FAMILY\"],\n",
" row[\"PERSON_JOB_CODE\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's read the resource provision history file. Here, we will create our dataset. We will read the add access and remove access actions and save them."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"add_access_data = []\n",
"remove_access_data = []\n",
"\n",
"df = pd.read_csv('amzn-anon-access-samples-history-2.0.csv')\n",
"\n",
"# Loop through unique logins (employee ids)\n",
"for login in df[\"LOGIN\"].unique():\n",
" login_df = df[df[\"LOGIN\"]==login].copy()\n",
" # Save actions\n",
" for target in login_df[\"TARGET_NAME\"].unique():\n",
" login_target_df = login_df[login_df[\"TARGET_NAME\"]==target]\n",
" unique_actions = login_target_df[\"ACTION\"].unique()\n",
" if((len(unique_actions)==1) and (unique_actions[0]==\"remove_access\")):\n",
" remove_access_data.append([0, target] + people[login])\n",
" elif((len(unique_actions)==1) and (unique_actions[0]==\"add_access\")):\n",
" add_access_data.append([1, target] + people[login])\n",
"\n",
"# Create random seed\n",
"random.seed(30)\n",
"\n",
"# We will use only 8000 random add_access data\n",
"add_access_data = random.sample(add_access_data, 8000)\n",
"\n",
"# Add them together\n",
"data = add_access_data + remove_access_data\n",
"\n",
"# Let's shuffle it\n",
"random.shuffle(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's save this data so that we can use it later"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(data, columns=[\"ACTION\", \"RESOURCE\",\n",
" \"MGR_ID\", \"ROLLUP_1\",\n",
" \"ROLLUP_2\", \"DEPTNAME\",\n",
" \"BUSINESS_TITLE\", \"JOB_FAMILY\",\n",
" \"JOB_CODE\"])\n",
"\n",
"df.to_csv(\"data.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is how our data look like:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ACTION | \n",
" RESOURCE | \n",
" MGR_ID | \n",
" ROLLUP_1 | \n",
" ROLLUP_2 | \n",
" DEPTNAME | \n",
" BUSINESS_TITLE | \n",
" JOB_FAMILY | \n",
" JOB_CODE | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 9802 | \n",
" 43122 | \n",
" 2 | \n",
" 3 | \n",
" 33467 | \n",
" 45383 | \n",
" 11 | \n",
" 33326 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 10617 | \n",
" 36504 | \n",
" 33416 | \n",
" 33689 | \n",
" 36505 | \n",
" 41299 | \n",
" 33430 | \n",
" 33326 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 9446 | \n",
" 35624 | \n",
" 33316 | \n",
" 34256 | \n",
" 35625 | \n",
" 41014 | \n",
" 33461 | \n",
" 33326 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 11065 | \n",
" 34326 | \n",
" 33299 | \n",
" 34397 | \n",
" 38458 | \n",
" 38459 | \n",
" 33678 | \n",
" 33289 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 11149 | \n",
" 40640 | \n",
" 33283 | \n",
" 40641 | \n",
" 40642 | \n",
" 40643 | \n",
" 33291 | \n",
" 33431 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ACTION RESOURCE MGR_ID ROLLUP_1 ROLLUP_2 DEPTNAME BUSINESS_TITLE \\\n",
"0 1 9802 43122 2 3 33467 45383 \n",
"1 1 10617 36504 33416 33689 36505 41299 \n",
"2 1 9446 35624 33316 34256 35625 41014 \n",
"3 1 11065 34326 33299 34397 38458 38459 \n",
"4 1 11149 40640 33283 40641 40642 40643 \n",
"\n",
" JOB_FAMILY JOB_CODE \n",
"0 11 33326 \n",
"1 33430 33326 \n",
"2 33461 33326 \n",
"3 33678 33289 \n",
"4 33291 33431 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Delete the downloaded files\n",
"! rm amzn-anon-access-samples-2.0.csv amzn-anon-access-samples-history-2.0.csv amzn-anon-access-samples.tgz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. LightGBM\n",
"\n",
"Let's use LightGBM on this dataset. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"data = pd.read_csv(\"data.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ACTION | \n",
" RESOURCE | \n",
" MGR_ID | \n",
" ROLLUP_1 | \n",
" ROLLUP_2 | \n",
" DEPTNAME | \n",
" BUSINESS_TITLE | \n",
" JOB_FAMILY | \n",
" JOB_CODE | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 9802 | \n",
" 43122 | \n",
" 2 | \n",
" 3 | \n",
" 33467 | \n",
" 45383 | \n",
" 11 | \n",
" 33326 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 10617 | \n",
" 36504 | \n",
" 33416 | \n",
" 33689 | \n",
" 36505 | \n",
" 41299 | \n",
" 33430 | \n",
" 33326 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 9446 | \n",
" 35624 | \n",
" 33316 | \n",
" 34256 | \n",
" 35625 | \n",
" 41014 | \n",
" 33461 | \n",
" 33326 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 11065 | \n",
" 34326 | \n",
" 33299 | \n",
" 34397 | \n",
" 38458 | \n",
" 38459 | \n",
" 33678 | \n",
" 33289 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 11149 | \n",
" 40640 | \n",
" 33283 | \n",
" 40641 | \n",
" 40642 | \n",
" 40643 | \n",
" 33291 | \n",
" 33431 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ACTION RESOURCE MGR_ID ROLLUP_1 ROLLUP_2 DEPTNAME BUSINESS_TITLE \\\n",
"0 1 9802 43122 2 3 33467 45383 \n",
"1 1 10617 36504 33416 33689 36505 41299 \n",
"2 1 9446 35624 33316 34256 35625 41014 \n",
"3 1 11065 34326 33299 34397 38458 38459 \n",
"4 1 11149 40640 33283 40641 40642 40643 \n",
"\n",
" JOB_FAMILY JOB_CODE \n",
"0 11 33326 \n",
"1 33430 33326 \n",
"2 33461 33326 \n",
"3 33678 33289 \n",
"4 33291 33431 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 8152 entries, 0 to 8151\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 ACTION 8152 non-null int64\n",
" 1 RESOURCE 8152 non-null int64\n",
" 2 MGR_ID 8152 non-null int64\n",
" 3 ROLLUP_1 8152 non-null int64\n",
" 4 ROLLUP_2 8152 non-null int64\n",
" 5 DEPTNAME 8152 non-null int64\n",
" 6 BUSINESS_TITLE 8152 non-null int64\n",
" 7 JOB_FAMILY 8152 non-null int64\n",
" 8 JOB_CODE 8152 non-null int64\n",
"dtypes: int64(9)\n",
"memory usage: 573.3 KB\n"
]
}
],
"source": [
"data.info()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 8000\n",
"0 152\n",
"Name: ACTION, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"ACTION\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will fix the column types below to make sure they are handled as categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"y = data[\"ACTION\"].values\n",
"X = data.drop(columns='ACTION')\n",
"\n",
"for c in X.columns:\n",
" X[c] = X[c].astype('category')\n",
" \n",
"X_train, X_valid, y_train, y_valid = train_test_split(X,\n",
" y,\n",
" test_size=0.15,\n",
" random_state=136,\n",
" stratify=y\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's fit the lightGBM model below."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LightGBM] [Info] Number of positive: 6800, number of negative: 129\n",
"[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001523 seconds.\n",
"You can set `force_col_wise=true` to remove the overhead.\n",
"[LightGBM] [Info] Total Bins 2538\n",
"[LightGBM] [Info] Number of data points in the train set: 6929, number of used features: 8\n",
"[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.981383 -> initscore=3.964865\n",
"[LightGBM] [Info] Start training from score 3.964865\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[1]\tvalid_0's auc: 0.766304\n",
"Training until validation scores don't improve for 20 rounds\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[2]\tvalid_0's auc: 0.850851\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[3]\tvalid_0's auc: 0.849565\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[4]\tvalid_0's auc: 0.856069\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[5]\tvalid_0's auc: 0.855181\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[6]\tvalid_0's auc: 0.853859\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[7]\tvalid_0's auc: 0.854366\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[8]\tvalid_0's auc: 0.853714\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[9]\tvalid_0's auc: 0.851775\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[10]\tvalid_0's auc: 0.850362\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[11]\tvalid_0's auc: 0.847409\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[12]\tvalid_0's auc: 0.846739\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[13]\tvalid_0's auc: 0.846105\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[14]\tvalid_0's auc: 0.845399\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[15]\tvalid_0's auc: 0.846214\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[16]\tvalid_0's auc: 0.845489\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[17]\tvalid_0's auc: 0.844547\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[18]\tvalid_0's auc: 0.842808\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[19]\tvalid_0's auc: 0.843678\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[20]\tvalid_0's auc: 0.842609\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[21]\tvalid_0's auc: 0.842627\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[22]\tvalid_0's auc: 0.840181\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[23]\tvalid_0's auc: 0.840743\n",
"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf\n",
"[24]\tvalid_0's auc: 0.840308\n",
"Early stopping, best iteration is:\n",
"[4]\tvalid_0's auc: 0.856069\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/lightgbm/engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument\n",
" _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n",
"/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument\n",
" _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n",
"/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/lightgbm/basic.py:1433: UserWarning: Overriding the parameters from Reference Dataset.\n",
" _log_warning('Overriding the parameters from Reference Dataset.')\n",
"/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/lightgbm/basic.py:1245: UserWarning: categorical_column in param dict is overridden.\n",
" _log_warning('{} in param dict is overridden.'.format(cat_alias))\n"
]
}
],
"source": [
"import lightgbm as lgb\n",
"\n",
"# Create dataset for lightgbm\n",
"lgb_train = lgb.Dataset(X_train, y_train)\n",
"lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)\n",
"\n",
"# Let's see our parameters\n",
"\n",
"# boosting_type (string, optional (default='gbdt'))\n",
"# ‘gbdt’, traditional Gradient Boosting Decision Tree.\n",
"# ‘dart’, Dropouts meet Multiple Additive Regression Trees.\n",
"# ‘goss’, Gradient-based One-Side Sampling.\n",
"# ‘rf’, Random Forest.\n",
"\n",
"params = {\n",
" 'boosting_type': 'gbdt',\n",
" 'objective': 'binary', # ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier\n",
" 'metric': ['auc'],\n",
" 'n_estimators': 50, # We can change it, by default 100\n",
" 'learning_rate': 0.1, # Default 0.1\n",
" 'num_iterations': 1000, # Default 100\n",
" 'is_unbalance': True, # Used to fix the class imbalance in the dataset\n",
" 'verbose': 1\n",
"}\n",
"\n",
"#Train\n",
"gbm = lgb.train(params,\n",
" lgb_train,\n",
" valid_sets=lgb_eval,\n",
" early_stopping_rounds=20\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see the overall performance on validation set."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.26 0.61 0.37 23\n",
" 1 0.99 0.97 0.98 1200\n",
"\n",
" accuracy 0.96 1223\n",
" macro avg 0.63 0.79 0.67 1223\n",
"weighted avg 0.98 0.96 0.97 1223\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)\n",
"\n",
"print(classification_report(y_valid, np.round(y_pred)))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}