{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In this notebook, we will provide instructions for launching a data wrangler job for preparing the data for Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The datasets required for completing the analysis of the entire source datasets are large. In order to demonstrate the use of Data Wrangler to prepare the data set in an expedited manner, we download and process sample data sets from the sources in CMS. You can use the same flow to process the entire datasets that you manually download from CMS and upload into S3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd\n",
    "import boto3\n",
    "import os\n",
    "import sagemaker\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import sklearn\n",
    "import string\n",
    "import json\n",
    "from sagemaker import get_execution_role"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Setup environment variables - store variables for use in the fraud identification step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client = boto3.client(\"s3\")\n",
    "session = sagemaker.Session()\n",
    "role = get_execution_role()\n",
    "bucket = session.default_bucket()\n",
    "prefix = 'fraud-detect-demo'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Ensure you upgrade your data wrangler application by following the instructions below\n",
    "https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-update.html,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Download data files needed from CMS. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download sample medicare part B files using the code below - It provides information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals.\n",
    "\n",
    "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service, manually download the data set and upload into S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict={\n",
    "\"2019\": \"5fccd951-9538-48a7-9075-6f02b9867868\",\n",
    "\"2018\":\t\"02c0692d-e2d9-4714-80c7-a1d16d72ec66\",\n",
    "\"2017\": \"7ebc578d-c2c7-46fd-8cc8-1b035eba7218\",\n",
    "\"2016\": \"5055d307-4fb3-4474-adbb-a11f4182ee35\",\n",
    "\"2015\": \"0ccba18d-b821-47c6-bb55-269b78921637\",\n",
    "\"2014\":\t\"e6aacd22-1b89-4914-855c-f8dacbd2ec60\",\n",
    "\"2013\":\t\"ebaf67d7-1572-4419-a053-c8631cc1cc9b\"\n",
    "}\n",
    "df = pd.DataFrame()\n",
    "for key, value in dict.items():\n",
    "     command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' +  'medicareB_' + key + '.json'\n",
    "     !wget $command\n",
    "     file = 'medicareB_' + key + '.json'\n",
    "     data = pd.read_json(f'{file}')\n",
    "     df = df.append(data,  ignore_index=True)\n",
    "df.to_csv('medicareB.csv',index = False, header = True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download medicare part D files using the code below - It provides information on prescription drugs prescribed by individual physicians and other health care providers and paid for under the Medicare Part D Prescription Drug Program\n",
    "\n",
    "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug, manually download the data set and upload into S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict={\n",
    "\"2019\": \"5a27f7a8-c7af-434f-a26c-54db03e22cd1\",\n",
    "\"2018\":\t\"4861ecfc-a656-4dcd-accb-b9c3b840dfcb\",\n",
    "\"2017\": \"04b93a42-c533-4e5c-8df9-a8f254886cde\",\n",
    "\"2016\": \"0015c60c-af38-4d06-98bd-f058c0abb778\",\n",
    "\"2015\": \"5da1b683-99ea-4734-8216-66ffdcd5e443\",\n",
    "\"2014\":\t\"2af61f9c-327c-4a23-8b7f-15e38b56e25a\",\n",
    "\"2013\":\t\"92d814bd-e2fb-48c2-95e7-a4b388a2c4be\"\n",
    "}\n",
    "df = pd.DataFrame()\n",
    "for key, value in dict.items():\n",
    "    file = 'medicareD_' + key + '.json'\n",
    "    command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' +  file \n",
    "    !wget $command\n",
    "    data = pd.read_json(f'{file}')\n",
    "    df = df.append(data,  ignore_index=True)\n",
    "df.to_csv('medicareD.csv',index = False, header = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download sample DMEPOS files using the code below - It provides information on DMEPOS products and services provided to Medicare beneficiaries ordered by physicians and other healthcare professionals.  \n",
    "\n",
    "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-durable-medical-equipment-devices-supplies/medicare-durable-medical-equipment-devices-supplies-by-referring-provider-and-service, manually download the data set and upload into S3\n",
    "Note: DMEPOS files have columns that have been added over the years - When you download the main data set, make sure to append your data sets to the latest data set in order to maintain consistency. For this Data Wrangler flow, the input for DMEPOS is based on the 2019 file format, which is the latest available at this time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict={\n",
    "\"2019\": \"d6d3de93-0579-408a-bcfe-c319f04069e7\",\n",
    "\"2018\":\t\"1c41ada1-2cb5-4808-95ef-757f5c237f4c\",\n",
    "\"2017\": \"4ca364a5-89a3-4d9b-9a62-a4817f6c1464\",\n",
    "\"2016\": \"3795f3a1-fa7a-4e7f-b4e4-0a3ba221311d\",\n",
    "\"2015\": \"8f8fcdce-0924-4f5c-ab49-88e34d581f4c\",\n",
    "\"2014\":\t\"3f53a9ab-cc41-46d0-b37d-8799799425a7\",\n",
    "\"2013\":\t\"631d1c71-8af2-4d6e-912a-73f7954178ca\"\n",
    "}\n",
    "df = pd.DataFrame()\n",
    "for key, value in dict.items():\n",
    "    file = 'dmepos_' + key + '.json'\n",
    "    command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' +  file \n",
    "    !wget $command\n",
    "    data = pd.read_json(f'{file}')\n",
    "    df = df.append(data,  ignore_index=True)\n",
    "df.drop(columns = 'HCPCS_Cd', axis=1, inplace=True)\n",
    "df.to_csv('dmepos.csv',index = False, header = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the LEIE file using the code below. It provides information about all medical professionals or entities that are excluded from receiving Medicare reimbursement due to violations of sections 1128 or 1156 of the Social Security Act.  We use this dataset to identify fraudulent providers that have submitted for repayment in the above 3 datasets. You can also go to https://oig.hhs.gov/exclusions/exclusions_list.asp to manually download the latest file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -cO - 'https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv' > leie.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload files to S3 and note S3Uri for each file set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file(\n",
    "    Filename=\"medicareB.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/medicareB.csv\"\n",
    ")\n",
    "\n",
    "print(f's3://{bucket}/{prefix}/data/raw/medicareB.csv')\n",
    "\n",
    "s3_client.upload_file(\n",
    "    Filename=\"medicareD.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/medicareD.csv\"\n",
    ")\n",
    "\n",
    "print(f's3://{bucket}/{prefix}/data/raw/medicareD.csv')\n",
    "\n",
    "s3_client.upload_file(\n",
    "    Filename=\"dmepos.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/dmepos.csv\"\n",
    ")\n",
    "\n",
    "print(f's3://{bucket}/{prefix}/data/raw/dmepos.csv')\n",
    "\n",
    "\n",
    "s3_client.upload_file(\n",
    "    Filename=\"leie.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/leie.csv\"\n",
    ")\n",
    "\n",
    "print(f's3://{bucket}/{prefix}/data/raw/leie.csv')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Remove intermediate files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm medicare*.json\n",
    "!rm dmepos*.json\n",
    "!rm medicare*.csv\n",
    "!rm dmepos.csv\n",
    "!rm leie.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Once you have the paths to the source files, open the data wrangler flow file \"data_processing.flow\" in \"editor\" mode by right clicking on the file and choosing Open With --> editor. Replace all S3Uri in the JSON with the S3Uri from above. Modify the type\": \"DESTINATION\" output_path to the prefix in S3 where you want to store your output data. \n",
    "\n",
    "7. Once that is done, save the flow by File --> Save and then open the file using Open With --> Flow. Data Wrangler will validate that all the changes you made are correct and you should have no errors shown. \n",
    "\n",
    "8. Click the \"Create Job\" button on the top right and configure the job with the number and type of instances to execute the processing. Once you start the job, you can track the progress in the Sagemaker console\n",
    "\n",
    "9. After the job is complete, you can download the output data from S3 into this notebook and load the data into a dataframe for preview"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Replace the fiile name with what you provided in the S3 DESTINATION in the flow\n",
    "output_file = \"fraud-detect-demo/data/data-processing-2022-02-10T10-07-34/part-00000-1653e8d0-26bf-46bf-b75d-9a16306ff5d2-c000.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.download_file(Bucket=bucket, Key=output_file, Filename=\"sample_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fraudulent_provider</th>\n",
       "      <th>female</th>\n",
       "      <th>average_medicare_payment_amt_standard_deviation</th>\n",
       "      <th>average_medicare_payment_amt_sum</th>\n",
       "      <th>average_medicare_payment_amt_mean</th>\n",
       "      <th>average_submitted_chrg_amt_standard_deviation</th>\n",
       "      <th>average_submitted_chrg_amt_sum</th>\n",
       "      <th>average_submitted_chrg_amt_mean</th>\n",
       "      <th>bene_day_srvc_cnt_standard_deviation</th>\n",
       "      <th>bene_day_srvc_cnt_sum</th>\n",
       "      <th>...</th>\n",
       "      <th>Tot_Suplr_Srvcs_mean</th>\n",
       "      <th>Tot_Suplr_Clms_standard_deviation</th>\n",
       "      <th>Tot_Suplr_Clms_sum</th>\n",
       "      <th>Tot_Suplr_Clms_mean</th>\n",
       "      <th>Tot_Suplr_Benes_standard_deviation</th>\n",
       "      <th>Tot_Suplr_Benes_sum</th>\n",
       "      <th>Tot_Suplr_Benes_mean</th>\n",
       "      <th>Tot_Suplrs_standard_deviation</th>\n",
       "      <th>Tot_Suplrs_sum</th>\n",
       "      <th>Tot_Suplrs_mean</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>73.478724</td>\n",
       "      <td>4254.762883</td>\n",
       "      <td>92.494845</td>\n",
       "      <td>298.623213</td>\n",
       "      <td>12646.047619</td>\n",
       "      <td>274.914079</td>\n",
       "      <td>79.759654</td>\n",
       "      <td>3021</td>\n",
       "      <td>...</td>\n",
       "      <td>1200.000000</td>\n",
       "      <td>1.414214</td>\n",
       "      <td>24</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>10.0</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>0.707107</td>\n",
       "      <td>3</td>\n",
       "      <td>1.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>312.477500</td>\n",
       "      <td>10102.444037</td>\n",
       "      <td>160.356255</td>\n",
       "      <td>764.145596</td>\n",
       "      <td>25057.484789</td>\n",
       "      <td>397.737854</td>\n",
       "      <td>73.897387</td>\n",
       "      <td>4903</td>\n",
       "      <td>...</td>\n",
       "      <td>158.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>39</td>\n",
       "      <td>13.000000</td>\n",
       "      <td>5.131601</td>\n",
       "      <td>32.0</td>\n",
       "      <td>10.666667</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>104.623237</td>\n",
       "      <td>10032.337465</td>\n",
       "      <td>94.644693</td>\n",
       "      <td>659.156448</td>\n",
       "      <td>41697.115587</td>\n",
       "      <td>393.369015</td>\n",
       "      <td>104.241125</td>\n",
       "      <td>8538</td>\n",
       "      <td>...</td>\n",
       "      <td>1592.000000</td>\n",
       "      <td>2.309401</td>\n",
       "      <td>46</td>\n",
       "      <td>15.333333</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>15.0</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>1.154701</td>\n",
       "      <td>5</td>\n",
       "      <td>1.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>30.977305</td>\n",
       "      <td>2871.812003</td>\n",
       "      <td>47.078885</td>\n",
       "      <td>77.016030</td>\n",
       "      <td>6879.508631</td>\n",
       "      <td>112.778830</td>\n",
       "      <td>292.333285</td>\n",
       "      <td>8541</td>\n",
       "      <td>...</td>\n",
       "      <td>265.051724</td>\n",
       "      <td>14.740516</td>\n",
       "      <td>1377</td>\n",
       "      <td>23.741379</td>\n",
       "      <td>5.908764</td>\n",
       "      <td>440.0</td>\n",
       "      <td>7.586207</td>\n",
       "      <td>5.761230</td>\n",
       "      <td>350</td>\n",
       "      <td>6.034483</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>35.425319</td>\n",
       "      <td>1917.425206</td>\n",
       "      <td>53.261811</td>\n",
       "      <td>198.111337</td>\n",
       "      <td>10689.354033</td>\n",
       "      <td>296.926501</td>\n",
       "      <td>7.477373</td>\n",
       "      <td>700</td>\n",
       "      <td>...</td>\n",
       "      <td>14.750000</td>\n",
       "      <td>6.184658</td>\n",
       "      <td>59</td>\n",
       "      <td>14.750000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>20.0</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>1.707825</td>\n",
       "      <td>11</td>\n",
       "      <td>2.750000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 51 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   fraudulent_provider  female  \\\n",
       "0                    0     1.0   \n",
       "1                    0     0.0   \n",
       "2                    0     0.0   \n",
       "3                    0     1.0   \n",
       "4                    0     0.0   \n",
       "\n",
       "   average_medicare_payment_amt_standard_deviation  \\\n",
       "0                                        73.478724   \n",
       "1                                       312.477500   \n",
       "2                                       104.623237   \n",
       "3                                        30.977305   \n",
       "4                                        35.425319   \n",
       "\n",
       "   average_medicare_payment_amt_sum  average_medicare_payment_amt_mean  \\\n",
       "0                       4254.762883                          92.494845   \n",
       "1                      10102.444037                         160.356255   \n",
       "2                      10032.337465                          94.644693   \n",
       "3                       2871.812003                          47.078885   \n",
       "4                       1917.425206                          53.261811   \n",
       "\n",
       "   average_submitted_chrg_amt_standard_deviation  \\\n",
       "0                                     298.623213   \n",
       "1                                     764.145596   \n",
       "2                                     659.156448   \n",
       "3                                      77.016030   \n",
       "4                                     198.111337   \n",
       "\n",
       "   average_submitted_chrg_amt_sum  average_submitted_chrg_amt_mean  \\\n",
       "0                    12646.047619                       274.914079   \n",
       "1                    25057.484789                       397.737854   \n",
       "2                    41697.115587                       393.369015   \n",
       "3                     6879.508631                       112.778830   \n",
       "4                    10689.354033                       296.926501   \n",
       "\n",
       "   bene_day_srvc_cnt_standard_deviation  bene_day_srvc_cnt_sum  ...  \\\n",
       "0                             79.759654                   3021  ...   \n",
       "1                             73.897387                   4903  ...   \n",
       "2                            104.241125                   8538  ...   \n",
       "3                            292.333285                   8541  ...   \n",
       "4                              7.477373                    700  ...   \n",
       "\n",
       "   Tot_Suplr_Srvcs_mean  Tot_Suplr_Clms_standard_deviation  \\\n",
       "0           1200.000000                           1.414214   \n",
       "1            158.000000                           2.000000   \n",
       "2           1592.000000                           2.309401   \n",
       "3            265.051724                          14.740516   \n",
       "4             14.750000                           6.184658   \n",
       "\n",
       "   Tot_Suplr_Clms_sum  Tot_Suplr_Clms_mean  \\\n",
       "0                  24            12.000000   \n",
       "1                  39            13.000000   \n",
       "2                  46            15.333333   \n",
       "3                1377            23.741379   \n",
       "4                  59            14.750000   \n",
       "\n",
       "   Tot_Suplr_Benes_standard_deviation  Tot_Suplr_Benes_sum  \\\n",
       "0                            0.000000                 10.0   \n",
       "1                            5.131601                 32.0   \n",
       "2                            0.000000                 15.0   \n",
       "3                            5.908764                440.0   \n",
       "4                            0.000000                 20.0   \n",
       "\n",
       "   Tot_Suplr_Benes_mean  Tot_Suplrs_standard_deviation  Tot_Suplrs_sum  \\\n",
       "0              5.000000                       0.707107               3   \n",
       "1             10.666667                       0.000000               3   \n",
       "2              5.000000                       1.154701               5   \n",
       "3              7.586207                       5.761230             350   \n",
       "4              5.000000                       1.707825              11   \n",
       "\n",
       "   Tot_Suplrs_mean  \n",
       "0         1.500000  \n",
       "1         1.000000  \n",
       "2         1.666667  \n",
       "3         6.034483  \n",
       "4         2.750000  \n",
       "\n",
       "[5 rows x 51 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sample = pd.read_csv('sample_data.csv')\n",
    "df_sample.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10. Next, we move on to Machine Learning to develop a model that can identify fraudulent providers on the entire dataset using two methods: 1) Anomaly Detection (anomaly_detection.ipynb)  2) Classification (classification.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}