{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## In this notebook, we will provide instructions for launching a data wrangler job for preparing the data for Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The datasets required for completing the analysis of the entire source datasets are large. In order to demonstrate the use of Data Wrangler to prepare the data set in an expedited manner, we download and process sample data sets from the sources in CMS. You can use the same flow to process the entire datasets that you manually download from CMS and upload into S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Import Libraries" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd\n", "import boto3\n", "import os\n", "import sagemaker\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import io\n", "import sklearn\n", "import string\n", "import json\n", "from sagemaker import get_execution_role" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Setup environment variables - store variables for use in the fraud identification step" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "s3_client = boto3.client(\"s3\")\n", "session = sagemaker.Session()\n", "role = get_execution_role()\n", "bucket = session.default_bucket()\n", "prefix = 'fraud-detect-demo'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Ensure you upgrade your data wrangler application by following the instructions below\n", "https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-update.html," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Download data files needed from CMS. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download sample medicare part B files using the code below - It provides information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals.\n", "\n", "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service, manually download the data set and upload into S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict={\n", "\"2019\": \"5fccd951-9538-48a7-9075-6f02b9867868\",\n", "\"2018\":\t\"02c0692d-e2d9-4714-80c7-a1d16d72ec66\",\n", "\"2017\": \"7ebc578d-c2c7-46fd-8cc8-1b035eba7218\",\n", "\"2016\": \"5055d307-4fb3-4474-adbb-a11f4182ee35\",\n", "\"2015\": \"0ccba18d-b821-47c6-bb55-269b78921637\",\n", "\"2014\":\t\"e6aacd22-1b89-4914-855c-f8dacbd2ec60\",\n", "\"2013\":\t\"ebaf67d7-1572-4419-a053-c8631cc1cc9b\"\n", "}\n", "df = pd.DataFrame()\n", "for key, value in dict.items():\n", " command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' + 'medicareB_' + key + '.json'\n", " !wget $command\n", " file = 'medicareB_' + key + '.json'\n", " data = pd.read_json(f'{file}')\n", " df = df.append(data, ignore_index=True)\n", "df.to_csv('medicareB.csv',index = False, header = True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download medicare part D files using the code below - It provides information on prescription drugs prescribed by individual physicians and other health care providers and paid for under the Medicare Part D Prescription Drug Program\n", "\n", "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug, manually download the data set and upload into S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict={\n", "\"2019\": \"5a27f7a8-c7af-434f-a26c-54db03e22cd1\",\n", "\"2018\":\t\"4861ecfc-a656-4dcd-accb-b9c3b840dfcb\",\n", "\"2017\": \"04b93a42-c533-4e5c-8df9-a8f254886cde\",\n", "\"2016\": \"0015c60c-af38-4d06-98bd-f058c0abb778\",\n", "\"2015\": \"5da1b683-99ea-4734-8216-66ffdcd5e443\",\n", "\"2014\":\t\"2af61f9c-327c-4a23-8b7f-15e38b56e25a\",\n", "\"2013\":\t\"92d814bd-e2fb-48c2-95e7-a4b388a2c4be\"\n", "}\n", "df = pd.DataFrame()\n", "for key, value in dict.items():\n", " file = 'medicareD_' + key + '.json'\n", " command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' + file \n", " !wget $command\n", " data = pd.read_json(f'{file}')\n", " df = df.append(data, ignore_index=True)\n", "df.to_csv('medicareD.csv',index = False, header = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download sample DMEPOS files using the code below - It provides information on DMEPOS products and services provided to Medicare beneficiaries ordered by physicians and other healthcare professionals. \n", "\n", "For downloading the entire dataset, go to https://data.cms.gov/provider-summary-by-type-of-service/medicare-durable-medical-equipment-devices-supplies/medicare-durable-medical-equipment-devices-supplies-by-referring-provider-and-service, manually download the data set and upload into S3\n", "Note: DMEPOS files have columns that have been added over the years - When you download the main data set, make sure to append your data sets to the latest data set in order to maintain consistency. For this Data Wrangler flow, the input for DMEPOS is based on the 2019 file format, which is the latest available at this time. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict={\n", "\"2019\": \"d6d3de93-0579-408a-bcfe-c319f04069e7\",\n", "\"2018\":\t\"1c41ada1-2cb5-4808-95ef-757f5c237f4c\",\n", "\"2017\": \"4ca364a5-89a3-4d9b-9a62-a4817f6c1464\",\n", "\"2016\": \"3795f3a1-fa7a-4e7f-b4e4-0a3ba221311d\",\n", "\"2015\": \"8f8fcdce-0924-4f5c-ab49-88e34d581f4c\",\n", "\"2014\":\t\"3f53a9ab-cc41-46d0-b37d-8799799425a7\",\n", "\"2013\":\t\"631d1c71-8af2-4d6e-912a-73f7954178ca\"\n", "}\n", "df = pd.DataFrame()\n", "for key, value in dict.items():\n", " file = 'dmepos_' + key + '.json'\n", " command = '-cO - https://data.cms.gov/data-api/v1/dataset/' + value + '/data' + '>' + file \n", " !wget $command\n", " data = pd.read_json(f'{file}')\n", " df = df.append(data, ignore_index=True)\n", "df.drop(columns = 'HCPCS_Cd', axis=1, inplace=True)\n", "df.to_csv('dmepos.csv',index = False, header = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the LEIE file using the code below. It provides information about all medical professionals or entities that are excluded from receiving Medicare reimbursement due to violations of sections 1128 or 1156 of the Social Security Act. We use this dataset to identify fraudulent providers that have submitted for repayment in the above 3 datasets. You can also go to https://oig.hhs.gov/exclusions/exclusions_list.asp to manually download the latest file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -cO - 'https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv' > leie.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upload files to S3 and note S3Uri for each file set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"medicareB.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/medicareB.csv\"\n", ")\n", "\n", "print(f's3://{bucket}/{prefix}/data/raw/medicareB.csv')\n", "\n", "s3_client.upload_file(\n", " Filename=\"medicareD.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/medicareD.csv\"\n", ")\n", "\n", "print(f's3://{bucket}/{prefix}/data/raw/medicareD.csv')\n", "\n", "s3_client.upload_file(\n", " Filename=\"dmepos.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/dmepos.csv\"\n", ")\n", "\n", "print(f's3://{bucket}/{prefix}/data/raw/dmepos.csv')\n", "\n", "\n", "s3_client.upload_file(\n", " Filename=\"leie.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/leie.csv\"\n", ")\n", "\n", "print(f's3://{bucket}/{prefix}/data/raw/leie.csv')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Remove intermediate files " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm medicare*.json\n", "!rm dmepos*.json\n", "!rm medicare*.csv\n", "!rm dmepos.csv\n", "!rm leie.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Once you have the paths to the source files, open the data wrangler flow file \"data_processing.flow\" in \"editor\" mode by right clicking on the file and choosing Open With --> editor. Replace all S3Uri in the JSON with the S3Uri from above. Modify the type\": \"DESTINATION\" output_path to the prefix in S3 where you want to store your output data. \n", "\n", "7. Once that is done, save the flow by File --> Save and then open the file using Open With --> Flow. Data Wrangler will validate that all the changes you made are correct and you should have no errors shown. \n", "\n", "8. Click the \"Create Job\" button on the top right and configure the job with the number and type of instances to execute the processing. Once you start the job, you can track the progress in the Sagemaker console\n", "\n", "9. After the job is complete, you can download the output data from S3 into this notebook and load the data into a dataframe for preview" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "## Replace the fiile name with what you provided in the S3 DESTINATION in the flow\n", "output_file = \"fraud-detect-demo/data/data-processing-2022-02-10T10-07-34/part-00000-1653e8d0-26bf-46bf-b75d-9a16306ff5d2-c000.csv\"" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "s3_client.download_file(Bucket=bucket, Key=output_file, Filename=\"sample_data.csv\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fraudulent_providerfemaleaverage_medicare_payment_amt_standard_deviationaverage_medicare_payment_amt_sumaverage_medicare_payment_amt_meanaverage_submitted_chrg_amt_standard_deviationaverage_submitted_chrg_amt_sumaverage_submitted_chrg_amt_meanbene_day_srvc_cnt_standard_deviationbene_day_srvc_cnt_sum...Tot_Suplr_Srvcs_meanTot_Suplr_Clms_standard_deviationTot_Suplr_Clms_sumTot_Suplr_Clms_meanTot_Suplr_Benes_standard_deviationTot_Suplr_Benes_sumTot_Suplr_Benes_meanTot_Suplrs_standard_deviationTot_Suplrs_sumTot_Suplrs_mean
001.073.4787244254.76288392.494845298.62321312646.047619274.91407979.7596543021...1200.0000001.4142142412.0000000.00000010.05.0000000.70710731.500000
100.0312.47750010102.444037160.356255764.14559625057.484789397.73785473.8973874903...158.0000002.0000003913.0000005.13160132.010.6666670.00000031.000000
200.0104.62323710032.33746594.644693659.15644841697.115587393.369015104.2411258538...1592.0000002.3094014615.3333330.00000015.05.0000001.15470151.666667
301.030.9773052871.81200347.07888577.0160306879.508631112.778830292.3332858541...265.05172414.740516137723.7413795.908764440.07.5862075.7612303506.034483
400.035.4253191917.42520653.261811198.11133710689.354033296.9265017.477373700...14.7500006.1846585914.7500000.00000020.05.0000001.707825112.750000
\n", "

5 rows × 51 columns

\n", "
" ], "text/plain": [ " fraudulent_provider female \\\n", "0 0 1.0 \n", "1 0 0.0 \n", "2 0 0.0 \n", "3 0 1.0 \n", "4 0 0.0 \n", "\n", " average_medicare_payment_amt_standard_deviation \\\n", "0 73.478724 \n", "1 312.477500 \n", "2 104.623237 \n", "3 30.977305 \n", "4 35.425319 \n", "\n", " average_medicare_payment_amt_sum average_medicare_payment_amt_mean \\\n", "0 4254.762883 92.494845 \n", "1 10102.444037 160.356255 \n", "2 10032.337465 94.644693 \n", "3 2871.812003 47.078885 \n", "4 1917.425206 53.261811 \n", "\n", " average_submitted_chrg_amt_standard_deviation \\\n", "0 298.623213 \n", "1 764.145596 \n", "2 659.156448 \n", "3 77.016030 \n", "4 198.111337 \n", "\n", " average_submitted_chrg_amt_sum average_submitted_chrg_amt_mean \\\n", "0 12646.047619 274.914079 \n", "1 25057.484789 397.737854 \n", "2 41697.115587 393.369015 \n", "3 6879.508631 112.778830 \n", "4 10689.354033 296.926501 \n", "\n", " bene_day_srvc_cnt_standard_deviation bene_day_srvc_cnt_sum ... \\\n", "0 79.759654 3021 ... \n", "1 73.897387 4903 ... \n", "2 104.241125 8538 ... \n", "3 292.333285 8541 ... \n", "4 7.477373 700 ... \n", "\n", " Tot_Suplr_Srvcs_mean Tot_Suplr_Clms_standard_deviation \\\n", "0 1200.000000 1.414214 \n", "1 158.000000 2.000000 \n", "2 1592.000000 2.309401 \n", "3 265.051724 14.740516 \n", "4 14.750000 6.184658 \n", "\n", " Tot_Suplr_Clms_sum Tot_Suplr_Clms_mean \\\n", "0 24 12.000000 \n", "1 39 13.000000 \n", "2 46 15.333333 \n", "3 1377 23.741379 \n", "4 59 14.750000 \n", "\n", " Tot_Suplr_Benes_standard_deviation Tot_Suplr_Benes_sum \\\n", "0 0.000000 10.0 \n", "1 5.131601 32.0 \n", "2 0.000000 15.0 \n", "3 5.908764 440.0 \n", "4 0.000000 20.0 \n", "\n", " Tot_Suplr_Benes_mean Tot_Suplrs_standard_deviation Tot_Suplrs_sum \\\n", "0 5.000000 0.707107 3 \n", "1 10.666667 0.000000 3 \n", "2 5.000000 1.154701 5 \n", "3 7.586207 5.761230 350 \n", "4 5.000000 1.707825 11 \n", "\n", " Tot_Suplrs_mean \n", "0 1.500000 \n", "1 1.000000 \n", "2 1.666667 \n", "3 6.034483 \n", "4 2.750000 \n", "\n", "[5 rows x 51 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sample = pd.read_csv('sample_data.csv')\n", "df_sample.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10. Next, we move on to Machine Learning to develop a model that can identify fraudulent providers on the entire dataset using two methods: 1) Anomaly Detection (anomaly_detection.ipynb) 2) Classification (classification.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }