{"metadata":{"title":"mlp-blog-notebook","kernelspec":{"display_name":"Redshift","language":"postgresql","name":"Redshift"},"language_info":{"file_extension":".sql","name":"Redshift"},"version":1},"nbformat":4,"nbformat_minor":0,"cells":[{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Here we use the Credit Card Fraud detection data available at https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud  to create, train and deploy MLP model which\n","can be used further to identify fraudulent transactions from the newly captured transaction records.\n","\n","For that, we have downloaded the dataset from the mentioned URL and identified the schema of the CSV file that comes with the downloaded content.\n","\n","We first create a table in Amazon Redshift which should hold the data.\n","One can even keep this CSV file in S3, crawl it using AWS Glue and/or catalog it using Amazon Athena to prepare an external table which can be queried for training the MLP model.\n","\n","Here we choose the option to create a table inside the Amazon Redshift cluster (or Amazon Redshift serverless endpoint)."],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["DROP TABLE IF EXISTS creditcardsfrauds;\n","CREATE TABLE creditcardsfrauds (\n","\ttxtime integer,\n","    v1 float8,\n","    v2 float8,\n","    v3 float8,\n","    v4 float8,\n","    v5 float8,\n","    v6 float8,\n","    v7 float8,\n","    v8 float8,\n","    v9 float8,\n","    v10 float8,\n","    v11 float8,\n","    v12 float8,\n","    v13 float8,\n","    v14 float8,\n","    v15 float8,\n","    v16 float8,\n","    v17 float8,\n","    v18 float8,\n","    v19 float8,\n","    v20 float8,\n","    v21 float8,\n","    v22 float8,\n","    v23 float8,\n","    v24 float8,\n","    v25 float8,\n","    v26 float8,\n","    v27 float8,\n","    v28 float8,\n","    amount float8,\n","    class integer\n",");"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Now we load the data into this table"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["COPY creditcardsfrauds\n","FROM 's3://redshift-ml-blog-mlp/creditcard.csv' \n","IAM_ROLE default\n","CSV QUOTE as '\\\"' delimiter ',' IGNOREHEADER 1 maxerror 100\n","REGION 'us-east-1';"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Now we do some quick checks on the numbers…"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["select count(*) from creditcardsfrauds;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["...and decide here the cutoff that will give you 70% or 80% of the total data to be considered for training. Remaining 30% or 20% should be used for validation"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["CREATE model creditcardsfrauds_mlp\n","FROM (select * from creditcardsfrauds where txtime < 120954)\n","TARGET class \n","FUNCTION creditcardsfrauds_mlp_fn\n","IAM_ROLE DEFAULT\n","MODEL_TYPE MLP\n","SETTINGS (\n","      S3_BUCKET '<<your-amazon-s3-bucket>>'',\n","      MAX_RUNTIME 54000\n",");"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Keep checking the status of the model creation using the following command"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["show model creditcardsfrauds_mlp;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Check the model explainability to understand which columns contributed effectively to derive the prediction"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["SELECT json_table.report.explanations.kernel_shap.label0.global_shap_values \n","FROM (select explain_model('creditcardsfrauds_mlp') as report) as json_table;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Optional section to check the counts. Can remove from the final post"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["select min(txtime), max(txtime) from creditcardsfrauds limit 1;\n","\n","select * from creditcardsfrauds limit 10;\n","\n","select min(txtime), max(txtime), count(*) from creditcardsfrauds_mlp where time1 < 120954\n","\n","select class, creditcards_mlp_fn(time1,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount) from creditcardsfrauds where time1 >= 120954 limit 50;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Validate the held off 20% to 30% data and check whether the predicted data is closer to the actual data and how much performance accuracy do we achieve."],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["select actualvspredicted, count(actualvspredicted)\n","FROM (\n","\tSELECT \n","      CASE WHEN class =  \n","      creditcardsfrauds_mlp_fn(txtime,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount) \n","      THEN 'PredictedMatchesActual' \n","      else 'NoMatch' \n","      END as actualvspredicted\n","    FROM creditcardsfrauds \n","    WHERE txtime >= 120954\n",") \n","group by actualvspredicted;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["Now you can start predicting on the newly arriving data."],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["CREATE A STAGING TABLE TO HOLD NEWLY ARRIVING DATA FROM THE SOURCE WHICH WILL NOT CONAIN THE CLASS COLUMN - AS IT IS TO BE PREDICTED"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["DROP TABLE if exists creditcardsfrauds_staging;\n","CREATE TABLE creditcardsfrauds_staging as (select * from creditcardsfrauds limit 0);\n","Alter table creditcardsfrauds_staging drop column class;"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["LETS CONSIDER ONLY ONE RECORD HERE WHICH HAS NEWLY ARRIVED"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["insert into creditcardsfrauds_staging values(174965,-39999.11383160738512,0.58586417180689,-5.39973021073242,1.81709247345531,-0.840618465991056,-2.94354779071974,-2.20800192003372,1.05873267723056,-1.63233334974982,-5000.24598383776964,11.93351953683592,-53046479695456,-1.12745457501155,-666666.41662797597451,0.141237234328704,-2.54949823633632,-4.61471706851594,-10.47813794126038,-0.0354803664667244,0.306270740368093,0.583275998701341,-0.269208637986581,-0.456107772584008,-0.183659129549716,-0.328167759255761,0.606115810329683,0.884875539542905,-0.253700318894381,-2450000000);"],"cell_type":"code","execution_count":0,"outputs":[]},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["USE THE FUNCTION TO PREDICT THE VALUE OF CLASS ALONG WITH THE PROBABILITY"],"cell_type":"markdown"},{"metadata":{"displayMode":"maximized","width":12,"isLimitOn":true},"source":["select predictedActive.labels[0], predictedActive.probabilities[0] \n","from (\n","SELECT creditcardsfrauds_mlp_fn_prob(txtime,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount)as predictedACtive\n","FROM creditcardsfrauds_staging ) temp"],"cell_type":"code","execution_count":0,"outputs":[]}]}