{ "cells": [ { "cell_type": "markdown", "id": "ae929fb3-0316-494b-accb-1951756d4d32", "metadata": { "tags": [] }, "source": [ "# ML Demo: Predicting Air Quality w/ ASDI NOAA + OpenAQ Datasets\n", "This demo consists of a Jupyter Notebook that connects to Amazon Sustainability Data Initiative (ASDI) datasets from NOAA and OpenAQ to build a Machine Learning (ML) model to predict air quality levels using weather data via a Binary Classification [AutoGluon](https://auto.gluon.ai/stable/index.html) model. The project's purpose is to demonstrate using two different types of ASDI datasets (files in Amazon S3 and HTTPS APIs) within a Jupyter Notebook, such as provided by Amazon SageMaker Studio Lab. This demo is NOT for scientific or health purposes.\n", "\n", "This Jupyter Notebook can be run for free using Amazon SageMaker Studio Lab and open-source Amazon Sustainability Data Initiative (ASDI) datasets without needing an AWS account.\n", "\n", "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-smsl-predict-airquality-via-weather/blob/main/aq_by_weather.ipynb)\n", "\n", "## BACKGROUND: 1 out of 8 deaths in the world is due to poor air quality*\n", "This demo explores correlations between weather and air quality since we know factors like temperatures, wind speeds, etc, affect certain air quality parameters. Predicting air quality can get into highly sophisticated ML techniques, but this demo shows how merging NOAA GSOD weather data with OpenAQ air quality data to build an ML model using AutoGluon (AutoML from AWS) can result in prediction accuracy of ~75-90% using Binary Classification models for various high-pollution areas and air quality parameters (mostly tested for 2.5 micron Particulate Matter and some Ground Level Ozone).\\\n", "*Source: [OpenAQ.org](https://OpenAQ.org)\n", "\n", "## INSTRUCTIONS\n", "- Configure your environment using the repo's environment.yml file or by running the *pip install* commands below.\n", "- If you want to run this Notebook yourself, first use the _Clear All Outputs_ option.\n", "- CELL #3: Review the Classes and Variables defined in this cell to see defaults and how ASDI access occurs.\n", "- CELL #4: Review the pre-defined AQParams and AQScenarios in this cell. You can edit these and/or add your own.\n", "- CELL #5: Select a scenario via the drop-down to use throughout the Notebook. This will drive the ML process.\n", "- Step through the remaining cells in the Notebook to access data, merge data, and build a Binary Classification Autogluon model.\n", "- NOTE: NOAA and OpenAQ input data is saved locally as CSV files and re-used in order to minimize unnecessary data access and compute resources. Be aware of this feature and clear the local data files, if your intention is to re-fetch the data." ] }, { "cell_type": "code", "execution_count": 1, "id": "5a41f10b-f9ed-433a-8d32-bb718f23d811", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# Set up your environment according to the repo's environment.yml file or run the following...\n", "# Comment these out, once installed or otherwise not needed.\n", "# This creates an empty pip_requirements.txt file used to suppress 'already satisfied' output.\n", "import os\n", "with open('pip_requirements.txt', mode='a'): pass\n", "%pip install boto3 -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install pandas -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install numpy -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install requests -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install ipywidgets -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install scikit-learn -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install autogluon -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install matplotlib -r pip_requirements.txt | grep -v 'already satisfied'\n", "%pip install nbconvert -r pip_requirements.txt | grep -v 'already satisfied'" ] }, { "cell_type": "code", "execution_count": 2, "id": "e844d780-d995-4bdc-98ae-570bc8253265", "metadata": {}, "outputs": [], "source": [ "# Import statements for packages used...\n", "import os, glob, shutil, sys, requests, json\n", "import boto3\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import ipywidgets as widgets\n", "\n", "from botocore import UNSIGNED\n", "from botocore.config import Config\n", "from io import StringIO\n", "from datetime import datetime\n", "from types import SimpleNamespace\n", "from IPython.display import clear_output\n", "\n", "# The following is required for matplotlib plots to display in some envs...\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 3, "id": "9ea427ea-8ce7-4dc5-80e1-c674f7f569ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classes and Variables are ready.\n" ] } ], "source": [ "# CELL #3: Review the Variables and Classes defined in this cell to see defaults and how ASDI access occurs...\n", "\n", "# class AQParam => Used to define attributes for the (6) main OpenAQ parameters.\n", "class AQParam:\n", " def __init__(self, id, name, unit, unhealthyThresholdDefault, desc):\n", " self.id = id\n", " self.name = name\n", " self.unit = unit\n", " self.unhealthyThresholdDefault = unhealthyThresholdDefault\n", " self.desc = desc\n", " \n", " def isValid(self):\n", " if(self is not None and self.id > 0 and self.unhealthyThresholdDefault > 0.0 and \n", " len(self.name) > 0 and len(self.unit) > 0 and len(self.desc) > 0):\n", " return True\n", " else:\n", " return False\n", " \n", " def toJSON(self):\n", " return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=2)\n", "\n", "# class AQScenario => Defines an ML scenario including a Location w/ NOAA Weather Station ID \n", "# and the target OpenAQ Param.\n", "# Note: OpenAQ data mostly begins sometime in 2016, so using that as a default yearStart value.\n", "class AQScenario:\n", " def __init__(self, location=None, noaaStationID=None, aqParamTarget=None, unhealthyThreshold=None, \n", " yearStart=2016, yearEnd=2022, aqRadiusMiles=10, featureColumnsToDrop=None):\n", " self.location = location\n", " self.name = location + \"_\" + aqParamTarget.name\n", " self.noaaStationID = noaaStationID\n", " self.noaaStationLat = 0.0\n", " self.noaaStationLng = 0.0\n", " self.openAqLocIDs = []\n", " \n", " self.aqParamTarget = aqParamTarget\n", " \n", " if unhealthyThreshold and unhealthyThreshold > 0.0:\n", " self.unhealthyThreshold = unhealthyThreshold\n", " else:\n", " self.unhealthyThreshold = self.aqParamTarget.unhealthyThresholdDefault\n", " \n", " self.yearStart = yearStart\n", " self.yearEnd = yearEnd\n", " self.aqRadiusMiles = aqRadiusMiles\n", " self.aqRadiusMeters = aqRadiusMiles * 1610 # Rough integer approximation is fine here.\n", " \n", " self.modelFolder = \"AutogluonModels\"\n", " \n", " def getSummary(self):\n", " return f\"Scenario: {self.name} => {self.aqParamTarget.desc} ({self.aqParamTarget.name}) with UnhealthyThreshold > {self.unhealthyThreshold} {self.aqParamTarget.unit}\"\n", " \n", " def getModelPath(self):\n", " return f\"{self.modelFolder}/aq_{self.name}_{self.yearStart}-{self.yearEnd}/\"\n", " \n", " def updateNoaaStationLatLng(self, noaagsod_df_row):\n", " # Use a NOAA row to set Lat+Lng values used for the OpenAQ API requests...\n", " if(noaagsod_df_row is not None and noaagsod_df_row['LATITUDE'] and noaagsod_df_row['LONGITUDE']):\n", " self.noaaStationLat = noaagsod_df_row['LATITUDE']\n", " self.noaaStationLng = noaagsod_df_row['LONGITUDE']\n", " print(f\"NOAA Station Lat,Lng Updated for Scenario: {self.name} => {self.noaaStationLat},{self.noaaStationLng}\")\n", " else:\n", " print(\"NOAA Station Lat,Lng COULD NOT BE UPDATED.\")\n", " \n", " def isValid(self):\n", " if(self is not None and self.aqParamTarget is not None and\n", " self.yearStart > 0 and self.yearEnd > 0 and self.yearEnd >= self.yearStart and \n", " self.aqRadiusMiles > 0 and self.aqRadiusMeters > 0 and self.unhealthyThreshold > 0.0 and \n", " len(self.name) > 0 and len(self.noaaStationID) > 0):\n", " return True\n", " else:\n", " return False\n", " \n", " def toJSON(self):\n", " return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=2)\n", "\n", "# class AQbyWeatherApp => Main app class with settings, AQParams, AQScenarios, and data access methods...\n", "class AQbyWeatherApp:\n", " def __init__(self, mlTargetLabel='isUnhealthy', mlEvalMetric='accuracy', mlTimeLimitSecs=None):\n", " self.mlTargetLabel = mlTargetLabel\n", " self.mlEvalMetric = mlEvalMetric\n", " self.mlTimeLimitSecs = mlTimeLimitSecs\n", " self.mlIgnoreColumns = ['DATE','NAME','LATITUDE','LONGITUDE','day','unit','average','parameter']\n", " \n", " self.defaultColumnsNOAA = ['DATE','NAME','LATITUDE','LONGITUDE',\n", " 'DEWP','WDSP','MAX','MIN','PRCP','MONTH'] # Default relevant NOAA columns\n", " self.defaultColumnsOpenAQ = ['day','parameter','unit','average'] # Default relevant OpenAQ columns\n", " \n", " self.aqParams = {} # A list to save AQParam objects\n", " self.aqScenarios = {} # A list to save AQScenario objects\n", " \n", " self.selectedScenario = None\n", " \n", " def addAQParam(self, aqParam):\n", " if aqParam and aqParam.isValid():\n", " self.aqParams[aqParam.name] = aqParam\n", " return True\n", " else:\n", " return False\n", " \n", " def addAQScenario(self, aqScenario):\n", " if aqScenario and aqScenario.isValid():\n", " self.aqScenarios[aqScenario.name] = aqScenario\n", " if(self.selectedScenario is None):\n", " self.selectedScenario = self.aqScenarios[next(iter(self.aqScenarios))] # Default selectedScenario to 1st item.\n", " return True\n", " else:\n", " return False\n", " \n", " def getFilenameNOAA(self):\n", " if self and self.selectedScenario and self.selectedScenario.isValid():\n", " return f\"dataNOAA_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}_{self.selectedScenario.noaaStationID}.csv\"\n", " else:\n", " return \"\"\n", " \n", " def getFilenameOpenAQ(self):\n", " if self and self.selectedScenario and self.selectedScenario.isValid() and len(self.selectedScenario.openAqLocIDs) > 0:\n", " idString = \"\"\n", " for i in range(0, len(self.selectedScenario.openAqLocIDs)):\n", " idString = idString + str(self.selectedScenario.openAqLocIDs[i]) + \"-\"\n", " idString = idString[:-1]\n", " return f\"dataOpenAQ_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}_{idString}.csv\"\n", " else:\n", " return \"\"\n", " \n", " def getFilenameOther(self, prefix):\n", " if self and self.selectedScenario and self.selectedScenario.isValid():\n", " return f\"{prefix}_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}.csv\"\n", " \n", " def getNoaaDataFrame(self):\n", " # ASDI Dataset Name: NOAA GSOD\n", " # ASDI Dataset URL : https://registry.opendata.aws/noaa-gsod/\n", " # NOAA GSOD README : https://www.ncei.noaa.gov/data/global-summary-of-the-day/doc/readme.txt\n", " # NOAA GSOD data in S3 is organized by year and Station ID values, so this is straight-forward\n", " # Example S3 path format => s3://noaa-gsod-pds/{yyyy}/{stationid}.csv\n", " # Let's start with a new DataFrame and load it from a local CSV or the NOAA data source...\n", " noaagsod_df = pd.DataFrame()\n", " filenameNOAA = self.getFilenameNOAA()\n", "\n", " if os.path.exists(filenameNOAA):\n", " # Use local data file already accessed + prepared...\n", " print('Loading NOAA GSOD data from local file: ', filenameNOAA)\n", " noaagsod_df = pd.read_csv(filenameNOAA)\n", " else:\n", " # Access + prepare data and save to a local data file...\n", " noaagsod_bucket = 'noaa-gsod-pds'\n", " print(f'Accessing and preparing data from ASDI-hosted NOAA GSOD dataset in Amazon S3 (bucket: {noaagsod_bucket})...')\n", " s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n", "\n", " for year in range(self.selectedScenario.yearStart, self.selectedScenario.yearEnd + 1):\n", " key = f'{year}/{self.selectedScenario.noaaStationID}.csv' # Compute the key to get\n", " csv_obj = s3.get_object(Bucket=noaagsod_bucket, Key=key) # Get the S3 object\n", " csv_string = csv_obj['Body'].read().decode('utf-8') # Read object contents to a string\n", " noaagsod_df = pd.concat([noaagsod_df, pd.read_csv(StringIO(csv_string))], ignore_index=True) # Use the string to build the DataFrame\n", "\n", " # Perform some Feature Engineering to append potentially useful columns to our dataset... (TODO: Add more to optimize...)\n", " # It may be true that Month affects air quality (ie: seasonal considerations; tends to have correlation for certain areas)\n", " noaagsod_df['MONTH'] = pd.to_datetime(noaagsod_df['DATE']).dt.month\n", "\n", " # Trim down to the desired key columns... (do this last in case engineered columns are to be removed)\n", " noaagsod_df = noaagsod_df[self.defaultColumnsNOAA]\n", " \n", " return noaagsod_df\n", " \n", " def getOpenAqDataFrame(self):\n", " # ASDI Dataset Name: OpenAQ\n", " # ASDI Dataset URL : https://registry.opendata.aws/openaq/\n", " # OpenAQ API Docs : https://docs.openaq.org/#/v2/\n", " # OpenAQ S3 data is only organized by date folders, so each folder is large and contains all stations.\n", " # Because of this, it's better to query ASDI OpenAQ data using the CloudFront-hosted API.\n", " # Note that some days may not have values and will get filtered out via an INNER JOIN later.\n", " # Let's start with a new DataFrame and load it from a local CSV or the NOAA data source...\n", " aq_df = pd.DataFrame()\n", " aq_reqUrlBase = \"https://api.openaq.org/v2\" # OpenAQ ASDI API Endpoint URL Base (ie: add /locations OR /averages)\n", " \n", " if self.selectedScenario.noaaStationLat == 0.0 or self.selectedScenario.noaaStationLng == 0.0:\n", " print(\"NOAA Station Lat/Lng NOT DEFINED. CANNOT PROCEED\")\n", " return aq_df\n", " \n", " if len(self.selectedScenario.openAqLocIDs) == 0:\n", " # We must start by querying nearby OpenAQ Locations for their IDs...\n", " print('Accessing ASDI-hosted OpenAQ Locations (HTTPS API)...')\n", " aq_reqParams = {\n", " 'limit': 10,\n", " 'page': 1,\n", " 'offset': 0,\n", " 'sort': 'desc',\n", " 'order_by': 'location',\n", " 'parameter': self.selectedScenario.aqParamTarget.name,\n", " 'coordinates': f'{self.selectedScenario.noaaStationLat},{self.selectedScenario.noaaStationLng}',\n", " 'radius': self.selectedScenario.aqRadiusMeters,\n", " 'isMobile': 'false',\n", " 'sensorType': 'reference grade',\n", " 'dumpRaw': 'false'\n", " }\n", " aq_resp = requests.get(aq_reqUrlBase + \"/locations\", aq_reqParams)\n", " aq_data = aq_resp.json()\n", " if aq_data['results'] and len(aq_data['results']) >= 1:\n", " for i in range(0, len(aq_data['results'])):\n", " self.selectedScenario.openAqLocIDs.append(aq_data['results'][i]['id'])\n", " print(f\"OpenAQ Location IDs within {self.selectedScenario.aqRadiusMiles} miles ({self.selectedScenario.aqRadiusMeters}m) \" +\n", " f\"of NOAA Station {self.selectedScenario.noaaStationID} at \" +\n", " f\"{self.selectedScenario.noaaStationLat},{self.selectedScenario.noaaStationLng}: {self.selectedScenario.openAqLocIDs}\")\n", " else:\n", " print(f\"NO OpenAQ Location IDs found within {self.selectedScenario.aqRadiusMiles} miles ({self.selectedScenario.aqRadiusMeters}m) \" +\n", " f\"of NOAA Station {self.selectedScenario.noaaStationID}. CANNOT PROCEED.\")\n", " \n", " if len(self.selectedScenario.openAqLocIDs) >= 1:\n", " filenameOpenAQ = self.getFilenameOpenAQ()\n", "\n", " if os.path.exists(filenameOpenAQ):\n", " # Use local data file already accessed + prepared...\n", " print('Loading OpenAQ data from local file: ', filenameOpenAQ)\n", " aq_df = pd.read_csv(filenameOpenAQ)\n", " else:\n", " # Access + prepare data (NOTE: calling OpenAQ API one year at a time to avoid timeouts)\n", " print('Accessing ASDI-hosted OpenAQ Averages (HTTPS API)...')\n", "\n", " for year in range(self.selectedScenario.yearStart, self.selectedScenario.yearEnd + 1):\n", " aq_reqParams = {\n", " 'date_from': f'{year}-01-01',\n", " 'date_to': f'{year}-12-31',\n", " 'parameter': [self.selectedScenario.aqParamTarget.name],\n", " 'limit': 366,\n", " 'page': 1,\n", " 'offset': 0,\n", " 'sort': 'asc',\n", " 'spatial': 'location',\n", " 'temporal': 'day',\n", " 'location': self.selectedScenario.openAqLocIDs,\n", " 'group': 'true'\n", " }\n", " aq_resp = requests.get(aq_reqUrlBase + \"/averages\", aq_reqParams)\n", " aq_data = aq_resp.json()\n", " if aq_data['results'] and len(aq_data['results']) >= 1:\n", " aq_df = pd.concat([aq_df, pd.json_normalize(aq_data['results'])], ignore_index=True)\n", "\n", " # Using the joined dataframe, trim down to the desired key columns...\n", " aq_df = aq_df[self.defaultColumnsOpenAQ]\n", "\n", " # Perform some Label Engineering to add our binary classification label => {0=OKAY, 1=UNHEALTHY}\n", " aq_df[self.mlTargetLabel] = np.where(aq_df['average'] <= self.selectedScenario.unhealthyThreshold, 0, 1)\n", " \n", " return aq_df\n", " \n", " def getMergedDataFrame(self, noaagsod_df, aq_df):\n", " if len(noaagsod_df) > 0 and len(aq_df) > 0:\n", " merged_df = pd.merge(noaagsod_df, aq_df, how=\"inner\", left_on=\"DATE\", right_on=\"day\")\n", " merged_df = merged_df.drop(columns=self.mlIgnoreColumns)\n", " return merged_df\n", " else:\n", " return pd.DataFrame()\n", " \n", " def getConfusionMatrixData(self, cm):\n", " cmData = SimpleNamespace()\n", " cmData.TN = cm[0][0]\n", " cmData.TP = cm[1][1]\n", " cmData.FN = cm[1][0]\n", " cmData.FP = cm[0][1]\n", " \n", " cmData.TN_Rate = cmData.TN/(cmData.TN+cmData.FP)\n", " cmData.TP_Rate = cmData.TP/(cmData.TP+cmData.FN)\n", " cmData.FN_Rate = cmData.FN/(cmData.FN+cmData.TP)\n", " cmData.FP_Rate = cmData.FP/(cmData.FP+cmData.TN)\n", " \n", " cmData.TN_Output = f\"True Negatives (TN): {cmData.TN} of {cmData.TN+cmData.FP} => {round(cmData.TN_Rate * 100, 2)}%\"\n", " cmData.TP_Output = f\"True Positives (TP): {cmData.TP} of {cmData.TP+cmData.FN} => {round(cmData.TP_Rate * 100, 2)}%\"\n", " cmData.FN_Output = f\"False Negatives (FN): {cmData.FN} of {cmData.FN+cmData.TP} => {round(cmData.FN_Rate * 100, 2)}%\"\n", " cmData.FP_Output = f\"False Positives (FP): {cmData.FP} of {cmData.FP+cmData.TN} => {round(cmData.FP_Rate * 100, 2)}%\"\n", " \n", " return cmData\n", " \n", "print(\"Classes and Variables are ready.\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "599478b3-5abd-4bc3-9888-d3dee71b1c09", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AQbyWeather.aqParams: 6\n", "AQbyWeather.aqScenarios: 15 (Default Selected: bakersfield_pm25)\n" ] } ], "source": [ "# CELL #4: Review the pre-defined AQParams and AQScenarios in this cell. You can edit these and/or use your own...\n", "# AQParams are added with default thresholds, which can be overridden on a per-AQScenario basis.\n", "# These AQParams are based on the OpenAQ /parameters API call where isCore=true (https://api.openaq.org/v2/parameters).\n", "# Default thresholds where provided using data from EPA.gov (https://www.epa.gov/criteria-air-pollutants/naaqs-table).\n", "# Confirm and adjust params or thresholds as needed for your needs... Not for scientific or health purposes.\n", "\n", "# Instantiate main App class with explicit mlTargetLabel and mlEvalMetric provided...\n", "AQbyWeather = AQbyWeatherApp(mlTargetLabel='isUnhealthy', mlEvalMetric='accuracy')\n", "\n", "# Define and add new AQParams...\n", "AQbyWeather.addAQParam(AQParam( 1, \"pm10\", \"µg/m³\", 150.0, \"Particulate Matter < 10 micrometers\"))\n", "AQbyWeather.addAQParam(AQParam( 2, \"pm25\", \"µg/m³\", 12.0, \"Particulate Matter < 2.5 micrometers\"))\n", "AQbyWeather.addAQParam(AQParam( 7, \"no2\", \"ppm\", 100.0, \"Nitrogen Dioxide\"))\n", "AQbyWeather.addAQParam(AQParam( 8, \"co\", \"ppm\", 9.0, \"Carbon Monoxide\"))\n", "AQbyWeather.addAQParam(AQParam( 9, \"so2\", \"ppm\", 75.0, \"Sulfur Dioxide\"))\n", "AQbyWeather.addAQParam(AQParam(10, \"o3\", \"ppm\", 0.070, \"Ground Level Ozone\"))\n", "\n", "# Define available AQ Scenarios for certain locations with their associated NOAA GSOD StationID values...\n", "# NOAA GSOD Station Search: https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day\n", "# TODO: Someday consider how to OPTIONALLY append more scenarios via an optional JSON file\n", "# (ie: without adding a dependecy outside the .ipynb file)\n", "# NOTE: For Ozone Scenarios, we're generally using 0.035 ppm to override the default threshold.\n", "AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"pm10\"], None)) # Attempt at pm10 prediction.\n", "AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"fresno\", \"72389093193\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"fresno\", \"72389093193\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"visalia\", \"72389693144\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"visalia\", \"72389693144\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"san-jose\", \"72494693232\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"san-jose\", \"72494693232\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"los-angeles\", \"72287493134\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"los-angeles\", \"72287493134\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"phoenix\", \"72278023183\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"phoenix\", \"72278023183\", AQbyWeather.aqParams[\"o3\"], 0.035))\n", "AQbyWeather.addAQScenario(AQScenario(\"fairbanks\", \"70261026411\", AQbyWeather.aqParams[\"pm25\"], None))\n", "AQbyWeather.addAQScenario(AQScenario(\"lahore-pk\", \"41640099999\", AQbyWeather.aqParams[\"pm25\"], None))\n", "\n", "print(f\"AQbyWeather.aqParams: {str(len(AQbyWeather.aqParams))}\")\n", "print(f\"AQbyWeather.aqScenarios: {str(len(AQbyWeather.aqScenarios))} (Default Selected: {AQbyWeather.selectedScenario.name})\")" ] }, { "cell_type": "code", "execution_count": 5, "id": "3395018f-87bd-4a7f-8ee5-7e90d38bfb04", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** CHOOSE YOUR OWN ADVENTURE HERE ***\n", "Please select a Scenario via the following drop-down-list...\n", "(NOTE: If you change Scenario, you must re-run remaining cells to see changes.)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "54d02042d8c74215baadef6fc83953d1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(options=('bakersfield_pm25', 'bakersfield_pm10', 'bakersfield_o3', 'fresno_pm25', 'fresno_o3', 'visal…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# CELL #5: Select a Scenario via DROP DOWN LIST to use throughout the Notebook. This will drive the ML process...\n", "# A default \"value\" is set to avoid issues. Change this default to run the Notebook from start-to-finish for that Scenario.\n", "print(\"*** CHOOSE YOUR OWN ADVENTURE HERE ***\")\n", "print(\"Please select a Scenario via the following drop-down-list...\")\n", "print(\"(NOTE: If you change Scenario, you must re-run remaining cells to see changes.)\")\n", "ddl = widgets.Dropdown(options=AQbyWeather.aqScenarios.keys(), \n", " value=AQbyWeather.aqScenarios[\"bakersfield_pm25\"].name) # <-- DEFAULT / FULL-RUN VALUE\n", "ddl" ] }, { "cell_type": "code", "execution_count": 6, "id": "ffed30d9-1725-405b-a87f-e2aac0be5634", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "{\n", " \"aqParamTarget\": {\n", " \"desc\": \"Particulate Matter < 2.5 micrometers\",\n", " \"id\": 2,\n", " \"name\": \"pm25\",\n", " \"unhealthyThresholdDefault\": 12.0,\n", " \"unit\": \"\\u00b5g/m\\u00b3\"\n", " },\n", " \"aqRadiusMeters\": 16100,\n", " \"aqRadiusMiles\": 10,\n", " \"location\": \"bakersfield\",\n", " \"modelFolder\": \"AutogluonModels\",\n", " \"name\": \"bakersfield_pm25\",\n", " \"noaaStationID\": \"72384023155\",\n", " \"noaaStationLat\": 0.0,\n", " \"noaaStationLng\": 0.0,\n", " \"openAqLocIDs\": [],\n", " \"unhealthyThreshold\": 12.0,\n", " \"yearEnd\": 2022,\n", " \"yearStart\": 2016\n", "}\n" ] } ], "source": [ "if ddl.value:\n", " AQbyWeather.selectedScenario = AQbyWeather.aqScenarios[ddl.value]\n", " print(AQbyWeather.selectedScenario.getSummary())\n", " print(AQbyWeather.selectedScenario.toJSON())\n", "else:\n", " print(\"Please select a Scenario via the above drop-down-list.\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "9c3a647c-0fee-4b49-8576-0ee355629f0e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "Accessing and preparing data from ASDI-hosted NOAA GSOD dataset in Amazon S3 (bucket: noaa-gsod-pds)...\n", "NOAA Station Lat,Lng Updated for Scenario: bakersfield_pm25 => 35.4344,-119.0542\n", "noaagsod_df.shape = (2346, 10)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DATENAMELATITUDELONGITUDEDEWPWDSPMAXMINPRCPMONTH
02016-01-01BAKERSFIELD AIRPORT, CA US35.43440-119.0542029.23.055.027.00.01
12016-01-02BAKERSFIELD AIRPORT, CA US35.43440-119.0542032.82.457.027.00.01
22016-01-03BAKERSFIELD AIRPORT, CA US35.43440-119.0542032.64.368.034.00.01
32016-01-04BAKERSFIELD AIRPORT, CA US35.43440-119.0542032.04.968.039.00.01
42016-01-05BAKERSFIELD AIRPORT, CA US35.43440-119.0542042.84.564.942.10.01
.................................
23412022-06-04BAKERSFIELD AIRPORT, CA US35.43424-119.0552450.57.391.064.00.06
23422022-06-05BAKERSFIELD AIRPORT, CA US35.43424-119.0552445.16.490.063.00.06
23432022-06-06BAKERSFIELD AIRPORT, CA US35.43424-119.0552458.77.590.063.00.06
23442022-06-07BAKERSFIELD AIRPORT, CA US35.43424-119.0552447.85.491.066.00.06
23452022-06-08BAKERSFIELD AIRPORT, CA US35.43424-119.0552443.54.391.975.90.06
\n", "

2346 rows × 10 columns

\n", "
" ], "text/plain": [ " DATE NAME LATITUDE LONGITUDE DEWP WDSP \\\n", "0 2016-01-01 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 29.2 3.0 \n", "1 2016-01-02 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.8 2.4 \n", "2 2016-01-03 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.6 4.3 \n", "3 2016-01-04 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.0 4.9 \n", "4 2016-01-05 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 42.8 4.5 \n", "... ... ... ... ... ... ... \n", "2341 2022-06-04 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 50.5 7.3 \n", "2342 2022-06-05 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 45.1 6.4 \n", "2343 2022-06-06 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 58.7 7.5 \n", "2344 2022-06-07 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 47.8 5.4 \n", "2345 2022-06-08 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 43.5 4.3 \n", "\n", " MAX MIN PRCP MONTH \n", "0 55.0 27.0 0.0 1 \n", "1 57.0 27.0 0.0 1 \n", "2 68.0 34.0 0.0 1 \n", "3 68.0 39.0 0.0 1 \n", "4 64.9 42.1 0.0 1 \n", "... ... ... ... ... \n", "2341 91.0 64.0 0.0 6 \n", "2342 90.0 63.0 0.0 6 \n", "2343 90.0 63.0 0.0 6 \n", "2344 91.0 66.0 0.0 6 \n", "2345 91.9 75.9 0.0 6 \n", "\n", "[2346 rows x 10 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# GET NOAA GSOD WEATHER DATA...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "noaagsod_df = AQbyWeather.getNoaaDataFrame()\n", "\n", "if(len(noaagsod_df) >= 1):\n", " # Update NOAA Station Lat/Lng...\n", " AQbyWeather.selectedScenario.updateNoaaStationLatLng(noaagsod_df.iloc[0])\n", " \n", " # Save DataFrame to CSV...\n", " noaagsod_df.to_csv(AQbyWeather.getFilenameNOAA(), index=False)\n", "\n", " # Output DataFrame properties...\n", " print('noaagsod_df.shape =', noaagsod_df.shape)\n", " display(noaagsod_df)" ] }, { "cell_type": "code", "execution_count": 8, "id": "2893d81f-a488-464d-b374-dafbfc0ae3fb", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "Accessing ASDI-hosted OpenAQ Locations (HTTPS API)...\n", "OpenAQ Location IDs within 10 miles (16100m) of NOAA Station 72384023155 at 35.4344,-119.0542: [6885, 28777, 6883, 230814, 887]\n", "Accessing ASDI-hosted OpenAQ Averages (HTTPS API)...\n", "aq_df.shape = (2195, 5)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dayparameterunitaverageisUnhealthy
02016-03-06pm25µg/m³2.00000
12016-03-10pm25µg/m³10.50000
22016-03-11pm25µg/m³5.42860
32016-03-12pm25µg/m³4.16670
42016-03-13pm25µg/m³6.00000
..................
21902022-06-10pm25µg/m³9.20450
21912022-06-11pm25µg/m³9.45830
21922022-06-12pm25µg/m³6.16670
21932022-06-13pm25µg/m³3.54350
21942022-06-14pm25µg/m³7.30000
\n", "

2195 rows × 5 columns

\n", "
" ], "text/plain": [ " day parameter unit average isUnhealthy\n", "0 2016-03-06 pm25 µg/m³ 2.0000 0\n", "1 2016-03-10 pm25 µg/m³ 10.5000 0\n", "2 2016-03-11 pm25 µg/m³ 5.4286 0\n", "3 2016-03-12 pm25 µg/m³ 4.1667 0\n", "4 2016-03-13 pm25 µg/m³ 6.0000 0\n", "... ... ... ... ... ...\n", "2190 2022-06-10 pm25 µg/m³ 9.2045 0\n", "2191 2022-06-11 pm25 µg/m³ 9.4583 0\n", "2192 2022-06-12 pm25 µg/m³ 6.1667 0\n", "2193 2022-06-13 pm25 µg/m³ 3.5435 0\n", "2194 2022-06-14 pm25 µg/m³ 7.3000 0\n", "\n", "[2195 rows x 5 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# GET OPENAQ AIR QUALITY DAILY AVERAGES DATA...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "aq_df = AQbyWeather.getOpenAqDataFrame() # Gets nearby Location IDs THEN gets associated daily averages.\n", " \n", "if len(aq_df) > 0:\n", " # Output DataFrame properties...\n", " print('aq_df.shape =', aq_df.shape)\n", " display(aq_df)\n", " aq_df.to_csv(AQbyWeather.getFilenameOpenAQ(), index=False)" ] }, { "cell_type": "markdown", "id": "0a676054-ca63-47a2-832c-c136d1f0ecb4", "metadata": {}, "source": [ "## Merging the NOAA and OpenAQ data into one dataframe...\n", "We'll be using Autogluon's [_TabularPredictor_](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) on this merged dataframe to predict our target label column. For best results, we need to ensure we have enough data and that we have enough of each target label classification {0,1}.\n", "\n", "Review the shape of the merged_df table and the target label bar chart to determine if you have enough data and enough variation in the target label. Ideally, we should have thousands of total rows and at least hundreds of each class value.\n", "\n", "In the following cell, we can also visualize correlations between different columns in our merged_df table." ] }, { "cell_type": "code", "execution_count": 9, "id": "6a873736-65b4-41a5-82e0-794ef33188b3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "merged_df.shape = (2184, 7)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DEWPWDSPMAXMINPRCPMONTHisUnhealthy
047.36.872.051.10.0230
144.83.782.944.10.0030
242.86.982.948.00.0030
343.22.573.045.00.2330
443.53.866.945.00.0030
........................
217950.57.391.064.00.0060
218045.16.490.063.00.0060
218158.77.590.063.00.0060
218247.85.491.066.00.0060
218343.54.391.975.90.0060
\n", "

2184 rows × 7 columns

\n", "
" ], "text/plain": [ " DEWP WDSP MAX MIN PRCP MONTH isUnhealthy\n", "0 47.3 6.8 72.0 51.1 0.02 3 0\n", "1 44.8 3.7 82.9 44.1 0.00 3 0\n", "2 42.8 6.9 82.9 48.0 0.00 3 0\n", "3 43.2 2.5 73.0 45.0 0.23 3 0\n", "4 43.5 3.8 66.9 45.0 0.00 3 0\n", "... ... ... ... ... ... ... ...\n", "2179 50.5 7.3 91.0 64.0 0.00 6 0\n", "2180 45.1 6.4 90.0 63.0 0.00 6 0\n", "2181 58.7 7.5 90.0 63.0 0.00 6 0\n", "2182 47.8 5.4 91.0 66.0 0.00 6 0\n", "2183 43.5 4.3 91.9 75.9 0.00 6 0\n", "\n", "[2184 rows x 7 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEDCAYAAADZUdTgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQVElEQVR4nO3df4xlZX3H8fdHVvAHjcuPCYFddGlca1Cj4hSxtIaIVVDjEoMUS8uKpJs2YLXUyNKmob9MIG1KpRqaLaxASkCKVDaKUoooaQ3IoID8LBMEdzf8GORHtaiIfvvHPBuv4w6zO3e4A/O8X8nknvN9nnOe5yaTz5x97jl3U1VIkvrwgsWegCRpdAx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOLFvsCTyTvffeu1atWrXY05Ck55Wbbrrpkaoa217bczr0V61axcTExGJPQ5KeV5LcP1ubyzuS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHVkztBPsjHJw0luG6j9XZK7ktya5N+TLB9oOy3JZJK7k7xzoH5Eq00mWb/g70SSNKcdeTjrfOBTwIUDtauB06rq6SRnAqcBpyY5EDgWeA2wH/CfSV7Vjvk08NvAFuDGJJuq6o6FeRuLa9X6Ly72FJaU+85492JPQVqy5rzSr6rrgEdn1P6jqp5uu9cDK9v2GuCSqvpxVX0HmAQObj+TVXVvVT0FXNL6SpJGaCHW9D8EfKltrwA2D7RtabXZ6pKkERoq9JP8OfA0cNHCTAeSrEsykWRiampqoU4rSWKI0E/yQeA9wHH18/9dfSuw/0C3la02W/2XVNWGqhqvqvGxse1+SZwkaZ7mFfpJjgA+Dry3qp4caNoEHJtktyQHAKuBbwA3AquTHJBkV6Y/7N003NQlSTtrzrt3klwMHAbsnWQLcDrTd+vsBlydBOD6qvrDqro9yaXAHUwv+5xUVT9t5zkZuArYBdhYVbc/C+9HkvQM5gz9qvrAdsrnPUP/TwCf2E79SuDKnZqdJGlB+USuJHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHVkztBPsjHJw0luG6jtmeTqJPe01z1aPUnOTjKZ5NYkBw0cs7b1vyfJ2mfn7UiSnsmOXOmfDxwxo7YeuKaqVgPXtH2AI4HV7WcdcA5M/5EATgfeDBwMnL7tD4UkaXTmDP2qug54dEZ5DXBB274AOGqgfmFNux5YnmRf4J3A1VX1aFU9BlzNL/8hkSQ9y+a7pr9PVT3Qth8E9mnbK4DNA/22tNpsdUnSCA39QW5VFVALMBcAkqxLMpFkYmpqaqFOK0li/qH/UFu2ob0+3Opbgf0H+q1stdnqv6SqNlTVeFWNj42NzXN6kqTtmW/obwK23YGzFrhioH58u4vnEOCJtgx0FfCOJHu0D3Df0WqSpBFaNleHJBcDhwF7J9nC9F04ZwCXJjkRuB84pnW/EngXMAk8CZwAUFWPJvkb4MbW76+rauaHw5KkZ9mcoV9VH5il6fDt9C3gpFnOsxHYuFOzkyQtqDlDX9Lz26r1X1zsKSwZ953x7sWewtD8GgZJ6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHhgr9JH+S5PYktyW5OMmLkhyQ5IYkk0k+m2TX1ne3tj/Z2lctyDuQJO2weYd+khXAHwPjVfVaYBfgWOBM4KyqeiXwGHBiO+RE4LFWP6v1kySN0LDLO8uAFydZBrwEeAB4G3BZa78AOKptr2n7tPbDk2TI8SVJO2HeoV9VW4G/B77LdNg/AdwEPF5VT7duW4AVbXsFsLkd+3Trv9d8x5ck7bxhlnf2YPrq/QBgP+ClwBHDTijJuiQTSSampqaGPZ0kacAwyztvB75TVVNV9RPgcuBQYHlb7gFYCWxt21uB/QFa+8uA7808aVVtqKrxqhofGxsbYnqSpJmGCf3vAockeUlbmz8cuAO4Fji69VkLXNG2N7V9WvtXqqqGGF+StJOGWdO/gekPZL8JfLudawNwKnBKkkmm1+zPa4ecB+zV6qcA64eYtyRpHpbN3WV2VXU6cPqM8r3Awdvp+yPg/cOMJ0kajk/kSlJHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHhgr9JMuTXJbkriR3JnlLkj2TXJ3knva6R+ubJGcnmUxya5KDFuYtSJJ21LBX+p8EvlxVrwZeD9wJrAeuqarVwDVtH+BIYHX7WQecM+TYkqSdNO/QT/Iy4K3AeQBV9VRVPQ6sAS5o3S4Ajmrba4ALa9r1wPIk+853fEnSzhvmSv8AYAr4TJJvJTk3yUuBfarqgdbnQWCftr0C2Dxw/JZWkySNyDChvww4CDinqt4I/B8/X8oBoKoKqJ05aZJ1SSaSTExNTQ0xPUnSTMOE/hZgS1Xd0PYvY/qPwEPblm3a68OtfSuw/8DxK1vtF1TVhqoar6rxsbGxIaYnSZpp3qFfVQ8Cm5P8WisdDtwBbALWttpa4Iq2vQk4vt3FcwjwxMAykCRpBJYNefyHgYuS7ArcC5zA9B+SS5OcCNwPHNP6Xgm8C5gEnmx9JUkjNFToV9XNwPh2mg7fTt8CThpmPEnScHwiV5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6MnToJ9klybeSfKHtH5DkhiSTST6bZNdW363tT7b2VcOOLUnaOQtxpf8R4M6B/TOBs6rqlcBjwImtfiLwWKuf1fpJkkZoqNBPshJ4N3Bu2w/wNuCy1uUC4Ki2vabt09oPb/0lSSMy7JX+PwIfB37W9vcCHq+qp9v+FmBF214BbAZo7U+0/pKkEZl36Cd5D/BwVd20gPMhybokE0kmpqamFvLUktS9Ya70DwXem+Q+4BKml3U+CSxPsqz1WQlsbdtbgf0BWvvLgO/NPGlVbaiq8aoaHxsbG2J6kqSZ5h36VXVaVa2sqlXAscBXquo44Frg6NZtLXBF297U9mntX6mqmu/4kqSd92zcp38qcEqSSabX7M9r9fOAvVr9FGD9szC2JOkZLJu7y9yq6qvAV9v2vcDB2+nzI+D9CzGeJGl+fCJXkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjoy79BPsn+Sa5PckeT2JB9p9T2TXJ3knva6R6snydlJJpPcmuSghXoTkqQdM8yV/tPAn1bVgcAhwElJDgTWA9dU1WrgmrYPcCSwuv2sA84ZYmxJ0jzMO/Sr6oGq+mbb/j5wJ7ACWANc0LpdABzVttcAF9a064HlSfad7/iSpJ23IGv6SVYBbwRuAPapqgda04PAPm17BbB54LAtrSZJGpGhQz/J7sDngI9W1f8OtlVVAbWT51uXZCLJxNTU1LDTkyQNGCr0k7yQ6cC/qKoub+WHti3btNeHW30rsP/A4Stb7RdU1YaqGq+q8bGxsWGmJ0maYZi7dwKcB9xZVf8w0LQJWNu21wJXDNSPb3fxHAI8MbAMJEkagWVDHHso8PvAt5Pc3Gp/BpwBXJrkROB+4JjWdiXwLmASeBI4YYixJUnzMO/Qr6r/AjJL8+Hb6V/ASfMdT5I0PJ/IlaSOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1JGRh36SI5LcnWQyyfpRjy9JPRtp6CfZBfg0cCRwIPCBJAeOcg6S1LNRX+kfDExW1b1V9RRwCbBmxHOQpG4tG/F4K4DNA/tbgDcPdkiyDljXdn+Q5O4Rza0HewOPLPYk5pIzF3sGWiTP+d/P59Hv5itmaxh16M+pqjYAGxZ7HktRkomqGl/seUjb4+/naIx6eWcrsP/A/spWkySNwKhD/0ZgdZIDkuwKHAtsGvEcJKlbI13eqaqnk5wMXAXsAmysqttHOYfOuWym5zJ/P0cgVbXYc5AkjYhP5EpSRwx9SeqIoS9JHXnO3aevhZPk1Uw/8byilbYCm6rqzsWblaTF5JX+EpXkVKa/5iLAN9pPgIv9ojs9lyU5YbHnsJR5984SleR/gNdU1U9m1HcFbq+q1YszM+mZJfluVb18seexVLm8s3T9DNgPuH9Gfd/WJi2aJLfO1gTsM8q59MbQX7o+ClyT5B5+/iV3LwdeCZy8WJOSmn2AdwKPzagH+Prop9MPQ3+JqqovJ3kV019nPfhB7o1V9dPFm5kEwBeA3avq5pkNSb468tl0xDV9SeqId+9IUkcMfUnqiKGvJSHJM374l+QHM/Y/mORTcxxzWJIvLND8/jLJxwbG3m+g7b4key/EONJcDH0tCVX1G4s9h53wQaZvp5VGztDXkrDtSj7JvkmuS3JzktuS/NYOHHt+krOTfD3JvUmOHmjePcllSe5KclGStGPelORrSW5KclWSfVv9D5LcmOSWJJ9L8pIZYx0NjAMXtTm+uDV9OMk3k3w7yauTvCDJPUnG2nEvSDK5bV+aL0NfS83vAldV1RuA1wM37+Bx+wK/CbwHOGOg/kamn3k4EPhV4NAkLwT+CTi6qt4EbAQ+0fpfXlW/XlWvB+4EThwcpKouAyaA46rqDVX1w9b0SFUdBJwDfKyqfgb8K3Bca387cEtVTe3g+5G2y/v0tdTcCGxswfz57d0HPmDwfuXPt6C9I8ngE6HfqKotAEluBlYBjwOvBa5uF/67AA+0/q9N8rfAcmB3pv+XuB1xeXu9CXhf294IXAH8I/Ah4DM7eC5pVl7pa0mpquuAtzL9INr5SY5vTT9s3zu0zZ7AIwP7Px7Yziz1nzJ9oRSmv7/oDe3ndVX1jtbnfODkqnod8FfAi3Zw6tvG2TYGVbUZeCjJ25h+yO5LO3guaVaGvpaUJK8AHqqqfwHOBQ5qTV8Dfq/1eTFwDHDtPIe5GxhL8pZ2vhcmeU1r+xXggfYvjeNmOf77rd+OOJfpZZ5/80lqLQRDX0vNYcAtSb4F/A7wyVb/CPC+tkRzPdMhet18Bqiqp4CjgTOT3ML05wbb7h76C+AG4L+Bu2Y5xfnAP8/4IHc2m5heJnJpRwvCr2GQnsOSjANnVdWcdyFJO8IPcqXnqPaf3fwRsy8TSTvNK31J6ohr+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakj/w9ogwL5p/SQzwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Merge the NOAA GSOD weather data with our OpenAQ data by DATE...\n", "# Perform another column drop to remove columns we don't want as features/inputs.\n", "# This column removal will NOT be necessary once we can use Autogluon ignore_columns param (TBD).\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "merged_df = AQbyWeather.getMergedDataFrame(noaagsod_df, aq_df)\n", "\n", "if(len(merged_df > 0)):\n", " # Output DataFrame properties...\n", " print('merged_df.shape =', merged_df.shape)\n", " display(merged_df)\n", " merged_df.groupby([AQbyWeather.mlTargetLabel]).size().plot(kind=\"bar\")\n", " merged_df.to_csv(AQbyWeather.getFilenameOther(\"dataMERGED\"), index=False)\n", "\n", "# TODO: Explore other ways to join the data and/or handle null values..." ] }, { "cell_type": "code", "execution_count": 10, "id": "c5dee84f-da68-4b10-a075-a5cab419cd16", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Visualize correlations in our merged dataframe...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "correlations = merged_df.corr()\n", "fig = plt.figure(figsize=(10, 8))\n", "ax = fig.add_subplot(111)\n", "cax = ax.matshow(correlations, vmin=-1, vmax=1)\n", "fig.colorbar(cax)\n", "ticks = np.arange(0, len(merged_df.columns), 1)\n", "ax.set_xticks(ticks)\n", "ax.set_yticks(ticks)\n", "ax.set_xticklabels(merged_df.columns)\n", "ax.set_yticklabels(merged_df.columns)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "219a6c62-70eb-41ad-ae88-5ccd27f28aab", "metadata": {}, "source": [ "## Begin ML on Merged Dataframe...\n", "In order to build, validate, and test our ML model, we'll begin by creating three dataframes:\n", "1. **train_df:** This is split from the full merged_df data as the larger chunk and used for training.\n", "2. **validate_df:** This is split from the full merged_df data as the smaller chunk and used for validation.\n", "3. **test_df:** This is the validation dataframe, but with the true target label values removed to enable model testing." ] }, { "cell_type": "code", "execution_count": 11, "id": "66bce085-f0ac-418f-b3e9-5e1da7cefbd9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "Number of training samples: 1463\n", "Number of validation samples: 721\n" ] } ], "source": [ "# Additional import statements for autogluon+sklearn and split out train_df + validate_df data...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "from autogluon.tabular import TabularPredictor\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "train_df, validate_df = train_test_split(merged_df, test_size=0.33, random_state=1)\n", "print('Number of training samples:', len(train_df))\n", "print('Number of validation samples:', len(validate_df))" ] }, { "cell_type": "code", "execution_count": 12, "id": "cff0133a-7dfe-4187-8dda-026d0695ec92", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DEWPWDSPMAXMINPRCPMONTH
24141.94.789.155.00.0011
99440.82.562.137.90.001
182042.25.590.063.00.005
144250.04.373.952.00.064
59639.72.766.039.90.0012
.....................
84953.05.593.064.00.008
20837.25.691.957.90.0010
28837.12.866.036.00.0012
122639.36.899.062.10.009
212942.35.973.942.10.004
\n", "

721 rows × 6 columns

\n", "
" ], "text/plain": [ " DEWP WDSP MAX MIN PRCP MONTH\n", "241 41.9 4.7 89.1 55.0 0.00 11\n", "994 40.8 2.5 62.1 37.9 0.00 1\n", "1820 42.2 5.5 90.0 63.0 0.00 5\n", "1442 50.0 4.3 73.9 52.0 0.06 4\n", "596 39.7 2.7 66.0 39.9 0.00 12\n", "... ... ... ... ... ... ...\n", "849 53.0 5.5 93.0 64.0 0.00 8\n", "208 37.2 5.6 91.9 57.9 0.00 10\n", "288 37.1 2.8 66.0 36.0 0.00 12\n", "1226 39.3 6.8 99.0 62.1 0.00 9\n", "2129 42.3 5.9 73.9 42.1 0.00 4\n", "\n", "[721 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the test_df data and remove the target label column...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "test_df=validate_df.drop([AQbyWeather.mlTargetLabel], axis=1)\n", "display(test_df)" ] }, { "cell_type": "markdown", "id": "be7097cd-6b74-4824-8eec-06d526ee7f67", "metadata": {}, "source": [ "## Run AutoGluon TabularPredictor.fit(...)\n", "- Instantiate TabularPredictor with explicit values for label, eval_metric, and path.\n", "- Call TabularPredictor.fit(...) using our train_df tabular data as the main input." ] }, { "cell_type": "code", "execution_count": 13, "id": "499df50a-239b-4173-a91f-f86941d69b15", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³'" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Warning: path already exists! This predictor may overwrite an existing predictor! path=\"AutogluonModels/aq_bakersfield_pm25_2016-2022/\"\n", "Presets specified: ['best_quality']\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"AutogluonModels/aq_bakersfield_pm25_2016-2022/\"\n", "AutoGluon Version: 0.4.2\n", "Python Version: 3.9.13\n", "Operating System: Linux\n", "Train Data Rows: 1463\n", "Train Data Columns: 6\n", "Label Column: isUnhealthy\n", "Preprocessing data ...\n", "AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n", "\t2 unique label values: [0, 1]\n", "\tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n", "Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", "Using Feature Generators to preprocess the data ...\n", "Fitting AutoMLPipelineFeatureGenerator...\n", "\tAvailable Memory: 15321.24 MB\n", "\tTrain Data (Original) Memory Usage: 0.07 MB (0.0% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('float', []) : 5 | ['DEWP', 'WDSP', 'MAX', 'MIN', 'PRCP']\n", "\t\t('int', []) : 1 | ['MONTH']\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('float', []) : 5 | ['DEWP', 'WDSP', 'MAX', 'MIN', 'PRCP']\n", "\t\t('int', []) : 1 | ['MONTH']\n", "\t0.0s = Fit runtime\n", "\t6 features in original data used to generate 6 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.04s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "AutoGluon will fit 2 stack levels (L1 to L2) ...\n", "Fitting 13 L1 models ...\n", "Fitting model: KNeighborsUnif_BAG_L1 ...\n", "\t0.7416\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.01s\t = Validation runtime\n", "Fitting model: KNeighborsDist_BAG_L1 ...\n", "\t0.7573\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.01s\t = Validation runtime\n", "Fitting model: LightGBMXT_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "2022-06-14 22:38:13,323\tWARNING services.py:1856 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 4294967296 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=4.77gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.\n", "\t0.8373\t = Validation score (accuracy)\n", "\t5.38s\t = Training runtime\n", "\t0.04s\t = Validation runtime\n", "Fitting model: LightGBM_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8182\t = Validation score (accuracy)\n", "\t6.25s\t = Training runtime\n", "\t0.03s\t = Validation runtime\n", "Fitting model: RandomForestGini_BAG_L1 ...\n", "\t0.812\t = Validation score (accuracy)\n", "\t0.77s\t = Training runtime\n", "\t0.16s\t = Validation runtime\n", "Fitting model: RandomForestEntr_BAG_L1 ...\n", "\t0.8059\t = Validation score (accuracy)\n", "\t0.67s\t = Training runtime\n", "\t0.15s\t = Validation runtime\n", "Fitting model: CatBoost_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8278\t = Validation score (accuracy)\n", "\t6.0s\t = Training runtime\n", "\t0.05s\t = Validation runtime\n", "Fitting model: ExtraTreesGini_BAG_L1 ...\n", "\t0.8141\t = Validation score (accuracy)\n", "\t0.77s\t = Training runtime\n", "\t0.16s\t = Validation runtime\n", "Fitting model: ExtraTreesEntr_BAG_L1 ...\n", "\t0.8072\t = Validation score (accuracy)\n", "\t0.69s\t = Training runtime\n", "\t0.16s\t = Validation runtime\n", "Fitting model: NeuralNetFastAI_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8018\t = Validation score (accuracy)\n", "\t12.3s\t = Training runtime\n", "\t0.12s\t = Validation runtime\n", "Fitting model: XGBoost_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8148\t = Validation score (accuracy)\n", "\t5.06s\t = Training runtime\n", "\t0.07s\t = Validation runtime\n", "Fitting model: NeuralNetTorch_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8291\t = Validation score (accuracy)\n", "\t17.65s\t = Training runtime\n", "\t0.11s\t = Validation runtime\n", "Fitting model: LightGBMLarge_BAG_L1 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8195\t = Validation score (accuracy)\n", "\t8.32s\t = Training runtime\n", "\t0.05s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t0.8373\t = Validation score (accuracy)\n", "\t0.98s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting 11 L2 models ...\n", "Fitting model: LightGBMXT_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8401\t = Validation score (accuracy)\n", "\t6.48s\t = Training runtime\n", "\t0.04s\t = Validation runtime\n", "Fitting model: LightGBM_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8483\t = Validation score (accuracy)\n", "\t7.34s\t = Training runtime\n", "\t0.04s\t = Validation runtime\n", "Fitting model: RandomForestGini_BAG_L2 ...\n", "\t0.8346\t = Validation score (accuracy)\n", "\t0.83s\t = Training runtime\n", "\t0.15s\t = Validation runtime\n", "Fitting model: RandomForestEntr_BAG_L2 ...\n", "\t0.8325\t = Validation score (accuracy)\n", "\t0.77s\t = Training runtime\n", "\t0.15s\t = Validation runtime\n", "Fitting model: CatBoost_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8442\t = Validation score (accuracy)\n", "\t11.68s\t = Training runtime\n", "\t0.04s\t = Validation runtime\n", "Fitting model: ExtraTreesGini_BAG_L2 ...\n", "\t0.8189\t = Validation score (accuracy)\n", "\t0.73s\t = Training runtime\n", "\t0.16s\t = Validation runtime\n", "Fitting model: ExtraTreesEntr_BAG_L2 ...\n", "\t0.8257\t = Validation score (accuracy)\n", "\t0.68s\t = Training runtime\n", "\t0.16s\t = Validation runtime\n", "Fitting model: NeuralNetFastAI_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8325\t = Validation score (accuracy)\n", "\t12.15s\t = Training runtime\n", "\t0.23s\t = Validation runtime\n", "Fitting model: XGBoost_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8476\t = Validation score (accuracy)\n", "\t6.37s\t = Training runtime\n", "\t0.07s\t = Validation runtime\n", "Fitting model: NeuralNetTorch_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8284\t = Validation score (accuracy)\n", "\t11.78s\t = Training runtime\n", "\t0.13s\t = Validation runtime\n", "Fitting model: LightGBMLarge_BAG_L2 ...\n", "\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "\t0.8401\t = Validation score (accuracy)\n", "\t13.94s\t = Training runtime\n", "\t0.11s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L3 ...\n", "\t0.8483\t = Validation score (accuracy)\n", "\t0.93s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 177.76s ... Best model: \"WeightedEnsemble_L3\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/aq_bakersfield_pm25_2016-2022/\")\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use AutoGluon TabularPredictor to fit a model for our training data...\n", "display(AQbyWeather.selectedScenario.getSummary()) #Using display for consistent/sequential output order.\n", "predictor = TabularPredictor(label=AQbyWeather.mlTargetLabel, \n", " eval_metric=AQbyWeather.mlEvalMetric, \n", " path=AQbyWeather.selectedScenario.getModelPath())\n", "predictor.fit(train_data=train_df, time_limit=AQbyWeather.mlTimeLimitSecs, verbosity=2, presets='best_quality')" ] }, { "cell_type": "markdown", "id": "021921b8-6133-46b3-aed5-8e733e627aab", "metadata": {}, "source": [ "## Evaluating Our Autogluon Model...\n", "- Use TabularPredictor.feature_importance(...) with our validation_df to get feature importance data.\\\n", " (this data is displayed further below)\n", "- Use TabularPredictor.leaderboard(...) with our validation_df to get individual model performance data.\\\n", " (this data is displayed further below)\n", "- Display TabularPredictor.evaluate(...) to output various model quality characteristics." ] }, { "cell_type": "code", "execution_count": 14, "id": "ace98288-c321-4ea9-ac20-b163d9753af1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³'" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Computing feature importance via permutation shuffling for 6 features using 721 rows with 5 shuffle sets...\n", "\t63.95s\t= Expected runtime (12.79s per shuffle set)\n", "\t15.71s\t= Actual runtime (Completed 5 of 5 shuffle sets)\n", "Evaluation: accuracy on test data: 0.7947295423023578\n", "Evaluations on test data:\n", "{\n", " \"accuracy\": 0.7947295423023578,\n", " \"balanced_accuracy\": 0.7923707440100882,\n", " \"mcc\": 0.5820230438897135,\n", " \"roc_auc\": 0.874984237074401,\n", " \"f1\": 0.7620578778135049,\n", " \"precision\": 0.7476340694006309,\n", " \"recall\": 0.7770491803278688\n", "}\n" ] } ], "source": [ "# Get dataframes for feature importance + model leaderboard AND get+display model evaluation...\n", "display(AQbyWeather.selectedScenario.getSummary()) #Using display for consistent/sequential output order.\n", "featureimp_df = predictor.feature_importance(validate_df)\n", "leaderboard_df = predictor.leaderboard(validate_df, silent=True)\n", "modelEvaluation = predictor.evaluate(validate_df, auxiliary_metrics=True)" ] }, { "cell_type": "code", "execution_count": 15, "id": "6d9a3a88-aab0-49f8-a6d4-ba20474bfd0b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelscore_testscore_valpred_time_testpred_time_valfit_timepred_time_test_marginalpred_time_val_marginalfit_time_marginalstack_levelcan_inferfit_order
0NeuralNetTorch_BAG_L20.8127600.8284351.3674421.24430075.6533060.1484620.12741211.7804712True24
1RandomForestEntr_BAG_L20.8044380.8325361.3446961.26571064.6468670.1257170.1488210.7740322True18
2ExtraTreesGini_BAG_L20.8030510.8188651.3524791.28107364.6000340.1335000.1641850.7271992True20
3LightGBMXT_BAG_L10.8016640.8373210.0724280.0439755.3818710.0724280.0439755.3818711True3
4WeightedEnsemble_L20.8016640.8373210.0745070.0469606.3658280.0020780.0029850.9839582True14
5CatBoost_BAG_L20.8016640.8441561.2328051.16038275.5568570.0138250.04349411.6840222True19
6XGBoost_BAG_L20.8016640.8475731.2812331.19147770.2396820.0622530.0745886.3668472True23
7LightGBMLarge_BAG_L20.8016640.8400551.4489961.22645977.8126700.2300160.10957113.9398352True25
8NeuralNetFastAI_BAG_L20.8016640.8325361.4931591.34655876.0210080.2741790.22966912.1481732True22
9LightGBMXT_BAG_L20.8002770.8400551.2868161.16132870.3496390.0678370.0444406.4768042True15
10RandomForestGini_BAG_L20.8002770.8345861.3475941.26377764.7033350.1286150.1468880.8305002True17
11RandomForestGini_BAG_L10.7988900.8120300.1324310.1551900.7709340.1324310.1551900.7709341True5
12LightGBMLarge_BAG_L10.7975030.8195490.1011140.0468918.3188840.1011140.0468918.3188841True13
13ExtraTreesEntr_BAG_L20.7975030.8257011.3498041.28110264.5510410.1308250.1642140.6782062True21
14ExtraTreesGini_BAG_L10.7961170.8140810.1394280.1556100.7679290.1394280.1556100.7679291True8
15RandomForestEntr_BAG_L10.7947300.8058780.1275430.1526300.6701420.1275430.1526300.6701421True6
16ExtraTreesEntr_BAG_L10.7947300.8072450.1400270.1588200.6902950.1400270.1588200.6902951True9
17LightGBM_BAG_L20.7947300.8482571.2614051.15542771.2133950.0424260.0385397.3405602True16
18WeightedEnsemble_L30.7947300.8482571.2634171.15821672.1415690.0020120.0027890.9281733True26
19CatBoost_BAG_L10.7933430.8277510.0097800.0509195.9997560.0097800.0509195.9997561True7
20LightGBM_BAG_L10.7933430.8181820.0565080.0326686.2470020.0565080.0326686.2470021True4
21NeuralNetTorch_BAG_L10.7905690.8291180.1641210.10668817.6528120.1641210.10668817.6528121True12
22NeuralNetFastAI_BAG_L10.7891820.8017770.1985430.11640312.3017600.1985430.11640312.3017601True10
23XGBoost_BAG_L10.7794730.8147640.0595010.0739445.0630550.0595010.0739445.0630551True11
24KNeighborsUnif_BAG_L10.7489600.7416270.0084640.0120380.0045620.0084640.0120380.0045621True1
25KNeighborsDist_BAG_L10.7475730.7573480.0090900.0111130.0038340.0090900.0111130.0038341True2
\n", "
" ], "text/plain": [ " model score_test score_val pred_time_test \\\n", "0 NeuralNetTorch_BAG_L2 0.812760 0.828435 1.367442 \n", "1 RandomForestEntr_BAG_L2 0.804438 0.832536 1.344696 \n", "2 ExtraTreesGini_BAG_L2 0.803051 0.818865 1.352479 \n", "3 LightGBMXT_BAG_L1 0.801664 0.837321 0.072428 \n", "4 WeightedEnsemble_L2 0.801664 0.837321 0.074507 \n", "5 CatBoost_BAG_L2 0.801664 0.844156 1.232805 \n", "6 XGBoost_BAG_L2 0.801664 0.847573 1.281233 \n", "7 LightGBMLarge_BAG_L2 0.801664 0.840055 1.448996 \n", "8 NeuralNetFastAI_BAG_L2 0.801664 0.832536 1.493159 \n", "9 LightGBMXT_BAG_L2 0.800277 0.840055 1.286816 \n", "10 RandomForestGini_BAG_L2 0.800277 0.834586 1.347594 \n", "11 RandomForestGini_BAG_L1 0.798890 0.812030 0.132431 \n", "12 LightGBMLarge_BAG_L1 0.797503 0.819549 0.101114 \n", "13 ExtraTreesEntr_BAG_L2 0.797503 0.825701 1.349804 \n", "14 ExtraTreesGini_BAG_L1 0.796117 0.814081 0.139428 \n", "15 RandomForestEntr_BAG_L1 0.794730 0.805878 0.127543 \n", "16 ExtraTreesEntr_BAG_L1 0.794730 0.807245 0.140027 \n", "17 LightGBM_BAG_L2 0.794730 0.848257 1.261405 \n", "18 WeightedEnsemble_L3 0.794730 0.848257 1.263417 \n", "19 CatBoost_BAG_L1 0.793343 0.827751 0.009780 \n", "20 LightGBM_BAG_L1 0.793343 0.818182 0.056508 \n", "21 NeuralNetTorch_BAG_L1 0.790569 0.829118 0.164121 \n", "22 NeuralNetFastAI_BAG_L1 0.789182 0.801777 0.198543 \n", "23 XGBoost_BAG_L1 0.779473 0.814764 0.059501 \n", "24 KNeighborsUnif_BAG_L1 0.748960 0.741627 0.008464 \n", "25 KNeighborsDist_BAG_L1 0.747573 0.757348 0.009090 \n", "\n", " pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal \\\n", "0 1.244300 75.653306 0.148462 0.127412 \n", "1 1.265710 64.646867 0.125717 0.148821 \n", "2 1.281073 64.600034 0.133500 0.164185 \n", "3 0.043975 5.381871 0.072428 0.043975 \n", "4 0.046960 6.365828 0.002078 0.002985 \n", "5 1.160382 75.556857 0.013825 0.043494 \n", "6 1.191477 70.239682 0.062253 0.074588 \n", "7 1.226459 77.812670 0.230016 0.109571 \n", "8 1.346558 76.021008 0.274179 0.229669 \n", "9 1.161328 70.349639 0.067837 0.044440 \n", "10 1.263777 64.703335 0.128615 0.146888 \n", "11 0.155190 0.770934 0.132431 0.155190 \n", "12 0.046891 8.318884 0.101114 0.046891 \n", "13 1.281102 64.551041 0.130825 0.164214 \n", "14 0.155610 0.767929 0.139428 0.155610 \n", "15 0.152630 0.670142 0.127543 0.152630 \n", "16 0.158820 0.690295 0.140027 0.158820 \n", "17 1.155427 71.213395 0.042426 0.038539 \n", "18 1.158216 72.141569 0.002012 0.002789 \n", "19 0.050919 5.999756 0.009780 0.050919 \n", "20 0.032668 6.247002 0.056508 0.032668 \n", "21 0.106688 17.652812 0.164121 0.106688 \n", "22 0.116403 12.301760 0.198543 0.116403 \n", "23 0.073944 5.063055 0.059501 0.073944 \n", "24 0.012038 0.004562 0.008464 0.012038 \n", "25 0.011113 0.003834 0.009090 0.011113 \n", "\n", " fit_time_marginal stack_level can_infer fit_order \n", "0 11.780471 2 True 24 \n", "1 0.774032 2 True 18 \n", "2 0.727199 2 True 20 \n", "3 5.381871 1 True 3 \n", "4 0.983958 2 True 14 \n", "5 11.684022 2 True 19 \n", "6 6.366847 2 True 23 \n", "7 13.939835 2 True 25 \n", "8 12.148173 2 True 22 \n", "9 6.476804 2 True 15 \n", "10 0.830500 2 True 17 \n", "11 0.770934 1 True 5 \n", "12 8.318884 1 True 13 \n", "13 0.678206 2 True 21 \n", "14 0.767929 1 True 8 \n", "15 0.670142 1 True 6 \n", "16 0.690295 1 True 9 \n", "17 7.340560 2 True 16 \n", "18 0.928173 3 True 26 \n", "19 5.999756 1 True 7 \n", "20 6.247002 1 True 4 \n", "21 17.652812 1 True 12 \n", "22 12.301760 1 True 10 \n", "23 5.063055 1 True 11 \n", "24 0.004562 1 True 1 \n", "25 0.003834 1 True 2 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# View Autogluon Individual Model Leaderboard...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "display(leaderboard_df)" ] }, { "cell_type": "code", "execution_count": 16, "id": "80cb9042-12b7-4659-beed-20c71bb3f13b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
importancestddevp_valuenp99_highp99_low
WDSP0.1101250.0229080.00021250.1572930.062957
MONTH0.1029130.0123820.00002550.1284080.077418
MAX0.0513180.0039230.00000450.0593950.043240
PRCP0.0335640.0056680.00009450.0452350.021894
MIN0.0141470.0076850.00732850.029970-0.001676
DEWP0.0099860.0084590.02880150.027404-0.007432
\n", "
" ], "text/plain": [ " importance stddev p_value n p99_high p99_low\n", "WDSP 0.110125 0.022908 0.000212 5 0.157293 0.062957\n", "MONTH 0.102913 0.012382 0.000025 5 0.128408 0.077418\n", "MAX 0.051318 0.003923 0.000004 5 0.059395 0.043240\n", "PRCP 0.033564 0.005668 0.000094 5 0.045235 0.021894\n", "MIN 0.014147 0.007685 0.007328 5 0.029970 -0.001676\n", "DEWP 0.009986 0.008459 0.028801 5 0.027404 -0.007432" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# View and Plot Feature Importance... (this various from Scenario to Scenario)\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "display(featureimp_df)\n", "featureimp_df.drop(columns=[\"n\"]).plot(kind=\"bar\", figsize=(12, 4), xlabel=\"feature\")" ] }, { "cell_type": "markdown", "id": "7456a6b0-2236-47db-b6a8-dc11f7b3d35c", "metadata": {}, "source": [ "## Using + Testing Our Model...\n", "Now that we've built our Autogluon ML Model, we can use our test_df dataframe to make actual target label predictions and then compare those predictions against the known true label values to determine how the model is performing across each binary classification { 0=OKAY, 1=Unhealthy }. Further below, we'll plot a _Confusion Matrix_ and the **Summary** explains how to interpret results and covers **Key Learnings**." ] }, { "cell_type": "code", "execution_count": 17, "id": "af1ec4ac-5e75-406b-8fe1-980ecf8d6986", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "text/plain": [ "241 1\n", "994 1\n", "1820 0\n", "1442 0\n", "596 1\n", " ..\n", "849 0\n", "208 1\n", "288 1\n", "1226 0\n", "2129 0\n", "Name: isUnhealthy, Length: 721, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Load + Use Our Model (this line is unnecessary, but shows how to load a built model)...\n", "predictor = TabularPredictor.load(AQbyWeather.selectedScenario.getModelPath())\n", "\n", "# Make Predictions, which are saved to an array: y_pred\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "y_pred = predictor.predict(test_df)\n", "display(y_pred)" ] }, { "cell_type": "code", "execution_count": 18, "id": "efa500e0-4d0b-47be-95fd-6b45a5dea0f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n" ] }, { "data": { "text/plain": [ "241 1\n", "994 1\n", "1820 0\n", "1442 0\n", "596 1\n", " ..\n", "849 1\n", "208 1\n", "288 1\n", "1226 0\n", "2129 0\n", "Name: isUnhealthy, Length: 721, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get true label values as an array: y_true\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "y_true = validate_df[AQbyWeather.mlTargetLabel]\n", "display(y_true)" ] }, { "cell_type": "code", "execution_count": 19, "id": "43d7e462-2462-4fb1-9b6c-c7cd6cd56554", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "True Negatives (TN): 336 of 416 => 80.77%\n", "True Positives (TP): 237 of 305 => 77.7%\n", "False Negatives (FN): 68 of 305 => 22.3%\n", "False Positives (FP): 80 of 416 => 19.23%\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Get Confusion Matrix (CM), Get Additional CM Data, View CM...\n", "# Learn More: https://towardsdatascience.com/confusion-matrix-what-is-it-e859e1bbecdc\n", "# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "cm = confusion_matrix(y_true, y_pred)\n", "\n", "# Print Confusion Matrix data...\n", "cmData = AQbyWeather.getConfusionMatrixData(cm)\n", "print(cmData.TN_Output)\n", "print(cmData.TP_Output)\n", "print(cmData.FN_Output)\n", "print(cmData.FP_Output)\n", "\n", "# Plot Confusion Matrix...\n", "cmd = ConfusionMatrixDisplay(confusion_matrix=cm)\n", "fig, ax = plt.subplots(figsize=(8, 6))\n", "cmd.plot(ax=ax)" ] }, { "cell_type": "markdown", "id": "152d9203-6987-4fbb-a33e-acef22cfb993", "metadata": {}, "source": [ "## SUMMARY...\n", "This demo showed how to use a SageMaker Studio Lab Jupyter Notebook to connect to open-source Amazon Sustainability Data Initiative (ASDI) datasets both via Amazon S3 (for NOAA weather data) and via HTTPS API (for OpenAQ air quality averages). The NOAA weather data and OpenAQ measurement averages are merged into a single table with irrelevant features dropped. This table is then used to build an Autogluon [TabularPredictor](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) ML model to predict a binary classification { 0=OKAY, 1=Unhealthy } based on weather features.\n", "\n", "### Key Learnings\n", "- The relationship between weather and air quality varies geographically and this was confirmed.\n", " - Feature Importance review shows differing factors in models built for different locations.\n", " - Academic research indicates the relationship can change by season and an engineered “MONTH” feature was important for most models.\n", " - An engineered “DAYOFWEEK” feature was expected to be important, but observed correlation was minimal and removing the feature improved most models.\n", "- Predicting a target measurement was infeasible, but predicting a binary “OKAY” vs “unhealthy” classification based on threshold values yielded decent results (~75-90% accuracy metrics).\n", "- For more on production-level air quality predictions, research how climatology and deeper statistical analysis are merged into 3D models that incorporate emissions models, meteorological models, and chemical models.\n", "\n", "### Confusion Matrix and Saving Final Results\n", "The Confusion Matrix above plots the relationships between:\n", "- True Positives (1=>1; see bottom-right)\n", "- True Negatives (0=>0; see top-left)\n", "- False Positives (1=>0; see top-right)\n", "- False Negatives (0=>1; see bottom-left)\n", "\n", "_The validation dataframe with predictions appended as the final column are saved below as a CSV for additional analysis._" ] }, { "cell_type": "code", "execution_count": 20, "id": "c1c7d5e7-0d51-4170-b1dd-ba09b97632bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n", "Results saved to dataRESULTS_bakersfield_pm25_2016-2022.csv. DONE.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DEWPWDSPMAXMINPRCPMONTHisUnhealthyPREDICTION
24141.94.789.155.00.001111
99440.82.562.137.90.00111
182042.25.590.063.00.00500
144250.04.373.952.00.06400
59639.72.766.039.90.001211
...........................
84953.05.593.064.00.00810
20837.25.691.957.90.001011
28837.12.866.036.00.001211
122639.36.899.062.10.00900
212942.35.973.942.10.00400
\n", "

721 rows × 8 columns

\n", "
" ], "text/plain": [ " DEWP WDSP MAX MIN PRCP MONTH isUnhealthy PREDICTION\n", "241 41.9 4.7 89.1 55.0 0.00 11 1 1\n", "994 40.8 2.5 62.1 37.9 0.00 1 1 1\n", "1820 42.2 5.5 90.0 63.0 0.00 5 0 0\n", "1442 50.0 4.3 73.9 52.0 0.06 4 0 0\n", "596 39.7 2.7 66.0 39.9 0.00 12 1 1\n", "... ... ... ... ... ... ... ... ...\n", "849 53.0 5.5 93.0 64.0 0.00 8 1 0\n", "208 37.2 5.6 91.9 57.9 0.00 10 1 1\n", "288 37.1 2.8 66.0 36.0 0.00 12 1 1\n", "1226 39.3 6.8 99.0 62.1 0.00 9 0 0\n", "2129 42.3 5.9 73.9 42.1 0.00 4 0 0\n", "\n", "[721 rows x 8 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create and save final results...\n", "print(AQbyWeather.selectedScenario.getSummary())\n", "resultsFile = AQbyWeather.getFilenameOther(\"dataRESULTS\")\n", "results_df = pd.DataFrame()\n", "results_df['PREDICTION'] = pd.DataFrame(y_pred)\n", "results_df = pd.concat([validate_df, results_df], axis=1)\n", "results_df.to_csv(resultsFile, index=False)\n", "print(f\"Results saved to {resultsFile}. DONE.\")\n", "display(results_df)" ] }, { "cell_type": "markdown", "id": "2da4da59-1326-4b15-95e4-3f86a877cfed", "metadata": {}, "source": [ "## Thanks and Clean-up...\n", "**Thanks for your time! I hope this demo was insightful...**\n", "\n", "If you'd like to clean up locally generated files, change _isCleanupTime_ to **True** below and run the cell.\n", "\n", "The End." ] }, { "cell_type": "code", "execution_count": 21, "id": "c1e368f6-c54d-4a6c-bf9d-9a6667628761", "metadata": {}, "outputs": [], "source": [ "# OPTIONAL: Clean Up Generated Files...\n", "isCleanupTime = False\n", "if(isCleanupTime):\n", " # Remove local Autogluon Models + Folder...\n", " shutil.rmtree(AQbyWeather.selectedScenario.modelFolder)\n", " \n", " # Remove local CSV data files...\n", " for filename in glob.glob(\"*.csv\"):\n", " os.remove(filename)" ] }, { "cell_type": "code", "execution_count": null, "id": "7d6db503-8406-4cb7-b49c-ada5b8918780", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "aq_by_weather:Python", "language": "python", "name": "conda-env-aq_by_weather-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }