{
"cells": [
{
"cell_type": "markdown",
"id": "ae929fb3-0316-494b-accb-1951756d4d32",
"metadata": {
"tags": []
},
"source": [
"# ML Demo: Predicting Air Quality w/ ASDI NOAA + OpenAQ Datasets\n",
"This demo consists of a Jupyter Notebook that connects to Amazon Sustainability Data Initiative (ASDI) datasets from NOAA and OpenAQ to build a Machine Learning (ML) model to predict air quality levels using weather data via a Binary Classification [AutoGluon](https://auto.gluon.ai/stable/index.html) model. The project's purpose is to demonstrate using two different types of ASDI datasets (files in Amazon S3 and HTTPS APIs) within a Jupyter Notebook, such as provided by Amazon SageMaker Studio Lab. This demo is NOT for scientific or health purposes.\n",
"\n",
"This Jupyter Notebook can be run for free using Amazon SageMaker Studio Lab and open-source Amazon Sustainability Data Initiative (ASDI) datasets without needing an AWS account.\n",
"\n",
"[](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-smsl-predict-airquality-via-weather/blob/main/aq_by_weather.ipynb)\n",
"\n",
"## BACKGROUND: 1 out of 8 deaths in the world is due to poor air quality*\n",
"This demo explores correlations between weather and air quality since we know factors like temperatures, wind speeds, etc, affect certain air quality parameters. Predicting air quality can get into highly sophisticated ML techniques, but this demo shows how merging NOAA GSOD weather data with OpenAQ air quality data to build an ML model using AutoGluon (AutoML from AWS) can result in prediction accuracy of ~75-90% using Binary Classification models for various high-pollution areas and air quality parameters (mostly tested for 2.5 micron Particulate Matter and some Ground Level Ozone).\\\n",
"*Source: [OpenAQ.org](https://OpenAQ.org)\n",
"\n",
"## INSTRUCTIONS\n",
"- Configure your environment using the repo's environment.yml file or by running the *pip install* commands below.\n",
"- If you want to run this Notebook yourself, first use the _Clear All Outputs_ option.\n",
"- CELL #3: Review the Classes and Variables defined in this cell to see defaults and how ASDI access occurs.\n",
"- CELL #4: Review the pre-defined AQParams and AQScenarios in this cell. You can edit these and/or add your own.\n",
"- CELL #5: Select a scenario via the drop-down to use throughout the Notebook. This will drive the ML process.\n",
"- Step through the remaining cells in the Notebook to access data, merge data, and build a Binary Classification Autogluon model.\n",
"- NOTE: NOAA and OpenAQ input data is saved locally as CSV files and re-used in order to minimize unnecessary data access and compute resources. Be aware of this feature and clear the local data files, if your intention is to re-fetch the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5a41f10b-f9ed-433a-8d32-bb718f23d811",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# Set up your environment according to the repo's environment.yml file or run the following...\n",
"# Comment these out, once installed or otherwise not needed.\n",
"# This creates an empty pip_requirements.txt file used to suppress 'already satisfied' output.\n",
"import os\n",
"with open('pip_requirements.txt', mode='a'): pass\n",
"%pip install boto3 -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install pandas -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install numpy -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install requests -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install ipywidgets -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install scikit-learn -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install autogluon -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install matplotlib -r pip_requirements.txt | grep -v 'already satisfied'\n",
"%pip install nbconvert -r pip_requirements.txt | grep -v 'already satisfied'"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e844d780-d995-4bdc-98ae-570bc8253265",
"metadata": {},
"outputs": [],
"source": [
"# Import statements for packages used...\n",
"import os, glob, shutil, sys, requests, json\n",
"import boto3\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import ipywidgets as widgets\n",
"\n",
"from botocore import UNSIGNED\n",
"from botocore.config import Config\n",
"from io import StringIO\n",
"from datetime import datetime\n",
"from types import SimpleNamespace\n",
"from IPython.display import clear_output\n",
"\n",
"# The following is required for matplotlib plots to display in some envs...\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9ea427ea-8ce7-4dc5-80e1-c674f7f569ee",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classes and Variables are ready.\n"
]
}
],
"source": [
"# CELL #3: Review the Variables and Classes defined in this cell to see defaults and how ASDI access occurs...\n",
"\n",
"# class AQParam => Used to define attributes for the (6) main OpenAQ parameters.\n",
"class AQParam:\n",
" def __init__(self, id, name, unit, unhealthyThresholdDefault, desc):\n",
" self.id = id\n",
" self.name = name\n",
" self.unit = unit\n",
" self.unhealthyThresholdDefault = unhealthyThresholdDefault\n",
" self.desc = desc\n",
" \n",
" def isValid(self):\n",
" if(self is not None and self.id > 0 and self.unhealthyThresholdDefault > 0.0 and \n",
" len(self.name) > 0 and len(self.unit) > 0 and len(self.desc) > 0):\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
" def toJSON(self):\n",
" return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=2)\n",
"\n",
"# class AQScenario => Defines an ML scenario including a Location w/ NOAA Weather Station ID \n",
"# and the target OpenAQ Param.\n",
"# Note: OpenAQ data mostly begins sometime in 2016, so using that as a default yearStart value.\n",
"class AQScenario:\n",
" def __init__(self, location=None, noaaStationID=None, aqParamTarget=None, unhealthyThreshold=None, \n",
" yearStart=2016, yearEnd=2022, aqRadiusMiles=10, featureColumnsToDrop=None):\n",
" self.location = location\n",
" self.name = location + \"_\" + aqParamTarget.name\n",
" self.noaaStationID = noaaStationID\n",
" self.noaaStationLat = 0.0\n",
" self.noaaStationLng = 0.0\n",
" self.openAqLocIDs = []\n",
" \n",
" self.aqParamTarget = aqParamTarget\n",
" \n",
" if unhealthyThreshold and unhealthyThreshold > 0.0:\n",
" self.unhealthyThreshold = unhealthyThreshold\n",
" else:\n",
" self.unhealthyThreshold = self.aqParamTarget.unhealthyThresholdDefault\n",
" \n",
" self.yearStart = yearStart\n",
" self.yearEnd = yearEnd\n",
" self.aqRadiusMiles = aqRadiusMiles\n",
" self.aqRadiusMeters = aqRadiusMiles * 1610 # Rough integer approximation is fine here.\n",
" \n",
" self.modelFolder = \"AutogluonModels\"\n",
" \n",
" def getSummary(self):\n",
" return f\"Scenario: {self.name} => {self.aqParamTarget.desc} ({self.aqParamTarget.name}) with UnhealthyThreshold > {self.unhealthyThreshold} {self.aqParamTarget.unit}\"\n",
" \n",
" def getModelPath(self):\n",
" return f\"{self.modelFolder}/aq_{self.name}_{self.yearStart}-{self.yearEnd}/\"\n",
" \n",
" def updateNoaaStationLatLng(self, noaagsod_df_row):\n",
" # Use a NOAA row to set Lat+Lng values used for the OpenAQ API requests...\n",
" if(noaagsod_df_row is not None and noaagsod_df_row['LATITUDE'] and noaagsod_df_row['LONGITUDE']):\n",
" self.noaaStationLat = noaagsod_df_row['LATITUDE']\n",
" self.noaaStationLng = noaagsod_df_row['LONGITUDE']\n",
" print(f\"NOAA Station Lat,Lng Updated for Scenario: {self.name} => {self.noaaStationLat},{self.noaaStationLng}\")\n",
" else:\n",
" print(\"NOAA Station Lat,Lng COULD NOT BE UPDATED.\")\n",
" \n",
" def isValid(self):\n",
" if(self is not None and self.aqParamTarget is not None and\n",
" self.yearStart > 0 and self.yearEnd > 0 and self.yearEnd >= self.yearStart and \n",
" self.aqRadiusMiles > 0 and self.aqRadiusMeters > 0 and self.unhealthyThreshold > 0.0 and \n",
" len(self.name) > 0 and len(self.noaaStationID) > 0):\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
" def toJSON(self):\n",
" return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=2)\n",
"\n",
"# class AQbyWeatherApp => Main app class with settings, AQParams, AQScenarios, and data access methods...\n",
"class AQbyWeatherApp:\n",
" def __init__(self, mlTargetLabel='isUnhealthy', mlEvalMetric='accuracy', mlTimeLimitSecs=None):\n",
" self.mlTargetLabel = mlTargetLabel\n",
" self.mlEvalMetric = mlEvalMetric\n",
" self.mlTimeLimitSecs = mlTimeLimitSecs\n",
" self.mlIgnoreColumns = ['DATE','NAME','LATITUDE','LONGITUDE','day','unit','average','parameter']\n",
" \n",
" self.defaultColumnsNOAA = ['DATE','NAME','LATITUDE','LONGITUDE',\n",
" 'DEWP','WDSP','MAX','MIN','PRCP','MONTH'] # Default relevant NOAA columns\n",
" self.defaultColumnsOpenAQ = ['day','parameter','unit','average'] # Default relevant OpenAQ columns\n",
" \n",
" self.aqParams = {} # A list to save AQParam objects\n",
" self.aqScenarios = {} # A list to save AQScenario objects\n",
" \n",
" self.selectedScenario = None\n",
" \n",
" def addAQParam(self, aqParam):\n",
" if aqParam and aqParam.isValid():\n",
" self.aqParams[aqParam.name] = aqParam\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
" def addAQScenario(self, aqScenario):\n",
" if aqScenario and aqScenario.isValid():\n",
" self.aqScenarios[aqScenario.name] = aqScenario\n",
" if(self.selectedScenario is None):\n",
" self.selectedScenario = self.aqScenarios[next(iter(self.aqScenarios))] # Default selectedScenario to 1st item.\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
" def getFilenameNOAA(self):\n",
" if self and self.selectedScenario and self.selectedScenario.isValid():\n",
" return f\"dataNOAA_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}_{self.selectedScenario.noaaStationID}.csv\"\n",
" else:\n",
" return \"\"\n",
" \n",
" def getFilenameOpenAQ(self):\n",
" if self and self.selectedScenario and self.selectedScenario.isValid() and len(self.selectedScenario.openAqLocIDs) > 0:\n",
" idString = \"\"\n",
" for i in range(0, len(self.selectedScenario.openAqLocIDs)):\n",
" idString = idString + str(self.selectedScenario.openAqLocIDs[i]) + \"-\"\n",
" idString = idString[:-1]\n",
" return f\"dataOpenAQ_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}_{idString}.csv\"\n",
" else:\n",
" return \"\"\n",
" \n",
" def getFilenameOther(self, prefix):\n",
" if self and self.selectedScenario and self.selectedScenario.isValid():\n",
" return f\"{prefix}_{self.selectedScenario.name}_{self.selectedScenario.yearStart}-{self.selectedScenario.yearEnd}.csv\"\n",
" \n",
" def getNoaaDataFrame(self):\n",
" # ASDI Dataset Name: NOAA GSOD\n",
" # ASDI Dataset URL : https://registry.opendata.aws/noaa-gsod/\n",
" # NOAA GSOD README : https://www.ncei.noaa.gov/data/global-summary-of-the-day/doc/readme.txt\n",
" # NOAA GSOD data in S3 is organized by year and Station ID values, so this is straight-forward\n",
" # Example S3 path format => s3://noaa-gsod-pds/{yyyy}/{stationid}.csv\n",
" # Let's start with a new DataFrame and load it from a local CSV or the NOAA data source...\n",
" noaagsod_df = pd.DataFrame()\n",
" filenameNOAA = self.getFilenameNOAA()\n",
"\n",
" if os.path.exists(filenameNOAA):\n",
" # Use local data file already accessed + prepared...\n",
" print('Loading NOAA GSOD data from local file: ', filenameNOAA)\n",
" noaagsod_df = pd.read_csv(filenameNOAA)\n",
" else:\n",
" # Access + prepare data and save to a local data file...\n",
" noaagsod_bucket = 'noaa-gsod-pds'\n",
" print(f'Accessing and preparing data from ASDI-hosted NOAA GSOD dataset in Amazon S3 (bucket: {noaagsod_bucket})...')\n",
" s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n",
"\n",
" for year in range(self.selectedScenario.yearStart, self.selectedScenario.yearEnd + 1):\n",
" key = f'{year}/{self.selectedScenario.noaaStationID}.csv' # Compute the key to get\n",
" csv_obj = s3.get_object(Bucket=noaagsod_bucket, Key=key) # Get the S3 object\n",
" csv_string = csv_obj['Body'].read().decode('utf-8') # Read object contents to a string\n",
" noaagsod_df = pd.concat([noaagsod_df, pd.read_csv(StringIO(csv_string))], ignore_index=True) # Use the string to build the DataFrame\n",
"\n",
" # Perform some Feature Engineering to append potentially useful columns to our dataset... (TODO: Add more to optimize...)\n",
" # It may be true that Month affects air quality (ie: seasonal considerations; tends to have correlation for certain areas)\n",
" noaagsod_df['MONTH'] = pd.to_datetime(noaagsod_df['DATE']).dt.month\n",
"\n",
" # Trim down to the desired key columns... (do this last in case engineered columns are to be removed)\n",
" noaagsod_df = noaagsod_df[self.defaultColumnsNOAA]\n",
" \n",
" return noaagsod_df\n",
" \n",
" def getOpenAqDataFrame(self):\n",
" # ASDI Dataset Name: OpenAQ\n",
" # ASDI Dataset URL : https://registry.opendata.aws/openaq/\n",
" # OpenAQ API Docs : https://docs.openaq.org/#/v2/\n",
" # OpenAQ S3 data is only organized by date folders, so each folder is large and contains all stations.\n",
" # Because of this, it's better to query ASDI OpenAQ data using the CloudFront-hosted API.\n",
" # Note that some days may not have values and will get filtered out via an INNER JOIN later.\n",
" # Let's start with a new DataFrame and load it from a local CSV or the NOAA data source...\n",
" aq_df = pd.DataFrame()\n",
" aq_reqUrlBase = \"https://api.openaq.org/v2\" # OpenAQ ASDI API Endpoint URL Base (ie: add /locations OR /averages)\n",
" \n",
" if self.selectedScenario.noaaStationLat == 0.0 or self.selectedScenario.noaaStationLng == 0.0:\n",
" print(\"NOAA Station Lat/Lng NOT DEFINED. CANNOT PROCEED\")\n",
" return aq_df\n",
" \n",
" if len(self.selectedScenario.openAqLocIDs) == 0:\n",
" # We must start by querying nearby OpenAQ Locations for their IDs...\n",
" print('Accessing ASDI-hosted OpenAQ Locations (HTTPS API)...')\n",
" aq_reqParams = {\n",
" 'limit': 10,\n",
" 'page': 1,\n",
" 'offset': 0,\n",
" 'sort': 'desc',\n",
" 'order_by': 'location',\n",
" 'parameter': self.selectedScenario.aqParamTarget.name,\n",
" 'coordinates': f'{self.selectedScenario.noaaStationLat},{self.selectedScenario.noaaStationLng}',\n",
" 'radius': self.selectedScenario.aqRadiusMeters,\n",
" 'isMobile': 'false',\n",
" 'sensorType': 'reference grade',\n",
" 'dumpRaw': 'false'\n",
" }\n",
" aq_resp = requests.get(aq_reqUrlBase + \"/locations\", aq_reqParams)\n",
" aq_data = aq_resp.json()\n",
" if aq_data['results'] and len(aq_data['results']) >= 1:\n",
" for i in range(0, len(aq_data['results'])):\n",
" self.selectedScenario.openAqLocIDs.append(aq_data['results'][i]['id'])\n",
" print(f\"OpenAQ Location IDs within {self.selectedScenario.aqRadiusMiles} miles ({self.selectedScenario.aqRadiusMeters}m) \" +\n",
" f\"of NOAA Station {self.selectedScenario.noaaStationID} at \" +\n",
" f\"{self.selectedScenario.noaaStationLat},{self.selectedScenario.noaaStationLng}: {self.selectedScenario.openAqLocIDs}\")\n",
" else:\n",
" print(f\"NO OpenAQ Location IDs found within {self.selectedScenario.aqRadiusMiles} miles ({self.selectedScenario.aqRadiusMeters}m) \" +\n",
" f\"of NOAA Station {self.selectedScenario.noaaStationID}. CANNOT PROCEED.\")\n",
" \n",
" if len(self.selectedScenario.openAqLocIDs) >= 1:\n",
" filenameOpenAQ = self.getFilenameOpenAQ()\n",
"\n",
" if os.path.exists(filenameOpenAQ):\n",
" # Use local data file already accessed + prepared...\n",
" print('Loading OpenAQ data from local file: ', filenameOpenAQ)\n",
" aq_df = pd.read_csv(filenameOpenAQ)\n",
" else:\n",
" # Access + prepare data (NOTE: calling OpenAQ API one year at a time to avoid timeouts)\n",
" print('Accessing ASDI-hosted OpenAQ Averages (HTTPS API)...')\n",
"\n",
" for year in range(self.selectedScenario.yearStart, self.selectedScenario.yearEnd + 1):\n",
" aq_reqParams = {\n",
" 'date_from': f'{year}-01-01',\n",
" 'date_to': f'{year}-12-31',\n",
" 'parameter': [self.selectedScenario.aqParamTarget.name],\n",
" 'limit': 366,\n",
" 'page': 1,\n",
" 'offset': 0,\n",
" 'sort': 'asc',\n",
" 'spatial': 'location',\n",
" 'temporal': 'day',\n",
" 'location': self.selectedScenario.openAqLocIDs,\n",
" 'group': 'true'\n",
" }\n",
" aq_resp = requests.get(aq_reqUrlBase + \"/averages\", aq_reqParams)\n",
" aq_data = aq_resp.json()\n",
" if aq_data['results'] and len(aq_data['results']) >= 1:\n",
" aq_df = pd.concat([aq_df, pd.json_normalize(aq_data['results'])], ignore_index=True)\n",
"\n",
" # Using the joined dataframe, trim down to the desired key columns...\n",
" aq_df = aq_df[self.defaultColumnsOpenAQ]\n",
"\n",
" # Perform some Label Engineering to add our binary classification label => {0=OKAY, 1=UNHEALTHY}\n",
" aq_df[self.mlTargetLabel] = np.where(aq_df['average'] <= self.selectedScenario.unhealthyThreshold, 0, 1)\n",
" \n",
" return aq_df\n",
" \n",
" def getMergedDataFrame(self, noaagsod_df, aq_df):\n",
" if len(noaagsod_df) > 0 and len(aq_df) > 0:\n",
" merged_df = pd.merge(noaagsod_df, aq_df, how=\"inner\", left_on=\"DATE\", right_on=\"day\")\n",
" merged_df = merged_df.drop(columns=self.mlIgnoreColumns)\n",
" return merged_df\n",
" else:\n",
" return pd.DataFrame()\n",
" \n",
" def getConfusionMatrixData(self, cm):\n",
" cmData = SimpleNamespace()\n",
" cmData.TN = cm[0][0]\n",
" cmData.TP = cm[1][1]\n",
" cmData.FN = cm[1][0]\n",
" cmData.FP = cm[0][1]\n",
" \n",
" cmData.TN_Rate = cmData.TN/(cmData.TN+cmData.FP)\n",
" cmData.TP_Rate = cmData.TP/(cmData.TP+cmData.FN)\n",
" cmData.FN_Rate = cmData.FN/(cmData.FN+cmData.TP)\n",
" cmData.FP_Rate = cmData.FP/(cmData.FP+cmData.TN)\n",
" \n",
" cmData.TN_Output = f\"True Negatives (TN): {cmData.TN} of {cmData.TN+cmData.FP} => {round(cmData.TN_Rate * 100, 2)}%\"\n",
" cmData.TP_Output = f\"True Positives (TP): {cmData.TP} of {cmData.TP+cmData.FN} => {round(cmData.TP_Rate * 100, 2)}%\"\n",
" cmData.FN_Output = f\"False Negatives (FN): {cmData.FN} of {cmData.FN+cmData.TP} => {round(cmData.FN_Rate * 100, 2)}%\"\n",
" cmData.FP_Output = f\"False Positives (FP): {cmData.FP} of {cmData.FP+cmData.TN} => {round(cmData.FP_Rate * 100, 2)}%\"\n",
" \n",
" return cmData\n",
" \n",
"print(\"Classes and Variables are ready.\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "599478b3-5abd-4bc3-9888-d3dee71b1c09",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AQbyWeather.aqParams: 6\n",
"AQbyWeather.aqScenarios: 15 (Default Selected: bakersfield_pm25)\n"
]
}
],
"source": [
"# CELL #4: Review the pre-defined AQParams and AQScenarios in this cell. You can edit these and/or use your own...\n",
"# AQParams are added with default thresholds, which can be overridden on a per-AQScenario basis.\n",
"# These AQParams are based on the OpenAQ /parameters API call where isCore=true (https://api.openaq.org/v2/parameters).\n",
"# Default thresholds where provided using data from EPA.gov (https://www.epa.gov/criteria-air-pollutants/naaqs-table).\n",
"# Confirm and adjust params or thresholds as needed for your needs... Not for scientific or health purposes.\n",
"\n",
"# Instantiate main App class with explicit mlTargetLabel and mlEvalMetric provided...\n",
"AQbyWeather = AQbyWeatherApp(mlTargetLabel='isUnhealthy', mlEvalMetric='accuracy')\n",
"\n",
"# Define and add new AQParams...\n",
"AQbyWeather.addAQParam(AQParam( 1, \"pm10\", \"µg/m³\", 150.0, \"Particulate Matter < 10 micrometers\"))\n",
"AQbyWeather.addAQParam(AQParam( 2, \"pm25\", \"µg/m³\", 12.0, \"Particulate Matter < 2.5 micrometers\"))\n",
"AQbyWeather.addAQParam(AQParam( 7, \"no2\", \"ppm\", 100.0, \"Nitrogen Dioxide\"))\n",
"AQbyWeather.addAQParam(AQParam( 8, \"co\", \"ppm\", 9.0, \"Carbon Monoxide\"))\n",
"AQbyWeather.addAQParam(AQParam( 9, \"so2\", \"ppm\", 75.0, \"Sulfur Dioxide\"))\n",
"AQbyWeather.addAQParam(AQParam(10, \"o3\", \"ppm\", 0.070, \"Ground Level Ozone\"))\n",
"\n",
"# Define available AQ Scenarios for certain locations with their associated NOAA GSOD StationID values...\n",
"# NOAA GSOD Station Search: https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day\n",
"# TODO: Someday consider how to OPTIONALLY append more scenarios via an optional JSON file\n",
"# (ie: without adding a dependecy outside the .ipynb file)\n",
"# NOTE: For Ozone Scenarios, we're generally using 0.035 ppm to override the default threshold.\n",
"AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"pm10\"], None)) # Attempt at pm10 prediction.\n",
"AQbyWeather.addAQScenario(AQScenario(\"bakersfield\", \"72384023155\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"fresno\", \"72389093193\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"fresno\", \"72389093193\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"visalia\", \"72389693144\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"visalia\", \"72389693144\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"san-jose\", \"72494693232\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"san-jose\", \"72494693232\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"los-angeles\", \"72287493134\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"los-angeles\", \"72287493134\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"phoenix\", \"72278023183\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"phoenix\", \"72278023183\", AQbyWeather.aqParams[\"o3\"], 0.035))\n",
"AQbyWeather.addAQScenario(AQScenario(\"fairbanks\", \"70261026411\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"AQbyWeather.addAQScenario(AQScenario(\"lahore-pk\", \"41640099999\", AQbyWeather.aqParams[\"pm25\"], None))\n",
"\n",
"print(f\"AQbyWeather.aqParams: {str(len(AQbyWeather.aqParams))}\")\n",
"print(f\"AQbyWeather.aqScenarios: {str(len(AQbyWeather.aqScenarios))} (Default Selected: {AQbyWeather.selectedScenario.name})\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3395018f-87bd-4a7f-8ee5-7e90d38bfb04",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** CHOOSE YOUR OWN ADVENTURE HERE ***\n",
"Please select a Scenario via the following drop-down-list...\n",
"(NOTE: If you change Scenario, you must re-run remaining cells to see changes.)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "54d02042d8c74215baadef6fc83953d1",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(options=('bakersfield_pm25', 'bakersfield_pm10', 'bakersfield_o3', 'fresno_pm25', 'fresno_o3', 'visal…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# CELL #5: Select a Scenario via DROP DOWN LIST to use throughout the Notebook. This will drive the ML process...\n",
"# A default \"value\" is set to avoid issues. Change this default to run the Notebook from start-to-finish for that Scenario.\n",
"print(\"*** CHOOSE YOUR OWN ADVENTURE HERE ***\")\n",
"print(\"Please select a Scenario via the following drop-down-list...\")\n",
"print(\"(NOTE: If you change Scenario, you must re-run remaining cells to see changes.)\")\n",
"ddl = widgets.Dropdown(options=AQbyWeather.aqScenarios.keys(), \n",
" value=AQbyWeather.aqScenarios[\"bakersfield_pm25\"].name) # <-- DEFAULT / FULL-RUN VALUE\n",
"ddl"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ffed30d9-1725-405b-a87f-e2aac0be5634",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"{\n",
" \"aqParamTarget\": {\n",
" \"desc\": \"Particulate Matter < 2.5 micrometers\",\n",
" \"id\": 2,\n",
" \"name\": \"pm25\",\n",
" \"unhealthyThresholdDefault\": 12.0,\n",
" \"unit\": \"\\u00b5g/m\\u00b3\"\n",
" },\n",
" \"aqRadiusMeters\": 16100,\n",
" \"aqRadiusMiles\": 10,\n",
" \"location\": \"bakersfield\",\n",
" \"modelFolder\": \"AutogluonModels\",\n",
" \"name\": \"bakersfield_pm25\",\n",
" \"noaaStationID\": \"72384023155\",\n",
" \"noaaStationLat\": 0.0,\n",
" \"noaaStationLng\": 0.0,\n",
" \"openAqLocIDs\": [],\n",
" \"unhealthyThreshold\": 12.0,\n",
" \"yearEnd\": 2022,\n",
" \"yearStart\": 2016\n",
"}\n"
]
}
],
"source": [
"if ddl.value:\n",
" AQbyWeather.selectedScenario = AQbyWeather.aqScenarios[ddl.value]\n",
" print(AQbyWeather.selectedScenario.getSummary())\n",
" print(AQbyWeather.selectedScenario.toJSON())\n",
"else:\n",
" print(\"Please select a Scenario via the above drop-down-list.\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9c3a647c-0fee-4b49-8576-0ee355629f0e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"Accessing and preparing data from ASDI-hosted NOAA GSOD dataset in Amazon S3 (bucket: noaa-gsod-pds)...\n",
"NOAA Station Lat,Lng Updated for Scenario: bakersfield_pm25 => 35.4344,-119.0542\n",
"noaagsod_df.shape = (2346, 10)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
DATE
\n",
"
NAME
\n",
"
LATITUDE
\n",
"
LONGITUDE
\n",
"
DEWP
\n",
"
WDSP
\n",
"
MAX
\n",
"
MIN
\n",
"
PRCP
\n",
"
MONTH
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2016-01-01
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43440
\n",
"
-119.05420
\n",
"
29.2
\n",
"
3.0
\n",
"
55.0
\n",
"
27.0
\n",
"
0.0
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
2016-01-02
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43440
\n",
"
-119.05420
\n",
"
32.8
\n",
"
2.4
\n",
"
57.0
\n",
"
27.0
\n",
"
0.0
\n",
"
1
\n",
"
\n",
"
\n",
"
2
\n",
"
2016-01-03
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43440
\n",
"
-119.05420
\n",
"
32.6
\n",
"
4.3
\n",
"
68.0
\n",
"
34.0
\n",
"
0.0
\n",
"
1
\n",
"
\n",
"
\n",
"
3
\n",
"
2016-01-04
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43440
\n",
"
-119.05420
\n",
"
32.0
\n",
"
4.9
\n",
"
68.0
\n",
"
39.0
\n",
"
0.0
\n",
"
1
\n",
"
\n",
"
\n",
"
4
\n",
"
2016-01-05
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43440
\n",
"
-119.05420
\n",
"
42.8
\n",
"
4.5
\n",
"
64.9
\n",
"
42.1
\n",
"
0.0
\n",
"
1
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2341
\n",
"
2022-06-04
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43424
\n",
"
-119.05524
\n",
"
50.5
\n",
"
7.3
\n",
"
91.0
\n",
"
64.0
\n",
"
0.0
\n",
"
6
\n",
"
\n",
"
\n",
"
2342
\n",
"
2022-06-05
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43424
\n",
"
-119.05524
\n",
"
45.1
\n",
"
6.4
\n",
"
90.0
\n",
"
63.0
\n",
"
0.0
\n",
"
6
\n",
"
\n",
"
\n",
"
2343
\n",
"
2022-06-06
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43424
\n",
"
-119.05524
\n",
"
58.7
\n",
"
7.5
\n",
"
90.0
\n",
"
63.0
\n",
"
0.0
\n",
"
6
\n",
"
\n",
"
\n",
"
2344
\n",
"
2022-06-07
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43424
\n",
"
-119.05524
\n",
"
47.8
\n",
"
5.4
\n",
"
91.0
\n",
"
66.0
\n",
"
0.0
\n",
"
6
\n",
"
\n",
"
\n",
"
2345
\n",
"
2022-06-08
\n",
"
BAKERSFIELD AIRPORT, CA US
\n",
"
35.43424
\n",
"
-119.05524
\n",
"
43.5
\n",
"
4.3
\n",
"
91.9
\n",
"
75.9
\n",
"
0.0
\n",
"
6
\n",
"
\n",
" \n",
"
\n",
"
2346 rows × 10 columns
\n",
"
"
],
"text/plain": [
" DATE NAME LATITUDE LONGITUDE DEWP WDSP \\\n",
"0 2016-01-01 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 29.2 3.0 \n",
"1 2016-01-02 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.8 2.4 \n",
"2 2016-01-03 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.6 4.3 \n",
"3 2016-01-04 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 32.0 4.9 \n",
"4 2016-01-05 BAKERSFIELD AIRPORT, CA US 35.43440 -119.05420 42.8 4.5 \n",
"... ... ... ... ... ... ... \n",
"2341 2022-06-04 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 50.5 7.3 \n",
"2342 2022-06-05 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 45.1 6.4 \n",
"2343 2022-06-06 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 58.7 7.5 \n",
"2344 2022-06-07 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 47.8 5.4 \n",
"2345 2022-06-08 BAKERSFIELD AIRPORT, CA US 35.43424 -119.05524 43.5 4.3 \n",
"\n",
" MAX MIN PRCP MONTH \n",
"0 55.0 27.0 0.0 1 \n",
"1 57.0 27.0 0.0 1 \n",
"2 68.0 34.0 0.0 1 \n",
"3 68.0 39.0 0.0 1 \n",
"4 64.9 42.1 0.0 1 \n",
"... ... ... ... ... \n",
"2341 91.0 64.0 0.0 6 \n",
"2342 90.0 63.0 0.0 6 \n",
"2343 90.0 63.0 0.0 6 \n",
"2344 91.0 66.0 0.0 6 \n",
"2345 91.9 75.9 0.0 6 \n",
"\n",
"[2346 rows x 10 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# GET NOAA GSOD WEATHER DATA...\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"noaagsod_df = AQbyWeather.getNoaaDataFrame()\n",
"\n",
"if(len(noaagsod_df) >= 1):\n",
" # Update NOAA Station Lat/Lng...\n",
" AQbyWeather.selectedScenario.updateNoaaStationLatLng(noaagsod_df.iloc[0])\n",
" \n",
" # Save DataFrame to CSV...\n",
" noaagsod_df.to_csv(AQbyWeather.getFilenameNOAA(), index=False)\n",
"\n",
" # Output DataFrame properties...\n",
" print('noaagsod_df.shape =', noaagsod_df.shape)\n",
" display(noaagsod_df)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2893d81f-a488-464d-b374-dafbfc0ae3fb",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"Accessing ASDI-hosted OpenAQ Locations (HTTPS API)...\n",
"OpenAQ Location IDs within 10 miles (16100m) of NOAA Station 72384023155 at 35.4344,-119.0542: [6885, 28777, 6883, 230814, 887]\n",
"Accessing ASDI-hosted OpenAQ Averages (HTTPS API)...\n",
"aq_df.shape = (2195, 5)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
day
\n",
"
parameter
\n",
"
unit
\n",
"
average
\n",
"
isUnhealthy
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2016-03-06
\n",
"
pm25
\n",
"
µg/m³
\n",
"
2.0000
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
2016-03-10
\n",
"
pm25
\n",
"
µg/m³
\n",
"
10.5000
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
2016-03-11
\n",
"
pm25
\n",
"
µg/m³
\n",
"
5.4286
\n",
"
0
\n",
"
\n",
"
\n",
"
3
\n",
"
2016-03-12
\n",
"
pm25
\n",
"
µg/m³
\n",
"
4.1667
\n",
"
0
\n",
"
\n",
"
\n",
"
4
\n",
"
2016-03-13
\n",
"
pm25
\n",
"
µg/m³
\n",
"
6.0000
\n",
"
0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2190
\n",
"
2022-06-10
\n",
"
pm25
\n",
"
µg/m³
\n",
"
9.2045
\n",
"
0
\n",
"
\n",
"
\n",
"
2191
\n",
"
2022-06-11
\n",
"
pm25
\n",
"
µg/m³
\n",
"
9.4583
\n",
"
0
\n",
"
\n",
"
\n",
"
2192
\n",
"
2022-06-12
\n",
"
pm25
\n",
"
µg/m³
\n",
"
6.1667
\n",
"
0
\n",
"
\n",
"
\n",
"
2193
\n",
"
2022-06-13
\n",
"
pm25
\n",
"
µg/m³
\n",
"
3.5435
\n",
"
0
\n",
"
\n",
"
\n",
"
2194
\n",
"
2022-06-14
\n",
"
pm25
\n",
"
µg/m³
\n",
"
7.3000
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
2195 rows × 5 columns
\n",
"
"
],
"text/plain": [
" day parameter unit average isUnhealthy\n",
"0 2016-03-06 pm25 µg/m³ 2.0000 0\n",
"1 2016-03-10 pm25 µg/m³ 10.5000 0\n",
"2 2016-03-11 pm25 µg/m³ 5.4286 0\n",
"3 2016-03-12 pm25 µg/m³ 4.1667 0\n",
"4 2016-03-13 pm25 µg/m³ 6.0000 0\n",
"... ... ... ... ... ...\n",
"2190 2022-06-10 pm25 µg/m³ 9.2045 0\n",
"2191 2022-06-11 pm25 µg/m³ 9.4583 0\n",
"2192 2022-06-12 pm25 µg/m³ 6.1667 0\n",
"2193 2022-06-13 pm25 µg/m³ 3.5435 0\n",
"2194 2022-06-14 pm25 µg/m³ 7.3000 0\n",
"\n",
"[2195 rows x 5 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# GET OPENAQ AIR QUALITY DAILY AVERAGES DATA...\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"aq_df = AQbyWeather.getOpenAqDataFrame() # Gets nearby Location IDs THEN gets associated daily averages.\n",
" \n",
"if len(aq_df) > 0:\n",
" # Output DataFrame properties...\n",
" print('aq_df.shape =', aq_df.shape)\n",
" display(aq_df)\n",
" aq_df.to_csv(AQbyWeather.getFilenameOpenAQ(), index=False)"
]
},
{
"cell_type": "markdown",
"id": "0a676054-ca63-47a2-832c-c136d1f0ecb4",
"metadata": {},
"source": [
"## Merging the NOAA and OpenAQ data into one dataframe...\n",
"We'll be using Autogluon's [_TabularPredictor_](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) on this merged dataframe to predict our target label column. For best results, we need to ensure we have enough data and that we have enough of each target label classification {0,1}.\n",
"\n",
"Review the shape of the merged_df table and the target label bar chart to determine if you have enough data and enough variation in the target label. Ideally, we should have thousands of total rows and at least hundreds of each class value.\n",
"\n",
"In the following cell, we can also visualize correlations between different columns in our merged_df table."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6a873736-65b4-41a5-82e0-794ef33188b3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"merged_df.shape = (2184, 7)\n"
]
},
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Merge the NOAA GSOD weather data with our OpenAQ data by DATE...\n",
"# Perform another column drop to remove columns we don't want as features/inputs.\n",
"# This column removal will NOT be necessary once we can use Autogluon ignore_columns param (TBD).\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"merged_df = AQbyWeather.getMergedDataFrame(noaagsod_df, aq_df)\n",
"\n",
"if(len(merged_df > 0)):\n",
" # Output DataFrame properties...\n",
" print('merged_df.shape =', merged_df.shape)\n",
" display(merged_df)\n",
" merged_df.groupby([AQbyWeather.mlTargetLabel]).size().plot(kind=\"bar\")\n",
" merged_df.to_csv(AQbyWeather.getFilenameOther(\"dataMERGED\"), index=False)\n",
"\n",
"# TODO: Explore other ways to join the data and/or handle null values..."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c5dee84f-da68-4b10-a075-a5cab419cd16",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Visualize correlations in our merged dataframe...\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"correlations = merged_df.corr()\n",
"fig = plt.figure(figsize=(10, 8))\n",
"ax = fig.add_subplot(111)\n",
"cax = ax.matshow(correlations, vmin=-1, vmax=1)\n",
"fig.colorbar(cax)\n",
"ticks = np.arange(0, len(merged_df.columns), 1)\n",
"ax.set_xticks(ticks)\n",
"ax.set_yticks(ticks)\n",
"ax.set_xticklabels(merged_df.columns)\n",
"ax.set_yticklabels(merged_df.columns)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "219a6c62-70eb-41ad-ae88-5ccd27f28aab",
"metadata": {},
"source": [
"## Begin ML on Merged Dataframe...\n",
"In order to build, validate, and test our ML model, we'll begin by creating three dataframes:\n",
"1. **train_df:** This is split from the full merged_df data as the larger chunk and used for training.\n",
"2. **validate_df:** This is split from the full merged_df data as the smaller chunk and used for validation.\n",
"3. **test_df:** This is the validation dataframe, but with the true target label values removed to enable model testing."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "66bce085-f0ac-418f-b3e9-5e1da7cefbd9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"Number of training samples: 1463\n",
"Number of validation samples: 721\n"
]
}
],
"source": [
"# Additional import statements for autogluon+sklearn and split out train_df + validate_df data...\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"from autogluon.tabular import TabularPredictor\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
"train_df, validate_df = train_test_split(merged_df, test_size=0.33, random_state=1)\n",
"print('Number of training samples:', len(train_df))\n",
"print('Number of validation samples:', len(validate_df))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cff0133a-7dfe-4187-8dda-026d0695ec92",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
DEWP
\n",
"
WDSP
\n",
"
MAX
\n",
"
MIN
\n",
"
PRCP
\n",
"
MONTH
\n",
"
\n",
" \n",
" \n",
"
\n",
"
241
\n",
"
41.9
\n",
"
4.7
\n",
"
89.1
\n",
"
55.0
\n",
"
0.00
\n",
"
11
\n",
"
\n",
"
\n",
"
994
\n",
"
40.8
\n",
"
2.5
\n",
"
62.1
\n",
"
37.9
\n",
"
0.00
\n",
"
1
\n",
"
\n",
"
\n",
"
1820
\n",
"
42.2
\n",
"
5.5
\n",
"
90.0
\n",
"
63.0
\n",
"
0.00
\n",
"
5
\n",
"
\n",
"
\n",
"
1442
\n",
"
50.0
\n",
"
4.3
\n",
"
73.9
\n",
"
52.0
\n",
"
0.06
\n",
"
4
\n",
"
\n",
"
\n",
"
596
\n",
"
39.7
\n",
"
2.7
\n",
"
66.0
\n",
"
39.9
\n",
"
0.00
\n",
"
12
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
849
\n",
"
53.0
\n",
"
5.5
\n",
"
93.0
\n",
"
64.0
\n",
"
0.00
\n",
"
8
\n",
"
\n",
"
\n",
"
208
\n",
"
37.2
\n",
"
5.6
\n",
"
91.9
\n",
"
57.9
\n",
"
0.00
\n",
"
10
\n",
"
\n",
"
\n",
"
288
\n",
"
37.1
\n",
"
2.8
\n",
"
66.0
\n",
"
36.0
\n",
"
0.00
\n",
"
12
\n",
"
\n",
"
\n",
"
1226
\n",
"
39.3
\n",
"
6.8
\n",
"
99.0
\n",
"
62.1
\n",
"
0.00
\n",
"
9
\n",
"
\n",
"
\n",
"
2129
\n",
"
42.3
\n",
"
5.9
\n",
"
73.9
\n",
"
42.1
\n",
"
0.00
\n",
"
4
\n",
"
\n",
" \n",
"
\n",
"
721 rows × 6 columns
\n",
"
"
],
"text/plain": [
" DEWP WDSP MAX MIN PRCP MONTH\n",
"241 41.9 4.7 89.1 55.0 0.00 11\n",
"994 40.8 2.5 62.1 37.9 0.00 1\n",
"1820 42.2 5.5 90.0 63.0 0.00 5\n",
"1442 50.0 4.3 73.9 52.0 0.06 4\n",
"596 39.7 2.7 66.0 39.9 0.00 12\n",
"... ... ... ... ... ... ...\n",
"849 53.0 5.5 93.0 64.0 0.00 8\n",
"208 37.2 5.6 91.9 57.9 0.00 10\n",
"288 37.1 2.8 66.0 36.0 0.00 12\n",
"1226 39.3 6.8 99.0 62.1 0.00 9\n",
"2129 42.3 5.9 73.9 42.1 0.00 4\n",
"\n",
"[721 rows x 6 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create the test_df data and remove the target label column...\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"test_df=validate_df.drop([AQbyWeather.mlTargetLabel], axis=1)\n",
"display(test_df)"
]
},
{
"cell_type": "markdown",
"id": "be7097cd-6b74-4824-8eec-06d526ee7f67",
"metadata": {},
"source": [
"## Run AutoGluon TabularPredictor.fit(...)\n",
"- Instantiate TabularPredictor with explicit values for label, eval_metric, and path.\n",
"- Call TabularPredictor.fit(...) using our train_df tabular data as the main input."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "499df50a-239b-4173-a91f-f86941d69b15",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning: path already exists! This predictor may overwrite an existing predictor! path=\"AutogluonModels/aq_bakersfield_pm25_2016-2022/\"\n",
"Presets specified: ['best_quality']\n",
"Beginning AutoGluon training ...\n",
"AutoGluon will save models to \"AutogluonModels/aq_bakersfield_pm25_2016-2022/\"\n",
"AutoGluon Version: 0.4.2\n",
"Python Version: 3.9.13\n",
"Operating System: Linux\n",
"Train Data Rows: 1463\n",
"Train Data Columns: 6\n",
"Label Column: isUnhealthy\n",
"Preprocessing data ...\n",
"AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
"\t2 unique label values: [0, 1]\n",
"\tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
"Selected class <--> label mapping: class 1 = 1, class 0 = 0\n",
"Using Feature Generators to preprocess the data ...\n",
"Fitting AutoMLPipelineFeatureGenerator...\n",
"\tAvailable Memory: 15321.24 MB\n",
"\tTrain Data (Original) Memory Usage: 0.07 MB (0.0% of available memory)\n",
"\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
"\tStage 1 Generators:\n",
"\t\tFitting AsTypeFeatureGenerator...\n",
"\tStage 2 Generators:\n",
"\t\tFitting FillNaFeatureGenerator...\n",
"\tStage 3 Generators:\n",
"\t\tFitting IdentityFeatureGenerator...\n",
"\tStage 4 Generators:\n",
"\t\tFitting DropUniqueFeatureGenerator...\n",
"\tTypes of features in original data (raw dtype, special dtypes):\n",
"\t\t('float', []) : 5 | ['DEWP', 'WDSP', 'MAX', 'MIN', 'PRCP']\n",
"\t\t('int', []) : 1 | ['MONTH']\n",
"\tTypes of features in processed data (raw dtype, special dtypes):\n",
"\t\t('float', []) : 5 | ['DEWP', 'WDSP', 'MAX', 'MIN', 'PRCP']\n",
"\t\t('int', []) : 1 | ['MONTH']\n",
"\t0.0s = Fit runtime\n",
"\t6 features in original data used to generate 6 features in processed data.\n",
"\tTrain Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)\n",
"Data preprocessing and feature engineering runtime = 0.04s ...\n",
"AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
"\tTo change this, specify the eval_metric parameter of Predictor()\n",
"AutoGluon will fit 2 stack levels (L1 to L2) ...\n",
"Fitting 13 L1 models ...\n",
"Fitting model: KNeighborsUnif_BAG_L1 ...\n",
"\t0.7416\t = Validation score (accuracy)\n",
"\t0.0s\t = Training runtime\n",
"\t0.01s\t = Validation runtime\n",
"Fitting model: KNeighborsDist_BAG_L1 ...\n",
"\t0.7573\t = Validation score (accuracy)\n",
"\t0.0s\t = Training runtime\n",
"\t0.01s\t = Validation runtime\n",
"Fitting model: LightGBMXT_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"2022-06-14 22:38:13,323\tWARNING services.py:1856 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 4294967296 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=4.77gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.\n",
"\t0.8373\t = Validation score (accuracy)\n",
"\t5.38s\t = Training runtime\n",
"\t0.04s\t = Validation runtime\n",
"Fitting model: LightGBM_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8182\t = Validation score (accuracy)\n",
"\t6.25s\t = Training runtime\n",
"\t0.03s\t = Validation runtime\n",
"Fitting model: RandomForestGini_BAG_L1 ...\n",
"\t0.812\t = Validation score (accuracy)\n",
"\t0.77s\t = Training runtime\n",
"\t0.16s\t = Validation runtime\n",
"Fitting model: RandomForestEntr_BAG_L1 ...\n",
"\t0.8059\t = Validation score (accuracy)\n",
"\t0.67s\t = Training runtime\n",
"\t0.15s\t = Validation runtime\n",
"Fitting model: CatBoost_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8278\t = Validation score (accuracy)\n",
"\t6.0s\t = Training runtime\n",
"\t0.05s\t = Validation runtime\n",
"Fitting model: ExtraTreesGini_BAG_L1 ...\n",
"\t0.8141\t = Validation score (accuracy)\n",
"\t0.77s\t = Training runtime\n",
"\t0.16s\t = Validation runtime\n",
"Fitting model: ExtraTreesEntr_BAG_L1 ...\n",
"\t0.8072\t = Validation score (accuracy)\n",
"\t0.69s\t = Training runtime\n",
"\t0.16s\t = Validation runtime\n",
"Fitting model: NeuralNetFastAI_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8018\t = Validation score (accuracy)\n",
"\t12.3s\t = Training runtime\n",
"\t0.12s\t = Validation runtime\n",
"Fitting model: XGBoost_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8148\t = Validation score (accuracy)\n",
"\t5.06s\t = Training runtime\n",
"\t0.07s\t = Validation runtime\n",
"Fitting model: NeuralNetTorch_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8291\t = Validation score (accuracy)\n",
"\t17.65s\t = Training runtime\n",
"\t0.11s\t = Validation runtime\n",
"Fitting model: LightGBMLarge_BAG_L1 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8195\t = Validation score (accuracy)\n",
"\t8.32s\t = Training runtime\n",
"\t0.05s\t = Validation runtime\n",
"Fitting model: WeightedEnsemble_L2 ...\n",
"\t0.8373\t = Validation score (accuracy)\n",
"\t0.98s\t = Training runtime\n",
"\t0.0s\t = Validation runtime\n",
"Fitting 11 L2 models ...\n",
"Fitting model: LightGBMXT_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8401\t = Validation score (accuracy)\n",
"\t6.48s\t = Training runtime\n",
"\t0.04s\t = Validation runtime\n",
"Fitting model: LightGBM_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8483\t = Validation score (accuracy)\n",
"\t7.34s\t = Training runtime\n",
"\t0.04s\t = Validation runtime\n",
"Fitting model: RandomForestGini_BAG_L2 ...\n",
"\t0.8346\t = Validation score (accuracy)\n",
"\t0.83s\t = Training runtime\n",
"\t0.15s\t = Validation runtime\n",
"Fitting model: RandomForestEntr_BAG_L2 ...\n",
"\t0.8325\t = Validation score (accuracy)\n",
"\t0.77s\t = Training runtime\n",
"\t0.15s\t = Validation runtime\n",
"Fitting model: CatBoost_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8442\t = Validation score (accuracy)\n",
"\t11.68s\t = Training runtime\n",
"\t0.04s\t = Validation runtime\n",
"Fitting model: ExtraTreesGini_BAG_L2 ...\n",
"\t0.8189\t = Validation score (accuracy)\n",
"\t0.73s\t = Training runtime\n",
"\t0.16s\t = Validation runtime\n",
"Fitting model: ExtraTreesEntr_BAG_L2 ...\n",
"\t0.8257\t = Validation score (accuracy)\n",
"\t0.68s\t = Training runtime\n",
"\t0.16s\t = Validation runtime\n",
"Fitting model: NeuralNetFastAI_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8325\t = Validation score (accuracy)\n",
"\t12.15s\t = Training runtime\n",
"\t0.23s\t = Validation runtime\n",
"Fitting model: XGBoost_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8476\t = Validation score (accuracy)\n",
"\t6.37s\t = Training runtime\n",
"\t0.07s\t = Validation runtime\n",
"Fitting model: NeuralNetTorch_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8284\t = Validation score (accuracy)\n",
"\t11.78s\t = Training runtime\n",
"\t0.13s\t = Validation runtime\n",
"Fitting model: LightGBMLarge_BAG_L2 ...\n",
"\tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
"\t0.8401\t = Validation score (accuracy)\n",
"\t13.94s\t = Training runtime\n",
"\t0.11s\t = Validation runtime\n",
"Fitting model: WeightedEnsemble_L3 ...\n",
"\t0.8483\t = Validation score (accuracy)\n",
"\t0.93s\t = Training runtime\n",
"\t0.0s\t = Validation runtime\n",
"AutoGluon training complete, total runtime = 177.76s ... Best model: \"WeightedEnsemble_L3\"\n",
"TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/aq_bakersfield_pm25_2016-2022/\")\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use AutoGluon TabularPredictor to fit a model for our training data...\n",
"display(AQbyWeather.selectedScenario.getSummary()) #Using display for consistent/sequential output order.\n",
"predictor = TabularPredictor(label=AQbyWeather.mlTargetLabel, \n",
" eval_metric=AQbyWeather.mlEvalMetric, \n",
" path=AQbyWeather.selectedScenario.getModelPath())\n",
"predictor.fit(train_data=train_df, time_limit=AQbyWeather.mlTimeLimitSecs, verbosity=2, presets='best_quality')"
]
},
{
"cell_type": "markdown",
"id": "021921b8-6133-46b3-aed5-8e733e627aab",
"metadata": {},
"source": [
"## Evaluating Our Autogluon Model...\n",
"- Use TabularPredictor.feature_importance(...) with our validation_df to get feature importance data.\\\n",
" (this data is displayed further below)\n",
"- Use TabularPredictor.leaderboard(...) with our validation_df to get individual model performance data.\\\n",
" (this data is displayed further below)\n",
"- Display TabularPredictor.evaluate(...) to output various model quality characteristics."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "ace98288-c321-4ea9-ac20-b163d9753af1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Computing feature importance via permutation shuffling for 6 features using 721 rows with 5 shuffle sets...\n",
"\t63.95s\t= Expected runtime (12.79s per shuffle set)\n",
"\t15.71s\t= Actual runtime (Completed 5 of 5 shuffle sets)\n",
"Evaluation: accuracy on test data: 0.7947295423023578\n",
"Evaluations on test data:\n",
"{\n",
" \"accuracy\": 0.7947295423023578,\n",
" \"balanced_accuracy\": 0.7923707440100882,\n",
" \"mcc\": 0.5820230438897135,\n",
" \"roc_auc\": 0.874984237074401,\n",
" \"f1\": 0.7620578778135049,\n",
" \"precision\": 0.7476340694006309,\n",
" \"recall\": 0.7770491803278688\n",
"}\n"
]
}
],
"source": [
"# Get dataframes for feature importance + model leaderboard AND get+display model evaluation...\n",
"display(AQbyWeather.selectedScenario.getSummary()) #Using display for consistent/sequential output order.\n",
"featureimp_df = predictor.feature_importance(validate_df)\n",
"leaderboard_df = predictor.leaderboard(validate_df, silent=True)\n",
"modelEvaluation = predictor.evaluate(validate_df, auxiliary_metrics=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "6d9a3a88-aab0-49f8-a6d4-ba20474bfd0b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n"
]
},
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# View and Plot Feature Importance... (this various from Scenario to Scenario)\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"display(featureimp_df)\n",
"featureimp_df.drop(columns=[\"n\"]).plot(kind=\"bar\", figsize=(12, 4), xlabel=\"feature\")"
]
},
{
"cell_type": "markdown",
"id": "7456a6b0-2236-47db-b6a8-dc11f7b3d35c",
"metadata": {},
"source": [
"## Using + Testing Our Model...\n",
"Now that we've built our Autogluon ML Model, we can use our test_df dataframe to make actual target label predictions and then compare those predictions against the known true label values to determine how the model is performing across each binary classification { 0=OKAY, 1=Unhealthy }. Further below, we'll plot a _Confusion Matrix_ and the **Summary** explains how to interpret results and covers **Key Learnings**."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "af1ec4ac-5e75-406b-8fe1-980ecf8d6986",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n"
]
},
{
"data": {
"text/plain": [
"241 1\n",
"994 1\n",
"1820 0\n",
"1442 0\n",
"596 1\n",
" ..\n",
"849 0\n",
"208 1\n",
"288 1\n",
"1226 0\n",
"2129 0\n",
"Name: isUnhealthy, Length: 721, dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Load + Use Our Model (this line is unnecessary, but shows how to load a built model)...\n",
"predictor = TabularPredictor.load(AQbyWeather.selectedScenario.getModelPath())\n",
"\n",
"# Make Predictions, which are saved to an array: y_pred\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"y_pred = predictor.predict(test_df)\n",
"display(y_pred)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "efa500e0-4d0b-47be-95fd-6b45a5dea0f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n"
]
},
{
"data": {
"text/plain": [
"241 1\n",
"994 1\n",
"1820 0\n",
"1442 0\n",
"596 1\n",
" ..\n",
"849 1\n",
"208 1\n",
"288 1\n",
"1226 0\n",
"2129 0\n",
"Name: isUnhealthy, Length: 721, dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Get true label values as an array: y_true\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"y_true = validate_df[AQbyWeather.mlTargetLabel]\n",
"display(y_true)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "43d7e462-2462-4fb1-9b6c-c7cd6cd56554",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"True Negatives (TN): 336 of 416 => 80.77%\n",
"True Positives (TP): 237 of 305 => 77.7%\n",
"False Negatives (FN): 68 of 305 => 22.3%\n",
"False Positives (FP): 80 of 416 => 19.23%\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Get Confusion Matrix (CM), Get Additional CM Data, View CM...\n",
"# Learn More: https://towardsdatascience.com/confusion-matrix-what-is-it-e859e1bbecdc\n",
"# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n",
"print(AQbyWeather.selectedScenario.getSummary())\n",
"cm = confusion_matrix(y_true, y_pred)\n",
"\n",
"# Print Confusion Matrix data...\n",
"cmData = AQbyWeather.getConfusionMatrixData(cm)\n",
"print(cmData.TN_Output)\n",
"print(cmData.TP_Output)\n",
"print(cmData.FN_Output)\n",
"print(cmData.FP_Output)\n",
"\n",
"# Plot Confusion Matrix...\n",
"cmd = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
"fig, ax = plt.subplots(figsize=(8, 6))\n",
"cmd.plot(ax=ax)"
]
},
{
"cell_type": "markdown",
"id": "152d9203-6987-4fbb-a33e-acef22cfb993",
"metadata": {},
"source": [
"## SUMMARY...\n",
"This demo showed how to use a SageMaker Studio Lab Jupyter Notebook to connect to open-source Amazon Sustainability Data Initiative (ASDI) datasets both via Amazon S3 (for NOAA weather data) and via HTTPS API (for OpenAQ air quality averages). The NOAA weather data and OpenAQ measurement averages are merged into a single table with irrelevant features dropped. This table is then used to build an Autogluon [TabularPredictor](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) ML model to predict a binary classification { 0=OKAY, 1=Unhealthy } based on weather features.\n",
"\n",
"### Key Learnings\n",
"- The relationship between weather and air quality varies geographically and this was confirmed.\n",
" - Feature Importance review shows differing factors in models built for different locations.\n",
" - Academic research indicates the relationship can change by season and an engineered “MONTH” feature was important for most models.\n",
" - An engineered “DAYOFWEEK” feature was expected to be important, but observed correlation was minimal and removing the feature improved most models.\n",
"- Predicting a target measurement was infeasible, but predicting a binary “OKAY” vs “unhealthy” classification based on threshold values yielded decent results (~75-90% accuracy metrics).\n",
"- For more on production-level air quality predictions, research how climatology and deeper statistical analysis are merged into 3D models that incorporate emissions models, meteorological models, and chemical models.\n",
"\n",
"### Confusion Matrix and Saving Final Results\n",
"The Confusion Matrix above plots the relationships between:\n",
"- True Positives (1=>1; see bottom-right)\n",
"- True Negatives (0=>0; see top-left)\n",
"- False Positives (1=>0; see top-right)\n",
"- False Negatives (0=>1; see bottom-left)\n",
"\n",
"_The validation dataframe with predictions appended as the final column are saved below as a CSV for additional analysis._"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "c1c7d5e7-0d51-4170-b1dd-ba09b97632bf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scenario: bakersfield_pm25 => Particulate Matter < 2.5 micrometers (pm25) with UnhealthyThreshold > 12.0 µg/m³\n",
"Results saved to dataRESULTS_bakersfield_pm25_2016-2022.csv. DONE.\n"
]
},
{
"data": {
"text/html": [
"