{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EMR on EC2 - Account Usage Report (Monthly)\n",
    "\n",
    "This notebook provides a breakdown of the costs for EMR on EC2 Clusters running in your AWS account. The dashboard generated at the end of the notebook provides the following information:\n",
    "\n",
    "- Aggregated hourly costs for all the clusters running in the account\n",
    "- AWS Costs repartitioning by service usage (Amazon EC2, Amazon EMR, EBS Storage, Data Transfer)\n",
    "- Total Costs for all EMR clusters running in the account\n",
    "- Instance Trends to determine which instances are launched by the EMR service (Instance type, family, market, architecture)\n",
    "\n",
    "To generate the analytics cost and usage dashboard, please modify the following **Configuration** section to match your environment, and run this notebook.\n",
    "\n",
    "***\n",
    "## Prerequisites\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>\n",
    "\n",
    "* This notebook uses the `PySpark` kernel.\n",
    "* The EMR cluster attached to this notebook should have `Spark` `JupyterEnterpriseGateway` `Livy` applications installed.\n",
    "* The EMR cluster should be configured to use the [AWS Glue Data Catalog as metastore for Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html).\n",
    "***\n",
    "## Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Current year and month used to filter data\n",
    "from datetime import date\n",
    "current_date = date.today()\n",
    "spark.conf.set(\"year\", current_date.year)\n",
    "spark.conf.set(\"month\", current_date.month)\n",
    "\n",
    "# AWS Glue Database where the AWS Cost and Usage Report is located (e.g. athenacurcfn_a_w_s_costs_daily)\n",
    "spark.conf.set(\"cur_db\", \"YOUR_CUR_DATABASE_NAME\")\n",
    "\n",
    "# AWS Glue Table Name used by AWS Cost and Usage Report (e.g. aws_costs_daily)\n",
    "spark.conf.set(\"cur_table\", \"YOUR_CUR_TABLE_NAME\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%sql -o emr_detailed_hourly_costs -n -1 -q\n",
    "SELECT \n",
    "  split_part(line_item_resource_id, '/', 2) AS cluster, \n",
    "  line_item_usage_account_id AS account,\n",
    "  date_format(line_item_usage_start_date - INTERVAL 1 hours , 'yyyy-MM-dd HH:mm') AS date,\n",
    "  year(line_item_usage_start_date - INTERVAL 1 hours ) AS year, \n",
    "  month(line_item_usage_start_date - INTERVAL 1 hours ) AS month, \n",
    "  day(line_item_usage_start_date - INTERVAL 1 hours ) AS day, \n",
    "  hour(line_item_usage_start_date - INTERVAL 1 hours ) AS hour, \n",
    "  product_product_name AS service, \n",
    "  product_product_family AS family, \n",
    "  sum(line_item_unblended_cost) AS cost \n",
    "FROM ${cur_db}.${cur_table}\n",
    "WHERE \n",
    "  product_product_name = 'Amazon Elastic MapReduce' \n",
    "  AND year = '${year}'\n",
    "  AND month = '${month}'\n",
    "  AND split_part(line_item_resource_id, '/', 2) NOT IN ('null', 'applications', 'virtualclusters') \n",
    "  AND split_part(line_item_resource_id, '/', 2) <> '' \n",
    "GROUP BY \n",
    "  split_part(line_item_resource_id, '/', 2), \n",
    "  line_item_usage_account_id,\n",
    "  line_item_usage_start_date,\n",
    "  year(line_item_usage_start_date), \n",
    "  month(line_item_usage_start_date), \n",
    "  day(line_item_usage_start_date), \n",
    "  hour(line_item_usage_start_date), \n",
    "  product_product_name, \n",
    "  product_product_family \n",
    "UNION ALL \n",
    "SELECT \n",
    "  resource_tags_aws_elasticmapreduce_job_flow_id AS cluster, \n",
    "  line_item_usage_account_id AS account,\n",
    "  date_format(line_item_usage_start_date, 'yyyy-MM-dd HH:mm') AS date,\n",
    "  year(line_item_usage_start_date) AS year, \n",
    "  month(line_item_usage_start_date) AS month, \n",
    "  day(line_item_usage_start_date) AS day, \n",
    "  hour(line_item_usage_start_date) AS hour, \n",
    "  product_product_name AS service, \n",
    "  product_product_family AS family, \n",
    "  sum(line_item_unblended_cost) AS cost \n",
    "FROM ${cur_db}.${cur_table}\n",
    "WHERE \n",
    "  product_product_name = 'Amazon Elastic Compute Cloud' \n",
    "  AND year = '${year}'\n",
    "  AND month = '${month}'\n",
    "  AND resource_tags_aws_elasticmapreduce_job_flow_id NOT IN ('null', 'applications', 'virtualclusters') \n",
    "  AND resource_tags_aws_elasticmapreduce_job_flow_id <> '' \n",
    "GROUP BY \n",
    "  resource_tags_aws_elasticmapreduce_job_flow_id, \n",
    "  line_item_usage_account_id,\n",
    "  line_item_usage_start_date,\n",
    "  year(line_item_usage_start_date), \n",
    "  month(line_item_usage_start_date), \n",
    "  day(line_item_usage_start_date), \n",
    "  hour(line_item_usage_start_date), \n",
    "  product_product_name, \n",
    "  product_product_family "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%sql -o emr_instance_details -n -1 -q\n",
    "SELECT DISTINCT\n",
    "line_item_resource_id AS instance_id,\n",
    "product_physical_processor AS arch,\n",
    "product_instance_type_family AS instance_family,\n",
    "product_instance_family AS instance_detail,\n",
    "product_instance_type AS instance_type,\n",
    "product_marketoption AS market,\n",
    "1 as num\n",
    "FROM ${cur_db}.${cur_table}\n",
    "WHERE \n",
    "  product_product_name = 'Amazon Elastic Compute Cloud' \n",
    "  AND year = '${year}'\n",
    "  AND month = '${month}'\n",
    "  AND product_product_family = 'Compute Instance'\n",
    "  AND resource_tags_aws_elasticmapreduce_job_flow_id NOT IN ('null', 'applications', 'virtualclusters') \n",
    "  AND resource_tags_aws_elasticmapreduce_job_flow_id <> ''\n",
    "  AND product_instance_type_family <> ''\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Dashboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%local\n",
    "from plotly.subplots import make_subplots\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "import plotly.io as pio\n",
    "\n",
    "# Default Theme\n",
    "pio.templates[\"aws_template\"] = go.layout.Template(\n",
    "    layout_autosize=True,\n",
    "    layout_height=600\n",
    ")\n",
    "pio.templates.default = 'aws_template+gridon'\n",
    "\n",
    "# Cluster Cost Details\n",
    "account_hourly_split = emr_detailed_hourly_costs[['date', 'family', 'cost']].groupby(['date', 'family'], as_index=False)['cost'].sum()\n",
    "account_month_split = emr_detailed_hourly_costs[['month', 'family', 'cost']].groupby(['month', 'family'], as_index=False)['cost'].sum()\n",
    "account_cluster_costs = emr_detailed_hourly_costs[['cluster', 'year', 'month', 'cost']].groupby(['cluster', 'year', 'month'], as_index=False)['cost'].sum().sort_values(by='cost', ascending=False)\n",
    "\n",
    "## Hourly Cost Dashboard\n",
    "fig = px.bar(account_hourly_split, x=account_hourly_split.date, y=account_hourly_split.cost, color=account_hourly_split.family)\n",
    "fig.update_layout(title='Hourly Costs', xaxis_title=\"\", yaxis_title=\"Costs ($)\")\n",
    "fig.update_layout(legend_title_text='', legend=dict(orientation=\"h\", yanchor=\"bottom\", y=1.02, xanchor=\"right\", x=1))\n",
    "fig.show()\n",
    "\n",
    "## Costs Trends\n",
    "fig = make_subplots(rows=1, cols=3, specs=[[{\"type\": \"xy\"}, {\"type\": \"domain\"},{\"type\": \"table\"}]], subplot_titles=(\"Expensive Clusters - Top 10\", \"Split by Service\", \"All Cluster Costs\"))\n",
    "fig.add_trace(go.Bar(x=account_cluster_costs.cluster, y=account_cluster_costs.head(10).cost.round(2), text=account_cluster_costs.cost.round(2), texttemplate = \"%{y:$,.2f}\", showlegend=False, name=\"\"), row=1, col=1)\n",
    "fig.add_trace(go.Pie(labels=account_month_split.family, values=account_month_split.cost.round(2), name=\"\", texttemplate = \"<b>%{label}</b>: %{value:$,.2f} <br> %{percent}\", textposition = \"inside\"), row=1, col=2)\n",
    "fig.add_trace(go.Table(header=dict(values=['Cluster', 'Costs ($)'], align=\"left\"), cells=dict(values=[account_cluster_costs.cluster, account_cluster_costs.cost.round(2)], align='left')), row=1, col=3)\n",
    "\n",
    "fig.update_layout(height=500, showlegend=False, title='Costs Trends')\n",
    "fig.update_yaxes(title_text=\"Costs ($)\", row=1, col=1)\n",
    "fig.show()\n",
    "\n",
    "## Instance Trends\n",
    "instance_market = emr_instance_details[['market', 'num']].groupby(['market'], as_index=False)['num'].sum()\n",
    "instance_family = emr_instance_details[['instance_family', 'instance_type', 'num']].groupby(['instance_family', 'instance_type'], as_index=False)['num'].sum()\n",
    "instance_arch = emr_instance_details[['arch', 'num']].groupby(['arch'], as_index=False)['num'].sum()\n",
    "instance_desc = emr_instance_details[['instance_detail', 'num']].groupby(['instance_detail'], as_index=False)['num'].sum()\n",
    "\n",
    "fig = px.bar(instance_family, x=instance_family.instance_type, y=instance_family.num)\n",
    "fig.update_layout(title='Instance Trends', xaxis_title=\"\", yaxis_title=\"Usage\")\n",
    "fig.update_layout(legend_title_text='', legend=dict(orientation=\"h\", yanchor=\"bottom\", y=1.02, xanchor=\"center\", x=1))\n",
    "fig.show()\n",
    "\n",
    "fig = make_subplots(rows=1, cols=3, specs=[[{\"type\": \"domain\"}, {\"type\": \"domain\"}, {\"type\": \"domain\"}]], subplot_titles=(\"Processor\", \"Market\", \"Families\"))\n",
    "fig.add_trace(go.Pie(labels=instance_arch.arch, values=instance_arch.num, name=\"\", texttemplate = \"<b>%{label}</b>: %{percent}\", textposition = \"inside\"), row=1, col=1)\n",
    "fig.add_trace(go.Pie(labels=instance_market.market, values=instance_market.num, name=\"\", texttemplate = \"<b>%{label}</b>: %{percent}\", textposition = \"inside\"), row=1, col=2)\n",
    "fig.add_trace(go.Pie(labels=instance_desc.instance_detail, values=instance_desc.num, name=\"\", texttemplate = \"<b>%{label}</b>: %{percent}\", textposition = \"inside\"), row=1, col=3)\n",
    "fig.update_layout(height=500, showlegend=False)\n",
    "fig.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PySpark",
   "language": "python",
   "name": "pysparkkernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "python",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "pyspark",
   "pygments_lexer": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}