{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EMR on EC2 - Account Usage Report (Monthly)\n", "\n", "This notebook provides a breakdown of the costs for EMR on EC2 Clusters running in your AWS account. The dashboard generated at the end of the notebook provides the following information:\n", "\n", "- Aggregated hourly costs for all the clusters running in the account\n", "- AWS Costs repartitioning by service usage (Amazon EC2, Amazon EMR, EBS Storage, Data Transfer)\n", "- Total Costs for all EMR clusters running in the account\n", "- Instance Trends to determine which instances are launched by the EMR service (Instance type, family, market, architecture)\n", "\n", "To generate the analytics cost and usage dashboard, please modify the following **Configuration** section to match your environment, and run this notebook.\n", "\n", "***\n", "## Prerequisites\n", "
\n", "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n", "\n", "* This notebook uses the `PySpark` kernel.\n", "* The EMR cluster attached to this notebook should have `Spark` `JupyterEnterpriseGateway` `Livy` applications installed.\n", "* The EMR cluster should be configured to use the [AWS Glue Data Catalog as metastore for Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html).\n", "***\n", "## Configuration" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Current year and month used to filter data\n", "from datetime import date\n", "current_date = date.today()\n", "spark.conf.set(\"year\", current_date.year)\n", "spark.conf.set(\"month\", current_date.month)\n", "\n", "# AWS Glue Database where the AWS Cost and Usage Report is located (e.g. athenacurcfn_a_w_s_costs_daily)\n", "spark.conf.set(\"cur_db\", \"YOUR_CUR_DATABASE_NAME\")\n", "\n", "# AWS Glue Table Name used by AWS Cost and Usage Report (e.g. aws_costs_daily)\n", "spark.conf.set(\"cur_table\", \"YOUR_CUR_TABLE_NAME\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data collection" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%sql -o emr_detailed_hourly_costs -n -1 -q\n", "SELECT \n", " split_part(line_item_resource_id, '/', 2) AS cluster, \n", " line_item_usage_account_id AS account,\n", " date_format(line_item_usage_start_date - INTERVAL 1 hours , 'yyyy-MM-dd HH:mm') AS date,\n", " year(line_item_usage_start_date - INTERVAL 1 hours ) AS year, \n", " month(line_item_usage_start_date - INTERVAL 1 hours ) AS month, \n", " day(line_item_usage_start_date - INTERVAL 1 hours ) AS day, \n", " hour(line_item_usage_start_date - INTERVAL 1 hours ) AS hour, \n", " product_product_name AS service, \n", " product_product_family AS family, \n", " sum(line_item_unblended_cost) AS cost \n", "FROM ${cur_db}.${cur_table}\n", "WHERE \n", " product_product_name = 'Amazon Elastic MapReduce' \n", " AND year = '${year}'\n", " AND month = '${month}'\n", " AND split_part(line_item_resource_id, '/', 2) NOT IN ('null', 'applications', 'virtualclusters') \n", " AND split_part(line_item_resource_id, '/', 2) <> '' \n", "GROUP BY \n", " split_part(line_item_resource_id, '/', 2), \n", " line_item_usage_account_id,\n", " line_item_usage_start_date,\n", " year(line_item_usage_start_date), \n", " month(line_item_usage_start_date), \n", " day(line_item_usage_start_date), \n", " hour(line_item_usage_start_date), \n", " product_product_name, \n", " product_product_family \n", "UNION ALL \n", "SELECT \n", " resource_tags_aws_elasticmapreduce_job_flow_id AS cluster, \n", " line_item_usage_account_id AS account,\n", " date_format(line_item_usage_start_date, 'yyyy-MM-dd HH:mm') AS date,\n", " year(line_item_usage_start_date) AS year, \n", " month(line_item_usage_start_date) AS month, \n", " day(line_item_usage_start_date) AS day, \n", " hour(line_item_usage_start_date) AS hour, \n", " product_product_name AS service, \n", " product_product_family AS family, \n", " sum(line_item_unblended_cost) AS cost \n", "FROM ${cur_db}.${cur_table}\n", "WHERE \n", " product_product_name = 'Amazon Elastic Compute Cloud' \n", " AND year = '${year}'\n", " AND month = '${month}'\n", " AND resource_tags_aws_elasticmapreduce_job_flow_id NOT IN ('null', 'applications', 'virtualclusters') \n", " AND resource_tags_aws_elasticmapreduce_job_flow_id <> '' \n", "GROUP BY \n", " resource_tags_aws_elasticmapreduce_job_flow_id, \n", " line_item_usage_account_id,\n", " line_item_usage_start_date,\n", " year(line_item_usage_start_date), \n", " month(line_item_usage_start_date), \n", " day(line_item_usage_start_date), \n", " hour(line_item_usage_start_date), \n", " product_product_name, \n", " product_product_family " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%sql -o emr_instance_details -n -1 -q\n", "SELECT DISTINCT\n", "line_item_resource_id AS instance_id,\n", "product_physical_processor AS arch,\n", "product_instance_type_family AS instance_family,\n", "product_instance_family AS instance_detail,\n", "product_instance_type AS instance_type,\n", "product_marketoption AS market,\n", "1 as num\n", "FROM ${cur_db}.${cur_table}\n", "WHERE \n", " product_product_name = 'Amazon Elastic Compute Cloud' \n", " AND year = '${year}'\n", " AND month = '${month}'\n", " AND product_product_family = 'Compute Instance'\n", " AND resource_tags_aws_elasticmapreduce_job_flow_id NOT IN ('null', 'applications', 'virtualclusters') \n", " AND resource_tags_aws_elasticmapreduce_job_flow_id <> ''\n", " AND product_instance_type_family <> ''\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Dashboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%local\n", "from plotly.subplots import make_subplots\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "\n", "# Default Theme\n", "pio.templates[\"aws_template\"] = go.layout.Template(\n", " layout_autosize=True,\n", " layout_height=600\n", ")\n", "pio.templates.default = 'aws_template+gridon'\n", "\n", "# Cluster Cost Details\n", "account_hourly_split = emr_detailed_hourly_costs[['date', 'family', 'cost']].groupby(['date', 'family'], as_index=False)['cost'].sum()\n", "account_month_split = emr_detailed_hourly_costs[['month', 'family', 'cost']].groupby(['month', 'family'], as_index=False)['cost'].sum()\n", "account_cluster_costs = emr_detailed_hourly_costs[['cluster', 'year', 'month', 'cost']].groupby(['cluster', 'year', 'month'], as_index=False)['cost'].sum().sort_values(by='cost', ascending=False)\n", "\n", "## Hourly Cost Dashboard\n", "fig = px.bar(account_hourly_split, x=account_hourly_split.date, y=account_hourly_split.cost, color=account_hourly_split.family)\n", "fig.update_layout(title='Hourly Costs', xaxis_title=\"\", yaxis_title=\"Costs ($)\")\n", "fig.update_layout(legend_title_text='', legend=dict(orientation=\"h\", yanchor=\"bottom\", y=1.02, xanchor=\"right\", x=1))\n", "fig.show()\n", "\n", "## Costs Trends\n", "fig = make_subplots(rows=1, cols=3, specs=[[{\"type\": \"xy\"}, {\"type\": \"domain\"},{\"type\": \"table\"}]], subplot_titles=(\"Expensive Clusters - Top 10\", \"Split by Service\", \"All Cluster Costs\"))\n", "fig.add_trace(go.Bar(x=account_cluster_costs.cluster, y=account_cluster_costs.head(10).cost.round(2), text=account_cluster_costs.cost.round(2), texttemplate = \"%{y:$,.2f}\", showlegend=False, name=\"\"), row=1, col=1)\n", "fig.add_trace(go.Pie(labels=account_month_split.family, values=account_month_split.cost.round(2), name=\"\", texttemplate = \"%{label}: %{value:$,.2f}
%{percent}\", textposition = \"inside\"), row=1, col=2)\n", "fig.add_trace(go.Table(header=dict(values=['Cluster', 'Costs ($)'], align=\"left\"), cells=dict(values=[account_cluster_costs.cluster, account_cluster_costs.cost.round(2)], align='left')), row=1, col=3)\n", "\n", "fig.update_layout(height=500, showlegend=False, title='Costs Trends')\n", "fig.update_yaxes(title_text=\"Costs ($)\", row=1, col=1)\n", "fig.show()\n", "\n", "## Instance Trends\n", "instance_market = emr_instance_details[['market', 'num']].groupby(['market'], as_index=False)['num'].sum()\n", "instance_family = emr_instance_details[['instance_family', 'instance_type', 'num']].groupby(['instance_family', 'instance_type'], as_index=False)['num'].sum()\n", "instance_arch = emr_instance_details[['arch', 'num']].groupby(['arch'], as_index=False)['num'].sum()\n", "instance_desc = emr_instance_details[['instance_detail', 'num']].groupby(['instance_detail'], as_index=False)['num'].sum()\n", "\n", "fig = px.bar(instance_family, x=instance_family.instance_type, y=instance_family.num)\n", "fig.update_layout(title='Instance Trends', xaxis_title=\"\", yaxis_title=\"Usage\")\n", "fig.update_layout(legend_title_text='', legend=dict(orientation=\"h\", yanchor=\"bottom\", y=1.02, xanchor=\"center\", x=1))\n", "fig.show()\n", "\n", "fig = make_subplots(rows=1, cols=3, specs=[[{\"type\": \"domain\"}, {\"type\": \"domain\"}, {\"type\": \"domain\"}]], subplot_titles=(\"Processor\", \"Market\", \"Families\"))\n", "fig.add_trace(go.Pie(labels=instance_arch.arch, values=instance_arch.num, name=\"\", texttemplate = \"%{label}: %{percent}\", textposition = \"inside\"), row=1, col=1)\n", "fig.add_trace(go.Pie(labels=instance_market.market, values=instance_market.num, name=\"\", texttemplate = \"%{label}: %{percent}\", textposition = \"inside\"), row=1, col=2)\n", "fig.add_trace(go.Pie(labels=instance_desc.instance_detail, values=instance_desc.num, name=\"\", texttemplate = \"%{label}: %{percent}\", textposition = \"inside\"), row=1, col=3)\n", "fig.update_layout(height=500, showlegend=False)\n", "fig.show()" ] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "python", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }