{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Orchestration with Step Functions\n", "\n", "In this workshop we will automate many of the steps that were run manually in the `ny-taxi-right-tool.ipynb` and `ny-taxi-streaming.ipynb` notebooks. The way we will accomplish this is through the use of step functions to build a state machine that will execute an AWS Glue job and once finished run a Glue Crawler to update the schema and partitions. A common scenario will be used of a scheduled file drop in a raw location in your data lake that will trigger an S3 event to start the orchetration process. \n", "\n", "![Glue Orchestration](../../docs/assets/images/glue_orch.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import json\n", "import time\n", "import os\n", "import pandas as pd\n", "\n", "import project_path # path to helper methods\n", "from lib import workshop\n", "from pandas import read_sql\n", "\n", "glue = boto3.client('glue')\n", "s3 = boto3.resource('s3')\n", "s3_client = boto3.client('s3')\n", "cfn = boto3.client('cloudformation')\n", "lambda_client = boto3.client('lambda')\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "workshop_user = 'datalake'\n", "database_name = 'taxi_' + workshop_user # AWS Glue Data Catalog Database Name\n", "table_name = 'yellow'\n", "email = 'user@example.com' #Replace with your own email address\n", "\n", "s3_output_key_prefix = 'datalake/curated/yellow_parq/' # Output location of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)\n", "\n", "We will create an S3 bucket that will be used throughout the workshop for storing our data.\n", "\n", "[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = workshop.create_bucket(region, session, 'taxi-')\n", "print(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discover the data in your Data Lake\n", "\n", "In this next section we will be using [AWS Glue](https://aws.amazon.com/glue/) to discover, catalog, and transform your data. Glue currently only supports `Python 2.7`, hence we'll write the script in `Python 2.7`.\n", "\n", "### Permission setup for invoking AWS Glue from this Notebook\n", "In order to enable this Notebook to run AWS Glue jobs, we need to add one additional permission to the default execution role of this notebook. We will be using SageMaker Python SDK to retrieve the default execution role and then you have to go to [IAM Dashboard](https://console.aws.amazon.com/iam/home) to edit the Role to add AWS Glue specific permission. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding out the current execution role of the Notebook\n", "We are using SageMaker Python SDK to retrieve the current role for this Notebook which needs to be enhanced to support the functionality in AWS Glue." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import SageMaker Python SDK to get the Session and execution_role\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "sess = sagemaker.Session()\n", "role = get_execution_role()\n", "role_name = role[role.rfind('/') + 1:]\n", "print(role_name)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding AWS Glue as an additional trusted entity to this role\n", "This step is needed if you want to pass the execution role of this Notebook while calling Glue APIs as well without creating an additional **Role**. If you have not used AWS Glue before, then this step is mandatory. \n", "\n", "If you have used AWS Glue previously, then you should have an already existing role that can be used to invoke Glue APIs. In that case, you can pass that role while calling Glue (later in this notebook) and skip this next step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the IAM dashboard, please click on **Roles** on the left sidenav and search for this Role. Once the Role appears, click on the Role to go to its **Summary** page. Click on the **Trust relationships** tab on the **Summary** page to add AWS Glue as an additional trusted entity. \n", "\n", "Click on **Edit trust relationship** and replace the JSON with this JSON.\n", "```\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": [\n", " \"sagemaker.amazonaws.com\",\n", " \"glue.amazonaws.com\"\n", " ]\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "```\n", "Once this is complete, click on **Update Trust Policy** and you are done.\n", "\n", "![IAM Roles](../../docs/assets/images/iam_roles_hl.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"https://console.aws.amazon.com/iam/home?region={0}#/roles/{1}\".format(region, role_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Copy Sample Data to S3 bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html) \n", "\n", "We will download some files from New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset available on the [AWS Open Data Registry](https://registry.opendata.aws/nyc-tlc-trip-records-pds/).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 cp s3://nyc-tlc/trip\\ data/yellow_tripdata_2017-01.csv s3://$bucket/datalake/raw/yellow/\n", "!aws s3 cp s3://nyc-tlc/trip\\ data/yellow_tripdata_2017-02.csv s3://$bucket/datalake/raw/yellow/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the [AWS Glue Catalog Database](https://docs.aws.amazon.com/glue/latest/dg/define-database.html)\n", "\n", "When you define a table in the AWS Glue Data Catalog, you add it to a database. A database is used to organize tables in AWS Glue. You can organize your tables using a crawler or using the AWS Glue console. A table can be in only one database at a time.\n", "\n", "There is a central Glue Catalog for each AWS account. When creating the database you will use your account id declared above as `account_id`\n", "\n", "[glue.create_database](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_database)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "workshop.create_db(glue, account_id, database_name, 'New York City Taxi and Limousine Commission (TLC) Trip Record Data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use a [Glue Crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) to Discover the transformed data\n", "\n", "You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. You add a crawler within your Data Catalog to traverse your data stores. The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these metadata tables as sources and targets.\n", "\n", "A crawler can crawl both file-based and table-based data stores. Crawlers can crawl the following data stores:\n", "\n", "* Amazon Simple Storage Service (Amazon S3)\n", " * [Built-in Classifiers](https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in)\n", " * [Custom Classifiers](https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html)\n", "* Amazon Redshift\n", "* Amazon Relational Database Service (Amazon RDS)\n", " * Amazon Aurora\n", " * MariaDB\n", " * Microsoft SQL Server\n", " * MySQL\n", " * Oracle\n", " * PostgreSQL\n", "* Amazon DynamoDB\n", "* Publicly accessible databases [Blog](https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/)\n", " * Aurora\n", " * MariaDB\n", " * SQL Server\n", " * MySQL\n", " * Oracle\n", " * PostgreSQL\n", "\n", "[glue.create_crawler](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_crawler)\n", "\n", "### Create the raw data crawler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crawler_name = 'NY-Taxi-Orch-Crawler-{0}'.format(workshop_user)\n", "raw_prefix = 'datalake/raw/yellow/'\n", "crawler_path = 's3://{0}/{1}'.format(bucket, raw_prefix)\n", "\n", "response = glue.create_crawler(\n", " Name=crawler_name,\n", " Role=role,\n", " DatabaseName=database_name,\n", " Description='Crawler for NY Taxi Data',\n", " Targets={\n", " 'S3Targets': [\n", " {\n", " 'Path': crawler_path\n", " }\n", " ]\n", " },\n", " SchemaChangePolicy={\n", " 'UpdateBehavior': 'UPDATE_IN_DATABASE',\n", " 'DeleteBehavior': 'DEPRECATE_IN_DATABASE'\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the curated crawler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parq_crawler_name = 'NY-Taxi-Orch-Parq-Crawler-{0}'.format(workshop_user)\n", "parq_crawler_path = 's3://'+bucket+'/datalake/curated/yellow_parq'\n", "\n", "response = glue.create_crawler(\n", " Name=parq_crawler_name,\n", " Role=role,\n", " DatabaseName=database_name,\n", " Description='Crawler for Curated NY Taxi Data',\n", " Targets={\n", " 'S3Targets': [\n", " {\n", " 'Path': parq_crawler_path\n", " }\n", " ]\n", " },\n", " SchemaChangePolicy={\n", " 'UpdateBehavior': 'UPDATE_IN_DATABASE',\n", " 'DeleteBehavior': 'DEPRECATE_IN_DATABASE'\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start the Raw Data Glue Crawler\n", "\n", "We will start with populating the Glue Data Catalog with the raw database of NY Taxi data. \n", "\n", "You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. You add a crawler within your Data Catalog to traverse your data stores. The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these metadata tables as sources and targets.\n", "\n", "[glue.start_crawler](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.start_crawler)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = glue.start_crawler(\n", " Name=crawler_name\n", ")\n", "\n", "print (\"Crawler: https://{0}.console.aws.amazon.com/glue/home?region={0}#crawler:name={1}\".format(region, crawler_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor the status of the Parquet crawler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crawler_status = glue.get_crawler(Name=crawler_name)['Crawler']['State']\n", "\n", "while crawler_status not in ('READY'):\n", " crawler_status = glue.get_crawler(Name=crawler_name)['Crawler']['State']\n", " print(crawler_status)\n", " time.sleep(30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('https://{0}.console.aws.amazon.com/glue/home?region={0}#database:name={1}'.format(region, database_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Parquet version of the yellow CSV table\n", "\n", "From [Wikipedia](https://en.wikipedia.org/wiki/Apache_Parquet), \"Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.\"\n", "\n", "The key points in this code is how easy it is to get access to the AWS Glue Data Catalog leveraging the [Glue libraries](https://github.com/awslabs/aws-glue-libs). Some of the key concepts are below:\n", "\n", "* [`glueContext.create_dynamic_frame.from_catalog`](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) - Read table metadata from the Glue Data Catalog using Glue libs to load tables into the pyspark job.\n", "* Writing back S3 [`glueContext.write_dynamic_frame.from_options`](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_catalog) with options:\n", " * Convert data to different format `format=\"parquet\"`. This format is [columnar](https://docs.aws.amazon.com/athena/latest/ug/columnar-storage.html) and provides [Snappy](https://en.wikipedia.org/wiki/Snappy_(compression)) compression by default.\n", " \n", "You can find more best practices for Glue and Athena [here](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile yellow_parquet_orch_etl.py\n", "\n", "import sys\n", "import os\n", "from awsglue.transforms import *\n", "from awsglue.utils import getResolvedOptions\n", "from pyspark.context import SparkContext\n", "from awsglue.context import GlueContext\n", "from awsglue.job import Job\n", "\n", "from pyspark.context import SparkContext\n", "from pyspark.sql.functions import *\n", "from awsglue.dynamicframe import DynamicFrame\n", "\n", "## @params: [JOB_NAME]\n", "args = getResolvedOptions(sys.argv, ['JOB_NAME', 'S3_OUTPUT_BUCKET', 'S3_OUTPUT_KEY_PREFIX', 'DATABASE_NAME', 'TABLE_NAME', 'REGION'])\n", "\n", "sc = SparkContext()\n", "glueContext = GlueContext(sc)\n", "spark = glueContext.spark_session\n", "job = Job(glueContext)\n", "job.init(args['JOB_NAME'], args)\n", "## @type: DataSource\n", "## @args: [database = \"taxi\", table_name = \"yellow\", transformation_ctx = \"datasource0\"]\n", "## @return: datasource0\n", "## @inputs: []\n", "datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['DATABASE_NAME'], table_name=args['TABLE_NAME'], transformation_ctx = \"datasource0\")\n", "## @type: ResolveChoice\n", "## @args: [choice = \"make_struct\", transformation_ctx = \"resolvechoice1\"]\n", "## @return: resolvechoice1\n", "## @inputs: [frame = datasource0]\n", "resolvechoice1 = ResolveChoice.apply(frame = datasource0, choice = \"make_struct\", transformation_ctx = \"resolvechoice1\")\n", "## @type: DropNullFields\n", "## @args: [transformation_ctx = \"dropnullfields2\"]\n", "## @return: dropnullfields2\n", "## @inputs: [frame = resolvechoice1]\n", "dropnullfields2 = DropNullFields.apply(frame = resolvechoice1, transformation_ctx = \"dropnullfields2\")\n", "\n", "addcolumn_df = dropnullfields2.toDF()\n", "\n", "addcolumn_df \\\n", " .withColumn(\"pu_datetime\", to_timestamp(col(\"tpep_pickup_datetime\"), \"yyyy-MM-dd HH:mm:ss\")) \\\n", " .withColumn(\"do_datetime\", to_timestamp(col(\"tpep_dropoff_datetime\"), \"yyyy-MM-dd HH:mm:ss\")) \\\n", " .withColumn(\"pu_year\", year(\"pu_datetime\")) \\\n", " .withColumn(\"pu_month\", month(\"pu_datetime\")) \\\n", " .withColumn(\"pu_day\", dayofmonth(\"pu_datetime\"))\n", "\n", "parquet_output_path = 's3://' + os.path.join(args['S3_OUTPUT_BUCKET'], args['S3_OUTPUT_KEY_PREFIX'])\n", "print(parquet_output_path)\n", "\n", "writeframe = DynamicFrame.fromDF(addcolumn_df, glueContext, \"writeframe\")\n", "## @type: DataSink\n", "## @args: [connection_type = \"s3\", connection_options = {\"path\": \"\"}, format = \"parquet\", transformation_ctx = \"datasink3\"]\n", "## @return: datasink3\n", "## @inputs: [frame = dropnullfields2]\n", "datasink3 = glueContext.write_dynamic_frame.from_options(frame = writeframe, connection_type = \"s3\", connection_options = {\"path\": parquet_output_path}, format = \"parquet\", transformation_ctx = \"datasink4\")\n", "job.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload the ETL script to S3\n", "We will be uploading the `yellow_parquet_orch_etl` script to S3 so Glue can use it to run the PySpark job. You can replace it with your own script if needed. If your code has multiple files, you need to zip those files and upload to S3 instead of uploading a single file like it's being done here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "script_location = sess.upload_data(path='yellow_parquet_orch_etl.py', bucket=bucket, key_prefix='codes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Authoring jobs with AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/author-job.html)\n", "\n", "Next we'll be creating Glue client via Boto so that we can invoke the `create_job` API of Glue. `create_job` API will create a job definition which can be used to execute your jobs in Glue. The job definition created here is mutable. While creating the job, we are also passing the code location as well as the dependencies location to Glue.\n", "\n", "`AllocatedCapacity` parameter controls the hardware resources that Glue will use to execute this job. It is measures in units of `DPU`. For more information on `DPU`, please see [here](https://docs.aws.amazon.com/glue/latest/dg/add-job.html).\n", "\n", "[glue.create_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "import time\n", "\n", "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "\n", "job_name = 'ny-yellow-parquet-orch-' + timestamp_prefix\n", "response = glue.create_job(\n", " Name=job_name,\n", " Description='PySpark job to convert yellow taxi csv data to parquet',\n", " Role=role, # you can pass your existing AWS Glue role here if you have used Glue before\n", " ExecutionProperty={\n", " 'MaxConcurrentRuns': 1\n", " },\n", " Command={\n", " 'Name': 'glueetl',\n", " 'ScriptLocation': script_location\n", " },\n", " DefaultArguments={\n", " '--job-language': 'python',\n", " '--job-bookmark-option': 'job-bookmark-enable'\n", " },\n", " AllocatedCapacity=5,\n", " Timeout=60,\n", ")\n", "glue_job_name = response['Name']\n", "print(glue_job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build Lambda functions for orchestration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p cfn/start_crawler\n", "!mkdir -p cfn/poller" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Lambda function to start Glue crawler\n", "\n", "This function will be triggered to start the crawler process when the Glue job has finished. It will crawl new data created from the job and update the schema and partitions of the parquet table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile cfn/start_crawler/start_crawler.py\n", "from __future__ import print_function\n", "import datetime\n", "import json\n", "import boto3\n", "\n", "glue = boto3.client('glue')\n", "\n", "def handler(event, context):\n", " # Log the received event\n", " print(\"Received event: \" + json.dumps(event, indent=2))\n", " # Get jobId from the event\n", " crawler_name = event['crawler_name']\n", " try:\n", " response = glue.get_crawler(Name=crawler_name)\n", " event['status'] = \"UNKNOWN\"\n", " \n", " if response['Crawler']['State'] == 'READY': \n", " print(\"Execute crawler: \" + crawler_name)\n", " \n", " # Call Start Crawler\n", " response = glue.start_crawler(Name=crawler_name)\n", " # Return the jobtatus\n", " event['status'] = \"STARTED\"\n", " else:\n", " event['status'] = \"FAILED\"\n", " event['status_message'] = 'Crawler in use' \n", " except Exception as e:\n", " print(e)\n", " message = 'Error starting Glue Crawler'\n", " print(message)\n", " event['status'] = \"FAILED\"\n", " event['status_message'] = message + ' ' + str(e)\n", " \n", " return event" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Lambda function to poll crawler status\n", "\n", "Crawler's are not a natively supported service with step functions, therefore we will need to monitor the activity of the crawler looking for it to finish. This function gets the crawler status and returns the avlue to the step function to proceed accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile cfn/poller/poll_crawler_status.py\n", "from __future__ import print_function\n", "import datetime\n", "import json\n", "import boto3\n", "from datetime import timedelta \n", "import os\n", "\n", "glue = boto3.client('glue')\n", "\n", "def handler(event, context):\n", " # Log the received event\n", " print(\"RECEIVED EVENT: \" + json.dumps(event, indent=2))\n", " crawler_name = event['crawler_name']\n", " \n", " try:\n", " response = glue.get_crawler(Name=crawler_name)\n", " event['status'] = response['Crawler']['State']\n", " except Exception as e:\n", " print(e)\n", " message = 'Error getting Glue Job status'\n", " print(message)\n", " event['status'] = \"UNKNOWN\"\n", " \n", " print(\"RESPONSE EVENT: \" + json.dumps(event, indent=2))\n", " return event" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create orchestration with [CloudFormation](https://aws.amazon.com/cloudformation/)\n", "\n", "In the interest of time, we will use CloudFormation to create the state machine and hook up the required events. You can view the template below to get a better understanding of how CloudFormation works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat cfn/sf_orch_glue.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Package deployment](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-cli-package.html)\n", "\n", "For some resource properties that require an Amazon S3 location (a bucket name and filename), you can specify local references instead. For example, you might specify the S3 location of your AWS Lambda function's source code or an Amazon API Gateway REST API's OpenAPI (formerly Swagger) file. Instead of manually uploading the files to an S3 bucket and then adding the location to your template, you can specify local references, called local artifacts, in your template and then use the package command to quickly upload them. A local artifact is a path to a file or folder that the package command uploads to Amazon S3. For example, an artifact can be a local path to your AWS Lambda function's source code or an Amazon API Gateway REST API's OpenAPI file.\n", "\n", "With the yaml file created we can now `package` out CloudFormation template to prepare it for deployment and finally call the `deploy` function on the CloudFormation service to build our API service for out Inference endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws cloudformation package \\\n", " --template-file cfn/sf_orch_glue.yaml \\\n", " --output-template-file orch_packaged.yaml \\\n", " --s3-bucket $bucket --s3-prefix cfn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Deploy Application](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-cli-deploy.html)\n", "\n", "AWS CloudFormation requires you to use a change set to create a template that includes transforms. Instead of independently creating and then executing a change set, use the `aws cloudformation deploy` command. When you run this command, it creates a change set, executes the change set, and then terminates. This command reduces the numbers of required steps when you create or update a stack that includes transforms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stack_name = 'glue-orch-{0}'.format(workshop_user)\n", "\n", "!aws cloudformation deploy \\\n", " --template-file orch_packaged.yaml \\\n", " --stack-name $stack_name \\\n", " --capabilities CAPABILITY_IAM \\\n", " --parameter-overrides NotificationEmail=$email" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Get Outputs of the CloudFormation template](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/outputs-section-structure.html)\n", "\n", "The optional `Outputs` section declares output values that you can import into other stacks, return in response, or view on the AWS CloudFormation console. We provide outputs for the `Name` and `ARN`s for the required resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.describe_stacks(StackName=stack_name)\n", "\n", "outputs = response[\"Stacks\"][0][\"Outputs\"]\n", "\n", "for output in response['Stacks'][0]['Outputs']:\n", " if (output['OutputKey'] == 'StepFunctionsStateMachine'):\n", " sf_arn = output['OutputValue']\n", " if (output['OutputKey'] == 'StepFunctionsAPIUrl'):\n", " api = output['OutputValue']\n", " \n", "pd.set_option('display.max_colwidth', -1)\n", "pd.DataFrame(outputs, columns=[\"OutputKey\", \"OutputValue\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the request object for the API call to start the Glue Job\n", "\n", "Now that the CloudFormation template is complete we can trigger the first run of the orchestration. The Step Function is proxied by API gateway to simplify calling the orchestration from 3rd party applications. [API Gateway](https://aws.amazon.com/api-gateway/) allows developers to quickly build REST API that are secure an scalable creating a \"front door\" to services with AWS.\n", "\n", "The step function orchestration requires the below JSON input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input = {\n", " \"job_name\": glue_job_name,\n", " \"arguments\": {\n", " \"region\": region,\n", " \"database_name\": database_name,\n", " \"table_name\": table_name,\n", " \"s3_output_bucket\": bucket,\n", " \"s3_output_key_prefix\": s3_output_key_prefix\n", " },\n", " \"crawler_name\": parq_crawler_name,\n", " \"wait_time\": 60,\n", " \"name\": \"IngestionOrchestration\",\n", " \"stateMachineArn\": sf_arn\n", "}\n", "\n", "payload = json.dumps(input, indent=4)\n", "post_val = \"'{0}'\".format(payload)\n", "\n", "print(post_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Trigger first Glue Job\n", "\n", "Simply using `curl` we can make the required `POST` call of the API to start orchestration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -X POST -d $post_val $api --header \"Content-Type: application/json\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Watch orchestration\n", "\n", "With the link below find the `executionArn` above to see the status of the state machine. You can follow the orchestration to completion in the UI as well as get the status from the CLI and SDKs. Once complete we should now be able to query the newly added data in your data lake.\n", "\n", "Once you click on the job run you will notice the steps to start and monitor the Glue job, run the curated crawler to populate new partitions and update schema changes and complete the job run. If at any point in the process the orchestration fails it will submit a notification to the email address used above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('https://{0}.console.aws.amazon.com/states/home?region={0}#/statemachines/view/{1}'.format(region, sf_arn)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parquet processed data set\n", "\n", "Once you hit the link below click the `tables` link and see there is a new `yellow_parq` table that is utilizing the parquet schema." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('https://{0}.console.aws.amazon.com/glue/home?region={0}#database:name={1}'.format(region, database_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Query the Data Lake with Athena](https://aws.amazon.com/athena/)\n", "\n", "For the self-serve end users that need the ability to create ad-hoc queries against the data Athena is a great choice the utilizes Presto and ANSI SQL to query a number of file formats on S3.\n", "\n", "To query the tables created by the crawler we will be installing a python library for querying the data in the Glue Data Catalog with Athena. For more information jump to [PyAthena](https://pypi.org/project/PyAthena/). You can also use the AWS console by browsing to the Athena service and run queries through the browser. Alternatively, you can also use the [JDBC/ODBC](https://docs.aws.amazon.com/athena/latest/ug/athena-bi-tools-jdbc-odbc.html) drivers available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install PyAthena" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from pyathena import connect\n", "from pyathena.util import as_pandas\n", "\n", "cursor = connect(region_name=region, s3_staging_dir='s3://'+bucket+'/athena/temp').cursor()\n", "cursor.execute('select * from ' + database_name + '.yellow_parq limit 10')\n", "\n", "df = as_pandas(cursor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Event Notification to execute new job run](https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html)\n", "\n", "In this example, we are abstracting the call to the Step Functions State machine behind an API gateway call. This makes it easy for 3rd parties to trigger the start of the orchestration by utilizing a REST API call. We could easily modify the below Lambda function to use the appropriate boto3 call to execution the state machine bypassing the API call. What we want to call out here is if you have a scheduled job to drop data into S3 that you want to automate on boarding you can do that with S3 events and Step Functions to automate the AWS Glue calls to register, update partitions, and transform into parquet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir trigger" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile trigger/trigger.py\n", "\n", "import json\n", "import os\n", "import requests\n", "\n", "def handler(event, context):\n", " bucket_name = event['Records'][0]['s3']['bucket']['name']\n", " s3_key_name = event['Records'][0]['s3']['object']['key']\n", " \n", " API = os.environ['API']\n", " data = {\n", " \"job_name\": os.environ['GLUE_JOB_NAME'],\n", " \"arguments\": {\n", " \"region\": os.environ['AWS_DEFAULT_REGION'],\n", " \"database_name\": os.environ['DATABASE_NAME'],\n", " \"table_name\": os.environ['TABLE_NAME'],\n", " \"s3_output_bucket\": os.environ['BUCKET'],\n", " \"s3_output_key_prefix\": os.environ['BUCKET_PREFIX']\n", " },\n", " \"crawler_name\": os.environ['CRAWLER_NAME'],\n", " \"wait_time\": os.environ['WAIT_TIME'],\n", " \"name\": \"PartitionUpdateExecution\",\n", " \"stateMachineArn\": os.environ['SF_ARN']\n", " }\n", " \n", " result = requests.post(API, headers={'content-type': 'application/json'}, data=json.dumps(data))\n", " return result.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zip Lambda function to trigger the Glue Job State Machine\n", "\n", "The Lambda function needs the `requests` library to make the API call to trigger the orchestration and the python file itself. We will zip them and create the new Lambda function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd trigger\n", "pip install requests --target .\n", "zip trigger.zip -r6 ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the Lamba function to trigger the orchestration\n", "\n", "The trigger function needs a number of environment variables to trigger the appropriate state machine and pass the required Glue params for loading the data.\n", "\n", "[lambda_client.create_function](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lambda.html#Lambda.Client.create_function)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fn_name = \"NY-Taxi-Trigger-SF-{0}\".format(workshop_user)\n", "fn_role = 'arn:aws:iam::{0}:role/lambda_basic_execution'.format(account_id)\n", "\n", "response = lambda_client.create_function(\n", " FunctionName=fn_name,\n", " Runtime='python2.7',\n", " Role=fn_role,\n", " Handler=\"trigger.handler\",\n", " Code={'ZipFile': open(\"trigger/trigger.zip\", 'rb').read(), },\n", " Environment={\n", " 'Variables': {\n", " 'API': api,\n", " 'GLUE_JOB_NAME': glue_job_name,\n", " 'DATABASE_NAME': database_name,\n", " 'TABLE_NAME': table_name,\n", " 'BUCKET': bucket,\n", " 'BUCKET_PREFIX': s3_output_key_prefix,\n", " 'CRAWLER_NAME': parq_crawler_name,\n", " 'WAIT_TIME': '60',\n", " 'SF_ARN': sf_arn\n", " }\n", " }, \n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trigger_arn = response['FunctionArn']\n", "print(trigger_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Give S3 Bucket permission to execute Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/lambda-permissions.html)\n", "\n", "We need to grant S3 permission to use a function. You can apply the policy at the function level, or specify a qualifier to restrict access to a single version or alias. If you use a qualifier, the invoker must use the full Amazon Resource Name (ARN) of that version or alias to invoke the function.\n", "\n", "[lambda_client.add_permission](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lambda.html#Lambda.Client.add_permission)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_arn = 'arn:aws:s3:::{0}'.format(bucket)\n", "\n", "response = lambda_client.add_permission(\n", " FunctionName=fn_name,\n", " StatementId='1',\n", " Action='lambda:InvokeFunction',\n", " Principal='s3.amazonaws.com',\n", " SourceArn=s3_arn,\n", " SourceAccount=account_id\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket Notification](https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html)\n", "\n", "Amazon S3 can publish events (for example, when an object is created in a bucket) to AWS Lambda and invoke your Lambda function by passing the event data as a parameter. This integration enables you to write Lambda functions that process Amazon S3 events. In Amazon S3, you add bucket notification configuration that identifies the type of event that you want Amazon S3 to publish and the Lambda function that you want to invoke." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {}\n", "data['LambdaFunctionConfigurations'] = [\n", " {\n", " 'Id': \"{0}-dl-notifications\".format(bucket),\n", " 'LambdaFunctionArn': trigger_arn,\n", " 'Events': [\"s3:ObjectCreated:*\"],\n", " 'Filter': {\n", " \"Key\": {\n", " \"FilterRules\": [\n", " {\n", " \"Name\": \"prefix\",\n", " \"Value\": raw_prefix\n", " }\n", " ]\n", " }\n", " }\n", " }\n", "]\n", "\n", "print(json.dumps(data, indent=4))\n", "s3 = boto3.resource('s3')\n", "bucket_notification = s3.BucketNotification(bucket)\n", "try:\n", " response = bucket_notification.put(NotificationConfiguration=data)\n", " print('Bucket notification updated successfully')\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add more raw data to trigger orchestration\n", "\n", "Let's add another file to the raw S3 prefix and watch the orchestration be triggers to start the Glue job and crawler to update the schema and partitions in the curated table. Wait for the step function to complete and go to Athena to query for data from this file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 cp s3://nyc-tlc/trip\\ data/yellow_tripdata_2017-03.csv s3://$bucket/datalake/raw/yellow/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View Step Function state\n", "\n", "The uploading of a new file to the `raw/yellow` prefix in the data lake triggered a new job run of the step function to update the data lake. When the state machine is complete you should be able to query for data on `2017-03` as well as the original files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('https://{0}.console.aws.amazon.com/states/home?region={0}#/statemachines/view/{1}'.format(region, sf_arn))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.delete_stack(StackName=stack_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = glue.delete_crawler(Name=parq_crawler_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = glue.delete_crawler(Name=crawler_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = glue.delete_job(JobName=glue_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = glue.delete_database(\n", " CatalogId = account_id,\n", " Name = database_name\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = lambda_client.delete_function(FunctionName=fn_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 rb s3://$bucket --force " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waiter = cfn.get_waiter('stack_delete_complete')\n", "waiter.wait(\n", " StackName=stack_name\n", ")\n", "\n", "print('The wait is over for {0}'.format(stack_name))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }