{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Serverless Data Lake Ingestion with Kinesis\n", "\n", "In this notebook we will walk through the steps required to use the Kinesis suite of tools as a swiss army knife to land data into your Data Lake. We will simulate Apache logs from a web server that could be sent from the [Kinesis Logs Agent](https://github.com/awslabs/amazon-kinesis-agent) to an Kinesis Data Stream. Once the logs have been sent to Kinesis we can convert, transform, and persist the processed and raw logs in your Data Lake. Finally, we will also show the real-time aspect of Kinesis by using an enhanced fan-out consumer with Lambda to send 500 errors found in the stream to Slack. The diagram below depicts the solutions we will be creating below.\n", "\n", "![Kinesis Ingestion](../../docs/assets/images/kinesis-swiss-army.png)\n", "\n", "You will need a Slack account and web hook to complete this workshop. Create a Slack account [here](https://slack.com/get-started). Once you have the acocunt created you will need to create a [Slack Channel](https://get.slack.help/hc/en-us/articles/201402297-Create-a-channel) and add a [WebHook](https://get.slack.help/hc/en-us/articles/115005265063-Incoming-WebHooks-for-Slack) to it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import json\n", "import time\n", "import project_path\n", "import getpass\n", "import pandas as pd\n", "\n", "from lib import workshop\n", "\n", "cfn = boto3.client('cloudformation')\n", "logs = boto3.client('logs')\n", "firehose = boto3.client('firehose')\n", "s3 = boto3.client('s3')\n", "\n", "# General variables for the region and account id for the location of the resources being created\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "delivery_stream_name = 'dc-demo-firehose'\n", "kdg_stack = 'kinesis-data-generator-cognito'\n", "kdg_username = 'admin'\n", "\n", "slack_web_hook = '{{SlackURL}}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)\n", "\n", "We will create an S3 bucket that will be used throughout the workshop for storing our data.\n", "\n", "[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = workshop.create_bucket(region, session, 'demo-')\n", "print(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Launch Kinesis Data Generator](https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html)\n", "\n", "*** This is optional if you haven't already launched the KDG in your account ***\n", "\n", "The Amazon Kinesis Data Generator (KDG) makes it easy to send data to Kinesis Streams or Kinesis Firehose. Learn how to use the tool and create templates for your records.\n", "\n", "Create an Amazon Cognito User\n", "Before you can send data to Kinesis, you must first create an [Amazon Cognito](https://aws.amazon.com/cognito/) user in your AWS account with permissions to access Amazon Kinesis. To simplify this process, an [Amazon Lambda](https://aws.amazon.com/lambda/) function and an [Amazon CloudFormation](https://aws.amazon.com/cloudformation/) template are provided to create the user and assign just enough permissions to use the KDG.\n", "\n", "Create the CloudFormation stack by clicking the link below. It will take you to the AWS CloudFormation console and start the stack creation wizard. You only need to provide a Username and Password for the user that you will use to log in to the KDG. Accept the defaults for any other options presented by CloudFormation.\n", "\n", "[Create a Cognito user with CloudFormation](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=Kinesis-Data-Generator-Cognito-User&templateURL=https://s3-us-west-2.amazonaws.com/kinesis-helpers/cognito-setup.json)\n", "\n", "The CloudFormation template can be downloaded from [here](https://s3-us-west-2.amazonaws.com/kinesis-helpers/cognito-setup.json). Because Amazon Cognito is not supported by CloudFormation, much of the setup is done in a Lambda function. The source code for the function can be downloaded from [here](https://s3-us-west-2.amazonaws.com/kinesis-helpers/datagen-cognito-setup.zip)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/GettingStarted.html) template\n", "\n", "In the interest of time we will leverage CloudFormation to launch many of the supporting resources needed for utilizing Kinesis Data Firehose to store the raw data, pre-process the incoming data, and store the curated data in S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "demo_file = 'cfn/kinesis-swiss-army.yaml'\n", "session.resource('s3').Bucket(bucket).Object(demo_file).upload_file(demo_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does the CloudFormation template look like?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat cfn/kinesis-swiss-army.yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfn_template = 'https://s3.amazonaws.com/{0}/{1}'.format(bucket, demo_file)\n", "print(cfn_template)\n", "\n", "stack_name = 'kinesis-swiss-army'\n", "response = cfn.create_stack(\n", " StackName=stack_name,\n", " TemplateURL=cfn_template,\n", " Capabilities = [\"CAPABILITY_NAMED_IAM\"],\n", " Parameters=[\n", " {\n", " 'ParameterKey': 'SlackWebHookUrl',\n", " 'ParameterValue': slack_web_hook\n", " }\n", " ] \n", " \n", ")\n", "\n", "print(response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.describe_stacks(\n", " StackName=stack_name\n", ")\n", "\n", "while response['Stacks'][0]['StackStatus'] != 'CREATE_COMPLETE':\n", " print('Not yet complete.')\n", " time.sleep(30)\n", " response = cfn.describe_stacks(\n", " StackName=stack_name\n", " )\n", "\n", "outputs = response['Stacks'][0]['Outputs']\n", "\n", "for output in outputs:\n", " if (output['OutputKey'] == 'FirehoseExecutionRole'):\n", " firehose_arn = output['OutputValue']\n", " if (output['OutputKey'] == 'LambdaPreProcessArn'):\n", " pre_processing_arn = output['OutputValue']\n", " if (output['OutputKey'] == 'GlueDatabase'):\n", " database = output['OutputValue']\n", " if (output['OutputKey'] == 'RawTable'):\n", " raw_table = output['OutputValue']\n", " if (output['OutputKey'] == 'CuratedTable'):\n", " curated_table = output['OutputValue']\n", " if (output['OutputKey'] == 'WeblogsBucket'):\n", " event_bucket = output['OutputValue']\n", " if (output['OutputKey'] == 'FirehoseLogGroup'):\n", " cloudwatch_logs_group_name = output['OutputValue']\n", " if (output['OutputKey'] == 'KinesisEventStream'):\n", " event_stream_arn = output['OutputValue']\n", " \n", "pd.set_option('display.max_colwidth', -1)\n", "pd.DataFrame(outputs, columns=[\"OutputKey\", \"OutputValue\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create the Kinesis Firehose we will use to send Apache Logs to our Data Lake](https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html)\n", "\n", "Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk. Kinesis Data Firehose is part of the Kinesis streaming data platform, along with Kinesis Data Streams, Kinesis Video Streams, and Amazon Kinesis Data Analytics. With Kinesis Data Firehose, you don't need to write applications or manage resources. You configure your data producers to send data to Kinesis Data Firehose, and it automatically delivers the data to the destination that you specified. You can also configure Kinesis Data Firehose to transform your data before delivering it.\n", "\n", "In this example, we will create custom S3 prefixes for when the data lands in S3. This will allow us to precreate the partitions that will be cataloged in the Glue Data Catalog. To find more information follow this [link](https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html)\n", "\n", "[firehose.create_delivery_stream](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/firehose.html#Firehose.Client.create_delivery_stream)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = firehose.create_delivery_stream(\n", " DeliveryStreamName=delivery_stream_name,\n", " DeliveryStreamType='KinesisStreamAsSource',\n", " KinesisStreamSourceConfiguration={\n", " 'KinesisStreamARN': event_stream_arn,\n", " 'RoleARN': firehose_arn\n", " },\n", " ExtendedS3DestinationConfiguration={\n", " 'RoleARN': firehose_arn,\n", " 'BucketARN': 'arn:aws:s3:::' + event_bucket,\n", " 'Prefix': 'weblogs/processed/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/',\n", " 'ErrorOutputPrefix': 'weblogs/failed/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/',\n", " 'BufferingHints': {\n", " 'SizeInMBs': 128,\n", " 'IntervalInSeconds': 60\n", " },\n", " 'CompressionFormat': 'UNCOMPRESSED',\n", " 'EncryptionConfiguration': {\n", " 'NoEncryptionConfig': 'NoEncryption'\n", " },\n", " 'CloudWatchLoggingOptions': {\n", " 'Enabled': True,\n", " 'LogGroupName': cloudwatch_logs_group_name,\n", " 'LogStreamName': 'ingestion_stream'\n", " },\n", " 'ProcessingConfiguration': {\n", " 'Enabled': True,\n", " 'Processors': [\n", " {\n", " 'Type': 'Lambda',\n", " 'Parameters': [\n", " {\n", " 'ParameterName': 'LambdaArn',\n", " 'ParameterValue': '{0}:$LATEST'.format(pre_processing_arn)\n", " },\n", " {\n", " 'ParameterName': 'NumberOfRetries',\n", " 'ParameterValue': '1'\n", " },\n", " {\n", " 'ParameterName': 'RoleArn',\n", " 'ParameterValue': firehose_arn\n", " },\n", " {\n", " 'ParameterName': 'BufferSizeInMBs',\n", " 'ParameterValue': '3'\n", " },\n", " {\n", " 'ParameterName': 'BufferIntervalInSeconds',\n", " 'ParameterValue': '60'\n", " }\n", " ]\n", " }\n", " ]\n", " },\n", " 'S3BackupMode': 'Enabled',\n", " 'S3BackupConfiguration': {\n", " 'RoleARN': firehose_arn,\n", " 'BucketARN': 'arn:aws:s3:::' + event_bucket,\n", " 'Prefix': 'weblogs/raw/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/',\n", " 'ErrorOutputPrefix': 'weblogs/failed/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/',\n", " 'BufferingHints': {\n", " 'SizeInMBs': 128,\n", " 'IntervalInSeconds': 60\n", " },\n", " 'CompressionFormat': 'UNCOMPRESSED',\n", " 'EncryptionConfiguration': {\n", " 'NoEncryptionConfig': 'NoEncryption'\n", " },\n", " 'CloudWatchLoggingOptions': {\n", " 'Enabled': True,\n", " 'LogGroupName': cloudwatch_logs_group_name,\n", " 'LogStreamName': 'raw_stream'\n", " }\n", " },\n", " 'DataFormatConversionConfiguration': {\n", " 'SchemaConfiguration': {\n", " 'RoleARN': firehose_arn,\n", " 'DatabaseName': database,\n", " 'TableName': curated_table,\n", " 'Region': region,\n", " 'VersionId': 'LATEST'\n", " },\n", " 'InputFormatConfiguration': {\n", " 'Deserializer': {\n", " 'OpenXJsonSerDe': {}\n", " }\n", " },\n", " 'OutputFormatConfiguration': {\n", " 'Serializer': {\n", " 'ParquetSerDe': {}\n", " }\n", " },\n", " 'Enabled': True\n", " }\n", " }\n", ")\n", "\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait for the Kinesis Firehose to become 'Active'\n", "The Kinesis Firehose Delivery Stream is in the process of being created.\n", "\n", "[firehose.describe_delivery_stream](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/firehose.html#Firehose.Client.describe_delivery_stream)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = firehose.describe_delivery_stream(\n", " DeliveryStreamName=delivery_stream_name\n", ")\n", "\n", "status = response['DeliveryStreamDescription']['DeliveryStreamStatus']\n", "print(status)\n", "\n", "while status == 'CREATING':\n", " time.sleep(30)\n", " response = firehose.describe_delivery_stream(\n", " DeliveryStreamName=delivery_stream_name\n", " )\n", " status = response['DeliveryStreamDescription']['DeliveryStreamStatus']\n", " print(status)\n", "\n", "print('Kinesis Firehose created.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be sending simulated apache logs to the Kinesis Data Stream and land the data in the raw and processed prefixes in the data lake.\n", "\n", "``` json\n", "{{internet.ip}} - - [{{date.now(\"DD/MMM/YYYY:HH:mm:ss ZZ\")}}] \"{{random.weightedArrayElement({\"weights\":[0.6,0.1,0.1,0.2],\"data\":[\"GET\",\"POST\",\"DELETE\",\"PUT\"]})}} {{random.arrayElement([\"/list\",\"/wp-content\",\"/wp-admin\",\"/explore\",\"/search/tag/list\",\"/app/main/posts\",\"/posts/posts/explore\"])}} HTTP/1.1\" {{random.weightedArrayElement({\"weights\": [0.9,0.04,0.02,0.04], \"data\":[\"200\",\"404\",\"500\",\"301\"]})}} {{random.number(10000)}} \"-\" \"{{internet.userAgent}}\"\n", "```\n", "\n", "Once you log in with your Cognito user into KDG you will be filling out the form like below with the appropriate Kinesis Data Stream and paste the json snippet above into the form to simultae Apache logs. Once you have correctly filled out the form you will click 'Send Data' and Apache logs will be sent to your Kinesis Data Stream.\n", "\n", "![KDG Setup](../../docs/assets/images/kdg-send.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point if you have everything setup correctly you will see 500 errors streaming to your Slack channel. Another thing you will see is data persist into the S3 bucket created in the CloudFormation template output param `WeblogsBucket` in prefixes raw for the raw data and processed for Parquet formatted data.\n", "\n", "![S3 Setup](../../docs/assets/images/kinesis-s3-bucket.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Query the Data with Athena](https://aws.amazon.com/athena/)\n", "\n", "For the self-serve end users that need the ability to create ad-hoc queries against the data Athena is a great choice the utilizes Presto and ANSI SQL to query a number of file formats on S3.\n", "\n", "To query the tables created by the crawler we will be installing a python library for querying the data in the Glue Data Catalog with Athena. For more information jump to [PyAthena](https://pypi.org/project/PyAthena/). You can also use the AWS console by browsing to the Athena service and run queries through the browser. Alternatively, you can also use the [JDBC/ODBC](https://docs.aws.amazon.com/athena/latest/ug/athena-bi-tools-jdbc-odbc.html) drivers available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install PyAthena" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select Query from Raw data\n", "\n", "In this first query we will create a simple query to show the ability of Athena to query the raw CSV data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from pyathena import connect\n", "from pyathena.util import as_pandas\n", "\n", "cursor = connect(region_name=region, s3_staging_dir='s3://'+bucket+'/athena/temp').cursor()\n", "cursor.execute('select * from weblogs.r_streaming_logs limit 10')\n", "\n", "df = as_pandas(cursor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select Query from Processed data\n", "\n", "In this first query we will create a simple query to show the ability of Athena to query the processed Parquet data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cursor.execute('select * from weblogs.p_streaming_logs limit 10')\n", "\n", "df = as_pandas(cursor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 rb s3://$event_bucket --force" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 rb s3://$bucket --force" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.delete_stack(StackName=stack_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waiter = cfn.get_waiter('stack_delete_complete')\n", "waiter.wait(\n", " StackName=stack_name\n", ")\n", "\n", "print('The wait is over for {0}'.format(stack_name))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = firehose.delete_delivery_stream(\n", " DeliveryStreamName=delivery_stream_name\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }