# Automated Bucketing of Streaming Data This repository accompanies the [Automating bucketing of streaming data using Amazon Athena and AWS Lambda](https://aws.amazon.com/blogs/big-data/automating-bucketing-of-streaming-data-using-amazon-athena-and-aws-lambda/) blog post. It contains an [AWS Serverless Application Model (AWS SAM)](https://aws.amazon.com/serverless/sam/) template that deploys two [AWS Lambda](https://aws.amazon.com/lambda) functions; LoadPartiton and Bucketing functions. The [LoadPartition](./functions/LoadPartition.py) function, runs every hour and reads the new folder created under /raw folder and loads this folder as a new partition to the SourceTable. The [Bucketing](./functions/Bucketing.py) function, runs every hour and copies previous hour's data from /raw to /curated by executing [Create Table AS Select (CTAS)](https://docs.aws.amazon.com/athena/latest/ug/ctas.html) query on [Amazon Athena](https://aws.amazon.com/athena). The copied data is a new sub-folder under /curated. The function will then load the new folder as a Partition to TargetTable. ```bash ├── README.MD <-- This instructions file ├── functions <-- Two lambda functions used to bucket streaming data ``` ## Requirements * AWS CLI already configured with Administrator permission * Source and target tables created in Athena * Streaming data is writing into Amazon S3 bucket and partitioned like this: dt=YYYY-mm-dd-HH ## Installation Instructions 1. [Install SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html) if you do not have it. 2. Clone the repo onto your local development machine using `git clone https://github.com/aws-samples/automated-bucketing-of-streaming-data.git`. 3. [OPTIONAL]The lambda functions included here, works on data that is partitioned on hourly basis. It will work with flat partition strategy that looks like the following; dt=YYYY-mm-dd-HH. If your data has a different structure, edit the lambda functions accordingly. 4. From the command line, change directory to SAM template directory ``` sam build sam deploy --guided ``` Follow the prompts in the deploy process to set the stack name, AWS Region and other parameters. ## Parameter Details * S3BucketName: the name of data lake S3 bucket for this application * CuratedKeyPrefix: Prefix of new bucketed files that are written by Function2. This is the Amazon S3 location of TargetTable without 's://'. **Do not add the trailing slash**. For example, /curated * AthenaResultLocation: Full S3 location where Athena will store query results in. For example, s3:///athena_results * DatabaseName: Data Catalog Database name that holds SourceTable and TargetTable * SourceTableName: Source Table Name that points to raw data * TargetTableName: Target Table name that points to curated data * BucketingKey: The column used as a bucketing key. The solution supports a single bucketing key, to add more edit the lambda function. * BucketCount: Number of hive buckets to create within a partition. This has to be the same number that was used when creating TargetTable. ## How it works * Start writing streaming data to S3 bucket * Create SourceTable and TargetTable in Athena * After an hour of the SAM deployment, you will see new data written ***CuratedKeyPrefix***. The data will be bucketed and could be queried from TargetTable in Athena. ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file.