# Step Functions Distributed Map weather analysis This is a SAM application that processes all [37+ GB of NOAA Global Surface Summary of Day](https://registry.opendata.aws/noaa-gsod/). The application code in this example finds the weather station that has _the higest average temperature on the planet each month_. This data set is interesting for a few reasons: 1. The data is organized by station/day. Each weather station will have a single record with averages per _day_. This example Step Function workflow will find the highest avearge temperature across all stations by _month_. That is, it answers the question: "What place on earth recorded the hightest average daily temperature within a given month?" 2. There are over 558,000 CSV files in the data set at over 37 GB. The average CSV file size is 66.5 KB. 3. The CSV files are relatively simple to understand and parse. ## Design This implementation uses a Lambda map function (using a Distributed Map state from Step Functions) and a Lambda reducer function. The reducer function performs a final aggregation and writes the results to DynamoDB. The reducer function is necessary because two child workflows in the Distributed Map run may process and find a high temperature for the same day. For example, child worflow 1 may find that Seattle, Washington, USA had the highest temperature on "2022-07" (July, 2022) while child workflow 2 finds that Jahra, Kuwait had the highest temperature on "2022-07". The reducer function will take a final pass through the outputs from all of the child workflows to find the correct highs. ![](dmap-state-machine.png) **Using a Distributed Map batch size of 500 this workflow completes in approximately 90-150 seconds.** ## Input data Each CSV file has the following format (many columns not shown): ![](noaa-gsod-pds-data.png) The mapper Lambda function will read each line into a dictionary and use the `DATE` and `TEMP` fields to find the `STATION` with the highest average daily temperature in a month. ## Output Results are written to DynamoDB in the same way they are represented in CSV, using the same columns. Each row is unique by `YYYY-MM`. ![](weather-station-output-in-ddb.png) ## Running this in your own account > ## NOTE: This demo can incur charges in your AWS acount once you are beyond the free tier. The charges are on the order $0.10 for one run. This demo needs to run in the `us-east-1` AWS region. To run this demo in your own account: 1. Deploy this example using AWS SAM in the `us-east-1` region. Look at the `Outputs` after you deploy it since you will need some of these in future steps. ![](sam-outputs.png) 2. Run the `CopyNOAAS3DataStateMachine` workflow. The `input` into the worflow doesn't matter. Running this workflow will copy the public NOAA data into your own bucket. **NOTE: THIS TAKES ~45-60 min to complete**. If you would like to copy a subset of the files, simply stop the execution in the AWS console. 3. Run the `NOAAWeatherStateMachine` workflow which will read the data you just copied and process it as described above, and write the results to DynamoDB. 4. Look at the DynamodDB table contents in the AWS console to see the final results. Also look at the processing time for the `NOAAWeatherStateMachine` run, which should be on the order of 90-150 seconds, depending on how much data you copied. ## Cleaning up ### Emptying S3 buckets The SAM application will create a state machine called `DeleteNOAADataStateMachine`. Before you can delete all of the resources, you need to empty two buckets: - `NOAADataBucket` where you copied the public NOAA data - `StateMachineResultsBucket` where the distributed map results are written Run the `DeleteNOAADataStateMachine` twice for each bucket. Use the following payload to tell the workflow which S3 bucket to empty. ```json { "BucketToEmpty": "value-of-the-StateMachineResultsBucket-from-outputs" } ``` and: ```json { "BucketToEmpty": "value-of-the-NOAADataBucket-from-outputs" } ``` This is what it looks like in the console: ![Input to empty a bucket using the DeleteNOAADataStateMachine workflow](empty-buckets-workflow.png) Because the `ListObjectsV2` API returns a maximum of 1000 S3 objects, the workflow will work page-by-page to both copy and delete. _Note, this workflow only has permission to delete objects from these two buckets, so there is no danger in accidentally emptying another bucket._ ### SAM delete Once the buckets are emptied, simply run `sam delete` and answer `y` to any questions.