### Introduction This is an annotated version of the cloudformation script that kicks off the serverless data lake. #### Standard Header ```YAML AWSTemplateFormatVersion: '2010-09-09' Transform: 'AWS::Serverless-2016-10-31' Description: A serverless datalake workshop. Resources: ``` #### IngestionBucket This is the S3 bucket that will contain the S3 Data Lake. ```YAML IngestionBucket: Type: AWS::S3::Bucket ``` #### ApacheLogs This is the CloudWatch log group that will contain the log data generated by the log file generator. ```YAML ApacheLogs: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub /${AWS::StackName}/apache RetentionInDays: 1 ``` #### ApacheLogsKinesis This is the delivery stream that receives the logs from CloudWatch to Kinesis Firehouse. Cloudwatch Logs publishes the logs in a compresses JSON format, so there's a Lambda function that extracts the compressed log data and that is what is written to S3. You can use the compressed log data in AWS Glue, but the compression makes the files harder to read and the JSON format adds hierarchy to data in the data lake and obuscates the data. For the purposes of simplifying the workshop, this data is uncompressed and written as a CSV. ```YAML ApacheLogsKinesis: Type: AWS::KinesisFirehose::DeliveryStream DependsOn: GenerateSampleDataFunction Properties: DeliveryStreamType: DirectPut ExtendedS3DestinationConfiguration: RoleARN: !GetAtt ApacheLogsServiceRole.Arn BucketARN: !GetAtt IngestionBucket.Arn BufferingHints: IntervalInSeconds: 60 SizeInMBs: 3 CloudWatchLoggingOptions: Enabled: False CompressionFormat: UNCOMPRESSED Prefix: weblogs/live/ ProcessingConfiguration: Enabled: true Processors: - Type: Lambda Parameters: - ParameterName: LambdaArn ParameterValue: !Sub ${TransformKinesis.Arn} - ParameterName: BufferSizeInMBs ParameterValue: 3 - ParameterName: BufferIntervalInSeconds ParameterValue: 60 ``` #### CloudWatchLogsToKinesis This is the subsription that publishes the logs from CloudWatch to Kinesis Firehouse delivery stream. ```YAML CloudWatchLogsToKinesis: Type: AWS::Logs::SubscriptionFilter Properties: DestinationArn: !Sub ${ApacheLogsKinesis.Arn} FilterPattern: "" LogGroupName: !Sub ${ApacheLogs} RoleArn: !Sub ${LogsToKinesisServiceRole.Arn} ``` #### LogsToKinesisServiceRole This is IAM role that CloudWatch logs assumes in order to publish the data to kinesis. It needs the authorization to write to kinesis. ```YAML LogsToKinesisServiceRole: Type: AWS::IAM::Role Properties: RoleName: !Sub ${AWS::StackName}_logs_kinesis_role AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: !Sub logs.${AWS::Region}.amazonaws.com Action: sts:AssumeRole ``` #### LogsToKinesisRolePolicy This is IAM policy that CloudWatch logs grants authorization to write to kinesis. ```YAML LogsToKinesisRolePolicy: Type: AWS::IAM::ManagedPolicy Properties: ManagedPolicyName: !Sub ${AWS::StackName}_logs_kineis_policy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - 'firehose:*' Resource: - !Sub '${ApacheLogsKinesis.Arn}' - Effect: Allow Action: - 'iam:PassRole' Resource: - !Sub '${LogsToKinesisServiceRole.Arn}' Roles: - !Ref 'LogsToKinesisServiceRole' ``` #### ApacheLogsServiceRole This is IAM role that Kinesis Firehose assumes to call the transformation lambda function and write the result to S3 ```YAML ApacheLogsServiceRole: Type: AWS::IAM::Role Properties: RoleName: !Sub ${AWS::StackName}_weblog_delivery_role AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: firehose.amazonaws.com Action: sts:AssumeRole ``` #### ApacheLogsRolePolicy This is IAM policy that grants Kinesis Firehose access to call the transformation lambda function and write the result to S3 ```YAML ApacheLogsRolePolicy: Type: AWS::IAM::ManagedPolicy Properties: ManagedPolicyName: !Sub ${AWS::StackName}_weblog_delivery_policy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - 's3:*' Resource: - !Sub '${IngestionBucket.Arn}/*' - !Sub '${IngestionBucket.Arn}' - Effect: Allow Action: - 'lambda:InvokeFunction' - 'lambda:InvokeAsync' Resource: - !Sub '${TransformKinesis.Arn}' Roles: - !Ref 'ApacheLogsServiceRole' ``` #### TransformKinesis This is the lambda function that transforms the cloudwatch log format into a simple CSV. It uncompresses the log payload and strips away the JSON document that supplies cloudwatch logs metadata. ```YAML TransformKinesis: Type: 'AWS::Serverless::Function' Properties: Handler: transformKinesis.handler Runtime: python2.7 Description: '' MemorySize: 512 Timeout: 60 CodeUri: ./src ``` #### GenerateSampleDataFunction This is lambda function that randomly generates some website data for 1 minute. It is scheduled to run every minute. ```YAML GenerateSampleDataFunction: Type: 'AWS::Serverless::Function' Properties: Handler: writelogs.lambda_handler Runtime: python2.7 Description: '' MemorySize: 512 Timeout: 60 CodeUri: ./src Events: Schedule: Type: Schedule Properties: Schedule: rate(1 minute) Environment: Variables: LOG_GROUP_NAME: !Sub /${AWS::StackName}/apache ``` #### LoadSampleDataFunction This is lambda function that provides the functionality for the LoadSampleData custom resource. It copies the sample data from a public S3 bucket into the IngestionBucket. Upon delete, it will delete all the data in the bucket so it can be removed. ```YAML LoadSampleDataFunction: Type: 'AWS::Serverless::Function' Properties: Handler: load-data-files.lambda_handler Runtime: python2.7 Description: '' MemorySize: 512 Timeout: 240 Policies: - S3CrudPolicy: BucketName: !Ref IngestionBucket CodeUri: ./src Environment: Variables: BUCKET_NAME: !Ref IngestionBucket ``` #### LoadSampleData This is a custom CloudFormation resource the pre-populates the sample data and cleans up the IngestionBucket when the stack is turned off. It also uploads the lab instructions to the bucket and will substitute the actual bucket name into the instructions in order to reduce the number of copy and paste errors. ```YAML LoadSampleData: Type: Custom::LoadSampleData DependsOn: - IngestionBucket Properties: ServiceToken: !GetAtt LoadSampleDataFunction.Arn StackName: !Ref AWS::StackName ``` #### Outputs These outputs are displayed in the AWS console after the stack has been created. The WorkshopInstructionsUrl is a link to the customized instructions for this workshop. ```YAML Outputs: WorkshopInstructionsUrl: Description: Follow the link for the instructions for the serverless datalake workshop. Value: !Sub https://s3.${AWS::Region}.amazonaws.com/${IngestionBucket}/instructions/instructions.html ```