Transform: AWS::Serverless-2016-10-31 Description: Example template for deploying AWS Glue DataBrew Data Quality Rules and an AWS Step Functions State Machine Parameters: S3Bucket: Type: String Default: databrew-public-datasets-us-east-1 Description: S3 bucket name where the data is stored S3Key: Type: String Default: votes.csv Description: S3 key name where the data is stored EmailAddress: Type: String AllowedPattern: '^[^\s@]+@[^\s@]+\.[^\s@]+$' Description: Email address of store manager for receiving notifications Globals: Function: Runtime: python3.8 Handler: app.handler Timeout: 300 Resources: GlueDataBrewOutputS3Bucket: Type: AWS::S3::Bucket Properties: VersioningConfiguration: Status: Enabled LifecycleConfiguration: Rules: - Id: All Status: Enabled NoncurrentVersionExpiration: NewerNoncurrentVerssions: 3 NoncurrentDays: 90 PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: true IgnorePublicAcls: true RestrictPublicBuckets: true BucketEncryption: ServerSideEncryptionConfiguration: - ServerSideEncryptionByDefault: SSEAlgorithm: AES256 GlueDataBrewInputS3Bucket: Type: AWS::S3::Bucket Properties: VersioningConfiguration: Status: Enabled LifecycleConfiguration: Rules: - Id: All Status: Enabled NoncurrentVersionExpiration: NewerNoncurrentVerssions: 3 NoncurrentDays: 90 NotificationConfiguration: # New configuration to enable EventBridge for S3 Event Notifications EventBridgeConfiguration: EventBridgeEnabled: true PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: true IgnorePublicAcls: true RestrictPublicBuckets: true BucketEncryption: ServerSideEncryptionConfiguration: - ServerSideEncryptionByDefault: SSEAlgorithm: AES256 s3ManagerLambdaFunction: Type: AWS::Serverless::Function Properties: CodeUri: ../lambdaFunctions/s3ManagerCustomResource Policies: - S3FullAccessPolicy: BucketName: !Ref GlueDataBrewOutputS3Bucket - S3FullAccessPolicy: BucketName: !Ref GlueDataBrewInputS3Bucket - S3FullAccessPolicy: BucketName: !Ref S3Bucket - Statement: Effect: Allow Action: - s3:ListBucketVersions Resource: - !GetAtt GlueDataBrewOutputS3Bucket.Arn - !GetAtt GlueDataBrewInputS3Bucket.Arn CopyObjectResource: Type: Custom::S3CopyObject Properties: ServiceToken: !GetAtt s3ManagerLambdaFunction.Arn SourceBucketName: !Ref S3Bucket SourceKey: !Ref S3Key InputBucketName: !Ref GlueDataBrewInputS3Bucket OutputBucketName: !Ref GlueDataBrewOutputS3Bucket copyVotes: true EmptyBucketsResource: Type: Custom::S3EmptyBuckets Properties: ServiceToken: !GetAtt s3ManagerLambdaFunction.Arn InputBucketName: !Ref GlueDataBrewInputS3Bucket OutputBucketName: !Ref GlueDataBrewOutputS3Bucket EventsPolicy: Type: AWS::IAM::ManagedPolicy Properties: PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - states:StartExecution Resource: - !Ref ProfileJobStepFunction EventsRole: Type: AWS::IAM::Role Properties: ManagedPolicyArns: - !Ref EventsPolicy AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: events.amazonaws.com Action: sts:AssumeRole Path: / S3EventBridgeRule: Type: AWS::Events::Rule DependsOn: CopyObjectResource Properties: Targets: - Arn: !Ref ProfileJobStepFunction Id: ABC123 RoleArn: !GetAtt EventsRole.Arn RoleArn: !GetAtt EventsRole.Arn EventPattern: source: - "aws.s3" detail-type: - "Object Created" detail: bucket: name: - !Ref GlueDataBrewInputS3Bucket GlueDataBrewDataset: DependsOn: CopyObjectResource Type: AWS::DataBrew::Dataset Properties: Format: CSV Input: S3InputDefinition: Bucket: Ref: GlueDataBrewInputS3Bucket Key: Ref: S3Key Name: awsbigdatablog-votes GlueDataBrewDataQualityRuleset: Type: AWS::DataBrew::Ruleset Properties: Name: awsbigdatablog-votes-ruleset Description: AWS Big Data Blog Votes Data Quality Ruleset Rules: - Name: Check All Columns For Missing Values Disabled: false CheckExpression: AGG(MISSING_VALUES_PERCENTAGE) == :val1 SubstitutionMap: - ValueReference: ":val1" Value: '0' ColumnSelectors: - Regex: ".*" - Name: Check assembly_session For Outliers Disabled: false CheckExpression: AGG(:col1, Z_SCORE_OUTLIERS_PERCENTAGE, :param1) <= :val1 SubstitutionMap: - ValueReference: ":col1" Value: "`assembly_session`" - ValueReference: ":param1" Value: '3' - ValueReference: ":val1" Value: '0' - Name: Check state_name For String Length Disabled: false CheckExpression: LENGTH(:col1) >= :val1 AND LENGTH(:col1) <= :val2 SubstitutionMap: - ValueReference: ":col1" Value: "`state_name`" - ValueReference: ":val1" Value: '4' - ValueReference: ":val2" Value: '32' Threshold: Value: 100 Type: GREATER_THAN_OR_EQUAL Unit: PERCENTAGE TargetArn: !Sub arn:${AWS::Partition}:databrew:${AWS::Region}:${AWS::AccountId}:dataset/${GlueDataBrewDataset} ProfileJobIAMPolicy: Type: AWS::IAM::ManagedPolicy Properties: PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - s3:GetObject - s3:PutObject - s3:DeleteObject - s3:ListBucket - s3:PutObjectAcl Resource: - !Sub arn:${AWS::Partition}:s3:::${GlueDataBrewOutputS3Bucket}/* - !Sub arn:${AWS::Partition}:s3:::${GlueDataBrewOutputS3Bucket} - Effect: Allow Action: - s3:ListBucket - s3:GetObject Resource: - !Sub arn:aws:s3:::${GlueDataBrewInputS3Bucket} - !Sub arn:aws:s3:::${GlueDataBrewInputS3Bucket}/* ProfileJobIAMRole: Type: AWS::IAM::Role Properties: ManagedPolicyArns: - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSGlueDataBrewServiceRole - !Ref ProfileJobIAMPolicy AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: databrew.amazonaws.com Action: sts:AssumeRole Path: / GlueDataBrewProfileJob: Type: AWS::DataBrew::Job Properties: OutputLocation: Bucket: !Ref GlueDataBrewOutputS3Bucket Key: !Sub ${GlueDataBrewDataset}/profile-job/ RoleArn: !GetAtt ProfileJobIAMRole.Arn ValidationConfigurations: - RulesetArn: !Sub arn:${AWS::Partition}:databrew:${AWS::Region}:${AWS::AccountId}:ruleset/${GlueDataBrewDataQualityRuleset} DatasetName: !Ref GlueDataBrewDataset Name: awsbigdatablog-votes-profile EncryptionMode: SSE-S3 LogSubscription: ENABLE MaxCapacity: 5 MaxRetries: 0 Type: PROFILE Timeout: 2880 JobSample: Mode: CUSTOM_ROWS Size: 20000 SNSNotifierTopic: Type: 'AWS::SNS::Topic' Properties: TopicName: 'databrew-dq-topic' KmsMasterKeyId: alias/aws/sns Subscription: - Endpoint: !Ref EmailAddress Protocol: email TopicPolicy: Type: AWS::SNS::TopicPolicy Properties: PolicyDocument: Id: sns-topic-policy Version: '2012-10-17' Statement: - Sid: allow-sns Effect: Allow Principal: AWS: !Ref 'AWS::AccountId' Action: sns:Publish Resource: '*' Topics: - !Ref SNSNotifierTopic StepFunctionsPolicy: Type: AWS::IAM::ManagedPolicy Properties: PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - databrew:startJobRun - databrew:stopJobRun - databrew:listJobRuns Resource: - !Sub arn:${AWS::Partition}:databrew:${AWS::Region}:${AWS::AccountId}:job/${GlueDataBrewProfileJob} - Effect: Allow Action: - lambda:InvokeFunction Resource: - !GetAtt DQCheckFunction.Arn - Effect: Allow Action: - sns:Publish Resource: - !Ref SNSNotifierTopic DQCheckFunction: Type: AWS::Serverless::Function Properties: CodeUri: ../lambdaFunctions/getDataBrewDqResults Policies: - S3ReadPolicy: BucketName: !Ref GlueDataBrewOutputS3Bucket ProfileJobStepFunction: Type: AWS::Serverless::StateMachine Properties: Definition: StartAt: StartDQJob States: StartDQJob: Type: Task Resource: !Sub arn:${AWS::Partition}:states:::databrew:startJobRun.sync Parameters: Name: !Ref GlueDataBrewProfileJob Next: Check Check: Type: Task Resource: !GetAtt DQCheckFunction.Arn ResultPath: $.CheckTaskResult Next: NotifyChoice NotifyChoice: Type: Choice Choices: - Next: Notify Variable: "$.CheckTaskResult.status" BooleanEquals: false Default: Success Notify: Type: Task Resource: arn:aws:states:::sns:publish Next: Fail Parameters: TopicArn: !Ref SNSNotifierTopic Message.$: "States.Format('Your AWS Glue DataBrew Data Quality job failed. Job Name: {}', $['JobName'])" Subject: Data Quality check failed. Fail: Type: Fail Success: Type: Succeed Policies: - !Ref StepFunctionsPolicy Outputs: AWSGlueDataBrewProfileJob: Value: Fn::Sub: https://console.aws.amazon.com/databrew/home?region=${AWS::Region}#job-details?job=${GlueDataBrewProfileJob}&tab=details AWSGlueDataBrewRuleset: Value: Fn::Sub: https://console.aws.amazon.com/databrew/home?region=us-east-1#ruleset-details?ruleset=${GlueDataBrewDataQualityRuleset} AWSGlueDataBrewProfileResults: Value: Fn::Sub: https://console.aws.amazon.com/databrew/home?region=${AWS::Region}#dataset-details?dataset=${GlueDataBrewDataset}&tab=data-quality-rules AWSStepFunctionsStateMachine: Value: Fn::Sub: https://console.aws.amazon.com/states/home?region=${AWS::Region}#/statemachines/view/${ProfileJobStepFunction} CommandToDownloadTestData: Value: Fn::Sub: aws s3 cp s3://${GlueDataBrewInputS3Bucket}/${S3Key} . CommandToUploadTestData: Value: !Sub - aws s3 cp ${LocalPath} s3://${GlueDataBrewInputS3Bucket}/${S3Key} - { LocalPath: !Select [2, !Split ['/', !Ref S3Key]] }