{ "paragraphs": [ { "text": "%md\n## Amazon Kinesis Data Analytics for Apache Flink application\n\nIn this notebook, a new Amazon Kinesis Data Analytics for SQL Application will be initialized and created. The notebook identifies the source/destination MSK topics and defines schema for the respective topics via AWS Glue Schema Registry. In addition, this notebook creates a streaming SQL query to aggregate data from the source MSK topic and store the results in the destination MSK topic. \n\nWith a simple click of a few buttons the process to identify records in the source MSK topic, define logic and push it to the destination topic is accomplished. By using Flink SQL, the logic is understandable in natural language without any need to learn programming languages. Once the data transformation logic is defined in this notebook, you can build your code and export it to Amazon S3. You can promote the code that you wrote in your note to a continuously running stream processing application. After you deploy a note to run in streaming mode, Kinesis Data Analytics creates an application for you that runs continuously, reads data from your sources, writes to your destinations, maintains long-running application state, and autoscales automatically based on the throughput of your source streams. Documentation on this topic is available at https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html.\n\nPLEASE NOTE WE DO NOT EXCECUTE INDIVIDUAL CELLS IN THIS NOTEBOOK, RATHER WE BUILD AND DEPLOY INTO AN EXECUTABLE APPLICATION.", "user": "anonymous", "dateUpdated": "2023-04-12T22:40:54+0000", "progress": 0, "config": { "tableHide": false, "editorSetting": { "language": "markdown", "editOnDblClick": true, "completionSupport": false }, "editorMode": "ace/mode/markdown", "colWidth": 12, "editorHide": true, "fontSize": 9, "results": {}, "enabled": true }, "settings": { "params": {}, "forms": {} }, "results": { "code": "SUCCESS", "msg": [ { "type": "HTML", "data": "
In this notebook, a new Amazon Kinesis Data Analytics for SQL Application will be initialized and created. The notebook identifies the source/destination MSK topics and defines schema for the respective topics via AWS Glue Schema Registry. In addition, this notebook creates a streaming SQL query to aggregate data from the source MSK topic and store the results in the destination MSK topic.
\nWith a simple click of a few buttons the process to identify records in the source MSK topic, define logic and push it to the destination topic is accomplished. By using Flink SQL, the logic is understandable in natural language without any need to learn programming languages. Once the data transformation logic is defined in this notebook, you can build your code and export it to Amazon S3. You can promote the code that you wrote in your note to a continuously running stream processing application. After you deploy a note to run in streaming mode, Kinesis Data Analytics creates an application for you that runs continuously, reads data from your sources, writes to your destinations, maintains long-running application state, and autoscales automatically based on the throughput of your source streams. Documentation on this topic is available at https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html.
\nPLEASE NOTE WE DO NOT EXCECUTE INDIVIDUAL CELLS IN THIS NOTEBOOK, RATHER WE BUILD AND DEPLOY INTO AN EXECUTABLE APPLICATION.
\n\nBefore proceeding, fetch the MSK cluster connection string using below steps -
\nWe are using Flink’s Connector for Apache Kafka to interact with Kafka topics. The Kafka connector “maps” Kafka topics to a created table. When you are selecting the data from the table, the connector acts as topic consumer and reads the data from Kafka topic, when you are inserting data to the table, connector acts as Kafka producer and produces the messages to Kafka topic, the topic is denoted by ‘topic’ metadata provided in create table sql statement. See https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/kafka/#how-to-create-a-kafka-table.
\nWhen you create a table in a Kinesis Data Analytics Studio Notebook, an accompanying table in the Glue Data Catalog is also created. AWS Glue Data Catalog is a persistent metadata store that serves as a central repository containing table definitions. You can use the AWS Glue Data Catalog to quickly discover and search across multiple AWS datasets. Kinesis Data Analytics Studio is compatible with the AWS Glue Data Catalog, where you can define the schema for your source and destination tables. The reason for this is so that other applications, be it Apache Flink applications or Batch applications, etc, can reference the same schema that we defined in this notebook instead of creating an entirely different copy of the schema. This ensures consistency between applications working with the same data.
\n\nWhen executing the 4_streaming_predictions.ipynb from SageMaker Studio, the put_to_topic function streams credit card transactions to the source MSK topic. From there, the InvokeFraudEndpoint Lambda function predicts potential fradulant transactions. In addition, our KDA Flink application would aggregate all transactions in the last 10 minutes and streams it to the destination topic for downstream processing. In the SQL cell below, we define the logic for aggregation by tying the source and destination topics. The SQL logic selects from the credit card topic, aggregates the counts and averages, and writes to the credit card destination topic. To learn more on Flink range aggregate functions, refer to https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/over-agg/#range-definitions.
\n\n