# Parquet Sink/Output Codec This is an implementation of Parquet Sink Codec that parses the Dataprepper Events into Parquet Records and writes them into the underlying OutputStream. ## Usages Parquet Output Codec can be configured with sink plugins (e.g. S3 Sink) in the Pipeline file. ## Configuration Options ``` pipeline: ... sink: - s3: aws: region: us-east-1 sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper sts_header_overrides: max_retries: 5 bucket: bucket_name object_key: path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/ threshold: event_count: 2000 maximum_size: 50mb event_collect_timeout: 15s codec: parquet: schema: "{\"namespace\": \"org.example.test\"," + " \"type\": \"record\"," + " \"name\": \"TestMessage\"," + " \"fields\": [" + " {\"name\": \"name\", \"type\": \"string\"}," + " {\"name\": \"age\", \"type\": \"int\"}]" + "}"; schema_file_location: "C:\\Path\\to\\your\\schema.json" schema_registry_url: https://your.schema.registry.url.com exclude_keys: - s3 region: bucket: path_prefix: buffer_type: in_memory ``` ## AWS Configuration ### Codec Configuration: 1) `schema`: A json string that user can provide in the yaml file itself. The codec parses schema object from this schema string. 2) `schema_file_location`: Path to the schema json file through which the user can provide schema. 3) `exclude_keys`: Those keys of the events that the user wants to exclude while converting them to avro records. 4) `schema_registry_url`: Another way of providing the schema through schema registry. 5) `region`: AWS Region of the S3 bucket which the user wants to use as buffer for records parsed by Parquet Output Codec. 6) `bucket`: Name of the S3 bucket which the user wants to use as buffer for records parsed by Parquet Output Codec. 7) `path_prefix`: Path to the folder within the S3 bucket where the user wants the intermittent files to be made. 8) `schema_bucket`: Name of the S3 bucket in which `schema.json` file is kept. 9) `file_key`: File key of `schema.json` file kept in S3 bucket. 10) `schema_region`: AWS Region of the S3 bucket in which `schema.json` file is kept. ### Note: 1) User can provide only one schema at a time i.e. through either of the ways provided in codec config. 2) If the user wants the tags to be a part of the resultant Avro Data and has given `tagsTargetKey` in the config file, the user also has to modify the schema to accommodate the tags. Another field has to be provided in the `schema.json` file: `{ "name": "yourTagsTargetKey", "type": { "type": "array", "items": "string" }` 3) The user must provide valid `region`, `bucket` and `path_prefix` for the codec to work. 4) If the user wants to input schema through a `schema.json` file kept in S3, the user must provide corresponding credentials i.e. region, bucket name and file key of the same. ## Developer Guide This plugin is compatible with Java 11. See below - [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) - [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md) The integration tests for this plugin do not run as part of the Data Prepper build. The following command runs the integration tests: ``` ./gradlew :data-prepper-plugins:s3-sink:integrationTest -Dtests.s3sink.region= -Dtests.s3sink.bucket= ```