# Operational Excellence Best Practices ## Development ### OPS_DEV_01: Codify Infrastructure In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. Define your Security Server infrastructure as templates and scripts. Store the codified infrastructure in a version control system. Make frequent, small, reversible changes to your templates and scripts. Apply code review practices to reduce human errors. **Recommended Tools** * [AWS CloudFormation](https://aws.amazon.com/cloudformation/) * [AWS Cloud Development Kit](https://aws.amazon.com/cdk/) * [AWS CodeCommit](https://aws.amazon.com/codecommit/) * [AWS CodePipeline](https://aws.amazon.com/codepipeline/) * [AWS Cloud9](https://aws.amazon.com/cloud9/) **Example** ![Infrastructure Example](img/ops-codify-infrastructure.png) When working on Security Server setup, system engineers use Cloud9 as a co-working, online integrated development environment to author infrastructure changes as code. Engineers use the Cloud Development Kit to work in a more traditional programming language, synthesizing their infrastructure code into CloudFormation templates or work on CloudFormation templates directly. Templates are stored in a Git repository in CodeCommit and picked up by CodePipeline to facilitate automated deployment. ### OPS_DEV_02: Test and Validate Changes Similarly to how you can apply coding practices to both software and infrastructure, you should apply testing and validation practices to both. Build a pipeline that would deploy the latest version of the Security Server packages into a pre-production (test) environment. **Recommended tools:** * [AWS CloudFormation](https://aws.amazon.com/cloudformation/) * [AWS CodePipeline](https://aws.amazon.com/codepipeline/) * [CDK Pipelines](https://aws.amazon.com/blogs/developer/cdk-pipelines-continuous-delivery-for-aws-cdk-applications/) **Example:** ![Test and Validate Changes](img/ops-test-validate-resources.png) AWS resources, like Security Server instances, databases and security groups are deployed in separate VPCs for different X-Road environments. Depending on your security and governance need, the consumer and producer information systems can be either deployed into the same VPCs as corresponding Security Servers, into separate VPCs or into completely separate AWS accounts. ### OPS_DEV_03: Use Configuration Management Manage and track configuration changes externally to the Security Servers, either through source control or a configuration management service. Store sensitive configuration (user credentials, keystore passwords etc.), in Secrets Manager, SSM Parameter Store or any other service that enables the safe handling of secret data. **Recommended tools:** * [AWS Systems Manager](https://aws.amazon.com/systems-manager/) * [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) ![Use Configuration Management](img/ops-config-management.png) Configuration parameters that change over time can be stored in AWS Systems Manager Parameter Store. This allows the parameters to be fetched when the Security Server starts. When using the containerized version of the security server, the configuration can directly be referred to when setting up the environment for the container. Keep the parameters containing secrets (usernames, passwords) in AWS Secrets Manager to enable automatic rotation. ## Deployment ### OPS_DEP_01: Automate Deployment Automate Security Server deployments, such that they are repeatable without user intervention. This minimizes human error in the deployment process, enables automated recovery from infrastructure failures, and allows you to create test environments with relative ease. For more complex deployment processes use AWS Step Functions for coordination. **Recommended tools:** * [AWS CodePipeline](https://aws.amazon.com/codepipeline/) * [AWS CodeDeploy](https://aws.amazon.com/codedeploy/) * [AWS Step Functions](https://aws.amazon.com/step-functions/) ### OPS_DEP_02: Perform Rolling Deployments When deploying a newer version of the Security Server, roll your environment gradually over to the new version, without impacting availability. For example, if your environment contains two Security Servers in high availability configuration, start a deployment by deploying a third Security Server and direct some X-Road request traffic to that server. If the server behaves as expected, decommission one of the original two servers and deploy a second copy of the new version, eventually finishing the deployment by terminating the second original Security Server. If at any point in this process, a Security Server or its infrastructure should fail, you will still have at least one healthy Security Server serving a part of the traffic. **Recommended tools:** * [EC2 Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html) * [CodeDeploy](https://aws.amazon.com/codedeploy/) ## Monitoring ### OPS_MON_01: Collect Logs and Metrics Centrally Collect system logs and health metrics from Security Servers into a central log storage that can be used to analyze these and take action upon anomalies found in the logs. Use Amazon CloudWatch Agent to collect logs from Security Servers. If the CloudWatch agent cannot be installed, you can mount an Elastic File System share to your Security Server and write the logs to a mounted file system. To store logs for a longer period of time, store these in S3. Configure S3 intelligent tiering to automatically optimize your storage costs, depending on the frequency of access of logs. **Recommended tools:** * [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) * [Amazon Elastic File System](https://aws.amazon.com/efs/) * [Amazon S3](https://aws.amazon.com/s3/) * [Amazon S3 Storage Classes](https://aws.amazon.com/s3/storage-classes/) * [Amazon EventBridge](https://aws.amazon.com/eventbridge/) * [AWS Lambda](https://aws.amazon.com/lambda/) **Example:** ![Central Logging](img/ops-central-logging.png) CloudWatch can be used as the central hub for collecting and working with near-realtime logs. One of the easiest ways to collect logs to CloudWatch is by using the CloudWatch Agent. If using the agent is not an option, logs can be written to a mounted Elastic File System volume and delivered to CloudWatch using a periodically scheduled Lambda function. If logs need to be stored for longer periods of time than actively used (e.g. for auditing purposes), a log export task can be created periodically to export the specific log groups to S3 for long term storage. ### OPS_MON_02: Build a Monitoring Dashboard Build a dashboard that surfaces the most critical metrics about Security Servers that you need to assess system health. For example: 1. Server CPU and Memory usage 2. Number of errors detected in system logs over a period of x minutes 3. Number of healthy Security Servers 4. Number of failed health checks over a period of x minutes Review and improve the metrics and dashboard periodically to minimize the time that it takes for you to diagnose problems occurring in your X-Road environments. Recommended tools: * [CloudWatch Dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) ### OPS_MON_03: Define Alarms Define thresholds in health metrics that could indicate problems with Security Servers. Use alarms to monitor these metrics and notify human operators when then thresholds have been breached. For example: 1. Server CPU or Memory usage is over 80% for more than x data points. 2. More than x errors detected in system logs over a period of y minutes. 3. Number of healthy Security Servers drops below x% of the total. 4. A health check fails x times in a row. Review and improve the thresholds, alarm triggers and notification content and mechanisms to decrease the number of false alarms, decrease the time it takes to receive a significant notification and decrease the time it takes for an operator from seeing the notification to understanding the problem. **Recommended tools:** * [CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) **Example:** ![Monitoring Flow](img/ops-monitoring.png) --- **Next Topic:** [Security](security.md)