# Reliability Best Practices

## Architecture

### REL_ARC_01: Define Recovery Time and Point Objectives

Understand the criticality of your X-Road infrastructure to your customers - internal and external. Define how fast
do you need to recover from a failure (Recovery Time Objective, RTO) and how much data loss can you tolerate (Recovery
Point Objective, RPO). Adjust your backup methods and frequency accordingly. Align these objectives with the objectives
of your customers to avoid X-Road infrastructure becoming the weakest link in the system, but also to avoid cost 
overheads.

**Recommended tools:**
* [Example Implementations for Availability Goals](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/example-implementations-for-availability-goals.html)


### REL_ARC_02: Separate Compute and Storage

By separating storage from the compute layer, it becomes easier to replace and scale each separately. For Security Servers
running as EC2 instances, you can pick between EBS volumes or EFS file systems for storage. With sidecar container deployments, 
mounting EFS file systems is the best option for persistent storage. 

Prefer using an Amazon RDS database over the built-in PostgreSQL option for best performance and availability. Consider
sharing an RDS cluster between multiple Security Servers over setting up an RDS cluster per Security Server. RDS supports multiple availability zones out of the box, which makes them highly available and reliable even in case of an availability zone failure. More info for RDS multi AZ support [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html)

**Recommended tools:**
* [Amazon Elastic Block Store](https://aws.amazon.com/ebs/)
* [Amazon Elastic File System](https://aws.amazon.com/efs/)
* [Amazon Relational Database Service](https://aws.amazon.com/rds/)

### REL_ARC_03: Scale Horizontally to Increase Availability

Instead of one large Security Server, deploy multiple small ones to reduce the impact of a single failure on the 
overall workload. Distribute requests across Security Servers to ensure that they don’t share a common point of 
failure. Additionaly ensure that the when creating mutiple instances of the security server the instances are spread across different availability zones for high availability and reliability. Information on how to do cross zone load balancing can be found [here](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html)

**Recommended tools:**
* [Amazon EC2 Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html)

## Change Management

### REL_CHM_01: Integrate Testing as Part of Your Deployment

Before deploying an updated version of a Security Server to a production environment, deploy it in a test environment.
Run automated integration tests that involve connecting to / from the subsystem(s) behind the Security Server. 
Verify that the Security Server management UI is available for operator access. When tests succeed, deploy the changes
into a production environment.

### REL_CHM_02: Deploy Using Immutable Infrastructure

Prefer separately staged changes to the Security Servers to in-place changes. Build Security Server images periodically
up-front in an automated build pipeline to get the latest versions of required operating system packages. Disable 
automatic in-place upgrades of packages on the Security Server in favor of replacing the Security Servers from the
latest verified machine image.

**Recommended tools:**
* [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html)

### REL_CHM_3: Deploy With Automation

Minimize the risk of human error by automating Security Server deployments. Automate both roll-forward and roll-back
scenarios, such that any deployment can be stopped and reverted when needed. Integrate deployment automation with 
canary tests, such that you can assess system health during and after the deployment has occurred. 

## Failure Management

### REL_FLM_01: Back Up Data

Configure automatic backups for Security Server databases, through RDS snapshots and point-in-time recovery. Configure
logs to be backed up to S3 in addition to CloudWatch for auditing purposes. Periodically verify that backups can be
used in a disaster recovery scenario. Also, don't forget to transfer the message log archive files to an external
storage, for example, EFS file system or S3 bucket.

Recommended tools:
* [Amazon RDS Backups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html)

### REL_FLM_02: Design to Withstand Component Failures

Run at least two Security Server instances behind a load balancer in separate availability zones, with at least two 
database instances (writer and reader) in the same availability zones for lowest latency and minimum downtime in the 
situation of component failure. Configure health checks on Security Servers to allow for the auto-scaling group to 
replace a failed instance.

For sidecar Security Servers (container-based deployments), define your workload as an ECS or EKS service, to let
the container platform manage the lifecycles of Security Server containers.

### REL_FLM_03: Test Reliability

Periodically verify recovery procedures. Adopt chaos engineering principles to introduce failures in your security
server environment. Verify that you can quickly recover from database failures or corruptions. Verify recovery from
the loss of a single Security Server instance.

**Recommended tools:**
* [AWS Fault Injection Simulator](https://aws.amazon.com/fis/)

---

**Previous Topic:** [Security](security.md)

**Next Topic:** [Performance Efficiency](performance-efficiency.md)