# Step Functions Gradual Deployments
These are reference scripts to demonstrate how to do gradual deployments using
AWS Step Functions versions and aliases.

You can use these scripts as inspiration to provision your own gradual
deployments in your CI/CD environments of choice.

The Python example shows how to use an AWS SDK to manage a gradual deployment,
whereas the Bash script shows which AWS CLI commands you can use if you prefer. 
An alternative is to [use CloudFormation for Step Functions Gradual Deployments](TODO: Linkhere).

## Python API example
### Prerequisites
Since this is a Python script, you need the Python 3 runtime.

To run [sfndeploy.py](sfndeploy.py), you will need to
[install boto3](https://aws.amazon.com/sdk-for-python/) and
[configure it with your 
credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration).

tldr; `pip install boto3`

### Script Overview
[sfndeploy.py](sfndeploy.py) is a Python 3 script showing how to use the
[boto3](https://aws.amazon.com/sdk-for-python/) AWS SDK for Python to create
gradual deployments with Step Functions.

This script demonstrates the following deployment strategies:
1. Canary - route a small percentage of traffic to the new version initially,
   then after a validation period where no alarms trigger, switch 100% to that
   new version.
2. Linear (aka Rolling) - route a percentage of traffic, which increases over
   time from 0% to 100%, to the new version, rolling back immediately if any
   alarms trigger.
3. All at Once (aka Blue/Green) - immediately switch 100% to the new version,
   monitor the new version and roll-back automatically to the previous version
   if any alarms trigger. 

### Canary
A Canary strategy deploys in two steps: first a small increment of traffic
routes to the new version, and if there are no problems during the set testing
period it will switch 100% of traffic to the new version.

In this script, use `--increment` to set the initial percentage of traffic to
route to the new version. The `--interval` input specifies for how long (in
seconds) the Canary testing period lasts before switching 100% of traffic to
the new version.

#### Example with defaults
Here is an example showing a Canary deploy using the defaults for increment
(5) and interval (120 seconds):
```
 ./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy canary
```
This example will:
- Upload the `sample.asl.json` file as a new revision of the state machine.
- Publish the state machine definition you just uploaded as the next version.
- Initially point 5% of traffic to this new version, using the `my-alias` alias.
  You can change the percentage of traffic with the `--increment` argument.
- Wait for the default period of 120s You can change this value with the
  `--interval` input.
- Switch 100% of traffic to the new version.

#### Example with values for increment and interval
Now let's switch 30% of traffic to the new version for a test period of
300 seconds. During the 300s the scripts monitors two different alarms - if
any of these alarms trigger the deployment will rollback. If the 300s complete
with no alarms, the script switches 100% of traffic to the new version.

```
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --publish-revision --strategy canary --increment 30 --interval 300 --alarms MaxCPU "API Error Breach"
```

Note in this script invocation there the optional file argument isn't specified,
so the `--publish-revision` flag will publish the latest revision of the
state machine to the new version without uploading a new definition.

### Linear
A Linear (or Rolling) deployment strategy gradually increases the percentage of
traffic to the new state machine version from 0% to 100%, in regular increments.

For example, an `--increment 20` with `--interval 600` will increase traffic
by 20% every 600 seconds until the new version receives 100% of traffic.

If you set `--alarms`, the script will monitor the alarms specified during the
deployment until all traffic routes 100% to the new version. If any of the
alarms go into the `ALARM` state during the deployment window, the script will
automatically and immediately rollback to the previous version. You can
configure how often the script polls for alarms with `--alarm-polling`.

```
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy linear --increment 20 --interval=600 --alarms MaxCPU "API Error Breach" --history-max 11
```

This example will:
- Upload the `sample.asl.json` file as a new revision of the state machine.
- Publish the state machine revision created in the previous step as the next
  version.
- Route 20% of traffic to the new version for 600s.
- Increase the percentage of traffic directed to new new version by 20% each
  600 seconds.
- Monitor the 2 alarms every minute, and rollback automatically if an alarm
  sounds.
- The script will then delete historic versions prior to 11 versions ago.

The `increment` does not need to be a factor of 100. The script will increment
linearly until it reaches 100. The script caps the maximum weight at 100. If,
for example, you set `increment` to 15, the script will increment in seven
steps - six steps of 15 to reach weight 90, and then the final step would only
add ten to reach 100. There wouldn't be any further increments in this case.

### All at Once
An All at Once strategy routes 100% of traffic to the new version immediately,
then monitors for problems duirngs a configurable period. This is useful to
support Blue/Green style deployment where you test the Green version first, then
switch all your production traffic to that version. If any alarms trigger,
the script will automatically rollback the alias to point to the Blue version.

You can set the monitoring period with `--interval` (in seconds).

This deployment strategy ignores the `--increment` input.

```
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy allatonce --interval=500 --alarms MaxCPU "API Error Breach" --history-max 10
```

This example will:
- Upload the `sample.asl.json` file as a new revision of the state machine.
- Publish the state machine you just uploaded as the next version.
- Point 100% of traffic to this new version, using the alias.
- Monitor the two alarms for 500s, and rollback automatically if an alarm
  sounds.
- If there are no alarms during this period, the deploy was a success.
- The script will then delete historic versions prior to ten versions ago.

If you do not pass the optional `--file` argument, the `--publish-revision` flag
will just publish the latest revision of the state machine to the new version
without first uploading a new definition from a local file.

### CLI inputs
To get CLI input help, pass `--help`:
```
./sfndeploy.py --help
```

Here is a summary of the inputs:
```
❯ ./sfndeploy.py --help
usage: sfndeploy [-h] --state-machine STATE_MACHINE --alias ALIAS --region REGION
                 [--strategy {allatonce,canary,linear}] [--alarms [ALARMS ...]]
                 [--file SM_FILE] [--publish-revision | --no-publish-revision]
                 [--increment INCREMENT] [--interval INTERVAL]
                 [--alarm-polling ALARM_POLLING] [--history-max HISTORY_MAX]
                 [--force | --no-force]

Gradually deploy AWS Step Functions state machines.

options:
  -h, --help            show this help message and exit
  --state-machine STATE_MACHINE
                        Name of the state machine (not ARN).
  --alias ALIAS         Name of alias.
  --region REGION       Region name. e.g 'us-east-1'
  --strategy {allatonce,canary,linear}
                        The type of deployment to do. By default will deploy AllAtOnce.
  --alarms [ALARMS ...]
                        Optional list of CloudWatch alarm names to monitor during
                        deployment.
  --file SM_FILE        Optional path to state machine definition file to deploy. Will
                        upload this file as the latest revision of the state machine. If
                        you don't set this, will use the current latest revision.
  --publish-revision, --no-publish-revision
                        Publish the current revision to the next version.
  --increment INCREMENT
                        The increment for weight increase during deploy strategy, from
                        0-100%. Just input the number, not the % sign.
  --interval INTERVAL   The interval in seconds at which to increase weight during the
                        deploy strategy.
  --alarm-polling ALARM_POLLING
                        Poll alarms at this interval in seconds. Default 60s.
  --history-max HISTORY_MAX
                        Maximum number of versions to keep in history. Will delete
                        versions older than this. Set to 0 to disable (this is the
                        default). There is a 1000 version limit in Step Functions.
  --force, --no-force   Force the deploy to start, even if the alias is not currently
                        pointing 100% at the old version. This may be required to recover
                        from a previous deploy that failed and didn't roll back correctly.
                        This means you might be overwriting an in-progress deploy, or that
                        something went wrong in a previous deploy. Be careful when
                        combining with publish_revision - if you just rerun the script you
                        might force publish a previously uploaded revision without
                        testing.
```

### Version History Deletion
Step Functions limits the number of versions per state machine to 1000. As you
release new versions of a state machine, the older versions remain in the
state machine. This can be useful because you might need to rollback to a
previous version.

To avoid the the build-up of historic versions to reach the limit of 1000, you
need to trim your version history by deleting older versions once you are sure
that you do not need them anymore.

This script provides an automatic version history deletion mechanism that runs
after a deploy completed. You enable this with the `--history-max` argument.
The script will delete any versions prior to `n` versions ago, where `n` is the
number you pass to `history-max`.

For example, if you pass `--history-max 5`, the script will only keep five
versions and delete any versions prior to that.

Carefully consider when a previous version is ready for deletion - you might
need to rollback to it or to refer to it for auditing purposes. Once you
delete a state machine version, it is gone forever.

### Alarm Polling Frequency
By default, the script polls for alarms every 60 seconds. This is because many
AWS services only have an alarm granularity of 60s.

You can set the polling frequency with the `--alarm-polling argument`.
For example, set `--alarm-polling 23` and the script will poll all the
`--alarms` every 23 seconds.

The alarm polling interval is completely independent from `--interval`, so it
does NOT need to be evenly divisible.

Take care to align how often you poll for alarms with the deployment window
that you set with `--interval`. If `--alarm-polling` is high relative to
`--interval` the deployment window could finish before the script polls the
alarms.

### Script Process Flow
The script takes the following actions:
1. If `--file` specified, upload that file as the new revision for the state
   machine
2. If `--publish-revision` set, publish the latest revision of the state machine
   as the next version. This will become the new version to deploy. If you
   combine this with the `--file` input, this will publish the file you just
   uploaded as the next version.
3. Note that if the script fails later on it will NOT undeploy any revision
   uploaded or promoted to a version in the first two steps.
   See [Cloudformation](TODO: link here) for provisioning with full rollback.
4. If `--publish-revision` is not set, the most recent published version of the
   state machine will deploy. This is useful if you have some other
   process or tool that updates your state machine definitions, and you just
   want to use this script as a way to switch the alias from the old version to
   that new version.
5. Create the specified alias if it does not exist. If the alias didn't exist,
   route 100% of traffic to the new version and exit the script. This is because
   it would be a first deploy, so there is no rollback possible.
6. Start routing traffic to the alias using the deployment strategy set by
   `--strategy`. (AllAtOnce, Linear, Canary). The `--increment` and `--interval`
   arguments govern how the strategy you select behaves.
7. Monitor any `--alarms` specified during the entire deployment period and
   rollback automatically if any of these go into ALARM state.
8. If deployment completes successfully, keep the number of versions set by
   `history_max` and delete state machine versions prior to that. The default
   value of 0 for `history_max` disables this deletion of old versions, but
   remember there is a limit of 1000 versions per state machine.

### Unit tests
You can run the unit tests for `sfndeploy.py` like this:

```
python -m unittest sfndeploy_test.py
```

## AWS CLI example
This bash script shows how to use AWS CLI commands to do a gradual deployment.

### Prerequisites
To run [sfn-canary-deploy.sh](sfn-canary-deploy.sh), you will need the
[AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
installed and configured.

### CLI Script Overview
[sfn-canary-deploy.sh](sfn-canary-deploy.sh) is a bash script showing how
to use the AWS CLI to create and manage a Canary-style deployment. For
AllAtOnce or Linear deployments, see the Python version above.

The script does the following:
1. Publish the most recent revision as the next version of the state machine if
   `publish_revision` is true. This will become the new live version.
2. If `publish_revision` is false, the most recent published version of the
   state machine will deploy.
3. Create the alias if it doesn’t exist yet. If the alias didn't exist,
   point 100% of traffic for this alias to the new version, then exit the
   script.
4. Update the routing configuration for the alias to direct a small
   percentage of traffic from the previous to the new version. You set this
   canary percentage with `canary_percentage`.
5. Monitor the configurable CloudWatch alarms every 60s by default. If any of
   these alarms trigger, rollback the deployment immediately by pointing 100%
   of traffic to the known-good previous version. Will keep on monitoring the
   alarms every `alarm_polling_interval` in seconds until
   `canary_interval_seconds` have passed.
6. If there were no alarms during the canary interval, shift 100% of traffic to
   the new version. You set this interval with `canary_interval_seconds`.
7. Upon successful deployment, delete any versions older than `history_max`.

## Gradual Deployments from CI/CD tools
Here are some tips to get you started with popular CD platforms:

### Jenkins
You can run your customized Bash or Python on Jenkins by using the `sh`
step in the `Jenkinsfile` to run your custom script.

```
pipeline {
    agent any

    stages {
        stage('Build') {
            steps {
                echo 'Building..'
            }
        }
        stage('Test') {
            steps {
                echo 'Testing..'
            }
        }
        stage(‘Gradual Deploy') {
            steps {
                sh /path/to/gradual-deploy-script.sh
            }
        }
    }
}
```

You have some options to configure the prequisites: 
- If you want to run your script directly from the Jenkins pipeline, you must
  install and configure your prerequisites on the Jenkins server instance - in
  this case the AWS CLI for the Bash script or Boto3 for the Python script.
- The Jenkins user must have AWS credentials to access the Step Functions
  service.
- If you are using the standard Amazon Machine Image (AMI) as a base for your
  Jenkins installation this already contains the prequisites.
- Alternatively, if you want to use custom Docker images to encapsulate your
  dependencies and scripts, you can use the
  [Docker Pipeline Plugin](https://plugins.jenkins.io/docker-workflow/) and let
  [Jenkins run your scripts inside the container](https://www.jenkins.io/doc/pipeline/tour/hello-world/#python).

### Spinnaker
Use the [Jenkins](https://spinnaker.io/docs/reference/pipeline/stages/#jenkins)
stage or the [Script](https://spinnaker.io/docs/setup/other_config/features/script-stage/)
stage in Spinnaker to run a custom shell or Python script from your pipeline.

With the Script stage, Spinnaker uses Jenkins to sandbox your scripts, so you
need to set up a Jenkins instance in order to use it.

In your Spinnaker deck, select:
- Add Stage.
- Select the `Script` type of stage.
- Under `Command` enter your script invocation.
- Set `Depends On` if there’s a preceding stage that should run before your
  custom script.

Alternatively, you can encapsulate your logic and its dependencies in a
container and execute it with a
[Run Job stage](https://spinnaker.io/docs/guides/user/kubernetes-v2/run-job-manifest/).

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: gradual-deploy
spec:
  backoffLimit: 0
  template:
    spec:
      containers:
        - command:
            - python
            - path/to/my/script.py
          image: 'myrepo/mycontainer:1.2.3'
          name: my-custom-script
      restartPolicy: Never
```

## Warning
Remember that creating and running resources in AWS costs money. Take care to
delete resources when you're done to avoid billing surprises.

All the scripts in this repo are examples that are not meant for production
systems. The scripts here do not clean up or release resources when finished.
Take care & run at your own risk.