# Batch Graduation

* **Original Author(s):**: @comcalvi
* **Tracking Issue**: #485
* **API Bar Raiser**: @rix0rrr

This proposes a new set of L2 constructs for the aws-batch module.

[Existing (experimental) L2 API docs](https://docs.aws.amazon.com/cdk/api/v1/docs/aws-batch-readme.html)
[CloudFormation resource reference](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Batch.html)

## Batch Overview

Batch has four resources:

* `ComputeEnvironment`
  * Can be `managed` or `unmanaged`
    * `managed` can use EC2 or Fargate resources
    * `unmanaged` can only use EC2 resources.
  * Can use EC2 or Fargate
    * Each of these can be regular or SPOT type.
    * EC2 can then specify EKS or ECS.
  * ImageId has been deprecated!
    * Likely in favor of EC2Configuration
    * see [API Ref](https://docs.aws.amazon.com/batch/latest/APIReference/API_ComputeResource.html)
  * Subnets work with any managed ComputeEnvironment, but you cannot use Local Zones with Fargate
* `JobDefinition`
  * Can be `multinode` or `container`
    * `container`s can be ECS or EKS
  * Can run on EC2 or Fargate resources
    * `multinode` only runs on EC2
      * does not work on spot instances!
    * `container` can use EC2 or Fargate
* `JobQueue`
  * This just uses a `ComputeEnvironment` with an order
    * Why not add the order to the `ComputeEnvironment`? `ComputeEnvironment`s can be reused across different `JobQueue`s.
* `SchedulingPolicy`
  * The only type of `SchedulingPolicy` is a `FairsharePolicy`
  * These let you define Job shares, which associate a Job with a `weightFactor`.
    The `weightFactor` is inversely correlated with the compute resources that are assigned to a particular Job

`ComputeEnvironment`s are used by `JobQueue`s to run incoming Jobs.
When a user submits a job to a `JobQueue`, the `JobQueue` chooses one of its available `ComputeEnvironment`s to run the job.
Jobs are defined by `JobDefinition`s, which are of type either `multinode` or `container`.
`multinode` jobs are only compatible with `ComputeEnvironment`s that use `managed` ec2 instances,
whereas `container` jobs are compatible with any type of `ComputeEnvironment`.

`SchedulingPolicy`s are used by `JobQueues` to decide which resources to allocate to which Jobs.

## Use Cases & Examples

### Cost Optimization

#### Spot Instances

Spot instances are significantly discounted EC2 instances that can be reclaimed at any time by AWS.
Workloads that are fault-tolerant or stateless can take advantage of spot pricing.
To use spot spot instances, set `spot` to `true` on a managed Ec2 or Fargate Compute Environment:

```ts
const vpc = new ec2.Vpc(this, 'VPC');
new batch.FargateComputeEnvironment(this, 'myFargateComputeEnv', {
  vpc,
  spot: true,
});
```

Batch allows you to specify the percentage of the on-demand instance that the current spot price
must be to provision the instance using the `spotBidPercentage`.
This defaults to 100%, which is the recommended value.
This value cannot be specified for `FargateComputeEnvironment`s
and only applies to `ManagedEc2EcsComputeEnvironment`s.
The following code configures a Compute Environment to only use spot instances that
are at most 20% the price of the on-demand instance price:

```ts
const vpc = new ec2.Vpc(this, 'VPC');
new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
   vpc,
   spot: true,
   spotBidPercentage: 20,
});
```

For stateful or otherwise non-interruption-tolerant workflows, omit `spot` or set it to `false` to only provision on-demand instances.

#### Choosing Your Instance Types

Batch allows you to choose the instance types or classes that will run your workload.
This example configures your `ComputeEnvironment` to use only the `M5AD.large` instance:

```ts
const vpc = new ec2.Vpc(this, 'VPC');

new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  vpc,
  instanceTypes: [ec2.InstanceType.of(ec2.InstanceClass.M5AD, ec2.InstanceSize.LARGE)],
});
```

Batch allows you to specify only the instance class and to let it choose the size, which you can do like this:

```ts
const vpc = new ec2.Vpc(this, 'VPC');

declare const computeEnv: batch.IManagedEc2EcsComputeEnvironment
computeEnv.addInstanceClass(ec2.InstanceClass.M5AD);
// Or, specify it on the constructor:
new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  vpc,
  instanceClasses: [ec2.InstanceClass.R4],
});
```

Unless you explicitly specify `useOptimalInstanceClasses: false`, this compute environment will use `'optimal'` instances,
which tells Batch to pick an instance from the C4, M4, and R4 instance families.
*Note*: Batch does not allow specifying instance types or classes with different architectures.
For example, `InstanceClass.A1` cannot be specified alongside `'optimal'`,
because `A1` uses ARM and `'optimal'` uses x86_64.
You can specify both `'optimal'` alongside several different instance types in the same compute environment:

```ts
declare const vpc: ec2.IVpc;

const computeEnv = new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  instanceTypes: [ec2.InstanceType.of(ec2.InstanceClass.M5AD, ec2.InstanceSize.LARGE)],
  useOptimalInstanceClasses: true, // default
  vpc,
});
// Note: this is equivalent to specifying
computeEnv.addInstanceType(ec2.InstanceType.of(ec2.InstanceClass.M5AD, ec2.InstanceSize.LARGE));
computeEnv.addInstanceClass(ec2.InstanceClass.C4);
computeEnv.addInstanceClass(ec2.InstanceClass.M4);
computeEnv.addInstanceClass(ec2.InstanceClass.R4);
```

#### Allocation Strategies

| Allocation Strategy     | Optimized for      | Downsides                     |
| ----------------------- | -------------      | ----------------------------- |
| BEST_FIT                | Cost               | May limit throughput          |
| BEST_FIT_PROGRESSIVE    | Throughput         | May increase cost             |
| SPOT_CAPACITY_OPTIMIZED | Least interruption | Only useful on Spot instances |

Batch provides different Allocation Strategies to help it choose which instances to provision.
If your workflow tolerates interruptions, you should enable `spot` on your `ComputeEnvironment`
and use `SPOT_CAPACITY_OPTIMIZED` (this is the default if `spot` is enabled).
This will tell Batch to choose the instance types from the ones you’ve specified that have
the most spot capacity available to minimize the chance of interruption.
To get the most benefit from your spot instances,
you should allow Batch to choose from as many different instance types as possible.

If your workflow does not tolerate interruptions and you want to minimize your costs at the expense
of potentially longer waiting times, use `AllocationStrategy.BEST_FIT`.
This will choose the lowest-cost instance type that fits all the jobs in the queue.
If instances of that type are not available,
the queue will not choose a new type; instead, it will wait for the instance to become available.
This can stall your `Queue`, with your compute environment only using part of its max capacity
(or none at all) until the `BEST_FIT` instance becomes available.

If you are running a workflow that does not tolerate interruptions and you want to maximize throughput,
you can use `AllocationStrategy.BEST_FIT_PROGRESSIVE`.
This is the default Allocation Strategy if `spot` is `false` or unspecified.
This strategy will examine the Jobs in the queue and choose whichever instance type meets the requirements
of the jobs in the queue and with the lowest cost per vCPU, just as `BEST_FIT`.
However, if not all of the capacity can be filled with this instance type,
it will choose a new next-best instance type to run any jobs that couldn’t fit into the `BEST_FIT` capacity.
To make the most use of this allocation strategy,
it is recommended to use as many instance classes as is feasible for your workload.
This example shows a `ComputeEnvironment` that uses `BEST_FIT_PROGRESSIVE`
with `'optimal'` and `InstanceClass.M5` instance types:

```ts
declare const vpc: ec2.IVpc;

const computeEnv = new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  vpc,
  instanceClasses: [ec2.InstanceClass.M5],
});
```

This example shows a `ComputeEnvironment` that uses `BEST_FIT` with `'optimal'` instances:

```ts
declare const vpc: ec2.IVpc;

const computeEnv = new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  vpc,
  allocationStrategy: batch.AllocationStrategy.BEST_FIT,
});
```

*Note*: `allocationStrategy` cannot be specified on Fargate Compute Environments.

### Controlling vCPU allocation

You can specify the maximum and minimum vCPUs a managed `ComputeEnvironment` can have at any given time.
Batch will *always* maintain `minvCpus` worth of instances in your ComputeEnvironment, even if it is not executing any jobs,
and even if it is disabled. Batch will scale the instances up to `maxvCpus` worth of instances as
jobs exit the JobQueue and enter the ComputeEnvironment. If you use `AllocationStrategy.BEST_FIT_PROGRESSIVE` or `AllocationStrategy.SPOT_CAPACITY_OPTIMIZED`,
batch may exceed `maxvCpus`; it will never exceed `maxvCpus` by more than a single instance type. This example configures a
`minvCpus` of 10 and a `maxvCpus` of 100:

```ts
declare const vpc: ec2.IVpc;

new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2ComputeEnv', {
  vpc,
  instanceClasses: [ec2.InstanceClass.R4],
  minvCpus: 10,
  maxvCpus: 100,
});
```

Unmanaged `ComputeEnvironment`s do not support `maxvCpus` or `minvCpus` because you must provision and manage the instances yourself;
that is, Batch will not scale them up and down as needed.

### Sharing a ComputeEnvironment between multiple JobQueues

Multiple `JobQueue`s can share the same `ComputeEnvironment`.
If multiple Queues are attempting to submit Jobs to the same `ComputeEnvironment`,
Batch will pick the Job from the Queue with the highest priority.
This example creates two `JobQueue`s that share a `ComputeEnvironment`:

```ts
declare const vpc: ec2.IVpc;
const sharedComputeEnv = new batch.FargateComputeEnvironment(this, 'spotEnv', {
  vpc,
  spot: true,
});
const lowPriorityQueue = new batch.JobQueue(this, 'JobQueue', {
   priority: 1,
});
const highPriorityQueue = new batch.JobQueue(this, 'JobQueue', {
   priority: 10,
});
lowPriorityQueue.addComputeEnvironment(sharedComputeEnv, 1);
highPriorityQueue.addComputeEnvironment(sharedComputeEnv, 1);
```

### Fairshare Scheduling

Batch `JobQueue`s execute Jobs submitted to them in FIFO order unless you specify a `SchedulingPolicy`.
FIFO queuing can cause short-running jobs to be starved while long-running jobs fill the compute environment.
To solve this, Jobs can be associated with a share.

Shares consist of a `shareIdentifier` and a `weightFactor`, which is inversely correlated with the vCPU allocated to that share identifier.
When submitting a Job, you can specify its `shareIdentifier` to associate that particular job with that share.
Let's see how the scheduler uses this information to schedule jobs.

For example, if there are two shares defined as follows:

| Share Identifier | Weight Factor |
| ---------------- | ------------- |
| A                | 1             |
| B                | 1             |

The weight factors share the following relationship:

```math
A_{vCpus} / A_{Weight} = B_{vCpus} / B_{Weight}
```

where `BvCpus` is the number of vCPUs allocated to jobs with share identifier `'B'`, and `B_weight` is the weight factor of `B`.

The total number of vCpus allocated to a share is equal to the amount of jobs in that share times the number of vCpus necessary for every job.
Let's say that each A job needs 32 VCpus (`A_requirement` = 32) and each B job needs 64 vCpus (`B_requirement` = 64):

```math
A_{vCpus} = A_{Jobs} * A_{Requirement}
```

```math
B_{vCpus} = B_{Jobs} * B_{Requirement}
```

We have:

```math
A_{vCpus} / A_{Weight} = B_{vCpus} / B_{Weight}
```

```math
A_{Jobs} * A_{Requirement} / A_{Weight} = B_{Jobs} * B_{Requirement} / B_{Weight}
```

```math
A_{Jobs} * 32 / 1 = B_{Jobs} * 64 / 1
```

```math
A_{Jobs} * 32 = B_{Jobs} * 64
```

```math
A_{Jobs} = B_{Jobs} * 2
```

Thus the scheduler will schedule two `'A'` jobs for each `'B'` job.

You can control the weight factors to change these ratios, but note that
weight factors are inversely correlated with the vCpus allocated to the corresponding share.

This example would be configured like this:

```ts
const fairsharePolicy = new batch.FairshareSchedulingPolicy(this, 'myFairsharePolicy');

fairsharePolicy.addShare({
  shareIdentifier: 'A',
  weightFactor: 1,
});
fairsharePolicy.addShare({
  shareIdentifier: 'B',
  weightFactor: 1,
});
new batch.JobQueue(this, 'JobQueue', {
  schedulingPolicy: fairsharePolicy,
});
```

*Note*: The scheduler will only consider the current usage of the compute environment unless you specify `shareDecay`.
For example, a `shareDecay` of 5 minutes in the above example means that at any given point in time, twice as many `'A'` jobs
will be scheduled for each `'B'` job, but only for the past 5 minutes. If `'B'` jobs run longer than 5 minutes, then
the scheduler is allowed to put more than two `'A'` jobs for each `'B'` job, because the usage of those long-running
`'B'` jobs will no longer be considered after 5 minutes. `shareDecay` linearly decreases the usage of
long running jobs for calculation purposes. eg if share decay is 60 seconds,
then jobs that run for 30 seconds have their usage considered to be only 50% of what it actually is,
but after a whole minute the scheduler pretends they don't exist for fairness calculations.

The following code specifies a `shareDecay` of 5 minutes:

```ts
import * as cdk from 'aws-cdk-lib'
const fairsharePolicy = new batch.FairshareSchedulingPolicy(this, 'myFairsharePolicy', {
   shareDecay: cdk.Duration.minutes(5),
});
```

If you have high priority jobs that should always be executed as soon as they arrive,
you can define a `computeReservation` to specify the percentage of the
maximum vCPU capacity that should be reserved for shares that are *not in the queue*.
The actual reserved percentage is defined by Batch as:

```math
 (\frac{computeReservation}{100}) ^ {ActiveFairShares}
```

where `ActiveFairShares` is the number of shares for which there exists
at least one job in the queue with a unique share identifier.

This is best illustrated with an example.
Suppose there three shares with share identifiers `A`, `B` and `C` respectively
and we specify the `computeReservation` to be 75%. The queue is currently empty,
and no other shares exist.

There are no active fair shares, since the queue is empty.
Thus (75/100)^0 = 1 = 100% of the maximum vCpus are reserved for all shares.

A job with identifier `A` enters the queue.

The number of active fair shares is now 1, hence
(75/100)^1 = .75 = 75% of the maximum vCpus are reserved for all shares that do not have the identifier `A`;
for this example, this is `B` and `C`,
(but if jobs are submitted with a share identifier not covered by this fairshare policy, those would be considered just as `B` and `C` are).

Now a `B` job enters the queue. The number of active fair shares is now 2,
so (75/100)^2 = .5625 = 56.25% of the maximum vCpus are reserved for all shares that do not have the identifier `A` or `B`.

Now a second `A` job enters the queue. The number of active fair shares is still 2,
so the percentage reserved is still 56.25%

Now a `C` job enters the queue. The number of active fair shares is now 3,
so (75/100)^3 = .421875 = 42.1875% of the maximum vCpus are reserved for all shares that do not have the identifier `A`, `B`, or `C`.

If these are no other shares that your jobs can specify, this means that 42.1875% of your capacity will never be used!

Now, `A`, `B`, and `C` can only consume 100% - 42.1875% = 57.8125% of the maximum vCpus.
Note that the this percentage is **not** split between `A`, `B`, and `C`.
Instead, the scheduler will use their `weightFactor`s to decide which jobs to schedule;
the only difference is that instead of competing for 100% of the max capacity, jobs compete for 57.8125% of the max capacity.

This example specifies a `computeReservation` of 75% that will behave as explained in the example above:

```ts
new batch.FairshareSchedulingPolicy(this, 'myFairsharePolicy', {
  computeReservation: 75,
  shares: [
    { weightFactor: 1, shareIdentifier: 'A' },
    { weightFactor: 0.5, shareIdentifier: 'B' },
    { weightFactor: 2, shareIdentifier: 'C' },
  ],
});
```

You can specify a `priority` on your `JobDefinition`s to tell the scheduler to prioritize certain jobs that share the same share identifier.

### Configuring Job Retry Policies

Certain workflows may result in Jobs failing due to intermittent issues.
Jobs can specify retry policies to respond to different failures with different actions.
There are three different ways information about the way a Job exited can be conveyed;

* `exitCode`: the exit code returned from the process executed by the container. Will only match non-zero exit codes.
* `reason`: any middleware errors, like your Docker registry being down.
* `statusReason`: infrastructure errors, most commonly your spot instance being reclaimed.

For most use cases, only one of these will be associated with a particular action at a time.
To specify common `exitCode`s, `reason`s, or `statusReason`s, use the corresponding value from
the `Reason` class. This example shows some common failure reasons:

```ts
import * as cdk from 'aws-cdk-lib';

const jobDefn = new batch.EcsJobDefinition(this, 'JobDefn', {
   container: new batch.EcsEc2ContainerDefinition(this, 'containerDefn', {
    image: ecs.ContainerImage.fromRegistry('public.ecr.aws/amazonlinux/amazonlinux:latest'),
    memory: cdk.Size.mebibytes(2048),
    cpu: 256,
  }),
  retryAttempts: 5,
  retryStrategies: [
    batch.RetryStrategy.of(batch.Action.EXIT, batch.Reason.CANNOT_PULL_CONTAINER),
  ],
});
jobDefn.addRetryStrategy(
  batch.RetryStrategy.of(batch.Action.EXIT, batch.Reason.SPOT_INSTANCE_RECLAIMED),
);
jobDefn.addRetryStrategy(
  batch.RetryStrategy.of(batch.Action.EXIT, batch.Reason.CANNOT_PULL_CONTAINER),
);
jobDefn.addRetryStrategy(
  batch.RetryStrategy.of(batch.Action.EXIT, batch.Reason.custom({
    onExitCode: '40*',
    onReason: 'some reason',
  })),
);
```

When specifying a custom reason,
you can specify a glob string to match each of these and react to different failures accordingly.
Up to five different retry strategies can be configured for each Job,
and each strategy can match against some or all of `exitCode`, `reason`, and `statusReason`.
You can optionally configure the number of times a job will be retried,
but you cannot configure different retry counts for different strategies; they all share the same count.
If multiple conditions are specified in a given retry strategy,
they must all match for the action to be taken; the conditions are ANDed together, not ORed.

### Running single-container ECS workflows

Batch can jobs on ECS or EKS. ECS jobs can defined as single container or multinode.
This examples creates a `JobDefinition` that runs a single container with ECS:

```ts
import * as cdk from 'aws-cdk-lib';
import * as efs from 'aws-cdk-lib/aws-efs';

declare const myFileSystem: efs.IFileSystem;

const jobDefn = new batch.EcsJobDefinition(this, 'JobDefn', {
  container: new batch.EcsEc2ContainerDefinition(this, 'containerDefn', {
    image: ecs.ContainerImage.fromRegistry('public.ecr.aws/amazonlinux/amazonlinux:latest'),
    memory: cdk.Size.mebibytes(2048),
    cpu: 256,
    volumes: [batch.EcsVolume.efs({
      name: 'myVolume',
      fileSystem: myFileSystem,
      containerPath: '/Volumes/myVolume',
    })],
  }),
});
```

For workflows that need persistent storage, batch supports mounting `Volume`s to the container.
You can both provision the volume and mount it to the container in a single operation:

```ts
import * as efs from 'aws-cdk-lib/aws-efs';

declare const myFileSystem: efs.IFileSystem;
declare const jobDefn: batch.EcsJobDefinition;

jobDefn.container.addVolume(batch.EcsVolume.efs({
  name: 'myVolume',
  fileSystem: myFileSystem,
  containerPath: '/Volumes/myVolume',
}));
```

### Running Kubernetes Workflows

Batch also supports running workflows on EKS. The following example creates a `JobDefinition` that runs on EKS:

```ts
import * as cdk from 'aws-cdk-lib';
const jobDefn = new batch.EksJobDefinition(this, 'eksf2', {
  container: new batch.EksContainerDefinition(this, 'container', {
    image: ecs.ContainerImage.fromRegistry('amazon/amazon-ecs-sample'),
    volumes: [batch.EksVolume.emptyDir({
      name: 'myEmptyDirVolume',
      mountPath: '/mount/path',
      medium: batch.EmptyDirMediumType.MEMORY,
      readonly: true,
      sizeLimit: cdk.Size.mebibytes(2048),
    })],
  }),
});
```

You can mount `Volume`s to these containers in a single operation:

```ts
declare const jobDefn: batch.EksJobDefinition;
jobDefn.container.addVolume(batch.EksVolume.emptyDir({
  name: 'emptyDir',
  mountPath: '/Volumes/emptyDir',
}));
jobDefn.container.addVolume(batch.EksVolume.hostPath({
  name: 'hostPath',
  hostPath: '/sys',
  mountPath: '/Volumes/hostPath',
}));
jobDefn.container.addVolume(batch.EksVolume.secret({
  name: 'secret',
  optional: true,
  mountPath: '/Volumes/secret',
  secretName: 'mySecret',
}));
```

### Running Distributed Workflows

Some workflows benefit from parallellization and are most powerful when run in a distributed environment,
such as certain numerical calculations or simulations. Batch offers `MultiNodeJobDefinition`s,
which allow a single job to run on multiple instances in parallel, for this purpose.
Message Passing Interface (MPI) is often used with these workflows.
You must configure your containers to use MPI properly,
but Batch allows different nodes running different containers to communicate easily with one another.
You must configure your containers to use certain environment variables that Batch will provide them,
which lets them know which one is the main node, among other information.
For an in-depth example on using MPI to perform numerical computations on Batch,
see this [blog post](https://aws.amazon.com/blogs/compute/building-a-tightly-coupled-molecular-dynamics-workflow-with-multi-node-parallel-jobs-in-aws-batch/)
In particular, the environment variable that tells the containers which one is the main node can be configured on your `MultiNodeJobDefinition` as follows:

```ts
import * as cdk from 'aws-cdk-lib';
const multiNodeJob = new batch.MultiNodeJobDefinition(this, 'JobDefinition', {
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R4, ec2.InstanceSize.LARGE),
  containers: [{
    container: new batch.EcsEc2ContainerDefinition(this, 'mainMPIContainer', {
      image: ecs.ContainerImage.fromRegistry('yourregsitry.com/yourMPIImage:latest'),
      cpu: 256,
      memory: cdk.Size.mebibytes(2048),
    }),
    startNode: 0,
    endNode: 5,
  }],
});
// convenience method
multiNodeJob.addContainer({
  startNode: 6,
  endNode: 10,
  container: new batch.EcsEc2ContainerDefinition(this, 'multiContainer', {
    image: ecs.ContainerImage.fromRegistry('amazon/amazon-ecs-sample'),
    cpu: 256,
    memory: cdk.Size.mebibytes(2048),
  }),
});
```

If you need to set the control node to an index other than 0, specify it in directly:

```ts
const multiNodeJob = new batch.MultiNodeJobDefinition(this, 'JobDefinition', {
  mainNode: 5,
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R4, ec2.InstanceSize.LARGE),
});
```

### Pass Parameters to a Job

Batch allows you define parameters in your `JobDefinition`, which can be referenced in the container command. For example:

```ts
import * as cdk from 'aws-cdk-lib';
new batch.EcsJobDefinition(this, 'JobDefn', {
  parameters: { echoParam: 'foobar' },
  container: new batch.EcsEc2ContainerDefinition(this, 'containerDefn', {
    image: ecs.ContainerImage.fromRegistry('public.ecr.aws/amazonlinux/amazonlinux:latest'),
    memory: cdk.Size.mebibytes(2048),
    cpu: 256,
    command: [
      'echo',
      'Ref::echoParam',
    ],
  }),
});
```

### Understanding Progressive Allocation Strategies

AWS Batch uses an [allocation strategy](https://docs.aws.amazon.com/batch/latest/userguide/allocation-strategies.html)
to determine what compute resource will efficiently handle incoming job requests.
By default, **BEST_FIT** will pick an available compute instance based on vCPU requirements.
If none exist, the job will wait until resources become available.
However, with this strategy, you may have jobs waiting in the queue unnecessarily
despite having more powerful instances available. Below is an example of how that situation might look like:

```plaintext
Compute Environment:

1. m5.xlarge => 4 vCPU
2. m5.2xlarge => 8 vCPU
```

```plaintext
Job Queue:
---------
| A | B |
---------

Job Requirements:
A => 4 vCPU - ALLOCATED TO m5.xlarge
B => 2 vCPU - WAITING
```

In this situation, Batch will allocate **Job A** to compute resource #1 because it is the most cost efficient
resource that matches the vCPU requirement.
However, with this `BEST_FIT` strategy, **Job B** will not be allocated to our other available
compute resource even though it is strong enough to handle it.
Instead, it will wait until the first job is finished processing or wait a similar `m5.xlarge`
resource to be provisioned.

The alternative would be to use the `BEST_FIT_PROGRESSIVE` strategy in order for the remaining
job to be handled in larger containers regardless of vCPU requirement and costs.

## API

* `ManagedEc2ComputeEnvironment`
  * Requires an L2 for `CfnPlacementGroup` in aws-ec2
  * Or, we could use a `placementGroupArn` instead, but that’s sub-par
* `UnmanagedEc2ComputeEnvironment`
* `FargateComputeEnvironment`
* `MultinodeJobDefinition`
* `ContainerJobDefinition`
* `JobQueue`
* `SchedulingPolicy`

```ts
new ManagedEc2ComputeEnvironment(this, 'myEc2Env', {
  images: [BatchMachineImage.of(ec2MachineImage, BatchImageType.ECS)], // required
  name: 'specifiedName', // optional, defaults to CFN generated name
  replaceComputeEnvironment: false, //optional, defaults to false
  spot: true, // optional, defaults to false
  enabled: true, // optional, defaults to true
  updateTimeout: cdk.Duration.minutes(30), // optional, defaults to 30 (CFN default)
  terimnateOnUpdate: false, // optional, defaults to false (CFN default)
  allocationStrategy: batch.AllocationStrategy.SPOT_CAPACITY_OPTIMIZED, // optional (not spot exclusive)
  spotBidPercentage: 20, // optional (Spot exclusive)
  desiredvCpus: 3, // optional: default???
  instanceTypes: [ec2.InstanceType.of(ec2.InstanceClass.M5AD, ec2.InstanceSize.LARGE)], // optional, defaults to ['Instance.OPTIMAL']
  instanceRole: myRole, // optional, creates it if not supplied
  serviceRole: myServiceRole, // optional, defaults to undefined (batch recommended default)
  launchTemplate: myLaunchTemplate // optional, defaults to undefined
  maxvCpus: 5, // optional, defaults to 256
  minvCpus: 1, // optional, defaults to 0 (required on ec2 envs, even though the docs say it's optional)
  placementGroup: myPlacementGroup, // optional: defaults to no placementgroup
  securityGroups: [mySecurityGroup1, mySecurityGroup2], // optional, defaults to newly created
  vpcSubnets: myEc2SubnetSelection, // optional: defaults to newly created
  updateToLatestImageVersion: true // optional, defaults to true
  eksClusterNamespace: batch.clusterNamespace.from(cluster, 'namespace');
});

new UnmanagedEc2ComputeEnvironment(this, 'myEc2Env', {
  name: 'specifiedName', // optional, defaults to CFN generated name
  replaceComputeEnvironment: false, //optional, defaults to false
  enabled: true, // optional, defaults to true
  updateTimeout: cdk.Duration.minutes(30), // optional, defaults to 30 (CFN default)
  terimnateOnUpdate: false, // optional, defaults to false (CFN default)
  serviceRole: myServiceRole, // optional, creates it if not supplied (batch recommended default)
  updateToLatestImageVersion: true // optional, defaults to true
  unmanagedvCpus: undefined // TODO: asked service team why ComputeResources can be specified on UNMANAGED env
  eksClusterNamespace: batch.clusterNamespace.from(cluster, 'namespace');
});

new FargateComputeEnvironment(this, 'myFargateEnv', {
  name: 'specifiedName', // optional, defaults to CFN generated name
  replaceComputeEnvironment: false, //optional, defaults to false
  enabled: true, // optional, defaults to true
  spot: true, // optional, defaults to false
  updateTimeout: cdk.Duration.minutes(30), // optional, defaults to 30 (CFN default)
  terimnateOnUpdate: false, // optional, defaults to false (CFN default)
  serviceRole: myServiceRole, // optional, defaults to undefined (batch recommended default)
  maxvCpus: 10, // optional, defaults to 256
  securityGroups: [mySecurityGroup1, mySecurityGroup2], // optional, defaults to newly created
  vpcSubnets: myEc2SubnetSelection, // optional: defaults to newly created
  updateToLatestImageVersion: false, // optional, defaults to false (CFN default)
  eksClusterNamespce: batch.clusterNamespace.from(cluster, 'namespace');
});

new EcsJobDefinition(this, 'myContainerJob', {
  taskDefinition: myEcsTaskDefinition, // required
  name: 'myJob', // optional, defaults to CFN generated name
  parameters: { foo: 'bar' }, // optional, defaults to none
  // platform: batch.Platform.EC2_AND_FARGATE, // optional, defaults to EC2 (CFN default)
  fargatePlatformVersion: batch.FargatePlatformVersion.LATEST, // optional, defaults to LATEST
  propogateTags: true, // optional, defaults to false (CFN default)
  retryAttempts: 5, // optional, defaults to 1
  retryStrategy: [new RetryStrategy({// optional, defaults to none
    action: batch.Action.RETRY, // ACTION, required
    onExitCode: '40*', // optional, defaults to none (onExitCode)
    onReason: 'error: *', // optional, defaults to none (onReason)
    onStatusReason: 'reason: *', // optional, defaults to none (onStatusReason)
  })],
  schedulingPriority: 4, // optional, defaults to none
  timeout: cdk.Duration.seconds(10),
  jobRole: myRole, // optional, defaults to none
});

new EksJobDefinition(this, 'myEksJob', {
  name: 'myJob', // optional, defaults to CFN generated name
  parameters: { foo: 'bar' }, // optional, defaults to none
  platform: batch.Platform.EC2_AND_FARGATE, // optional, defaults to EC2 (CFN default)
  propogateTags: true, // optional, defaults to false (CFN default)
  retryAttempts: 5, // optional, defaults to 1
  retryStrategy: new RetryStrategy({// optional, defaults to none
    action: Action.RETRY, // ACTION, required
    onExitCode: '40*', // optional, defaults to none (onExitCode)
    onReason: 'error: *', // optional, defaults to none (onReason)
    onStatusReason: 'reason: *', // optional, defaults to none (onStatusReason)
  }),
  schedulingPriority: 4, // optional, defaults to none
  timeout: cdk.Duration.seconds(10), // optional, defaults to none
  pod: new batch.EksPod(this, 'myEksPod', { // required
    containers: [
      new batch.EksContainerDefinition(this, 'myEksContainer', {
        image: ecs.ContainerImage.fromRegistry('my-registry/my-image:latest'),
        volumes: [new batch.EmptyDirVolume()], // optional, automatically adds to the pod
      });
    ],
  });
});

new MultiNodeJobDefinition(this, 'myMultiNodeJob', {
  mainNode: 0, // required
  containers: [ // optional
    new MultiNodeContainer(
      startNode: 0,
      endNode: 5,
      instanceType: ec2.InstanceType.of(ec2.InstanceClass.A1, ec2.InstanceSize.LARGE), // required,
      TaskDefinition: new ecs.TaskDefinition(this, 'ECSTask', {
        compatibility: ecs.Compatibility.EC2,
      }).addContainer('myContainer', {
        image: ecs.ContainerImage.fromRegistry('public.ecr.aws/amazonlinux/amazonlinux:latest'),
        memoryLimitMiB: 2048,
      }),
    ),
  ],
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.A1, ec2.InstanceSize.LARGE), // required,
  name: 'myJob', // optional, defaults to CFN generated name
  parameters: { foo: 'bar' }, // optional, defaults to none
  propogateTags: true, // optional, defaults to false (CFN default)
  retryAttempts: 5, // optional, defaults to 1
  retryStrategy: new RetryStrategy({// optional, defaults to none
    action: Action.RETRY, // ACTION, required
    onExitCode: '40*', // optional, defaults to none (onExitCode)
    onReason: 'error: *', // optional, defaults to none (onReason)
    onStatusReason: 'reason: *', // optional, defaults to none (onStatusReason)
  }),
  schedulingPriority: 4, // optional, defaults to none
  timeout: cdk.Duration.seconds(10), // optional, defaults to none
});

new JobQueue(this, 'myJobQueue', {
  orderedComputeEnvironment: OrderedComputeEnvironment.of(myComputeEnv, 10), // required
  priority: 10, // required
  name: 'myJobQueueName', // optional, defaults to CFN generated name
  enabled: true, // optional, defaults to true
  schedulingPolicy: mySchedulingPolicy, // optional, defaults to none
});

new FairshareSchedulingPolicy(this, 'mySchedulingPolicy', {
  computeReservation: 20, // optional, defaults to none
  shareDecaySeconds: 20, // optional, defaults to none
  shares: [{ // optional, defaults to none
    shareIdentifier: 'foo', // optional, defaults to none
    weightFactor: 5, // optional, defaults to none
  }],
  name: 'myPolicyName', // optional, defaults to generated by CFN
});
```

```ts
// public ec2 members
export class PlacementGroup {
   // TODO
}

export class IMachineImage {
   // ...
   keyPairName: string,
}

// public batch members
// ComputeEnvironment
export interface IBatchMachineImage {
  image: ec2.IMachineImage;
  imageType: BatchMachineImageType;
  imageKubernetesVersion?: string;
}

export enum BatchMachineImageType {
  ECS_AL2 = 'ECS_AL2',
  ECS_AL2_NVIDIA = 'ECS_AL2_NVIDIA',
  EKS_AL2 = 'EKS_AL2',
  EKS_AL2_NVIDIA = 'EKS_AL2_NVIDIA',
}

export enum AllocationStrategy {
  BEST_FIT = 'BEST_FIT',
  BEST_FIT_PROGRESSIVE = 'BEST_FIT_PROGRESSIVE',
  SPOT_CAPACITY_OPTIMIZED = 'SPOT_CAPACITY_OPTIMIZED',
}

// JobDefinition
enum Compatibility {
  EC2 = ['EC2'],
  FARGATE = ['FARGATE'],
  EC2_AND_FARGATE = ['EC2', 'FARGATE'],
}

interface RetryStrategy {
  readonly retry: boolean;
  readonly onExitCode?: string;
  readonly onReason?: string;
  readonly onStatusReason?: string;
}

enum ImagePullPolicy {
  ALWAYS = 'Always';
  IF_NOT_PRESENT = 'IfNotPresent';
  NEVER = 'Never';
}

// private stuff
enum Ec2ComputeEnvironmentType {
  ON_DEMAND = 'EC2',
  SPOT = 'SPOT',
}

enum FargateComputeEnvironmentType {
  ON_DEMAND = 'FARGATE',
  SPOT = 'FARGATE_SPOT',
}
```

### Migration Plan

Removing all of the existing L2 constructs and forcing people to migrate to the new construct is not a great customer experience.
We should deprecate the existing construct,
move it to a different repository but still publish it under the same name, and leave it there until the new API is stable.
I propose we leave the new API as experimental until Q3 and graduate it then.

### Differences Between the Existing API and this Proposal

Existing API consists of:

* `JobDefinition`
  * It only supports the ECS variant, as the only required property it has it mapped directly to `ContainerProperties`.
    Specifying `ContainerProperties` prevents you from specifying `EksProperties` or `NodeRangeProperties`,
    meaning the existing construct does not support either of these (even with escape hatches).
* `ComputeEnvironment`
  * No distinction in type between EC2 and Fargate. The result is this massive amount of
    [runtime error checking](https://github.com/aws/aws-cdk/blob/main/packages/%40aws-cdk/aws-batch/lib/compute-environment.ts#L487-538) for Fargate
  * All of that runtime error checking could be removed with an EC2 type and a Fargate type.
* `JobQueue`
  * No support for `FairshareSchedulingPolicy`, so they can’t be added to a `JobQueue` easily.

Existing API does not have `FairsharechedulingPolicys`.

## Public FAQ

### What are we launching today?

We are launching a new set of L2 constructs for Batch (`aws-cdk-lib/aws-batch`).
These new constructs will fully support Batch within the CDK.

### Why should I use this feature?

Efficiently run your batch computing jobs without managing servers.
Use cases include machine learning and high-performance computing.
CDK provides seamless integrations with your existing infrastructure
and simplifies configuration of your resources.

## Internal FAQ

### Why are we doing this?

We are graduating alpha modules to build trust with our customers.
Batch is a highly-requested module for stabilization, but the existing experimental
constructs do not cover many of the use cases that batch supports and extending them
will result in a subpar customer experience.

### Why should we *not* do this?

Our existing alpha module has customers today. This new API will be substantially
different, which will require them to understand the new API to migrate.

### What is the technical solution (design) of this feature?

#### Compute Environment

* `IComputeEnvironment` and `ComputeEnvironmentBase` -- abstract base class and interface

```ts
interface IComputeEnvironment extends cdk.IResource, iam.Grantable, cdk.ITaggable {
  readonly name?: string;
  readonly serviceRole?: iam.IRole;
  readonly enabled?: boolean;
  readonly maxvCpus?: number; //note: becomes unmanageVCPUs on unmanaged, maxvCPUs on managed

  grant(grantee: iam.IGrantable, ...actions: string[]): iam.Grant;
}

interface ComputeEnvironmentProps {
  readonly name?: string;
  readonly serviceRole?: iam.IRole;
  readonly enabled?: boolean;
  readonly maxvCpus?: number; //note: becomes unmanagedvCPUs on unmanaged, maxvCPUs on managed
}

abstract class ComputeEnvironmentBase implements IComputeEnvironment {
  constructor(readonly props: ComputeEnvironmentProps = {}) {}
}
```

* `IUnmanagedComputeEnvironment` and `UnmanagedComputeEnvironmentBase` -- interface to define and deploy Batch Unmanaged Compute Environments

```ts
interface IUnmanagedComputeEnvironment extends IComputeEnvironment {}

interface UnmanagedComputeEnvironmentProps extends ComputeEnvironmentProps {}

abstract class UnmanagedComputeEnvironmentBase extends ComputeEnvironmentBase implements IUnmanagedComputeEnvironment {
  constructor(readonly props: UnmanagedComputeEnvironmentProps = {}) {}
}
```

* `IUnmanagedEc2ComputeEnvironment` -- interface to define and deploy Batch Unmanaged Compute Environments (can only use ec2)

```ts
interface IUnmanagedEc2ComputeEnvironment extends IUnmanagedComputeEnvironment {}

interface UnmanagedEc2ComputeEnvironmentProps extends UnmanagedComputeEnvironmentProps {}

abstract class UnmanagedEc2ComputeEnvironmentBase extends UnmanagedComputeEnvironmentBase implements IUnmanagedEc2ComputeEnvironment {
  constructor(readonly props: UnmanagedEc2ComputeEnvironmentProps = {}) {}
}
```

* `IManagedComputeEnvironment` -- interface to define and deploy Batch Managed Compute Environments

```ts
interface IManagedComputeEnvironment extends IComputeEnvironment {
  readonly replaceComputeEnvironment?: boolean;
  readonly updateTimeout?: Duration;
  readonly terimnateOnUpdate?: boolean;
  readonly securityGroups?: ec2.ISecurityGroup[];
  readonly subnets?: ec2.ISubnet[];
  readonly updateToLatestImageVersion?: boolean;
}

interface ManagedComputeEnvironmentProps extends ComputeEnvironmentProps {
  readonly replaceComputeEnvironment?: boolean;
  readonly updateTimeout?: Duration;
  readonly terimnateOnUpdate?: boolean;
  readonly securityGroups?: ec2.ISecurityGroup[];
  readonly subnets?: ec2.ISubnet[];
  readonly updateToLatestImageVersion?: boolean;
}

abstract class ManagedComputeEnvironmentBase extends ComputeEnvironmentBase implements IManagedComputeEnvironment {
  constructor(readonly props: ManagedComputeEnvironmentProps = {}) {}
}
```

* `IManagedEc2ComputeEnvironment` -- interface to define and deploy Batch Managed Compute Environments

```ts
interface IManagedEc2ComputeEnvironment extends IManagedComputeEnvironment {
  readonly images?: IBatchMachineImage[];
  readonly allocationStrategy?: AllocationStrategy;
  readonly spotBidPercentage?: number;
  readonly spot?: boolean;
  readonly desiredvCpus?: number;
  readonly instanceTypes?: InstanceType[];
  readonly instanceRole?: iam.IRole;
  readonly launchTemplate?: ec2.ILaunchTemplate;
  readonly minvCpus?: number;
  readonly placementGroup?: ec2.IPlacementGroup;
}

interface IBatchMachineImage {
  image: ec2.IMachineImage;
  imageType: BatchMachineImageType;
  imageKubernetesVersion?: string;
}

enum BatchMachineImageType {
  ECS_AL2 = 'ECS_AL2',
  ECS_AL2_NVIDIA = 'ECS_AL2_NVIDIA',
  EKS_AL2 = 'EKS_AL2',
  EKS_AL2_NVIDIA = 'EKS_AL2_NVIDIA',
}

enum InstanceType extends ec2.InstanceType {
  OPTIMAL = 'optimal',
}

enum AllocationStrategy {
  BEST_FIT = 'BEST_FIT',
  BEST_FIT_PROGRESSIVE = 'BEST_FIT_PROGRESSIVE',
  SPOT_CAPACITY_OPTIMIZED = 'SPOT_CAPACITY_OPTIMIZED',
}

interface ManagedEc2ComputeEnvironmentProps extends ComputeEnvironmentProps {
  readonly images?: IBatchMachineImage[];
  readonly allocationStrategy?: AllocationStrategy;
  readonly spotBidPercentage?: number;
  readonly spot?: boolean;
  readonly desiredvCpus?: number;
  readonly instanceTypes?: InstanceType[];
  readonly instanceClasses?: ec2.InstanceClass[];
  readonly instanceRole?: iam.IRole;
  readonly launchTemplate?: ec2.ILaunchTemplate;
  readonly minvCpus?: number;
  readonly placementGroup?: ec2.IPlacementGroup;
}

class ManagedEc2ComputeEnvironment extends ManagedComputeEnvironmentBase implements IManagedEc2ComputeEnvironment {
  constructor(readonly props: ManagedEc2ComputeEnvironmentProps = {}) {}
  public addInstanceType() {}
  public addInstanceClass() {}
}
```

* `IEksComputeEnvironment` -- interface to define and deploy Batch EKS Compute Environments

```ts
interface IEksComputeEnvironment extends IManagedEc2ComputeEnvironment {
  readonly eksCluster: eks.ICluster,
  readonly kubernetesNamespace: string,
}

interface EksComputeEnvironmentProps extends ManagedEc2ComputeEnvironmentProps {}

class EksComputeEnvironment extends ManagedEc2ComputeEnvironment implements IEksComputeEnvironment {
  constructor(readonly props: EksComputeEnvironmentProps = {}) {}
}
```

* `IFargateComputeEnvironment` -- interface to define and deploy Batch Managed Fargate Compute Environments

```ts
interface IFargateComputeEnvironment extends IManagedComputeEnvironment {}

interface FargateComputeEnvironmentProps extends ComputeEnvironmentProps {}

class FargateComputeEnvironment extends ManagedComputeEnvironmentBase implements IFargateComputeEnvironment {
  constructor(readonly props: FargateComputeEnvironmentProps = {}) {}
}
```

#### JobDefinition

* `IJobDefinition`

```ts
interface IJobDefinition extends cdk.IResource, iam.Grantable, cdk.ITaggable {
  name?: string;
  parameters?: { [key:string]: any };
  propogateTags?: boolean;
  retryAttempts?: number;
  retryStrategies?: RetryStrategy[];
  schedulingPriority?: number;
  timeout?: Duration;

  grant(grantee: iam.IGrantable, ...actions: string[]): iam.Grant;
}

interface JobDefinitionProps {
  name?: string;
  parameters?: { [key:string]: any };
  propogateTags?: boolean;
  retryAttempts?: number;
  retryStrategies?: RetryStrategy[];
  schedulingPriority?: number;
  timeout?: Duration;
}

class RetryStrategy {
  public readonly action: Action;
  public readonly onExitCode: string;
  public readonly onStatusReason: string;
  public readonly onReason: string;

  constructor(action: Action, onExitCode: string, onStatusReason: string, onReason: string) {}
  addRetryStrategy() {}
}

enum Action {
  EXIT = 'EXIT';
  RETRY = 'RETRY';
}

abstract class JobDefinitionBase implements IJobDefinition {
  constructor(readonly props: JobDefinitionProps = {}) {}
}
```

* `IEcsJobDefinition`

```ts
interface IEcsJobDefinition extends IJobDefinition {
  containerDefinition: EcsContainerDefinition
  fargatePlatformVersion?: FargatePlatformVersion
  compatibility?: Compatibility
}

class EcsContainerDefinition {
  command?: string
  environments?: Array<{ [key:string]: string }>
  executionRoleArn?: iam.IRole
  fargateVersion?: FargateVersion
  image: ContainerImage
  jobRoleArn?: iam.IRole
  linuxParameters?: LinuxParameters
  logDriver?: LogDriver
  disableNetworking?: boolean
  priveleged?: boolean
  readonlyRootFileSystem?: boolean
  cpu: number
  gpuCount?: number
  secrets?: { [key: string]: Secret }
  user?: string

  addUlimit(...Ulimit[])
  addVolume(...EcsVolume[])
  addMountedVolume(..MountedEcsVolume[])
}

enum FargateVersion {
// TODO

}

interface Ulimit {
  hardLimit: number;
  name: UlimitName;
  softLimit: number;
}

enum UlimitName {
  //TODO
}

interface MountedEcsVolume {
  name: string;
  efsVolumeConfiguration?: EfsVolumeConfiguration;
  host?: string;
  containerPath: string;
  readOnly: boolean;
}

interface EfsVolumeConfiguration {
  fileSystemId: string;
  rootDirectory?: string;
  transitEncryption?: string;
  transitEncryptionPort?: number;
  authorizationConfig?: AuthorizationConfig;
}

interface AuthorizationConfig {
  accessPointId?: string;
  iam?: string;
}

interface EcsJobDefinitionProps {
  containerDefinition: EcsContainerDefinition
  fargatePlatformVersion?: FargatePlatformVersion
  compatibility?: Compatibility
}

class EcsJobDefinition implements IEcsJobDefinition {
  constructor(readonly props: EcsJobDefinitionProps = {}) {}
}
```

* `IEksJobDefinition`

```ts
interface IEksJobDefinition extends IJobDefinition {
  pod: batch.IEksPod;
  platform?: Platform;
}

enum Platform {
  // TODO
}

class EksPod {
  containers: batch.EksContainerDefinition[]
  dnsPolicy?: DnsPolicy
  useHostNetwork?: boolean
  serviceAccount?: string

  addContainer(batch.EksContainerDefinition)
}

class EksContainerDefinition {
  image: string
  args?: string[]
  command?: string[]
  env?: { [key:string]: string }
  imagePullPolicy?: ImagePullPolicy
  name?: string
  resources?: EksContainerResources
  priveleged?: boolean
  readonlyFileSystem?: boolean
  runAsGroup?: number
  runAsRoot?: boolean
  runAsUser?: number
  volumes?: EksVolume[]

  addVolume(...EksVolume[])
}

// TODO: need props for all of these smaller classes
// TODO: EksContainerResources has no definition
abstract class EksVolume {
  name: string;
  mountPath?: string;
  readonly?: boolean;
}

class EmptyDirVolume extends EksVolume {
  medium?: MediumType;
  sizeLimit?: number;
}

class HostPathVolume extends EksVolume {
  path: string;
}

class SecretPathVolume extends EksVolume {
  secret: string;
  optional?: boolean;
}

interface EksJobDefinitionProps {
  pod: batch.IEksPod;
  platform?: Platform;
}

class EksJobDefinition implements IEksJobDefinition {
  constructor(readonly props: EksJobDefinitionProps = {}) {}
}
```

* `IMultiNodeJobDefinition`

```ts
interface IMultiNodeJobDefinition extends IJobDefinition {
  containers: MultiNodeContainer[];
  mainNode: number;
  instanceType: InstanceType;
}

interface MultiNodeContainer {
  startNode: number;
  endNode: number;
  container: ContainerDefinition;
}

interface MultiNodeJobDefinitionProps {
  containers: MultiNodeContainer[];
  mainNode: number;
  instanceType: InstanceType;
}

class MultiNodeJobDefinition implements IMultiNodeJobDefinition {
  constructor(readonly props: MultiNodeJobDefinitionProps = {}) {}

  public addContainer(...containers: MultiNodeContainer[]) {}
}
```

#### JobQueue

* `IJobQueue`

```ts
interface IJobQueue extends cdk.IResource, iam.Grantable, cdk.ITaggable {
  priority: number
  name?: string
  enabled?: boolean
  schedulingPolicy?: SchedulingPolicy
  computeEnvironments: OrderedComputeEnvironment[]
}

interface JobQueueProps {
  priority: number
  name?: string
  enabled?: boolean
  schedulingPolicy?: ISchedulingPolicy
  computeEnvironments: OrderedComputeEnvironment[]
}

interface OrderedComputeEnvironment {
  computeEnvironment: IComputeEnvironment;
  order: number;
}

class JobQueue implements IJobQueue {
  constructor(readonly props: JobQueueProps = {}) {}

  addComputeEnvironment(computeEnvironment: ComputeEnvironment, order: number)
}
```

#### SchedulingPolicy

* `ISchedulingPolicy`

```ts
interface ISchedulingPolicy extends cdk.IResource, iam.Grantable, cdk.ITaggable {
}

interface SchedulingPolicyProps {
}

class SchedulingPolicyBase implements ISchedulingPolicy {
  constructor(readonly props: SchedulingPolicyProps = {}) {}
}
```

* `IFairshareSchedulingPolicy`

```ts
interface IFairshareSchedulingPolicy extends ISchedulingPolicy {
  readonly computeReservation?: number;
  readonly shareDecay?: Duration;
  readonly shares?: Share[];
}

interface FairshareSchedulingPolicyProps {
  readonly computeReservation?: number;
  readonly shareDecay?: Duration;
  readonly shares?: Share[];
}

interface Share {
  id: string;
  weight: number;
}

class FairshareSchedulingPolicy implements IFairshareSchedulingPolicy {
  constructor(readonly props: FairshareSchedulingPolicyProps = {}) {}

  addShare(id: string, weight: number) {}
}
```

### Is this a breaking change?

No, because we will continue to make the old API available. Their existing
code will not be compatible with the new API.

### What alternative solutions did you consider?

We could expand the coverage of the existing design to cover all the missing use cases.
Supporting these additional use cases without redesigning the API requires
us to provide a subpar customer experience because the existing API suffers from the same
problem as the CloudFormation resource; there are too many logically distinct resources
lumped into a single type. Many properties are always silently ignored or rejected by CloudFormation
depending on which other properties are specified (12 on a single resource, among many others).

### What are the drawbacks of this solution?

The new API is substantially different from the old one, which means migration will not be free.

### What is the high-level project plan?

Once this RFC is approved, implementation will begin and will be completed by Q1.

### Are there any open issues that need to be addressed later?

No.

## Appendix

### L2 props → L1 props

#### abstract class ComputeEnvironment

```
name?: string -> ComputeEnvironmentName
serviceRole?: iam.IRole -> ServiceRole
replaceComputeEnvironment?: boolean -> ReplaceComputeEnvironment
enabled?: boolean -> State
updateTimeout?: Duration -> UpdatePolicy.JobExecutionTimeoutMinutes
terimnateOnUpdate?: boolean -> UpdatePolicy.TerminateJobsOnUpdate
maxvCpus?: number -> ComputeResources.MaxvCpus
securityGroups?: ec2.ISecurityGroup[] -> ComputeResources.SecurityGroupIDs
subnets?: ec2.ISubnet[] -> ComputeResources.Subnets
updateToLatestImageVersion?: boolean -> ComputeResources.UpdateToLatestImageVersion
eksClusterWithNamespace -> EksConfiguration
```

#### ManagedEc2ComputeEnvironment

```
images?: IBatchMachineImage[] -> ComputeResources.Ec2Configuration, ComputeResources.Ec2Configuration
allocationStrategy?: AllocationStrategy -> ComputeResources.AllocationStrategy
spotBidPercentage?: number -> ComputeResources.BidPercentage
spot?: boolean -> ComputeResources.Type
desiredvCpus?: number -> ComputeResources.DesiredvCpus
instanceTypes?: ec2.InstanceType[] -> ComputeResources.InstanceTypes
instanceRole?: iam.IRole -> ComputeResources.InstanceRole
launchTemplate?: ec2.ILaunchTemplate -> ComputeResources.LaunchTemplate
minvCpus?: number -> ComputeResources.MinvCpus
placementGroup?: ec2.IPlacementGroup -> ComputeResources.PlacementGroup
```

#### UnmanagedEc2ComputeEnvironment

```
unmanagedvCpus?: number -> UnmanagedvCpus
```

#### FargateComputeEnvironment

```
// has no extra props
```

#### abstract class JobDefinition

```
name?: string -> JobDefinitionName
parameters?: { [key:string]: any } -> Parameters
propogateTags?: boolean -> PropogateTags
retryAttempts?: number -> RetryStrategy.Attempts
retryStrategy?: RetryStrategy -> RetryStrategy.EvaluateOnExit
schedulingPriority?: number -> SchedulingPriority
timeout?: Duration -> Timeout
```

#### EcsJobDefinition extends JobDefinition

```
containerDefinition: EcsContainerDefinition -> ContainerProperties
fargatePlatformVersion?: Batch.FargatePlatformVersion -> ContainerProperties.FargatePlatformConfiguration
compatibility?: Compatibility -> PlatformCapabilities

// no addContainer() because this is a single-node only job type
```

#### EcsContainerDefinition

```
command?: string -> ContainerProperties.Command
environments?: Array<{ [key:string]: string }> -> ContainerProperties.Environment
executionRoleArn?: iam.IRole -> ContainerProperties.ExecutionRoleArn
fargateVersion?: FargateVersion -> ContainerProperties.FargatePlatformConfiguration
image: ContainerImage -> ContainerProperties.Image, ContainerProperties.ResourceRequirements.MEMORY
// InstanceType is not applicable to single-node ECS jobs. All node ranges must use the same instance type, so configure it there
jobRoleArn?: iam.IRole -> ContainerProperties.JobRoleArn,
linuxParameters?: LinuxParameters -> ContainerProperties.LinuxParameters
logDriver?: LogDriver -> ContainerProperties.LogConfiguration
// Memory is deprecated in favor of ResourceRequirements
// mountpoints added via addMountPoint() or addMountVolume()
disableNetworking?: boolean -> ContainerProperties.NetworkConfiguration
priveleged?: boolean -> ContainerProperties.Priveleged
readonlyRootFileSystem?: boolean -> ContainerProperties.ReadonlyRootFileSystem
// image covers ContainerProperties.ResourceRequirements.MEMORY
cpu: number -> ContainerProperties.ResourceRequirements.VCPU
gpuCount?: number -> ContainerProperties.ResourceRequirements.GPU
secrets?: { [key: string]: Secret } -> ContainerProperties.Secrets
// Ulimits are added via addUlimit()
user?: string -> ContainerProperties.User
// Vcpus is deprecated in favor of ResourceRequirements
// Volumes are added via addVolume() or addMountedVolume()

addMountPoint(...MountPoint[]) -> MountPoints
addUlimit(...Ulimit[]) -> Ulimits
addVolume(...EcsVolume[]) -> Volumes
addMountedVolume(..MountedEcsVolume[]) -> MountPoints, Volumes
```

#### ContainerImage

```
see https://docs.aws.amazon.com/cdk/api/v1/docs/@aws-cdk_aws-ecs.ContainerImage.html
```

#### MountPoint

```
containerPath: string
readOnly: boolean
sourceVolume: string
```

#### Ulimit

```
hardLimit: number
name: UlimitName
softLimit: number
```

#### UlimitName

```
see https://docs.aws.amazon.com/cdk/api/v1/docs/@aws-cdk_aws-ecs.UlimitName.html
```

#### EcsVolume

```
name: string
efsVolumeConfiguration?: EfsVolumeConfiguration
host?: string -> Host.SourcePath
```

#### EfsVolumeConfiguration

```
fileSystemId: string
rootDirectory?: string
transitEncryption?: string
transitEncryptionPort?: number
authorizationConfig?: AuthorizationConfig
```

#### AuthorizationConfig

```
accessPointId?: string
iam?: string
```

#### MountedEcsVolume

```
name: string
efsVolumeConfiguration?: EfsVolumeConfiguration
host?: string
containerPath: string
readOnly: boolean
```

#### EksJobDefinition extends JobDefinition

```
pod: batch.IEksPod -> EksProperties
platform?: Platform -> PlatformCapabilities
```

### EksPod

```
containers: batch.EksContainerDefinition[] -> Containers, Volumes (container volumes will be added to these pod Volumes)
dnsPolicy?: DnsPolicy -> DnsPolicy
useHostNetwork?: boolean -> HostNetwork
serviceAccount?: string -> ServiceAccountName

addContainer(batch.EksContainerDefinition) -> Containers, Volumes (container volumes will be added to these pod Volumes)
```

#### EksContainerDefinition

```
image: string -> Image
args?: string[] -> Args
command?: string[] -> Command
env?: { [key:string]: string } -> Env
imagePullPolicy?: ImagePullPolicy -> ImagePullPolicy
name?: string -> Name
resources?: EksContainerResources -> Resources
priveleged?: boolean -> SecurityContext.Priveleged
readonlyFileSystem?: boolean -> SecurityContext.ReadOnlyFileSystem
runAsGroup?: number -> SecurityContext.RunAsGroup
runAsRoot?: boolean -> SecurityContext.RunAsNonRoot
runAsUser?: number -> SecurityContext.RunAsUser
volumes?: Volume[] -> VolumeMounts

addVolume(...Volume[]) -> VolumeMounts
```

#### abstract class EksVolume

```
name: string -> Name
mountPath?: string -> MountPath
readonly?: boolean -> ReadOnly
```

#### EmptyDirVolume extends EksVolume

```
medium?: MediumType -> Medium
sizeLimit?: number -> SizeLimit
```

#### HostPathVolume extends EksVolume

```
path: string -> Path
```

#### SecretVolume extends EksVolume

```
secretName: string,
optional?: boolean,
```

#### MultiNodeJobDefinition extends JobDefinition

```
containers: MultiNodeContainer[] -> NodeProperties.NodeRangeProperties
mainNode: number -> NodeProperties.MainNode
instanceType: NodeProperties.NodeRangeProperties.Container.InstanceType
// FargatePlatformConfiguration doesn't apply to multinode
```

#### MultiNodeContainer

```
startNode: number -> NodeProperties.NodeRangeProperties.TargetNodes, NodeProperties.NumNodes
endNode: number -> NodeProperties.NodeRangeProperties.TargetNodes, NodeProperties.NumNodes
container: ecs.ContainerDefinition -> NodeProperties.NodeRangeProperties.Container
```

#### JobQueue

```
//ComputeEnvironmentOrder // can't mix fargate and ec2, added via addComputeEnvironment()
 priority: number -> Priority
name?: string -> JobQueueName
enabled?: boolean ->  State
schedulingPolicy?: SchedulingPolicy -> SchedulingPolicyArn
computeEnvironments: OrderedComputeEnvironment[] -> ComputeEnvironmentOrder

addComputeEnvironment(computeEnvironment: ComputeEnvironment, order: number) -> ComputeEnvironmentOrder
```

#### FairshareSchedulingPolicy extends SchedulingPolicyBase

```
computeReservation?: number -> FairsharePolicy.ComputeReservation
shareDecay?: Duration -> FairsharePolicy.ShareDecaySeconds
shares?: Share[] -> FairsharePolicy.ShareDistribution
```

#### Share

```
id: string -> ShareIdentifier
weight: number -> WeightFactor
```

#### Batch::JobDefinition ← ECS::TaskDefinition

```
ContainerProperties: {
  Command: ContainerDefinitions.Command,
  Environment: ContainerDefinition.Environment,
  ExecutionRoleArn: ExecutionRoleArn,
  FargatePlatformConfiguration: ?????
  Image: ContainerDefinition.Image,
  InstanceType: ????? // only used in multinode
  JobRoleArn: TaskRoleArn,
  LinuxParameters: ContainerDefinition.LinuxParameters,
  LogConfiguration: ContainerDefinition.LogCongifuration,
  Memory: ContainerDefinition.Memory,
  MountPoints: ContainerDefinition.MountPoints,
  NetworkConfiguration.AssignPublicIp: ContainerDefinition.DisableNetworking,
  Priveleged: ContainerDefinition.Priveleged,
  ReadonlyRootFilesystem: ContainerDefinition.ReadonlyRootFilesystem,
  ResourceRequirements: ???????
  Secrets: ContainerDefinition.Secrets,
  Ulimits: ContainerDefinition.Ulimits,
  User: ContainerDefinition.User,
  Vcpu: deprecated, use ResourceRequirements instead
  Volumes: Volumes,
}

Platform: RequiresCompatibilities
```