--- title: "Accounts" date: 2021-09-02T13:52:05-04:00 draft: false weight: 1 description: > How AWS Genomics CLI interacts with AWS Accounts --- Amazon Genomics CLI requires an AWS account in which to deploy the cloud infrastructure required to run and manage workflows. To begin working with Amazon Genomics CLI and account must be "Activated" by the Amazon Genomics CLI application using the [account activate]( {{< relref "#activate" >}}) command. ## Which AWS Account is Used by Amazon Genomics CLI? Amazon Genomics CLI uses the same [AWS credential chain](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-precedence) used by the AWS CLI to determine what account should be used and with what credentials. All that is required is that you have an existing AWS account (or create a new one) which contains at least one IAM Principal (User/ Role) that you have can access. ## Which Region is Used by Amazon Genomics CLI? Much like accounts and credentials, Amazon Genomics CLI uses the same chain used by the AWS CLI to determine the region that is being targeted. For example, if your AWS profile uses `us-east-1` then Amazon Genomics CLI will use the same. Likewise, if you set the `AWS_REGION` environment variable to `eu-west-1` then that region will be used by Amazon Genomics CLI for all subsequent commands in that shell. ## Shared Infrastructure When a region is first activated for Amazon Genomics CLI, some basic infrastructure is deployed including a [VPC](https://docs.aws.amazon.com/vpc/latest/userguide/index.html), which is used for the compute infrastructure that will be deployed in a [context]( {{< relref "../contexts" >}} ), and an [S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/index.html) bucket which will be used to store workflow intermediates and results. This core infrastructure will be shared by all Amazon Genomics CLI users and projects in that region. The following diagram shows the infrastructure deployed when the command `agc account activate` is run: ![Image of shared infrastructure](AccountActivateArchitecture.png "Shared Infrastructure Components") Note that context specific infrastructure is not shared and is unique and namespaced by user and project. ## Bring your Own VPC and S3 Bucket During account [activation]( {{< relref "#activate" >}}) you may specify an existing VPC ID or S3 bucket name for use by Amazon Genomics CLI. If you do not these will be created for you. Although we use AWS best practices for these, if your organization has specific security requirements for networking and storage this may be the easiest way to activate Amazon Genomics CLI in your environment. ## Account Commands A full reference of the account commands is [here]( {{< relref "../../Reference/agc_account" >}} ) ### `activate` You can activate an account using `agc account activate`. An account must be activated before any contexts can be deployed or workflows run. Activating an account will also bootstrap the AWS Environment for CDK app deployments. #### Using an Existing S3 Bucket Amazon Genomics CLI requires an S3 bucket to store workflow results and associated information. If you prefer to use an existing bucket you can use the form `agc account activate --bucket my-existing-bucket`. If you do this the AWS [IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/index.html) role used to run Amazon Genomics CLI must be able to write to that bucket. #### Using an Existing VPC To use an existing VPC you can use the form `agc account activate --vpc my-existing-vpc-id`. This VPC must have at least 3 availability zones each with at least one private subnet. The private subnets must have connectivity to the internet, such as via a NAT gateway, and connectivity to AWS services either through VPC endpoints or the internet. Amazon Genomics CLI will not modify the network topology of the specified VPC. #### Specifying Subnets When using an existing VPC you may need to specify which subnets of the VPC can be used for infrastructure. This is useful when only some private subnets have internet routing. To do this you can supply a comma separated list of subnet IDs using the `--subnets` flag, or repeat the flag multiple times. For example: `agc account activate --vpc my-existing-vpc-id --subnets subnet-id-1,subnet-id-2 --subnets subnet-id-3` We recommend a minimum of 3 subnets across availability zones to take advantage of EC2 instance availability and to ensure high availability of infrastructure. #### Using a Specific AMI for Compute Environments Some organizations restrict the use of AMIs to a pre-approved list. By default, Amazon Genomics CLI uses the most recent version of the Amazon Linux 2 ECS Optimized AMI. To change this behavior you can supply the ID of an alternative AMI at account activation. This AMI will then be used for all compute environments used by all newly deployed contexts. ```shell agc account activate --ami ``` There are some specific requirements that the AMI must comply with. It must be a private AMI from the same account that you will use for deploying Amazon Genomics CLI infrastructure. It must also be capable of successfully running all parts of the [LaunchTemplate](https://github.com/aws/amazon-genomics-cli/blob/main/packages/cdk/lib/constructs/launch-template-data.ts) executed at startup time including the [ecs-additions](https://github.com/aws/amazon-genomics-cli/tree/main/packages/cdk/lib/artifacts/batch-artifacts/ecs-additions) dependencies. We recommend an ECS optimized image based on Amazon Linux 2, RHEL, Fedora or similar. If the LaunchTemplate cannot complete successfully it will result in an EC2 instance that cannot join a compute-cluster and cannot complete workflow tasks. A common symptom of this is workflow tasks that become stuck in a "runnable" state but are never assigned to a cluster node. #### Using Only Public Subnets Amazon Genomics CLI can create a new VPC with only public subnets to use for its infrastructure using the `--usePublicSubnets` flag. `agc account activate --usePublicSubnets` This can reduce costs by removing the need for NAT Gateways and VPC Gateway Endpoints to route internet traffic from private subnets. It can also reduce the number of Elastic IP Addresses consumed by your infrastructure. {{% alert title="Warning" color="warning" %}} When using a VPC with only public subnets, you will need to ensure that the contexts defined in `agc-project.yaml` files declare that they will use public subnets. For example: {{% /alert %}} ```yaml contexts: myContext: usePublicSubnets: true engines: - type: nextflow engine: nextflow ``` {{% alert title="Warning" color="warning" %}} Currently, use of public subnets is only supported for contexts that use the Nextflow engine. Use of public IPs with the Cromwell server creates a security risk and will fail. Assignment of public IPs to AWS Batch Fargate tasks (as used by miniwdl and SnakeMake) is possible but will require changes to the WES adapters of those engines. If you need this please file a [feature request](https://github.com/aws/amazon-genomics-cli/issues/new?labels=enhancement) with your use case {{% /alert %}} ##### Security Considerations Although your infrastructure will be protected by security groups you should be aware that any manual modification of these may result in exposing your infrastructure to the internet. For this reason *we do **not** recommend using this configuration in production or with sensitive data*. #### Updating Issuing `account activate` commands more than once effectively updates the core infrastructure with the difference between the two commands according to the rules below. ##### Updating the VPC You may change the VPC used by issuing the command `agc account activate --vpc `. If a `--vpc` argument is *not* provided as part of an `agc account activate` command then the last VPC used will be 'remembered' and used by default. If you wish to change to use a new default VPC created by Amazon Genomics CLI you must deactivate (`agc account deactivate`) and reactivate with no `--vpc` flag. ```shell agc account activate # VPC 1 created. agc account activate --vpc-id abc # VPC 1 destroyed and customer VPC abc used. agc account activate # VPC 2 created. Customer VPC retained. agc account deactivate # AGC core infrastructure destroyed. Customer VPC abc retained. ``` ##### Updating to Use Public Subnets Only If you wish to change the VPC to use public subnets only, or change it from public subnets to private subnets you must deactivate the account and reactivate it with (or without) the `--usePublicSubnets` flag. For example: ```shell agc account activate --usePublicSubnets # New VPC with only public subnets agc account deactivate # VPC destroyed agc account activate # New VPC with private subnets ``` ##### Updating Selected Subnets To change a VPC to use a different selection of subnets you must supply both the VPC id and the required subnet IDs. If you omit the `--subnets` flag, then future context deployments will use *all* private subnets of the VPC. ```shell agc account activate --vpc --subnets # use subnets 1 and 2 of vpc-id agc account activate --vpc --subnets # use subnets 1 and 4 of vpc-id agc account activate --vpc # use all subnets of vpc-id ``` ##### Updating the Compute-Environment AMI The compute-environment AMI can be changed by re-issuing the `account activate` command with (or without) the `--ami` flag. If the flag is not provided the latest Amazon Linux 2 ECS optimized image will be used. ```shell agc account activate # Latest Amazon Linux ECS Optimized AMI used for all contexts agc account activate --ami # AMI 1234 used for new contexts agc account activate # Latest Amazon Linux ECS Optimized AMI used for new contexts ``` ### `deactivate` The `deactivate` command is used to remove the core infrastructure deployed by Amazon Genomics CLI in the current region when an account is activated. The S3 bucket deployed by Amazon Genomics CLI and its contents are retained. If a VPC and/ or S3 bucket were specified by the user during account activation these will also be retained. Any CloudWatch logs produced by Amazon Genomics CLI will also be retained. If there are existing deployed contexts the command will fail, however, you can force the removal of these at the same time with the `--force` flag. Note that this will also interrupt any running workflow of any user in that region. The deactivate command will only operate on infrastructure in the current region. If the deployed infrastructure has been modified through the console or the AWS CLI rather than through Amazon Genomics CLI deactivation may fail due to the infrastructure state being inconsistent with the [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/index.html) state. If this happens you may need to manually clean up through the CloudFormation console. ## Costs Core infrastructure deployed for Amazon Genomics CLI is [tagged]( {{< relref "namespaces#tags" >}} ) with the `application-name: agc` tag. This tag can be activated for cost tracking in [AWS CostExplorer](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/ce-what-is.html). The core infrastructure is shared and *not* tagged with any username, context name or project name. While an account region is activated there will be ongoing charges from the core infrastructure deployed including things such as VPC NAT gateways and VPC Endpoints. If you no longer use Amazon Genomics CLI in a region we recommend you deactivate it. You may also wish to remove the S3 bucket along with its objects as well as the [CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/index.html) logs produced by Amazon Genomics CLI. These are retained by default so that you can view workflow results and logs even after deactivation. However, if you wish to have this infrastructure remain deployed, you are able to significantly reduce ongoing costs by using `agc account activate --usePublicSubnets`. This prevents the creation of private subnets with NAT gateways, and the use of VPC endpoints, both of which have associated ongoing costs. Please note that **you must also set `usePublicSubnets: true` in your `agc-config.yaml` if you choose to use this option**. Please also note that this is not recommended for security-critical deployments, as it means that any edits to the stack security groups risk exposing worker nodes to the public internet. ### Network traffic When running genomics workflows, network traffic can become a significant expense when the traffic is routed through NAT gateways into private subnets (where worker nodes are usually located). To minimize these costs we recommend the use of VPC Enpoints [(see below)]( {{< relref "#VPC Endpoints" >}} ) as well as activating Amazon Genomics CLI and running your workflows in the same region as your S3 bucket holding your genome files. VPC Gateway endpoints are regional so cross region S3 access will *not* be routed through a VPC gateway. If you make use of large container images in your workflows (such as the GATK images) we recommend copying these to a private [ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/index.html) repository in the same region that you will run your analysis to use ECR endpoints and avoid traffic through NAT gateways. ### VPC Endpoints When Amazon Genomics CLI creates a VPC it creates the following VPC endpoints: * `com.amazonaws.{region}.dynamodb` * `com.amazonaws.{region}.ecr.api` * `com.amazonaws.{region}.ecr.dkr` * `com.amazonaws.{region}.ecs` * `com.amazonaws.{region}.ecs-agent` * `com.amazonaws.{region}.ecs-telemetry` * `com.amazonaws.{region}.logs` * `com.amazonaws.{region}.s3` * `com.amazonaws.{region}.ec2` If you provide your own VPC we recommend that the VPC has these endpoints. This will improve the security posture of Amazon Genomics CLI in your VPC and will also reduce NAT gateway traffic charges which can be substantial for genomics analyses that use large S3 objects and/ or large container images. If you are using Amazon Genomics CLI client on an EC2 instance in a subnet with no access to the internet you will need to have a VPC endpoint to `com.amazonaws.{region}.execute-api` so that the client can make calls to the REST services deployed during account activation. ## Technical Details. Amazon Genomics CLI core infrastructure is defined in code and deployed by [AWS CDK](https://aws.amazon.com/cdk/). The CDK app responsible for creating the core infrastructure can be found in [`packages/cdk/apps/core/`](https://github.com/aws/amazon-genomics-cli/tree/main/packages/cdk/apps/core).