# AWS Batch for Protein Folding With Nextflow Orchestration

*Deprecated*: The [AWS Batch Guidance for Protein Folding and Design](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding) now supports natively Nextflow; please use that functionality instead. See [here](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding/tree/main/notebooks/orchestration) for sample notebooks. 


_____________

## Deployment

The steps here show how to modify the [AWS Batchfold architecture](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding) to include orchestration with Nextflow. The `infrastructure` directory contains modified CloudFormation templates and a dockerfile to perform these steps.

**1.** First, deploy the [AWS Batchfold architecture](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding) Cloudformation template, and wait for the architecture to fully deploy.

**2.** After the Batchfold architecture is deployed, it will create a repository in AWS CodeCommit with a name like the following pattern `batch-protein-folding-code-repo-xxxxxxxxx`. This repository will also be copied into a corresponding SageMaker notebook instance.


In the next two steps, you will copy files **from** this repository **to** your copy of the Batchfold repository in the SageMaker notebook instance. This will add the needed components to run Nextflow. 

**3.** Within the SageMaker notebook instance, there will be a repository of the form `batch-protein-folding-code-repo-xxxxxxxxx`. Within that repository, add the following files (from *this* repository) into the directory`infrastructure/cloudformation`: `batch-protein-folding-cfn-batch.yaml`,  `batch-protein-folding-cfn-module-nextflow.yaml`, `batch-protein-folding-cfn-root.yaml`

**4.** Copy the `infrastructure/docker/nextflow` directory from *this* repository to a new directory in *infrastructure/docker/nextflow* in the repository in the SageMaker notebook instance. Do the same for the `notebooks/orchestration`; copying in a the new directory from *this* repository. 

Now you will push these commits back to the main branch in codecommit:

**5.** Execute the following


    git add .; git commit -m " modify to include nextflow orchestration";git push origin main
    
**6.** Now that the repository has been updated, you will need to re-run the cloudformation to update it. Make sure that the stack name is the **same stack name** you chose when deploying the original Batchfold architecture.

    #just enter the name of the bucket in the format "mybucket
    S3_BUCKET="ENTER AN S3 BUCKET OF YOUR CHOICE" 
    S3_BUCKET="ENTER AN S3 BUCKET OF YOUR CHOICE"
    REGION="YOUR REGION"
    
     aws cloudformation package --template-file infrastructure/cloudformation/batch-protein-folding-cfn-root.yaml --region {REGION} --output-template infrastructure/cloudformation/batch-protein-folding-cfn-packaged.yaml --s3-bucket {S3_BUCKET}
     
     #redeploy the CFN
     aws cloudformation deploy --template-file infrastructure/cloudformation/batch-protein-folding-cfn-packaged.yaml --stack-name   {STACK_NAME} --region {REGION} --capabilities CAPABILITY_IAM --parameter-overrides CreateG5ComputeEnvironment=N MultiAZ=N DownloadFsxData=Y
    
    
**7.** You may need to manually start the Nextflow CodeBuild step; you can do so from the CodeBuild console.

    
## Submission of Nextflow script

Once, the previous steps are executed, you can submit a Nextflow script to Batchfold. In order to submit a sample nextflow script, go the the `notebooks/orchestration` directory within the SageMaker notebook instance, and run the following.

    python submit nextflow_job.py

This will run a pipeline that first runs RFDesign, followed by ESMFold on each of the structures generated by RFDesign. Please note that prior to running this script, you will first have to retrieve the nextflow orchestrator and nextflow job definition from the AWS Batch console.

Note that you can construct your own Nextflow pipelines as well. When doing so, you will need to to first place code dependencies in S3, which will be retrieved by the Nextflow orchestrator. See the script `nextflow_job.py` for more details.