# Container Basics for Research


There are several container technologies available, but Docker container technology is the most popular one. In this workshop, we will start with a simple application running in a Docker container. We will take a closer look at the key components and environments that are needed. We will also explore different ways of running Docker containers in AWS with different services. 

Why should we use containers for research?
- Repeatable and shareable tools and applications
- Portable - run in different environments (develop on laptop, test on-premises, run large scale in the cloud)
- Stackable - run different stages of a pipeline/applications with different OS settings and libraries without conflicts
- Easier development - each part of an analysis pipeline can be developed independently by different teams and with most appropriate technologies 


# Getting started

In this workshop, we will use Jupyter/JupyterLab notebooks (with conda_python3 or similar kernels) to experiment with containers through the AWS SageMaker platform. Furthermore, we are going to setup the learning environment by taking advantage of the AWS Python SDK (boto3 library), which allows us to interact with the necessary AWS services via API calls. For standalone applications (e.g., scripts), we can either use the AWS SDK for the respective language (if available) or the AWS CLI. This notebook uses both to illustrate their use in the respective contexts.

In [None]:
import boto3
import botocore
import json
import time
import os
import base64
import docker
import pandas as pd

import project_path # path to helper methods
from lib import workshop
from botocore.exceptions import ClientError

The containers we are building in this notebook will be ephemeral. Thus, we need a place to store our output files so that we can later inspect them from this notebook. We will first create a boto3 session and then an S3 bucket (command line tools would internally establish a session in a similar fashion before being able to interact with the AWS APIs). 

In [None]:
session = boto3.session.Session()
region = session.region_name
bucket_name = workshop.create_bucket_name('sagemaker-container-ws-')

bucket = workshop.create_bucket(region, session, bucket_name, False)
print(bucket)

We'll also create a helper magic to easily create and save a file from the notebook.

In [None]:
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
 with open(line, 'w+') as f:
 f.write(cell.format(**globals()))

# Running an application in a container locally.

This SageMaker Jupyter notebook runs on an EC2 instance with Docker daemon pre-installed. Thus, we can build and test Docker containers on the very same instance. 

We are going to build a simple web server container that says "Hello World!". 

## The Dockerfile

The Dockerfile is similar with a cooking recipe. It describes the steps necessary to prepare the environment of your application and get is started. 

If you already have automation scripts that prepare a new machine (VM or physical) for your application, then you are 90% done here as Docker would just run these inside a container. 

In the more general case, we start with a base image (i.e., well known initial configuration, such as a fresh install of the underlying OS, in this case ubuntu:18.04). We write this in the first line (FROM ...) of the Dockerfile below.

We then install, configue, compile or build all the software we need (including libraries and other dependencies the application might have). Each "RUN" line below (note that some lines are longer and we use the backspace character to avoid horizontal scrolling) has one or more commands and arguments that would be executed in sequence. Each of these lines creates a "layer" or stage for building the application environment.

For this example, we need the webserver (Apache) to listen to port 80/HTTP, thus we use the EXPOSE statement to instruct Docker to setup the appropriate networking environment.

Lastly, we start the webserver with the "CMD" statement pointing to the "run_apache.sh" shell script.



In [None]:
%%writetemplate Dockerfile
FROM public.ecr.aws/ubuntu/ubuntu:latest

 
# Install dependencies and apache web server
RUN apt-get update && apt-get -y install apache2

# Create the index html
RUN echo 'Hello World!' > /var/www/html/index.html

# Configure apache 
RUN echo '. /etc/apache2/envvars' > /root/run_apache.sh && \
 echo 'mkdir -p /var/run/apache2' >> /root/run_apache.sh && \
 echo 'mkdir -p /var/lock/apache2' >> /root/run_apache.sh && \
 echo '/usr/sbin/apache2 -D FOREGROUND' >> /root/run_apache.sh && \
 chmod 755 /root/run_apache.sh

EXPOSE 80

CMD /root/run_apache.sh

## Now let's build the container. 

We already have the "docker" runtime installed in this Jupyter environment, so we can easily test our container locally. In other environments, we would need to install docker and its dependencies (e.g., "sudo yum install docker", or "sudo apt-get install docker" depending on the underlying OS).

Before we test the container, we need to actually build it by following the recipe instructions from the Dockerfile above. We use the "-t" flag to build and tag the image. The resulting container image will be stored in the local Docker image registry. 

We will later learn how to use an external image registry (e.g., AWS Elastic Container Registry/ECR) to push and save the image there. 

In [None]:
!docker build -t simple_server .

## Run the container 

Recall that the webserver inside our container listens on port 80 for HTTP connections. When we run the container locally, we will bind (i.e., create a mapping) the container port 80 to the localhsot port 8080 ("-d" runs detached/background). Thus, we can use "curl" to access the webserver within the container on port 8080 of our local machine.


In [None]:
c_id = !docker run -d -p 8080:80 simple_server
 
! sleep 3 && docker ps 
! curl http://localhost:8080


Our "Hello World" example should show up. 

We used the command line above to do all this work, but we can use the "boto3" library (or similar SDK in the desired language) as well to achieve the same thing. Below, we build a simple function that lists the running containers. Then, we stop our webserver example using its container id.

In [None]:
def list_all_running_containers():
 docker_client = docker.from_env()
 container_list = docker_client.containers.list()
 for c in container_list:
 print(c.attrs['Id'], c.attrs['State']['Status'])
 return container_list


docker_client = docker.from_env()
running_containers = list_all_running_containers()

print("Stopping container... ", c_id)
simple_server_container = docker_client.containers.get(c_id[0])
simple_server_container.stop()

## Let's run some real workload

We are going to use a genomics example here and run the NCBI SRA (Sequence Read Archive) Tool (https://github.com/ncbi/sra-tools), fasterq-dump (https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to extract fastq (sequence of nucleotides, such as GATTATTATTATTACCTTACA, https://en.wikipedia.org/wiki/FASTQ_format) from SRA-accessions.

The command takes a package name as an argument:
```
$ fasterq-dump SRR000001
```

We will use the official NCBI container base image from https://hub.docker.com/r/ncbi/sra-tools as our starting point instead of building one from scratch. Third party versions exist too, but they might not be up to date (e.g., https://hub.docker.com/r/pegi3s/sratoolkit/)

The workflow implemented by our container would be: 
1. Upon start, the container runs a script "sratest.sh".
3. sratest.sh will "prefetch" the data package, whose name is passed via an environment variable. 
4. sratest.sh then run "fasterq-dump" on the data package
5. sratest.sh will then upload the result to our s3://{bucket}

The output of the fasterq-dump will be stored in s3://{bucket}/data/sra-toolkit/fasterq/{PACKAGE_NAME}

We first need to setup our environment with the necessary credentials and input/ouput options.


In [None]:
PACKAGE_NAME='SRR000002'

# this is where the output will be stored
sra_prefix = 'data/sra-toolkit/fasterq'
sra_output = f"s3://{bucket}/{sra_prefix}"

# to run the docker container locally, you need the access credentials inside the container when using AWS CLI
# pass the current keys and session token to the container via environment variables
credentials = boto3.session.Session().get_credentials()
current_credentials = credentials.get_frozen_credentials() 

# Please don't print these out or store them in files: 
access_key=current_credentials.access_key
secret_key=current_credentials.secret_key
token=current_credentials.token


We then build our "sratest.sh" script following the workflow above. Note that we use the settings above (PACKAGE_NAME, SRA_OUTPUT) to contextualize our build. We use the "aws s3 sync" command to push our output files into S3 for later use.

In [None]:
%%writetemplate sratest.sh
#!/bin/bash
set -x

## Prefetch accession
prefetch $PACKAGE_NAME --output-directory /tmp

## Perform conversion using 8 threads (more will lead to I/O issues)
fasterq-dump $PACKAGE_NAME -e 8

## Upload results to S3 bucket
aws s3 sync . $SRA_OUTPUT/$PACKAGE_NAME
 

We are now ready to build our container starting from the available NCBI official image instead of the bare OS. We still need python and the AWS CLI besides the "sratest.sh" script we created above. 

In [None]:
%%writetemplate Dockerfile.ncbi
FROM ncbi/sra-tools

RUN apk add aws-cli
RUN export PATH=/usr/local/bin/aws/bin:$PATH
ADD sratest.sh /usr/local/bin/sratest.sh
RUN chmod +x /usr/local/bin/sratest.sh
WORKDIR /tmp
ENTRYPOINT ["/bin/sh","/usr/local/bin/sratest.sh"]

We can now build a new container image with the above Dockerfile and tag it as "myncbi/sra-tools" so that we can use it later.

In [None]:
!docker build -t myncbi/sra-tools -f Dockerfile.ncbi .

Now we can give this new container a try. We will provide our runtime settings via environment variables specified in the Docker command line.

In [None]:
PACKAGE_NAME='SRR000002'

# only run this when you need to clean up the registry and storage
#!docker system prune -a -f
!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key \
 --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token myncbi/sra-tools:latest
 

Let's try something else by changing the PACKAGE_NAME and running our container again.

In [None]:
# Now try a differnet package
PACKAGE_NAME = 'SRR000003'
!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key \
 --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token myncbi/sra-tools:latest

# Starting from scratch with own Docker image

So far, we have been using the existing NCBI base image. This was a great time saver, but perhaps we have some special code optimizations, need additional tools, or have certain settings that we want to take advantage of. Let's build our own image starting with the base Ubuntu Linux image.

Workflow:
1. Install tzdata - this is a dependency for other packages we need. Under normal circumstances, we do not need to explicitly install it; however, there is an issue with "tzdata" requiring an interaction to select timezone during the installation process, which would halt the docker built. Thus we install it separately with -y. 
2. Install wget and awscli.
3. Download sratookit ubuntu binary and unzip into /opt. Need to generate an UUID for configuration to avoid issues with vdb-config interactive mode on sratoolkit version 2.10.3 and above.
4. Set the PATH to include sratoolkit/bin and HOME to /tmp in order to setup the base configuration.
5. USER nobody is needed to set the permission for sratookit configuration. 
6. Use the same sratest.sh script 

We will build a new Dockerfile for this:

In [None]:
%%writetemplate Dockerfile.myown
#FROM ubuntu:18.04 
FROM public.ecr.aws/ubuntu/ubuntu:latest

RUN apt-get update 
RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata \
 && apt-get install -y curl wget libxml-libxml-perl awscli uuid-runtime

### Known older version that is preconfigured
#RUN wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.0/sratoolkit.2.10.0-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz \
# && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz && ln -s /opt/sratoolkit.2.10.0-ubuntu64 /opt/sratoolkit

### Latest version requires workaround below
RUN wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz \
 && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz && \
 ln -s /opt/sratoolkit.$(curl -s https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current.version)-ubuntu64 /opt/sratoolkit
 
ENV PATH="/opt/sratoolkit/bin/:${{PATH}}"

ADD sratest.sh /usr/local/bin/sratest.sh
RUN chmod +x /usr/local/bin/sratest.sh

### Workaround for vdb-config --interactive
RUN mkdir /tmp/.ncbi && printf '/LIBS/GUID = "%s"\n' `uuidgen` > /tmp/.ncbi/user-settings.mkfg

ENV HOME=/tmp
WORKDIR /tmp

USER nobody
ENTRYPOINT ["/usr/local/bin/sratest.sh"]

We will use this Dockerfile to build a new image and tag it with "myownncbi/sra-tools" to differentiate it from the previous build.

In [None]:
!docker build -t myownncbi/sra-tools -f Dockerfile.myown .

We can now test it out in a similar manner using the PACKAGE_NAME='SRR000004' and "myownncbi/sra-tools:latest" arguments.

In [None]:
PACKAGE_NAME='SRR000004'

!docker run --env SRA_OUTPUT=$sra_output --env PACKAGE_NAME=$PACKAGE_NAME --env AWS_ACCESS_KEY_ID=$access_key \
 --env AWS_SECRET_ACCESS_KEY=$secret_key --env AWS_SESSION_TOKEN=$token myownncbi/sra-tools:latest

Recall that we used the "aws s3 sync" command in our "sratest.sh" script above to save our output into the desired bucket... Now we can check the result of our Docker experiments.

We will use the boto3 session we created above to list the objects in our bucket and then download them. We could do the same with a simple "aws s3 cp -r s3://{bucket}/ ." or our favorite S3 browser tool.

In [None]:
# checkout the outfiles on S3
s3_client = session.client('s3')
objs = s3_client.list_objects(Bucket=bucket, Prefix=sra_prefix)
for obj in objs['Contents']:
 fn = obj['Key']
 p = os.path.dirname(fn)
 if not os.path.exists(p):
 os.makedirs(p)
 s3_client.download_file(bucket, fn , fn)

We now have the files locally (in this Jupyter notebook environment) and can do a quick inspection.

In [None]:
#You only need to do this once per kernel - used in analyzing fastq data. If you don't want to run the last inspection step below, then you don't need this.
!pip install bioinfokit 

In [None]:
from bioinfokit.analys import fastq
fastq_iter = fastq.fastq_reader(file=f"{sra_prefix}/{PACKAGE_NAME}/{PACKAGE_NAME}.fastq") 
# read fastq file and print out the first 10, 
i = 0
for record in fastq_iter:
 # get sequence headers, sequence, and quality values
 header_1, sequence, header_2, qual = record
 # get sequence length
 sequence_len = len(sequence)
 # count A bases
 a_base = sequence.count('A')
 if i < 10:
 print(sequence, qual, a_base, sequence_len)
 i +=1

print(f"Total number of records for package {PACKAGE_NAME} : {i}")

# Cleanup

We are done! We built and ran multiple containers, got the results saved for later, and did a quick analysis. Let's do some cleanup.

In [None]:
!aws s3 rb s3://$bucket --force 
!rm -rf $sra_prefix

## Other ways to run the container 

We looked at creating and running containers locally in this notebook. While this is great for small examples, we often need more computing or storage resources than our local machine can provide. 

Please check the "notebook/hpc/batch-fastqc" notebook for running containers with the AWS Batch service. 