# Steps to build spark-benchmark-assembly application

## Pre-requisites
We can complete all the steps either from a local desktop or using [AWS Cloud9](https://aws.amazon.com/cloud9/).  If you’re using AWS Cloud9, follow the instructions in the first step to spin up an AWS Cloud9 environment, otherwise skip to the next step.

### 1. Setup AWS Cloud9
AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code with just a browser. AWS Cloud9 comes preconfigured with many of the dependencies we require to build our application.

Create an AWS Cloud9 environment from the [AWS Management Console[(https://console.aws.amazon.com/cloud9)] with an instance type of t3.small or larger. For our testing we used m5.xlarge for adequate memory and CPU to compile and build our benchmark application. Provide the required name, and leave the remaining default values. After your environment is created, you should have access to a terminal window. If you are new to AWS Cloud9 follow [this tutorial](https://docs.aws.amazon.com/cloud9/latest/user-guide/tutorial.html) to create your environment.

You must increase the size of the [Amazon Elastic Block Store (Amazon EBS)](https://aws.amazon.com/ebs/) volume attached to your AWS Cloud9 instance to 20 GB, because the default size (10 GB) is not enough. For instructions, refer to [Resize an Amazon EBS volume used by an environment](https://docs.aws.amazon.com/cloud9/latest/user-guide/move-environment.html#move-environment-resize).

### 2. Install Docker if required
AWS Cloud9 EC2 instance m5.xlarge comes with Docker pre-installed. Depending on your environment you may or may not need to install Docker. To install Docker follow the instructions in the [Docker Desktop page](https://docs.docker.com/desktop/#download-and-install).

## Build Benchmark application

To build our application we are going to reuse the source code for TPCDS benchmarking application that was used to build a similar benchmarking utility from [the EMR on EKS benchmark Github repo](https://github.com/aws-samples/emr-on-eks-benchmark).

### 1. Download the source code from Github repo as shown below:
```
git clone https://github.com/aws-samples/emr-on-eks-benchmark.git
```

### 2. Build a docker image of Apache Spark

First change to project root directory, and then build the Spark version 3.3.0. We use Hadoop 3.3.4. Feel free to change the Spark version to the one that you need.
```
cd emr-on-eks-benchmark
docker build -t spark:3.3.0_hadoop_3.3.4 -f docker/hadoop-aws-3.3.1/Dockerfile --build-arg HADOOP_VERSION=3.3.4 --build-arg SPARK_VERSION=3.3.0 .
```

### 3. Build the Spark Benchmark application as a docker image

Build the benchmark utility based on the Spark version we created above. In order to do that we need to make sure the Dockerfile points to the correct Spark and Hadoop versions. Edit [docker/benchmark-util/Dockerfile](https://github.com/aws-samples/emr-on-eks-benchmark/blob/main/docker/benchmark-util/Dockerfile) and make sure Spark and Hadoop versions are correct. In our example we are benchmarking Spark version 3.3.0.

```
ARG SPARK_VERSION=3.3.0
ARG HADOOP_VERSION=3.3.4
```

Use this Dockerfile to build the benchmark utility as shown below

```
docker build -t eks-spark-benchmark:3.3.0 -f docker/benchmark-util/Dockerfile --build-arg SPARK_BASE_IMAGE=spark:3.3.0_hadoop_3.3.4 .
```

### 4. Copy the benchmark application jar file from the docker image
To do this open two terminals. In the first terminal run a docker container from the image built in the previous step. In the example below we give it a name `spark-benchmark` using the `--name` argument.

```
docker run --name spark-benchmark -it eks-spark-benchmark:3.3.0 bash
```
This should start a bash prompt in your spark-benchmark docker container. If the build was successful, inside the bash prompt in your docker container, you should see a jar file named `eks-spark-benchmark-assembly-1.0.jar` in the `$SPARK_HOME/examples/jars` directory as shown in the example below:

```
hadoop@9ca5b2afe778:/opt/spark/work-dir$ pwd
/opt/spark/work-dir
hadoop@9ca5b2afe778:/opt/spark/work-dir$ cd ../examples/jars
hadoop@9ca5b2afe778:/opt/spark/examples/jars$ ls
eks-spark-benchmark-assembly-1.0.jar  scopt_2.12-3.7.1.jar  spark-examples_2.12-3.3.0.jar
```

On another terminal in Cloud9 running the `docker ps` command shows our running container. Here is an example:

```
sekar:~/environment $ docker ps
CONTAINER ID   IMAGE                       COMMAND                  CREATED         STATUS         PORTS     NAMES
9ca5b2afe778   eks-spark-benchmark:3.3.0   "/opt/entrypoint.sh …"   7 seconds ago   Up 6 seconds             spark-benchmark
```

Now you can copy the eks-spark-benchmark-assembly-1.0.jar file from the docker container into your local directory using `docker cp` command as shown below:

```
docker cp spark-benchmark:/opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar ./spark-benchmark-assembly-3.3.0.jar
```

Optionally, upload benchmark application to S3. Replace `$YOUR_S3_BUCKET` with your S3 bucket name.

```
aws s3 cp spark-benchmark-assembly-3.3.0.jar s3://$YOUR_S3_BUCKET/blog/jar/spark-benchmark-assembly-3.3.0.jar
```