# Benchmarking Amazon S3 performance with PyWren and AWS Lambda


# Step by Step instructions

### Setup Logging (optional)
Only activate the below lines if you want to see all debug messages from PyWren. _Note: The output will be rather chatty and lengthy._


In [None]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
%env PYWREN_LOGLEVEL=INFO

Let's setup all the necessary libraries:

In [None]:
%pylab inline
import numpy as np
import time
import s3_benchmark
import pandas as pd
import cPickle as pickle
import seaborn as sns
sns.set_style('whitegrid')

**IMPORTANT** - We need to update the S3 Bucket variable with the Amazon S3 bucket that has been created with the AWS Cloudformation template earlier. Please update the following variable with your bucketname that you copied out of the Output tab.



In [None]:
S3BUCKET = 'pywren-workshop-s3bucket-12apl9us09h1d'

## Step by Step instructions

_**IMPORTANT** - This lab will write and read many files (200MB) from your Amazon S3 bucket which will incur costs accoriding to our [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/) - make sure to clean out the bucket after the lab to avoid unnecessary cost_

We are going to benchmark an Amazon S3 bucket by writing a large amount of data to a bucket, and then reading that data back. We are using the S3BUCKET variable, which is the bucket that was created with the CloudFormation template. We will run 100 AWS Lambda functions in parallel (_your account might have a soft limit of less parallel executions, if you encounter an error we suggest to change the `--number` parameter to a smaller value - also feel free to try with higher amount of workers to see the performance boost of parallel execution_ )

All of the actual benchmark code is in a stand-alone python file, which you can call as follows. It places the output in `write.pickle`. If you are interested in the details you can inspect the [s3_benchmark.py](/edit/Lab-1-Hello-World/benchmark_s3/s3_benchmark.py) file. Here is the relevant code snippet that invokes the distirbuten PyWren functions:

```python
wrenexec = pywren.default_executor()

# create list of random keys
keynames = [ key_prefix + str(uuid.uuid4().get_hex().upper()) for _ in range(number)]
futures = wrenexec.map(run_command, keynames)

results = [f.result() for f in futures]
```


In [None]:
!python s3_benchmark.py write --mb_per_file=200 --bucket_name={S3BUCKET} --number=100 --outfile=write.pickle

Let's have a quick look at our bucket on what files have been created. Here's a direct link to your [S3 Management Console](https://s3.console.aws.amazon.com/s3/buckets/?region=us-west-2&tab=overview)

We then run the read test

In [None]:
!python s3_benchmark.py read --key_file=write.pickle --outfile=read.pickle

Now let's plot the results and see what's the distribution of read and write rates to Amazon S3 from our 200 AWS Lambda function executions:

In [None]:
current_palette = sns.color_palette()
read_color = current_palette[0]
write_color = current_palette[1]

write_data = pickle.load(open("write.pickle", 'r'))
write = s3_benchmark.compute_times_rates(write_data['results'])

read_data = pickle.load(open("read.pickle", 'r'))
read_time_results = [r[:3] for r in read_data['results']]
read = s3_benchmark.compute_times_rates(read_time_results)

fig = pylab.figure(figsize=(8, 6))
sns.distplot(read['rate'], label='read', color=read_color)
sns.distplot(write['rate'], label='write', color=write_color)
pylab.legend()
pylab.xlabel("MB/sec")
pylab.ylabel("Function count (per 1000)")
pylab.grid(True)
pylab.title("Read/Write rates per single AWS Lambda function")
sns.despine()

We can investigate when jobs start and how long they run. Each horizontal line is a job, and then plotted on top is the aggregate number of jobs running at that moment. 

In [None]:
from matplotlib.collections import LineCollection
fig = pylab.figure(figsize=(10, 6))

for plot_i, (datum, l, c) in enumerate([(read, 'read', read_color), 
 (write, 'write', write_color)]):
 ax = fig.add_subplot(1, 2, 1 + plot_i)

 N = len(datum['start_time'])
 line_segments = LineCollection([[[datum['start_time'][i], i], 
 [datum['end_time'][i], i]] for i in range(N)],
 linestyles='solid', color='k', alpha=0.4, linewidth=0.2)
 #line_segments.set_array(x)

 ax.add_collection(line_segments)

 ax.plot(s3_benchmark.runtime_bins, datum['runtime_jobs_hist'].sum(axis=0), 
 c=c, label='active jobs total', 
 zorder=-1)


 ax.set_xlim(0, np.max(datum['end_time']))
 ax.set_ylim(0, len(datum['start_time'])*1.05)
 ax.set_xlabel("time (sec)")
 if plot_i == 0:
 ax.set_ylabel("AWS Lambda function execution")
 ax.grid(False)
 ax.legend(loc='upper right')
fig.tight_layout()

Lastly let's plot all the values in aggregate over time:

In [None]:
fig = pylab.figure(figsize=(8, 4))
ax = fig.add_subplot(1, 1, 1)
for d, l, c in [(read, 'read', read_color), (write, 'write', write_color)]:
 ax.plot(d['runtime_rate_hist'].sum(axis=0)/1000, label=l, c=c)
ax.set_xlabel('time (sec)')
ax.set_ylabel("GB/sec")
pylab.legend()
sns.despine()
fig.tight_layout()

Please note that the performance will increase with the amount of workers (parallel AWS Lambda function executions) you will use. We tried this with 2800 workers, and here's the performance chart:

![2800 Workers](performance_2800workers.png)
