# Benchmarking GFLOPS with PyWren and AWS Lambda

In [None]:
%pylab inline
import numpy as np
import time
import flops_benchmark
import pandas as pd
import cPickle as pickle
import seaborn as sns
sns.set_style('whitegrid')

# Step by Step instructions


### Setup Logging (optional)
Only activate the below lines if you want to see all debug messages from PyWren. _Note: The output will be rather chatty and lengthy._

In [None]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
%env PYWREN_LOGLEVEL=INFO

We are going to benchmark the simple function below, which simply generates two matrices and computes their matrix (dot) product. The matrices are of size `MAT_N` and we will compute the product `loopcount` times. 

In [None]:
def compute_flops(loopcount, MAT_N):
 
 A = np.arange(MAT_N**2, dtype=np.float64).reshape(MAT_N, MAT_N)
 B = np.arange(MAT_N**2, dtype=np.float64).reshape(MAT_N, MAT_N)

 t1 = time.time()
 for i in range(loopcount):
 c = np.sum(np.dot(A, B))

 FLOPS = 2 * MAT_N**3 * loopcount
 t2 = time.time()
 return FLOPS / (t2-t1)

All of the actual benchmark code is in a stand-alone python file, which you can call as follows. It places the output in `small.pickle`.
If you are interested in the details you can inspect the [flops_benchmark.py](/edit/Lab-1-Hello-World/benchmark_flops/flops_benchmark.py) file. Here is the relevant code snippet that invokes the distirbuten PyWren functions:

```python
iters = np.arange(N)

def f(x):
 return {'flops': compute_flops(loopcount, matn)}

pwex = pywren.lambda_executor()
futures = pwex.map(f, iters)

```

In [None]:
!python flops_benchmark.py --workers=10 --loopcount=10 --matn=1024 --outfile="small.pickle"

We can now plot a histogram of the results: 

In [None]:
exp_results = pickle.load(open("small.pickle", 'r'))
results_df = flops_benchmark.results_to_dataframe(exp_results)
sns.distplot(results_df.intra_func_flops/1e9, bins=np.arange(10, 30), kde=False, axlabel='FLOPS measured intra function')

# Scaling up
Now we will run a larger number of functions simultaneously. Note that this is dependent on the soft limit of parallel AWS Lambda function for your AWS Account. For this workshop we suggest to start off with 100 parallel AWS Lambda function (`--workers=100`). The more functions you run in parallel the larger your potential performance.

In [None]:
# Parallel Function limit of 100 simultaneous invocations
!python flops_benchmark.py --workers=100 --loopcount=10 --matn=4096 --outfile="big.pickle"

Let's plot the intra function FLOPS first:

In [None]:
big_exp_results = pickle.load(open("big.pickle", 'r'))
big_results_df = flops_benchmark.results_to_dataframe(big_exp_results)
sns.distplot(results_df.intra_func_flops/1e9, bins=np.arange(10, 36), kde=False, axlabel='FLOPS measured intra function')

Now let's aggregate this to understand our total GFLOPS across all the parallel executions and plot it:

In [None]:
est_total_flops = big_results_df['est_flops']
total_jobs = len(big_results_df)
JOB_GFLOPS = est_total_flops /1e9 /total_jobs 
# grid jobs running time 
time_offset = np.min(big_results_df.host_submit_time)
max_time = np.max(big_results_df.download_output_timestamp ) - time_offset
runtime_bins = np.linspace(0, max_time, max_time, endpoint=False)


runtime_flops_hist = np.zeros((len(big_results_df), len(runtime_bins)))
for i in range(len(big_results_df)):
 row = big_results_df.iloc[i]
 s = (row.start_time + row.setup_time) - time_offset
 e = row.end_time - time_offset
 a, b = np.searchsorted(runtime_bins, [s, e])
 if b-a > 0:
 runtime_flops_hist[i, a:b] = row.est_flops / float(b-a)
 
results_by_endtime = big_results_df.sort_values('download_output_timestamp')
results_by_endtime['job_endtime_zeroed'] = big_results_df.download_output_timestamp - time_offset
results_by_endtime['flops_done'] = results_by_endtime.est_flops.cumsum()
results_by_endtime['rolling_flops_rate'] = results_by_endtime.flops_done/results_by_endtime.job_endtime_zeroed

 
fig = pylab.figure(figsize=(8, 6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(runtime_flops_hist.sum(axis=0)/1e9, label='peak GFLOPS')
ax.plot(results_by_endtime.job_endtime_zeroed, 
 results_by_endtime.rolling_flops_rate/1e9, label='effective GFLOPS')
ax.set_xlabel('time (sec)')
ax.set_ylabel("GFLOPS")
pylab.legend()
ax.grid(False)
sns.despine()
fig.tight_layout()
fig.savefig("flops_benchmark.gflops.png")
fig.savefig("flops_benchmark.gflops.pdf")

This plot computes two things:
* **Peak GLFOPS**: Across all cores, what is the total simultaneous FLOPS that are being computed? 
* **Effective GFLOPS**: If the job ended at this point in time, what would our aggregate effective GFLOPS have been, including time to launch the jobs and download the results

We see "peak GFLOPS" peaks in the middle of the job, when all 100 lambdas are running at once. "Effective GFLOPS" starts climbing as results quickly return, but stragglers mean that our total effective GFLOPS drops slightly. Still not bad for pure python! 