Testing Pyspark Locally with Docker

This part is the preferred method for testing and debugging, and will be useful for our 6th part: Transformations. Alternative is using Glue Development Endpoint which is explained in the next part of the lab. Otherwise it would be difficult and time consuming to debug your Spark script.

Developers would usually develop their applications locally before deploying to testing and production environments. In order to test your Pyspark scripts locally, you will need a Spark environment on your local machine. Also, you will need your data files (exported CSVs) available in your local machine.

One of the easiest ways to manage such an environment is using Docker. Many Docker images for this specific purpose can be found online. We will use Jupyter Pyspark Notebook for this purpose.

Docker must be locally installed and running on your local environment. Instructions on installing and running Docker for your specific operating system can be found online.

  1. Open terminal (or Powershell for Windows)
  2. Run
docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
  1. It will take a few minutes to download and run the container. If you dont want to wait, leave your terminal open, continue with the workshop, check back before the 5th Lab: Transformation.
  2. When the process finishes, it will give you the necessary instructions to open the notebook on your browser (It will prompt a url with a specific token).
  3. Open the notebook on your browser, create a new file and you are ready to create and run your scripts.

Example code to test your environment. Coy and paste it into your notebook and run.

import pyspark 
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)