## Running Spark queries against your Data Lake

We will now leverage EMR using the Glue Data Catalog and the same yellow taxi data we used in Athena and Redshift. The first step is to get a Spark session to be able to run SparkSQL queries against.

In [None]:
from pyspark.sql import SparkSession
import datetime

spark = SparkSession \
 .builder \
 .appName("Spark Taxi demo") \
 .getOrCreate()

### Run the same query but with Spark on EMR to get the count of yellow taxi rides between Jan 1st - 10th in 2017 using the CSV formatted data.

In [None]:
## Unoptimized query
currentDT1 = datetime.datetime.now()

sql = 'SELECT count(yellow.vendorid) FROM taxi.yellow '\
 'Inner JOIN taxi.paymenttype ON yellow.payment_type = paymenttype.id '\
 'Inner JOIN taxi.ratecode ON yellow.ratecodeid = ratecode.id '\
 'Inner JOIN taxi.taxi_zone_lookup AS pu_taxizone ON yellow.pulocationid = pu_taxizone.locationid '\
 'Inner JOIN taxi.taxi_zone_lookup AS do_taxizone ON yellow.dolocationid = do_taxizone.locationid '\
 'where month(to_date(tpep_pickup_datetime)) = 1 '\
 'and year(to_date(tpep_pickup_datetime)) = 2017 and dayofmonth(to_date(tpep_pickup_datetime)) between 1 and 10'

sqlDF = spark.sql(sql)

sqlDF.show()

currentDT2 = datetime.datetime.now()
print(currentDT2 - currentDT1)

### Run the same query but with Spark on EMR to get the count of yellow taxi rides between Jan 1st - 10th in 2017 using the Parquet formatted data.

In [None]:
## Optimized query
currentDT1 = datetime.datetime.now()

sql = 'SELECT count(yellow.vendorid) FROM taxi.yellow_parquet As yellow '\
 'Inner JOIN taxi.paymenttype ON yellow.payment_type = paymenttype.id '\
 'Inner JOIN taxi.ratecode ON yellow.ratecodeid = ratecode.id '\
 'Inner JOIN taxi.taxi_zone_lookup AS pu_taxizone ON yellow.pulocationid = pu_taxizone.locationid '\
 'Inner JOIN taxi.taxi_zone_lookup AS do_taxizone ON yellow.dolocationid = do_taxizone.locationid '\
 'where month(to_date(tpep_pickup_datetime)) = 1 '\
 'and year(to_date(tpep_pickup_datetime)) = 2017 and dayofmonth(to_date(tpep_pickup_datetime)) between 1 and 10'

sqlDF = spark.sql(sql)
sqlDF.show()

currentDT2 = datetime.datetime.now()
print(currentDT2 - currentDT1)