{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring data with Python and Amazon S3 Select\n",
    "by Manav Sehgal | on 3 MAY 2019\n",
    "\n",
    "We hear from public institutions all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital sources. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people. The size, complexity, and varied sources of the data means the same technology and approaches that worked in the past don’t work anymore.\n",
    "\n",
    "![Data Analytics Workflow](https://s3.amazonaws.com/cloudstory/notebooks-media/data-analytics-workflow.png)\n",
    "\n",
    "A new approach is needed to extract insights and value from data. This approach needs to address complexities of multi-step data analytics workflow. This includes setting up durable, secure, and scalable storage for data, moving data from source to destination with speed and low cost, ease of data preparation for analytics, and making data available for different types of analytics including ad-hoc, real-time, and predictive.\n",
    "\n",
    "### About AWS Open Data Analytics Notebooks\n",
    "This notebook is first in a series of AWS Open Data Analytics Notebooks following step-by-step workflow for open data analytics on cloud. We will present these notebooks with guidance on using AWS Cloud programmatically, introduce relevant AWS services, explaining the code, reviewing the code outputs, evaluating alternative steps in our workflow, and ultimately designing a reusable API for open data analytics workflow on cloud. The first step in this workflow is sourcing the appropriate open dataset(s) for setting up our analytics pipeline. You may want to run these notebooks using [Amazon SageMaker](https://aws.amazon.com/sagemaker/). Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action.\n",
    "\n",
    "### Why Open Datasets\n",
    "When building analytical models it is best to start with tried and tested open datasets from the problem domain we are solving. This enables us to setup our data analytics workflow, determine the appropriate models and analytical methods, benchmark the results, collaborate with open data community, before we apply these to our own data. Such open datasets are available at the [Registry of Open Data on AWS](https://registry.opendata.aws/).\n",
    "\n",
    "For this notebook let us start with a big open dataset. Big enough that we will struggle to open it in Excel on a laptop. Excel has around million rows limit. We will setup AWS services to source from a 270GB data source, filter and store more than 8 million rows or 100 million data points into a flat file, extract schema from this file, transform this data, load into analytics tools, run Structured Query Language (SQL) on this data, perform exploratory data analytics, train and build machine learning models, and visualize all 100 million data points using an interactive dashboard.\n",
    "\n",
    "### Open Data Analytics Architecture\n",
    "When we complete these workflow notebooks we will build the following open data analytics architecture. This is a serverless architecture. It requires no software licenses to be procured. You do not need to manage any virtual servers or operating systems. Billing of each of the services is pay-per-use. You can plug-and-play 160 AWS services within this stack based on your specific requirements.\n",
    "\n",
    "![Open Data Analytics Architecture](https://s3.amazonaws.com/cloudstory/notebooks-media/open-data-analytics-architecture.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup Notebook Environment\n",
    "We begin by importing the required Python dependencies. We will use ``Boto3`` Python SDK for using AWS services. The import ``Pandas`` is a popular library providing high-performance, easy-to-use data structures and data analysis tools for Python. The ``IPython.display`` and ``Markdown`` dependencies are required for well-formatted output from Notebook cells. We require ``botocore`` for exceptions management."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import botocore\n",
    "import pandas as pd\n",
    "from IPython.display import display, Markdown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we start to access AWS services from an Amazon SageMaker notebook we need to ensure that the [SageMaker Execution IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) associated with the notebook instance is allowed permissions to use the specific services like Amazon S3.\n",
    "\n",
    "We will setup an S3 client to call most of the S3 APIs. S3 resource is required for specific calls to object loading and copying features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client('s3')\n",
    "s3_resource = boto3.resource('s3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Bucket\n",
    "Now we come to an important part of the workflow of creating a Python function. These functions are created all along this notebook and others in the series. Think of these functions as reusable APIs for applying all that you learn from AWS Open Data Analytics Notebooks into your own projects by simply importing these functions as a library.\n",
    "\n",
    "Before we source the open dataset from the Registry, we need to define a destination for our data. We will store our open datasets within Amazon S3. S3 storage in turn is organized in universally unique ``buckets``. These bucket names form special URLs of the format ``s3://bucket-name`` which access the contents of the buckets depending on security and access policies applied to the bucket and its contents. Buckets can further contain folders and files. ``Keys`` are combination of folder and file name path or just the file name in case it is within the bucket root.\n",
    "\n",
    "Our first function ``create_bucket`` will do just that, it will create a bucket or return as-is if the bucket already exists for your account. If the bucket name is already used by someone else other than you, then this generates an exception caught by the message ``Bucket <name> could not be created`` as defined.\n",
    "\n",
    "AWS services can be accessed using the SDK as we are using right now, using browser based console GUI, or using a Command Line Interface (CLI) over OS terminal or shell. Benefits of using the SDK are reusability of commands across different use cases, handling exceptions with custom actions, and focusing on just the functionality needed by the solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_bucket(bucket):\n",
    "    import logging\n",
    "\n",
    "    try:\n",
    "        s3.create_bucket(Bucket=bucket)\n",
    "    except botocore.exceptions.ClientError as e:\n",
    "        logging.error(e)\n",
    "        return 'Bucket ' + bucket + ' could not be created.'\n",
    "    return 'Created or already exists ' + bucket + ' bucket.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Created or already exists open-data-analytics-taxi-trips bucket.'"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_bucket('open-data-analytics-taxi-trips')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List Buckets\n",
    "We can confirm that the new bucket has been created by listing the buckets within S3. The ``list_buckets`` function takes a ``match`` parameter which enables us to search among available buckets and only list the ones which contain the matching string in their name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "def list_buckets(match=''):\n",
    "    response = s3.list_buckets()\n",
    "    if match:\n",
    "        print(f'Existing buckets containing \"{match}\" string:')\n",
    "    else:\n",
    "        print('All existing buckets:')\n",
    "    for bucket in response['Buckets']:\n",
    "        if match:\n",
    "            if match in bucket[\"Name\"]:\n",
    "                print(f'  {bucket[\"Name\"]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Existing buckets containing \"open\" string:\n",
      "  open-analytics-assistant\n",
      "  open-data-analytics\n",
      "  open-data-analytics-taxi-trips\n",
      "  open-data-on-cloud\n"
     ]
    }
   ],
   "source": [
    "list_buckets(match='open')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List Bucket Contents\n",
    "Now that we have prepared our destination bucket we can shift our attention to the source for our dataset. The [Registry of Open Data on AWS](https://registry.opendata.aws) also happens to be a listing of S3 hosted open datasets. So all Registry listed datasets can be accessed by the same API we use for S3 within our own AWS account.\n",
    "\n",
    "Next, all we need to do is search the Registry for the dataset we want to analyze. For this notebook let us analyze the [New York Taxi Trips](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset. On the dataset description page we make a note of the Amazon Resource Name (ARN) which is ``arn:aws:s3:::nyc-tlc`` in this case. We are interested in the last part which provides access to the open datasets using the ``s3://nyc-tlc`` URL.\n",
    "\n",
    "Let us create a function to list contents of this open dataset. We will iterate through the keys or the path names of file objects stored within the bucket. The function allows us to match and return only keys which contain the matching string. It also optionally allows us to return only those files in the listing which are less than a certain size in MB. This helps traverse a large open dataset which may contain data in Gigabytes or even Terabytes with hundreds if not thousands of files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "def list_bucket_contents(bucket, match='', size_mb=0):\n",
    "    bucket_resource = s3_resource.Bucket(bucket)\n",
    "    total_size_gb = 0\n",
    "    total_files = 0\n",
    "    match_size_gb = 0\n",
    "    match_files = 0\n",
    "    for key in bucket_resource.objects.all():\n",
    "        key_size_mb = key.size/1024/1024\n",
    "        total_size_gb += key_size_mb\n",
    "        total_files += 1\n",
    "        list_check = False\n",
    "        if not match:\n",
    "            list_check = True\n",
    "        elif match in key.key:\n",
    "            list_check = True\n",
    "        if list_check and not size_mb:\n",
    "            match_files += 1\n",
    "            match_size_gb += key_size_mb\n",
    "            print(f'{key.key} ({key_size_mb:3.0f}MB)')\n",
    "        elif list_check and key_size_mb <= size_mb:\n",
    "            match_files += 1\n",
    "            match_size_gb += key_size_mb\n",
    "            print(f'{key.key} ({key_size_mb:3.0f}MB)')\n",
    "\n",
    "    if match:\n",
    "        print(f'Matched file size is {match_size_gb/1024:3.1f}GB with {match_files} files')            \n",
    "    \n",
    "    print(f'Bucket {bucket} total size is {total_size_gb/1024:3.1f}GB with {total_files} files')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this notebook we want to list the latest data files matching year 2018 and we also want files which are less than 250MB in size for reasons explained shortly. Note that the function results in quickly filtering 12 or 251 files within a dataset size of 270GB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trip data/green_tripdata_2018-01.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-02.csv ( 66MB)\n",
      "trip data/green_tripdata_2018-03.csv ( 71MB)\n",
      "trip data/green_tripdata_2018-04.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-05.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-06.csv ( 63MB)\n",
      "trip data/green_tripdata_2018-07.csv ( 58MB)\n",
      "trip data/green_tripdata_2018-08.csv ( 57MB)\n",
      "trip data/green_tripdata_2018-09.csv ( 57MB)\n",
      "trip data/green_tripdata_2018-10.csv ( 61MB)\n",
      "trip data/green_tripdata_2018-11.csv ( 56MB)\n",
      "trip data/green_tripdata_2018-12.csv ( 59MB)\n",
      "Matched file size is 0.7GB with 12 files\n",
      "Bucket nyc-tlc total size is 273.3GB with 251 files\n"
     ]
    }
   ],
   "source": [
    "list_bucket_contents(bucket='nyc-tlc', match='2018', size_mb=250)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview CSV Dataset\n",
    "Now that we know which files we are interested in for our analytics, we want to write a function to quickly preview this big data from source without having to download the entire data file locally or open it in Excel. The ``preview_csv_dataset`` function takes bucket and key names as parameters for identifying the file object to preview. It also takes an optional parameter to determine the number of rows of records to return or display when previewing the dataset.\n",
    "\n",
    "We use Pandas DataFrame feature to read data from a web URL. As Pandas does not recognise S3 URLs, we first generate a presigned web URL which makes the source data available securely to our dataframe.\n",
    "\n",
    "Benefit of this approach is that we can quickly preview CSV based open datasets from the Registry listings without having to store these datasets into our own S3 account or download locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preview_csv_dataset(bucket, key, rows=10):\n",
    "    data_source = {\n",
    "            'Bucket': bucket,\n",
    "            'Key': key\n",
    "        }\n",
    "    # Generate the URL to get Key from Bucket\n",
    "    url = s3.generate_presigned_url(\n",
    "        ClientMethod = 'get_object',\n",
    "        Params = data_source\n",
    "    )\n",
    "\n",
    "    data = pd.read_csv(url, nrows=rows)\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can perform some manual analysis based on the preview dataset. We note that the dataset contains 19 columns. Data types are mixed among float, object, and int. We also note potentially categorical features including ``trip_type`` and ``payment_type`` among others. There are continuous features include ``fare_amount`` and ``trip_distance`` among others. Data quality seems good as there are no missing data (Nulls) in preview, only one column ``ehail_fee`` which has ``NaN`` values and the values in the columns seem consistent at a glance. Of course there are formal methods to confirm all these observations however at this stage we are only interested in filtering and sourcing a dataset for further analytics.\n",
    "\n",
    "As you will appreciate the ability of filtering a big data repository containing hundreds of files and Gigabytes of data and previewing one of the files without having to download the entire file, is a really powerful feature for our open data analytics workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = preview_csv_dataset(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', rows=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>lpep_pickup_datetime</th>\n",
       "      <th>lpep_dropoff_datetime</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>ehail_fee</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>trip_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-02-01 00:39:38</td>\n",
       "      <td>2018-02-01 00:39:41</td>\n",
       "      <td>N</td>\n",
       "      <td>5</td>\n",
       "      <td>97</td>\n",
       "      <td>65</td>\n",
       "      <td>1</td>\n",
       "      <td>0.00</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>23.00</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-02-01 00:58:28</td>\n",
       "      <td>2018-02-01 01:05:35</td>\n",
       "      <td>N</td>\n",
       "      <td>1</td>\n",
       "      <td>256</td>\n",
       "      <td>80</td>\n",
       "      <td>5</td>\n",
       "      <td>1.60</td>\n",
       "      <td>7.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.88</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>9.68</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-02-01 00:56:05</td>\n",
       "      <td>2018-02-01 01:18:54</td>\n",
       "      <td>N</td>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>95</td>\n",
       "      <td>1</td>\n",
       "      <td>9.60</td>\n",
       "      <td>28.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>5.96</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>35.76</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-02-01 00:12:40</td>\n",
       "      <td>2018-02-01 00:15:50</td>\n",
       "      <td>N</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>61</td>\n",
       "      <td>1</td>\n",
       "      <td>0.73</td>\n",
       "      <td>4.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>5.80</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-02-01 00:45:18</td>\n",
       "      <td>2018-02-01 00:51:56</td>\n",
       "      <td>N</td>\n",
       "      <td>1</td>\n",
       "      <td>65</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1.87</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>9.30</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag  \\\n",
       "0         2  2018-02-01 00:39:38   2018-02-01 00:39:41                  N   \n",
       "1         2  2018-02-01 00:58:28   2018-02-01 01:05:35                  N   \n",
       "2         2  2018-02-01 00:56:05   2018-02-01 01:18:54                  N   \n",
       "3         2  2018-02-01 00:12:40   2018-02-01 00:15:50                  N   \n",
       "4         2  2018-02-01 00:45:18   2018-02-01 00:51:56                  N   \n",
       "\n",
       "   RatecodeID  PULocationID  DOLocationID  passenger_count  trip_distance  \\\n",
       "0           5            97            65                1           0.00   \n",
       "1           1           256            80                5           1.60   \n",
       "2           1            25            95                1           9.60   \n",
       "3           1            61            61                1           0.73   \n",
       "4           1            65            17                2           1.87   \n",
       "\n",
       "   fare_amount  extra  mta_tax  tip_amount  tolls_amount  ehail_fee  \\\n",
       "0         20.0    0.0      0.0        3.00           0.0        NaN   \n",
       "1          7.5    0.5      0.5        0.88           0.0        NaN   \n",
       "2         28.5    0.5      0.5        5.96           0.0        NaN   \n",
       "3          4.5    0.5      0.5        0.00           0.0        NaN   \n",
       "4          8.0    0.5      0.5        0.00           0.0        NaN   \n",
       "\n",
       "   improvement_surcharge  total_amount  payment_type  trip_type  \n",
       "0                    0.0         23.00             1          2  \n",
       "1                    0.3          9.68             1          1  \n",
       "2                    0.3         35.76             1          1  \n",
       "3                    0.3          5.80             2          1  \n",
       "4                    0.3          9.30             2          1  "
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 19)"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 100 entries, 0 to 99\n",
      "Data columns (total 19 columns):\n",
      "VendorID                 100 non-null int64\n",
      "lpep_pickup_datetime     100 non-null object\n",
      "lpep_dropoff_datetime    100 non-null object\n",
      "store_and_fwd_flag       100 non-null object\n",
      "RatecodeID               100 non-null int64\n",
      "PULocationID             100 non-null int64\n",
      "DOLocationID             100 non-null int64\n",
      "passenger_count          100 non-null int64\n",
      "trip_distance            100 non-null float64\n",
      "fare_amount              100 non-null float64\n",
      "extra                    100 non-null float64\n",
      "mta_tax                  100 non-null float64\n",
      "tip_amount               100 non-null float64\n",
      "tolls_amount             100 non-null float64\n",
      "ehail_fee                0 non-null float64\n",
      "improvement_surcharge    100 non-null float64\n",
      "total_amount             100 non-null float64\n",
      "payment_type             100 non-null int64\n",
      "trip_type                100 non-null int64\n",
      "dtypes: float64(9), int64(7), object(3)\n",
      "memory usage: 14.9+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>ehail_fee</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>trip_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.00000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>100.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1.840000</td>\n",
       "      <td>1.200000</td>\n",
       "      <td>120.29000</td>\n",
       "      <td>138.890000</td>\n",
       "      <td>1.280000</td>\n",
       "      <td>2.946800</td>\n",
       "      <td>11.735000</td>\n",
       "      <td>0.465000</td>\n",
       "      <td>0.465000</td>\n",
       "      <td>1.054100</td>\n",
       "      <td>0.115200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.279000</td>\n",
       "      <td>14.113300</td>\n",
       "      <td>1.540000</td>\n",
       "      <td>1.050000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.368453</td>\n",
       "      <td>0.876172</td>\n",
       "      <td>73.43757</td>\n",
       "      <td>78.900256</td>\n",
       "      <td>0.792388</td>\n",
       "      <td>3.356363</td>\n",
       "      <td>9.716033</td>\n",
       "      <td>0.146594</td>\n",
       "      <td>0.146594</td>\n",
       "      <td>2.011155</td>\n",
       "      <td>0.810462</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.087957</td>\n",
       "      <td>11.151924</td>\n",
       "      <td>0.520683</td>\n",
       "      <td>0.219043</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>7.00000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-4.500000</td>\n",
       "      <td>-0.500000</td>\n",
       "      <td>-0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.300000</td>\n",
       "      <td>-5.800000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>69.00000</td>\n",
       "      <td>68.750000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.947500</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.300000</td>\n",
       "      <td>8.195000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>106.00000</td>\n",
       "      <td>135.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.885000</td>\n",
       "      <td>8.750000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.300000</td>\n",
       "      <td>10.300000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>168.75000</td>\n",
       "      <td>207.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.340000</td>\n",
       "      <td>13.875000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>1.485000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.300000</td>\n",
       "      <td>16.922500</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>256.00000</td>\n",
       "      <td>265.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>20.660000</td>\n",
       "      <td>56.000000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>10.560000</td>\n",
       "      <td>5.760000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.300000</td>\n",
       "      <td>63.360000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         VendorID  RatecodeID  PULocationID  DOLocationID  passenger_count  \\\n",
       "count  100.000000  100.000000     100.00000    100.000000       100.000000   \n",
       "mean     1.840000    1.200000     120.29000    138.890000         1.280000   \n",
       "std      0.368453    0.876172      73.43757     78.900256         0.792388   \n",
       "min      1.000000    1.000000       7.00000      7.000000         1.000000   \n",
       "25%      2.000000    1.000000      69.00000     68.750000         1.000000   \n",
       "50%      2.000000    1.000000     106.00000    135.000000         1.000000   \n",
       "75%      2.000000    1.000000     168.75000    207.000000         1.000000   \n",
       "max      2.000000    5.000000     256.00000    265.000000         5.000000   \n",
       "\n",
       "       trip_distance  fare_amount       extra     mta_tax  tip_amount  \\\n",
       "count     100.000000   100.000000  100.000000  100.000000  100.000000   \n",
       "mean        2.946800    11.735000    0.465000    0.465000    1.054100   \n",
       "std         3.356363     9.716033    0.146594    0.146594    2.011155   \n",
       "min         0.000000    -4.500000   -0.500000   -0.500000    0.000000   \n",
       "25%         0.947500     6.000000    0.500000    0.500000    0.000000   \n",
       "50%         1.885000     8.750000    0.500000    0.500000    0.000000   \n",
       "75%         3.340000    13.875000    0.500000    0.500000    1.485000   \n",
       "max        20.660000    56.000000    0.500000    0.500000   10.560000   \n",
       "\n",
       "       tolls_amount  ehail_fee  improvement_surcharge  total_amount  \\\n",
       "count    100.000000        0.0             100.000000    100.000000   \n",
       "mean       0.115200        NaN               0.279000     14.113300   \n",
       "std        0.810462        NaN               0.087957     11.151924   \n",
       "min        0.000000        NaN              -0.300000     -5.800000   \n",
       "25%        0.000000        NaN               0.300000      8.195000   \n",
       "50%        0.000000        NaN               0.300000     10.300000   \n",
       "75%        0.000000        NaN               0.300000     16.922500   \n",
       "max        5.760000        NaN               0.300000     63.360000   \n",
       "\n",
       "       payment_type   trip_type  \n",
       "count    100.000000  100.000000  \n",
       "mean       1.540000    1.050000  \n",
       "std        0.520683    0.219043  \n",
       "min        1.000000    1.000000  \n",
       "25%        1.000000    1.000000  \n",
       "50%        2.000000    1.000000  \n",
       "75%        2.000000    1.000000  \n",
       "max        3.000000    2.000000  "
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Copy Among Buckets\n",
    "We are ready to query our dataset so we copy it over from the S3 bucket listed on the Registry to our own account. To perform this action we first check if the file already exists in our destination bucket using the ``key_exists`` function. You would be running this notebook over several iterations and it may be a case that the data file is already copied over. If the file does not exist we copy from one S3 bucket to another. You will notice that even for big datasets in GBs the copy operation from S3 bucket to bucket across accounts does not take much time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "def key_exists(bucket, key):\n",
    "    try:\n",
    "        s3_resource.Object(bucket, key).load()\n",
    "    except botocore.exceptions.ClientError as e:\n",
    "        if e.response['Error']['Code'] == \"404\":\n",
    "            # The key does not exist.\n",
    "            return(False)\n",
    "        else:\n",
    "            # Something else has gone wrong.\n",
    "            raise\n",
    "    else:\n",
    "        # The key does exist.\n",
    "        return(True)\n",
    "\n",
    "def copy_among_buckets(from_bucket, from_key, to_bucket, to_key):\n",
    "    if not key_exists(to_bucket, to_key):\n",
    "        s3_resource.meta.client.copy({'Bucket': from_bucket, 'Key': from_key}, \n",
    "                                        to_bucket, to_key)        \n",
    "        print(f'File {to_key} saved to S3 bucket {to_bucket}')\n",
    "    else:\n",
    "        print(f'File {to_key} already exists in S3 bucket {to_bucket}') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File few-trips/trips-2018-02.csv already exists in S3 bucket open-data-analytics-taxi-trips\n"
     ]
    }
   ],
   "source": [
    "copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/green_tripdata_2018-02.csv',\n",
    "                      to_bucket='open-data-analytics-taxi-trips', to_key='few-trips/trips-2018-02.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Amazon S3 Select\n",
    "Structured Query Language (SQL) SELECT statement is generally associated with relational databases and is a powerful first tool for querying and analyzing a dataset. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. This means we do not need to deploy servers, setup databases, import data into our database, before querying our data. Simply copy datasets to S3 and Query. S3 Select can query a file which is up to 256MB uncompressed and 100 columns.\n",
    "\n",
    "As we build the function to run S3 Select we capture the results as a set of events payload. This payload includes records of results and statistics of query operation which can be useful to calculate the cost of running the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "def s3_select(bucket, key, statement):\n",
    "    import io\n",
    "\n",
    "    s3_select_results = s3.select_object_content(\n",
    "        Bucket=bucket,\n",
    "        Key=key,\n",
    "        Expression=statement,\n",
    "        ExpressionType='SQL',\n",
    "        InputSerialization={'CSV': {\"FileHeaderInfo\": \"Use\"}},\n",
    "        OutputSerialization={'JSON': {}},\n",
    "    )\n",
    "\n",
    "    for event in s3_select_results['Payload']:\n",
    "        if 'Records' in event:\n",
    "            df = pd.read_json(io.StringIO(event['Records']['Payload'].decode('utf-8')), lines=True)\n",
    "        elif 'Stats' in event:\n",
    "            print(f\"Scanned: {int(event['Stats']['Details']['BytesScanned'])/1024/1024:5.2f}MB\")            \n",
    "            print(f\"Processed: {int(event['Stats']['Details']['BytesProcessed'])/1024/1024:5.2f}MB\")\n",
    "            print(f\"Returned: {int(event['Stats']['Details']['BytesReturned'])/1024/1024:5.2f}MB\")\n",
    "    return (df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the power of serverless at its best. We did not provision any servers, virtual or otherwise. We did not write more than a handful lines of code and even that could be avoided in future reuse of the ``s3_select`` function or when we run this operation directly on the AWS Console. We did not setup any physical database engine. We simply copied a flat file and ran SQL to query the results. The query did not even have to scan the entire file for sending back structured results for our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scanned:  1.72MB\n",
      "Processed:  1.71MB\n",
      "Returned:  0.01MB\n"
     ]
    }
   ],
   "source": [
    "df = s3_select(bucket='open-data-analytics-taxi-trips', key='few-trips/trips-2018-02.csv', \n",
    "          statement=\"\"\"\n",
    "          select passenger_count, payment_type, trip_distance \n",
    "          from s3object s \n",
    "          where s.passenger_count = '4' \n",
    "          limit 100\n",
    "          \"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>trip_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>7.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0.63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>8.41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1.38</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   passenger_count  payment_type  trip_distance\n",
       "0                4             1           7.20\n",
       "1                4             1           1.05\n",
       "2                4             1           0.63\n",
       "3                4             2           8.41\n",
       "4                4             2           1.38"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In case you do not need to manipulate or edit the dataset within your local S3 environment, you can also use S3 Select on the source dataset directly. This save you steps in copying the dataset over and also saves on storage costs within your account. In fact if you ``list_bucket_contents`` to match the S3 Select [size limits](https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html#selecting-content-from-objects-requirements-and-limits) and then use ``s3_select`` function, it turns into a much faster and more flexible preview option that the ``preview_csv_dataset`` function described earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scanned:  1.72MB\n",
      "Processed:  1.71MB\n",
      "Returned:  0.01MB\n"
     ]
    }
   ],
   "source": [
    "df = s3_select(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', \n",
    "          statement=\"\"\"\n",
    "          select passenger_count, payment_type, trip_distance \n",
    "          from s3object s \n",
    "          where s.passenger_count = '4' \n",
    "          limit 100\n",
    "          \"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>trip_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>7.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0.63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>8.41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1.38</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   passenger_count  payment_type  trip_distance\n",
       "0                4             1           7.20\n",
       "1                4             1           1.05\n",
       "2                4             1           0.63\n",
       "3                4             2           8.41\n",
       "4                4             2           1.38"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us enter the big data leagues now. Let's list all the files in the Registry dataset which match the year 2018 with no constraints on file size this time. We are now reusing the function written earlier suggesting how the API will get used when we complete the notebook series. This time the results have files going beyond 1.5GB in size. We pick a file which is +700MB in size for our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trip data/fhv_tripdata_2018-01.csv (1337MB)\n",
      "trip data/fhv_tripdata_2018-02.csv (1307MB)\n",
      "trip data/fhv_tripdata_2018-03.csv (1486MB)\n",
      "trip data/fhv_tripdata_2018-04.csv (1425MB)\n",
      "trip data/fhv_tripdata_2018-05.csv (1459MB)\n",
      "trip data/fhv_tripdata_2018-06.csv (1430MB)\n",
      "trip data/fhv_tripdata_2018-07.csv (1463MB)\n",
      "trip data/fhv_tripdata_2018-08.csv (1498MB)\n",
      "trip data/fhv_tripdata_2018-09.csv (1501MB)\n",
      "trip data/fhv_tripdata_2018-10.csv (1578MB)\n",
      "trip data/fhv_tripdata_2018-11.csv (1550MB)\n",
      "trip data/fhv_tripdata_2018-12.csv (1616MB)\n",
      "trip data/green_tripdata_2018-01.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-02.csv ( 66MB)\n",
      "trip data/green_tripdata_2018-03.csv ( 71MB)\n",
      "trip data/green_tripdata_2018-04.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-05.csv ( 68MB)\n",
      "trip data/green_tripdata_2018-06.csv ( 63MB)\n",
      "trip data/green_tripdata_2018-07.csv ( 58MB)\n",
      "trip data/green_tripdata_2018-08.csv ( 57MB)\n",
      "trip data/green_tripdata_2018-09.csv ( 57MB)\n",
      "trip data/green_tripdata_2018-10.csv ( 61MB)\n",
      "trip data/green_tripdata_2018-11.csv ( 56MB)\n",
      "trip data/green_tripdata_2018-12.csv ( 59MB)\n",
      "trip data/yellow_tripdata_2018-01.csv (736MB)\n",
      "trip data/yellow_tripdata_2018-02.csv (714MB)\n",
      "trip data/yellow_tripdata_2018-03.csv (793MB)\n",
      "trip data/yellow_tripdata_2018-04.csv (783MB)\n",
      "trip data/yellow_tripdata_2018-05.csv (777MB)\n",
      "trip data/yellow_tripdata_2018-06.csv (734MB)\n",
      "trip data/yellow_tripdata_2018-07.csv (660MB)\n",
      "trip data/yellow_tripdata_2018-08.csv (660MB)\n",
      "trip data/yellow_tripdata_2018-09.csv (677MB)\n",
      "trip data/yellow_tripdata_2018-10.csv (743MB)\n",
      "trip data/yellow_tripdata_2018-11.csv (686MB)\n",
      "trip data/yellow_tripdata_2018-12.csv (688MB)\n",
      "Matched file size is 26.4GB with 36 files\n",
      "Bucket nyc-tlc total size is 273.3GB with 251 files\n"
     ]
    }
   ],
   "source": [
    "list_bucket_contents(bucket='nyc-tlc', match='2018')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will notice the preview function takes longer to return results for a larger file. It is still usable for preview purposes however this is an indication that we need more suitable tools for running our analytics this time. We cannot use S3 Select here due to the 256 MB size limit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>tpep_pickup_datetime</th>\n",
       "      <th>tpep_dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:15:40</td>\n",
       "      <td>2018-06-01 00:16:46</td>\n",
       "      <td>1</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>145</td>\n",
       "      <td>145</td>\n",
       "      <td>2</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>4.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:04:18</td>\n",
       "      <td>2018-06-01 00:09:18</td>\n",
       "      <td>1</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>230</td>\n",
       "      <td>161</td>\n",
       "      <td>1</td>\n",
       "      <td>5.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.35</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>8.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:14:39</td>\n",
       "      <td>2018-06-01 00:29:46</td>\n",
       "      <td>1</td>\n",
       "      <td>3.30</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>263</td>\n",
       "      <td>2</td>\n",
       "      <td>13.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>14.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:51:25</td>\n",
       "      <td>2018-06-01 00:51:29</td>\n",
       "      <td>3</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>145</td>\n",
       "      <td>145</td>\n",
       "      <td>2</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>3.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:55:06</td>\n",
       "      <td>2018-06-01 00:55:10</td>\n",
       "      <td>1</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>145</td>\n",
       "      <td>145</td>\n",
       "      <td>2</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>3.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:09:00</td>\n",
       "      <td>2018-06-01 00:24:01</td>\n",
       "      <td>1</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>161</td>\n",
       "      <td>234</td>\n",
       "      <td>1</td>\n",
       "      <td>11.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>2.55</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>15.35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:02:33</td>\n",
       "      <td>2018-06-01 00:13:01</td>\n",
       "      <td>2</td>\n",
       "      <td>1.50</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>163</td>\n",
       "      <td>233</td>\n",
       "      <td>1</td>\n",
       "      <td>8.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.95</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>11.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:13:23</td>\n",
       "      <td>2018-06-01 00:16:52</td>\n",
       "      <td>1</td>\n",
       "      <td>0.70</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>186</td>\n",
       "      <td>246</td>\n",
       "      <td>1</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.85</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>8.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>2018-06-01 00:24:29</td>\n",
       "      <td>2018-06-01 01:08:43</td>\n",
       "      <td>1</td>\n",
       "      <td>5.70</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>230</td>\n",
       "      <td>179</td>\n",
       "      <td>2</td>\n",
       "      <td>22.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>23.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2</td>\n",
       "      <td>2018-06-01 00:17:01</td>\n",
       "      <td>2018-06-01 00:23:16</td>\n",
       "      <td>1</td>\n",
       "      <td>0.85</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>179</td>\n",
       "      <td>223</td>\n",
       "      <td>2</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0</td>\n",
       "      <td>0.3</td>\n",
       "      <td>7.30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \\\n",
       "0         1  2018-06-01 00:15:40   2018-06-01 00:16:46                1   \n",
       "1         1  2018-06-01 00:04:18   2018-06-01 00:09:18                1   \n",
       "2         1  2018-06-01 00:14:39   2018-06-01 00:29:46                1   \n",
       "3         1  2018-06-01 00:51:25   2018-06-01 00:51:29                3   \n",
       "4         1  2018-06-01 00:55:06   2018-06-01 00:55:10                1   \n",
       "5         1  2018-06-01 00:09:00   2018-06-01 00:24:01                1   \n",
       "6         1  2018-06-01 00:02:33   2018-06-01 00:13:01                2   \n",
       "7         1  2018-06-01 00:13:23   2018-06-01 00:16:52                1   \n",
       "8         1  2018-06-01 00:24:29   2018-06-01 01:08:43                1   \n",
       "9         2  2018-06-01 00:17:01   2018-06-01 00:23:16                1   \n",
       "\n",
       "   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \\\n",
       "0           0.00           1                  N           145           145   \n",
       "1           1.00           1                  N           230           161   \n",
       "2           3.30           1                  N           100           263   \n",
       "3           0.00           1                  N           145           145   \n",
       "4           0.00           1                  N           145           145   \n",
       "5           2.00           1                  N           161           234   \n",
       "6           1.50           1                  N           163           233   \n",
       "7           0.70           1                  N           186           246   \n",
       "8           5.70           1                  N           230           179   \n",
       "9           0.85           1                  N           179           223   \n",
       "\n",
       "   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \\\n",
       "0             2          3.0    0.5      0.5        0.00             0   \n",
       "1             1          5.5    0.5      0.5        1.35             0   \n",
       "2             2         13.0    0.5      0.5        0.00             0   \n",
       "3             2          2.5    0.5      0.5        0.00             0   \n",
       "4             2          2.5    0.5      0.5        0.00             0   \n",
       "5             1         11.5    0.5      0.5        2.55             0   \n",
       "6             1          8.5    0.5      0.5        1.95             0   \n",
       "7             1          5.0    0.5      0.5        1.85             0   \n",
       "8             2         22.0    0.5      0.5        0.00             0   \n",
       "9             2          6.0    0.5      0.5        0.00             0   \n",
       "\n",
       "   improvement_surcharge  total_amount  \n",
       "0                    0.3          4.30  \n",
       "1                    0.3          8.15  \n",
       "2                    0.3         14.30  \n",
       "3                    0.3          3.80  \n",
       "4                    0.3          3.80  \n",
       "5                    0.3         15.35  \n",
       "6                    0.3         11.75  \n",
       "7                    0.3          8.15  \n",
       "8                    0.3         23.30  \n",
       "9                    0.3          7.30  "
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preview_csv_dataset(bucket='nyc-tlc', key='trip data/yellow_tripdata_2018-06.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy operation for a larger file does not take that much longer though. If interested you can time the operation by adding ``%%time`` magic function in the first line. Before timing the operation do ensure that you delete the file if it already exists in your S3 bucket. You can do so using the AWS Management Console."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File many-trips/trips-2018-06.csv already exists in S3 bucket open-data-analytics-taxi-trips\n"
     ]
    }
   ],
   "source": [
    "copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/yellow_tripdata_2018-06.csv',\n",
    "                      to_bucket='open-data-analytics-taxi-trips', to_key='many-trips/trips-2018-06.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time when we list our bucket contents we should see the smaller file used in earlier S3 Select use case and the larger one we have copied over just now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "few-trips/trips-2018-02.csv ( 66MB)\n",
      "many-trips/trips-2018-06.csv (734MB)\n",
      "Matched file size is 0.8GB with 2 files\n",
      "Bucket open-data-analytics-taxi-trips total size is 0.9GB with 57 files\n"
     ]
    }
   ],
   "source": [
    "list_bucket_contents(bucket='open-data-analytics-taxi-trips', match='trips/trips')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Change Log\n",
    "\n",
    "This section captures changes and updates to this notebook across releases.\n",
    "\n",
    "#### Source S3 Select - Release 3 MAY 2019\n",
    "This release adds alternative workflow for directly querying source datasets on the Registry of Open Data on AWS. You may want to use this alternative workflow if you do not want to retain a copy of the source dataset within your local S3 bucket, saving on workflow steps and storage costs.\n",
    "\n",
    "Known issue: Running s3_select with query limit 1000 or more results in ValueError - Expected object or value. Is this exception because of the maximum length of a record in the result as 1 MB limit? [TODO] Handle exception gracefully.\n",
    "\n",
    "#### Launch - Release 30 APR 2019\n",
    "This is the launch release which builds the AWS Open Data Analytics API for exploring open datasets within your Amazon S3 account using S3 Select.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "#### Exploring data with Python and Amazon S3 Select\n",
    "by Manav Sehgal | on 3 MAY 2019\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}