## Exploring data with Python and Amazon S3 Select
by Manav Sehgal | on 3 MAY 2019

We hear from public institutions all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital sources. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people. The size, complexity, and varied sources of the data means the same technology and approaches that worked in the past don’t work anymore.

![Data Analytics Workflow](https://s3.amazonaws.com/cloudstory/notebooks-media/data-analytics-workflow.png)

A new approach is needed to extract insights and value from data. This approach needs to address complexities of multi-step data analytics workflow. This includes setting up durable, secure, and scalable storage for data, moving data from source to destination with speed and low cost, ease of data preparation for analytics, and making data available for different types of analytics including ad-hoc, real-time, and predictive.

### About AWS Open Data Analytics Notebooks
This notebook is first in a series of AWS Open Data Analytics Notebooks following step-by-step workflow for open data analytics on cloud. We will present these notebooks with guidance on using AWS Cloud programmatically, introduce relevant AWS services, explaining the code, reviewing the code outputs, evaluating alternative steps in our workflow, and ultimately designing a reusable API for open data analytics workflow on cloud. The first step in this workflow is sourcing the appropriate open dataset(s) for setting up our analytics pipeline. You may want to run these notebooks using [Amazon SageMaker](https://aws.amazon.com/sagemaker/). Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action.

### Why Open Datasets
When building analytical models it is best to start with tried and tested open datasets from the problem domain we are solving. This enables us to setup our data analytics workflow, determine the appropriate models and analytical methods, benchmark the results, collaborate with open data community, before we apply these to our own data. Such open datasets are available at the [Registry of Open Data on AWS](https://registry.opendata.aws/).

For this notebook let us start with a big open dataset. Big enough that we will struggle to open it in Excel on a laptop. Excel has around million rows limit. We will setup AWS services to source from a 270GB data source, filter and store more than 8 million rows or 100 million data points into a flat file, extract schema from this file, transform this data, load into analytics tools, run Structured Query Language (SQL) on this data, perform exploratory data analytics, train and build machine learning models, and visualize all 100 million data points using an interactive dashboard.

### Open Data Analytics Architecture
When we complete these workflow notebooks we will build the following open data analytics architecture. This is a serverless architecture. It requires no software licenses to be procured. You do not need to manage any virtual servers or operating systems. Billing of each of the services is pay-per-use. You can plug-and-play 160 AWS services within this stack based on your specific requirements.

![Open Data Analytics Architecture](https://s3.amazonaws.com/cloudstory/notebooks-media/open-data-analytics-architecture.png)

### Setup Notebook Environment
We begin by importing the required Python dependencies. We will use ``Boto3`` Python SDK for using AWS services. The import ``Pandas`` is a popular library providing high-performance, easy-to-use data structures and data analysis tools for Python. The ``IPython.display`` and ``Markdown`` dependencies are required for well-formatted output from Notebook cells. We require ``botocore`` for exceptions management.


```python
import boto3
import botocore
import pandas as pd
from IPython.display import display, Markdown
```

Before we start to access AWS services from an Amazon SageMaker notebook we need to ensure that the [SageMaker Execution IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) associated with the notebook instance is allowed permissions to use the specific services like Amazon S3.

We will setup an S3 client to call most of the S3 APIs. S3 resource is required for specific calls to object loading and copying features.


```python
s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')
```

### Create Bucket
Now we come to an important part of the workflow of creating a Python function. These functions are created all along this notebook and others in the series. Think of these functions as reusable APIs for applying all that you learn from AWS Open Data Analytics Notebooks into your own projects by simply importing these functions as a library.

Before we source the open dataset from the Registry, we need to define a destination for our data. We will store our open datasets within Amazon S3. S3 storage in turn is organized in universally unique ``buckets``. These bucket names form special URLs of the format ``s3://bucket-name`` which access the contents of the buckets depending on security and access policies applied to the bucket and its contents. Buckets can further contain folders and files. ``Keys`` are combination of folder and file name path or just the file name in case it is within the bucket root.

Our first function ``create_bucket`` will do just that, it will create a bucket or return as-is if the bucket already exists for your account. If the bucket name is already used by someone else other than you, then this generates an exception caught by the message ``Bucket <name> could not be created`` as defined.

AWS services can be accessed using the SDK as we are using right now, using browser based console GUI, or using a Command Line Interface (CLI) over OS terminal or shell. Benefits of using the SDK are reusability of commands across different use cases, handling exceptions with custom actions, and focusing on just the functionality needed by the solution.


```python
def create_bucket(bucket):
    import logging

    try:
        s3.create_bucket(Bucket=bucket)
    except botocore.exceptions.ClientError as e:
        logging.error(e)
        return 'Bucket ' + bucket + ' could not be created.'
    return 'Created or already exists ' + bucket + ' bucket.'
```


```python
create_bucket('open-data-analytics-taxi-trips')
```


    'Created or already exists open-data-analytics-taxi-trips bucket.'


### List Buckets
We can confirm that the new bucket has been created by listing the buckets within S3. The ``list_buckets`` function takes a ``match`` parameter which enables us to search among available buckets and only list the ones which contain the matching string in their name.


```python
def list_buckets(match=''):
    response = s3.list_buckets()
    if match:
        print(f'Existing buckets containing "{match}" string:')
    else:
        print('All existing buckets:')
    for bucket in response['Buckets']:
        if match:
            if match in bucket["Name"]:
                print(f'  {bucket["Name"]}')
```


```python
list_buckets(match='open')
```

    Existing buckets containing "open" string:
      open-analytics-assistant
      open-data-analytics
      open-data-analytics-taxi-trips
      open-data-on-cloud


### List Bucket Contents
Now that we have prepared our destination bucket we can shift our attention to the source for our dataset. The [Registry of Open Data on AWS](https://registry.opendata.aws) also happens to be a listing of S3 hosted open datasets. So all Registry listed datasets can be accessed by the same API we use for S3 within our own AWS account.

Next, all we need to do is search the Registry for the dataset we want to analyze. For this notebook let us analyze the [New York Taxi Trips](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset. On the dataset description page we make a note of the Amazon Resource Name (ARN) which is ``arn:aws:s3:::nyc-tlc`` in this case. We are interested in the last part which provides access to the open datasets using the ``s3://nyc-tlc`` URL.

Let us create a function to list contents of this open dataset. We will iterate through the keys or the path names of file objects stored within the bucket. The function allows us to match and return only keys which contain the matching string. It also optionally allows us to return only those files in the listing which are less than a certain size in MB. This helps traverse a large open dataset which may contain data in Gigabytes or even Terabytes with hundreds if not thousands of files.


```python
def list_bucket_contents(bucket, match='', size_mb=0):
    bucket_resource = s3_resource.Bucket(bucket)
    total_size_gb = 0
    total_files = 0
    match_size_gb = 0
    match_files = 0
    for key in bucket_resource.objects.all():
        key_size_mb = key.size/1024/1024
        total_size_gb += key_size_mb
        total_files += 1
        list_check = False
        if not match:
            list_check = True
        elif match in key.key:
            list_check = True
        if list_check and not size_mb:
            match_files += 1
            match_size_gb += key_size_mb
            print(f'{key.key} ({key_size_mb:3.0f}MB)')
        elif list_check and key_size_mb <= size_mb:
            match_files += 1
            match_size_gb += key_size_mb
            print(f'{key.key} ({key_size_mb:3.0f}MB)')

    if match:
        print(f'Matched file size is {match_size_gb/1024:3.1f}GB with {match_files} files')            
    
    print(f'Bucket {bucket} total size is {total_size_gb/1024:3.1f}GB with {total_files} files')
```

For this notebook we want to list the latest data files matching year 2018 and we also want files which are less than 250MB in size for reasons explained shortly. Note that the function results in quickly filtering 12 or 251 files within a dataset size of 270GB.


```python
list_bucket_contents(bucket='nyc-tlc', match='2018', size_mb=250)
```

    trip data/green_tripdata_2018-01.csv ( 68MB)
    trip data/green_tripdata_2018-02.csv ( 66MB)
    trip data/green_tripdata_2018-03.csv ( 71MB)
    trip data/green_tripdata_2018-04.csv ( 68MB)
    trip data/green_tripdata_2018-05.csv ( 68MB)
    trip data/green_tripdata_2018-06.csv ( 63MB)
    trip data/green_tripdata_2018-07.csv ( 58MB)
    trip data/green_tripdata_2018-08.csv ( 57MB)
    trip data/green_tripdata_2018-09.csv ( 57MB)
    trip data/green_tripdata_2018-10.csv ( 61MB)
    trip data/green_tripdata_2018-11.csv ( 56MB)
    trip data/green_tripdata_2018-12.csv ( 59MB)
    Matched file size is 0.7GB with 12 files
    Bucket nyc-tlc total size is 273.3GB with 251 files


### Preview CSV Dataset
Now that we know which files we are interested in for our analytics, we want to write a function to quickly preview this big data from source without having to download the entire data file locally or open it in Excel. The ``preview_csv_dataset`` function takes bucket and key names as parameters for identifying the file object to preview. It also takes an optional parameter to determine the number of rows of records to return or display when previewing the dataset.

We use Pandas DataFrame feature to read data from a web URL. As Pandas does not recognise S3 URLs, we first generate a presigned web URL which makes the source data available securely to our dataframe.

Benefit of this approach is that we can quickly preview CSV based open datasets from the Registry listings without having to store these datasets into our own S3 account or download locally.


```python
def preview_csv_dataset(bucket, key, rows=10):
    data_source = {
            'Bucket': bucket,
            'Key': key
        }
    # Generate the URL to get Key from Bucket
    url = s3.generate_presigned_url(
        ClientMethod = 'get_object',
        Params = data_source
    )

    data = pd.read_csv(url, nrows=rows)
    return data
```

We can perform some manual analysis based on the preview dataset. We note that the dataset contains 19 columns. Data types are mixed among float, object, and int. We also note potentially categorical features including ``trip_type`` and ``payment_type`` among others. There are continuous features include ``fare_amount`` and ``trip_distance`` among others. Data quality seems good as there are no missing data (Nulls) in preview, only one column ``ehail_fee`` which has ``NaN`` values and the values in the columns seem consistent at a glance. Of course there are formal methods to confirm all these observations however at this stage we are only interested in filtering and sourcing a dataset for further analytics.

As you will appreciate the ability of filtering a big data repository containing hundreds of files and Gigabytes of data and previewing one of the files without having to download the entire file, is a really powerful feature for our open data analytics workflow.


```python
df = preview_csv_dataset(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', rows=100)
```


```python
df.head()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>VendorID</th>
      <th>lpep_pickup_datetime</th>
      <th>lpep_dropoff_datetime</th>
      <th>store_and_fwd_flag</th>
      <th>RatecodeID</th>
      <th>PULocationID</th>
      <th>DOLocationID</th>
      <th>passenger_count</th>
      <th>trip_distance</th>
      <th>fare_amount</th>
      <th>extra</th>
      <th>mta_tax</th>
      <th>tip_amount</th>
      <th>tolls_amount</th>
      <th>ehail_fee</th>
      <th>improvement_surcharge</th>
      <th>total_amount</th>
      <th>payment_type</th>
      <th>trip_type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2</td>
      <td>2018-02-01 00:39:38</td>
      <td>2018-02-01 00:39:41</td>
      <td>N</td>
      <td>5</td>
      <td>97</td>
      <td>65</td>
      <td>1</td>
      <td>0.00</td>
      <td>20.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>3.00</td>
      <td>0.0</td>
      <td>NaN</td>
      <td>0.0</td>
      <td>23.00</td>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>2018-02-01 00:58:28</td>
      <td>2018-02-01 01:05:35</td>
      <td>N</td>
      <td>1</td>
      <td>256</td>
      <td>80</td>
      <td>5</td>
      <td>1.60</td>
      <td>7.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.88</td>
      <td>0.0</td>
      <td>NaN</td>
      <td>0.3</td>
      <td>9.68</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>2018-02-01 00:56:05</td>
      <td>2018-02-01 01:18:54</td>
      <td>N</td>
      <td>1</td>
      <td>25</td>
      <td>95</td>
      <td>1</td>
      <td>9.60</td>
      <td>28.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>5.96</td>
      <td>0.0</td>
      <td>NaN</td>
      <td>0.3</td>
      <td>35.76</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2</td>
      <td>2018-02-01 00:12:40</td>
      <td>2018-02-01 00:15:50</td>
      <td>N</td>
      <td>1</td>
      <td>61</td>
      <td>61</td>
      <td>1</td>
      <td>0.73</td>
      <td>4.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0.0</td>
      <td>NaN</td>
      <td>0.3</td>
      <td>5.80</td>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2</td>
      <td>2018-02-01 00:45:18</td>
      <td>2018-02-01 00:51:56</td>
      <td>N</td>
      <td>1</td>
      <td>65</td>
      <td>17</td>
      <td>2</td>
      <td>1.87</td>
      <td>8.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0.0</td>
      <td>NaN</td>
      <td>0.3</td>
      <td>9.30</td>
      <td>2</td>
      <td>1</td>
    </tr>
  </tbody>
</table>
</div>


```python
df.shape
```


    (100, 19)


```python
df.info()
```

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100 entries, 0 to 99
    Data columns (total 19 columns):
    VendorID                 100 non-null int64
    lpep_pickup_datetime     100 non-null object
    lpep_dropoff_datetime    100 non-null object
    store_and_fwd_flag       100 non-null object
    RatecodeID               100 non-null int64
    PULocationID             100 non-null int64
    DOLocationID             100 non-null int64
    passenger_count          100 non-null int64
    trip_distance            100 non-null float64
    fare_amount              100 non-null float64
    extra                    100 non-null float64
    mta_tax                  100 non-null float64
    tip_amount               100 non-null float64
    tolls_amount             100 non-null float64
    ehail_fee                0 non-null float64
    improvement_surcharge    100 non-null float64
    total_amount             100 non-null float64
    payment_type             100 non-null int64
    trip_type                100 non-null int64
    dtypes: float64(9), int64(7), object(3)
    memory usage: 14.9+ KB


```python
df.describe()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>VendorID</th>
      <th>RatecodeID</th>
      <th>PULocationID</th>
      <th>DOLocationID</th>
      <th>passenger_count</th>
      <th>trip_distance</th>
      <th>fare_amount</th>
      <th>extra</th>
      <th>mta_tax</th>
      <th>tip_amount</th>
      <th>tolls_amount</th>
      <th>ehail_fee</th>
      <th>improvement_surcharge</th>
      <th>total_amount</th>
      <th>payment_type</th>
      <th>trip_type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.00000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>0.0</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
      <td>100.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>1.840000</td>
      <td>1.200000</td>
      <td>120.29000</td>
      <td>138.890000</td>
      <td>1.280000</td>
      <td>2.946800</td>
      <td>11.735000</td>
      <td>0.465000</td>
      <td>0.465000</td>
      <td>1.054100</td>
      <td>0.115200</td>
      <td>NaN</td>
      <td>0.279000</td>
      <td>14.113300</td>
      <td>1.540000</td>
      <td>1.050000</td>
    </tr>
    <tr>
      <th>std</th>
      <td>0.368453</td>
      <td>0.876172</td>
      <td>73.43757</td>
      <td>78.900256</td>
      <td>0.792388</td>
      <td>3.356363</td>
      <td>9.716033</td>
      <td>0.146594</td>
      <td>0.146594</td>
      <td>2.011155</td>
      <td>0.810462</td>
      <td>NaN</td>
      <td>0.087957</td>
      <td>11.151924</td>
      <td>0.520683</td>
      <td>0.219043</td>
    </tr>
    <tr>
      <th>min</th>
      <td>1.000000</td>
      <td>1.000000</td>
      <td>7.00000</td>
      <td>7.000000</td>
      <td>1.000000</td>
      <td>0.000000</td>
      <td>-4.500000</td>
      <td>-0.500000</td>
      <td>-0.500000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>NaN</td>
      <td>-0.300000</td>
      <td>-5.800000</td>
      <td>1.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>2.000000</td>
      <td>1.000000</td>
      <td>69.00000</td>
      <td>68.750000</td>
      <td>1.000000</td>
      <td>0.947500</td>
      <td>6.000000</td>
      <td>0.500000</td>
      <td>0.500000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>NaN</td>
      <td>0.300000</td>
      <td>8.195000</td>
      <td>1.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>2.000000</td>
      <td>1.000000</td>
      <td>106.00000</td>
      <td>135.000000</td>
      <td>1.000000</td>
      <td>1.885000</td>
      <td>8.750000</td>
      <td>0.500000</td>
      <td>0.500000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>NaN</td>
      <td>0.300000</td>
      <td>10.300000</td>
      <td>2.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>2.000000</td>
      <td>1.000000</td>
      <td>168.75000</td>
      <td>207.000000</td>
      <td>1.000000</td>
      <td>3.340000</td>
      <td>13.875000</td>
      <td>0.500000</td>
      <td>0.500000</td>
      <td>1.485000</td>
      <td>0.000000</td>
      <td>NaN</td>
      <td>0.300000</td>
      <td>16.922500</td>
      <td>2.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>max</th>
      <td>2.000000</td>
      <td>5.000000</td>
      <td>256.00000</td>
      <td>265.000000</td>
      <td>5.000000</td>
      <td>20.660000</td>
      <td>56.000000</td>
      <td>0.500000</td>
      <td>0.500000</td>
      <td>10.560000</td>
      <td>5.760000</td>
      <td>NaN</td>
      <td>0.300000</td>
      <td>63.360000</td>
      <td>3.000000</td>
      <td>2.000000</td>
    </tr>
  </tbody>
</table>
</div>


### Copy Among Buckets
We are ready to query our dataset so we copy it over from the S3 bucket listed on the Registry to our own account. To perform this action we first check if the file already exists in our destination bucket using the ``key_exists`` function. You would be running this notebook over several iterations and it may be a case that the data file is already copied over. If the file does not exist we copy from one S3 bucket to another. You will notice that even for big datasets in GBs the copy operation from S3 bucket to bucket across accounts does not take much time.


```python
def key_exists(bucket, key):
    try:
        s3_resource.Object(bucket, key).load()
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            # The key does not exist.
            return(False)
        else:
            # Something else has gone wrong.
            raise
    else:
        # The key does exist.
        return(True)

def copy_among_buckets(from_bucket, from_key, to_bucket, to_key):
    if not key_exists(to_bucket, to_key):
        s3_resource.meta.client.copy({'Bucket': from_bucket, 'Key': from_key}, 
                                        to_bucket, to_key)        
        print(f'File {to_key} saved to S3 bucket {to_bucket}')
    else:
        print(f'File {to_key} already exists in S3 bucket {to_bucket}') 
```


```python
copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/green_tripdata_2018-02.csv',
                      to_bucket='open-data-analytics-taxi-trips', to_key='few-trips/trips-2018-02.csv')
```

    File few-trips/trips-2018-02.csv already exists in S3 bucket open-data-analytics-taxi-trips


### Amazon S3 Select
Structured Query Language (SQL) SELECT statement is generally associated with relational databases and is a powerful first tool for querying and analyzing a dataset. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. This means we do not need to deploy servers, setup databases, import data into our database, before querying our data. Simply copy datasets to S3 and Query. S3 Select can query a file which is up to 256MB uncompressed and 100 columns.

As we build the function to run S3 Select we capture the results as a set of events payload. This payload includes records of results and statistics of query operation which can be useful to calculate the cost of running the query.


```python
def s3_select(bucket, key, statement):
    import io

    s3_select_results = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        Expression=statement,
        ExpressionType='SQL',
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'JSON': {}},
    )

    for event in s3_select_results['Payload']:
        if 'Records' in event:
            df = pd.read_json(io.StringIO(event['Records']['Payload'].decode('utf-8')), lines=True)
        elif 'Stats' in event:
            print(f"Scanned: {int(event['Stats']['Details']['BytesScanned'])/1024/1024:5.2f}MB")            
            print(f"Processed: {int(event['Stats']['Details']['BytesProcessed'])/1024/1024:5.2f}MB")
            print(f"Returned: {int(event['Stats']['Details']['BytesReturned'])/1024/1024:5.2f}MB")
    return (df)
```

This is the power of serverless at its best. We did not provision any servers, virtual or otherwise. We did not write more than a handful lines of code and even that could be avoided in future reuse of the ``s3_select`` function or when we run this operation directly on the AWS Console. We did not setup any physical database engine. We simply copied a flat file and ran SQL to query the results. The query did not even have to scan the entire file for sending back structured results for our analysis.


```python
df = s3_select(bucket='open-data-analytics-taxi-trips', key='few-trips/trips-2018-02.csv', 
          statement="""
          select passenger_count, payment_type, trip_distance 
          from s3object s 
          where s.passenger_count = '4' 
          limit 100
          """)
```

    Scanned:  1.72MB
    Processed:  1.71MB
    Returned:  0.01MB


```python
df.head()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>passenger_count</th>
      <th>payment_type</th>
      <th>trip_distance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>4</td>
      <td>1</td>
      <td>7.20</td>
    </tr>
    <tr>
      <th>1</th>
      <td>4</td>
      <td>1</td>
      <td>1.05</td>
    </tr>
    <tr>
      <th>2</th>
      <td>4</td>
      <td>1</td>
      <td>0.63</td>
    </tr>
    <tr>
      <th>3</th>
      <td>4</td>
      <td>2</td>
      <td>8.41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>4</td>
      <td>2</td>
      <td>1.38</td>
    </tr>
  </tbody>
</table>
</div>


In case you do not need to manipulate or edit the dataset within your local S3 environment, you can also use S3 Select on the source dataset directly. This save you steps in copying the dataset over and also saves on storage costs within your account. In fact if you ``list_bucket_contents`` to match the S3 Select [size limits](https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html#selecting-content-from-objects-requirements-and-limits) and then use ``s3_select`` function, it turns into a much faster and more flexible preview option that the ``preview_csv_dataset`` function described earlier.


```python
df = s3_select(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', 
          statement="""
          select passenger_count, payment_type, trip_distance 
          from s3object s 
          where s.passenger_count = '4' 
          limit 100
          """)
```

    Scanned:  1.72MB
    Processed:  1.71MB
    Returned:  0.01MB


```python
df.head()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>passenger_count</th>
      <th>payment_type</th>
      <th>trip_distance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>4</td>
      <td>1</td>
      <td>7.20</td>
    </tr>
    <tr>
      <th>1</th>
      <td>4</td>
      <td>1</td>
      <td>1.05</td>
    </tr>
    <tr>
      <th>2</th>
      <td>4</td>
      <td>1</td>
      <td>0.63</td>
    </tr>
    <tr>
      <th>3</th>
      <td>4</td>
      <td>2</td>
      <td>8.41</td>
    </tr>
    <tr>
      <th>4</th>
      <td>4</td>
      <td>2</td>
      <td>1.38</td>
    </tr>
  </tbody>
</table>
</div>


Let us enter the big data leagues now. Let's list all the files in the Registry dataset which match the year 2018 with no constraints on file size this time. We are now reusing the function written earlier suggesting how the API will get used when we complete the notebook series. This time the results have files going beyond 1.5GB in size. We pick a file which is +700MB in size for our analysis.


```python
list_bucket_contents(bucket='nyc-tlc', match='2018')
```

    trip data/fhv_tripdata_2018-01.csv (1337MB)
    trip data/fhv_tripdata_2018-02.csv (1307MB)
    trip data/fhv_tripdata_2018-03.csv (1486MB)
    trip data/fhv_tripdata_2018-04.csv (1425MB)
    trip data/fhv_tripdata_2018-05.csv (1459MB)
    trip data/fhv_tripdata_2018-06.csv (1430MB)
    trip data/fhv_tripdata_2018-07.csv (1463MB)
    trip data/fhv_tripdata_2018-08.csv (1498MB)
    trip data/fhv_tripdata_2018-09.csv (1501MB)
    trip data/fhv_tripdata_2018-10.csv (1578MB)
    trip data/fhv_tripdata_2018-11.csv (1550MB)
    trip data/fhv_tripdata_2018-12.csv (1616MB)
    trip data/green_tripdata_2018-01.csv ( 68MB)
    trip data/green_tripdata_2018-02.csv ( 66MB)
    trip data/green_tripdata_2018-03.csv ( 71MB)
    trip data/green_tripdata_2018-04.csv ( 68MB)
    trip data/green_tripdata_2018-05.csv ( 68MB)
    trip data/green_tripdata_2018-06.csv ( 63MB)
    trip data/green_tripdata_2018-07.csv ( 58MB)
    trip data/green_tripdata_2018-08.csv ( 57MB)
    trip data/green_tripdata_2018-09.csv ( 57MB)
    trip data/green_tripdata_2018-10.csv ( 61MB)
    trip data/green_tripdata_2018-11.csv ( 56MB)
    trip data/green_tripdata_2018-12.csv ( 59MB)
    trip data/yellow_tripdata_2018-01.csv (736MB)
    trip data/yellow_tripdata_2018-02.csv (714MB)
    trip data/yellow_tripdata_2018-03.csv (793MB)
    trip data/yellow_tripdata_2018-04.csv (783MB)
    trip data/yellow_tripdata_2018-05.csv (777MB)
    trip data/yellow_tripdata_2018-06.csv (734MB)
    trip data/yellow_tripdata_2018-07.csv (660MB)
    trip data/yellow_tripdata_2018-08.csv (660MB)
    trip data/yellow_tripdata_2018-09.csv (677MB)
    trip data/yellow_tripdata_2018-10.csv (743MB)
    trip data/yellow_tripdata_2018-11.csv (686MB)
    trip data/yellow_tripdata_2018-12.csv (688MB)
    Matched file size is 26.4GB with 36 files
    Bucket nyc-tlc total size is 273.3GB with 251 files


You will notice the preview function takes longer to return results for a larger file. It is still usable for preview purposes however this is an indication that we need more suitable tools for running our analytics this time. We cannot use S3 Select here due to the 256 MB size limit.


```python
preview_csv_dataset(bucket='nyc-tlc', key='trip data/yellow_tripdata_2018-06.csv')
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>VendorID</th>
      <th>tpep_pickup_datetime</th>
      <th>tpep_dropoff_datetime</th>
      <th>passenger_count</th>
      <th>trip_distance</th>
      <th>RatecodeID</th>
      <th>store_and_fwd_flag</th>
      <th>PULocationID</th>
      <th>DOLocationID</th>
      <th>payment_type</th>
      <th>fare_amount</th>
      <th>extra</th>
      <th>mta_tax</th>
      <th>tip_amount</th>
      <th>tolls_amount</th>
      <th>improvement_surcharge</th>
      <th>total_amount</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2018-06-01 00:15:40</td>
      <td>2018-06-01 00:16:46</td>
      <td>1</td>
      <td>0.00</td>
      <td>1</td>
      <td>N</td>
      <td>145</td>
      <td>145</td>
      <td>2</td>
      <td>3.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>4.30</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>2018-06-01 00:04:18</td>
      <td>2018-06-01 00:09:18</td>
      <td>1</td>
      <td>1.00</td>
      <td>1</td>
      <td>N</td>
      <td>230</td>
      <td>161</td>
      <td>1</td>
      <td>5.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>1.35</td>
      <td>0</td>
      <td>0.3</td>
      <td>8.15</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>2018-06-01 00:14:39</td>
      <td>2018-06-01 00:29:46</td>
      <td>1</td>
      <td>3.30</td>
      <td>1</td>
      <td>N</td>
      <td>100</td>
      <td>263</td>
      <td>2</td>
      <td>13.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>14.30</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>2018-06-01 00:51:25</td>
      <td>2018-06-01 00:51:29</td>
      <td>3</td>
      <td>0.00</td>
      <td>1</td>
      <td>N</td>
      <td>145</td>
      <td>145</td>
      <td>2</td>
      <td>2.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>3.80</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>2018-06-01 00:55:06</td>
      <td>2018-06-01 00:55:10</td>
      <td>1</td>
      <td>0.00</td>
      <td>1</td>
      <td>N</td>
      <td>145</td>
      <td>145</td>
      <td>2</td>
      <td>2.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>3.80</td>
    </tr>
    <tr>
      <th>5</th>
      <td>1</td>
      <td>2018-06-01 00:09:00</td>
      <td>2018-06-01 00:24:01</td>
      <td>1</td>
      <td>2.00</td>
      <td>1</td>
      <td>N</td>
      <td>161</td>
      <td>234</td>
      <td>1</td>
      <td>11.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>2.55</td>
      <td>0</td>
      <td>0.3</td>
      <td>15.35</td>
    </tr>
    <tr>
      <th>6</th>
      <td>1</td>
      <td>2018-06-01 00:02:33</td>
      <td>2018-06-01 00:13:01</td>
      <td>2</td>
      <td>1.50</td>
      <td>1</td>
      <td>N</td>
      <td>163</td>
      <td>233</td>
      <td>1</td>
      <td>8.5</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>1.95</td>
      <td>0</td>
      <td>0.3</td>
      <td>11.75</td>
    </tr>
    <tr>
      <th>7</th>
      <td>1</td>
      <td>2018-06-01 00:13:23</td>
      <td>2018-06-01 00:16:52</td>
      <td>1</td>
      <td>0.70</td>
      <td>1</td>
      <td>N</td>
      <td>186</td>
      <td>246</td>
      <td>1</td>
      <td>5.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>1.85</td>
      <td>0</td>
      <td>0.3</td>
      <td>8.15</td>
    </tr>
    <tr>
      <th>8</th>
      <td>1</td>
      <td>2018-06-01 00:24:29</td>
      <td>2018-06-01 01:08:43</td>
      <td>1</td>
      <td>5.70</td>
      <td>1</td>
      <td>N</td>
      <td>230</td>
      <td>179</td>
      <td>2</td>
      <td>22.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>23.30</td>
    </tr>
    <tr>
      <th>9</th>
      <td>2</td>
      <td>2018-06-01 00:17:01</td>
      <td>2018-06-01 00:23:16</td>
      <td>1</td>
      <td>0.85</td>
      <td>1</td>
      <td>N</td>
      <td>179</td>
      <td>223</td>
      <td>2</td>
      <td>6.0</td>
      <td>0.5</td>
      <td>0.5</td>
      <td>0.00</td>
      <td>0</td>
      <td>0.3</td>
      <td>7.30</td>
    </tr>
  </tbody>
</table>
</div>


Copy operation for a larger file does not take that much longer though. If interested you can time the operation by adding ``%%time`` magic function in the first line. Before timing the operation do ensure that you delete the file if it already exists in your S3 bucket. You can do so using the AWS Management Console.


```python
copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/yellow_tripdata_2018-06.csv',
                      to_bucket='open-data-analytics-taxi-trips', to_key='many-trips/trips-2018-06.csv')
```

    File many-trips/trips-2018-06.csv already exists in S3 bucket open-data-analytics-taxi-trips


This time when we list our bucket contents we should see the smaller file used in earlier S3 Select use case and the larger one we have copied over just now.


```python
list_bucket_contents(bucket='open-data-analytics-taxi-trips', match='trips/trips')
```

    few-trips/trips-2018-02.csv ( 66MB)
    many-trips/trips-2018-06.csv (734MB)
    Matched file size is 0.8GB with 2 files
    Bucket open-data-analytics-taxi-trips total size is 0.9GB with 57 files


### Change Log

This section captures changes and updates to this notebook across releases.

#### Source S3 Select - Release 3 MAY 2019
This release adds alternative workflow for directly querying source datasets on the Registry of Open Data on AWS. You may want to use this alternative workflow if you do not want to retain a copy of the source dataset within your local S3 bucket, saving on workflow steps and storage costs.

Known issue: Running s3_select with query limit 1000 or more results in ValueError - Expected object or value. Is this exception because of the maximum length of a record in the result as 1 MB limit? [TODO] Handle exception gracefully.

#### Launch - Release 30 APR 2019
This is the launch release which builds the AWS Open Data Analytics API for exploring open datasets within your Amazon S3 account using S3 Select.


---
#### Exploring data with Python and Amazon S3 Select
by Manav Sehgal | on 3 MAY 2019