## Exploring data with Python and Amazon S3 Select by Manav Sehgal | on 3 MAY 2019 We hear from public institutions all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital sources. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people. The size, complexity, and varied sources of the data means the same technology and approaches that worked in the past don’t work anymore. ![Data Analytics Workflow](https://s3.amazonaws.com/cloudstory/notebooks-media/data-analytics-workflow.png) A new approach is needed to extract insights and value from data. This approach needs to address complexities of multi-step data analytics workflow. This includes setting up durable, secure, and scalable storage for data, moving data from source to destination with speed and low cost, ease of data preparation for analytics, and making data available for different types of analytics including ad-hoc, real-time, and predictive. ### About AWS Open Data Analytics Notebooks This notebook is first in a series of AWS Open Data Analytics Notebooks following step-by-step workflow for open data analytics on cloud. We will present these notebooks with guidance on using AWS Cloud programmatically, introduce relevant AWS services, explaining the code, reviewing the code outputs, evaluating alternative steps in our workflow, and ultimately designing a reusable API for open data analytics workflow on cloud. The first step in this workflow is sourcing the appropriate open dataset(s) for setting up our analytics pipeline. You may want to run these notebooks using [Amazon SageMaker](https://aws.amazon.com/sagemaker/). Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. ### Why Open Datasets When building analytical models it is best to start with tried and tested open datasets from the problem domain we are solving. This enables us to setup our data analytics workflow, determine the appropriate models and analytical methods, benchmark the results, collaborate with open data community, before we apply these to our own data. Such open datasets are available at the [Registry of Open Data on AWS](https://registry.opendata.aws/). For this notebook let us start with a big open dataset. Big enough that we will struggle to open it in Excel on a laptop. Excel has around million rows limit. We will setup AWS services to source from a 270GB data source, filter and store more than 8 million rows or 100 million data points into a flat file, extract schema from this file, transform this data, load into analytics tools, run Structured Query Language (SQL) on this data, perform exploratory data analytics, train and build machine learning models, and visualize all 100 million data points using an interactive dashboard. ### Open Data Analytics Architecture When we complete these workflow notebooks we will build the following open data analytics architecture. This is a serverless architecture. It requires no software licenses to be procured. You do not need to manage any virtual servers or operating systems. Billing of each of the services is pay-per-use. You can plug-and-play 160 AWS services within this stack based on your specific requirements. ![Open Data Analytics Architecture](https://s3.amazonaws.com/cloudstory/notebooks-media/open-data-analytics-architecture.png) ### Setup Notebook Environment We begin by importing the required Python dependencies. We will use ``Boto3`` Python SDK for using AWS services. The import ``Pandas`` is a popular library providing high-performance, easy-to-use data structures and data analysis tools for Python. The ``IPython.display`` and ``Markdown`` dependencies are required for well-formatted output from Notebook cells. We require ``botocore`` for exceptions management. ```python import boto3 import botocore import pandas as pd from IPython.display import display, Markdown ``` Before we start to access AWS services from an Amazon SageMaker notebook we need to ensure that the [SageMaker Execution IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) associated with the notebook instance is allowed permissions to use the specific services like Amazon S3. We will setup an S3 client to call most of the S3 APIs. S3 resource is required for specific calls to object loading and copying features. ```python s3 = boto3.client('s3') s3_resource = boto3.resource('s3') ``` ### Create Bucket Now we come to an important part of the workflow of creating a Python function. These functions are created all along this notebook and others in the series. Think of these functions as reusable APIs for applying all that you learn from AWS Open Data Analytics Notebooks into your own projects by simply importing these functions as a library. Before we source the open dataset from the Registry, we need to define a destination for our data. We will store our open datasets within Amazon S3. S3 storage in turn is organized in universally unique ``buckets``. These bucket names form special URLs of the format ``s3://bucket-name`` which access the contents of the buckets depending on security and access policies applied to the bucket and its contents. Buckets can further contain folders and files. ``Keys`` are combination of folder and file name path or just the file name in case it is within the bucket root. Our first function ``create_bucket`` will do just that, it will create a bucket or return as-is if the bucket already exists for your account. If the bucket name is already used by someone else other than you, then this generates an exception caught by the message ``Bucket could not be created`` as defined. AWS services can be accessed using the SDK as we are using right now, using browser based console GUI, or using a Command Line Interface (CLI) over OS terminal or shell. Benefits of using the SDK are reusability of commands across different use cases, handling exceptions with custom actions, and focusing on just the functionality needed by the solution. ```python def create_bucket(bucket): import logging try: s3.create_bucket(Bucket=bucket) except botocore.exceptions.ClientError as e: logging.error(e) return 'Bucket ' + bucket + ' could not be created.' return 'Created or already exists ' + bucket + ' bucket.' ``` ```python create_bucket('open-data-analytics-taxi-trips') ``` 'Created or already exists open-data-analytics-taxi-trips bucket.' ### List Buckets We can confirm that the new bucket has been created by listing the buckets within S3. The ``list_buckets`` function takes a ``match`` parameter which enables us to search among available buckets and only list the ones which contain the matching string in their name. ```python def list_buckets(match=''): response = s3.list_buckets() if match: print(f'Existing buckets containing "{match}" string:') else: print('All existing buckets:') for bucket in response['Buckets']: if match: if match in bucket["Name"]: print(f' {bucket["Name"]}') ``` ```python list_buckets(match='open') ``` Existing buckets containing "open" string: open-analytics-assistant open-data-analytics open-data-analytics-taxi-trips open-data-on-cloud ### List Bucket Contents Now that we have prepared our destination bucket we can shift our attention to the source for our dataset. The [Registry of Open Data on AWS](https://registry.opendata.aws) also happens to be a listing of S3 hosted open datasets. So all Registry listed datasets can be accessed by the same API we use for S3 within our own AWS account. Next, all we need to do is search the Registry for the dataset we want to analyze. For this notebook let us analyze the [New York Taxi Trips](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset. On the dataset description page we make a note of the Amazon Resource Name (ARN) which is ``arn:aws:s3:::nyc-tlc`` in this case. We are interested in the last part which provides access to the open datasets using the ``s3://nyc-tlc`` URL. Let us create a function to list contents of this open dataset. We will iterate through the keys or the path names of file objects stored within the bucket. The function allows us to match and return only keys which contain the matching string. It also optionally allows us to return only those files in the listing which are less than a certain size in MB. This helps traverse a large open dataset which may contain data in Gigabytes or even Terabytes with hundreds if not thousands of files. ```python def list_bucket_contents(bucket, match='', size_mb=0): bucket_resource = s3_resource.Bucket(bucket) total_size_gb = 0 total_files = 0 match_size_gb = 0 match_files = 0 for key in bucket_resource.objects.all(): key_size_mb = key.size/1024/1024 total_size_gb += key_size_mb total_files += 1 list_check = False if not match: list_check = True elif match in key.key: list_check = True if list_check and not size_mb: match_files += 1 match_size_gb += key_size_mb print(f'{key.key} ({key_size_mb:3.0f}MB)') elif list_check and key_size_mb <= size_mb: match_files += 1 match_size_gb += key_size_mb print(f'{key.key} ({key_size_mb:3.0f}MB)') if match: print(f'Matched file size is {match_size_gb/1024:3.1f}GB with {match_files} files') print(f'Bucket {bucket} total size is {total_size_gb/1024:3.1f}GB with {total_files} files') ``` For this notebook we want to list the latest data files matching year 2018 and we also want files which are less than 250MB in size for reasons explained shortly. Note that the function results in quickly filtering 12 or 251 files within a dataset size of 270GB. ```python list_bucket_contents(bucket='nyc-tlc', match='2018', size_mb=250) ``` trip data/green_tripdata_2018-01.csv ( 68MB) trip data/green_tripdata_2018-02.csv ( 66MB) trip data/green_tripdata_2018-03.csv ( 71MB) trip data/green_tripdata_2018-04.csv ( 68MB) trip data/green_tripdata_2018-05.csv ( 68MB) trip data/green_tripdata_2018-06.csv ( 63MB) trip data/green_tripdata_2018-07.csv ( 58MB) trip data/green_tripdata_2018-08.csv ( 57MB) trip data/green_tripdata_2018-09.csv ( 57MB) trip data/green_tripdata_2018-10.csv ( 61MB) trip data/green_tripdata_2018-11.csv ( 56MB) trip data/green_tripdata_2018-12.csv ( 59MB) Matched file size is 0.7GB with 12 files Bucket nyc-tlc total size is 273.3GB with 251 files ### Preview CSV Dataset Now that we know which files we are interested in for our analytics, we want to write a function to quickly preview this big data from source without having to download the entire data file locally or open it in Excel. The ``preview_csv_dataset`` function takes bucket and key names as parameters for identifying the file object to preview. It also takes an optional parameter to determine the number of rows of records to return or display when previewing the dataset. We use Pandas DataFrame feature to read data from a web URL. As Pandas does not recognise S3 URLs, we first generate a presigned web URL which makes the source data available securely to our dataframe. Benefit of this approach is that we can quickly preview CSV based open datasets from the Registry listings without having to store these datasets into our own S3 account or download locally. ```python def preview_csv_dataset(bucket, key, rows=10): data_source = { 'Bucket': bucket, 'Key': key } # Generate the URL to get Key from Bucket url = s3.generate_presigned_url( ClientMethod = 'get_object', Params = data_source ) data = pd.read_csv(url, nrows=rows) return data ``` We can perform some manual analysis based on the preview dataset. We note that the dataset contains 19 columns. Data types are mixed among float, object, and int. We also note potentially categorical features including ``trip_type`` and ``payment_type`` among others. There are continuous features include ``fare_amount`` and ``trip_distance`` among others. Data quality seems good as there are no missing data (Nulls) in preview, only one column ``ehail_fee`` which has ``NaN`` values and the values in the columns seem consistent at a glance. Of course there are formal methods to confirm all these observations however at this stage we are only interested in filtering and sourcing a dataset for further analytics. As you will appreciate the ability of filtering a big data repository containing hundreds of files and Gigabytes of data and previewing one of the files without having to download the entire file, is a really powerful feature for our open data analytics workflow. ```python df = preview_csv_dataset(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', rows=100) ``` ```python df.head() ```
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag RatecodeID PULocationID DOLocationID passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount ehail_fee improvement_surcharge total_amount payment_type trip_type
0 2 2018-02-01 00:39:38 2018-02-01 00:39:41 N 5 97 65 1 0.00 20.0 0.0 0.0 3.00 0.0 NaN 0.0 23.00 1 2
1 2 2018-02-01 00:58:28 2018-02-01 01:05:35 N 1 256 80 5 1.60 7.5 0.5 0.5 0.88 0.0 NaN 0.3 9.68 1 1
2 2 2018-02-01 00:56:05 2018-02-01 01:18:54 N 1 25 95 1 9.60 28.5 0.5 0.5 5.96 0.0 NaN 0.3 35.76 1 1
3 2 2018-02-01 00:12:40 2018-02-01 00:15:50 N 1 61 61 1 0.73 4.5 0.5 0.5 0.00 0.0 NaN 0.3 5.80 2 1
4 2 2018-02-01 00:45:18 2018-02-01 00:51:56 N 1 65 17 2 1.87 8.0 0.5 0.5 0.00 0.0 NaN 0.3 9.30 2 1
```python df.shape ``` (100, 19) ```python df.info() ``` RangeIndex: 100 entries, 0 to 99 Data columns (total 19 columns): VendorID 100 non-null int64 lpep_pickup_datetime 100 non-null object lpep_dropoff_datetime 100 non-null object store_and_fwd_flag 100 non-null object RatecodeID 100 non-null int64 PULocationID 100 non-null int64 DOLocationID 100 non-null int64 passenger_count 100 non-null int64 trip_distance 100 non-null float64 fare_amount 100 non-null float64 extra 100 non-null float64 mta_tax 100 non-null float64 tip_amount 100 non-null float64 tolls_amount 100 non-null float64 ehail_fee 0 non-null float64 improvement_surcharge 100 non-null float64 total_amount 100 non-null float64 payment_type 100 non-null int64 trip_type 100 non-null int64 dtypes: float64(9), int64(7), object(3) memory usage: 14.9+ KB ```python df.describe() ```
VendorID RatecodeID PULocationID DOLocationID passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount ehail_fee improvement_surcharge total_amount payment_type trip_type
count 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 0.0 100.000000 100.000000 100.000000 100.000000
mean 1.840000 1.200000 120.29000 138.890000 1.280000 2.946800 11.735000 0.465000 0.465000 1.054100 0.115200 NaN 0.279000 14.113300 1.540000 1.050000
std 0.368453 0.876172 73.43757 78.900256 0.792388 3.356363 9.716033 0.146594 0.146594 2.011155 0.810462 NaN 0.087957 11.151924 0.520683 0.219043
min 1.000000 1.000000 7.00000 7.000000 1.000000 0.000000 -4.500000 -0.500000 -0.500000 0.000000 0.000000 NaN -0.300000 -5.800000 1.000000 1.000000
25% 2.000000 1.000000 69.00000 68.750000 1.000000 0.947500 6.000000 0.500000 0.500000 0.000000 0.000000 NaN 0.300000 8.195000 1.000000 1.000000
50% 2.000000 1.000000 106.00000 135.000000 1.000000 1.885000 8.750000 0.500000 0.500000 0.000000 0.000000 NaN 0.300000 10.300000 2.000000 1.000000
75% 2.000000 1.000000 168.75000 207.000000 1.000000 3.340000 13.875000 0.500000 0.500000 1.485000 0.000000 NaN 0.300000 16.922500 2.000000 1.000000
max 2.000000 5.000000 256.00000 265.000000 5.000000 20.660000 56.000000 0.500000 0.500000 10.560000 5.760000 NaN 0.300000 63.360000 3.000000 2.000000
### Copy Among Buckets We are ready to query our dataset so we copy it over from the S3 bucket listed on the Registry to our own account. To perform this action we first check if the file already exists in our destination bucket using the ``key_exists`` function. You would be running this notebook over several iterations and it may be a case that the data file is already copied over. If the file does not exist we copy from one S3 bucket to another. You will notice that even for big datasets in GBs the copy operation from S3 bucket to bucket across accounts does not take much time. ```python def key_exists(bucket, key): try: s3_resource.Object(bucket, key).load() except botocore.exceptions.ClientError as e: if e.response['Error']['Code'] == "404": # The key does not exist. return(False) else: # Something else has gone wrong. raise else: # The key does exist. return(True) def copy_among_buckets(from_bucket, from_key, to_bucket, to_key): if not key_exists(to_bucket, to_key): s3_resource.meta.client.copy({'Bucket': from_bucket, 'Key': from_key}, to_bucket, to_key) print(f'File {to_key} saved to S3 bucket {to_bucket}') else: print(f'File {to_key} already exists in S3 bucket {to_bucket}') ``` ```python copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/green_tripdata_2018-02.csv', to_bucket='open-data-analytics-taxi-trips', to_key='few-trips/trips-2018-02.csv') ``` File few-trips/trips-2018-02.csv already exists in S3 bucket open-data-analytics-taxi-trips ### Amazon S3 Select Structured Query Language (SQL) SELECT statement is generally associated with relational databases and is a powerful first tool for querying and analyzing a dataset. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. This means we do not need to deploy servers, setup databases, import data into our database, before querying our data. Simply copy datasets to S3 and Query. S3 Select can query a file which is up to 256MB uncompressed and 100 columns. As we build the function to run S3 Select we capture the results as a set of events payload. This payload includes records of results and statistics of query operation which can be useful to calculate the cost of running the query. ```python def s3_select(bucket, key, statement): import io s3_select_results = s3.select_object_content( Bucket=bucket, Key=key, Expression=statement, ExpressionType='SQL', InputSerialization={'CSV': {"FileHeaderInfo": "Use"}}, OutputSerialization={'JSON': {}}, ) for event in s3_select_results['Payload']: if 'Records' in event: df = pd.read_json(io.StringIO(event['Records']['Payload'].decode('utf-8')), lines=True) elif 'Stats' in event: print(f"Scanned: {int(event['Stats']['Details']['BytesScanned'])/1024/1024:5.2f}MB") print(f"Processed: {int(event['Stats']['Details']['BytesProcessed'])/1024/1024:5.2f}MB") print(f"Returned: {int(event['Stats']['Details']['BytesReturned'])/1024/1024:5.2f}MB") return (df) ``` This is the power of serverless at its best. We did not provision any servers, virtual or otherwise. We did not write more than a handful lines of code and even that could be avoided in future reuse of the ``s3_select`` function or when we run this operation directly on the AWS Console. We did not setup any physical database engine. We simply copied a flat file and ran SQL to query the results. The query did not even have to scan the entire file for sending back structured results for our analysis. ```python df = s3_select(bucket='open-data-analytics-taxi-trips', key='few-trips/trips-2018-02.csv', statement=""" select passenger_count, payment_type, trip_distance from s3object s where s.passenger_count = '4' limit 100 """) ``` Scanned: 1.72MB Processed: 1.71MB Returned: 0.01MB ```python df.head() ```
passenger_count payment_type trip_distance
0 4 1 7.20
1 4 1 1.05
2 4 1 0.63
3 4 2 8.41
4 4 2 1.38
In case you do not need to manipulate or edit the dataset within your local S3 environment, you can also use S3 Select on the source dataset directly. This save you steps in copying the dataset over and also saves on storage costs within your account. In fact if you ``list_bucket_contents`` to match the S3 Select [size limits](https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html#selecting-content-from-objects-requirements-and-limits) and then use ``s3_select`` function, it turns into a much faster and more flexible preview option that the ``preview_csv_dataset`` function described earlier. ```python df = s3_select(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', statement=""" select passenger_count, payment_type, trip_distance from s3object s where s.passenger_count = '4' limit 100 """) ``` Scanned: 1.72MB Processed: 1.71MB Returned: 0.01MB ```python df.head() ```
passenger_count payment_type trip_distance
0 4 1 7.20
1 4 1 1.05
2 4 1 0.63
3 4 2 8.41
4 4 2 1.38
Let us enter the big data leagues now. Let's list all the files in the Registry dataset which match the year 2018 with no constraints on file size this time. We are now reusing the function written earlier suggesting how the API will get used when we complete the notebook series. This time the results have files going beyond 1.5GB in size. We pick a file which is +700MB in size for our analysis. ```python list_bucket_contents(bucket='nyc-tlc', match='2018') ``` trip data/fhv_tripdata_2018-01.csv (1337MB) trip data/fhv_tripdata_2018-02.csv (1307MB) trip data/fhv_tripdata_2018-03.csv (1486MB) trip data/fhv_tripdata_2018-04.csv (1425MB) trip data/fhv_tripdata_2018-05.csv (1459MB) trip data/fhv_tripdata_2018-06.csv (1430MB) trip data/fhv_tripdata_2018-07.csv (1463MB) trip data/fhv_tripdata_2018-08.csv (1498MB) trip data/fhv_tripdata_2018-09.csv (1501MB) trip data/fhv_tripdata_2018-10.csv (1578MB) trip data/fhv_tripdata_2018-11.csv (1550MB) trip data/fhv_tripdata_2018-12.csv (1616MB) trip data/green_tripdata_2018-01.csv ( 68MB) trip data/green_tripdata_2018-02.csv ( 66MB) trip data/green_tripdata_2018-03.csv ( 71MB) trip data/green_tripdata_2018-04.csv ( 68MB) trip data/green_tripdata_2018-05.csv ( 68MB) trip data/green_tripdata_2018-06.csv ( 63MB) trip data/green_tripdata_2018-07.csv ( 58MB) trip data/green_tripdata_2018-08.csv ( 57MB) trip data/green_tripdata_2018-09.csv ( 57MB) trip data/green_tripdata_2018-10.csv ( 61MB) trip data/green_tripdata_2018-11.csv ( 56MB) trip data/green_tripdata_2018-12.csv ( 59MB) trip data/yellow_tripdata_2018-01.csv (736MB) trip data/yellow_tripdata_2018-02.csv (714MB) trip data/yellow_tripdata_2018-03.csv (793MB) trip data/yellow_tripdata_2018-04.csv (783MB) trip data/yellow_tripdata_2018-05.csv (777MB) trip data/yellow_tripdata_2018-06.csv (734MB) trip data/yellow_tripdata_2018-07.csv (660MB) trip data/yellow_tripdata_2018-08.csv (660MB) trip data/yellow_tripdata_2018-09.csv (677MB) trip data/yellow_tripdata_2018-10.csv (743MB) trip data/yellow_tripdata_2018-11.csv (686MB) trip data/yellow_tripdata_2018-12.csv (688MB) Matched file size is 26.4GB with 36 files Bucket nyc-tlc total size is 273.3GB with 251 files You will notice the preview function takes longer to return results for a larger file. It is still usable for preview purposes however this is an indication that we need more suitable tools for running our analytics this time. We cannot use S3 Select here due to the 256 MB size limit. ```python preview_csv_dataset(bucket='nyc-tlc', key='trip data/yellow_tripdata_2018-06.csv') ```
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2018-06-01 00:15:40 2018-06-01 00:16:46 1 0.00 1 N 145 145 2 3.0 0.5 0.5 0.00 0 0.3 4.30
1 1 2018-06-01 00:04:18 2018-06-01 00:09:18 1 1.00 1 N 230 161 1 5.5 0.5 0.5 1.35 0 0.3 8.15
2 1 2018-06-01 00:14:39 2018-06-01 00:29:46 1 3.30 1 N 100 263 2 13.0 0.5 0.5 0.00 0 0.3 14.30
3 1 2018-06-01 00:51:25 2018-06-01 00:51:29 3 0.00 1 N 145 145 2 2.5 0.5 0.5 0.00 0 0.3 3.80
4 1 2018-06-01 00:55:06 2018-06-01 00:55:10 1 0.00 1 N 145 145 2 2.5 0.5 0.5 0.00 0 0.3 3.80
5 1 2018-06-01 00:09:00 2018-06-01 00:24:01 1 2.00 1 N 161 234 1 11.5 0.5 0.5 2.55 0 0.3 15.35
6 1 2018-06-01 00:02:33 2018-06-01 00:13:01 2 1.50 1 N 163 233 1 8.5 0.5 0.5 1.95 0 0.3 11.75
7 1 2018-06-01 00:13:23 2018-06-01 00:16:52 1 0.70 1 N 186 246 1 5.0 0.5 0.5 1.85 0 0.3 8.15
8 1 2018-06-01 00:24:29 2018-06-01 01:08:43 1 5.70 1 N 230 179 2 22.0 0.5 0.5 0.00 0 0.3 23.30
9 2 2018-06-01 00:17:01 2018-06-01 00:23:16 1 0.85 1 N 179 223 2 6.0 0.5 0.5 0.00 0 0.3 7.30
Copy operation for a larger file does not take that much longer though. If interested you can time the operation by adding ``%%time`` magic function in the first line. Before timing the operation do ensure that you delete the file if it already exists in your S3 bucket. You can do so using the AWS Management Console. ```python copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/yellow_tripdata_2018-06.csv', to_bucket='open-data-analytics-taxi-trips', to_key='many-trips/trips-2018-06.csv') ``` File many-trips/trips-2018-06.csv already exists in S3 bucket open-data-analytics-taxi-trips This time when we list our bucket contents we should see the smaller file used in earlier S3 Select use case and the larger one we have copied over just now. ```python list_bucket_contents(bucket='open-data-analytics-taxi-trips', match='trips/trips') ``` few-trips/trips-2018-02.csv ( 66MB) many-trips/trips-2018-06.csv (734MB) Matched file size is 0.8GB with 2 files Bucket open-data-analytics-taxi-trips total size is 0.9GB with 57 files ### Change Log This section captures changes and updates to this notebook across releases. #### Source S3 Select - Release 3 MAY 2019 This release adds alternative workflow for directly querying source datasets on the Registry of Open Data on AWS. You may want to use this alternative workflow if you do not want to retain a copy of the source dataset within your local S3 bucket, saving on workflow steps and storage costs. Known issue: Running s3_select with query limit 1000 or more results in ValueError - Expected object or value. Is this exception because of the maximum length of a record in the result as 1 MB limit? [TODO] Handle exception gracefully. #### Launch - Release 30 APR 2019 This is the launch release which builds the AWS Open Data Analytics API for exploring open datasets within your Amazon S3 account using S3 Select. --- #### Exploring data with Python and Amazon S3 Select by Manav Sehgal | on 3 MAY 2019