# Documentation of the Athena SQL Query

This powerful transformation is shown in the SQL statement “Q2: DynamoDBAthenaMLScooterPredict”. The statement make use of sub-queries, which are also known as Common Table Expressions (CTE) and subquery factoring.
Let’s break the statement down in the different steps:

First we define two subqueries “current_ts” and “now” because this example uses historic data and we have to go back in time and select a date and time. I this case September 7, 2019 at 3 PM. With real-time data you can replace the time-stamp “ts” with `NOW()` to get the current time. Athena does not have support for variables in SQL statements. We use a common method to achieve the same by using single-row CTEs that can be joined to subsequent expressions.

The next CTE “trips_raw” uses the Lambda function to pull data from the DynamoDB table. It only returns records that are less than five hours older than the target time “now.ts_epoch”. We use epoch time, aka. UNIX time, because DynamoDB does not support a native time-stamp format. This is faster than using strings and convert them to timestamps.

The next CTE “trips” prepares the selected data from “trips_raw” for aggregation over time and geography by associating 1-hour bins and neighborhoods to the trip records. The trip has a start time and location and end time and location. The sub-query generates the respective fields “t_hour_start”, “start_nbid” and “t_hour_end”, “end_nbid”. We use geospatial functions (https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list-v2.html) in Athena to map the longitude-latitude coordinates of the start and end locations to their respective neighborhoods. The geospatial function `ST_WITHIN()`, that determines if the given points lays within the boundaries of the polygon, is used to join the neighborhood table the trip data. It has to be joined twice, for the start and the end location. 

The CTEs “start_count” and “end_count” perform the aggregation over hours and neighborhoods. Both sub-queries operate on the same way. Technically, we aggregate over neighborhood, “*_nbid”, and hour “t_hour_*”. However, the query uses `GROUP BY` only on the neighborhood. The construct `SUM(CASE WHEN ...)` is used to mimic result of a pivot table in order to get the aggregates for different grouping in separate columns.

The final CTE “predictions” uses the counts per preceding hours and neighborhoods to build the feature vector for the ML model. In Athena you can access SageMaker endpoints for ML inference as external function with the keyword `SAGEMAKER` and the name of the endpoint. The top of the SQL statement shows the definition of the inference function `predict_demand()` that returns the predicted number of trips for the next hour based on the counts of the past four hours, the neighborhood, and day of the week and hour of day. The endpoint with the pre-trained ML model was launched during installation. You can find the Python code for training the ML model in the Notebook “Demand Prediction for Dockless Vehicles using Amazon SageMaker and Amazon Athena”.