# Regression **Note**: The contents of this folder was taken from the AWS Blog Post: [Build regression models with Amazon Redshift ML](https://aws.amazon.com/blogs/machine-learning/build-regression-models-with-amazon-redshift-ml/). ## Introduction In this folder, we have a working example to create models in Amazon Redshift using the regression problem type, which is used to predict a numerical outcome, like the price of a house or how many people will use a city’s bike rental service. ## Pre-requisites Use the [Cloudformation template](../cloud-formation/cloud-formation-template.yaml) to setup your Amazon Redshift cluster with Redshift ML or create an IAM role allowing Amazon Redshift to communicate with Amazon SageMaker. ## Use Case For our use case, we want to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions. Because we need to predict a numerical outcome, we create a regression model. ### Using Amazon SageMaker Notebook You may follow the step-by-step instructions in the [multi-class-classification notebook](./multi-class-classification.ipynb) for this setup which is auto-provisioned in the [Cloudformation template](../cloud-formation/cloud-formation-template.yaml). ## Data Preparation Please use the the [data preparation SQL file](./data_preparation.sql) to populate the input tables for this example setup. ## Create Model We can use the create model statement in Amazon Redshift to create our ML model using regression. We specify the problem type but we let AutoML take care of everything else. In this model, the target we want to predict is `trip_count`. Amazon SageMaker creates the function predict_customer_activity, which we use to do inference in Amazon Redshift. See the following code: ```sql CREATE MODEL predict_rental_count FROM training_data TARGET trip_count FUNCTION predict_rental_count IAM_ROLE '<>' PROBLEM_TYPE regression OBJECTIVE 'mse' SETTINGS (s3_bucket '<>', s3_garbage_collect off, max_runtime 5000); ``` ## Evaluation In this step, we evaluate the accuracy of our ML model against our validation data. While creating the model, Amazon SageMaker Autopilot automatically splits the input data into train and validation sets, and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. You can use the show model statement in your cluster to view various metrics, including the accuracy score. If you don’t specify explicitly, SageMaker automatically uses accuracy for the objective type. See the following code: Show model predict_rental_count; You can see the validation:mse value in the output of the `SHOW MODEL` command. A value closer to 100 is better. Run the inference query: ```sql SELECT trip_time , actual_count , predicted_count , ( actual_count - predicted_count ) difference FROM (SELECT trip_time , trip_count AS actual_count , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count FROM validation_data) LIMIT 5; ``` ## ML Predictions Let’s run a query to see what time the customers would take a bike-ride: ```sql SELECT trip_time , trip_count AS actual_count , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count FROM validation_data limit 10; ```