{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Discovery: how to use Data API to query RedShift from Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup: configure your cluster access, your role ARN, your output S3 URI\n", "\n", "In the following cell, you will be prompted to provide some of the information associated to your Redshift cluster and S3 path of output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "session = sagemaker.Session()\n", "\n", "###### CLUSTER CONFIGURATION\n", "cluster_id = input(\"The name of your Redshift cluster:\")\n", "database = input(\"The database of your Redshift cluster (default: dev)\") or 'dev'\n", "db_user = input(\"The user of your Redshift cluster (default: awsuser)\") or 'awsuser'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that Redshift cluster is set, we can now prepare our SQL query string. In this example, we plan on predicting total sales for a specific event, provided its venue, category, date and holiday information. The query is a pretty basic one but can be improved as needed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "###### QUERY STRING\n", "query_string = \"\"\"\n", "-- Find total sales for specific event, plus additional features\n", "SELECT sum(s.qtysold) AS total_sold, sum(s.pricepaid) AS total_paid, e.venueid, e.catid, d.caldate, d.holiday\n", "from sales s, event e, date d\n", "WHERE s.eventid = e.eventid and e.dateid = d.dateid\n", "GROUP BY e.venueid, e.catid, d.caldate, d.holiday\n", "\"\"\" # this will work on the default Free Tier Redshift cluster. Change if needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now use the AWS SDK for Python, `boto3`, to query the Redshift Data API. You can check more details on the boto3 client at this [link](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data.html#client)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import time\n", "\n", "# Execute the Data API query\n", "client = boto3.client('redshift-data')\n", "execution_id = client.execute_statement(\n", " ClusterIdentifier=cluster_id,\n", " Database=database,\n", " DbUser=db_user,\n", " Sql=query_string,\n", ")['Id']\n", "print(f'Execution started with ID {execution_id}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Wait for the query to be done\n", "status = client.describe_statement(Id=execution_id)['Status']\n", "while status not in ['FINISHED','ABORTED','FAILED']:\n", " time.sleep(10)\n", " status = client.describe_statement(Id=execution_id)['Status']\n", "print(f'Execution {execution_id} finished with status {status}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if status == 'FINISHED':\n", " columns = [c['label'] for c in client.get_statement_result(Id=execution_id)['ColumnMetadata']]\n", " records = client.get_statement_result(Id=execution_id)['Records']\n", " print(f'SUCCESS. Found {len(records)} records')\n", "else:\n", " print(f'Failed with Error: {client.describe_statement(Id=execution_id)[\"Error\"]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the query return successful, we can now parse the content and use it to create a Pandas DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from io import StringIO\n", "\n", "def content_parser(columns, records):\n", " # Build the CSV\n", " content = \"\"\n", " line = \"\"\n", " for c in columns:\n", " line+=f\"{c},\"\n", " content += line[:-1]+'\\n'\n", " for record in records:\n", " line = \"\"\n", " for v in record:\n", " data = v.get('longValue') or v.get('stringValue')\n", " line += f'{data},'\n", " content += line[:-1]+'\\n'\n", " return StringIO(content)\n", "\n", "data = content_parser(columns, records)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(data, sep=',')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }