{ "cells": [ { "cell_type": "markdown", "id": "a2c4326d", "metadata": {}, "source": [ "# Load dataset into Snowflake\n", "\n", "***This notebook works well with the `conda_python3` kernel on a SageMaker Notebook `ml.t3.xlarge` instance.***\n", "\n", "---\n", "\n", "In this notebook we will do the following:\n", "1. Install dependencies i.e. Python packages needed to connect to Snowflake.\n", "1. Connect to Snowflake and create a dateware and a database.\n", "1. Download the [California housing dataset](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html) from scikit-learn and ingest it into a new Snowflake table.\n" ] }, { "cell_type": "markdown", "id": "c130e21d-b0a6-4f48-be81-3f74b99b0f54", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "## Install dependencies " ] }, { "cell_type": "code", "execution_count": null, "id": "ae86dfa4-f67e-4fde-8813-66c3eb5ba8ee", "metadata": {}, "outputs": [], "source": [ "!pip install snowflake-connector-python==2.9.0\n", "!pip install pyarrow==10.0.1" ] }, { "cell_type": "markdown", "id": "c395ee8b-0eff-480d-8eca-41d224affe05", "metadata": {}, "source": [ "---\n", "\n", "## Connect to Snowflake account\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f3a05856-f388-47af-a7a9-67bf61cd81b7", "metadata": {}, "source": [ "Now you need to retrieve your account identifier for Snowflake. This can be done by looking into Snowflake account and then copying the account URL from the bottom left corner of the page as shown in the screenshot below.\n", "\n", "![](./img/snowflake-accountid-1.png)" ] }, { "cell_type": "markdown", "id": "1b3c62a7-b9d9-403b-b39b-f4983a5c1f0f", "metadata": {}, "source": [ "The account URL is of the form `https://abc12345.us-east-1.snowflakecomputing.com` the account identifier is the `abc12345.us-east-1` part. _Note that in your case the `us-east-1` might be replaced with a different region_." ] }, { "cell_type": "code", "execution_count": null, "id": "dbfe0da4-b67a-4180-9ab1-59d001ac522f", "metadata": {}, "outputs": [], "source": [ "sf_account_id = \"your-snowflake-account-id\"\n", "%store sf_account_id \n", "\n", "sf_secret_id = \"snowflake_credentials\"\n", "%store sf_secret_id " ] }, { "cell_type": "markdown", "id": "efed0d24-addb-40c4-b855-4bda9570bf85", "metadata": {}, "source": [ "### Retrieve Snowflake credentials from AWS secrets manager\n", "\n", "The following code retrieves your Snowflake username and password from a secret stored called `snowflake_credentials` store in AWS Secrets Manager. The secret is expected to have two key-value pairs: `username`: this is your Snowflake username and `password`: this is the password associated with your Snowflake username. For instructions on how to create a secret in AWS Secrets Mananger refer to [`Create an AWS Secrets Mananger secret`](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "ef6a2422-a023-4a8f-8b5f-d7945880fb49", "metadata": {}, "outputs": [], "source": [ "\n", "import os\n", "import json\n", "import boto3\n", "import pandas as pd \n", "import snowflake.connector\n", "from snowflake.connector.pandas_tools import write_pandas\n", "\n", "# collect snowflake credentials from Secrets Manager\n", "client = boto3.client('secretsmanager')\n", "response = client.get_secret_value(SecretId=sf_secret_id)\n", "secrets_credentials = json.loads(response['SecretString'])\n", "sf_password = secrets_credentials['password']\n", "sf_username = secrets_credentials['username']" ] }, { "cell_type": "code", "execution_count": null, "id": "3f874ec8-15dd-4d78-986a-35a867bdc5b7", "metadata": {}, "outputs": [], "source": [ "print(f\"sf_username={sf_username}\")" ] }, { "cell_type": "markdown", "id": "38acd10e-864c-4acd-9ba2-e964d3febc4d", "metadata": {}, "source": [ "### Connect" ] }, { "cell_type": "code", "execution_count": null, "id": "6c3f5879-0cbf-422b-8e1b-dbd35d690752", "metadata": {}, "outputs": [], "source": [ "# connect to Snowflake account\n", "conn = snowflake.connector.connect(user=sf_username,\n", " password=sf_password,\n", " account=sf_account_id,\n", " protocol=\"https\")" ] }, { "cell_type": "markdown", "id": "ca4e7659-8b48-464d-9bec-a19f30ba1af0", "metadata": {}, "source": [ "## Create Snowflake data warehouse and database" ] }, { "cell_type": "code", "execution_count": null, "id": "87cba8a4-48c1-4733-9d0a-75c42da63b59", "metadata": {}, "outputs": [], "source": [ "dwh = \"amazon_sagemake_w_snowflake_as_datasource\"\n", "db = \"housing\"\n", "schema = \"housing_schema\"\n", "table = \"california_housing\"" ] }, { "cell_type": "code", "execution_count": null, "id": "e0295f21-fd71-4650-af6f-5095e82855d6", "metadata": {}, "outputs": [], "source": [ "conn.cursor().execute(f\"CREATE WAREHOUSE IF NOT EXISTS {dwh}\")\n", "conn.cursor().execute(f\"USE WAREHOUSE {dwh}\")\n", "conn.cursor().execute(f\"CREATE DATABASE IF NOT EXISTS {db}\")\n", "conn.cursor().execute(f\"USE DATABASE {db}\")" ] }, { "cell_type": "markdown", "id": "aeba7432", "metadata": {}, "source": [ "---\n", "\n", "## Download the dataset from scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "id": "8c74f61d-9d7b-489a-9a9f-934a8029b14f", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "\n", "california_housing = fetch_california_housing(as_frame=True)\n", "california_housing = california_housing.frame\n", "california_housing.head()" ] }, { "cell_type": "markdown", "id": "a5765e64", "metadata": {}, "source": [ "---\n", "\n", "## Ingest data into the Snowflake table" ] }, { "cell_type": "code", "execution_count": null, "id": "907fb43e", "metadata": {}, "outputs": [], "source": [ "# connect to Snowflake Table schema\n", "conn.cursor().execute(f\"CREATE SCHEMA IF NOT EXISTS {schema}\")\n", "conn.cursor().execute(f\"USE SCHEMA {schema}\")\n", "\n", "\n", "create_table_sql = f\"CREATE TABLE IF NOT EXISTS {db}.{schema}.{table}\\n (\"\n", "\n", "california_housing.rename(columns=str.upper, inplace=True)\n", "california_housing.columns\n", "\n", "# iterating through the columns\n", "for col in california_housing.columns:\n", " column_name = col.upper()\n", "\n", " if (california_housing[col].dtype.name == \"int\" or california_housing[col].dtype.name == \"int64\"):\n", " create_table_sql = create_table_sql + column_name + \" int\"\n", " elif california_housing[col].dtype.name == \"object\":\n", " create_table_sql = create_table_sql + column_name + \" varchar(16777216)\"\n", " elif california_housing[col].dtype.name == \"datetime64[ns]\":\n", " create_table_sql = create_table_sql + column_name + \" datetime\"\n", " elif california_housing[col].dtype.name == \"float64\":\n", " create_table_sql = create_table_sql + column_name + \" float8\"\n", " elif california_housing[col].dtype.name == \"bool\":\n", " create_table_sql = create_table_sql + column_name + \" boolean\"\n", " else:\n", " create_table_sql = create_table_sql + column_name + \" varchar(16777216)\"\n", "\n", " # deciding next steps. Either column is not the last column (add comma) else end create_tbl_statement\n", " if california_housing[col].name != california_housing.columns[-1]:\n", " create_table_sql = create_table_sql + \",\\n\"\n", " else:\n", " create_table_sql = create_table_sql + \")\"\n", "\n", " # execute the SQL statement to create the table\n", " print(f\"create_table_sql={create_table_sql}\")\n", " conn.cursor().execute(create_table_sql) \n", "print(f\"snowflake_table={table}\")\n", "conn.cursor().execute(f'TRUNCATE TABLE IF EXISTS {table}') \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3057cfbf-0218-4c39-a01b-d55d05f75045", "metadata": {}, "outputs": [], "source": [ "print(f\"database={db}, schema={schema}, snowflake_table={table}\")\n", "# Write the data from the DataFrame to the Snowflake table.\n", "write_pandas(conn=conn,\n", " df=california_housing,\n", " table_name=table.upper(),\n", " database=db.upper(),\n", " schema=schema.upper())" ] }, { "cell_type": "code", "execution_count": null, "id": "2ea7f4ec-a122-4611-8505-ee3e42a26f06", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "c788421e88f093f8dd506f4527624e47e75432cd7217f9d7714b2cd296741f7d" } } }, "nbformat": 4, "nbformat_minor": 5 }